org.htmlparser.filters

Class RegexFilter

public class RegexFilter extends Object implements NodeFilter

This filter accepts all string nodes matching a regular expression. Because this searches {@link org.htmlparser.Text Text} nodes. it is only useful for finding small fragments of text, where it is unlikely to be broken up by a tag. To find large fragments of text you should convert the page to plain text with something like the {@link org.htmlparser.beans.StringBean StringBean} and then apply the regular expression.

For example, to look for dates use:

   (19|20)\d\d([- \\/.](0[1-9]|1[012])[- \\/.](0[1-9]|[12][0-9]|3[01]))?
 
as in:
 Parser parser = new Parser ("http://cbc.ca");
 RegexFilter filter = new RegexFilter ("(19|20)\\d\\d([- \\\\/.](0[1-9]|1[012])[- \\\\/.](0[1-9]|[12][0-9]|3[01]))?");
 NodeIterator iterator = parser.extractAllNodesThatMatch (filter).elements ();
 
which matches a date in yyyy-mm-dd format between 1900-01-01 and 2099-12-31, with a choice of five separators, either a dash, a space, either kind of slash or a period. The year is matched by (19|20)\d\d which uses alternation to allow the either 19 or 20 as the first two digits. The round brackets are mandatory. The month is matched by 0[1-9]|1[012], again enclosed by round brackets to keep the two options together. By using character classes, the first option matches a number between 01 and 09, and the second matches 10, 11 or 12. The last part of the regex consists of three options. The first matches the numbers 01 through 09, the second 10 through 29, and the third matches 30 or 31. The day and month are optional, but must occur together because of the ()? bracketing after the year.
Field Summary
static intFIND
Use find() match strategy.
static intLOOKINGAT
Use lookingAt() match strategy.
protected PatternmPattern
The compiled regular expression to search for.
protected StringmPatternString
The regular expression to search for.
protected intmStrategy
The match strategy.
static intMATCH
Use match() matching strategy.
Constructor Summary
RegexFilter()
Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy.
RegexFilter(String pattern)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.
RegexFilter(String pattern, int strategy)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression.
Method Summary
booleanaccept(Node node)
Accept string nodes that match the regular expression.
StringgetPattern()
Get the search pattern.
intgetStrategy()
Get the search strategy.
voidsetPattern(String pattern)
Set the search pattern.
voidsetStrategy(int strategy)
Set the search pattern.

Field Detail

FIND

public static final int FIND
Use find() match strategy.

LOOKINGAT

public static final int LOOKINGAT
Use lookingAt() match strategy.

mPattern

protected Pattern mPattern
The compiled regular expression to search for.

mPatternString

protected String mPatternString
The regular expression to search for.

mStrategy

protected int mStrategy
The match strategy.

See Also: RegexFilter

MATCH

public static final int MATCH
Use match() matching strategy.

Constructor Detail

RegexFilter

public RegexFilter()
Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy.

RegexFilter

public RegexFilter(String pattern)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy.

Parameters: pattern The pattern to search for.

RegexFilter

public RegexFilter(String pattern, int strategy)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression.

Parameters: pattern The pattern to search for. strategy The type of match:

  1. {@link #MATCH} use matches() method: attempts to match the entire input sequence against the pattern
  2. {@link #LOOKINGAT} use lookingAt() method: attempts to match the input sequence, starting at the beginning, against the pattern
  3. {@link #FIND} use find() method: scans the input sequence looking for the next subsequence that matches the pattern

Method Detail

accept

public boolean accept(Node node)
Accept string nodes that match the regular expression.

Parameters: node The node to check.

Returns: true if the regular expression matches the text of the node, false otherwise.

getPattern

public String getPattern()
Get the search pattern.

Returns: Returns the pattern.

getStrategy

public int getStrategy()
Get the search strategy.

Returns: Returns the strategy.

setPattern

public void setPattern(String pattern)
Set the search pattern.

Parameters: pattern The pattern to set.

setStrategy

public void setStrategy(int strategy)
Set the search pattern.

Parameters: strategy The strategy to use. One of MATCH, LOOKINGAT or FIND.

HTML Parser is an open source library released under LGPL. SourceForge.net