org.htmlparser

Class Parser

public class Parser extends Object implements Serializable, ConnectionMonitor

The main parser class. This is the primary class of the HTML Parser library. It provides constructors that take a {@link #Parser(String) String}, a {@link #Parser(URLConnection) URLConnection}, or a {@link #Parser(Lexer) Lexer}. In the case of a String, a check is made to see if the first non-whitespace character is a <, in which case it is assumed to be HTML. Otherwise an attempt is made to open it as a URL, and if that fails it assumes it is a local disk file. If you want to parse a String after using the {@link #Parser() no-args} constructor, use {@link #setInputHTML setInputHTML()}, or you can use {@link #createParser}.

The Parser provides access to the contents of the page, via a {@link #elements() NodeIterator}, a {@link #parse(NodeFilter) NodeList} or a {@link #visitAllNodesWith NodeVisitor}.

Typical usage of the parser is:

 Parser parser = new Parser ("http://whatever");
 NodeList list = parser.parse (null);
 // do something with your list of nodes.
 

What types of nodes and what can be done with them is dependant on the setup, but in general a node can be converted back to HTML and it's children (enclosed nodes) and parent can be obtained, because nodes are nested. See the {@link Node} interface.

For example, if the URL contains:
{@.html Mondays -- What a bad idea. Most people have a pathological hatred of Mondays... }
and the example code above is used, the list contain only one element, the {@.html } node. This node is a {@link org.htmlparser.tags tag}, which is an object of class {@link org.htmlparser.tags.Html Html} if the default {@link NodeFactory} (a {@link PrototypicalNodeFactory}) is used.

To get at further content, the children of the top level nodes must be examined. When digging through a node list one must be conscious of the possibility of whitespace between nodes, e.g. in the example above:

 Node node = list.elementAt (0);
 NodeList sublist = node.getChildren ();
 System.out.println (sublist.size ());
 
would print out 5, not 2, because there are newlines after {@.html }, {@.html } and {@.html } that are children of the HTML node besides the {@.html } and {@.html } nodes.

Because processing nodes is so common, two interfaces are provided to ease this task, {@link org.htmlparser.filters filters} and {@link org.htmlparser.visitors visitors}.

Field Summary
static ParserFeedbackDEVNULL
A quiet message sink.
protected ParserFeedbackmFeedback
Feedback object.
protected LexermLexer
The html lexer associated with this parser.
static ParserFeedbackSTDOUT
A verbose message sink.
static StringVERSION_DATE
The date of the version ({@value}).
static doubleVERSION_NUMBER
The floating point version number ({@value}).
static StringVERSION_STRING
The display version ({@value}).
static StringVERSION_TYPE
The type of version ({@value}).
Constructor Summary
Parser()
Zero argument constructor.
Parser(Lexer lexer, ParserFeedback fb)
Construct a parser using the provided lexer and feedback object.
Parser(URLConnection connection, ParserFeedback fb)
Constructor for custom HTTP access.
Parser(String resource, ParserFeedback feedback)
Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.
Parser(String resource)
Creates a Parser object with the location of the resource (URL or file).
Parser(Lexer lexer)
Construct a parser using the provided lexer.
Parser(URLConnection connection)
Construct a parser using the provided URLConnection.
Method Summary
static ParsercreateParser(String html, String charset)
Creates the parser on an input string.
NodeIteratorelements()
Returns an iterator (enumeration) over the html nodes.
NodeListextractAllNodesThatMatch(NodeFilter filter)
Extract all nodes matching the given filter.
URLConnectiongetConnection()
Return the current connection.
static ConnectionManagergetConnectionManager()
Get the connection manager all Parsers use.
StringgetEncoding()
Get the encoding for the page this parser is reading from.
ParserFeedbackgetFeedback()
Returns the current feedback object.
LexergetLexer()
Returns the lexer associated with the parser.
NodeFactorygetNodeFactory()
Get the current node factory.
StringgetURL()
Return the current URL being parsed.
static StringgetVersion()
Return the version string of this parser.
static doublegetVersionNumber()
Return the version number of this parser.
static voidmain(String[] args)
The main program, which can be executed from the command line.
NodeListparse(NodeFilter filter)
Parse the given resource, using the filter provided.
voidpostConnect(HttpURLConnection connection)
Called just after calling connect.
voidpreConnect(HttpURLConnection connection)
Called just prior to calling connect.
voidreset()
Reset the parser to start from the beginning again.
voidsetConnection(URLConnection connection)
Set the connection for this parser.
static voidsetConnectionManager(ConnectionManager manager)
Set the connection manager all Parsers use.
voidsetEncoding(String encoding)
Set the encoding for the page this parser is reading from.
voidsetFeedback(ParserFeedback fb)
Sets the feedback object used in scanning.
voidsetInputHTML(String inputHTML)
Initializes the parser with the given input HTML String.
voidsetLexer(Lexer lexer)
Set the lexer for this parser.
voidsetNodeFactory(NodeFactory factory)
Set the current node factory.
voidsetResource(String resource)
Set the html, a url, or a file.
voidsetURL(String url)
Set the URL for this parser.
voidvisitAllNodesWith(NodeVisitor visitor)
Apply the given visitor to the current page.

Field Detail

DEVNULL

public static final ParserFeedback DEVNULL
A quiet message sink. Use this for no feedback.

mFeedback

protected ParserFeedback mFeedback
Feedback object.

mLexer

protected Lexer mLexer
The html lexer associated with this parser.

STDOUT

public static final ParserFeedback STDOUT
A verbose message sink. Use this for output on System.out.

VERSION_DATE

public static final String VERSION_DATE
The date of the version ({@value}).

VERSION_NUMBER

public static final double VERSION_NUMBER
The floating point version number ({@value}).

VERSION_STRING

public static final String VERSION_STRING
The display version ({@value}).

VERSION_TYPE

public static final String VERSION_TYPE
The type of version ({@value}).

Constructor Detail

Parser

public Parser()
Zero argument constructor. The parser is in a safe but useless state parsing an empty string. Set the lexer or connection using {@link #setLexer} or {@link #setConnection}.

See Also: setLexer setConnection

Parser

public Parser(Lexer lexer, ParserFeedback fb)
Construct a parser using the provided lexer and feedback object. This would be used to create a parser for special cases where the normal creation of a lexer on a URLConnection needs to be customized.

Parameters: lexer The lexer to draw characters from. fb The object to use when information, warning and error messages are produced. If null no feedback is provided.

Parser

public Parser(URLConnection connection, ParserFeedback fb)
Constructor for custom HTTP access. This would be used to create a parser for a URLConnection that needs a special setup or negotiation conditioning beyond what is available from the {@link #getConnectionManager ConnectionManager}.

Parameters: connection A fully conditioned connection. The connect() method will be called so it need not be connected yet. fb The object to use for message communication.

Throws: ParserException If the creation of the underlying Lexer cannot be performed.

Parser

public Parser(String resource, ParserFeedback feedback)
Creates a Parser object with the location of the resource (URL or file) You would typically create a DefaultHTMLParserFeedback object and pass it in.

Parameters: resource Either a URL, a filename or a string of HTML. The string is considered HTML if the first non-whitespace character is a <. The use of a url or file is autodetected by first attempting to open the resource as a URL, if that fails it is assumed to be a file name. A standard HTTP GET is performed to read the content of the URL. feedback The HTMLParserFeedback object to use when information, warning and error messages are produced. If null no feedback is provided.

Throws: ParserException If the URL is invalid.

See Also: Parser

Parser

public Parser(String resource)
Creates a Parser object with the location of the resource (URL or file). A DefaultHTMLParserFeedback object is used for feedback.

Parameters: resource Either HTML, a URL or a filename (autodetects).

Throws: ParserException If the resourceLocn argument does not resolve to a valid page or file.

See Also: Parser

Parser

public Parser(Lexer lexer)
Construct a parser using the provided lexer. A feedback object printing to {@link #STDOUT System.out} is used. This would be used to create a parser for special cases where the normal creation of a lexer on a URLConnection needs to be customized.

Parameters: lexer The lexer to draw characters from.

Parser

public Parser(URLConnection connection)
Construct a parser using the provided URLConnection. This would be used to create a parser for a URLConnection that needs a special setup or negotiation conditioning beyond what is available from the {@link #getConnectionManager ConnectionManager}. A feedback object printing to {@link #STDOUT System.out} is used.

Parameters: connection A fully conditioned connection. The connect() method will be called so it need not be connected yet.

Throws: ParserException If the creation of the underlying Lexer cannot be performed.

See Also: Parser

Method Detail

createParser

public static Parser createParser(String html, String charset)
Creates the parser on an input string.

Parameters: html The string containing HTML. charset Optional. The character set encoding that will be reported by {@link #getEncoding}. If charset is null the default character set is used.

Returns: A parser with the html string as input.

Throws: IllegalArgumentException if html is null.

elements

public NodeIterator elements()
Returns an iterator (enumeration) over the html nodes. {@link org.htmlparser.nodes Nodes} can be of three main types: In general, when parsing with an iterator or processing a NodeList, you will need to use recursion. For example:
 void processMyNodes (Node node)
 {
     if (node instanceof TextNode)
     {
         // downcast to TextNode
         TextNode text = (TextNode)node;
         // do whatever processing you want with the text
         System.out.println (text.getText ());
     }
     if (node instanceof RemarkNode)
     {
         // downcast to RemarkNode
         RemarkNode remark = (RemarkNode)node;
         // do whatever processing you want with the comment
     }
     else if (node instanceof TagNode)
     {
         // downcast to TagNode
         TagNode tag = (TagNode)node;
         // do whatever processing you want with the tag itself
         // ...
         // process recursively (nodes within nodes) via getChildren()
         NodeList nl = tag.getChildren ();
         if (null != nl)
             for (NodeIterator i = nl.elements (); i.hasMoreElements (); )
                 processMyNodes (i.nextNode ());
     }
 }

 Parser parser = new Parser ("http://www.yahoo.com");
 for (NodeIterator i = parser.elements (); i.hasMoreElements (); )
     processMyNodes (i.nextNode ());
 

Returns: An iterator over the top level nodes (usually {@.html }).

Throws: ParserException If a parsing error occurs.

extractAllNodesThatMatch

public NodeList extractAllNodesThatMatch(NodeFilter filter)
Extract all nodes matching the given filter.

Parameters: filter The filter to be applied to the nodes.

Returns: A list of nodes matching the filter criteria, i.e. for which the filter's accept method returned true.

Throws: ParserException If a parse error occurs.

See Also: Node

getConnection

public URLConnection getConnection()
Return the current connection.

Returns: The connection either created by the parser or passed into this parser via {@link #setConnection}.

See Also: setConnection

getConnectionManager

public static ConnectionManager getConnectionManager()
Get the connection manager all Parsers use.

Returns: The connection manager.

See Also: Parser

getEncoding

public String getEncoding()
Get the encoding for the page this parser is reading from. This item is set from the HTTP header but may be overridden by meta tags in the head, so this may change after the head has been parsed.

Returns: The encoding currently in force.

See Also: Parser

getFeedback

public ParserFeedback getFeedback()
Returns the current feedback object.

Returns: The feedback object currently being used.

See Also: Parser

getLexer

public Lexer getLexer()
Returns the lexer associated with the parser.

Returns: The current lexer.

See Also: Parser

getNodeFactory

public NodeFactory getNodeFactory()
Get the current node factory.

Returns: The current lexer's node factory.

See Also: Parser

getURL

public String getURL()
Return the current URL being parsed.

Returns: The current url. This is the URL for the current page. A string passed into the constructor or set via setURL may be altered, for example, a file name may be modified to be a URL.

See Also: Page Parser

getVersion

public static String getVersion()
Return the version string of this parser.

Returns: A string of the form:

 "[floating point number] ([build-type] [build-date])"
 

getVersionNumber

public static double getVersionNumber()
Return the version number of this parser.

Returns: A floating point number, the whole number part is the major version, and the fractional part is the minor version.

main

public static void main(String[] args)
The main program, which can be executed from the command line.

Parameters: args A URL or file name to parse, and an optional tag name to be used as a filter.

parse

public NodeList parse(NodeFilter filter)
Parse the given resource, using the filter provided. This can be used to extract information from specific nodes. When used with a null filter it returns an entire page which can then be modified and converted back to HTML (Note: the synthesis use-case is not handled very well; the parser is more often used to extract information from a web page).

For example, to replace the entire contents of the HEAD with a single TITLE tag you could do this:

 NodeList nl = parser.parse (null); // here is your two node list
 NodeList heads = nl.extractAllNodesThatMatch (new TagNameFilter ("HEAD"))
 if (heads.size () > 0) // there may not be a HEAD tag
 {
     Head head = heads.elementAt (0); // there should be only one
     head.removeAll (); // clean out the contents
     Tag title = new TitleTag ();
     title.setTagName ("title");
     title.setChildren (new NodeList (new TextNode ("The New Title")));
     Tag title_end = new TitleTag ();
     title_end.setTagName ("/title");
     title.setEndTag (title_end);
     head.add (title);
 }
 System.out.println (nl.toHtml ()); // output the modified HTML
 

Parameters: filter The filter to apply to the parsed nodes, or null to retrieve all the top level nodes.

Returns: The list of matching nodes (for a null filter this is all the top level nodes).

Throws: ParserException If a parsing error occurs.

postConnect

public void postConnect(HttpURLConnection connection)
Called just after calling connect. Part of the ConnectionMonitor interface, this implementation just sends the response header to the feedback object if any.

Parameters: connection The connection that was just connected.

Throws: ParserException Not used.

See Also: ConnectionMonitor

preConnect

public void preConnect(HttpURLConnection connection)
Called just prior to calling connect. Part of the ConnectionMonitor interface, this implementation just sends the request header to the feedback object if any.

Parameters: connection The connection which is about to be connected.

Throws: ParserException Not used

See Also: ConnectionMonitor

reset

public void reset()
Reset the parser to start from the beginning again. This assumes support for a reset from the underlying {@link org.htmlparser.lexer.Source} object.

This is cheaper (in terms of time) than resetting the URL, i.e.

 parser.setURL (parser.getURL ());
 
because the page is not refetched from the internet. Note: the nodes returned on the second parse are new nodes and not the same nodes returned on the first parse. If you want the same nodes for re-use, collect them in a NodeList with {@link #parse(NodeFilter) parse(null)} and operate on the NodeList.

setConnection

public void setConnection(URLConnection connection)
Set the connection for this parser. This method creates a new Lexer reading from the connection.

Parameters: connection A fully conditioned connection. The connect() method will be called so it need not be connected yet.

Throws: ParserException if the character set specified in the HTTP header is not supported, or an i/o exception occurs creating the lexer. IllegalArgumentException if connection is null. ParserException if a problem occurs in connecting.

See Also: Parser Parser

setConnectionManager

public static void setConnectionManager(ConnectionManager manager)
Set the connection manager all Parsers use.

Parameters: manager The new connection manager.

See Also: Parser

setEncoding

public void setEncoding(String encoding)
Set the encoding for the page this parser is reading from.

Parameters: encoding The new character set to use.

Throws: ParserException If the encoding change causes characters that have already been consumed to differ from the characters that would have been seen had the new encoding been in force.

See Also: EncodingChangeException Parser

setFeedback

public void setFeedback(ParserFeedback fb)
Sets the feedback object used in scanning.

Parameters: fb The new feedback object to use. If this is null a {@link #DEVNULL silent feedback object} is used.

See Also: Parser

setInputHTML

public void setInputHTML(String inputHTML)
Initializes the parser with the given input HTML String.

Parameters: inputHTML the input HTML that is to be parsed.

Throws: ParserException If a error occurs in setting up the underlying Lexer. IllegalArgumentException if inputHTML is null.

setLexer

public void setLexer(Lexer lexer)
Set the lexer for this parser. The current NodeFactory is transferred to (set on) the given lexer, since the lexer owns the node factory object. It does not adjust the feedback object.

Parameters: lexer The lexer object to use.

Throws: IllegalArgumentException if lexer is null.

See Also: Parser Parser

setNodeFactory

public void setNodeFactory(NodeFactory factory)
Set the current node factory.

Parameters: factory The new node factory for the current lexer.

Throws: IllegalArgumentException if factory is null.

See Also: Parser

setResource

public void setResource(String resource)
Set the html, a url, or a file.

Parameters: resource The resource to use.

Throws: IllegalArgumentException if resource is null. ParserException if a problem occurs in connecting.

setURL

public void setURL(String url)
Set the URL for this parser. This method creates a new Lexer reading from the given URL. Trying to set the url to null or an empty string is a no-op.

Parameters: url The new URL for the parser.

Throws: ParserException If the url is invalid or creation of the underlying Lexer cannot be performed. ParserException if a problem occurs in connecting.

See Also: Parser

visitAllNodesWith

public void visitAllNodesWith(NodeVisitor visitor)
Apply the given visitor to the current page. The visitor is passed to the accept() method of each node in the page in a depth first traversal. The visitor beginParsing() method is called prior to processing the page and finishedParsing() is called after the processing.

Parameters: visitor The visitor to visit all nodes with.

Throws: ParserException If a parse error occurs while traversing the page with the visitor.

HTML Parser is an open source library released under LGPL. SourceForge.net