org.htmlparser.lexer
public class Page extends Object implements Serializable
Field Summary | |
---|---|
static String | DEFAULT_CHARSET
The default charset.
|
static String | DEFAULT_CONTENT_TYPE
The default content type.
|
static char | EOF
Character value when the page is exhausted.
|
protected String | mBaseUrl
The base URL for this page. |
protected URLConnection | mConnection
The connection this page is coming from or null . |
protected static ConnectionManager | mConnectionManager
Connection control (proxy, cookies, authorization). |
protected PageIndex | mIndex
Character positions of the first character in each line. |
protected Source | mSource
The source of characters. |
protected String | mUrl
The URL this page is coming from.
|
Constructor Summary | |
---|---|
Page()
Construct an empty page. | |
Page(URLConnection connection)
Construct a page reading from a URL connection. | |
Page(InputStream stream, String charset)
Construct a page from a stream encoded with the given charset. | |
Page(String text, String charset)
Construct a page from the given string. | |
Page(String text)
Construct a page from the given string.
| |
Page(Source source)
Construct a page from a source. |
Method Summary | |
---|---|
void | close()
Close the page by destroying the source of characters. |
int | column(Cursor cursor)
Get the column number for a cursor. |
int | column(int position)
Get the column number for a cursor. |
URL | constructUrl(String link, String base)
Build a URL from the link and base provided using non-strict rules. |
URL | constructUrl(String link, String base, boolean strict)
Build a URL from the link and base provided. |
protected void | finalize()
Clean up this page, releasing resources.
|
static String | findCharset(String name, String fallback)
Lookup a character set name.
|
String | getAbsoluteURL(String link)
Create an absolute URL from a relative link. |
String | getAbsoluteURL(String link, boolean strict)
Create an absolute URL from a relative link. |
String | getBaseUrl()
Gets the baseUrl. |
char | getCharacter(Cursor cursor)
Read the character at the given cursor position.
|
String | getCharset(String content)
Get a CharacterSet name corresponding to a charset parameter. |
URLConnection | getConnection()
Get the connection, if any. |
static ConnectionManager | getConnectionManager()
Get the connection manager all Parsers use. |
String | getContentType()
Try and extract the content type from the HTTP header. |
String | getEncoding()
Get the current encoding being used. |
String | getLine(Cursor cursor)
Get the text line the position of the cursor lies on. |
String | getLine(int position)
Get the text line the position of the cursor lies on. |
Source | getSource()
Get the source this page is reading from. |
String | getText(int start, int end)
Get the text identified by the given limits. |
void | getText(StringBuffer buffer, int start, int end)
Put the text identified by the given limits into the given buffer. |
String | getText()
Get all text read so far from the source. |
void | getText(StringBuffer buffer)
Put all text read so far from the source into the given buffer. |
void | getText(char[] array, int offset, int start, int end)
Put the text identified by the given limits into the given array at the specified offset. |
String | getUrl()
Get the URL for this page.
|
void | reset()
Reset the page by resetting the source of characters. |
int | row(Cursor cursor)
Get the line number for a cursor. |
int | row(int position)
Get the line number for a cursor. |
void | setBaseUrl(String url)
Sets the baseUrl. |
void | setConnection(URLConnection connection)
Set the URLConnection to be used by this page.
|
static void | setConnectionManager(ConnectionManager manager)
Set the connection manager to use. |
void | setEncoding(String character_set)
Begins reading from the source with the given character set.
|
void | setUrl(String url)
Set the URL for this page.
|
String | toString()
Display some of this page as a string. |
void | ungetCharacter(Cursor cursor)
Return a character.
|
{@value}
,
see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616)
section 3.7.1
Another alias is "8859_1".
null
.getConnection().toExternalForm()
or
setUrl()
.Parameters: connection A fully conditioned connection. The connect() method will be called so it need not be connected yet.
Throws: ParserException An exception object wrapping a number of
possible error conditions, some of which are outlined below.
Parameters: stream The source of bytes. charset The encoding used.
If null, defaults to the DEFAULT_CHARSET
.
Throws: UnsupportedEncodingException If the given charset is not supported.
Parameters: text The HTML text. charset Optional. The character set encoding that will
be reported by {@link #getEncoding}. If charset is null
the default character set is used.
Parameters: text The HTML text.
Parameters: source The source of characters.
Throws: IOException If destroying the source encounters an error.
Parameters: cursor The character offset into the page.
Returns: The character offset into the line this cursor is on.
Parameters: position The character offset into the page.
Returns: The character offset into the line this cursor is on.
Parameters: link The (relative) URI. base The base URL of the page, either from the <BASE> tag or, if none, the URL the page is being fetched from.
Returns: An absolute URL.
Throws: MalformedURLException If creating the URL fails.
See Also: Page
Parameters: link The (relative) URI. base The base URL of the page, either from the <BASE> tag
or, if none, the URL the page is being fetched from. strict If true
a link starting with '?' is handled
according to RFC 2396,
otherwise the common interpretation of a query appended to the base
is used instead.
Returns: An absolute URL.
Throws: MalformedURLException If creating the URL fails.
close()
.Throws: Throwable if close()
throws an
IOException
.
java.nio.charset
.
This uses reflection so the code will still run under prior JDK's but
in that case the default is always returned.Parameters: name The name to look up. One of the aliases for a character set. fallback The name to return if the lookup fails.
Returns: The character set name.
Parameters: link The reslative portion of a URL.
Returns: The fully qualified URL or the original link if it was absolute already or a failure occured.
Parameters: link The reslative portion of a URL. strict If true
a link starting with '?' is handled
according to RFC 2396,
otherwise the common interpretation of a query appended to the base
is used instead.
Returns: The fully qualified URL or the original link if it was absolute already or a failure occured.
Returns: The base URL for this page, or null
if not set.
Parameters: cursor The position to read at.
Returns: The character at that position, and modifies the cursor to prepare for the next read. If the source is exhausted a zero is returned.
Throws: ParserException If an IOException on the underlying source occurs, or an attempt is made to read characters in the future (the cursor position is ahead of the underlying stream)
Parameters: content A text line of the form:
text/html; charset=Shift_JIS
which is applicable both to the HTTP header field Content-Type and
the meta tag http-equiv="Content-Type".
Note this method also handles non-compliant quoted charset directives
such as:
text/html; charset="UTF-8"
and
text/html; charset='UTF-8'
Returns: The character set name to use when reading the input stream. For JDKs that have the Charset class this is qualified by passing the name to findCharset() to render it into canonical form. If the charset parameter is not found in the given string, the default character set is returned.
See Also: Page DEFAULT_CHARSET
Returns: The connection object for this page, or null if this page is built from a stream or a string.
Returns: The connection manager.
Returns: The content type.
Returns: The encoding used to convert characters.
Parameters: cursor The position to calculate for.
Returns: The contents of the URL or file corresponding to the line number containing the cursor position.
Parameters: position The position to calculate for.
Returns: The contents of the URL or file corresponding to the line number containg the cursor position.
Returns: The current source.
Parameters: start The starting position, zero based. end The ending position (exclusive, i.e. the character at the ending position is not included), zero based.
Returns: The text from start
to end
.
Throws: IllegalArgumentException If an attempt is made to get characters ahead of the current source offset (character position).
See Also: Page
Parameters: buffer The accumulator for the characters. start The starting position, zero based. end The ending position (exclusive, i.e. the character at the ending position is not included), zero based.
Throws: IllegalArgumentException If an attempt is made to get characters ahead of the current source offset (character position).
Returns: The text from the source.
See Also: getText
Parameters: buffer The accumulator for the characters.
See Also: Page
Parameters: array The array of characters. offset The starting position in the array where characters are to be placed. start The starting position, zero based. end The ending position (exclusive, i.e. the character at the ending position is not included), zero based.
Throws: IllegalArgumentException If an attempt is made to get characters ahead of the current source offset (character position).
getConnection()
returns non-null), or the document base has
been set via a call to setUrl()
.Returns: The url for the connection, or null
if there is
no conenction or the document base has not been set.
Parameters: cursor The character offset into the page.
Returns: The line number the character is in.
Parameters: position The character offset into the page.
Returns: The line number the character is in.
Parameters: url The base url for this page.
Parameters: connection The connection to use. It will be connected by this method.
Throws: ParserException If the connect()
method fails,
or an I/O error occurs opening the input stream or the character set
designated in the HTTP header is unsupported.
Parameters: manager The new connection manager.
Some magic happens here to obtain this result if characters have already been consumed from this page. Since a Reader cannot be dynamically altered to use a different character set, the underlying stream is reset, a new Source is constructed and a comparison made of the characters read so far with the newly read characters up to the current position. If a difference is encountered, or some other problem occurs, an exception is thrown.
Parameters: character_set The character set to use to convert bytes into characters.
Throws: ParserException If a character mismatch occurs between characters already provided and those that would have been returned had the new character set been in effect from the beginning. An exception is also thrown if the underlying stream won't put up with these shenanigans.
Parameters: url The new URL.
Returns: The last few characters the source read in.
Parameters: cursor The position to 'unread' at.
Throws: ParserException If an IOException on the underlying source occurs.
HTML Parser is an open source library released under LGPL. | |