org.htmlparser.util

Class Translate

public class Translate extends Object

Translate numeric character references and character entity references to unicode characters. Based on tables found at http://www.w3.org/TR/REC-html40/sgml/entities.html

Typical usage:

      String s = Translate.decode (getTextFromHtmlPage ());
 
or
      String s = "<HTML>" + Translate.encode (getArbitraryText ()) + "</HTML>";
 
Field Summary
protected static intBREAKPOINT
The dividing point between a simple table lookup and a binary search.
static booleanDECODE_LINE_BY_LINE
If this member is set true, decoding of streams is done line by line in order to reduce the maximum memory required.
static booleanENCODE_HEXADECIMAL
If this member is set true, encoding of numeric character references uses hexadecimal digits, i.e.
protected static CharacterReference[]mCharacterList
List of references sorted by character.
protected static CharacterReference[]mCharacterReferences
Table mapping entity reference kernel to character.
Method Summary
static Stringdecode(String string)
Decode a string containing references.
static Stringdecode(StringBuffer buffer)
Decode the characters in a string buffer containing references.
static voiddecode(InputStream in, PrintStream out)
Decode a stream containing references.
static Stringencode(int character)
Convert a character to a numeric character reference.
static Stringencode(String string)
Encode a string to use references.
static voidencode(InputStream in, PrintStream out)
Encode a stream to use references.
protected static intlookup(CharacterReference[] array, char ref, int lo, int hi)
Binary search for a reference.
static CharacterReferencelookup(char character)
Look up a reference by character.
protected static CharacterReferencelookup(CharacterReference key)
Look up a reference by kernel.
static CharacterReferencelookup(String kernel, int start, int end)
Look up a reference by kernel.
static voidmain(String[] args)
Numeric character reference and character entity reference to unicode codec.

Field Detail

BREAKPOINT

protected static final int BREAKPOINT
The dividing point between a simple table lookup and a binary search. Characters below the break point are stored in a sparse array allowing direct index lookup.

DECODE_LINE_BY_LINE

public static boolean DECODE_LINE_BY_LINE
If this member is set true, decoding of streams is done line by line in order to reduce the maximum memory required.

ENCODE_HEXADECIMAL

public static boolean ENCODE_HEXADECIMAL
If this member is set true, encoding of numeric character references uses hexadecimal digits, i.e. &#x25CB;, instead of decimal digits.

mCharacterList

protected static final CharacterReference[] mCharacterList
List of references sorted by character. The first part of this array, up to BREAKPOINT is stored in a direct translational table, indexing into the table with a character yields the reference. The second part is dense and sorted by character, suitable for binary lookup.

mCharacterReferences

protected static final CharacterReference[] mCharacterReferences
Table mapping entity reference kernel to character. This is sorted by kernel when the class is loaded.

Method Detail

decode

public static String decode(String string)
Decode a string containing references. Change all numeric character reference and character entity references to unicode characters.

Parameters: string The string to translate.

decode

public static String decode(StringBuffer buffer)
Decode the characters in a string buffer containing references. Change all numeric character reference and character entity references to unicode characters.

Parameters: buffer The StringBuffer containing references.

Returns: The decoded string.

decode

public static void decode(InputStream in, PrintStream out)
Decode a stream containing references. Change all numeric character reference and character entity references to unicode characters. If DECODE_LINE_BY_LINE is true, the input stream is broken up into lines, terminated by either carriage return or newline, in order to reduce the latency and maximum buffering memory size required.

Parameters: in The stream to translate. It is assumed that the input stream is encoded with ISO-8859-1 since the table of character entity references in this class applies only to ISO-8859-1. out The stream to write the decoded stream to.

encode

public static String encode(int character)
Convert a character to a numeric character reference. Convert a unicode character to a numeric character reference of the form &#xxxx;.

Parameters: character The character to convert.

Returns: The converted character.

encode

public static String encode(String string)
Encode a string to use references. Change all characters that are not ISO-8859-1 to their numeric character reference or character entity reference.

Parameters: string The string to translate.

Returns: The encoded string.

encode

public static void encode(InputStream in, PrintStream out)
Encode a stream to use references. Change all characters that are not ISO-8859-1 to their numeric character reference or character entity reference.

Parameters: in The stream to translate. It is assumed that the input stream is encoded with ISO-8859-1 since the table of character entity references in this class applies only to ISO-8859-1. out The stream to write the decoded stream to.

lookup

protected static int lookup(CharacterReference[] array, char ref, int lo, int hi)
Binary search for a reference.

Parameters: array The array of CharacterReference objects. ref The character to search for. lo The lower index within which to look. hi The upper index within which to look.

Returns: The index at which reference was found or is to be inserted.

lookup

public static CharacterReference lookup(char character)
Look up a reference by character. Use a combination of direct table lookup and binary search to find the reference corresponding to the character.

Parameters: character The character to be looked up.

Returns: The entity reference for that character or null.

lookup

protected static CharacterReference lookup(CharacterReference key)
Look up a reference by kernel. Use a binary search on the ordered list of known references. Since the binary search returns the position at which a new item should be inserted, we check the references earlier in the list if there is a failure.

Parameters: key A character reference with the kernel set to the string to be found. It need not be truncated at the exact end of the reference.

lookup

public static CharacterReference lookup(String kernel, int start, int end)
Look up a reference by kernel. Use a binary search on the ordered list of known references. This is not very efficient, use {@link org.htmlparser.util.Translate#lookup(org.htmlparser.util.CharacterReference) lookup(CharacterReference)} instead.

Parameters: kernel The string to lookup, i.e. "amp". start The starting point in the string of the kernel. end The ending point in the string of the kernel. This should be the index of the semicolon if it exists, or failing that, at least an index past the last character of the kernel.

Returns: The reference that matches the given string, or null if it wasn't found.

main

public static void main(String[] args)
Numeric character reference and character entity reference to unicode codec. Translate the System.in input into an encoded or decoded stream and send the results to System.out.

Parameters: args If arg[0] is -encode perform an encoding on System.in, otherwise perform a decoding.

HTML Parser is an open source library released under LGPL. SourceForge.net