Class Regexp

  • All Implemented Interfaces:
    java.io.Serializable

    public class Regexp
    extends java.lang.Object
    implements java.io.Serializable
    The Regexp class can be used to match a pattern against a string and optionally replace the matched parts with new strings.

    Regular expressions were implemented by translating Henry Spencer's regular expression package for tcl8.0. Much of the description below is copied verbatim from the tcl8.0 regsub manual entry.


    REGULAR EXPRESSIONS

    A regular expression is zero or more branches, separated by "|". It matches anything that matches one of the branches.

    A branch is zero or more pieces, concatenated. It matches a match for the first piece, followed by a match for the second piece, etc.

    A piece is an atom, possibly followed by "*", "+", or "?".

    • An atom followed by "*" matches a sequence of 0 or more matches of the atom.
    • An atom followed by "+" matches a sequence of 1 or more matches of the atom.
    • An atom followed by "?" matches either 0 or 1 matches of the atom.

    An atom is

    • a regular expression in parentheses (matching a match for the regular expression)
    • a range (see below)
    • "." (matching any single character)
    • "^" (matching the null string at the beginning of the input string)
    • "$" (matching the null string at the end of the input string)
    • a "\" followed by a single character (matching that character)
    • a single character with no other significance (matching that character).

    A range is a sequence of characters enclosed in "[]". The range normally matches any single character from the sequence. If the sequence begins with "^", the range matches any single character not from the rest of the sequence. If two characters in the sequence are separated by "-", this is shorthand for the full list of characters between them (e.g. "[0-9]" matches any decimal digit). To include a literal "]" in the sequence, make it the first character (following a possible "^"). To include a literal "-", make it the first or last character.

    In general there may be more than one way to match a regular expression to an input string. For example, consider the command

     String[] match = new String[2];
     Regexp.match("(a*)b*", "aabaaabb", match);
     
    Considering only the rules given so far, match[0] and match[1] could end up with the values
    • "aabb" and "aa"
    • "aaab" and "aaa"
    • "ab" and "a"
    or any of several other combinations. To resolve this potential ambiguity, Regexp chooses among alternatives using the rule "first then longest". In other words, it considers the possible matches in order working from left to right across the input string and the pattern, and it attempts to match longer pieces of the input string before shorter ones. More specifically, the following rules apply in decreasing order of priority:
    1. If a regular expression could match two different parts of an input string then it will match the one that begins earliest.
    2. If a regular expression contains "|" operators then the leftmost matching sub-expression is chosen.
    3. In "*", "+", and "?" constructs, longer matches are chosen in preference to shorter ones.
    4. In sequences of expression components the components are considered from left to right.

    In the example from above, "(a*)b*" therefore matches exactly "aab"; the "(a*)" portion of the pattern is matched first and it consumes the leading "aa", then the "b*" portion of the pattern consumes the next "b". Or, consider the following example:

     String match = new String[3];
     Regexp.match("(ab|a)(b*)c", "abc", match);
     
    After this command, match[0] will be "abc", match[1] will be "ab", and match[2] will be an empty string. Rule 4 specifies that the "(ab|a)" component gets first shot at the input string and Rule 2 specifies that the "ab" sub-expression is checked before the "a" sub-expression. Thus the "b" has already been claimed before the "(b*)" component is checked and therefore "(b*)" must match an empty string.
    REGULAR EXPRESSION SUBSTITUTION

    Regular expression substitution matches a string against a regular expression, transforming the string by replacing the matched region(s) with new substring(s).

    What gets substituted into the result is controlled by a subspec. The subspec is a formatting string that specifies what portions of the matched region should be substituted into the result.

    • "&" or "\0" is replaced with a copy of the entire matched region.
    • "\n", where n is a digit from 1 to 9, is replaced with a copy of the nth subexpression.
    • "\&" or "\\" are replaced with just "&" or "\" to escape their special meaning.
    • any other character is passed through.
    In the above, strings like "\2" represents the two characters backslash and "2", not the Unicode character 0002.
    Here is an example of how to use Regexp
    
        public static void
        main(String[] args)
            throws Exception
        {
            Regexp re;
            String[] matches;
            String s;
    
            /*
             * A regular expression to match the first line of a HTTP request.
             *
             * 1. ^               - starting at the beginning of the line
             * 2. ([A-Z]+)        - match and remember some upper case characters
             * 3. [ \t]+          - skip blank space
             * 4. ([^ \t]*)       - match and remember up to the next blank space
             * 5. [ \t]+          - skip more blank space
             * 6. (HTTP/1\\.[01]) - match and remember HTTP/1.0 or HTTP/1.1
             * 7. $               - end of string - no chars left.
             */
    
            s = "GET http://a.b.com:1234/index.html HTTP/1.1";
    
            re = new Regexp("^([A-Z]+)[ \t]+([^ \t]+)[ \t]+(HTTP/1\\.[01])$");
            matches = new String[4];
            if (re.match(s, matches)) {
                System.out.println("METHOD  " + matches[1]);
                System.out.println("URL     " + matches[2]);
                System.out.println("VERSION " + matches[3]);
            }
    
            /*
             * A regular expression to extract some simple comma-separated data,
             * reorder some of the columns, and discard column 2.
             */
    
            s = "abc,def,ghi,klm,nop,pqr";
    
            re = new Regexp("^([^,]+),([^,]+),([^,]+),(.*)");
            System.out.println(re.sub(s, "\\3,\\1,\\4"));
        }
     
    Version:
    2.3
    Author:
    Colin Stevens (colin.stevens@sun.com)
    See Also:
    Regsub, Serialized Form
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static interface  Regexp.Filter
      This interface is used by the Regexp class to generate the replacement string for each pattern match found in the source string.
    • Constructor Summary

      Constructors 
      Constructor Description
      Regexp​(java.lang.String pat)
      Compiles a new Regexp object from the given regular expression pattern.
      Regexp​(java.lang.String pat, boolean ignoreCase)
      Compiles a new Regexp object from the given regular expression pattern.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static void applySubspec​(Regsub rs, java.lang.String subspec, java.lang.StringBuffer sb)
      Utility method to give access to the standard substitution algorithm used by sub and subAll.
      static void main​(java.lang.String[] args)  
      java.lang.String match​(java.lang.String str)
      Matches the given string against this regular expression.
      boolean match​(java.lang.String str, int[] indices)
      Matches the given string against this regular expression, and computes the set of substrings that matched the parenthesized subexpressions.
      boolean match​(java.lang.String str, java.lang.String[] substrs)
      Matches the given string against this regular expression, and computes the set of substrings that matched the parenthesized subexpressions.
      java.lang.String sub​(java.lang.String str, java.lang.String subspec)
      Matches a string against a regular expression and replaces the first match with the string generated from the substitution parameter.
      java.lang.String sub​(java.lang.String str, Regexp.Filter rf)  
      java.lang.String subAll​(java.lang.String str, java.lang.String subspec)
      Matches a string against a regular expression and replaces all matches with the string generated from the substitution parameter.
      int subspecs()
      Returns the number of parenthesized subexpressions in this regular expression, plus one more for this expression itself.
      java.lang.String toString()
      Returns a string representation of this compiled regular expression.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • Regexp

        public Regexp​(java.lang.String pat)
               throws java.lang.IllegalArgumentException
        Compiles a new Regexp object from the given regular expression pattern.

        It takes a certain amount of time to parse and validate a regular expression pattern before it can be used to perform matches or substitutions. If the caller caches the new Regexp object, that parsing time will be saved because the same Regexp can be used with respect to many different strings.

        Parameters:
        pat - The string holding the regular expression pattern.
        Throws:
        java.lang.IllegalArgumentException - if the pattern is malformed. The detail message for the exception will be set to a string indicating how the pattern was malformed.
      • Regexp

        public Regexp​(java.lang.String pat,
                      boolean ignoreCase)
               throws java.lang.IllegalArgumentException
        Compiles a new Regexp object from the given regular expression pattern.
        Parameters:
        pat - The string holding the regular expression pattern.
        ignoreCase - If true then this regular expression will do case-insensitive matching. If false, then the matches are case-sensitive. Regular expressions generated by Regexp(String) are case-sensitive.
        Throws:
        java.lang.IllegalArgumentException - if the pattern is malformed. The detail message for the exception will be set to a string indicating how the pattern was malformed.
    • Method Detail

      • main

        public static void main​(java.lang.String[] args)
                         throws java.lang.Exception
        Throws:
        java.lang.Exception
      • subspecs

        public int subspecs()
        Returns the number of parenthesized subexpressions in this regular expression, plus one more for this expression itself.
        Returns:
        The number.
      • match

        public java.lang.String match​(java.lang.String str)
        Matches the given string against this regular expression.
        Parameters:
        str - The string to match.
        Returns:
        The substring of str that matched the entire regular expression, or null if the string did not match this regular expression.
      • match

        public boolean match​(java.lang.String str,
                             java.lang.String[] substrs)
        Matches the given string against this regular expression, and computes the set of substrings that matched the parenthesized subexpressions.

        substrs[0] is set to the range of str that matched the entire regular expression.

        substrs[1] is set to the range of str that matched the first (leftmost) parenthesized subexpression. substrs[n] is set to the range that matched the nth subexpression, and so on.

        If subexpression n did not match, then substrs[n] is set to null. Not to be confused with "", which is a valid value for a subexpression that matched 0 characters.

        The length that the caller should use when allocating the substr array is the return value of Regexp.subspecs. The array can be shorter (in which case not all the information will be returned), or longer (in which case the remainder of the elements are initialized to null), or null (to ignore the subexpressions).

        Parameters:
        str - The string to match.
        substrs - An array of strings allocated by the caller, and filled in with information about the portions of str that matched the regular expression. May be null.
        Returns:
        true if str that matched this regular expression, false otherwise. If false is returned, then the contents of substrs are unchanged.
        See Also:
        subspecs()
      • match

        public boolean match​(java.lang.String str,
                             int[] indices)
        Matches the given string against this regular expression, and computes the set of substrings that matched the parenthesized subexpressions.

        For the indices specified below, the range extends from the character at the starting index up to, but not including, the character at the ending index.

        indices[0] and indices[1] are set to starting and ending indices of the range of str that matched the entire regular expression.

        indices[2] and indices[3] are set to the starting and ending indices of the range of str that matched the first (leftmost) parenthesized subexpression. indices[n * 2] and indices[n * 2 + 1] are set to the range that matched the nth subexpression, and so on.

        If subexpression n did not match, then indices[n * 2] and indices[n * 2 + 1] are both set to -1.

        The length that the caller should use when allocating the indices array is twice the return value of Regexp.subspecs. The array can be shorter (in which case not all the information will be returned), or longer (in which case the remainder of the elements are initialized to -1), or null (to ignore the subexpressions).

        Parameters:
        str - The string to match.
        indices - An array of integers allocated by the caller, and filled in with information about the portions of str that matched all the parts of the regular expression. May be null.
        Returns:
        true if the string matched the regular expression, false otherwise. If false is returned, then the contents of indices are unchanged.
        See Also:
        subspecs()
      • sub

        public java.lang.String sub​(java.lang.String str,
                                    java.lang.String subspec)
        Matches a string against a regular expression and replaces the first match with the string generated from the substitution parameter.
        Parameters:
        str - The string to match against this regular expression.
        subspec - The substitution parameter, described in REGULAR EXPRESSION SUBSTITUTION.
        Returns:
        The string formed by replacing the first match in str with the string generated from subspec. If no matches were found, then the return value is null.
      • subAll

        public java.lang.String subAll​(java.lang.String str,
                                       java.lang.String subspec)
        Matches a string against a regular expression and replaces all matches with the string generated from the substitution parameter. After each substutition is done, the portions of the string already examined, including the newly substituted region, are not checked again for new matches -- only the rest of the string is examined.
        Parameters:
        str - The string to match against this regular expression.
        subspec - The substitution parameter, described in REGULAR EXPRESSION SUBSTITUTION.
        Returns:
        The string formed by replacing all the matches in str with the strings generated from subspec. If no matches were found, then the return value is a copy of str.
      • applySubspec

        public static void applySubspec​(Regsub rs,
                                        java.lang.String subspec,
                                        java.lang.StringBuffer sb)
        Utility method to give access to the standard substitution algorithm used by sub and subAll. Appends to the string buffer the string generated by applying the substitution parameter to the matched region.
        Parameters:
        rs - Information about the matched region.
        subspec - The substitution parameter.
        sb - StringBuffer to which the generated string is appended.
      • sub

        public java.lang.String sub​(java.lang.String str,
                                    Regexp.Filter rf)
      • toString

        public java.lang.String toString()
        Returns a string representation of this compiled regular expression. The format of the string representation is a symbolic dump of the bytecodes.
        Overrides:
        toString in class java.lang.Object
        Returns:
        A string representation of this regular expression.