final class Utf8
extends java.lang.Object
There are several variants of UTF-8. The one implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1, which mandates the rejection of "overlong" byte sequences as well as rejection of 3-byte surrogate codepoint byte sequences. Note that the UTF-8 decoder included in Oracle's JDK has been modified to also reject "overlong" byte sequences, but (as of 2011) still accepts 3-byte surrogate codepoint byte sequences.
The byte sequences considered valid by this class are exactly those that can be roundtrip converted to Strings and back to bytes using the UTF-8 charset, without loss:
Arrays.equals(bytes, new String(bytes, Internal.UTF_8).getBytes(Internal.UTF_8))
See the Unicode Standard, Table 3-6. UTF-8 Bit Distribution, Table 3-7. Well Formed UTF-8 Byte Sequences.
This class supports decoding of partial byte sequences, so that the
bytes in a complete UTF-8 byte sequences can be stored in multiple
segments. Methods typically return MALFORMED
if the partial
byte sequence is definitely not well-formed, COMPLETE
if it is
well-formed in the absence of additional input, or if the byte sequence
apparently terminated in the middle of a character, an opaque integer
"state" value containing enough information to decode the character when
passed to a subsequent invocation of a partial decoding method.
Modifier and Type | Class and Description |
---|---|
private static class |
Utf8.DecodeUtil
Utility methods for decoding bytes into
String . |
(package private) static class |
Utf8.Processor
A processor of UTF-8 strings, providing methods for checking validity and encoding.
|
(package private) static class |
Utf8.SafeProcessor
Utf8.Processor implementation that does not use any sun.misc.Unsafe methods. |
(package private) static class |
Utf8.UnpairedSurrogateException |
(package private) static class |
Utf8.UnsafeProcessor
Utf8.Processor that uses sun.misc.Unsafe where possible to improve performance. |
Modifier and Type | Field and Description |
---|---|
private static long |
ASCII_MASK_LONG
A mask used when performing unsafe reads to determine if a long value contains any non-ASCII
characters (i.e.
|
static int |
COMPLETE
State value indicating that the byte sequence is well-formed and
complete (no further bytes are needed to complete a character).
|
static int |
MALFORMED
State value indicating that the byte sequence is definitely not
well-formed.
|
(package private) static int |
MAX_BYTES_PER_CHAR
Maximum number of bytes per Java UTF-16 char in UTF-8.
|
private static Utf8.Processor |
processor
UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations
depending on what is available on the platform.
|
private static int |
UNSAFE_COUNT_ASCII_THRESHOLD
Used by
Unsafe UTF-8 string validation logic to determine the minimum string length
above which to employ an optimized algorithm for counting ASCII characters. |
Modifier | Constructor and Description |
---|---|
private |
Utf8() |
Modifier and Type | Method and Description |
---|---|
(package private) static java.lang.String |
decodeUtf8(byte[] bytes,
int index,
int size)
Decodes the given UTF-8 encoded byte array slice into a
String . |
(package private) static java.lang.String |
decodeUtf8(java.nio.ByteBuffer buffer,
int index,
int size)
Decodes the given UTF-8 portion of the
ByteBuffer into a String . |
(package private) static int |
encode(java.lang.CharSequence in,
byte[] out,
int offset,
int length) |
(package private) static int |
encodedLength(java.lang.CharSequence sequence)
Returns the number of bytes in the UTF-8-encoded form of
sequence . |
private static int |
encodedLengthGeneral(java.lang.CharSequence sequence,
int start) |
(package private) static void |
encodeUtf8(java.lang.CharSequence in,
java.nio.ByteBuffer out)
Encodes the given characters to the target
ByteBuffer using UTF-8 encoding. |
private static int |
estimateConsecutiveAscii(java.nio.ByteBuffer buffer,
int index,
int limit)
Counts (approximately) the number of consecutive ASCII characters in the given buffer.
|
private static int |
incompleteStateFor(byte[] bytes,
int index,
int limit) |
private static int |
incompleteStateFor(java.nio.ByteBuffer buffer,
int byte1,
int index,
int remaining) |
private static int |
incompleteStateFor(int byte1) |
private static int |
incompleteStateFor(int byte1,
int byte2) |
private static int |
incompleteStateFor(int byte1,
int byte2,
int byte3) |
static boolean |
isValidUtf8(byte[] bytes)
Returns
true if the given byte array is a well-formed
UTF-8 byte sequence. |
static boolean |
isValidUtf8(byte[] bytes,
int index,
int limit)
Returns
true if the given byte array slice is a
well-formed UTF-8 byte sequence. |
(package private) static boolean |
isValidUtf8(java.nio.ByteBuffer buffer)
Determines if the given
ByteBuffer is a valid UTF-8 string. |
static int |
partialIsValidUtf8(int state,
byte[] bytes,
int index,
int limit)
Tells whether the given byte array slice is a well-formed,
malformed, or incomplete UTF-8 byte sequence.
|
(package private) static int |
partialIsValidUtf8(int state,
java.nio.ByteBuffer buffer,
int index,
int limit)
Determines if the given
ByteBuffer is a partially valid UTF-8 string. |
private static final Utf8.Processor processor
private static final long ASCII_MASK_LONG
static final int MAX_BYTES_PER_CHAR
CharsetEncoder.maxBytesPerChar()
,
Constant Field Valuespublic static final int COMPLETE
public static final int MALFORMED
private static final int UNSAFE_COUNT_ASCII_THRESHOLD
Unsafe
UTF-8 string validation logic to determine the minimum string length
above which to employ an optimized algorithm for counting ASCII characters. The reason for this
threshold is that for small strings, the optimization may not be beneficial or may even
negatively impact performance since it requires additional logic to avoid unaligned reads
(when calling Unsafe.getLong
). This threshold guarantees that even if the initial
offset is unaligned, we're guaranteed to make at least one call to Unsafe.getLong()
which provides a performance improvement that entirely subsumes the cost of the additional
logic.public static boolean isValidUtf8(byte[] bytes)
true
if the given byte array is a well-formed
UTF-8 byte sequence.
This is a convenience method, equivalent to a call to isValidUtf8(bytes, 0, bytes.length)
.
public static boolean isValidUtf8(byte[] bytes, int index, int limit)
true
if the given byte array slice is a
well-formed UTF-8 byte sequence. The range of bytes to be
checked extends from index index
, inclusive, to limit
, exclusive.
This is a convenience method, equivalent to partialIsValidUtf8(bytes, index, limit) == Utf8.COMPLETE
.
public static int partialIsValidUtf8(int state, byte[] bytes, int index, int limit)
index
, inclusive, to
limit
, exclusive.state
- either COMPLETE
(if this is the initial decoding
operation) or the value returned from a call to a partial decoding method
for the previous bytesMALFORMED
if the partial byte sequence is
definitely not well-formed, COMPLETE
if it is well-formed
(no additional input needed), or if the byte sequence is
"incomplete", i.e. apparently terminated in the middle of a character,
an opaque integer "state" value containing enough information to
decode the character when passed to a subsequent invocation of a
partial decoding method.private static int incompleteStateFor(int byte1)
private static int incompleteStateFor(int byte1, int byte2)
private static int incompleteStateFor(int byte1, int byte2, int byte3)
private static int incompleteStateFor(byte[] bytes, int index, int limit)
private static int incompleteStateFor(java.nio.ByteBuffer buffer, int byte1, int index, int remaining)
static int encodedLength(java.lang.CharSequence sequence)
sequence
. For a string,
this method is equivalent to string.getBytes(UTF_8).length
, but is more efficient in
both time and space.java.lang.IllegalArgumentException
- if sequence
contains ill-formed UTF-16 (unpaired
surrogates)private static int encodedLengthGeneral(java.lang.CharSequence sequence, int start)
static int encode(java.lang.CharSequence in, byte[] out, int offset, int length)
static boolean isValidUtf8(java.nio.ByteBuffer buffer)
ByteBuffer
is a valid UTF-8 string.
Selects an optimal algorithm based on the type of ByteBuffer
(i.e. heap or direct)
and the capabilities of the platform.
buffer
- the buffer to check.isValidUtf8(byte[], int, int)
static int partialIsValidUtf8(int state, java.nio.ByteBuffer buffer, int index, int limit)
ByteBuffer
is a partially valid UTF-8 string.
Selects an optimal algorithm based on the type of ByteBuffer
(i.e. heap or direct)
and the capabilities of the platform.
buffer
- the buffer to check.partialIsValidUtf8(int, byte[], int, int)
static java.lang.String decodeUtf8(java.nio.ByteBuffer buffer, int index, int size) throws InvalidProtocolBufferException
ByteBuffer
into a String
.InvalidProtocolBufferException
- if the input is not valid UTF-8.static java.lang.String decodeUtf8(byte[] bytes, int index, int size) throws InvalidProtocolBufferException
String
.InvalidProtocolBufferException
- if the input is not valid UTF-8.static void encodeUtf8(java.lang.CharSequence in, java.nio.ByteBuffer out)
ByteBuffer
using UTF-8 encoding.
Selects an optimal algorithm based on the type of ByteBuffer
(i.e. heap or direct)
and the capabilities of the platform.
in
- the source string to be encodedout
- the target buffer to receive the encoded string.encode(CharSequence, byte[], int, int)
private static int estimateConsecutiveAscii(java.nio.ByteBuffer buffer, int index, int limit)
ByteBuffer
does not matter, so performance can be improved if
native byte order is used (i.e. no byte-swapping in ByteBuffer.getLong(int)
).buffer
- the buffer to be scanned for ASCII charsindex
- the starting index of the scanlimit
- the limit within buffer for the scan