Description
Given an integer array data
representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
- For a 1-byte character, the first bit is a
0
, followed by its Unicode code. - For an n-bytes character, the first
n
bits are all one’s, then + 1
bit is0
, followed byn - 1
bytes with the most significant2
bits being10
.
This is how the UTF-8 encoding would work:
Number of Bytes | UTF-8 Octet Sequence
| (binary)
--------------------+-----------------------------------------
1 | 0xxxxxxx
2 | 110xxxxx 10xxxxxx
3 | 1110xxxx 10xxxxxx 10xxxxxx
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
x
denotes a bit in the binary form of a byte that may be either 0
or 1
.
Examples:
Example 1:
Input: data = [197,130,1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:
Input: data = [235,140,4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.
Solution in Python
UTF-8 Encoding Rules:
- 1-byte character:
- Format:
0xxxxxxx
- The first bit is
0
.
- Format:
- 2-byte character:
- Format:
110xxxxx 10xxxxxx
- The first byte starts with
110
, and the following byte starts with10
.
- Format:
- 3-byte character:
- Format:
1110xxxx 10xxxxxx 10xxxxxx
- The first byte starts with
1110
, and the next two bytes start with10
.
- Format:
- 4-byte character:
- Format:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- The first byte starts with
11110
, and the next three bytes start with10
.
- Format:
Each byte in the data array can be checked by looking at the binary representation of the first few bits to determine its role (starting byte or continuation byte).
Solution Strategy:
- Identify the Starting Byte:
- For each starting byte, determine how many bytes should follow by counting the number of leading
1
s until we encounter a0
.
- For each starting byte, determine how many bytes should follow by counting the number of leading
- Check Continuation Bytes:
- Each continuation byte should start with
10
.
- Each continuation byte should start with
- Move Through the Bytes:
- Use a pointer to iterate through the bytes. When encountering a starting byte, check that the next bytes are valid continuation bytes.
Python
class Solution:
def validUtf8(self, data: List[int]) -> bool:
# Initialize the index that will move through the array
i = 0
while i < len(data):
# Get the byte in binary format, only considering the last 8 bits
byte = data[i] & 0xFF
# Determine the number of leading 1s in the byte, this tells us the expected number of bytes
if byte & 0x80 == 0:
# 1-byte character (0xxxxxxx)
num_bytes = 1
elif byte & 0xE0 == 0xC0:
# 2-byte character (110xxxxx)
num_bytes = 2
elif byte & 0xF0 == 0xE0:
# 3-byte character (1110xxxx)
num_bytes = 3
elif byte & 0xF8 == 0xF0:
# 4-byte character (11110xxx)
num_bytes = 4
else:
# Invalid starting byte
return False
# Check if there are enough bytes remaining for this character
if i + num_bytes > len(data):
return False
# Verify the continuation bytes, which should all start with '10'
for j in range(1, num_bytes):
if (data[i + j] & 0xC0) != 0x80:
return False
# Move the index to the next character
i += num_bytes
return True
Explanation of the Code:
- Byte Masking:
- Each byte is masked with
0xFF
to ensure only the last 8 bits are considered.
- Each byte is masked with
- Determining Number of Bytes:
- We use bitwise operations to check the first bits in the current byte:
byte & 0x80 == 0
: Checks if it’s a 1-byte character.byte & 0xE0 == 0xC0
: Checks if it’s a 2-byte character.byte & 0xF0 == 0xE0
: Checks if it’s a 3-byte character.byte & 0xF8 == 0xF0
: Checks if it’s a 4-byte character.
- We use bitwise operations to check the first bits in the current byte:
- Validation of Continuation Bytes:
- For each continuation byte, we check that it starts with
10
by using(data[i + j] & 0xC0) == 0x80
.
- For each continuation byte, we check that it starts with
- Iteration and Boundary Checking:
- We move the index
i
forward bynum_bytes
after processing a character. - If
i + num_bytes
exceeds the length ofdata
, it means there aren’t enough continuation bytes left, so we returnFalse
.
- We move the index
Solution in C++
C++
class Solution {
public:
bool validUtf8(vector<int>& data) {
// Number of bytes remaining to be checked in a multi-byte character
int remainingBytes = 0;
// Iterate through each integer in the data array
for (int byte : data) {
// Check the first byte or continuation byte based on remainingBytes
if (remainingBytes == 0) {
// Determine the number of bytes in the current UTF-8 character
if ((byte >> 7) == 0b0) {
// 1-byte character (0xxxxxxx)
continue;
} else if ((byte >> 5) == 0b110) {
// 2-byte character (110xxxxx)
remainingBytes = 1;
} else if ((byte >> 4) == 0b1110) {
// 3-byte character (1110xxxx)
remainingBytes = 2;
} else if ((byte >> 3) == 0b11110) {
// 4-byte character (11110xxx)
remainingBytes = 3;
} else {
// If it does not match any of the above patterns, it's invalid
return false;
}
} else {
// Check if the current byte is a valid continuation byte (10xxxxxx)
if ((byte >> 6) != 0b10) {
return false;
}
// Decrease the remaining bytes count since we found a valid continuation byte
remainingBytes--;
}
}
// If we have consumed all expected continuation bytes, it's valid
return remainingBytes == 0;
}
};
Solution in Javascript
JavaScript
var validUtf8 = function(data) {
// Helper function to check if a byte is a valid continuation byte
// A valid continuation byte has the format '10xxxxxx' (binary)
const isContinuationByte = (byte) => (byte & 0b11000000) === 0b10000000;
// Initialize the number of bytes remaining for the current UTF-8 character
let bytesRemaining = 0;
// Iterate through each integer in the data array
for (let byte of data) {
// If bytesRemaining is 0, we are expecting the start of a new UTF-8 character
if (bytesRemaining === 0) {
// Determine how many bytes the current UTF-8 character uses
if ((byte & 0b10000000) === 0) {
// 1-byte character: '0xxxxxxx'
// Do nothing, continue to the next byte
continue;
} else if ((byte & 0b11100000) === 0b11000000) {
// 2-byte character: '110xxxxx'
bytesRemaining = 1;
} else if ((byte & 0b11110000) === 0b11100000) {
// 3-byte character: '1110xxxx'
bytesRemaining = 2;
} else if ((byte & 0b11111000) === 0b11110000) {
// 4-byte character: '11110xxx'
bytesRemaining = 3;
} else {
// If the byte does not match any of the above formats, it's invalid
return false;
}
} else {
// If bytesRemaining > 0, we expect a continuation byte ('10xxxxxx')
if (!isContinuationByte(byte)) {
return false;
}
// Decrease the count of remaining continuation bytes
bytesRemaining--;
}
}
// If we've processed all the bytes, bytesRemaining should be 0
return bytesRemaining === 0;
};
Solution in Java
Java
class Solution {
public boolean validUtf8(int[] data) {
// This variable keeps track of how many continuation bytes are expected.
int remainingBytes = 0;
// Iterate through each integer in the input data array
for (int num : data) {
// We only need the last 8 bits, so use bitwise AND with 0xFF (11111111 in binary)
num = num & 0xFF;
if (remainingBytes == 0) {
// Determine how many bytes are in the current UTF-8 character
if ((num >> 7) == 0b0) {
// If the first bit is 0, it's a 1-byte character (0xxxxxxx)
continue;
} else if ((num >> 5) == 0b110) {
// If the first 3 bits are 110, it's a 2-byte character (110xxxxx)
remainingBytes = 1;
} else if ((num >> 4) == 0b1110) {
// If the first 4 bits are 1110, it's a 3-byte character (1110xxxx)
remainingBytes = 2;
} else if ((num >> 3) == 0b11110) {
// If the first 5 bits are 11110, it's a 4-byte character (11110xxx)
remainingBytes = 3;
} else {
// If it doesn't match any of the above, it's not a valid UTF-8 byte
return false;
}
} else {
// Check if the current byte is a valid continuation byte (10xxxxxx)
if ((num >> 6) != 0b10) {
return false;
}
// One less continuation byte to check
remainingBytes--;
}
}
// If there are no remaining continuation bytes expected, the UTF-8 encoding is valid
return remainingBytes == 0;
}
}