393. UTF-8 Validation – Leetcode Solutions

November 13, 2024

Table of Contents

Description

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For a 1-byte character, the first bit is a 0, followed by its Unicode code.
For an n-bytes character, the first n bits are all one’s, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

     Number of Bytes   |        UTF-8 Octet Sequence
                       |              (binary)
   --------------------+-----------------------------------------
            1          |   0xxxxxxx
            2          |   110xxxxx 10xxxxxx
            3          |   1110xxxx 10xxxxxx 10xxxxxx
            4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

Examples:

Example 1:

Input: data = [197,130,1]
Output: true
Explanation: data represents the octet sequence: 11000101 10000010 00000001.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

Input: data = [235,140,4]
Output: false
Explanation: data represented the octet sequence: 11101011 10001100 00000100.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Solution in Python

UTF-8 Encoding Rules:

1-byte character:
- Format: 0xxxxxxx
- The first bit is 0.
2-byte character:
- Format: 110xxxxx 10xxxxxx
- The first byte starts with 110, and the following byte starts with 10.
3-byte character:
- Format: 1110xxxx 10xxxxxx 10xxxxxx
- The first byte starts with 1110, and the next two bytes start with 10.
4-byte character:
- Format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- The first byte starts with 11110, and the next three bytes start with 10.

Each byte in the data array can be checked by looking at the binary representation of the first few bits to determine its role (starting byte or continuation byte).

Solution Strategy:

Identify the Starting Byte:
- For each starting byte, determine how many bytes should follow by counting the number of leading 1s until we encounter a 0.
Check Continuation Bytes:
- Each continuation byte should start with 10.
Move Through the Bytes:
- Use a pointer to iterate through the bytes. When encountering a starting byte, check that the next bytes are valid continuation bytes.

Python

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        # Initialize the index that will move through the array
        i = 0
        
        while i < len(data):
            # Get the byte in binary format, only considering the last 8 bits
            byte = data[i] & 0xFF
            
            # Determine the number of leading 1s in the byte, this tells us the expected number of bytes
            if byte & 0x80 == 0:
                # 1-byte character (0xxxxxxx)
                num_bytes = 1
            elif byte & 0xE0 == 0xC0:
                # 2-byte character (110xxxxx)
                num_bytes = 2
            elif byte & 0xF0 == 0xE0:
                # 3-byte character (1110xxxx)
                num_bytes = 3
            elif byte & 0xF8 == 0xF0:
                # 4-byte character (11110xxx)
                num_bytes = 4
            else:
                # Invalid starting byte
                return False
            
            # Check if there are enough bytes remaining for this character
            if i + num_bytes > len(data):
                return False
            
            # Verify the continuation bytes, which should all start with '10'
            for j in range(1, num_bytes):
                if (data[i + j] & 0xC0) != 0x80:
                    return False
            
            # Move the index to the next character
            i += num_bytes
        
        return True

Explanation of the Code:

Byte Masking:
- Each byte is masked with 0xFF to ensure only the last 8 bits are considered.
Determining Number of Bytes:
- We use bitwise operations to check the first bits in the current byte:
  - byte & 0x80 == 0: Checks if it’s a 1-byte character.
  - byte & 0xE0 == 0xC0: Checks if it’s a 2-byte character.
  - byte & 0xF0 == 0xE0: Checks if it’s a 3-byte character.
  - byte & 0xF8 == 0xF0: Checks if it’s a 4-byte character.
Validation of Continuation Bytes:
- For each continuation byte, we check that it starts with 10 by using (data[i + j] & 0xC0) == 0x80.
Iteration and Boundary Checking:
- We move the index i forward by num_bytes after processing a character.
- If i + num_bytes exceeds the length of data, it means there aren’t enough continuation bytes left, so we return False.

Solution in C++

C++

class Solution {
public:
    bool validUtf8(vector<int>& data) {
        // Number of bytes remaining to be checked in a multi-byte character
        int remainingBytes = 0;
        
        // Iterate through each integer in the data array
        for (int byte : data) {
            // Check the first byte or continuation byte based on remainingBytes
            if (remainingBytes == 0) {
                // Determine the number of bytes in the current UTF-8 character
                if ((byte >> 7) == 0b0) {
                    // 1-byte character (0xxxxxxx)
                    continue;
                } else if ((byte >> 5) == 0b110) {
                    // 2-byte character (110xxxxx)
                    remainingBytes = 1;
                } else if ((byte >> 4) == 0b1110) {
                    // 3-byte character (1110xxxx)
                    remainingBytes = 2;
                } else if ((byte >> 3) == 0b11110) {
                    // 4-byte character (11110xxx)
                    remainingBytes = 3;
                } else {
                    // If it does not match any of the above patterns, it's invalid
                    return false;
                }
            } else {
                // Check if the current byte is a valid continuation byte (10xxxxxx)
                if ((byte >> 6) != 0b10) {
                    return false;
                }
                // Decrease the remaining bytes count since we found a valid continuation byte
                remainingBytes--;
            }
        }
        
        // If we have consumed all expected continuation bytes, it's valid
        return remainingBytes == 0;
    }
};

Solution in Javascript

JavaScript

var validUtf8 = function(data) {
    // Helper function to check if a byte is a valid continuation byte
    // A valid continuation byte has the format '10xxxxxx' (binary)
    const isContinuationByte = (byte) => (byte & 0b11000000) === 0b10000000;

    // Initialize the number of bytes remaining for the current UTF-8 character
    let bytesRemaining = 0;

    // Iterate through each integer in the data array
    for (let byte of data) {
        // If bytesRemaining is 0, we are expecting the start of a new UTF-8 character
        if (bytesRemaining === 0) {
            // Determine how many bytes the current UTF-8 character uses
            if ((byte & 0b10000000) === 0) {
                // 1-byte character: '0xxxxxxx'
                // Do nothing, continue to the next byte
                continue;
            } else if ((byte & 0b11100000) === 0b11000000) {
                // 2-byte character: '110xxxxx'
                bytesRemaining = 1;
            } else if ((byte & 0b11110000) === 0b11100000) {
                // 3-byte character: '1110xxxx'
                bytesRemaining = 2;
            } else if ((byte & 0b11111000) === 0b11110000) {
                // 4-byte character: '11110xxx'
                bytesRemaining = 3;
            } else {
                // If the byte does not match any of the above formats, it's invalid
                return false;
            }
        } else {
            // If bytesRemaining > 0, we expect a continuation byte ('10xxxxxx')
            if (!isContinuationByte(byte)) {
                return false;
            }
            // Decrease the count of remaining continuation bytes
            bytesRemaining--;
        }
    }

    // If we've processed all the bytes, bytesRemaining should be 0
    return bytesRemaining === 0;
};

Solution in Java

Java

class Solution {
    public boolean validUtf8(int[] data) {
        // This variable keeps track of how many continuation bytes are expected.
        int remainingBytes = 0;
        
        // Iterate through each integer in the input data array
        for (int num : data) {
            // We only need the last 8 bits, so use bitwise AND with 0xFF (11111111 in binary)
            num = num & 0xFF;
            
            if (remainingBytes == 0) {
                // Determine how many bytes are in the current UTF-8 character
                if ((num >> 7) == 0b0) {
                    // If the first bit is 0, it's a 1-byte character (0xxxxxxx)
                    continue;
                } else if ((num >> 5) == 0b110) {
                    // If the first 3 bits are 110, it's a 2-byte character (110xxxxx)
                    remainingBytes = 1;
                } else if ((num >> 4) == 0b1110) {
                    // If the first 4 bits are 1110, it's a 3-byte character (1110xxxx)
                    remainingBytes = 2;
                } else if ((num >> 3) == 0b11110) {
                    // If the first 5 bits are 11110, it's a 4-byte character (11110xxx)
                    remainingBytes = 3;
                } else {
                    // If it doesn't match any of the above, it's not a valid UTF-8 byte
                    return false;
                }
            } else {
                // Check if the current byte is a valid continuation byte (10xxxxxx)
                if ((num >> 6) != 0b10) {
                    return false;
                }
                // One less continuation byte to check
                remainingBytes--;
            }
        }
        
        // If there are no remaining continuation bytes expected, the UTF-8 encoding is valid
        return remainingBytes == 0;
    }
}

393. UTF-8 Validation – Leetcode Solutions

Description

Solution in Python

UTF-8 Encoding Rules:

Solution Strategy:

Explanation of the Code:

Solution in C++

Solution in Javascript

Solution in Java

Popular

51. N-Queens – Leetcode Solutions

99. Recover Binary Search Tree – Leetcode Solutions

34. Find First and Last Position of Element in Sorted Array – Leetcode Solutions

377. Combination Sum IV – Leetcode Solutions

Context Managers and the with Statement in Python

168. Excel Sheet Column Title – Leetcode Solutions

Guide to Strings in Python: Functions and Methods

393. UTF-8 Validation – Leetcode Solutions

Description

Solution in Python

UTF-8 Encoding Rules:

Solution Strategy:

Explanation of the Code:

Solution in C++

Solution in Javascript

Solution in Java

Related Post

Popular