Posts tagged as “encoding”

花花酱 LeetCode 187. Repeated DNA Sequences

By zxi on November 28, 2021

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]

Constraints:

1 <= s.length <= 10⁵
s[i] is either 'A', 'C', 'G', or 'T'.

Solution: Hashtable

Store each subsequence into the hashtable, add it into the answer array when it appears for the second time.

Time complexity: O(n*l)
Space complexity: O(n*l) -> O(n) / string_view

C++

// Author: Huahua

class Solution {

public:

vector<string> findRepeatedDnaSequences(string_view s) {

constexpr int kLen = 10;

const int n = s.length();

unordered_map<string_view, int> m;

vector<string> ans;

for (int i = 0; i + kLen <= n; ++i)

if (++m[s.substr(i, kLen)] == 2)

ans.emplace_back(s.substr(i, kLen));

return ans;

}

};

Optimization

There are 4 type of letters, each can be encoded into 2 bits. We can represent the 10-letter-long string using 20 lowest bit of a int32. We can use int as key for the hashtable.

A -> 00
C -> 01
G -> 10
T -> 11

Time complexity: O(n)
Space complexity: O(n)

C++

// Author: Huahua

class Solution {

public:

vector<string> findRepeatedDnaSequences(string s) {

constexpr int kLen = 10;

constexpr int mask = (1 << (2 * kLen)) -1;

const int n = s.length();

unordered_map<int, int> m;

array<int, 128> km;

km['A'] = 0;

km['C'] = 1;

km['G'] = 2;

km['T'] = 3;

vector<string> ans;

for(int i = 0, key = 0; i < n; ++i) {

key = ((key << 2) & mask) | km[s[i]];

if (i < kLen - 1) continue;

if (++m[key] == 2)

ans.push_back(s.substr(i - kLen + 1, kLen));

}

return ans;

}

};

花花酱 LeetCode 1531. String Compression II

By zxi on July 26, 2020

Run-length encoding is a string compression method that works by replacing consecutive identical characters (repeated 2 or more times) with the concatenation of the character and the number marking the count of the characters (length of the run). For example, to compress the string "aabccc" we replace "aa" by "a2" and replace "ccc" by "c3". Thus the compressed string becomes "a2bc3".

Notice that in this problem, we are not adding '1' after single characters.

Given a string s and an integer k. You need to delete at most k characters from s such that the run-length encoded version of s has minimum length.

Find the minimum length of the run-length encoded version of s after deleting at most k characters.

Example 1:

Input: s = "aaabcccd", k = 2
Output: 4
Explanation: Compressing s without deleting anything will give us "a3bc3d" of length 6. Deleting any of the characters 'a' or 'c' would at most decrease the length of the compressed string to 5, for instance delete 2 'a' then we will have s = "abcccd" which compressed is abc3d. Therefore, the optimal way is to delete 'b' and 'd', then the compressed version of s will be "a3c3" of length 4.

Example 2:

Input: s = "aabbaa", k = 2
Output: 2
Explanation: If we delete both 'b' characters, the resulting compressed string would be "a4" of length 2.

Example 3:

Input: s = "aaaaaaaaaaa", k = 0
Output: 3
Explanation: Since k is zero, we cannot delete anything. The compressed string is "a11" of length 3.

Constraints:

1 <= s.length <= 100
0 <= k <= s.length
s contains only lowercase English letters.

Solution 0: Brute Force DFS (TLE)

Time complexity: O(C(n,k))
Space complexity: O(k)

C++

class Solution {

public:

int getLengthOfOptimalCompression(string s, int k) {

const int n = s.length();

auto encode = [&]() -> int {

char p = '$';

int count = 0;

int len = 0;

for (char c : s) {

if (c == '.') continue;

if (c != p) {

p = c;

count = 0;

}

++count;

if (count <= 2 || count == 10 || count == 100)

++len;

}

return len;

};

function<int(int, int)> dfs = [&](int start, int k) -> int {

if (start == n || k == 0) return encode();

int ans = n;

for (int i = start; i < n; ++i) {

char c = s[i];

s[i] = '.'; // delete

ans = min(ans, dfs(i + 1, k - 1));

s[i] = c;

}

return ans;

};

return dfs(0, k);

}

};

Solution1: DP

State:
i: the start index of the substring
last: last char
len: run-length
k: # of chars that can be deleted.

base case:
1. k < 0: return inf # invalid
2. i >= s.length(): return 0 # done

Transition:
1. if s[i] == last: return carry + dp(i + 1, last, len + 1, k)

2. if s[i] != last:
return min(1 + dp(i + 1, s[i], 1, k, # start a new group with s[i]
dp(i + 1, last, len, k -1) # delete / skip s[i], keep it as is.

Time complexity: O(n^3*26)
Space complexity: O(n^3*26)

C++

int cache[101][27][101][101];

class Solution {

public:

int getLengthOfOptimalCompression(string s, int k) {

memset(cache, -1, sizeof(cache));

// Min length of compressioned string of s[i:]

// 1. last char is |last|

// 2. current run-length is len

// 3. we can delete k chars.

function<int(int, int, int, int)> dp =

[&](int i, int last, int len, int k) {

if (k < 0) return INT_MAX / 2;

if (i >= s.length()) return 0;

int& ans = cache[i][last][len][k];

if (ans != -1) return ans;

if (s[i] - 'a' == last) {

// same as the previous char, no need to delete.

int carry = (len == 1 || len == 9 || len == 99);

ans = carry + dp(i + 1, last, len + 1, k);

} else {

ans = min(1 + dp(i + 1, s[i] - 'a', 1, k), // keep s[i]

dp(i + 1, last, len, k - 1)); // delete s[i]

}

return ans;

};

return dp(0, 26, 0, k);

}

};

State compression

dp[i][k] := min len of s[i:] encoded by deleting at most k charchters.

dp[i][k] = min(dp[i+1][k-1] # delete s[i]
encode_len(s[i~j] == s[i]) + dp(j+1, k – sum(s[i~j])) for j in range(i, n)) # keep

Time complexity: O(n^2*k)
Space complexity: O(n*k)

C++

// Author: Huahua

class Solution {

public:

int getLengthOfOptimalCompression(string s, int k) {

const int n = s.length();

vector<vector<int>> cache(n, vector<int>(k + 1, -1));

function<int(int, int)> dp = [&](int i, int k) -> int {

if (k < 0) return n;

if (i + k >= n) return 0;

int& ans = cache[i][k];

if (ans != -1) return ans;

ans = dp(i + 1, k - 1); // delete

int len = 0;

int same = 0;

int diff = 0;

for (int j = i; j < n && diff <= k; ++j) {

if (s[j] == s[i] && ++same) {

if (same <= 2 || same == 10 || same == 100) ++len;

} else {

++diff;

}

ans = min(ans, len + dp(j + 1, k - diff));

}

return ans;

};

return dp(0, k);

}

};

Java

// Author: Huahua

class Solution {

private int[][] dp;

private char[] s;

private int n;

public int getLengthOfOptimalCompression(

String s, int k) {

this.s = s.toCharArray();

this.n = s.length();

this.dp = new int[n][k + 1];

for (int[] row : dp)

Arrays.fill(row, -1);

return dp(0, k);

}

private int dp(int i, int k) {

if (k < 0) return this.n;

if (i + k >= n) return 0; // done or delete all.

int ans = dp[i][k];

if (ans != -1) return ans;

ans = dp(i + 1, k - 1); // delete s[i]

int len = 0;

int same = 0;

int diff = 0;

for (int j = i; j < n && diff <= k; ++j) {

if (s[j] == s[i]) {

++same;

if (same <= 2 || same == 10 || same == 100) ++len;

} else {

++diff;

}

ans = Math.min(ans, len + dp(j + 1, k - diff));

}

dp[i][k] = ans;

return ans;

}

Python3

# Author: Huahua

class Solution:

def getLengthOfOptimalCompression(self, s: str, k: int) -> int:

n = len(s)

@functools.lru_cache(maxsize=None)

def dp(i, k):

if k < 0: return n

if i + k >= n: return 0

ans = dp(i + 1, k - 1)

l = 0

same = 0

for j in range(i, n):

if s[j] == s[i]:

same += 1

if same <= 2 or same == 10 or same == 100:

l += 1

diff = j - i + 1 - same

if diff < 0: break

ans = min(ans, l + dp(j + 1, k - diff))

return ans

return dp(0, k)

花花酱 LeetCode 393. UTF-8 Validation

By zxi on March 20, 2020

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one’s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Char. number range | UTF-8 octet sequence

(hexadecimal) | (binary)

--------------------+---------------------------------------------

0000 0000-0000 007F | 0xxxxxxx

0000 0080-0000 07FF | 110xxxxx 10xxxxxx

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note:
The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Solution: Bit Operation

Check the first byte of a character and find out the number of bytes (from 0 to 3) left to check. The left bytes must start with 0b10.

Time complexity: O(n)
Space complexity: O(1)

C++

// Author: Huahua

class Solution {

public:

bool validUtf8(vector<int>& data) {

int left = 0;

for (int d : data) {

if (left == 0) {

if ((d >> 3) == 0b11110) left = 3;

else if ((d >> 4) == 0b1110) left = 2;

else if ((d >> 5) == 0b110) left = 1;

else if ((d >> 7) == 0b0) left = 0;

else return false;

} else {

if ((d >> 6) != 0b10) return false;

--left;

}

return left == 0;

}

};

花花酱 LeetCode 443. String Compression

By zxi on March 24, 2018

Problem

题目大意：对一个string进行in-place的run length encoding。

https://leetcode.com/problems/string-compression/description/

Given an array of characters, compress it in-place.

The length after compression must always be smaller than or equal to the original array.

Every element of the array should be a character (not int) of length 1.

After you are done modifying the input array in-place, return the new length of the array.

Follow up:
Could you solve it using only O(1) extra space?

Example 1:

Input:
["a","a","b","b","c","c","c"]

Output:
Return 6, and the first 6 characters of the input array should be: ["a","2","b","2","c","3"]

Explanation:
"aa" is replaced by "a2". "bb" is replaced by "b2". "ccc" is replaced by "c3".

Example 2:

Input:
["a"]

Output:
Return 1, and the first 1 characters of the input array should be: ["a"]

Explanation:
Nothing is replaced.

Example 3:

Input:
["a","b","b","b","b","b","b","b","b","b","b","b","b"]

Output:
Return 4, and the first 4 characters of the input array should be: ["a","b","1","2"].

Explanation:
Since the character "a" does not repeat, it is not compressed. "bbbbbbbbbbbb" is replaced by "b12".
Notice each digit has it's own entry in the array.

Note:

All characters have an ASCII value in [35, 126].
1 <= len(chars) <= 1000.

Solution

Time complexity: O(n)

Space complexity: O(1)

C++

// Author: Huahua

// Running time: 9 ms

class Solution {

public:

int compress(vector<char>& chars) {

const int n = chars.size();

int p = 0;

for (int i = 1; i <= n; ++i) {

int count = 1;

while (i < n && chars[i] == chars[i - 1]) { ++i; ++count; }

chars[p++] = chars[i - 1];

if (count == 1) continue;

for (char c : to_string(count))

chars[p++] = c;

}

return p;

}

};