# Posts tagged as “encoding”

The DNA sequence is composed of a series of nucleotides abbreviated as 'A''C''G', and 'T'.

• For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]


Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]


Constraints:

• 1 <= s.length <= 105
• s[i] is either 'A''C''G', or 'T'.

## Solution: Hashtable

Store each subsequence into the hashtable, add it into the answer array when it appears for the second time.

Time complexity: O(n*l)
Space complexity: O(n*l) -> O(n) / string_view

## Optimization

There are 4 type of letters, each can be encoded into 2 bits. We can represent the 10-letter-long string using 20 lowest bit of a int32. We can use int as key for the hashtable.

A -> 00
C -> 01
G -> 10
T -> 11

Time complexity: O(n)
Space complexity: O(n)

## C++

Run-length encoding is a string compression method that works by replacing consecutive identical characters (repeated 2 or more times) with the concatenation of the character and the number marking the count of the characters (length of the run). For example, to compress the string "aabccc" we replace "aa" by "a2" and replace "ccc" by "c3". Thus the compressed string becomes "a2bc3".

Notice that in this problem, we are not adding '1' after single characters.

Given a string s and an integer k. You need to delete at most k characters from s such that the run-length encoded version of s has minimum length.

Find the minimum length of the run-length encoded version of s after deleting at most k characters.

Example 1:

Input: s = "aaabcccd", k = 2
Output: 4
Explanation: Compressing s without deleting anything will give us "a3bc3d" of length 6. Deleting any of the characters 'a' or 'c' would at most decrease the length of the compressed string to 5, for instance delete 2 'a' then we will have s = "abcccd" which compressed is abc3d. Therefore, the optimal way is to delete 'b' and 'd', then the compressed version of s will be "a3c3" of length 4.

Example 2:

Input: s = "aabbaa", k = 2
Output: 2
Explanation: If we delete both 'b' characters, the resulting compressed string would be "a4" of length 2.


Example 3:

Input: s = "aaaaaaaaaaa", k = 0
Output: 3
Explanation: Since k is zero, we cannot delete anything. The compressed string is "a11" of length 3.


Constraints:

• 1 <= s.length <= 100
• 0 <= k <= s.length
• s contains only lowercase English letters.

## Solution 0: Brute Force DFS (TLE)

Time complexity: O(C(n,k))
Space complexity: O(k)

## Solution1: DP

State:
i: the start index of the substring
last: last char
len: run-length
k: # of chars that can be deleted.

base case:
1. k < 0: return inf # invalid
2. i >= s.length(): return 0 # done

Transition:
1. if s[i] == last: return carry + dp(i + 1, last, len + 1, k)

2. if s[i] != last:
return min(1 + dp(i + 1, s[i], 1, k, # start a new group with s[i]
dp(i + 1, last, len, k -1) # delete / skip s[i], keep it as is.

Time complexity: O(n^3*26)
Space complexity: O(n^3*26)

## State compression

dp[i][k] := min len of s[i:] encoded by deleting at most k charchters.

dp[i][k] = min(dp[i+1][k-1] # delete s[i]
encode_len(s[i~j] == s[i]) + dp(j+1, k – sum(s[i~j])) for j in range(i, n)) # keep

Time complexity: O(n^2*k)
Space complexity: O(n*k)

## Python3

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

1. For 1-byte character, the first bit is a 0, followed by its unicode code.
2. For n-bytes character, the first n-bits are all one’s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note:
The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.


Example 2:

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

## Solution: Bit Operation

Check the first byte of a character and find out the number of bytes (from 0 to 3) left to check. The left bytes must start with 0b10.

Time complexity: O(n)
Space complexity: O(1)

# Problem

https://leetcode.com/problems/string-compression/description/

Given an array of characters, compress it in-place.

The length after compression must always be smaller than or equal to the original array.

Every element of the array should be a character (not int) of length 1.

After you are done modifying the input array in-place, return the new length of the array.

Could you solve it using only O(1) extra space?

Example 1:

Input:
["a","a","b","b","c","c","c"]

Output:
Return 6, and the first 6 characters of the input array should be: ["a","2","b","2","c","3"]

Explanation:
"aa" is replaced by "a2". "bb" is replaced by "b2". "ccc" is replaced by "c3".


Example 2:

Input:
["a"]

Output:
Return 1, and the first 1 characters of the input array should be: ["a"]

Explanation:
Nothing is replaced.


Example 3:

Input:
["a","b","b","b","b","b","b","b","b","b","b","b","b"]

Output:
Return 4, and the first 4 characters of the input array should be: ["a","b","1","2"].

Explanation:
Since the character "a" does not repeat, it is not compressed. "bbbbbbbbbbbb" is replaced by "b12".
Notice each digit has it's own entry in the array.


Note:

1. All characters have an ASCII value in [35, 126].
2. 1 <= len(chars) <= 1000.

# Solution

Time complexity: O(n)

Space complexity: O(1)

C++