LeetCode Repeated DNA Sequences

187. Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

题意:给定一个DNA序列,问有哪些长度为10的子串出现过至少两次。 解法:遍历字符串,用Hash表记录所有长度为10的子串的出现频率,如果频率等于2,则把该子串加入到结果数组中。时间复杂度为O(n),空间复杂度为Hash表占用的空间。 完整代码如下:

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s)
    {
        vector<string> ans;
        unordered_map<string, int> hash;
        for (size_t i = 0; i + 10 <= s.size(); ++i) {
            string seq = s.substr(i, 10);
            ++hash[seq];
            if (hash[seq] == 2)
                ans.push_back(seq);
        }
        return ans;
    }
};

本代码提交AC,用时73MS。
优化思路:
因为碱基只有4个:A,T,G,C,所以可以用两个bit表示所有碱基,子串的长度为10,所以可以用20bits表示一个子串,也就是用一个int就可以表示一个子串了。所以我们在将长度为10的子串插入Hash表时,可以先将其编码成一个int。这样Hash表的key就变成一个int了,Hash存储空间应该会更小,而且int也更利于Hash的查找。
完整代码如下:

class Solution {
private:
    unsigned int encode(const string& s)
    {
        unsigned int code = 0;
        for (size_t i = 0; i < s.size(); ++i) {
            code <<= 2;
            switch (s[i]) {
            case ‘A’:
                code += 0;
                break;
            case ‘T’:
                code += 1;
                break;
            case ‘C’:
                code += 2;
                break;
            case ‘G’:
                code += 3;
                break;
            }
        }
        return code;
    }
public:
    vector<string> findRepeatedDnaSequences(string s)
    {
        vector<string> ans;
        unordered_map<unsigned int, int> hash;
        for (size_t i = 0; i + 10 <= s.size(); ++i) {
            string seq = s.substr(i, 10);
            unsigned int code = encode(seq);
            ++hash[code];
            if (hash[code] == 2)
                ans.push_back(seq);
        }
        return ans;
    }
};

本代码提交AC,用时59MS,快了一点。

Leave a Reply

Your email address will not be published. Required fields are marked *