LeetCode Repeated DNA Sequences

LeetCode Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

题意:给定一个DNA序列,问有哪些长度为10的子串出现过至少两次。

解法:遍历字符串,用Hash表记录所有长度为10的子串的出现频率,如果频率等于2,则把该子串加入到结果数组中。时间复杂度为O(n),空间复杂度为Hash表占用的空间。

完整代码如下:

class Solution {
public:
	vector<string> findRepeatedDnaSequences(string s) {
		vector<string> ans;
		unordered_map<string, int> hash;
		for (size_t i = 0; i + 10 <= s.size(); ++i) {
			string seq = s.substr(i, 10);
			++hash[seq];
			if (hash[seq] == 2)ans.push_back(seq);
		}
		return ans;
	}
};

本代码提交AC,用时73MS。

优化思路:

因为碱基只有4个:A,T,G,C,所以可以用两个bit表示所有碱基,子串的长度为10,所以可以用20bits表示一个子串,也就是用一个int就可以表示一个子串了。所以我们在将长度为10的子串插入Hash表时,可以先将其编码成一个int。这样Hash表的key就变成一个int了,Hash存储空间应该会更小,而且int也更利于Hash的查找。

完整代码如下:

class Solution {
private:
	unsigned int encode(const string &s) {
		unsigned int code = 0;
		for (size_t i = 0; i < s.size(); ++i) {
			code <<= 2;
			switch (s[i]) {
			case 'A':code += 0; break;
			case 'T':code += 1; break;
			case 'C':code += 2; break;
			case 'G':code += 3; break;
			}
		}
		return code;
	}
public:
	vector<string> findRepeatedDnaSequences(string s) {
		vector<string> ans;
		unordered_map<unsigned int, int> hash;
		for (size_t i = 0; i + 10 <= s.size(); ++i) {
			string seq = s.substr(i, 10);
			unsigned int code = encode(seq);
			++hash;
			if (hash == 2)ans.push_back(seq);
		}
		return ans;
	}
};

本代码提交AC,用时59MS,快了一点。

Leave a Reply

Your email address will not be published. Required fields are marked *