leetcode | Implement strStr() | 实现字符串查找函数_综合

Implement strStr() ： https://leetcode.com/problems/implement-strstr/

Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
如：haystack = “bcbcda”; needle = “bcd” 则 return 2

解析：字符串查找函数，strstr()函数用来检索子串在字符串中首次出现的位置，其原型为：
char *strstr( char *str, char * substr );

思路一：容易实现，然并卵（时间复杂度不满足要求）

两个指针，i 指向haystack 的起始，j 指向 needle 的起始；首先 i 向后走，直至haystack[i] == needle [j]; 然后 j 往后走，如果haystack[i+j] != needle [j] 跳出，如果能走 m 步，即存在相同,返回i；如果存在不匹配，则haystack 后移后，从needle[0]重新比较
原理就是：拿着 needle 字符串去 haystack 上逐个比较；每次最多需要对比m次，最多重复n次；
故时间复杂度为O(m*n),不能满足leetcode的时间要求
注：在写代码前理清思路，
1. 确定解决问题的算法
2. 确定算法的时空复杂度，考虑能不能优化或询问面试官是否要求时空复杂度。
3. 有哪些特殊情况需要处理
必须必须必须先清晰思路，再写代码。
这里写图片描述

int strStr2(string haystack, string needle) {// 时间复杂度O(m*n),不能满足leetcode的时间要求int m = needle.size();int n = haystack.size();if (m == 0) return 0;if (m > n) return -1;for (int i = 0; i < n; i++) {int j = 0;if (haystack[i] == needle[j]) {for (; j < m && i+j < n; j++) {if (needle[j] != haystack[i+j])break;}if (j == m)return i;}}return -1;}

思路二 Rabin–Karp algorithm算法 - Hash 查找

Rabin–Karp algorithm算法：是计算机科学中通过 hash 的方法用于在一个大量文本中查找一个固定长度的字符串的算法。（模式查找）

从思路一我们可知，要想确定haystack中存在needle，必须完全比较needle的所有字符。那么有没有能够利用上一次比较的结果，仅添加O(1)的时间。
基本思想是：用一个hash code 表示一个字符串，为了保证 hash 的唯一性，我们用比字符集大的素数为底，以这个素数的幂为基。
例如：小写字母集，选择素数29为底，如字符串”abcd”的hash code为

h a s h = 1 ? 290 + 2 ? 291 + 3 ? 292 + 4 ? 293

$hash = 1*29^0+2*29^1+3*29^2+4*29^3$ ，
那么下一步计算字符串”bcde”的 hash code 为

h a s h = h a s h / 29 + 5 ? 293

$hash = hash/29 + 5*29^3$ 这一计算过程是

O(1) $O(1)$ 常量的操作，那么检测所有子串所需的时间复杂度是O(m+(n-m)) =

O(n) $O(n)$ 是一个线性算法（ Rolling hash）
<注>例子中是正序计算的hash code，以下程序中使用是倒序计算的 hash code, 即

hash("abcd")=4?290+3?291+2?292+1?293 $hash("abcd") = 4*29^0+3*29^1+2*29^2+1*29^3$ ,类似于进制转换

hash("bcde")=(hash("abcd")?1?293)?29+5 $hash("bcde")=(hash("abcd") - 1*29^3) * 29 + 5$

    int charToInt(char c) {return (int)(c-'a'+1);}// 时间复杂度 O(m+(n-m)) = O(n)int strStr(string haystack, string needle) {int m = needle.size();int n = haystack.size();if (m == 0) return 0;if (m > n) return -1;const int base = 29;long long max_base = 1;long long needle_code = 0;long long haystack_code = 0;for (int j = m - 1; j >= 0; j--) {needle_code += charToInt(needle[j])*max_base;haystack_code += charToInt(haystack[j])*max_base;max_base *= base;}max_base /= base; // 子串的最大基if (haystack_code == needle_code)return 0;for (int i = m; i < n; i++) {haystack_code = (haystack_code - charToInt(haystack[i-m]) * max_base) * base + charToInt(haystack[i]);if (haystack_code == needle_code)return i - m + 1;}return -1;}

存在的缺点是，素数的幂可能会很大，因此计算结果要使用 long long 的类型，甚至要求更大的big int；另外，可以通过取余的方式缩小，但是有小概率误判。

算法参考：http://blog.csdn.net/linhuanmars/article/details/20276833

leetcode | Implement strStr() | 实现字符串查找函数

思路一：容易实现，然并卵（时间复杂度不满足要求）

思路二 Rabin–Karp algorithm算法 - Hash 查找

更多思路