diff options
author | William Carroll <wpcarro@gmail.com> | 2020-11-16T17·14+0000 |
---|---|---|
committer | William Carroll <wpcarro@gmail.com> | 2020-11-16T17·14+0000 |
commit | 6989c3a91a99d18fbe527fd453e2f1f9a5a1c1af (patch) | |
tree | a16408275f3b5f884f099f6c880f58835779ece4 | |
parent | a2fa88f5611f878c6937abaab4f6858a203c37b6 (diff) |
Implement the Rabin Karp string matching algorithm
This algorithm is pretty interesting because it runs in linear time with respect to the length of the `corpus` string. It does this by using a sliding window hash. This hash -- because it's a sliding window -- runs in constant time for each iteration; we're only adding and subtracting one character each time and not re-hashing the whole "window". When our hashes match, only then do we compare the "window" to the `pattern`. String comparisons are linear because they compare each character to each character one at a time. But because we only compare strings when are hashes match (a check which runs in constant time), this spares us the performance hit.
-rw-r--r-- | scratch/facebook/rabin-karp.py | 27 |
1 files changed, 27 insertions, 0 deletions
diff --git a/scratch/facebook/rabin-karp.py b/scratch/facebook/rabin-karp.py new file mode 100644 index 000000000000..53a47b278333 --- /dev/null +++ b/scratch/facebook/rabin-karp.py @@ -0,0 +1,27 @@ +def substring_exists(corpus, pattern): + """ + Return True if `pattern` appears in `corpus`. + + This function runs in O(m) time where n is equal to the length of + `corpus`. To improve the efficiency of this algorithm, use a hashing + function the reduces the number of collisions, which will consequently + reduce the number of string-to-string, linear comparisons. + """ + m, n = len(corpus), len(pattern) + a = sum(ord(c) for c in corpus[0:n]) + b = sum(ord(c) for c in pattern) + + # (clumsily) prevent an off-by-one error... + if a == b and corpus[0:n] == pattern: + return True + + for i in range(1, m - n): + # Update the hash of corpus by subtracting the hash of the character + # that is sliding out of view and adding the hash of the character that + # is sliding into view. + a = a - ord(corpus[i - 1]) + ord(corpus[i + n - 1]) + # Integer comparison in O(0) time followed by string comparison in O(m) + # time. + if a == b and corpus[i:i + n] == pattern: + return True + return False |