about summary refs log tree commit diff
diff options
context:
space:
mode:
authorWilliam Carroll <wpcarro@gmail.com>2020-11-16T17·14+0000
committerWilliam Carroll <wpcarro@gmail.com>2020-11-16T17·14+0000
commit6989c3a91a99d18fbe527fd453e2f1f9a5a1c1af (patch)
treea16408275f3b5f884f099f6c880f58835779ece4
parenta2fa88f5611f878c6937abaab4f6858a203c37b6 (diff)
Implement the Rabin Karp string matching algorithm
This algorithm is pretty interesting because it runs in linear time with respect
to the length of the `corpus` string. It does this by using a sliding window
hash. This hash -- because it's a sliding window -- runs in constant time for
each iteration; we're only adding and subtracting one character each time and
not re-hashing the whole "window".

When our hashes match, only then do we compare the "window" to the
`pattern`. String comparisons are linear because they compare each character to
each character one at a time. But because we only compare strings when are
hashes match (a check which runs in constant time), this spares us the
performance hit.
-rw-r--r--scratch/facebook/rabin-karp.py27
1 files changed, 27 insertions, 0 deletions
diff --git a/scratch/facebook/rabin-karp.py b/scratch/facebook/rabin-karp.py
new file mode 100644
index 000000000000..53a47b278333
--- /dev/null
+++ b/scratch/facebook/rabin-karp.py
@@ -0,0 +1,27 @@
+def substring_exists(corpus, pattern):
+    """
+    Return True if `pattern` appears in `corpus`.
+
+    This function runs in O(m) time where n is equal to the length of
+    `corpus`. To improve the efficiency of this algorithm, use a hashing
+    function the reduces the number of collisions, which will consequently
+    reduce the number of string-to-string, linear comparisons.
+    """
+    m, n = len(corpus), len(pattern)
+    a = sum(ord(c) for c in corpus[0:n])
+    b = sum(ord(c) for c in pattern)
+
+    # (clumsily) prevent an off-by-one error...
+    if a == b and corpus[0:n] == pattern:
+        return True
+
+    for i in range(1, m - n):
+        # Update the hash of corpus by subtracting the hash of the character
+        # that is sliding out of view and adding the hash of the character that
+        # is sliding into view.
+        a = a - ord(corpus[i - 1]) + ord(corpus[i + n - 1])
+        # Integer comparison in O(0) time followed by string comparison in O(m)
+        # time.
+        if a == b and corpus[i:i + n] == pattern:
+            return True
+    return False