about summary refs log tree commit diff
path: root/scratch
diff options
context:
space:
mode:
authorWilliam Carroll <wpcarro@gmail.com>2020-11-16T17·14+0000
committerWilliam Carroll <wpcarro@gmail.com>2020-11-16T17·14+0000
commit6989c3a91a99d18fbe527fd453e2f1f9a5a1c1af (patch)
treea16408275f3b5f884f099f6c880f58835779ece4 /scratch
parenta2fa88f5611f878c6937abaab4f6858a203c37b6 (diff)
Implement the Rabin Karp string matching algorithm
This algorithm is pretty interesting because it runs in linear time with respect
to the length of the `corpus` string. It does this by using a sliding window
hash. This hash -- because it's a sliding window -- runs in constant time for
each iteration; we're only adding and subtracting one character each time and
not re-hashing the whole "window".

When our hashes match, only then do we compare the "window" to the
`pattern`. String comparisons are linear because they compare each character to
each character one at a time. But because we only compare strings when are
hashes match (a check which runs in constant time), this spares us the
performance hit.
Diffstat (limited to 'scratch')
-rw-r--r--scratch/facebook/rabin-karp.py27
1 files changed, 27 insertions, 0 deletions
diff --git a/scratch/facebook/rabin-karp.py b/scratch/facebook/rabin-karp.py
new file mode 100644
index 000000000000..53a47b278333
--- /dev/null
+++ b/scratch/facebook/rabin-karp.py
@@ -0,0 +1,27 @@
+def substring_exists(corpus, pattern):
+    """
+    Return True if `pattern` appears in `corpus`.
+
+    This function runs in O(m) time where n is equal to the length of
+    `corpus`. To improve the efficiency of this algorithm, use a hashing
+    function the reduces the number of collisions, which will consequently
+    reduce the number of string-to-string, linear comparisons.
+    """
+    m, n = len(corpus), len(pattern)
+    a = sum(ord(c) for c in corpus[0:n])
+    b = sum(ord(c) for c in pattern)
+
+    # (clumsily) prevent an off-by-one error...
+    if a == b and corpus[0:n] == pattern:
+        return True
+
+    for i in range(1, m - n):
+        # Update the hash of corpus by subtracting the hash of the character
+        # that is sliding out of view and adding the hash of the character that
+        # is sliding into view.
+        a = a - ord(corpus[i - 1]) + ord(corpus[i + n - 1])
+        # Integer comparison in O(0) time followed by string comparison in O(m)
+        # time.
+        if a == b and corpus[i:i + n] == pattern:
+            return True
+    return False