Efficient algorithm for strings - regex matching

Efficient algorithm for strings - regex matching - python

Problem statement:
Given two list A of strings and B of regex's(they are string too).
For every regex in list B, find all the matching strings in list A.
Length of list A <= 10^6 (N)
Length of string B <= 100 (M)
Length of strings, regex <= 30 (K)
Assume regex matching and string comparisons take O(K) time and regex can contain any python regex supported operations.
My algorithm:
for regex in B:
for s in A:
if regex.match(s):
mapping[regex].add(s)
This takes O(N*M*K) time.
Is there any way to make it more time efficient even compromising space (using any data structure)?

This is about as fast as it can go, in terms of time complexity.
Every regex has to be matched with every string at least once. Otherwise, you won't be able to get the information of "match" or "no match".
In terms of absolute time, you can use a filter to avoid the slow Python loops:
mapping = {regex: filter(re.compile(regex).match, A) for regex in B}

Related

Effeciently remove single letter substrings from a string

So I've been trying to attack this problem for a while but have no idea how to do it efficiently.
I'm given a substring of N (N >= 3) characters, and the substring contains solely of the characters 'A' and 'B'. I have to efficiently find a way to count all the substrings possible, which have only one A or only one B, with the same order given.
For example ABABA:
For three letters, the substrings would be: ABA, BAB, ABA. For this all three count because all three of them contain only one B or only one A.
For four letters, the substrings would be: ABAB, BABA. None of these count because they both don't have only one A or B.
For five letters: ABABA. This doesn't count because it doesn't have only one A or B.
If the string was bigger, then all substring combinations would be checked.
I need to implement this is O(n^2) or even O(nlogn) time, but the best I've been able to do was O(n^3) time, where I loop from 3 to the string's length for the length of the substrings, use a nested for loop to check each substring, then use indexOf and lastIndexOf and seeing for each substring if they match and don't equal -1 (meaning that there is only 1 of the character), for both A and B.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!

Effeciently remove single letter substrings from a string
This is completely impossible. Removing a letter is O(n) time already. The right answer is to not remove anything anywhere. You don't need to.
The actual answer is to stop removing letters and making substrings. If you call substring you messed up.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!
I have no clue. Also seems kinda silly. But, there's some good news: There's an O(n) algorithm available, why mess about with pointlessly inefficient algorithms?
charAt(i) is efficient. We can use that.
Here's your algorithm, in pseudocode because if I just write it for you, you wouldn't learn much:
First do the setup. It's a little bit complicated:
Maintain counters for # of times A and B occurs.
Maintain the position of the start of the current substring you're on. This starts at 0, obviously.
Start off the proceedings by looping from 0 to x (x = substring length), and update your A/B counters. So, if x is 3, and input is ABABA, you want to end with aCount = 2 and bCount = 1.
With that prepwork completed, let's run the algorithm:
Check for your current substring (that's the substring that starts at 0) if it 'works'. You do not need to run substring or do any string manipulation at all to know this. Just check your aCount and bCount variables. Is one of them precisely 1? Then this substring works. If not, it doesn't. Increment your answer counter by 1 if it works, don't do that if it doesn't.
Next, move to the next substring. To calculate this, first get the character at your current position (0). Then substract 1 from aCount or bCount depending on what's there. Then, fetch the char at 'the end' (.charAt(pos + x)) and add 1 to aCount or bCount depending on what's there. Your aCount and bCount vars now represent how many As respectively Bs are in the substring that starts at pos 1. And it only took 2 constant steps to update these vars.
... and loop. Keep looping until the end (pos + x) is at the end of the string.
This is O(n): Given, say, an input string of 1000 chars, and a substring check of 10, then the setup costs 10, and the central loop costs 990 loops. O(n) to the dot. .charAt is O(1), and you need two of them on every loop. Constant factors don't change big-O number.

Check if two sorted strings are equal in O(log n) time

I need to write a Python function which takes two sorted strings (the characters in each string are in increasing alphabetical order) containing only lowercase letters, and checks whether or not the strings are equal.
The function's time complexity needs to be O(log n), where n is the length of each string.
I can't figure out how to check it without comparing each character in the first string with the parallel character of the second string.

This is, in fact, possible in O(log n) time in the worst case, since the strings are formed from an alphabet of constant size.
You can do 26 binary searches on each string to find the left-most occurrence of each letter. If the strings are equal, then all 26 binary searches will give the same results; either that the letter exists in neither string, or that its left-most occurrence is the same in both strings.
Conversely, if all of the binary searches give the same result, then the strings must be equal, because (1) the alphabet is fixed, (2) the indices of the left-most occurrences determine the frequency of each letter in the string, and (3) the strings are sorted, so the letter frequencies uniquely determine the string.
I'm assuming here that the strings have the same length. If they might not, then check that first and return False if the lengths are different. Getting the length of a string takes O(1) time.
As #wim notes in the comments, this solution cannot be generalised to lists of numbers; it specifically only works with strings. When you have an algorithmic problem involving strings, the alphabet size is usually a constant, and this fact can often be exploited to achieve a better time complexity than would otherwise be possible.

Merging and sorting n strings in O(n)

I recently was given a question in a coding challenge where I had to merge n strings of alphanumeric characters and then sort the new merged string while only allowing alphabetical characters in the sorted string. Now, this would be fairly straight forward except that the caveat added was that the algorithm had to be O(n) (it didn't specify whether this was time or space complexity or both).
My initial approach was to concatenate the strings into a new one, only adding alphabetical characters and then sorting at the end. I wanted to come up with a more efficient solution but I was given less time than I was initially told. There isn't any sorting algorithm (that I know of) which runs in O(n) time, so the only thing I can think of is that I could increase the space complexity and use a sorted hashtable (e.g. C++ map) to store the counts of each character and then print the hashtable in sorted order. But as this would require possibly printing n characters n times, I think it would still run in quadratic time. Also, I was using python which I don't think has a way to keep a dictionary sorted (maybe it does).
Is there anyway this problem could have been solved in O(n) time and/or space complexity?

Your counting sort is the way to go: build a simple count table for the 26 letters in order. Iterate through your two strings, counting letters, ignoring non-letters. This is one pass of O(n). Now, simply go through your table, printing each letter the number of times indicated. This is also O(n), since the sum of the counts cannot exceed n. You're not printing n letters n times each: you're printing a total of n letters.

Concatenate your strings (not really needed, you can also count chars in the individual strings)
Create an array with length equal to total nr of charcodes
Read through your concatenated string and count occurences in the array made at step 2
By reading through the char freq array, build up an output array with the right nr of repetitions of each char.
Since each step is O(n) the whole thing is O(n)
[#patatahooligan: had made this edit before I saw your remark, accidentally duplicated the answer]

If I've understood the requirement correctly, you're simply sorting the characters in the string?
I.e. ADFSACVB becomes AABCDFSV?
If so then the trick is to not really "sort". You have a fixed (and small) number of values. So you can simply keep a count of each value and generate your result from that.
E.g. Given ABACBA
In the first pass, increment a counters in an array indexed by characters. This produces:
[A] == 3
[B] == 2
[C] == 1
In second pass output the number of each character indicated by the counters. AAABBC
In summary, you're told to sort, but thinking outside the box, you really want a counting algorithm.

What is more efficient? Using .replace() or passing string to list

Solving the following problem from CodeFights:
Given two strings, find the number of common characters between them.
For s1 = "aabcc" and s2 = "adcaa", the output should be
commonCharacterCount(s1, s2) = 3.
Strings have 3 common characters - 2 "a"s and 1 "c".
The way I approached it, whenever I took a letter into account I wanted to cancel it out so as not to count it again. I know strings are immutable, even when using methods such as .replace() (replace() method returns a copy of the string, no the actual string changed).
In order to mutate said string what I tend to do at the start is simply pass it on to a list with list(mystring) and then mutate that.
Question is... what is more efficient of the following? Take into account that option B gets done over and over, worst case scenario the strings are equal and have a match for match. Meanwhile option A happens once.
Option A)
list(mystring)
Option B)
mystring = mystring.replace(letterThatMatches, "")

The idea of calling replace on the string for each element you find, is simply not a good idea: it takes O(n) to do that. If you do that for every character in the other string, it will result in an O(m×n) algorithm with m the number of characters of the first string, and n the number of characters from the second string.
You can simply use two Counters, then calculate the minimum of the two, and then calculate the sum of the counts. Like:
from collections import Counter
def common_chars(s1,s2):
c1 = Counter(s1) # count for every character in s1, the amount
c2 = Counter(s2) # count for every character in s2, the amount
c3 = c1 & c2 # produce the minimum of each character
return sum(c3.values()) # sum up the resulting counts
Or as a one-liner:
def common_chars(s1,s2):
return sum((Counter(s1) & Counter(s2)).values())
If dictionary lookup can be done in O(1) (which usually holds for an average case), this is an O(m+n) algorithm: the counting then happens in O(m) and O(n) respectively, and calculating the minimum runs in the number of different characters (which is at most O(max(m,n)). Finally taking the sum is again an O(max(m,n)) operation.
For the sample input, this results in:
>>> common_chars("aabcc","adcaa")
3

Is there a better way to find all the contiguous substrings of a string that appear in a given dictionary

Is there a more efficient algorithm to find all the substrings that are part of a given language over alphabet than the following:
import string.ascii_lowercase as alphabet
languge = {'aa', 'bc', 'wxyz', 'uz'};
for i in xrange(len(alphabet)):
for j in xrange(i,len(alphabet)):
substirng = alphabet[i:j+1]
if substirng in languge:
print substirng

If I understand your question correctly. You have an alphabet, or string. In this case a string of 26 characters, a-z. You wish to check if any of the strings given to you are substrings of the aforementioned "alphabet string".
If this is indeed the case, there is a better way.
Your current approach amounts to computing all possible substrings from the alphabet, which is O(N^2) in the general case of an alphabet of size N and 26^2 in your particular case and then checking if the substring belongs to your predefined set. A much better approach would be to simply loop over your given strings and check if they are substrings of your alphabet. This is an O(N) operation for each string in your predefined set. This brings the complexity down to O(NM).
This is better if M is noticeably smaller than N.
There might be even better ways, but this is a good start.

Use Aho-Corasick or Rabin-Karp algorithms intended for this purpose:
It is a kind of dictionary-matching algorithm that locates elements of
a finite set of strings (the "dictionary") within an input text. It
matches all strings simultaneously
There are numerous Python implementations for these algorithms.
Complexity for Aho-Corasick searching is O(TextLength + AnswerLength), preprocessing O(n*σ), where n is overall length of all words in the dictionary, σ is alphabet size
For Rabin-Karp average time is O(TextLength + AnswerLength) too, but the worst time is O(TextLength * AnswerLength)

It is nicer if you use
from string import ascii_lowercase as alphabet instead
language = {'aa', 'bc', 'wxyz', 'uz'}
for item in language:
if item in alphabet:
print item
this works but a list comprehension is preferred
substrings = [item for item in language if item in alphabet]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient algorithm for strings - regex matching - python

Related

Effeciently remove single letter substrings from a string

Check if two sorted strings are equal in O(log n) time

Merging and sorting n strings in O(n)

What is more efficient? Using .replace() or passing string to list

Is there a better way to find all the contiguous substrings of a string that appear in a given dictionary

Categories

Resources