How to find the longest common substring of multiple strings?

How to find the longest common substring of multiple strings? - python

I am writing a python script where I have multiple strings.
For example:
x = "brownasdfoersjumps"
y = "foxsxzxasis12sa[[#brown"
z = "thissasbrownxc-34a#s;"
In all these three strings, they have one sub string in common which is brown. I want to search it in a way that I want to create a dictionary as:
dict = {[commonly occuring substring] =>
[total number of occurrences in the strings provided]}
What would be the best way of doing that? Considering that I will have more than 200 strings each time, what would be an easy/efficient way of doing it?

This is a relatively optimised naïve algorithm. You first transform each sequence into a set of all its ngrams. Then you intersect all sets and find the longest ngram in the intersection.
from functools import partial, reduce
from itertools import chain
from typing import Iterator
def ngram(seq: str, n: int) -> Iterator[str]:
return (seq[i: i+n] for i in range(0, len(seq)-n+1))
def allngram(seq: str) -> set:
lengths = range(len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a#s;"]
seqs_ngrams = map(allngram, sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown
While this might get you through short sequences, this algorithm is extremely inefficient on long sequences. If your sequences are long, you can add a heuristic to limit the largest possible ngram length (i.e. the longest possible common substring). One obvious value for such a heuristic may be the shortest sequence's length.
def allngram(seq: str, minn=1, maxn=None) -> Iterator[str]:
lengths = range(minn, maxn) if maxn else range(minn, len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a#s;"]
maxn = min(map(len, sequences))
seqs_ngrams = map(partial(allngram, maxn=maxn), sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown
This may still take too long (or make your machine run out of RAM), so you might want to read about some optimal algorithms (see the link I left in my comment to your question).
Update
To count the number of strings wherein each ngram occurs
from collections import Counter
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a#s;"]
seqs_ngrams = map(allngram, sequences)
counts = Counter(chain.from_iterable(seqs_ngrams))
Counter is a subclass of dict, so its instances have similar interfaces:
print(counts)
Counter({'#': 1,
'#b': 1,
'#br': 1,
'#bro': 1,
'#brow': 1,
'#brown': 1,
'-': 1,
'-3': 1,
'-34': 1,
'-34a': 1,
'-34a#': 1,
'-34a#s': 1,
'-34a#s;': 1,
...
You can filter the counts to leave substrings occurring in at least n strings: {string: count for string, count in counts.items() if count >= n}

I have used a straightforward method to get the common sub sequences from multiple strings. Although the code can be further optimised.
import itertools
def getMaxOccurrence(stringsList, key):
count = 0
for word in stringsList:
if key in word:
count += 1
return count
def getSubSequences(STR):
combs = []
result = []
for l in range(1, len(STR)+1):
combs.append(list(itertools.combinations(STR, l)))
for c in combs:
for t in c:
result.append(''.join(t))
return result
def getCommonSequences(S):
mainList = []
for word in S:
temp = getSubSequences(word)
mainList.extend(temp)
mainList = list(set(mainList))
mainList = reversed(sorted(mainList, key=len))
mainList = list(filter(None, mainList))
finalData = dict()
for alpha in mainList:
val = getMaxOccurrence(S, alpha)
if val > 0:
finalData[alpha] = val
finalData = {k: v for k, v in sorted(finalData.items(), key=lambda item: item[1], reverse=True)}
return finalData
stringsList = ['abc', 'cab', 'dfab', 'xz']
seqs = getCommonSequences(stringsList)
print(seqs)

Related

removing duplicates from a bool list

I am trying to get a word in a list that is followed by a word with a ''.'' in it. for example, if this is a list
test_list = ["hello", "how", "are.", "you"]
it would select the word ''you'' I have managed to pull this off but I am trying to ensure that I do not get duplicate words.
Here is what I have so far
list = []
i = 0
bool = False
words = sent.split()
for word in words:
if bool:
list.append(word)
bool = False
# the bellow if statment seems to make everything worse instead of fixing the duplicate problem
if "." in word and word not in list:
bool = True
return list

Your whole code can be reduced to this example using zip() and list comprehension:
a = ['hello', 'how', 'are.', 'you']
def get_new_list(a):
return [v for k,v in zip(a, a[1:]) if k.endswith('.')]
Then, to remove the duplicates, if there is any, you can use set(), like this example:
final = set(get_new_list(a))
output:
{'you'}

This isn't based off of the code you posted, however it should do exactly what you're asking.
def get_word_after_dot(words):
for index, word in enumerate(words):
if word.endswith('.') and len(words) - index > 1:
yield words[index + 1]
Iterating over this generator will yield words that are followed by a period.

Here is a different approach to the same problem.
import itertools
from collections import deque
t = deque(map(lambda x: '.' in x, test_list)) # create a deque of bools
>>deque([False, False, True, False])
t.rotate(1) # shift it by one since we want the word after the '.'
>>deque([False, False, False, True])
set(itertools.compress(test_list, t)) # and then grab everywhere it is True
>>{'you'}

In the itertools recipes is the definition of pairwise which is useful to iterating a list 2 at a time:
def pairwise(iterable):
a, b = it.tee(iterable)
next(b, None)
return a, b
You can use this to create a list of words that follow a word ending in '.':
words = [n for m, n in zip(*pairwise(l)) if m[-1] == '.']
Remove duplicates:
seen = set()
results = [x for x in words if not (x in seen or seen.add(x))]

How to optimize word_count in python

I am given n words (1≤n≤10^5). Some words may repeat. For each word, I have to output its number of occurences. But the output order should correspond with the order of the first appearance of the word.
I have a working program of the problem, but for large inputs I am getting timeout. Here is my solution for the problem:
n=int(input())
l=[]
ll=[]
for x in range(n):
l.append(raw_input())
if l[x] not in ll:
ll.append(l[x])
result = [ l.count(ll[x]) for x in range(len(ll)) ]
for x in range(len(result)):
print result[x],

Use an ordered counter by subclassing OrderedDict and Counter:
from collections import Counter, OrderedDict
class OrderedCounter(Counter, OrderedDict):
pass
counts = OrderedCounter(['b', 'c', 'b', 'b', 'a', 'c'])
for k, c in counts.items():
print(k, c)
Which prints:
b 3
c 2
a 1
See the documentation for the collections module for a more complete recipe which includes a __repr__ for OrderedCounter.

The easiest way to count items in python is to use a Counter from the collections module.
Assuming you have a list of items in the order that you expect, passing it to a Counter should suffice:
c = collections.Counter(['foo', 'bar', 'bar'])
print(c['bar']) # Will print 2
If words is the list of words you retrieved from the user, you can iterate over it to print the values:
seen = set()
for elem in words:
if elem not in seen:
print(counter[elem])
seen.add(elem)

Take a look at collections.OrderedDict. It can handle this for you, and it removes the linear membership test expense using a list is imposing:
import collections
n = int(input())
l = []
ll = collections.OrderedDict()
for x in range(n):
v = raw_input()
l.append(v)
ll[v] = None # If v already in OrderedDict, does nothing, otherwise, appends
ll = list(ll) # Can convert back to list when you're done if you like
If you need the count, you can make a custom class based on OrderedDict that both handles counts and remains ordered.
class OrderedCounter(collections.OrderedDict):
def __missing__(self, key):
return 0
Then change ll to an OrderedCounter, and ll[v] = None to ll[v] += 1. At the end, ll will have the ordered words with their counts; l isn't even needed:
for word, count in ll.items():
print(word, count)
The final code would simplify to just (omitting imports and class definition):
n = int(input())
word_counts = OrderedCounter()
for x in range(n):
word_counts[raw_input()] += 1
for cnt in word_counts.values():
print cnt,
Much simpler, right?

Python - counting duplicate strings

I'm trying to write a function that will count the number of word duplicates in a string and then return that word if the number of duplicates exceeds a certain number (n). Here's what I have so far:
from collections import defaultdict
def repeat_word_count(text, n):
words = text.split()
tally = defaultdict(int)
answer = []
for i in words:
if i in tally:
tally[i] += 1
else:
tally[i] = 1
I don't know where to go from here when it comes to comparing the dictionary values to n.
How it should work:
repeat_word_count("one one was a racehorse two two was one too", 3) should return ['one']

Try
for i in words:
tally[i] = tally.get(i, 0) + 1
instead of
for i in words:
if i in tally:
tally[words] += 1 #you are using words the list as key, you should use i the item
else:
tally[words] = 1
If you simply want to count the words, use collections.Counter would fine.
>>> import collections
>>> a = collections.Counter("one one was a racehorse two two was one too".split())
>>> a
Counter({'one': 3, 'two': 2, 'was': 2, 'a': 1, 'racehorse': 1, 'too': 1})
>>> a['one']
3

Here is a way to do it:
from collections import defaultdict
tally = defaultdict(int)
text = "one two two three three three"
for i in text.split():
tally[i] += 1
print tally # defaultdict(<type 'int'>, {'three': 3, 'two': 2, 'one': 1})
Putting this in a function:
def repeat_word_count(text, n):
output = []
tally = defaultdict(int)
for i in text.split():
tally[i] += 1
for k in tally:
if tally[k] > n:
output.append(k)
return output
text = "one two two three three three four four four four"
repeat_word_count(text, 2)
Out[141]: ['four', 'three']

If what you want is a dictionary counting the words in a string, you can try this:
string = 'hello world hello again now hi there hi world'.split()
d = {}
for word in string:
d[word] = d.get(word, 0) +1
print d
Output:
{'again': 1, 'there': 1, 'hi': 2, 'world': 2, 'now': 1, 'hello': 2}

why don't you use Counter class for that case:
from collections import Counter
cnt = Counter(text.split())
Where elements are stored as dictionary keys and their counts are stored as dictionary values. Then it's easy to keep the words that exceeds your n number with iterkeys() in a for loop like
list=[]
for k in cnt.iterkeys():
if cnt[k]>n:
list.append(k)
In list you'll got your list of words.
**Edited: sorry, thats if you need many words, BrianO have the right one for your case.

As luoluo says, use collections.Counter.
To get the item(s) with the highest tally, use the Counter.most_common method with argument 1, which returns a list of pairs (word, tally) whose 2nd coordinates are all the same max tally. If the "sentence" is nonempty then that list is too. So, the following function returns some word that occurs at least n times if there is one, and returns None otherwise:
from collections import Counter
def repeat_word_count(text, n):
if not text: return None # guard against '' and None!
counter = Counter(text.split())
max_pair = counter.most_common(1)[0]
return max_pair[0] if max_pair[1] > n else None

Filter List by Longest Element Containing a String

I want to filter a list of all items containing the same last 4 digits, I want to print the longest of them.
For example:
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
# want to return abcdabcd1234 and poiupoiupoiupoiu7890
In this case, we print the longer of the elements containing 1234, and the longer of the elements containing 7890. Finding the longest element containing a certain element is not hard, but doing it for all items in the list (different last four digits) efficiently seems difficult.
My attempt was to first identify all the different last 4 digits using list comprehension and slice:
ids=[]
for x in lst:
ids.append(x[-4:])
ids = list(set(ids))
Next, I would search through the list by index, with a "max_length" variable and "current_id" to find the largest elements of each id. This is clearly very inefficient and was wondering what the best way to do this would be.

Use a dictionary:
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> d = {} # to keep the longest items for digits.
>>> for item in lst:
... key = item[-4:] # last 4 characters
... d[key] = max(d.get(key, ''), item, key=len)
...
>>> d.values() # list(d.values()) in Python 3.x
['abcdabcd1234', 'poiupoiupoiupoiu7890']

from collections import defaultdict
d = defaultdict(str)
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
for x in lst:
if len(x) > len(d[x[-4:]]):
d[x[-4:]] = x
To display the results:
for key, value in d.items():
print key,'=', value
which produces:
1234 = abcdabcd1234
7890 = poiupoiupoiupoiu7890

itertools is great. Use groupby with a lambda to group the list into the same endings, and then from there it is easy:
>>> from itertools import groupby
>>> lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
>>> [max(y, key=len) for x, y in groupby(lst, lambda l: l[-4:])]
['abcdabcd1234', 'poiupoiupoiupoiu7890']

Slightly more generic
import string
import collections
lst = ['abcd1234','abcdabcd1234','gqweri7890','poiupoiupoiupoiu7890']
z = [(x.translate(None, x.translate(None, string.digits)), x) for x in lst]
x = collections.defaultdict(list)
for a, b in z:
x[a].append(b)
for k in x:
print k, max(x[k], key=len)
1234 abcdabcd1234
7890 poiupoiupoiupoiu7890

Python, get index from list of lists

I have a list of lists of strings, like this:
l = [['apple','banana','kiwi'],['chair','table','spoon']]
Given a string, I want its index in l. Experimenting with numpy, this is what I ended up with:
import numpy as np
l = [['apple','banana','kiwi'],['chair','table','spoon']]
def ind(s):
i = [i for i in range(len(l)) if np.argwhere(np.array(l[i]) == s)][0]
j = np.argwhere(np.array(l[i]) == s)[0][0]
return i, j
s = ['apple','banana','kiwi','chair','table','spoon']
for val in s:
try:
print val, ind(val)
except IndexError:
print 'oops'
This fails for apple and chair, getting an indexerror. Also, this just looks bad to me. Is there some better approch to doing this?

Returns a list of tuples containing (outer list index, inner list index), designed such that the item you're looking for can be in multiple inner lists:
l = [['apple','banana','kiwi'],['chair','table','spoon']]
def findItem(theList, item):
return [(ind, theList[ind].index(item)) for ind in xrange(len(theList)) if item in theList[ind]]
findItem(l, 'apple') # [(0, 0)]
findItem(l, 'spoon') # [(1, 2)]

If you want to use numpy, you don't need to roll your own:
import numpy as np
l = np.array([['apple','banana','kiwi'],['chair','table','spoon']])
s = ['apple','banana','kiwi','chair','table','spoon']
for a in s:
arg = np.argwhere(l==a)
print a, arg, tuple(arg[0]) if len(arg) else None

l = [['apple','banana','kiwi'],['chair','table','spoon']]
def search(lst, item):
for i in range(len(lst)):
part = lst[i]
for j in range(len(part)):
if part[j] == item: return (i, j)
return None

I'd create a dictionary to map the items to their indices:
>>> import numpy as np
>>> l = [['apple','banana','kiwi'],['chair','table','spoon']]
>>> a = np.array(l,dtype=object)
>>> a
array([[apple, banana, kiwi],
[chair, table, spoon]], dtype=object)
>>> d = {s:idx for (idx),s in np.ndenumerate(a)}
>>> d['apple']
(0, 0)
>>> d['chair']
(1, 0)
numpy + ndenumerate is nice for creating the index, but it's definitely not necessary. Of course, this is going to be most efficient if you can create the index once and then reuse it for subsequent searches.

One way is to make use of enumerate:
l = [['apple','banana','kiwi'],['chair','table','spoon']]
s = ['apple','banana','kiwi','chair','table','spoon']
for a in s:
for i, ll in enumerate(l):
for j, b in enumerate(ll):
if a == b:
print a, i, j

In your line that computes i, you already have the answer if you apply argwhere to the entire list, rather than each sublist. There is no need to search again for j.
def ind(s):
match = np.argwhere(np.array(l == s))
if match:
i, j = match[0]
else:
return -1, -1
This is will return the indeces of the first occurence of the string you're searching for.
Also, you might consider how this method is impacted as the complexity of the problem increases. This method will iterate over every element of your list, so the runtime cost increases as the list becomes bigger. So, if the number of test strings you're trying to find in the list also increases, you might want to think about using a dictionary to create a lookup table once, then subsequent searches for test strings are cheaper.
def make_lookup(search_list):
lookup_table = {}
for i, sublist in enumerate(list):
for j, word in enumerate(sublist):
lookup_table[word] = (i, j)
return lookup_table
lookup_table = make_lookup(l)
def ind(s):
if s in lookup_table:
return lookup_table[s]
else:
return -1, -1

To get index of list of list in python:
theList = [[1,2,3], [4,5,6], [7,8,9]]
for i in range(len(theList)):
if 5 in theList(i):
print("[{0}][{1}]".format(i, theList[i].index(5))) #[1][1]

This solution will find all occurrences of the string you're searching for:
l = [['apple','banana','kiwi','apple'],['chair','table','spoon']]
def findItem(theList, item):
return [(i, j) for i, line in enumerate(theList)
for j, char in enumerate(line) if char == item]
findItem(l, 'apple') # [(0, 0), (0, 3)]
findItem(l, 'spoon') # [(1, 2)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find the longest common substring of multiple strings? - python

Related

removing duplicates from a bool list

How to optimize word_count in python

Python - counting duplicate strings

Filter List by Longest Element Containing a String

Python, get index from list of lists

Categories

Resources