Python - how to find all intersections of two strings? - python

How to find all intersections (also called the longest common substrings) of two strings and their positions in both strings?
For example, if S1="never" and S2="forever" then resulted intersection must be ["ever"] and its positions are [(1,3)]. If S1="address" and S2="oddness" then resulted intersections are ["dd","ess"] and their positions are [(1,1),(4,4)].
Shortest solution without including any library is preferable. But any correct solution is also welcomed.

Well, you're saying that you can't include any library. However, Python's standard difflib contains a function which does exactly what you expect. Considering that it is a Python interview question, familiarity with difflib might be what the interviewer expected.
In [31]: import difflib
In [32]: difflib.SequenceMatcher(None, "never", "forever").get_matching_blocks()
Out[32]: [Match(a=1, b=3, size=4), Match(a=5, b=7, size=0)]
In [33]: difflib.SequenceMatcher(None, "address", "oddness").get_matching_blocks()
Out[33]: [Match(a=1, b=1, size=2), Match(a=4, b=4, size=3), Match(a=7, b=7, size=0)]
You can always ignore the last Match tuple, since it's dummy (according to documentation).

This can be done in O(n+m) where n and m are lengths of input strings.
The pseudocode is:
function LCSubstr(S[1..m], T[1..n])
L := array(1..m, 1..n)
z := 0
ret := {}
for i := 1..m
for j := 1..n
if S[i] = T[j]
if i = 1 or j = 1
L[i,j] := 1
else
L[i,j] := L[i-1,j-1] + 1
if L[i,j] > z
z := L[i,j]
ret := {}
if L[i,j] = z
ret := ret ∪ {S[i-z+1..z]}
return ret
See the Longest_common_substring_problem wikipedia article for more details.

Here's what I could come up with:
import itertools
def longest_common_substring(s1, s2):
set1 = set(s1[begin:end] for (begin, end) in
itertools.combinations(range(len(s1)+1), 2))
set2 = set(s2[begin:end] for (begin, end) in
itertools.combinations(range(len(s2)+1), 2))
common = set1.intersection(set2)
maximal = [com for com in common
if sum((s.find(com) for s in common)) == -1 * (len(common)-1)]
return [(s, s1.index(s), s2.index(s)) for s in maximal]
Checking some values:
>>> longest_common_substring('address', 'oddness')
[('dd', 1, 1), ('ess', 4, 4)]
>>> longest_common_substring('never', 'forever')
[('ever', 1, 3)]
>>> longest_common_substring('call', 'wall')
[('all', 1, 1)]
>>> longest_common_substring('abcd1234', '1234abcd')
[('abcd', 0, 4), ('1234', 4, 0)]

Batteries included!
The difflib module might have some help for you - here is a quick and dirty side-by-side diff:
>>> import difflib
>>> list(difflib.ndiff("never","forever"))
['- n', '+ f', '+ o', '+ r', ' e', ' v', ' e', ' r']
>>> diffs = list(difflib.ndiff("never","forever"))
>>> for d in diffs:
... print {' ': ' ', '-':'', '+':' '}[d[0]]+d[1:]
...
n
f
o
r
e
v
e
r

I'm assuming you only want substrings to match if they have the same absolute position within their respective strings. For example, "abcd", and "bcde" won't have any matches, even though both contain "bcd".
a = "address"
b = "oddness"
#matches[x] is True if a[x] == b[x]
matches = map(lambda x: x[0] == x[1], zip(list(a), list(b)))
positions = filter(lambda x: matches[x], range(len(a)))
substrings = filter(lambda x: x.find("_") == -1 and x != "","".join(map(lambda x: ["_", a[x]][matches[x]], range(len(a)))).split("_"))
positions = [1, 2, 4, 5, 6]
substrings = ['dd', 'ess']
If you only want substrings, you can squish it into one line:
filter(lambda x: x.find("_") == -1 and x != "","".join(map(lambda x: ["_", a[x]][map(lambda x: x[0] == x[1], zip(list(a), list(b)))[x]], range(len(a)))).split("_"))

def IntersectStrings( first, second):
x = list(first)
#print x
y = list(second)
lst1= []
lst2= []
for i in x:
if i in y:
lst1.append(i)
lst2 = sorted(lst1) + []
# This above step is an optional if it is required to be sorted alphabetically use this or else remove it
return ''.join(lst2)
print IntersectStrings('hello','mello' )

Related

How to test if a list is sorted in ascending order

This is the question of the exercise: write a function that checks if a list is sorted in ascending order.
def ascending(lst):
for k in range(0,len(lst)):
if lst[k] < lst[k+1]:
print('Ok')
else:
print('NOk, the number ' + str(lst[k]) + ' is greater than his next ' + str(lst[k+1]))
return 'Bye!'
lst = [1,3,2,4,5]
print(ascending(lst))
I expect the output: Ok, Ok, NOk the number 3 is greather than his next 2, Ok ... and I get it but, at the very end of the problem, the error message is obviously "IndexError: list index out of range". I understood that the problem is at the end of the if statement because for k = 4, k+1 = 5 (out of range) but I don't know how to solve it.
Your problem is here:
for k in range(0,len(lst)):
if lst[k] < lst[k+1]:
When k=4 ( which is len(list) ), then k+1 is out of range. Make your loop statement
for k in range(0,len(lst) - 1):
A different approach:
If your task is as simple as 'Test if a list is sorted in ascending order'; then how about a simple function like:
def issorted(lst):
return lst == sorted(lst)
print('Is sorted: ', issorted([5, 4, 1, 2]))
print('Is sorted: ', issorted([1, 2, 3, 4]))
print('Is sorted: ', issorted(['a', 'b', 'd', 'c']))
print('Is sorted: ', issorted(['w', 'x', 'y', 'z']))
Which outputs:
Is sorted: False
Is sorted: True
Is sorted: False
Is sorted: True
The easiest:
def ascending(lst):
lst == sorted(lst)
But this is log-linear and does not short-circuit. Better:
def ascending(lst):
return all(a <= b for a, b in zip(lst, lst[1:]))
or in Python >= 3.10:
from itertools import pairwise
def ascending(lst):
return all(a <= b for a, b in pairwise(lst))

How to return the count of the same elements in two lists?

I have two very large lists(that's why I used ... ), a list of lists:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],...,['how to match and return the frequency?']]
and a list of strings:
y = ['hi', 'nice', 'ok',..., 'frequency']
I would like to return in a new list the times (count) that any word in y occurred in all the lists of x. For example, for the above lists, this should be the correct output:
[(1,2),(2,0),(3,1),...,(n,count)]
As follows, [(1,count),...,(n,count)]. Where n is the number of the list and count the number of times that any word from y appeared in x. Any idea of how to approach this?.
First, you should preprocess x into a list of sets of lowercased words -- that will speed up the following lookups enormously. E.g:
ppx = []
for subx in x:
ppx.append(set(w.lower() for w in re.finditer(r'\w+', subx))
(yes, you could collapse this into a list comprehension, but I'm aiming for some legibility).
Next, you loop over y, checking how many of the sets in ppx contain each item of y -- that would be
[sum(1 for s in ppx if w in s) for w in y]
That doesn't give you those redundant first items you crave, but enumerate to the rescue...:
list(enumerate((sum(1 for s in ppx if w in s) for w in y), 1))
should give exactly what you require.
Here is a more readable solution. Check my comments in the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
assert len(x)==len(y), "you have to make sure length of x equals y's"
num = []
for i in xrange(len(y)):
# lower all the strings in x for comparison
# find all matched patterns in x and count it, and store result in variable num
num.append(len(re.findall(y[i], x[i][0].lower())))
res = []
# use enumerate to give output in format you want
for k, v in enumerate(num):
res.append((k,v))
# here is what you want
print res
OUTPUT:
[(0, 1), (1, 0), (2, 1), (3, 1)]
INPUT:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],
['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
CODE:
import re
s1 = set(y)
index = 0
result = []
for itr in x:
itr = re.sub('[!.?]', '',itr[0].lower()).split(' ')
# remove special chars and convert to lower case
s2 = set(itr)
intersection = s1 & s2
#find intersection of common strings
num = len(intersection)
result.append((index,num))
index = index+1
OUTPUT:
result = [(0, 2), (1, 0), (2, 1), (3, 1)]
You could do like this also.
>>> x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
>>> y = ['hi', 'nice', 'ok', 'frequency']
>>> l = []
>>> for i,j in enumerate(x):
c = 0
for x in y:
if re.search(r'(?i)\b'+x+r'\b', j[0]):
c += 1
l.append((i+1,c))
>>> l
[(1, 2), (2, 0), (3, 1), (4, 1)]
(?i) will do a case-insensitive match. \b called word boundaries which matches between a word character and a non-word character.
Maybe you could concatenate the strings in x to make the computation easy:
w = ' '.join(i[0] for i in x)
Now w is a long string like this:
>>> w
"I like stackoverflow. Hi ok! this is a great community Ok, I didn't like this!. how to match and return the frequency?"
With this conversion, you can simply do this:
>>> l = []
>>> for i in range(len(y)):
l.append((i+1, w.count(str(y[i]))))
which gives you:
>>> l
[(1, 2), (2, 0), (3, 1), (4, 0), (5, 1)]
You can make a dictionary where key is each item in the "Y" List. Loop through the values of the keys and look up for them in the dictionary. Keep updating the value as soon as you encounter the word into your X nested list.

Find index of strings within a list that contain a specific character

I am very new to Python, but I have a problem that Google hasn't yet solved for me. I have a list of strings (f_list). I would like to generate a list of the indicies of the strings that contain a specific character ('>').
Example:
f_list = ['>EntryA', EntryB, '>EntryC', EntryD]
I would like to generate:
index_list = [0, 2]
This code works, but I have to enter the exact name of a string (ie. >EntryA) for Value. If I enter '>' (as indicated below in the code example), it returns no values in index_list.
f_list = ['>EntryA', 'EntryB', '>EntryC', 'EntryD']
index_list = []
def all_indices(v, qlist):
idx = -1
while True:
try:
idx = qlist.find(v, idx+1)
index_list.append(idx)
except ValueError:
break
return index_list
all_indices('>', f_list)
print(index_list)
>>> [i for i, s in enumerate(f_list) if '>' in s]
[0, 2]
You can use filter to find the strings:
>>> f_list = ['>EntryA', 'EntryB', '>EntryC', 'EntryD']
>>> filter(lambda s: '>' in s, f_list)
['>EntryA', '>EntryC']
Or use a list comprehension to find the indices:
>>> [i for i, s in enumerate(f_list) if '>' in s]
[0, 2]
Or you can find both with either:
>>> filter(lambda s: '>' in s[1], enumerate(f_list))
[(0, '>EntryA'), (2, '>EntryC')]
>>> [(i, s) for i, s in enumerate(f_list) if '>' in s]
[(0, '>EntryA'), (2, '>EntryC')]
If you are ever working with indexes, enumerate() is your function:
>>> f_list = ['>EntryA', 'EntryB', '>EntryC', 'EntryD']
>>> for i, j in enumerate(f_list):
... if '>' in j:
... print i
...
0
2
In a function:
>>> def all_indices(v, qlist):
... return [i for i, j in enumerate(f_list) if '>' in j]
...
>>> all_indices('>', f_list)
[0, 2]

Checking if a string's characters are ascending alphabetically and its ascent is evenly spaced python

So need to check if a string's characters are ascending alphabetically and if that ascent is evenly spaced.
a = "abc"
b = "ceg"
So a is alphabetically ascending and it's spacing is 1 (if you convert to the ordinal values they are 97,98,99). And b is also alphabetically ascending and it's spacing is 2 (99,101,103).
And I am sticking with the following code:
a = 'jubjub'
words1 = []
ords = [ord(letter) for letter in a]
diff = ords[1] - ords[0]
for ord_val in range(1, len(ords)-1):
if diff > 0:
if ords[ord_val + 1] - ords[ord_val] == diff:
if a not in words1:
words1.append((a, diff))
print words1
How come 'jubjub' works, 'ace' works, but 'catcat' doesn't?
>>> from itertools import product
>>> from string import lowercase
>>> a="abc"
>>> any(a in lowercase[i::j+1] for i,j in product(range(26),repeat=2))
True
>>> b="ceg"
>>> any(b in lowercase[i::j+1] for i,j in product(range(26),repeat=2))
True
>>> c="longer"
>>> any(c in string.lowercase[i::j+1] for i,j in product(range(26),repeat=2))
False
>>> d="bdfhj"
>>> any(d in string.lowercase[i::j+1] for i,j in product(range(26),repeat=2))
True
It's not necessary to use product, and a little more efficient to do it this way
>>> any(a in string.lowercase[i::j+1] for i in range(26) for j in range(26-i))
True
without itertools
>>> a = 'abc'
>>> ords = [ord(c) for c in a]
>>> ords == sorted(ords)
True
>>> diffs = set()
>>> for i in range(len(ords) -1): diffs.add(ords[i] - ords[i+1])
>>> len(diffs) == 1
True
only hints for home work,,,
you can try to make use of something from this
In [100]: z = 'abc'
In [101]: [ord(x) for x in z]
Out[101]: [97, 98, 99]
then there can be several logics to check if the elements are evenly spaced :)
A non intuitive approach in handling this
>>> somestring = "aceg"
>>> len(set([ord(y)-ord(x) for (x,y) in zip(*(iter(somestring),) * 2) if y > x]))==1
True
The concept is to create the difference of subsequent element and see if the difference is consistent. For that I create a set and determine if the length is 1 in which case the order is preserved. Also to ensure that the series

Python: find sequential change in one member of list pairs, report other

There must be a simpler, more pythonic way of doing this.
Given this list of pairs:
pp = [('a',1),('b',1),('c',1),('d',2),('e',2)]
How do I most easily find the first item in adjacent pairs where the second item changes (here, from 1 to 2). Thus I'm looking for ['c','d']. Assume there will only be one change in pair[1] for the entire list, but that it may be a string.
This code works but seems excruciatingly long and cumbersome.
for i, pair in enumerate(pp):
if i == 0:
pInitial = pair[0]
sgInitial = pair[1]
pNext = pair[0]
sgNext = pair[1]
if sgInitial == sgNext:
sgInitial = sgNext
pInitial = pNext
else:
pOne = pInitial
pTwo = pNext
x = [pOne, pTwo]
print x
break
Thanks
Tim
import itertools as it
pp = [('a',1),('b',1),('c',1),('d',2),('e',2)]
# with normal zip and slicing
for a,b in zip(pp,pp[1:]):
if a[1] != b[1]:
x=(a[0],b[0])
print x
break
# with generators and izip
iterfirst = (b for a,b in pp)
itersecond = (b for a,b in pp[1:])
iterfirstsymbol = (a for a,b in pp)
itersecondsymbol = (a for a,b in pp[1:])
iteranswer = it.izip(iterfirstsymbol, itersecondsymbol, iterfirst, itersecond)
print next((symbol1, symbol2)
for symbol1,symbol2, first, second in iteranswer
if first != second)
Added my readable generator version.
You could try somethingl like :
[[pp[i][0],pp[i+1][0]] for i in xrange(len(pp)-1) if pp[i][1]!=pp[i+1][1]][0]
(using list comprehension)
try comparing pp[:-1] to pp[1:], something like
[a for a in zip(pp[:-1], pp[1:]) if a[0][1] != a[1][1]]
(look at zip(pp[:-1], pp[1:]) first to see what's going on
edit:
i guess you'd need
([a[0][0], a[1][0]] for a in zip(pp[:-1], pp[1:]) if a[0][1] != a[1][1]).next()
>>> import itertools
>>> pp = [('a',1),('b',1),('c',1),('d',2),('e',2)]
>>> gb = itertools.groupby(pp, key=lambda x: x[1])
>>> f = lambda x: list(next(gb)[1])[x][0]
>>> f(-1), f(0)
('c', 'd')
Here is something (simple?) with recursion:
def first_diff( seq, key=lambda x:x ):
""" returns the first items a,b of `seq` with `key(a) != key(b)` """
it = iter(seq)
def test(last): # recursive function
cur = next(it)
if key(last) != key(cur):
return last, cur
else:
return test(cur)
return test(next(it))
print first_diff( pp, key=lambda x:x[1]) # (('c', 1), ('d', 2))
pp = [('a',1),('b',1),('c',1),('d',2),('e',2)]
def find_first(pp):
for i,(a,b) in enumerate(pp):
if i == 0: oldb = b
else:
if b != oldb: return i
return None
print find_first(pp)
>>> pp = [('a',1),('b',1),('c',1),('d',2),('e',2)]
>>> [[t1, t2] for ((t1, v1), (t2, v2)) in zip(pp, pp[1:]) if v1 != v2] [0]
['c', 'd']
>>>
I like this for clarity...if you find list comprehensions clear. It does create two temporary lists: pp[1:] and the zip() result. Then it compares all the adjacent pairs and gives you the first change it found.
This similar-looking generator expression doesn't create temporary lists and stops processing when it reaches the first change:
>>> from itertools import islice, izip
>>> ([t1, t2] for ((t1, v1), (t2, v2)) in izip(pp, islice(pp, 1, None))
... if v1 != v2
... ).next()
['c', 'd']
>>>
Everybody's examples on this page are more compact than they would be if you wanted to catch errors.

Categories

Resources