Check intersection between two strings in python

Check intersection between two strings in python - python

I'm trying to check intersection between two strings using Python.
I defined this function:
def check(s1,s2):
word_array = set.intersection(set(s1.split(" ")), set(s2.split(" ")))
n_of_words = len(word_array)
return n_of_words
It works with some sample string, but in this specific case:
d_word = "BANGKOKThailand"
nlp_word = "Despite Concerns BANGKOK"
print(check(d_word,nlp_word))
I got 0. What am I missing?

I was looking for the maximum common part of 2 strings no matter where this part would be.
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
#get_intersection(s1, s2)
Works for this example as well:
>>> s1 = "BANGKOKThailand"
>>> s2 = "Despite Concerns BANGKOK"
>>> get_intersection('aa' + s1 + 'bb', 'cc' + s2 + 'dd')
'BANGKOK'

Set one contains single string, set two 3 strings, and string "BANGKOKThailand" is not equal to the string "BANGKOK".

I can see two might-be mistakes:
n_of_words = len(array)
should be
n_of_words = len(word_array)
and
d_word = "BANGKOKThailand"
is missing a space in-between as
"BANGKOK Thailand"
Fixing those two changes gave me a result of 1.

Related

Remove equal characters from two python strings

I am writing a Python code to remove equal same characters from two strings which lies on the same indices. For example remove_same('ABCDE', 'ACBDE') should make both arguments as BC and CB. I know that string is immutable here so I have converted them to list. I am getting an out of index error.
def remove_same(l_string, r_string):
l_list = list(l_string)
r_list = list(r_string)
i = 0
while i != len(l_list):
print(f'in {i} length is {len(l_list)}')
while l_list[i] == r_list[i]:
l_list.pop(i)
r_list.pop(i)
if i == len(l_list) - 1:
break
if i != len(l_list):
i += 1
return l_list[0] == r_list[0]

I would avoid using a while loop in that case, I think this is a better and more clear solution:
def remove_same(s1, s2):
l1 = list(s1)
l2 = list(s2)
out1 = []
out2 = []
for c1, c2 in zip(l1, l2):
if c1 != c2:
out1.append(c1)
out2.append(c2)
s1_out = "".join(out1)
s2_out = "".join(out2)
print(s1_out)
print(s2_out)
It could be shortened using some list comprehensions but I was trying to be as explicit as possible

I feel this could be a problem.
while l_list[i] == r_list[i]:
l_list.pop(i)
r_list.pop(i)
This could reduce size of list and it can go below i.
Do a dry run on this, if l_list = ["a"] and r_list = ["a"].

It is in general not a good idea to modify a list in a loop. Here is a cleaner, more Pythonic solution. The two strings are zipped and processed in parallel. Each pair of equal characters is discarded, and the remaining characters are arranged into new strings.
a = 'ABCDE'
b = 'ACFDE'
def remove_same(s1, s2):
return ["".join(s) for s
in zip(*[(x,y) for x,y in zip(s1,s2) if x!=y])]
remove_same(a, b)
#['BC', 'CF']

Here you go:
def remove_same(l_string, r_string):
# if either string is empty, return False
if not l_string or not r_string:
return False
l_list = list(l_string)
r_list = list(r_string)
limit = min(len(l_list), len(r_list))
i = 0
while i < limit:
if l_list[i] == r_list[i]:
l_list.pop(i)
r_list.pop(i)
limit -= 1
else:
i += 1
return l_list[0] == r_list[0]
print(remove_same('ABCDE', 'ACBDE'))
Output:
False

Format a large integer with commas without using .format()

I'm trying to format any number by inserting ',' every 3 numbers from the end by not using format()
123456789 becomes 123,456,789
1000000 becomes 1,000,000
What I have so far only seems to go from the start, I've tried different ideas to get it to reverse but they seem to not work as I hoped.
def format_number(number):
s = [x for x in str(number)]
for a in s[::3]:
if s.index(a) is not 0:
s.insert(s.index(a), ',')
return ''.join(s)
print(format_number(1123456789))
>> 112,345,678,9
But obviously what I want is 1,123,456,789
I tried reversing the range [:-1:3] but I get 112,345,6789
Clarification: I don't want to use format to structure the number, I'd prefer to understand how to do it myself just for self-study's sake.

Here is a solution for you, without using built-in functions:
def format_number(number):
s = list(str(number))[::-1]
o = ''
for a in range(len(s)):
if a and a % 3 == 0:
o += ','
o += s[a]
return o[::-1]
print(format_number(1123456789))
And here is the same solution using built-in functions:
def format_number(number):
return '{:,}'.format(number)
print(format_number(1123456789))
I hope this helps. :D

One way to do it without built-in functions at all...
def format_number(number):
i = 0
r = ""
while True:
r = "0123456789"[number % 10] + r
number //= 10
if number == 0:
return r
i += 1
if i % 3 == 0:
r = "," + r
Here's a version that's almost free of built-in functions or methods (it does still have to use str)
def format_number(number):
i = 0
r = ""
for character in str(number)[::-1]:
if i > 0 and i % 3 == 0:
r = "," + r
r = character + r
i += 1
return r
Another way to do it without format but with other built-ins is to reverse the number, split it into chunks of 3, join them with a comma, and reverse it again.
def format_number(number):
backward = str(number)[::-1]
r = ",".join(backward[i:i+3] for i in range(0, len(backward), 3))
return r[::-1]

Your current approach has following drawbacks
checking for equality/inequality in most cases (especially for int) should be made using ==/!= operators, not is/is not ones,
using list.index returns first occurence from the left end (so s.index('1') will be always 0 in your example), we can iterate over range if indices instead (using range built-in).
we can have something like
def format_number(number):
s = [x for x in str(number)]
for index in range(len(s) - 3, 0, -3):
s.insert(index, ',')
return ''.join(s)
Test
>>> format_number(1123456789)
'1,123,456,789'
>>> format_number(6789)
'6,789'
>>> format_number(135)
'135'
If range, list.insert and str.join are not allowed
We can replace
range with while loop,
list.insert using slicing and concatenation,
str.join with concatenation,
like
def format_number(number):
s = [x for x in str(number)]
index = len(s) - 3
while index > 0:
s = s[:index] + [','] + s[index:]
index -= 3
result = ''
for character in s:
result += character
return result
Using str.format
Finally, following docs
The ',' option signals the use of a comma for a thousands separator. For a locale aware separator, use the 'n' integer presentation type instead.
your function can be simplified to
def format_number(number):
return '{:,}'.format(number)
and it will even work for floats.

Find values in list which differ from reference list by up to N characters

I have a list like the following:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
And a reference list like this:
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
I want to extract the values from Test if they are N or less characters different from any one of the items in Ref.
For example, if N = 1, only the first two elements of Test should be output. If N = 2, all three elements fit this criteria and should be returned.
It should be noted that I am looking for same charcacter length values (ASDFGY -> ASDFG matching doesn't work for N = 1), so I want something more efficient than levensthein distance.
I have over 1000 values in ref and a couple hundred million in Test so efficiency is key.

Using a generation expression with sum:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
from collections import Counter
def comparer(x, y, n):
return (len(x) == len(y)) and (sum(i != j for i, j in zip(x, y)) <= n)
res = [a for a, b in zip(Ref, Test) if comparer(a, b, 1)]
print(res)
['ASDFGY', 'QWERTYI']

Using difflib
Demo:
import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
c = 0
for j,s in enumerate(difflib.ndiff(i, v)):
if s.startswith("-"):
c += 1
if c <= N:
result.append( i )
print(result)
Output:
['ASDFGH', 'QWERTYU']

The newer regex module offers a "fuzzy" match possibility:
import regex as re
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
for item in Test:
rx = re.compile('(' + item + '){s<=3}')
for r in Ref:
if rx.search(r):
print(rf'{item} is similar to {r}')
This yields
ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA
You can control it via the {s<=3} part which allows three or less substitutions.
To have pairs, you could write
pairs = [(origin, difference)
for origin in Test
for rx in [re.compile(rf"({origin}){{s<=3}}")]
for difference in Ref
if rx.search(difference)]
Which would yield for
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
the following output:
[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'),
('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'),
('ZXCVB', 'ZXCAA')]

splicing between sequence of two defined boundaries

I have created a function that takes in four strings. The first two strings will be long strings that can be anything. The last two strings will be referred to as boundaries. I want to take everything in string1 between the defined boundaries and replace everything in string2 between the defined boundaries. The part of the string taken away from string 1 will be removed and the part replaced in string 2 will be removed. An example of this function is below:
def bound('DOGYOMAMA','ROOGMEMAD', 'OG' 'MA') --> RETURNS('DMA','ROOGYOMAD',
'OG', 'MA')
This is the function I have created to do what I wrote above
def bound(st,sz,a,b):
s1=''.join(st)
s2=''.join(sz)
if a in s1 and b in s1 and a in s2 and b in s2:
f1=s1.find(a)
l1=s1.find(b)
f2=s2.find(a)
l2=s2.find(b)
blen1 = len(b)
blen2 = len(b)
s1_n = s1[:f1] +s1[l1+blen1:]
s2_n = s2[:f2] + s1[f1:l1 + blen1] +s2[l2+blen2]
return s1_n, s2_n, a, b
print(bound('DOGYOMAMA','ROOGMEMAD', 'OG','MA'))
My problem is that I also need to make it so this will work in reverse so if I have ('DOGYOMAMA','ROOGMEMAD', 'OG' 'MA') it should also look for ('AMAMOYGOD','DAMEMGOOR', 'GO' 'AM'). Another thing would be if the string can be spliced both ways it will take only the sequence that is spliced at the lowest index.

Try this :
and if you have to return many items then don't return the output instead of store the output in a list and return that list at last , that i did there :
def bound(st,sz,a,b):
result=[]
string_s = [''.join(st), ''.join(sz), ''.join(st)[::-1], ''.join(sz)[::-1]]
boundaries = [a, b, a[::-1], b[::-1]]
for chunk in range(0, len(string_s), 2):
word = string_s[chunk:chunk + 2]
bound = boundaries[chunk:chunk + 2]
if bound[0] in word[0] and bound[1] in word[0] and bound[0] in word[1] and bound[1] in word[1]:
f1 = word[0].find(bound[0])
l1 = word[0].find(bound[1])
f2 = word[1].find(bound[0])
l2 = word[1].find(bound[1])
blen1 = len(bound[1])
blen2 = len(bound[1])
s1_n = word[0][:f1] + word[0][l1 + blen1:]
s2_n = word[1][:f2] + word[0][f1:l1 + blen1] + word[1][l2 + blen2]
result.append([s1_n, s2_n, bound[0], bound[1]])
return result
print(bound('DOGYOMAMA','ROOGMEMAD', 'OG','MA'))
output:
[['DMA', 'ROOGYOMAD', 'OG', 'MA'], ['AMAMOYAMOYGOD', 'DAMEME', 'GO', 'AM']]

Splitting a string before the nth occurrence of a character [duplicate]

Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?

>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)

Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2

I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)

I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.

>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']

It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.

I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!

In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']

As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check intersection between two strings in python - python

Set one contains single string, set two 3 strings, and string "BANGKOKThailand" is not equal to the string "BANGKOK".

I can see two might-be mistakes: n_of_words = len(array) should be n_of_words = len(word_array) and d_word = "BANGKOKThailand" is missing a space in-between as "BANGKOK Thailand" Fixing those two changes gave me a result of 1.

Related

Remove equal characters from two python strings

Format a large integer with commas without using .format()

Find values in list which differ from reference list by up to N characters

splicing between sequence of two defined boundaries

Splitting a string before the nth occurrence of a character [duplicate]

Categories

Resources