What's the best way to find the intersection between two strings?

What's the best way to find the intersection between two strings? - python

I need to find the intersection between two strings.
Assertions:
assert intersect("test", "tes") == list("tes"), "Assertion 1"
assert intersect("test", "ta") == list("t"), "Assertion 2"
assert intersect("foo", "fo") == list("fo"), "Assertion 3"
assert intersect("foobar", "foo") == list("foo"), "Assertion 4"
I tried different implementations for the intersect function. intersect would receive 2 str parameters, w and w2
List comprehension. Iterate and look for occurrences in the second string.
return [l for l in w if l in w2]
Fail assertion 1 and 2 because multiple t in w match the one t in w2
Sets intersections.
return list(set(w).intersection(w2)
return list(set(w) & set(w2))
Fails assertion 3 and 4 because a set is a collection of unique elements and duplicated letters will be discarded.
Iterate and count.
out = ""
for c in s1:
if c in s2 and not c in out:
out += c
return out
Fails because it also eliminates duplicates.
difflib (Python Documentation)
letters_diff = difflib.ndiff(word, non_wildcards_letters)
letters_intersection = []
for l in letters_diff:
letter_code, letter = l[:2], l[2:]
if letter_code == " ":
letters_intersection.append(letter)
return letters_intersection
Passes
difflib works but can anybody think of a better, optimized approach?
EDIT:
The function would return a list of chars. The order doesn't really matter.

Try this:
def intersect(string1, string2):
common = []
for char in set(string1):
common.extend(char * min(string1.count(char), string2.count(char)))
return common
Note: It doesn't preserve the order (if I remember set() correctly, the letters will be returned in alphabetical order). But, as you say in your comments, order doesn't matter

This works pretty well for your test cases:
def intersect(haystack, needle):
while needle:
pos = haystack.find(needle)
if pos >= 0:
return list(needle)
needle = needle[:-1]
return []
But, bear in mind that, all your test cases are longer then shorter, do not have an empty search term, an empty search space, or a non-match.

Gives the number of co-occurrence for all n-grams in the two strings:
from collections import Counter
def all_ngrams(text):
ngrams = ( text[i:i+n] for n in range(1, len(text)+1)
for i in range(len(text)-n+1) )
return Counter(ngrams)
def intersection(string1, string2):
count_1 = all_ngrams(string1)
count_2 = all_ngrams(string2)
return count_1 & count_2 # intersection: min(c[x], d[x])
intersection('foo', 'f') # Counter({'f': 1})
intersection('foo', 'o') # Counter({'o': 1})
intersection('foobar', 'foo') # Counter({'f': 1, 'fo': 1, 'foo': 1, 'o': 2, 'oo': 1})
intersection('abhab', 'abab') # Counter({'a': 2, 'ab': 2, 'b': 2})
intersection('achab', 'abac') # Counter({'a': 2, 'ab': 1, 'ac': 1, 'b': 1, 'c': 1})
intersection('test', 'ates') # Counter({'e': 1, 'es': 1, 's': 1, 't': 1, 'te': 1, 'tes': 1})

Related

Removing duplicates using a dictionary

I am writing a function that is supposed to count duplicates and mention how many duplicates are of each individual record. For now my output is giving me the total number of duplications, which I don't want.
i.e. if there are 4 duplicates of one record, it's giving me 4 instead of 1; if there are 6 duplicates of 2 individual records it should give me 2.
Could someone please help find the bug?
Thank you
def duplicate_count(text):
text = text.lower()
dict = {}
word = 0
if len(text) != "":
for a in text:
dict[a] = dict.get(a,0) + 1
for a in text:
if dict[a] > 1:
word = word + 1
return word
else:
return "0"

Fixed it:
def duplicate_count(text):
text = text.lower()
dict = {}
word = 0
if len(text) != "":
for a in text:
dict[a] = dict.get(a,0) + 1
return sum(1 for a in dict.values() if a >= 2)
else:
return "0"

You can do this with set and sum. First set is used to remove all duplicates. This is so we can have as few iterations as possible, as-well-as get an immediate count, as opposed to a "one-at-a-time" count. The set is then used to create a dictionary that stores the amount of times a character repeats. Those values are then used as a generator in sum to sum all the times that the "repeat value" is greater than 1.
def dup_cnt(t:str) -> int:
if not t: return 0
t = t.lower()
d = dict()
for c in set(t):
d[c] = t.count(c)
return sum(v>1 for v in d.values())
print(dup_cnt('aabccdeefggh')) #4

I don't really understand the question you asked.
But I assume you want to get the count or details of each letter's duplication in the text. You can do this, hoping this can help.
def duplicate_count(text):
count_dict = {}
for letter in text.lower():
count_dict[letter] = count_dict.setdefault(letter, 0) + 1
return count_dict
ret = duplicate_count('asuhvknasiasifjiasjfija')
# Get all letter details
print(ret)
#{'a': 5, 's': 4, 'u': 1, 'h': 1, 'v': 1, 'k': 1, 'n': 1, 'i': 4, 'f': 2, 'j': 3}
# Get all letter count
print(len(ret))
# 10
# Get only the letters appear more than once in the text
dedup = {k: v for k, v in ret.items() if v > 1}
# Get only duplicated letter details
print(dedup)
# {'a': 5, 's': 4, 'i': 4, 'f': 2, 'j': 3}
# Get only duplicated letter count
print(len(dedup))
# 5

I want to create a function that takes a text string and returns a dictionary containing certain characters with how many tmies they occur

This is what I have so far as a function
example = "Sample String"
def func(text, let):
count= {}
for let in text.lower():
let = count.keys()
if let in text:
count[let] += 1
else:
count[let] = 1
return count
I want to return something like this
print(func(example, "sao"))
{'s': 2, 'a' : 1}
I am not very sure what I could improve on

I would use Counter from the collections built-in module:
from collections import Counter
def func(text, let):
c = Counter(text.lower())
return {l: c[l] for l in let if l in c.keys()}
Breaking it down:
Counter will return the count of letters in your string:
In [5]: Counter(example.lower())
Out[5]:
Counter({'s': 2,
'a': 1,
'm': 1,
'p': 1,
'l': 1,
'e': 1,
' ': 1,
't': 1,
'r': 1,
'i': 1,
'n': 1,
'g': 1})
So then all you need to do is return a dictionary of the appropriate letters, which can be done in a dictionary comprehension:
# iterate over every letter in `let`, and get the Counter value for that letter,
# if that letter is in the Counter keys
{l: c[l] for l in let if l in c.keys()}
Fixing your code
If you prefer to use your approach, you could make your code work properly with this:
def func(text, let):
count = {}
for l in text.lower():
if l in let:
if l in count.keys():
count[l] += 1
else:
count[l] = 1
return count

from functools import reduce
def count(text, letters):
return reduce(
lambda d, letr: d.update({letr: d.get(letr, 0) + 1}) or d,
filter(lambda l: l in letters, text), {}
)
Read it backwards.
Creates an empty dictionary.
{}
Filters letters from text.
lambda l: l in letters
This lambda function returns true if l is in letters
filter(lambda l: l in letters, text)
reduce will iterate over the object returned by filter, which will
only produce letters in text, if they are in letters.
lambda d, letr: d.update({letr: d.get(letr, 0) + 1}) or d
Updates the dictionary with the count of the letters it encounters.
Each time reduce iterates over an item generated by the filter object,
it will call this lambda function. Since dict.update() -> None, returns None, which evaluates to false, we say or d to actually return the dict back to reduce, which will pass the dict back into the lambda the next time it gets called, thus building up the counts. We also use dict.get() in the lambda instead of d[i], this allows us to pass the default of 0 if the letter is not yet in the dictionary.
At the end reduce returns the dict, and we return that from count.
This is similar to how "map reduce" works.
You can read about functional style and lambda expressions in the python docs.

>>> def func(text: str, let: str):
... text, count = text.lower(), {}
... for i in let:
... if text.count(i) != 0:
... count[i] = text.count(i)
... return count
...
>>> print(func("Sample String", "sao"))
{'s': 2, 'a': 1}

Python dictionary sorting Anagram of a string [duplicate]

This question already has answers here:
Permutation of string as substring of another
(11 answers)
Closed 5 years ago.
I have the following question and I found this Permutation of string as substring of another, but this is using C++, I am kind of confused applying to python.
Given two strings s and t, determine whether some anagram of t is a
substring of s. For example: if s = "udacity" and t = "ad", then the
function returns True. Your function definition should look like:
question1(s, t) and return a boolean True or False.
So I answered this question but they want me to use dictionaries instead of sorting string. The reviewer saying that;
We can first compile a dictionary of counts for t and check with every
possible consecutive substring sets in s. If any set is anagram of t,
then we return True, else False. Comparing counts of all characters
will can be done in constant time since there are only limited amount
of characters to check. Looping through all possible consecutive
substrings will take worst case O(len(s)). Therefore, the time
complexity of this algorithm is O(len(s)). space complexity is O(1)
although we are creating a dictionary because we can have at most 26
characters and thus it is bounded.
Could you guys please help how I can use dictionaries in my solution.
Here is my solution;
# Check if s1 and s2 are anagram to each other
def anagram_check(s1, s2):
# sorted returns a new list and compare
return sorted(s1) == sorted(s2)
# Check if anagram of t is a substring of s
def question1(s, t):
for i in range(len(s) - len(t) + 1):
if anagram_check(s[i: i+len(t)], t):
return True
return False
def main():
print question1("udacity", "city")
if __name__ == '__main__':
main()
'''
Test Case 1: question1("udacity", "city") -- True
Test Case 2: question1("udacity", "ud") -- True
Test Case 3: question1("udacity", "ljljl") -- False
'''
Any help is appreciated. Thank you,

A pure python solution for getting an object which corresponds to how many of that char in the alphabet is in a string (t)
Using the function chr() you can convert an int to its corresponding ascii value, so you can easily work from 97 to 123 and use chr() to get that value of the alphabet.
So if you have a string say:
t = "abracadabra"
then you can do a for-loop like:
dt = {}
for c in range(97, 123):
dt[chr(c)] = t.count(chr(c))
this worked for this part of the solution giving back the result of:
{'k': 0, 'v': 0, 'a': 5, 'z': 0, 'n': 0, 't': 0, 'm': 0, 'q': 0, 'f': 0, 'x': 0, 'e': 0, 'r': 2, 'b': 2, 'i': 0, 'l': 0, 'h': 0, 'c': 1, 'u': 0, 'j': 0, 'p': 0, 's': 0, 'y': 0, 'o': 0, 'd': 1, 'w': 0, 'g': 0}
A different solution?
Comments are welcome, but why is storing in a dict necessary? using count(), can you not simply compare the counts for each char in t, to the count of that char in s? If the count of that char in t is greater than in s return False else True.
Something along the lines of:
def question1(s, t):
for c in range(97, 123):
if t.count(chr(c)) > s.count(chr(c)):
return False
return True
which gives results:
>>> question1("udacity", "city")
True
>>> question1("udacity", "ud")
True
>>> question1("udacity", "ljljl")
False
If a dict is necessary...
If it is, then just create two as above and go through each key...
def question1(s, t):
ds = {}
dt = {}
for c in range(97, 123):
ds[chr(c)] = s.count(chr(c))
dt[chr(c)] = t.count(chr(c))
for c in range(97, 123):
if dt[chr(c)] > ds[chr(c)]:
return False
return True
Update
The above answers ONLY CHECK FOR SUBSEQUENCES NOT SUBSTRING anagrams. As maraca explained to me in the comments, there is a distinction between the two and your example makes that clear.
Using the sliding window idea (by slicing the string), the code below should work for substrings:
def question1(s, t):
dt = {}
for c in range(97, 123):
dt[chr(c)] = t.count(chr(c))
for i in range(len(s) - len(t) + 1):
contains = True
for c in range(97, 123):
if dt[chr(c)] > s[i:i+len(t)].count(chr(c)):
contains = False
break
if contains:
return True
return False
The code above does work for ALL cases and utilizes a dictionary to speed up the calculations correctly :)

import collections
print collections.Counter("google")
Counter({'o': 2, 'g': 2, 'e': 1, 'l': 1})

Anagram test for two strings in python

This is the question:
Write a function named test_for_anagrams that receives two strings as
parameters, both of which consist of alphabetic characters and returns
True if the two strings are anagrams, False otherwise. Two strings are
anagrams if one string can be constructed by rearranging the
characters in the other string using all the characters in the
original string exactly once. For example, the strings "Orchestra" and
"Carthorse" are anagrams because each one can be constructed by
rearranging the characters in the other one using all the characters
in one of them exactly once. Note that capitalization does not matter
here i.e. a lower case character can be considered the same as an
upper case character.
My code:
def test_for_anagrams (str_1, str_2):
str_1 = str_1.lower()
str_2 = str_2.lower()
print(len(str_1), len(str_2))
count = 0
if (len(str_1) != len(str_2)):
return (False)
else:
for i in range(0, len(str_1)):
for j in range(0, len(str_2)):
if(str_1[i] == str_2[j]):
count += 1
if (count == len(str_1)):
return (True)
else:
return (False)
#Main Program
str_1 = input("Enter a string 1: ")
str_2 = input("Enter a string 2: ")
result = test_for_anagrams (str_1, str_2)
print (result)
The problem here is when I enter strings as Orchestra and Carthorse, it gives me result as False. Same for the strings The eyes and They see. Any help would be appreciated.

I'm new to python, so excuse me if I'm wrong
I believe this can be done in a different approach: sort the given strings and then compare them.
def anagram(a, b):
# string to list
str1 = list(a.lower())
str2 = list(b.lower())
#sort list
str1.sort()
str2.sort()
#join list back to string
str1 = ''.join(str1)
str2 = ''.join(str2)
return str1 == str2
print(anagram('Orchestra', 'Carthorse'))

The problem is that you just check whether any character matches exist in the strings and increment the counter then. You do not account for characters you already matched with another one. That’s why the following will also fail:
>>> test_for_anagrams('aa', 'aa')
False
Even if the string is equal (and as such also an anagram), you are matching the each a of the first string with each a of the other string, so you have a count of 4 resulting in a result of False.
What you should do in general is count every character occurrence and make sure that every character occurs as often in each string. You can count characters by using a collections.Counter object. You then just need to check whether the counts for each string are the same, which you can easily do by comparing the counter objects (which are just dictionaries):
from collections import Counter
def test_for_anagrams (str_1, str_2):
c1 = Counter(str_1.lower())
c2 = Counter(str_2.lower())
return c1 == c2
>>> test_for_anagrams('Orchestra', 'Carthorse')
True
>>> test_for_anagrams('aa', 'aa')
True
>>> test_for_anagrams('bar', 'baz')
False

For completeness: If just importing Counter and be done with the exercise is not in the spirit of the exercise, you can just use plain dictionaries to count the letters.
def test_for_anagrams(str_1, str_2):
counter1 = {}
for c in str_1.lower():
counter1[c] = counter1.get(c, 0) + 1
counter2 = {}
for c in str_2.lower():
counter2[c] = counter2.get(c, 0) + 1
# print statements so you can see what's going on,
# comment out/remove at will
print(counter1)
print(counter2)
return counter1 == counter2
Demo:
print(test_for_anagrams('The eyes', 'They see'))
print(test_for_anagrams('orchestra', 'carthorse'))
print(test_for_anagrams('orchestr', 'carthorse'))
Output:
{' ': 1, 'e': 3, 'h': 1, 's': 1, 't': 1, 'y': 1}
{' ': 1, 'e': 3, 'h': 1, 's': 1, 't': 1, 'y': 1}
True
{'a': 1, 'c': 1, 'e': 1, 'h': 1, 'o': 1, 's': 1, 'r': 2, 't': 1}
{'a': 1, 'c': 1, 'e': 1, 'h': 1, 'o': 1, 's': 1, 'r': 2, 't': 1}
True
{'c': 1, 'e': 1, 'h': 1, 'o': 1, 's': 1, 'r': 2, 't': 1}
{'a': 1, 'c': 1, 'e': 1, 'h': 1, 'o': 1, 's': 1, 'r': 2, 't': 1}
False

Traverse through string test and validate weather character present in string test1 if present store the data in string value.
compare the length of value and length of test1 if equals return True Else False.
def anagram(test,test1):
value =''
for data in test:
if data in test1:
value += data
if len(value) == len(test1):
return True
else:
return False
anagram("abcd","adbc")

I have done Anagram Program in basic way and easy to understandable .
def compare(str1,str2):
if((str1==None) or (str2==None)):
print(" You don't enter string .")
elif(len(str1)!=len(str2)):
print(" Strings entered is not Anagrams .")
elif(len(str1)==len(str2)):
b=[]
c=[]
for i in str1:
#print(i)
b.append(i)
b.sort()
print(b)
for j in str2:
#print(j)
c.append(j)
c.sort()
print(c)
if (b==c and b!=[] ):
print(" String entered is Anargama .")
else:
print(" String entered are not Anargama.")
else:
print(" String entered is not Anargama .")
str1=input(" Enter the first String :")
str2=input(" Enter the second String :")
compare(str1,str2)

A more concise and pythonic way to do it is using sorted & lower/upper keys.
You can first sort the strings and then use lower/ upper to make the case consistent for proper comparison as follows:
# Function definition
def test_for_anagrams (str_1, str_2):
if sorted(str_1).lower() == sorted(str_2).lower():
return True
else:
return False
#Main Program
str_1 = input("Enter a string 1: ")
str_2 = input("Enter a string 2: ")
result = test_for_anagrams (str_1, str_2)
print (result)

Another solution:
def test_for_anagrams(my_string1, my_string2):
s1,s2 = my_string1.lower(), my_string2.lower()
count = 0
if len(s1) != len(s2) :
return False
for char in s1 :
if s2.count(char,0,len(s2)) == s1.count(char,0,len(s1)):
count = count + 1
return count == len(s1)

My solution is:
#anagrams
def is_anagram(a, b):
if sorted(a) == sorted(b):
return True
else:
return False
print(is_anagram("Alice", "Bob"))

def anagram(test,test1):
test_value = []
if len(test) == len(test1):
for i in test:
value = test.count(i) == test1.count(i)
test_value.append(value)
else:
test_value.append(False)
if False in test_value:
return False
else:
return True
check for length of test and test1 , if length matches traverse through string test and compare the character count in both test and test1 strings if matches store the value in string.

Determining Letter Frequency Of Cipher Text

I am trying to make a tool that finds the frequencies of letters in some type of cipher text.
Lets suppose it is all lowercase a-z no numbers. The encoded message is in a txt file
I am trying to build a script to help in cracking of substitution or possibly transposition ciphers.
Code so far:
cipher = open('cipher.txt','U').read()
cipherfilter = cipher.lower()
cipherletters = list(cipherfilter)
alpha = list('abcdefghijklmnopqrstuvwxyz')
occurrences = {}
for letter in alpha:
occurrences[letter] = cipherfilter.count(letter)
for letter in occurrences:
print letter, occurrences[letter]
All it does so far is show how many times a letter appears.
How would I print the frequency of all letters found in this file.

import collections
d = collections.defaultdict(int)
for c in 'test':
d[c] += 1
print d # defaultdict(<type 'int'>, {'s': 1, 'e': 1, 't': 2})
From a file:
myfile = open('test.txt')
for line in myfile:
line = line.rstrip('\n')
for c in line:
d[c] += 1
For the genius that is the defaultdict container, we must give thanks and praise. Otherwise we'd all be doing something silly like this:
s = "andnowforsomethingcompletelydifferent"
d = {}
for letter in s:
if letter not in d:
d[letter] = 1
else:
d[letter] += 1

The modern way:
from collections import Counter
string = "ihavesometextbutidontmindsharing"
Counter(string)
#>>> Counter({'i': 4, 't': 4, 'e': 3, 'n': 3, 's': 2, 'h': 2, 'm': 2, 'o': 2, 'a': 2, 'd': 2, 'x': 1, 'r': 1, 'u': 1, 'b': 1, 'v': 1, 'g': 1})

If you want to know the relative frequency of a letter c, you would have to divide number of occurrences of c by the length of the input.
For instance, taking Adam's example:
s = "andnowforsomethingcompletelydifferent"
n = len(s) # n = 37
and storing the absolute frequence of each letter in
dict[letter]
we obtain the relative frequencies by:
from string import ascii_lowercase # this is "a...z"
for c in ascii_lowercase:
print c, dict[c]/float(n)
putting it all together, we get something like this:
# get input
s = "andnowforsomethingcompletelydifferent"
n = len(s) # n = 37
# get absolute frequencies of letters
import collections
dict = collections.defaultdict(int)
for c in s:
dict[c] += 1
# print relative frequencies
from string import ascii_lowercase # this is "a...z"
for c in ascii_lowercase:
print c, dict[c]/float(n)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What's the best way to find the intersection between two strings? - python

Related

Removing duplicates using a dictionary

I want to create a function that takes a text string and returns a dictionary containing certain characters with how many tmies they occur

Python dictionary sorting Anagram of a string [duplicate]

Anagram test for two strings in python

Determining Letter Frequency Of Cipher Text

Categories

Resources