Find similar permutation of a word in another column - python

I want to look for permutations that match with a given word, and arrange my data based on column position.
IE - I created a CSV with data I scrapped from several websites.Say it looks something like this:
Name1 OtherVars Name2 More Vars
Stanford 23451 Mamford No
MIT yes stanfor1d 12
BeachBoys pie Beatles Sweeden
I want to (1) find permutations of each word from Name1 in Name2, and then (2) print a table with that word from Name1+it's matching word in OtherVars + the permutation of that word in Name2+it's match in MoreVars.
(if no matches found, just delete the word).
The outcome will be in this case:
Name1 OtherVars Name2 More Vars
Stanford 23451 stanford 12
So, how do I:
Find matching permutations for a word in other column?
Print the 2 words and the values they are mapped to in other columns?
PS - here's a similar question; however, it's java and it's pseudo code.
How to find all permutations of a given word in a given text?
Difflib seems not to be suitable for CSVs based on this: How to find the most similar word in a list in python
PS2 - I was advised to use Fuzzymatch however, I suspect that it's an overkill in this case.

If you're looking for a function which returns the same output for "Stanford" and "stanf1ord", you could :
use lowercase
only keep letters
sort the letters
import re
def signature(word):
return sorted(re.findall('[a-z]', word.lower()))
print(signature("Stanford"))
# ['a', 'd', 'f', 'n', 'o', 'r', 's', 't']
print(signature("Stanford") == signature("stanfo1rd"))
# True
You could create a set or dict of signatures from 1st column, and see if there's any match within the second column.

You seem to want fuzzy matching, not "permutations". There are a few python fuzzy matching libraries, but i think people like fuzzywuzzy
Alternatively, you can roll your own. Something like
def ismatch(s1,s2):
# implement logic
# return boolean if match
pass
def group():
pairs = [(n1, v1, n2, v2) for n1 in names1 for n2 in names2 if ismatch(n1,n2)]
return pairs

Related

How to check if a string is only made-up from a list of words?

I have a list of words:
words = ['ABC', 'CDE', 'EFG']
How to check that my string only consists words from that list? For example, 'EFG CDE' results True since both 'CDE' and 'EFG' are in words.
My code is below:
lmn = []
for j in list(itertools.permutations(words, 2)) + list(itertools.permutations(words, 3)):
lmn.append(' '.join(j))
'EFG CDE' in lmn
My output is giving True which is correct.
But for strings like 'EFG EFG CDE', 'CDE CDE CDE CDE' it will not give True because these strings are not present in lmn. Even if they are made of the list ['ABC', 'CDE', 'EFG'] only.
Here's how I would do it:
allowed_words = set(['ABC','CDE','EFG'])
target_string = 'EFG EFG CDE'
print(all(word in allowed_words for word in target_string.split()))
Rather than trying to build every possible permutation and then checking (which will be unbounded if the input is unbounded), just do the search yourself.
The problem is 'check every component part of the string is present in an iterable' where component part is defined as 'part separated by a space':
def check_string_made_of_parts(candidate, parts):
return all(part in parts for part in candidate.split(" "))
With these kind of problems in python it's helpful to talk through a sensible algorithm in words before you hit any code.
If you break down your problem, you have a list (could be a set?) of allowed words and a string made up of words.
You want to check that every word in your string is in the list of allowed words. Translated loosely, you want to check that the set of words in your string is a subset of the set of allowed words.
Now there are two steps:
Get the set of words from the string.
Check if it is a subset of the allowed words.
(1) is quite easy - words = set(s.split())
(2) can be done by using basic set operations:
words.issubset(allowed_words) # allowed_words doesn't need to be a set
words <= set(allowed_words)

How would I write a function which uses multiple if statements, where each statement would modify the word one after the other?

I'm trying to write a function which would modify a tagged word depending on the tags present in the word, so basically a lemmatizer, but for words in Swedish.
For example if the word was tagged with A it would remove ending X from the word, and if the word also was tagged with B it would remove ending Y from the word etc. In total there are seven different endings that might be present in the word depending on the tag combinations and which I in that case want to remove.
What I've tried so far is to use several if statements after one another which would modify the word if it was tagged with one tag combination and then check if it was tagged with another tag combination and then modify it based on that and so on.
if tag1 == 'A':
word = word.rstrip('x')
if tag2 == 'B' and tag3 == 'C' and tag4 == 'D':
word = word.rstrip('y')
if tag3 == 'B' and tag4 == 'D':
word = word.rstrip('z')
I'm having problems with understanding how I should phrase the if statements so that they would each check for a tag combination, modify the word if the statement is true and then pass the modified word along to the next statement. How would I do this?
EDIT: As Prune said, I know that I could just add if statements with all the possible tag combinations, but I wanted to see if there is a more elegant solution than to do that.
From your description, it seems that you know how to beat this to death with brute force, but you'd like something more elegant. You might consider a structure of tags and associated removals, such as
rules = [ ['A', 'x'],
['BCD'. 'y'],
['B', 'z'],
...
]
Then iterate through your list of removal rules, applying each as appropriate, something like
for rule in rules:
rule_tags = rule[0]
# Check to see that all rule tags are in the input tags ... left to you to code
if <your code here>:
word = word.rstrip(rule[1]) # strip letter included in that rule
Does that get you moving toward a solution?

Single regular expression in Python with named groups for interleaved text

I would like to create a single regular expression in Python that extracts two interleaved portions of text from a filename as named groups. An example filename is given below:
CM00626141_H12.d4_T0001F003L01A02Z03C02.tif
The part of the filename I'd like to extract is contained between the underscores, and consists of the following:
An uppercase letter: [A-H]
A zero-padded two-digit number: 01 to 12
A period
A lowercase letter: [a-d]
A single digit: 1 to 4
For the example above, I would like one group ('Row') to contain H.d, and the other group ('Column') to contain 12.4. However, I don't know how to do this this when the text is separated as it is here.
EDIT: A constraint which I omitted: it needs to be a single regex to handle the string. I've updated the text/title to reflect this point.
Regexp capturing groups (whether numbered or named) do not actually capture text - they capture starting/ending indices within the original text. Thus, it is impossible for them to capture non-contiguous text. Probably the best thing to do here is have four separate groups, and combine them into your two desired values manually.
You may do it in two steps using re.findall() as:
Step 1: Extract substring from the main string following your pattern as:
>>> import re
>>> my_file = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> my_content = re.findall(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', my_file)
# where content of my_content is: [('H', '12', 'd', '4')]
Step 2: Join tuples to get the value of row and column:
>>> row = ".".join(my_content[0][::2])
>>> row
'H.d'
>>> column = ".".join(my_content[0][1::2])
>>> column
'12.4'
I do not believe there is any way to capture everything you want in exactly two named capture groups and one regex call. The most straightforward way I see is to do the following:
>>> import re
>>> source = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> match = re.search(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', source)
>>> row, column = '.'.join(match.groups()[0::2]), '.'.join(match.groups()[1::2])
>>> row
'H.d'
>>> column
'12.4'
Alternatively, you might find it more appealing to handle the parsing almost completely in the regex:
>>> row, column = re.sub(
r'^.*_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_.*$',
r'\1.\3,\2.\4',
source).split(',')
>>> row, column
('H.d', '12.4')

How to find the combination of words that includes all the letters in the input with Python

I want to find the most efficient way to loop through the combination of letters that are entered in Python and return a set of words whose combination includes all the letters, if feasible.
Example:
Say user entered A B C D E. Goal is to find the least number of words that includes all the letters. In this case an optimum solution, in preference order, will be:
One word that has all 5 letters
Two words that has all the 5 letters. (can be 4-letter word + 1-letter word OR 3 letter word + 2 letter word. Does not make difference)
....
etc.
If no match, then find go back to 1. with n-1 letters etc.
I have a function to check if a "combination of letters" (i.e. word) is in dictionary.
def is_in_lib(word):
if word in lib:
return word
return False
Ideal answer should not include finding the combination of those letters and searching all of those. Searching through my dictionary is very costly, so I need something that can take also optimize the time that we search through the dictionary
IMPORTANT EDIT: The order matters and continuity is required. Meaning if user enters "H", "T", "A", you cannot build "HAT".
Real Example: If the input is : T - H - G - R - A - C - E - K - B - Y - E " output should be "Grace" and "Bye"
You could create a string/list from the input letters, and iterate trought THEM on every word in the word library:
inputstring='abcde'
for i in lib:
is_okay=True
for j in inputstring:
if i.find(j)=-1:
is_okay=False
if is_okay:
return i
I think the other cases (two words with 3-2 letters) can be implemented recursively, but it couldn't be efficient.
I think the key idea here would be to have some kind of index providing a mapping from a canonical sequence of characters to actual words. Something like that:
# List of known words
>>> words = ('bonjour', 'jour', 'bon', 'poire', 'proie')
# Build the index
>>> index = collections.defaultdict(list)
>>> for w in words:
... index[''.join(sorted(w.lower()))].append(w)
...
This will produce a efficient way to find all the anagrams corresponding to a sequence of characters:
>>> index
defaultdict(<class 'list'>, {'joru': ['jour'], 'eiopr': ['poire', 'proie'], 'bjnooru': ['bonjour'], 'bno': ['bon']})
You could query the index that way:
>>> user_str = 'OIREP'
>>> index.get(''.join(sorted(user_str.lower())), "")
['poire', 'proie']
Of course, this will only find "exact" anagrams -- that is containing all the letters provided by the user. To find all the string that match a subset of the user provided string, you will have to remove one letter at a time and check again each combination. I feel like recursivity will help to solve that problem ;)
EDIT:
(should I put that on a spoiler section?)
Here is a possibl solution:
import collections
words = ('bonjour', 'jour', 'bon', 'or', 'pire', 'poire', 'proie')
index = collections.defaultdict(list)
for w in words:
index[''.join(sorted(w.lower()))].append(w)
# Recursively search all the words containing a sequence of letters
def search(letters, result = set()):
# Assume "letters" ordered
if not letters:
return
solutions = index.get(letters)
if solutions:
for s in solutions:
result.add(s)
for i in range(0,len(letters)):
search(letters[:i]+letters[i+1:], result)
return result
# Use case:
user_str = "OIREP"
s = search(''.join(sorted(user_str.lower())))
print(s)
Producing:
set(['poire', 'or', 'proie', 'pire'])
It is not that bad, but could be improved since the same subset of characters are examined several times. This is especially true is the user provided search string contain several identical letters.

Breaking a string into individual words in Python

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1

Categories

Resources