How to match against tuples using Python FuzzyWuzzy library? - python

I am using FuzzyWuzzy to match a string against tuples contains two strings. For example:
from fuzzywuzzy import fuzz, process
query = "cat"
animals = [('cat','owner1'),('dog','owner3'),('lizard','owner45')]
result = process.extractOne(query, animals, scorer=fuzz.ratio)
This code returns an error because the list being compared to, animals , is not a list of strings. I would only like to compare to the 1st item in the tuple. What I would like to be returned is: (('cat','owner1), 100) because it is a 100% match.
The below code works, outputting ('cat', 100) but I need the other part of the tuple.
from fuzzywuzzy import fuzz, process
query = "cat"
animals = ["cat","dog",'lizard']
result = process.extractOne(query, lex, scorer=fuzz.ratio)
print(result)
Any ideas?
edit: I know that I can get a list of 1st elements with a list comprehension, but for memory and speed reasons, I would like to do this without creating a new list, because I am working with large data sets.

From your list of tuples you can create a sub-list containing only the first item of each tuple by using a list comprehension.
>>> animal_owners = [('cat','owner1'),('dog','owner3'),('lizard','owner45')]
>>> [ao[0] for ao in animal_owners]
['cat', 'dog', 'lizard']
With this technique you can substitute the second expression where you only need the animals while leaving the original list alone.

I know the post is older, but this is an issue I just had to contend with and found a way to do it! If you look at its signature:
process.extractOne(
query,
choices,
processor: function=function,
scorer:function=function,
score_cutoff: int=0
)
you can utilize the processor function to return the value each tuple you want to be analyzed. For example, say you have a list of company names and their ticker symbols in tuples, and want to get the closest match based on the company name:
from fuzzywuzzy import process
def get_company_name(tup):
return tup[0]
choices = [
('Apple, Inc.', 'AAPL'),
('Google, Inc.', 'GOOGL'),
('Tesla, Inc.', 'TSLA')
]
closest_match = process.extractOne("apple", choices, processor=get_company_name)
and then the script will return a tuple with the best match tuple and the pct match:
(('Apple, Inc.', 'AAPL'), 100)

Related

Use lower() method (or similar) when checking if input is in a list?

I'm working on a project where I have a list of items where some are capitalized (list is currently incomplete because debugging). I need to check if the user's input is in the list, and I'd rather it not be case sensitive. I tried using the lower() method, but it doesn't work for lists. Is there a way to get the same effect as that in the context of this code?
itemList = ['Dirt', 'Oak Log']
def takeName():
itemName = ''
while itemName == '':
itemName = input('What Minecraft item would you like the crafting recipe for?\n')
try:
itemName = int(itemName)
except ValueError:
if itemName.lower() not in itemList.lower():
print('Not a valid Minecraft item name or ID!\n')
itemName = ''
elif itemName.lower() in itemList.lower():
itemName = itemList.lower().index(itemName)
return itemName
You can use list comprehension to convert the list items to lowercase. For example:
itemList_lower = [x.lower() for x in itemList]
Then use itemList_lower in the places in your code where you tried using itemList.lower().
Unfortunately, there is no built-in method that set all the strings of a list to lower case.
If you think about it, it would be inconsistent to develop, because in Python a list can store object of different type, it would be incoherent about the purpose of the data structure itself.
So, you should design and develop an algorithm to do that.
Do not worry though, with "list comprehension" it's quick, simple and elegant, in fact:
lowered_list = [item.lower() for item in itemList]
so now, you have a new list named "lowered_list" that contains all the strings you have stored in the other list, but in lower case.
Cool, isn't it?
You can't use a string function on a list in order to apply this function to all list items. You'll have to apply the function using a list comprehension or using the map function.
Using a list comprehension:
lowerItemList = [item.lower() for item in itemList]
A list comprehension is basiccally a more efficient way and shorter to write short for loops. The above code would be the same as this:
lowerItemList = []
for item in itemList:
itemLower = item.lower()
loserItemList.append(itemLower)
Another way to do this is using the map function. This function applies a function to all items of an iterable.
lowerItemList = list(map(str.lower, itemList))
As this is an a bit more complicated example, here's how it works:
We are using the map function to apply the str.lower function on our item list. Calling str.lower(string) is actually the same as calling string.lower(). Python does this already for you.
So, our result will contain a map object containing all the strings in their lowered form. We need to convert this map object into a list object, as a map object is an iterator and this does not support indexing and many of the list class' methods.
You want a case-insensitive lookup but you want to return a value as it's defined in itemList. Therefore:
itemList = ['Dirt', 'Oak Log']
itemListLower = [e.lower() for e in itemList] # normal form
def takeName():
while True:
itemName = input('What Minecraft item would you like the crafting recipe for? ')
try:
idx = itemListLower.index(itemName.lower()) # do lookup in the normal form
# as the two lists are aligned you can then return the value from the original list using the same index
return itemList[idx]
except ValueError:
print(f'{itemName} is not a valid selection') # user will be promted to re-enter
print(takeName())
Output:
What Minecraft item would you like the crafting recipe for? banana
banana is not a valid selection
What Minecraft item would you like the crafting recipe for? oak log
Oak Log

Python: Unify multiple lists into one

Could you help me with the following challenge I am currently facing:
I have multiple lists, each of which contains multiple strings. Each string has the following format:
"ID-Type" - where ID is a number and type is a common Python type. One such example can be found here:
["1-int", "2-double", "1-string", "5-list", "5-int"],
["3-string", "1-int", "1-double", "5-double", "5-string"]
Before calculating further, I now want to preprocess these list to unify them the following way:
Count how often each type is appearing in each list
Generate a new list, combining both results
Create a mapping from initial list to that new list
As an example
In the above lists, we have the following types:
List 1: 2 int, 1 double, 1 string, 1 list
List 2: 2 string, 2 double, 1 int
The resulting table should now contain:
2 int, 2 double, 2 string, 1 list (in order to be able to contain both lists), like this:
[
"int_1-int",
"int_2-int",
"double_1-double",
"double_2-double",
"string_1-string",
"string_2-string",
"list_1-list"
]
And lastly, in order to map input to output, the idea is to have a corresponding dictionary to map this transformation, e.g., for list_1:
{
"1-int": "int_1-int",
"2-double": "double_1-double",
"1-string": "string_1-string",
"5-list": "list_1-list",
"5-int": "int_2-int"
}
I want to prevent to do this with a nested loop and multiple iterations - are there any libraries or is there maybe a smart vectorized solution to address this challenge?
Just add them:
Example :
['it'] + ['was'] + ['annoying']
You should read the Python tutorial to learn basic info like this.
Just another method....
import itertools
ab = itertools.chain(['it'], ['was'], ['annoying'])
list(ab)
Just add them: Example :
['it'] + ['was'] + ['annoying']
You should read the Python tutorial to learn basic info like this.
Just another method....
import itertools
ab = itertools.chain(['it'], ['was'], ['annoying'])
list(ab)
In general, this approach doesn't really make sense unless you specifically need to have the items in the resulting list and dict in this exact format. But here's how you can do it:
def process_type_list(type_list):
mapping = dict()
for i in type_list:
i_type = i.split('-')[1]
n_occur = 1
map_val = f'{i_type}_{n_occur}-{i_type}'
while map_val in mapping.values():
n_occur += 1
map_val = f'{i_type}_{n_occur}-{i_type}'
mapping[i] = map_val
return mapping
l1 = ["1-int", "2-double", "1-string", "5-list", "5-int"]
l2 = ["3-string", "1-int", "1-double", "5-double", "5-string"]
l1_mapping = process_type_list(l1)
l2_mapping = process_type_list(l2)
Additionally, Python does not have a double type. C doubles are implemented as Python floats (or decimal.Decimal if you need fine control over the precision)
I am pretty sure that this is what you want to do:
To make a joint list:
['any item'] + ['any item 2']
If you want to turn the list into a dictionary:
dict(zip(['key 1', 'key 2'], ['value 1', 'value 2']))
Another method of joining 2 lists:
a = ['list item', 'another list item']
a.extend(['another list item', 'another list item'])

Iterate over a list of strings in python

I am trying to set up a data set that checks how often several different names are mentioned in a list of articles. So for each article, I want to know how often nameA, nameB and so forth are mentioned. However, I have troubles with iterating over the list.
My code is the following:
for element in list_of_names:
for i in list_of_articles:
list_of_namecounts = len(re.findall(element, i))
list_of_names = a string with several names [nameA nameB nameC]
list_of_articles = a list with 40.000 strings that are articles
Example of article in list_of_articles:
Index: 1
Type: str
Size: Amsterdam - de financiële ...
the error i get is: expected string or buffer
I though that when iterating over the list of strings, that the re.findall command should work using lists like this, but am also fairly new to Python. Any idea how to solve my issue here?
Thank you!
If your list is ['apple', 'apple', 'banana'] and you want the result: number of apple = 2, then:
from collections import Counter
list_count = Counter(list_of_articles)
for element in list_of_names:
list_of_namecounts = list_count[element]
And assuming list_of_namecounts is a list ¿?
list_of_namecounts = []
for element in list_of_names:
list_of_namecounts.append(list_count[element])
See this for more understanding

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Combining two lists and sorting through reference to dictionary Python

I have (what seems to me is) a pretty convoluted problem. I'm going to try to be as succinct as possible - though in order to understand the issue fully, you might have to click on my profile and look at the (only other) two questions I've posted on StackOverflow. In short: I have two lists -- one is comprised of email strings that contain a facility name, and a date of incident. The other is comprised of the facility ids for each email (I use one of the following regex functions to get this list). I've used Regex to be able to search each string for these pieces of information. The 3 Regex functions are:
def find_facility_name(incident):
pattern = re.compile(r'Subject:.*?for\s(.+?)\n')
findPat1 = re.search(pattern, incident)
facility_name = findPat1.group(1)
return facility_name
def find_date_of_incident(incident):
pattern = re.compile(r'Date of Incident:\s(.+?)\n')
findPat2 = re.search(pattern, incident)
incident_date = findPat2.group(1)
return incident_date
def find_facility_id(incident):
pattern = re.compile('(\d{3})\n')
findPat3 = re.search(pattern, incident)
f_id = findPat3.group(1)
return f_id
I also have a dictionary that is formatted like this:
d = {'001' : 'Facility #1', '002' : 'Another Facility'...etc.}
I'm trying to COMBINE the two lists and sort by the Key values in the dictionary, followed by the Date of Incident. Since the key values are attached to the facility name, this should automatically caused emails from the same facilities to be grouped together. In order to do that, I've tried to use these two functions:
def get_facility_ids(incident_list):
'''(lst) -> lst
Return a new list from incident_list that inserts the facility IDs from the
get_facilities dictionary into each incident.
'''
f_id = []
for incident in incident_list:
find_facility_name(incident)
for k in d:
if find_facility_name(incident) == d[k]:
f_id.append(k)
return f_id
id_list = get_facility_ids(incident_list)
def combine_lists(L1, L2):
combo_list = []
for i in range(len(L1)):
combo_list.append(L1[i] + L2[i])
return combo_list
combination = combine_lists(id_list, incident_list)
def get_sort_key(incident):
'''(str) -> tup
Return a tuple from incident containing the facility id as the first
value and the date of the incident as the second value.
'''
return (find_facility_id(incident), find_date_of_incident(incident))
final_list = sorted(combination, key=get_sort_key)
Here is an example of what my input might be and the desired output:
d = {'001' : 'Facility #1', '002' : 'Another Facility'...etc.}
input: first_list = ['email_1', 'email_2', etc.]
first output: next_list = ['facility_id_for_1+email_1', 'facility_id_for_2 + email_2', etc.]
DESIRED OUTPUT: FINAL_LIST = sorted(next_list, key=facility_id, date of incident)
The only problem is, the key values are not matching properly with what's found in each individual email string. Some DO, others are completely random. I have no idea why this is happening, but I have a feeling it has something to do with the way I'm combining the two lists. Can anyone help this lowly n00b? Thanks!!!
First off, I would suggest reversing your ID-to-name dictionary. Looking up a value by key is very fast but finding a key by value is very slow.
rd = { name: id_num for id_num, name in d.items() }
Then your first function can be replaced by a list comprehension:
id_list = [rd[find_facility_name(incident)] for incident in incident_list]
This might also expose why you're getting messed up values in your results. If an incident has a facility name that's not in your dictionary, this code will raise a KeyError (whereas your old function would just skip it).
Your combine function is very similar to Python's built in zip function. I'd replace it with:
combination = [id+incident for id, incident in zip(id_list, incident_list)]
However, since you're building the first list from the second one, it might make sense to build the combined version directly, rather than making separate lists and then combining them in a separate step. Here's an update to the list comprehension above that goes right to the combination result:
combination = [rd[find_facility_name(incident)] + incident
for incident in incident_list]
To do the sort, you can use the ID string that we just prepended to the email message, rather than parsing to find the ID again:
combination.sort(key=lambda x: (x[0:3], get_date_of_incident(x)))
The 3 in the slice is based off of your example of "001" and "002" as the id values. If the actual ids are longer or shorter you'll need to adjust that.
So, here is what I think is going on. Please correct me if possible.
The 'incident_list' is a list of email strings. You go in and find the facility names in each email because you have an external dictionary that has the (key:value)=(facility id: facility name). From the dictionary, you can extract the facility id in this 'id_list'.
You combine the lists so that you get a list of strings [facility id + email,...]
Then you want it to sort by a tuple( facility id, date of incidence ).
It looks like you are searching for the facility id and the facility name twice. You can skip a step if they are the same. Then, the best way is to do it all at once with tuples:
incident_list = ['email1', 'email2',...]
unsorted_list = []
for email in incident list:
id = find_facility_id(email)
date = find_date_of_incident(email)
mytuple = ( id, date, id + email )
unsorted_list.append(mytuple)
final_list = sorted(unsorted_list, key=lambda mytup:(mytup[0], mytup[1]))
Then you get an easy list of tuples sorted by first element (id as a string), and then second element (date as a string). If you just need a list of strings ( id + email ), then you need a list with the last element of each tuple part
FINALLIST = [ tup[-1] for tup in final_list ]

Categories

Resources