I have three lists, (1) treatments (2) medicine name and (3) medicine code symbol. I am trying to identify the respective medicine code symbol for each of 14,700 treatments. My current approach is to identify if any name in (2) is "in" (1), and then return the corresponding (3). However, I am returned an abitrary list (correct length) of medicine code symbols corresponding to the 14,700 treatments. Code for the method I've written is below:
codes = pandas.read_csv('Codes.csv', dtype=str)
codes_list = _codes.values.tolist()
names = pandas.read_csv('Names.csv', dtype=str)
names_list = names.values.tolist()
treatments = pandas.read_csv('Treatments.csv', dtype=str)
treatments_list = treatments.values.tolist()
matched_codes_list = range(len(treatments_list))
for i in range(len(treatments_list)):
for j in range(len(names_list)):
if names_list[j] in treatments_list[i]:
matched_codes_list[i]=codes_list_text[j]
print matched_codes_list
Any suggestions for where I am going wrong would be much appreciated!
I can't tell what you are expecting. You should replace the xxx_list code with examples instead, since you don't seem to have any problems with the csv reading.
Let's suppose you did that, and your result looks like this.
codes_list = ['shark', 'panda', 'horse']
names_list = ['fin', 'paw', 'hoof']
assert len(codes_list) == len(names_list)
treatments_list = ['tape up fin', 'reverse paw', 'stand on one hoof', 'pawn affinity maneuver', 'alert wing patrol']
it sounds like you are trying to determine the 'code' for each 'treatment', assuming that the number of codes and names are the same (and indicate some mapping). You plan to use the presence of the name to determine the code.
we can zip together the name and codes list to avoid using indexes there, and we can use iteration over the treatment list instead of indexes for pythonic readability
matched_codes_list = []
for treatment in treatment:
matched_codes = []
for name, code in zip(names_list, codes_list):
if name in treatment:
matched_codes.append(code)
matched_codes_list.append(matched_codes)
this would give something like
assert matched_codes_list == [
['shark'], # 'tape up fin'
['panda'], # 'reverse paw'
['horse'], # 'stand on one hoof'
['shark', 'panda', 'horse'], # 'pawn affinity maneuver'
[], # 'alert wing patrol'
]
note that the method used to do this is quite slow (and probably will give false positives, see 4th entry). You will traverse the text of all treatment descriptions once for each name/code pair.
You can use a dictionary like 'lookup = {name: code for name, code in zip(names_list, codes_list)}, or itertools.izip for minor gains. Otherwise something more clever might be needed, perhaps splitting treatments into a set containing words, or mapping words into multiple codes.
Related
I am attempting to match a very long list of Python dictionaries. What I'm looking for is to append dicts from this list into a new list based on the values of the keys of the dict. An example of what I have is:
A list of 1000+ dictionaries structured like this:
{'regions': ['north','south'],
'age':35,
'name':'john',
'cars':['ford','kia']}
I want to sort and match through this list using almost all keys and append the matching ones to a new list. Sometimes I might be searching with only age, where as other times I will be searching with regions & name, all the way to searching with all keys like age, name, regions, & cars because all parameters to search with are optional.
I currently have used for loops to sort through it, but as I add more and more optional parameters, it gets slower and more complex. Is there an easier way to accompany what I am doing?
An example of what the user would send is:
regions: north
age: 10
And it would return a list of all dictionaries with north as a region and 10 as age
I think this one is pretty open ended, especially because you suggest that you want this to be extensible as you add more keys etc. and haven't really discussed your operational requirements. But here are a few thoughts:
Third party modules
Are these dictionaries going to get any more nested? Or is it going to always be 'key -> value' or 'key -> [list, of, values]'?
If you can accept a chunky dependency, you might consider something like pandas, which we normally think of as representing tables, but which can certainly manage nesting to some degree.
For example:
from functools import partial
from pandas import DataFrame
from typing import Dict
def matcher(comparator, target=None) -> bool:
"""
matcher
Checks whether a value matches or contains a target
(won't match if the target is a substring of the value)
"""
if target == comparator: # simple case, return True immediately
return True
if isinstance(comparator, str):
return False # doesn't match exactly and string => no match
try: # handle looking in collections
return target in comparator
except TypeError: # if it fails, you know there's no match
return False
def search(data: DataFrame, query: Dict) -> DataFrame:
"""
search
Pass in a DataFrame and a query in the form of a dictionary
of keys and values to match, for example:
{"age": 42, "regions": "north", "cars": "ford"}
Returns a matching subset of the data
"""
# each element of the resulting list is a boolean series
# corresponding to a dictionary key
masks = [
data[key].map(partial(matcher, target=value)) for key, value in query.items()
]
# collapse the masks down to a single boolean series indicating
# whether ALL conditions are met for each record
mask = pd.concat(masks, axis="columns").all(axis="columns")
return data.loc[mask]
if __name__ == "__main__":
data = DataFrame(your_big_list)
query = {"age": 35, "regions": "north", "cars": "ford"}
results = search(data, query)
list_results = results.to_dict(orient="records")
Here list_results would restore the filtered data to the original format, if that's important to you.
I found that the matcher function had to be surprisingly complicated, I kept thinking of edge-cases (like: we need to support searching in a collection, but in can also find substrings, which isn't what we want ... unless it is, of course!).
But at least all that logic is walled off in there. You could write a series of unit tests for it, and if you extend your schema in future you can then alter the function accordingly and check the tests still pass.
The search function then is purely for nudging pandas into doing what you want with the matcher.
match case
In Python 3.10 the new match case statement might allow you to very cleanly encapsulate the matching logic.
Performance
A pretty fundamental issue here is that if you care about performance (and I got the sense that this was secondary to maintainability for you) then
the bigger the data get, the slower things will be
python is already not fast, generally speaking
You could possibly improve things by building some sort of index for your data. Ultimately, however, it's always going to be more reliable to use a specialist tool. That's going to be some sort of database.
The precise details will depend on your requirements. e.g. are these data going to become horribly unstructured? Are any fields going to be text that will need to be properly indexed in something like Elasticsearch/Solr?
A really light-touch solution that you could implement in the short term with Python would be to
chuck that data into SQLite
rely on SQL for the searching
I am suggesting SQLite since it runs out-of-the-box and just in a single local file:
from sqlalchemy import create_engine
engine = create_engine("sqlite:///mydb.sql")
# ... that's it, we can now connect to a SQLite DB/the 'mydb.sql' file that will be created
... but the drawback is that it won't support array-like data. Your options are:
use postgreSQL instead and take the hit of running a DB with more firepower
normalise those data
I don't think option 2 would be too difficult. Something like:
REGIONS
id | name
----------
1 | north
2 | south
3 | east
4 | west
CUSTOMERS
id | age | name
---------------
...
REGION_LINKS
customer_id | region_id
-----------------------
1 | 1
1 | 2
I've called the main data table 'customers' but you haven't mentioned what these data really represent, so that's more by way of example.
Then your SQL queries could get built and executed using sqlalchemy's ORM capabilities.
I made a code to do this
test = [
{
'regions': ['south'],
'age':35,
'name':'john',
'cars':['ford']
},
{
'regions': ['north'],
'age':15,
'name':'michael',
'cars':['kia']
},
{
'regions': ['north','south'],
'age':20,
'name':'terry',
'cars':['ford','kia']
},
{
'regions': ['East','south'],
'age':35,
'name':'user',
'cars':['other','kia']
},
{
'regions': ['East','south'],
'age':75,
'name':'john',
'cars':['other']
}
]
def Finder(inputs: list, regions: list = None, age: int = None, name: str = None, cars: list = None) -> list:
output = []
for input in inputs:
valid = True
if regions != None and valid: valid = all(i in input["regions"] for i in regions)
if age != None and valid: valid = input["age"] == age
if name != None and valid: valid = input["name"] == name
if cars != None and valid: valid = all(i in input["cars"] for i in cars)
if valid:
output.append(input)
return output
print(Finder(test))
print(Finder(test, age = 25))
print(Finder(test, regions = ["East"]))
print(Finder(test, cars = ["ford","kia"]))
print(Finder(test, name = "john", regions = ["south"]))
This function just checks all parameters and check if the input is valid, and he puts all valid inputs in an output list
I am working on an online course exercise (practice problem before the final test).
The test involves working with a big csv file (not downloadable) and answering questions about the dataset. You're expected to write code to get the answers.
The data set is a list of all documented baby names each year, along with
#how often each name was used for boys and for girls.
A sample list of the first 10 lines is also given:
Isabella,42567,Girl
Sophia,42261,Girl
Jacob,42164,Boy
and so on.
Questions you're asked include things like 'how many names in the data set', 'how many boys' names beginning with z' etc.
I can get all the data into a list of lists:
[['Isabella', '42567', 'Girl'], ['Sophia', '42261', 'Girl'], ['Jacob', '42164', 'Boy']]
My plan was to convert into a dictionary, as that would probably be easier for answering some of the other questions. The list of lists is saved to the variable 'data':
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['42567', 'Girl'], 'Sophia': ['42261', 'Girl'], 'Jacob': ['42164', 'Boy']}
Works perfectly.
Here's where it gets weird. If instead of opening the sample file with 10 lines, I open the real csv file, with around 16,000 lines. everything works perfectly right up to the very last bit.
I get the complete list of lists, but when I go to create the dictionary, it breaks - here I'm just showing the first three items, but the full 16000 lines are all wrong in a similar way):
names = {}
for d in data:
names[d[0]] = d[1:]
print(names)
{'Isabella': ['56', 'Boy'], 'Sophia': ['48', 'Boy'], 'Jacob': ['49', 'Girl']
I know the data is there and correct, since I can read it directly:
for d in data:
print(d[0], d[1], d[2])
Isabella 42567 Girl
Sophia 42261 Girl
Jacob 42164 Boy
Why would this dictionary work fine with the cvs file with 10 lines, but completely break with the full file? I can't find any
Follow the comments to create two dicts, or a single dictionary with tuple keys. Using tuples as keys is fine if you keep your variables inside python, but you might get into trouble when exporting to json for example.
Try a dictionary comprehension with list unpacking
names = {(name, sex): freq for name, freq, sex in data}
Or a for loop as you started
names = dict()
for name, freq, sex in data:
names[(name, freq)] = freq
I'd go with something like
results = {}
for d in data:
name, amount, gender = d.split(',')
results[name] = data.get(name, {})
results[name].update({ gender: amount })
this way you'll get results in smth like
{
'Isabella': {'Girl': '42567', 'Boy': '67'},
'Sophia': {'Girl': '42261'},
'Jacob': {'Boy': '42164'}
}
However duplicated values will override previous, so you need to take that into account if there are some and it also assumes that the whole file matches format you've provided
I am searching for hours and hours on this problem and tried everything possible but I can't get it cracked, I am quiet a dictionary noob.
I work with maya and got clashing names of lights, this happens when you duplicate a group all children are named the same as before, so having a ALL_KEY in one group results in a clashing name with a key_char in another group.
I need to identify a clashing name of the short name and return the long name so I can do a print long name is double or even a cmds.select.
Unfortunately everything I find on this matter in the internet is about returning if a list contains double values or not and only returns True or False, which is useless for me, so I tried list cleaning and list comparison, but I get stuck with a dictionary to maintain long and short names at the same time.
I managed to fetch short names if they are duplicates and return them, but on the way the long name got lost, so of course I can't identify it clearly anymore.
>import itertools
>import fnmatch
>import maya.cmds as mc
>LIGHT_TYPES = ["spotLight", "areaLight", "directionalLight", "pointLight", "aiAreaLight", "aiPhotometricLight", "aiSkyDomeLight"]
#create dict
dblList = {'long' : 'short'}
for x in mc.ls (type=LIGHT_TYPES, transforms=True):
y = x.split('|')[-1:][0]
dblList['long','short'] = dblList.setdefault(x, y)
#reverse values with keys for easier detection
rev_multidict = {}
for key, value in dblList.items():
rev_multidict.setdefault(value, set()).add(key)
#detect the doubles in the dict
#print [values for key, values in rev_multidict.items() if len(values) > 1]
flattenList = set(itertools.chain.from_iterable(values for key, values in rev_multidict.items() if len(values) > 1))
#so by now I got all the long names which clash in the scene already!
#means now I just need to make a for loop strip away the pipes and ask if the object is already in the list, then return the path with the pipe, and ask if the object is in lightlist and return the longname if so.
#but after many many hours I can't get this part working.
##as example until now print flattenList returns
>set([u'ALL_blockers|ALL_KEY', u'ABCD_0140|scSet', u'SARAH_TOPShape', u'ABCD_0140|scChars', u'ALL|ALL_KEY', u'|scChars', u'|scSet', u'|scFX', ('long', 'short'), u'ABCD_0140|scFX'])
#we see ALL_KEY is double! and that's exactly what I need returned as long name
#THIS IS THE PART THAT I CAN'T GET WORKING, CHECK IN THE LIST WHICH VALUES ARE DOUBLE IN THE LONGNAME AND RETURN THE SHORTNAME LIST.
THE WHOLE DICTIONARY IS STILL COMPLETE AS
seen = set()
uniq = []
for x in dblList2:
if x[0].split('|')[-1:][0] not in seen:
uniq.append(x.split('|')[-1:][0])
seen.add(x.split('|')[-1:][0])
thanks for your help.
I'm going to take a stab with this. If this isn't what you want let me know why.
If I have a scene with a hierarchy like this:
group1
nurbsCircle1
group2
nurbsCircle2
group3
nurbsCircle1
I can run this (adjust ls() if you need it for selection or whatnot):
conflictObjs = {}
objs = cmds.ls(shortNames = True, transforms = True)
for obj in objs:
if len( obj.split('|') ) > 1:
conflictObjs[obj] = obj.split('|')[-1]
And the output of conflictObjs will be:
# Dictionary of objects with non-unique short names
# {<long name>:<short name>}
{u'group1|nurbsCircle1': u'nurbsCircle1', u'group3|nurbsCircle1': u'nurbsCircle1'}
Showing me what objects don't have unique short names.
This will give you a list of all the lights which have duplicate short names, grouped by what the duplicated name is and including the full path of the duplicated objects:
def clashes_by_type(*types):
long_names = cmds.ls(type = types, l=True) or []
# get the parents from the lights, not using ls -type transform
long_names = set(cmds.listRelatives(*long_names, p=True, f=True) or [])
short_names = set([i.rpartition("|")[-1] for i in long_names])
short_dict = dict()
for sn in short_names:
short_dict[sn] = [i for i in long_names if i.endswith("|"+ sn)]
clashes = dict((k,v) for k, v in short_dict.items() if len(v) > 1)
return clashes
clashes_by_type('directionalLight', 'ambientLight')The main points to note:
work down from long names. short names are inherently unreliable!
when deriving the short names, include the last pipe so you don't get accidental overlaps of common names
short_names will always be a list of lists since it's created by a comprehension
once you have a dict of (name, [objects with that shortname]) it's easy to get clashes by looking for values longer than 1
I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.
I am pulling hotel names through the Expedia API and cross referencing results with another travel service provider.
The problem I am encountering is that many of the hotel names appear differently on the Expedia API than they do with the other provider and I cannot figure out a good way to match them.
I am storing the results of both in separate dicts with room rates. So, for example, the results from Expedia on a search for Vilnius in Lithuania might look like this:
expediadict = {'Ramada Hotel & Suites Vilnius': 120, 'Hotel Rinno': 100,
'Vilnius Comfort Hotel': 110}
But the results from the other provider might look like this:
altproviderdict = {'Ramada Vilnius': 120, 'Rinno Hotel': 100,
'Comfort Hotel LT': 110}
The only thing I can think of doing is stripping out all instances of 'Hotel', 'Vilnius', 'LT' and 'Lithuania' and then testing whether part of the expediadict key matches part of an altprovderdict key. This seems messy and not very Pythonic, so I wondered if any of you had any cleaner ideas?
>>> def simple_clean(word):
... return word.lower().replace(" ","").replace("hotel","")
...
>>> a = "Ramada Hotel & Suites Vilnius"
>>> b = "Hotel Ramada Suites Vilnous"
>>> a = simple_clean(a)
>>> b = simple_clean(b)
>>> a
'ramada&suitesvilnius'
>>> b
'ramadasuitesvilnous'
>>> import difflib
>>> difflib.SequenceMatcher(None,a,b).ratio()
0.9230769230769231
Do cleaning and normalization of the words : eg. remove words like Hotel,The,Resort etc
, and convert to lower case without spaces etc
Then use a fuzzy string matching algorithm like leveinstein, eg from difflib module.
This method is pretty raw and just an example, you can enhance it to suit your needs for optimal results.
If you only want to match names when the words appear in the same order, you might want to use some longest common sub sequence algorithm like it's used in diff tools. But with words instead of characters or lines.
If order is not important, it's simpler: Put all the words of the name into a set like this:
set(name.split())
and in order to match two names, test the size of the intersection of these two sets. Or test if the symmetric_difference only contains unimportant words.