Python list comprehension with if else conditions

Python list comprehension with if else conditions - python

I have a small (<100) list of chemical names called detected_chems .
And a second much larger (>1000) iterable; a dictionary chem_db containing chemical names as the key, and a dictionary of chemical properties as the value. Like this:
{'chemicalx':{'property1':'smells','property2':'poisonous'},
'chemicaly':{'property1':'stinks','property2':'toxic'}}
I am trying to match all the detected chemicals with those in the database and pull their properties.
I have studied these questions/answers but can't seem to apply it to my case (sorry)
Is it possible to use 'else' in a list comprehension?
if/else in a list comprehension?
if/else in a list comprehension?
Python Nested List Comprehension with If Else
So I am making a list of results res, but instead of nested for loops with an if x in condition, I've created this.
res = [{chem:chem_db[chem]}
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem]
This works to an extent!
What I (think) am doing here is creating a list of dictionaries, which will have the key:value pair of chemical names (keys) and information about the chemicals (as a dictionary itself, as values), if the detected chemical is found somewhere in the chemical database (chem_db).
The problem is not all the detected chemicals are found in the database. This is probably because of misspelling or name variation (e.g. they include numbers) or something similar.
So to solve the problem I need to identify which detected chemicals are not being matched. I thought this might be a solution:
not_matched=[]
res = [{chem:chem_db[chem]}
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem else not_matched.append(det_chem)]
I am getting a syntax error, due to the else not_matched.append(det_chem) part.
I have two questions:
1) Where should I put the else condition to avoid the syntax error?
2) Can the not_matched list be built within the list comprehension, so I don't create that empty list first.
res = [{chem:chem_db[chem]}
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem else print(det_chem)]
What I'd like to achieve is something like:
in: len(detected_chems)
out: 20
in: len(res)
out: 18
in: len(not_matched)
out: 2
in: print(not_matched)
out: ['chemical_strange_character$$','chemical___WeirdSPELLING']
That will help me find trouble shoot the matching.

You should
if det_chem in chem or not_matched.append(det_chem)
but that being said if you clean up a bit as per comments I think there is a much more efficient way of doing what you want. The explanation of the above is that append returns None so the whole if-condition will evaluate to False (but the item still appended to the not_matched list)
Re: efficiency:
res = [{det_chem:chem_db[det_chem]}
for det_chem in detected_chems
if det_chem in chem_db or not_matched.append(det_chem)]
This should be drastically faster - the for loop on dictionary keys is an O(n) operation while dictionaries are used precisely because lookup is O(1) so instead of retrieving the keys and comparing them one by one we use the det_chem in chem_db lookup which is hash based
Bonus: dict comprehension (to address question 2)
I am not sure why a list of one-key-dicts is built but probably what needed is a dict comprehension as in:
chem_db = {1: 2, 4: 5}
detected_chems = [1, 3]
not_matched = []
res = {det_chem: chem_db[det_chem] for det_chem in detected_chems if
det_chem in chem_db or not_matched.append(det_chem)}
# output
print(res) # {1: 2}
print(not_matched) # [3]
No way I can think of to build the not_matched list while also building res using a single list/dict comprehension.

List comprehension consists formally up to 3 parts. Let's show them in an example:
[2 * i for i in range(10) if i % 3 == 0]
The first part is an expression — and it may be (or used in it) the ternary operator (x if y else z)
The second part is a list (or more lists in nested for loops) to select values for a variable from it.
The third part (optional) is a filter (for selecting in the part 2) - and the else clause in not allowed here!
So if you want to use the else branch, you have to put it into the first part, for example
[2 * i if i < 5 else 3 * i for i in range(10) if i % 3 == 0]

Your Syntax error comes from the fact that comprehension do not accepts else clauses.
You could eventually use the ... if ... else ... ternary operator to determine the value to put in your comprehension result. Something like below:
not_matched=[]
res = [{chem:chem_db[chem]} if det_chem in chem else not_matched.append(det_chem)
for det_chem in detected_chems
for chem in chem_db.keys()]
But it would be a bad idea since you then would have None in your res for each not matched. This is because the ... if ... else ... operator always returns a value, and in your case, the value would be the return value of the list.append method (= None).
You could then filter the res list to remove None values but... meh...
A better solution would be to simply keep your first comprehsion and get the difference between the original chem list and the res list:
not_matched = set(chems).difference(<the already matched chems>)
Note that I used a the already matched chems placeholder instead of a real chunk of code because the way your store your res is not pratical at all. Indeed it is a list of single-key-dictionaries which is a non-sens. The role of dictionary is to hold multiple values identified by keys.
A solution to this would be to make res a dictionary instead of a list, using a dict comprehension:
res = {chem: chem_db[chem]
for det_chem in detected_chems
for chem in chem_db.keys()
if det_chem in chem}
Doing this, the the already matched chems placeholder could be replaced by res.values()
As an addition, even if comprehensions are a really cool feature in a lot of cases, they are not a miraculous feature which should be used everywhere.
And nested comprehensions are a real pain to read and should be avoided (in my opinion at least).

The below sample code would give you the desired output what you want. It is using a dictionary comprehension instead of a list comprehension to capture matched dictionary item info as a dictionary only. It is because you would need a dictionary of matched chemical items instead of a list. In the dictionary of matched items, it would be easier for you to get their properties later. Also, you don't need to use chem_db.keys() because the "in" operator itself searches the target in the entire sequence (be it a list or dictionary). And if the seq is a dict, then it matches the target with all the keys inside the dictionary.
Code:
detected_chems=['chemical_strange_character$$','chemical___WeirdSPELLING','chem3','chem4','chem5','chem6','chem7','chem8','chem9','chem10','chem11','chem12','chem13','chem14','chem15','chem16','chem17','chem18','chem19','chem20']
chem_db = {'chem1':{'property1':'smells','property2':'poisonous'},'chem2':{'property1':'stinks','property2':'toxic'},'chem3':{'property1':'smells','property2':'poisonous'},'chem4':{'property1':'smells','property2':'poisonous'},'chem5':{'property1':'smells','property2':'poisonous'},'chem6':{'property1':'smells','property2':'poisonous'},'chem7':{'property1':'smells','property2':'poisonous'},'chem8':{'property1':'smells','property2':'poisonous'},'chem9':{'property1':'smells','property2':'poisonous'},'chem10':{'property1':'smells','property2':'poisonous'},'chem11':{'property1':'smells','property2':'poisonous'},'chem12':{'property1':'smells','property2':'poisonous'},'chem13':{'property1':'smells','property2':'poisonous'},'chem14':{'property1':'smells','property2':'poisonous'},'chem15':{'property1':'smells','property2':'poisonous'},'chem16':{'property1':'smells','property2':'poisonous'},'chem17':{'property1':'smells','property2':'poisonous'},'chem18':{'property1':'smells','property2':'poisonous'},'chem19':{'property1':'smells','property2':'poisonous'},'chem20':{'property1':'smells','property2':'poisonous'}}
not_matched = []
res = {det_chem:chem_db[det_chem]
for det_chem in detected_chems
if det_chem in chem_db or not_matched.append(det_chem)}
print(len(detected_chems))
print(len(res))
print(len(not_matched))
print(not_matched)
The output:
20
18
2
['chemical_strange_character$$', 'chemical___WeirdSPELLING']
If any info is needed further, see to check out: dictionary in python

Related

Refer to a list in Python from within the [ ] operator

While trying to optimize a one-liner that extracts values from a list of dicts (all strings that come after keyword/ til the next /) I got stuck in the following:
imagine this list of dicts:
my_list = [{'name':'val_1', 'url':'foo/bar/keyword/bla_1/etc'}.......{'name':'val_z', 'url':'foo/bar/keyword/bla_n/etc'}]
The numbers in val_ and bla_ are not related.
I am trying to extract all the bla_x (they could be anything, no pattern possible) words using list comprehension for 0 < x < n+1. I managed to get the index of those elements in the split string
[d['url'].split('/').index('keyword') + 1 for d in my_list]
But what I need is to access that value and not only its index, so I thought about something like this:
[d['url'].split('/')[the_resulting_split_list.index('keyword') + 1] for d in my_list]
As far as my knowledge go, that seems impossible to do. Is there an alternative way to reach my goal of getting an output of ['bla_1', 'bla_2', ....., 'bla_n'] without having to run the split('/') operation twice in the list comprehension?
Let's not care about exceptions for now and assume input data is always correct.

How to add up values of the "sublists" within a list of lists

I have a list of lists in my script:
list = [[1,2]
[4,3]
[6,2]
[1,6]
[9,2]
[6,5]]
I am looking for a solution to sum up the contents of each "sublist" within the list of lists. The desired output would be:
new_list = [3,7,8,7,11,11]
I know about combining ALL of these lists into one which would be:
new_list = [27,20]
But that's not what i'm looking to accomplish.
I need to combine the two values within these "sublists" and have them remain as their own entry in the main list.
I would also greatly appreciate it if it was also explained how you solved the problem rather than just handing me the solution. I'm trying to learn python so even a minor explanation would be greatly appreciated.
Using Python 3.7.4
Thanks, Riftie.

The "manual" solution will be using a for loop.
new_list = []
for sub_list in list:
new_list.append(sum(sub_list))
or as list compherension:
new_list = [sum(sub_list) for sub_list in list]
The for loop iterates through the elements of list. In your case, list is a list of lists. So every element is a list byitself. That means that while iterating, sub_list is a simple list. To get a sum of list I used sum() build-in function. You can of course iterate manually and sum every element:
new_list = []
for sub_list in list:
sum_val = 0
for element in sub_list:
sum_val = sum_val + element
new_list.append(sum_val)
but no need for that.
A better approach will be to use numpy, which allows you to sum by axis, as it looks on list of lists like an array. Since you are learning basic python, it's too soon to learn about numpy. Just keep in mind that there is a package for handling multi-dimensions arrays and it allows it perform some actions like sum on an axis by your choice.
Edit: I've seen the other solution suggest. As both will work, I believe this solution is more "accessible" for someone who learn to program for first time. Using list comprehension is great and correct, but may be a bit confusing while first learning. Also as suggested, calling your variables list is a bad idea because it's keyword. Better names will be "my_list", "tmp_list" or something else.

Use list comprehension. Also avoid using keywords as variable names, in your case you overrode the builtin list.
# a, b -> sequence unpacking
summed = [a + b for a, b in lst] # where lst is your list of lists
# if the inner lists contain variable number of elements, a more
# concise solution would be
summed2 = [sum(seq) for seq in lst]
Read more about the powerful list comprehension here.

Pythonic Way to Compare Values in Two Lists of Dictionaries

I'm new to Python and am still trying to tear myself away from C++ coding techniques while in Python, so please forgive me if this is a trivial question. I can't seem to find the most Pythonic way of doing this.
I have two lists of dicts. The individual dicts in both lists may contain nested dicts. (It's actually some Yelp data, if you're curious.) The first list of dicts contains entries like this:
{business_id': 'JwUE5GmEO-sH1FuwJgKBlQ',
'categories': ['Restaurants'],
'type': 'business'
...}
The second list of dicts contains entries like this:
{'business_id': 'vcNAWiLM4dR7D2nwwJ7nCA',
'date': '2010-03-22',
'review_id': 'RF6UnRTtG7tWMcrO2GEoAg',
'stars': 2,
'text': "This is a basic review",
...}
What I would like to do is extract all the entries in the second list that match specific categories in the first list. For example, if I'm interested in restaurants, I only want the entires in the second list where the business_id matches the business_id in the first list and the word Restaurants appears in the list of values for categories.
If I had these two lists as tables in SQL, I'd do a join on the business_id attribute then just a simple filter to get the rows I want (where Restaurants IN categories, or something similar).
These two lists are extremely large, so I'm running into both efficiency and memory space issues. Before I go and shove all of this into a SQL database, can anyone give me some pointers? I've messed around with Pandas some, so I do have some limited experience with that. I was having trouble with the merge process.

Suppose your lists are called l1 and l2:
All elements from l1:
[each for each in l1]
All elements from l1 with the Restaurant category:
[each for each in l1
if 'Restaurants' in each['categories']]
All elements from l2 matching id with elements from l1 with the Restaurant category:
[x for each in l1 for x in l2
if 'Restaurants' in each['categories']
and x['business_id'] == each['business_id'] ]

Let's define sample lists of dictionaries:
first = [
{'business_id':100, 'categories':['Restaurants']},
{'business_id':101, 'categories':['Printer']},
{'business_id':102, 'categories':['Restaurants']},
]
second = [
{'business_id':100, 'stars':5},
{'business_id':101, 'stars':4},
{'business_id':102, 'stars':3},
]
We can extract the items of interest in two steps. The first step is to collect the list of business ids that belong to restaurants:
ids = [d['business_id'] for d in first if 'Restaurants' in d['categories']]
The second step is to get the dicts that correspond to those ids:
[d for d in second if d['business_id'] in ids]
This results in:
[{'business_id': 100, 'stars': 5}, {'business_id': 102, 'stars': 3}]

Python programmers like using list comprehensions as a way to do both their logic and their design.
List comprehensions lead to terser and more compact expression. You're right to think of it quite a lot like a query language.
x = [comparison(a, b) for (a, b) in zip(A, B)]
x = [comparison(a, b) for (a, b) in itertools.product(A, B)]
x = [comparison(a, b) for a in A for b in B if test(a, b)]
x = [comparison(a, b) for (a, b) in X for X in Y if test(a, b, X)]
...are all patterns that I use.

This is pretty tricky, and I had fun with it. This is what I'd do:
def match_fields(business, review):
return business['business_id'] == review['business_id'] and 'Restaurants' in business['categories']
def search_businesses(review):
# the lambda binds the given review as an argument to match_fields
return any(lambda business: match_fields(business, review), business_list)
answer = filter(search_businesses, review_list)
This is the most readable way I found. I'm not terribly fond of list comprehensions that go past one line, and three lines is really pushing it. If you want this to look more terse, just use shorter variable names. I favor long ones for clarity.
I defined a function that returns true if an entry can be matched between lists, and a second function that helps me search through the review list. I then can say: get rid of any review that doesn't have a matching entry in the businesses list. This pattern works well with arbitrary checks between lists.

As a variation to the list comprehension only approaches, it may be more efficient to use a set and generator comprehension. This is especially true if the size of your first list is very large or if the total number of restaurants is very large.
restaurant_ids = set(biz for biz in first if 'Restaurants' in biz['categories'])
restaurant_data = [rest for rest in second if rest['id'] in restaurant_ids]
Note the brute force list comprehension approach is O(len(first)*len(second)), but it uses no additional memory storage whereas this approach is O(len(first)+len(second)) and uses O(number_of_restaurants) extra memory for the set.

You could do:
restaurant_ids = [biz['id'] for biz in list1 if 'Restaurants' in biz['categories']]
restaurant_data = [rest for rest in list2 if rest['id'] in restaurant_ids]
Then restaurant_data would contain all of the dictionaries from list2 that contain restaurant data.

finding first item in a list whose first item in a tuple is matched

I have a list of several thousand unordered tuples that are of the format
(mainValue, (value, value, value, value))
Given a main value (which may or may not be present), is there a 'nice' way, other than iterating through every item looking and incrementing a value, where I can produce a list of indexes of tuples that match like this:
index = 0;
for destEntry in destList:
if destEntry[0] == sourceMatch:
destMatches.append(index)
index = index + 1
So I can compare the sub values against another set, and remove the best match from the list if necessary.
This works fine, but just seems like python would have a better way!
Edit:
As per the question, when writing the original question, I realised that I could use a dictionary instead of the first value (in fact this list is within another dictionary), but after removing the question, I still wanted to know how to do it as a tuple.

With list comprehension your for loop can be reduced to this expression:
destMatches = [i for i,destEntry in enumerate(destList) if destEntry[0] == sourceMatch]
You can also use filter()1 built in function to filter your data:
destMatches = filter(lambda destEntry:destEntry[0] == sourceMatch, destList)
1: In Python 3 filter is a class and returns a filter object.

Python: remove lots of items from a list

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.
I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.
So, I might have
myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}
I want to remove each (v, k) tuple from the list.
Currently I am doing:
for k, v in myDict.iteritems():
myList.remove((v, k))
Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.
Is there a better way to do this?
I can provide code used to test, plus pickled data from the actual application if needed.

You'll have to measure, but I can imagine this to be more performant:
myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)
because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.
Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."

To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:
totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]
The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).
I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!
If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:
sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]
or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):
sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]
In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).

Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.
Have you tried doing the "inverse" operation of:
newMyList = [(v,k) for (v,k) in myList if not k in myDict]
But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.
Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.

[(i, j) for i, j in myList if myDict.get(j) != i]

Try something like this:
myListSet = set(myList)
myDictSet = set(zip(myDict.values(), myDict.keys()))
myList = list(myListSet - myDictSet)
This will convert myList to a set, will swap the keys/values in myDict and put them into a set, and will then find the difference, turn it back into a list, and assign it back to myList. :)

The problem looks to me to be the fact you are using a list as the container you are trying to remove from, and it is a totally unordered type. So to find each item in the list is a linear operation (O(n)), it has to iterate over the whole list until it finds a match.
If you could swap the list for some other container (set?) which uses a hash() of each item to order them, then each match could be performed much quicker.
The following code shows how you could do this using a combination of ideas offered by myself and Nick on this thread:
list_set = set(original_list)
dict_set = set(zip(original_dict.values(), original_dict.keys()))
difference_set = list(list_set - dict_set)
final_list = []
for item in original_list:
if item in difference_set:
final_list.append(item)

[i for i in myList if i not in list(zip(myDict.values(), myDict.keys()))]

A list containing a million 2-tuples is not large on most machines running Python. However if you absolutely must do the removal in situ, here is a clean way of doing it properly:
def filter_by_dict(my_list, my_dict):
sentinel = object()
for i in xrange(len(my_list) - 1, -1, -1):
key = my_list[i][1]
if my_dict.get(key, sentinel) is not sentinel:
del my_list[i]
Update Actually each del costs O(n) shuffling the list pointers down using C's memmove(), so if there are d dels, it's O(n*d) not O(n**2). Note that (1) the OP suggests that d approx == 0.01 * n and (2) the O(n*d) effort is copying one pointer to somewhere else in memory ... so this method could in fact be somewhat faster than a quick glance would indicate. Benchmarks, anyone?
What are you going to do with the list after you have removed the items that are in the dict? Is it possible to piggy-back the dict-filtering onto the next step?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.