Get related dictionaries from lists - python

I have two list of different dictionaries (ListA and ListB).
All dictionaries in listA have field "id" and "external_id"
All dictionaries in listB have field "num" and "external_num"
I need to get all pairs of dictionaries where value of external_id = num and value of external_num = id.
I can achieve that using this code:
for dictA in ListA:
for dictB in ListB:
if dictA["id"] == dictB["external_num"] and dictA["external_id"] == dictB["num"]:
But I saw many beautiful python expressions, and I guess it is possible to get that result more pythonic style, isn't it?
I something like:
res = [A, B for A, B in listA, listB if A['id'] == B['extnum'] and A['ext'] == B['num']]

You are pretty close, but you aren't telling Python how you want to connect the two lists to get the pairs of dictionaries A and B.
If you want to compare all dictionaries in ListA to all in ListB, you need itertools.product:
from itertools import product
res = [A, B for A, B in product(ListA, ListB) if ...]
Alternatively, if you want pairs at the same indices, use zip:
res = [A, B for A, B in zip(ListA, ListB) if ...]
If you don't need the whole list building at once, note that you can use itertools.ifilter to pick the pairs you want:
from itertools import ifilter, product
for A, B in ifilter(lambda (A, B): ...,
product(ListA, ListB)):
# do whatever you want with A and B
(if you do this with zip, use itertools.izip instead to maximise performance).
Notes on Python 3.x:
zip and filter no longer return lists, therefore itertools.izip and itertools.ifilter no longer exist (just as range has pushed out xrange) and you only need product from itertools; and
lambda (A, B): is no longer valid syntax; you will need to write the filtering function to take a single tuple argument lambda t: and e.g. replace A with t[0].

Firstly, for code clarity, I actually would probably go with your first option - I don't think using for loops is particularly un-Pythonic, in this case. However, if you want to try using a list comprehension, there are a few things to be aware of:
Each item returned by the list comprehension needs to be just a singular item. Trying to return A, B is going to give you a SyntaxError. However, you can return either a list or a tuple (or anything else, that is a single object), so something like res = [(A,B) for...] would start working.
Another concern is how you're iterating over these lists - from you first snippet of code, it appears you don't make any assumptions about these lists lining up, meaning: you seem to be ok if the 2nd item in listA matches the 14th item in listB, so long as they match on the appropriate fields. That's perfectly reasonable, but just be aware that means you will need two for loops no matter how you try to do it*. And you still need your comparisons. So, as a list comprehension, you might try:
res = [(A, B) for A in listA for B in listB if A['id']==B['extnum'] and A['extid']==B['num']]
Then, in res, you'll have 0 or more tuples, and each tuple will contain the respective dictionaries you're interested in. To use them:
for tup in res:
A = tup[0]
B = tup[1]
#....
or more concisely (and Pythonically):
for A,B in res:
#...
since Python is smart enough to know that it's yielding an item (the tuple) that has 2 elements, and so it can directly assign them to A and B.
EDIT:* in retrospect, it isn't completely true that you need two forloops, and if your lists are big enough, it may be helpful, performance-wise, to make an intermediate dictionary such as this:
# make a dictionary with key=tuple, value=dictionary
interim = {(A['id'], A['extid']): A for A in listA}
for B in listB:
tup = (B['extnum'], B['num']) ## order matters! match-up with A
if tup in interim:
A = interim[tup]
print(A, B)
and, if the id-extid pair isnot expected to be unique across all items in listA, then you'd want to look into collections.defaultdict with a list... but I'm not sure this still fits in the 'more Pythonic' category anymore.
I realize this is likely overkill for the question you asked, but I couldn't let my 'two for loops' statement stand, since it's not entirely true.

Related

How do you convert a list of strings to separate strings in Python 3?

I want to know if you have a list of strings such as:
l = ['ACGAAAG', 'CAGAAGC', 'ACCTGTT']
How do you convert it to:
O = 'ACGAAAG'
P = 'CAGAAGC'
Q = 'ACCTGTT'
Can you do this without knowing the number of items in a list? You have to store them as variables.
(The variables don't matter.)
Welcome to SE!
Structure Known
If you know the structure of the string, then you might simply unpack it:
O, P, Q = my_list
Structure Unknown
Unpack your list using a for loop. Do your work on each string inside the loop. For the below, I am simply printing each one:
for element in l:
print(element)
Good luck!
If you don't know the number of items beforehand, a list is the right structure to keep the items in.
You can, though, cut off fist few known items, and leave the unknown tail as a list:
a, b, *rest = ["ay", "bee", "see", "what", "remains"]
print("%r, %r, rest is %r" % (a, b, rest))
a,b,c = my_list
this will work as long as the numbers of elements in the list is equal to the numbers of variables you want to unpack, it actually work with any iterable, tuple, list, set, etc
if the list is longer you can always access the first 3 elements if that is what you want
a = my_list[0]
b = my_list[1]
c = my_list[2]
or in one line
a, b, c = my_list[0], my_list[1], my_list[2]
even better with the slice notation you can get a sub list of the right with the first 3 elements
a, b, c = my_list[:3]
those would work as long as the list is at least of size 3, or the numbers of variables you want
you can also use the extended unpack notation
a, b, c, *the_rest = my_list
the rest would be a list with everything else in the list other than the first 3 elements and again the list need to be of size 3 or more
And that pretty much cover all the ways to extract a certain numbers of items
Now depending of what you are going to do with those, you may be better with a regular loop
for item in my_list:
#do something with the current item, like printing it
print(item)
in each iteration item would take the value of one element in the list for you to do what you need to do one item at the time
if what you want is take 3 items at the time in each iteration, there are several way to do it
like for example
for i in range(3,len(my_list),3)
a,b,c = my_list[i-3:i]
print(a,b,c)
there are more fun construct like
it = [iter(my_list)]*3
for a,b,c in zip(*it):
print(a,b,c)
and other with the itertools module.
But now you said something interesting "so that every term is assigned to a variable" that is the wrong approach, you don't want an unknown number of variables running around that get messy very fast, you work with the list, if you want to do some work with each element it there are plenty of ways of doing it like list comprehension
my_new_list = [ some_fun(x) for x in my_list ]
or in the old way
my_new_list = []
for x in my_list:
my_new_list.append( some_fun(x) )
or if you need to work with more that 1 item at the time, combine that with some of the above
I do not know if your use case requires the strings to be stored in different variables. It usually is a bad idea.
But if you do need it, then you can use exec builtin which takes the string representation of a python statement and executes it.
list_of_strings = ['ACGAAAG', 'CAGAAGC', 'ACCTGTT']
Dynamically generate variable names equivalent to the column names in an excel sheet. (A,B,C....Z,AA,AB........,AAA....)
variable_names = ['A', 'B', 'C'] in this specific case
for vn, st in zip(variable_names, list_of_strings):
exec('{} = "{}"'.format(vn, st))
Test it out, print(A,B,C) will output the three strings and you can use A,B and C as variables in the rest of the program

Python: replace values of sublist, with values looked up from another sublist without indexing

Description
I have two lists of lists which are derived from CSVs (minimal working example below). The real dataset for this too large to do this manually.
mainlist = [["MH75","QF12",0,38], ["JQ59","QR21",105,191], ["JQ61","SQ48",186,284], ["SQ84","QF36",0,123], ["GA55","VA63",80,245], ["MH98","CX12",171,263]]
replacelist = [["MH75","QF12","BA89","QR29"], ["QR21","JQ59","VA51","MH52"], ["GA55","VA63","MH19","CX84"], ["SQ84","QF36","SQ08","JQ65"], ["SQ48","JQ61","QF87","QF63"], ["MH98","CX12","GA34","GA60"]]
mainlist contains a pair of identifiers (mainlist[x][0], mainlist[x][1]) and these are associated with to two integers (mainlist[x][2] and mainlist[x][3]).
replacelist is a second list of lists which also contains the same pairs of identifiers (but not in the same order within a pair, or across rows). All sublist pairs are unique. Importantly, replacelist[x][2],replacelist[x][3] corresponds to a replacement for replacelist[x][0],replacelist[x][1], respectively.
I need to create a new third list, newlist which copies mainlist but replaces the identifiers with those from replacelist[x][2],replacelist[x][3]
For example, given:
mainlist[2] is: [JQ61,SQ48,186,284]
The matching pair in replacelist is
replacelist[4]: [SQ48,JQ61,QF87,QF63]
Therefore the expected output is
newlist[2] = [QF87,QF63,186,284]
More clearly put:
if replacelist = [[A, B, C, D]]
A is replaced with C, and B is replaced with D.
but it may appear in mainlist as [[B, A]]
Note newlist row position uses the same as mainlist
Attempt
What has me totally stumped on a simple problem is I feel I can't use basic list comprehension [i for i in replacelist if i in mainlist] as the order within a pair changes, and if I sorted(list) then I lose information about what to replace the lists with. Current solution (with commented blanks):
newlist = []
for k in replacelist:
for i in mainlist:
if k[0] and k[1] in i:
# retrieve mainlist order, then use some kind of indexing to check a series of nested if statements to work out positional replacement.
As you can see, this solution is clearly inefficient and I can't work out the best way to perform the final step in a few lines.
I can add more information if this is not clear
It'll help if you had replacelist as a dict:
mainlist = [[MH75,QF12,0,38], [JQ59,QR21,105,191], [JQ61,SQ48,186,284], [SQ84,QF36,0,123], [GA55,VA63,80,245], [MH98,CX12,171,263]]
replacelist = [[MH75,QF12,BA89,QR29], [QR21,JQ59,VA51,MH52], [GA55,VA63,MH19,CX84], [SQ84,QF36,SQ08,JQ65], [SQ48,JQ61,QF87,QF63], [MH98,CX12,GA34,GA60]]
replacements = {frozenset(r[:2]):dict(zip(r[:2], r[2:])) for r in replacements}
newlist = []
for *ids, val1, val2 in mainlist:
reps = replacements[frozenset([id1, id2])]
newlist.append([reps[ids[0]], reps[ids[1]], val1, val2])
First thing you do - transform both lists in a dictionary:
from collections import OrderedDict
maindct = OrderedDict((frozenset(item[:2]),item[2:]) for item in mainlist)
replacedct = {frozenset(item[:2]):item[2:] for item in replacementlist}
# Now it is trivial to create another dict with the desired output:
output_list = [replacedct[key] + maindct[key] for key in maindct]
The big deal here is that by using a dictionary, you cancel up the search time for the indices on the replacement list - in a list you have to scan all the list for each item you have, which makes your performance worse with the square of your list length. With Python dictionaries, the search time is constant - and do not depend on the data length at all.

Matching a list of dictionaries A to list C with list B having common properties of A and C in Python?

I have three lists of dictionaries, A, B and C. They look like:
A = [{propA1: valueA1}, {propA1: valueA2}, ...]
B = [{propB1: valueB1, propB2: valueB2}, {propB1: valueB3, propB2: value4}, ...]
C = [{propC1: valueC1}, {propC1: valueC2}, ...]
propA1 and propB1 are same properties but different name, propB2 and propC1 are same properties as well.
However, propA1 and propB1 do not always have same values, but I am only interested in the "set intersect" of array [valueA1, valueA2, ...] and [valueB1, valueB2, ...], here is the goal: I want to return all propB2 from B such that their propB1 counterpart (in the same dictionary) match with propA1 in A. Then I will use that propB2 set to match with propC1 in C.
What I have tried:
propB2_match = set()
for elementB in B:
for elementA in A:
if elementB['propB1'] == elementA['propA1']:
propB2_match(elementB['propB2'])
break
At the end of this loop, I have propB2_match containing all of propB2 that I can use to match with propC1.
However, as you can see from the loop, this is an expensive O(n^2) loop. I am wondering if there is a way to handle this with O(n)? If not, is there any pythonic optimization can be done on it?
Note: I do not want to put it in a database and use relational database SQL to handle the join operation.
If I understand correctly, you are trying to do a essentially do a JOIN on A and B where columns A['propA1'] == B['propB1'].
Here's one way using defaultdict that's O(len(A)+len(B)):
from collections import defaultdict
A = [{'pA1': 'vA1'}, {'pA1': 'vA2'}]
B = [{'pB1': 'vA1', 'pB2': 'vB2'}, {'pB1': 'vB3', 'pB2': 'v4'}]
# Key by the value you want to group on
kA = [(x['pA1'],x) for x in A]
kB = [(x['pB1'],x) for x in B]
# Combine the lists
kAB = kA+kB
# Map each unique key to a list of elements that have that key
results = defaultdict(list)
for x in kAB:
results[x[0]].append(x[1])
for x in results:
print results[x]
Outputs:
[{'pA1': 'vA2'}]
[{'pB1': 'vB3', 'pB2': 'v4'}]
[{'pA1': 'vA1'}, {'pB1': 'vA1', 'pB2': 'vB2'}]
At this point you could merge each list of dicts into a single dict or whatever you need, and use the result to JOIN with the third list C.

Python list index splitting and manipulation

My question seems simple, but for a novice to python like myself this is starting to get too complex for me to get, so here's the situation:
I need to take a list such as:
L = [(a, b, c), (d, e, d), (etc, etc, etc), (etc, etc, etc)]
and make each index an individual list so that I may pull elements from each index specifically. The problem is that the list I am actually working with contains hundreds of indices such as the ones above and I cannot make something like:
L_new = list(L['insert specific index here'])
for each one as that would mean filling up the memory with hundreds of lists corresponding to individual indices of the first list and would be far too time and memory consuming from my point of view. So my question is this, how can I separate those indices and then pull individual parts from them without needing to create hundreds of individual lists (at least to the point where I wont need hundreds of individual lines to create them).
I might be misreading your question, but I'm inclined to say that you don't actually have to do anything to be able to index your tuples. See my comment, but: L[0][0] will give "a", L[0][1] will give "b", L[2][1] will give "etc" etc...
If you really want a clean way to turn this into a list of lists you could use a list comprehension:
cast = [list(entry) for entry in L]
In response to your comment: if you want to access across dimensions I would suggest list comprehension. For your comment specifically:
crosscut = [entry[0] for entry in L]
In response to comment 2: This is largely a part of a really useful operation called slicing. Specifically to do the referenced operation you would do this:
multiple_index = [entry[0:3] for entry in L]
Depending on your readability preferences there are actually a number of possibilities here:
list_of_lists = []
for sublist in L:
list_of_lists.append(list(sublist))
iterator = iter(L)
for i in range(0,iterator.__length_hint__()):
return list(iterator.next())
# Or yield list(iterator.next()) if you want lazy evaluation
What you have there is a list of tuples, access them like a list of lists
L[3][2]
will get the second element from the 3rd tuple in your list L
Two way of using inner lists:
for index, sublist in enumerate(L):
# do something with sublist
pass
or with an iterator
iterator = iter(L)
sublist = L.next() # <-- yields the first sublist
in both case, sublist elements can be reached via
direct index
sublist[2]
iteration
iterator = iter(sublist)
iterator.next() # <-- yields first elem of sublist
for elem in sublist:
# do something with my elem
pass

How to separate one list in two via list comprehension or otherwise

If have a list of dictionary items like so:
L = [{"a":1, "b":0}, {"a":3, "b":1}...]
I would like to split these entries based upon the value of "b", either 0 or 1.
A(b=0) = [{"a":1, "b":1}, ....]
B(b=1) = [{"a":3, "b":2}, .....]
I am comfortable with using simple list comprehensions, and i am currently looping through the list L two times.
A = [d for d in L if d["b"] == 0]
B = [d for d in L if d["b"] != 0]
Clearly this is not the most efficient way.
An else clause does not seem to be available within the list comprehension functionality.
Can I do what I want via list comprehension?
Is there a better way to do this?
I am looking for a good balance between readability and efficiency, leaning towards readability.
Thanks!
update:
thanks everyone for the comments and ideas! the most easiest one for me to read is the one by Thomas. but i will look at Alex' suggestion as well. i had not found any reference to the collections module before.
Don't use a list comprehension. List comprehensions are for when you want a single list result. You obviously don't :) Use a regular for loop:
A = []
B = []
for item in L:
if item['b'] == 0:
target = A
else:
target = B
target.append(item)
You can shorten the snippet by doing, say, (A, B)[item['b'] != 0].append(item), but why bother?
If the b value can be only 0 or 1, #Thomas's simple solution is probably best. For a more general case (in which you want to discriminate among several possible values of b -- your sample "expected results" appear to be completely divorced from and contradictory to your question's text, so it's far from obvious whether you actually need some generality;-):
from collections import defaultdict
separated = defaultdict(list)
for x in L:
separated[x['b']].append(x)
When this code executes, separated ends up with a dict (actually an instance of collections.defaultdict, a dict subclass) whose keys are all values for b that actually occur in dicts in list L, the corresponding values being the separated sublists. So, for example, if b takes only the values 0 and 1, separated[0] would be what (in your question's text as opposed to the example) you want as list A, and separated[1] what you want as list B.

Categories

Resources