Is this the fastest way to build dict? - python

I am reading an element list from an xml file and make the data into 2 dictionaries.
Was this the fastest way? (I don't think this is the best, you guys always surprise me.;-)
ADict = {}
BDict = {}
for x in fields:
key = x.get('key')
ADict[key] = x.find('A').text
BDict[key] = x.find('B').text
I think add it one by one is a bad idea, but write it in a single line. aka more pythonic way like this
ADict,BDict = [dict(k) for k in zip(*([(x.get('key'),x.find('A').text),(x.get('key'),x.find('B').text)] for x in fields))]
I don't think it's better, two reasons,
first, x.get('key') was called twice
second, create too much temp tuples

Not tested, but should work
ADict = dict((x.get('key'), x.find('A').text) for x in fields)
BDict = dict((x.get('key'), x.find('B').text) for x in fields)

Related

accelerate comparing dictionary keys and values to strings in list in python

Sorry if this is trivial I'm still learning but I have a list of dictionaries that looks as follow:
[{'1102': ['00576', '00577', '00578', '00579', '00580', '00581']},
{'1102': ['00582', '00583', '00584', '00585', '00586', '00587']},
{'1102': ['00588', '00589', '00590', '00591', '00592', '00593']},
{'1102': ['00594', '00595', '00596', '00597', '00598', '00599']},
{'1102': ['00600', '00601', '00602', '00603', '00604', '00605']}
...]
it contains ~89000 dictionaries. And I have a list containing 4473208 paths. example:
['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv',
'/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv',
...]
and what I want to do is group each path that contains the grouped values in the dict in the folder containing the key together.
I tried using for loops like this:
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for file in ct_paths:
for key, val in elem.items():
if (file[16:20] == key) and (any(x in file[21:26] for x in val)):
temp1.append(file)
grpd_cts.append(temp1)
but this takes around 30hours. is there a way to make it more efficient? any itertools function or something?
Thanks a lot!
ct_paths is iterated repeatedly in your inner loop, and you're only interested in a little bit of it for testing purposes; pull that out and use it to index the rest of your data, as a dictionary.
What does make your problem complicated is that you're wanting to end up with the original list of filenames, so you need to construct a two-level dictionary where the values are lists of all originals grouped under those two keys.
ct_path_index = {}
for f in ct_paths:
ct_path_index.setdefault(f[16:20], {}).setdefault(f[21:26], []).append(f)
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for key, val in elem.items():
d2 = ct_path_index.get(key)
if d2:
for v in val:
v2 = d2.get(v)
if v2:
temp1 += v2
grpd_cts.append(temp1)
ct_path_index looks like this, using your data:
{'1102': {'00575': ['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv'],
'00578': ['/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv']}}
The use of setdefault (which can be a little hard to understand the first time you see it) is important when building up collections of collections, and is very common in these kinds of cases: it makes sure that the sub-collections are created on demand and then re-used for a given key.
Now, you've only got two nested loops; the inner checks are done using dictionary lookups, which are close to O(1).
Other optimizations would include turning the lists in dict_list into sets, which would be worthwhile if you made more than one pass through dict_list.

Simplifying the code to a dictionary comprehension

In a directory images, images are named like - 1_foo.png, 2_foo.png, 14_foo.png, etc.
The images are OCR'd and the text extract is stored in a dict by the code below -
data_dict = {}
for i in os.listdir(images):
if str(i[1]) != '_':
k = str(i[:2]) # Get first two characters of image name and use as 'key'
else:
k = str(i[:1]) # Get first character of image name and use 'key'
# Intiates a list for each key and allows storing multiple entries
data_dict.setdefault(k, [])
data_dict[k].append(pytesseract.image_to_string(i))
The code performs as expected.
The images can have varying numbers in their name ranging from 1 to 99.
Can this be reduced to a dictionary comprehension?
No. Each iteration in a dict comprehension assigns a value to a key; it cannot update an existing value list. Dict comprehensions aren't always better--the code you wrote seems good enough. Although maybe you could write
data_dict = {}
for i in os.listdir(images):
k = i.partition("_")[0]
image_string = pytesseract.image_to_string(i)
data_dict.setdefault(k, []).append(image_string)
Yes. Here's one way, but I wouldn't recommend it:
{k: d.setdefault(k, []).append(pytesseract.image_to_string(i)) or d[k]
for d in [{}]
for k, i in ((i.split('_')[0], i) for i in names)}
That might be as clean as I can make it, and it's still bad. Better use a normal loop, especially a clean one like Dennis's.
Slight variation (if I do the abuse once, I might as well do it twice):
{k: d.setdefault(k, []).append(pytesseract_image_to_string(i)) or d[k]
for d in [{}]
for i in names
for k in i.split('_')[:1]}
Edit: kaya3 now posted a good one using a dict comprehension. I'd recommend that over mine as well. Mine are really just the dirty results of me being like "Someone said it can't be done? Challenge accepted!".
In this case itertools.groupby can be useful; you can group the filenames by the numeric part. But making it work is not easy, because the groups have to be contiguous in the sequence.
That means before we can use groupby, we need to sort using a key function which extracts the numeric part. That's the same key function we want to group by, so it makes sense to write the key function separately.
from itertools import groupby
def image_key(image):
return str(image).partition('_')[0]
images = ['1_foo.png', '2_foo.png', '3_bar.png', '1_baz.png']
result = {
k: list(v)
for k, v in groupby(sorted(images, key=image_key), key=image_key)
}
# {'1': ['1_foo.png', '1_baz.png'],
# '2': ['2_foo.png'],
# '3': ['3_bar.png']}
Replace list(v) with list(map(pytesseract.image_to_string, v)) for your use-case.

Dictionary comprehension to avoid lines of code

Hi want to understand how to make this code shorter using dictionary comprehension:
for e in list_of_tuples:
tmp = mydict.copy()
tmp[e[0]] = tmp[e[1]]
if someFunction(tmp):
mydict = tmp
I would like to pass a dictionary comprehension to someFunction instead of relying on a temporary dictionary whose values are changed in the loop. Is it possible?
This answer assumes that someFunction does not alter the dictionary
The dictionary passed to someFunction is still going to be a basic copy of mydict, but this is the only way I can think of answering the question with comprehension.
for e in list_of_tuples:
if someFunction({key: val if key != e[0] else mydict[e[1]] for key,val in mydict }):
mydict[e[0]] = mydict[e[1]]
However the faster/ easier way would be to just make a temp variable for mydict[e[0]], and change it back after if someFunction fails. Also having extra lines isn't always a bad thing. It can usually help readability, solving bugs and maintenance.. especially for newer programmers.
for e in list_of_tuples:
temp = mydict[e[0]]
mydict[e[0]] = mydict[e[1]]
if not someFunction(mydict):
mydict[e[0]] = temp

Elegantly Generalising Sorting into Dictionaries in Python?

The list comprehension is a great structure for generalising working with lists in such a way that the creation of lists can be managed elegantly. Is there a similar tool for managing Dictionaries in Python?
I have the following functions:
# takes in 3 lists of lists and a column specification by which to group
def custom_groupby(atts, zmat, zmat2, col):
result = dict()
for i in range(0, len(atts)):
val = atts[i][col]
row = (atts[i], zmat[i], zmat2[i])
try:
result[val].append(row)
except KeyError:
result[val] = list()
result[val].append(row)
return result
# organises samples into dictionaries using the groupby
def organise_samples(attributes, z_matrix, original_z_matrix):
strucdict = custom_groupby(attributes, z_matrix, original_z_matrix, 'SecStruc')
strucfrontdict = dict()
for k, v in strucdict.iteritems():
strucfrontdict[k] = custom_groupby([x[0] for x in strucdict[k]],
[x[1] for x in strucdict[k]], [x[2] for x in strucdict[k]], 'Front')
samples = dict()
for k in strucfrontdict:
samples[k] = dict()
for k2 in strucfrontdict[k]:
samples[k][k2] = dict()
samples[k][k2] = custom_groupby([x[0] for x in strucfrontdict[k][k2]],
[x[1] for x in strucfrontdict[k][k2]], [x[2] for x in strucfrontdict[k][k2]], 'Back')
return samples
It seems like this is unwieldy. There being elegant ways to do almost everything in Python, I'm inclined to think I'm using Python wrongly.
More importantly, I'd like to be able to generalise this function better so that I can specify how many "layers" should be in the dictionary (without using several lambdas and approaching the problem in a Lisp style). I would like a function:
# organises samples into a dictionary by specified columns
# number of layers could also be assumed by number of criterion
def organise_samples(number_layers, list_of_strings_for_column_ids)
Is this possible to do in Python?
Thank you! Even if there isn't a way to do it elegantly in Python, any suggestions towards making the above code more elegant would be really appreciated.
::EDIT::
For context, the attributes object, z_matrix, and original_zmatrix are all lists of Numpy arrays.
Attributes might look like this:
Type,Num,Phi,Psi,SecStruc,Front,Back
11,181,-123.815,65.4652,2,3,19
11,203,148.581,-89.9584,1,4,1
11,181,-123.815,65.4652,2,3,19
11,203,148.581,-89.9584,1,4,1
11,137,-20.2349,-129.396,2,0,1
11,163,-34.75,-59.1221,0,1,9
The Z-matrices might both look like this:
CA-1, CA-2, CA-CB-1, CA-CB-2, N-CA-CB-SG-1, N-CA-CB-SG-2
-16.801, 28.993, -1.189, -0.515, 118.093, 74.4629
-24.918, 27.398, -0.706, 0.989, 112.854, -175.458
-1.01, 37.855, 0.462, 1.442, 108.323, -72.2786
61.369, 113.576, 0.355, -1.127, 111.217, -69.8672
Samples is a dict{num => dict {num => dict {num => tuple(attributes, z_matrix)}}}, having one row of the z-matrix.
The list comprehension is a great structure for generalising working with lists in such a way that the creation of lists can be managed elegantly. Is there a similar tool for managing Dictionaries in Python?
Have you tries using dictionary comprehensions?
see this great question about dictionary comperhansions

Which of these is more python-like?

I'm doing some exploring of various languages I hadn't used before, using a simple Perl script as a basis for what I want to accomplish. I have a couple of versions of something, and I'm curious which is the preferred method when using Python -- or if neither is, what is?
Version 1:
workflowname = []
paramname = []
value = []
for line in lines:
wfn, pn, v = line.split(",")
workflowname.append(wfn)
paramname.append(pn)
value.append(v)
Version 2:
workflowname = []
paramname = []
value = []
i = 0;
for line in lines:
workflowname.append("")
paramname.append("")
value.append("")
workflowname[i], paramname[i], value[i] = line.split(",")
i = i + 1
Personally, I prefer the second, but, as I said, I'm curious what someone who really knows Python would prefer.
A Pythonic solution might a bit like #Bogdan's, but using zip and argument unpacking
workflowname, paramname, value = zip(*[line.split(',') for line in lines])
If you're determined to use a for construct, though, the 1st is better.
Of your two attepts the 2nd one doesn't make any sense to me. Maybe in other languages it would. So from your two proposed approaces the 1st one is better.
Still I think the pythonic way would be something like Matt Luongo suggested.
Bogdan's answer is best. In general, if you need a loop counter (which you don't in this case), you should use enumerate instead of incrementing a counter:
for index, value in enumerate(lines):
# do something with the value and the index
Version 1 is definitely better than version 2 (why put something in a list if you're just going to replace it?) but depending on what you're planning to do later, neither one may be a good idea. Parallel lists are almost never more convenient than lists of objects or tuples, so I'd consider:
# list of (workflow,paramname,value) tuples
items = []
for line in lines:
items.append( line.split(",") )
Or:
class WorkflowItem(object):
def __init__(self,workflow,paramname,value):
self.workflow = workflow
self.paramname = paramname
self.value = value
# list of objects
items = []
for line in lines:
items.append( WorkflowItem(*line.split(",")) )
(Also, nitpick: 4-space tabs are preferable to 8-space.)

Categories

Resources