Slowness on iterating over namedtuple and dicts for defined pairs

Slowness on iterating over namedtuple and dicts for defined pairs - python

This code runs quickly on the sample data, but when iterating over a large file, it seems to run slowly, perhaps because of the nested for loops. Is there any other reason why iterating over items in a defaultdict is slow?
import itertools
sample_genes1={"0002":["GENE1", "GENE2", "GENE3", "GENE4"],
"0003":["GENE1", "GENE2", "GENE3", "GENE6"],
"0202":["GENE4", "GENE2", "GENE1", "GENE7"]}
def get_common_gene_pairs(genelist):
genedict={}
for k,v in sample_genes1.items():
listofpairs=[]
for i in itertools.combinations(v,2):
listofpairs.append(i)
genedict[k]=listofpairs
return genedict
from collections import namedtuple,defaultdict
def get_gene_pair_pids(genelist):
i=defaultdict(list)
d=get_common_gene_pairs(sample_genes1)
Pub_genes=namedtuple("pair", ["gene1", "gene2"])
for p_id,genepairs in d.iteritems():
for p in genepairs:
thispair=Pub_genes(p[0], p[1])
if thispair in i.keys():
i[thispair].append(p_id)
else:
i[thispair]=[p_id,]
return i
if __name__=="__main__":
get_gene_pair_pids(sample_genes1)

One big problem: this line:
if thispair in i.keys():
doesn't take advantage of dictionary search, it's linear search. Drop the keys() call, let the dictionary do its fast lookup:
if thispair in i:
BUT since i is a default dict which creates a list when key doesn't exist, just replace:
if thispair in i.keys():
i[thispair].append(p_id) # i is defaultdict: even if thispair isn't in the dict, it will create a list and append p_id.
else:
i[thispair]=[p_id,]
by this simple statement:
i[thispair].append(p_id)
(it's even faster because there's only one hashing of p_id)
to sum it up:
don't do thispair in i.keys(): it's slow, in python 2, or 3, defaultdict or not
you have defined a defaultdict, but your code just assumed a dict, which works but slower.
Note: without defaultdict you could have just removed the .keys() or done this with a simple dict:
i.setdefault(thispair,list)
i[thispair].append(p_id)
(here the default item depends on the key)
Aside:
def get_common_gene_pairs(genelist):
genedict={}
for k,v in sample_genes1.items(): # should be genelist, not the global variable, you're not using the genelist parameter at all
And you're not using the values of sample_genes1 at all in your code.

Adding to Jean's answer, your get_common_gene_pairs function could be optimized by using dict comprehension as:
def get_common_gene_pairs(genelist):
return {k : list(itertools.combinations(v,2)) for k,v in genelist.items()}
list.append() is much more time consuming compared to it's list comprhension counter part. Also, you don't have to iterate over itertools.combinations(v,2) in order to convert it to list. Type-casting it to list does that.
Here is the comparison I made between list comprehension and list.append() in answer to Comparing list comprehensions and explicit loops, in case you are interested in taking a look.

Related

How to replace the first character of all keys in a dictionary?

What is the best way to replace the first character of all keys in a dictionary?
old_cols= {"~desc1":"adjustment1","~desc23":"adjustment3"}
I am trying to get
new_cols= {"desc1":"adjustment1","desc23":"adjustment3"}
I have tried:
for k,v in old_cols.items():
new_cols[k]=old_cols.pop(k[1:])

old_cols = {"~desc1":"adjustment1", "~desc23":"adjustment3"}
new_cols = {}
for k, v in old_cols.items():
new_key = k[1:]
new_cols[new_key] = v

Here it is with a dictionary comprehension:
old_cols= {"~desc1":"adjustment1","~desc23":"adjustment3"}
new_cols = {k.replace('~', ''):old_cols[k] for k in old_cols}
print(new_cols)
#{'desc1': 'adjustment1', 'desc23': 'adjustment3'}

There are many ways to do this with list-comprehension or for-loops. What is important is to understand is that dictionaries are mutable. This basically means that you can either modify the existing dictionary or create a new one.
If you want to create a new one (and I would recommend it! - see 'A word of warning ...' below), both solutions provided in the answers by #Ethem_Turgut and by #pakpe do the trick. I would probably wirte:
old_dict = {"~desc1":"adjustment1","~desc23":"adjustment3"}
# get the id of this dict, just for comparison later.
old_id = id(old_dict)
new_dict = {k[1:]: v for k, v in old_dict.items()}
print(f'Is the result still the same dictionary? Answer: {old_id == id(new_dict)}')
Now, if you want to modify the dictionary in place you might loop over the keys and adapt the key/value to your liking:
old_dict = {"~desc1":"adjustment1","~desc23":"adjustment3"}
# get the id of this dict, just for comparison later.
old_id = id(old_dict)
for k in [*old_dict.keys()]: # note the weird syntax here
old_dict[k[1:]] = old_dict.pop(k)
print(f'Is the result still the same dictionary? Answer: {old_id == id(old_dict)}')
A word of warning for the latter approach:
You should be aware that you are modifying the keys while looping over them. This is in most of the cases problematic and can even lead to a RuntimeError if you loop directly over old_dict. I avoided this by explicitly unpacking the keys into a list and the looping over that list with [*old_dict.keys()].
Why can modifying keys while looping over them be problematic? Imagine for example that you have the keys '~key1' and 'key1' in your old_dict. Now when your loop handles '~key1' it will modify it to 'key1' which already exists in old_dict and thus it will overwrite the value of 'key1' with the value from '~key1'.
So, only use the latter approach if you are certain to not produce issues like the example mentioned here before. If you are uncertain, simply create a new dictionary!

Simplifying the code to a dictionary comprehension

In a directory images, images are named like - 1_foo.png, 2_foo.png, 14_foo.png, etc.
The images are OCR'd and the text extract is stored in a dict by the code below -
data_dict = {}
for i in os.listdir(images):
if str(i[1]) != '_':
k = str(i[:2]) # Get first two characters of image name and use as 'key'
else:
k = str(i[:1]) # Get first character of image name and use 'key'
# Intiates a list for each key and allows storing multiple entries
data_dict.setdefault(k, [])
data_dict[k].append(pytesseract.image_to_string(i))
The code performs as expected.
The images can have varying numbers in their name ranging from 1 to 99.
Can this be reduced to a dictionary comprehension?

No. Each iteration in a dict comprehension assigns a value to a key; it cannot update an existing value list. Dict comprehensions aren't always better--the code you wrote seems good enough. Although maybe you could write
data_dict = {}
for i in os.listdir(images):
k = i.partition("_")[0]
image_string = pytesseract.image_to_string(i)
data_dict.setdefault(k, []).append(image_string)

Yes. Here's one way, but I wouldn't recommend it:
{k: d.setdefault(k, []).append(pytesseract.image_to_string(i)) or d[k]
for d in [{}]
for k, i in ((i.split('_')[0], i) for i in names)}
That might be as clean as I can make it, and it's still bad. Better use a normal loop, especially a clean one like Dennis's.
Slight variation (if I do the abuse once, I might as well do it twice):
{k: d.setdefault(k, []).append(pytesseract_image_to_string(i)) or d[k]
for d in [{}]
for i in names
for k in i.split('_')[:1]}
Edit: kaya3 now posted a good one using a dict comprehension. I'd recommend that over mine as well. Mine are really just the dirty results of me being like "Someone said it can't be done? Challenge accepted!".

In this case itertools.groupby can be useful; you can group the filenames by the numeric part. But making it work is not easy, because the groups have to be contiguous in the sequence.
That means before we can use groupby, we need to sort using a key function which extracts the numeric part. That's the same key function we want to group by, so it makes sense to write the key function separately.
from itertools import groupby
def image_key(image):
return str(image).partition('_')[0]
images = ['1_foo.png', '2_foo.png', '3_bar.png', '1_baz.png']
result = {
k: list(v)
for k, v in groupby(sorted(images, key=image_key), key=image_key)
}
# {'1': ['1_foo.png', '1_baz.png'],
# '2': ['2_foo.png'],
# '3': ['3_bar.png']}
Replace list(v) with list(map(pytesseract.image_to_string, v)) for your use-case.

Python consolidate dictionary keys and values using comprehension

How do I CONSOLIDATE the following using python COMPREHENSION
FROM (list of dicts)
[
{'server':'serv1','os':'Linux','archive':'/my/folder1'}
,{'server':'serv2','os':'Linux','archive':'/my/folder1'}
,{'server':'serv3','os':'Linux','archive':'/my/folder2'}
,{'server':'serv4','os':'AIX','archive':'/my/folder1'}
,{'server':'serv5','os':'AIX','archive':'/my/folder1'}
]
TO (list of dicts with tuple as key and list of 'server#'s as value
[
{('Linux','/my/folder1'):['serv1','serv2']}
,('Linux','/my/folder2'):['serv3']}
.{('AIX','/my/folder1'):['serv4','serv5']}
]

the need to be able to set default values to your dictionary and to have the same key several times may make a dict-comprehension a bit clumsy. i'd prefer something like this:
a defaultdict may help:
from collections import defaultdict
lst = [
{'server':'serv1','os':'Linux','archive':'/my/folder1'},
{'server':'serv2','os':'Linux','archive':'/my/folder1'},
{'server':'serv3','os':'Linux','archive':'/my/folder2'},
{'server':'serv4','os':'AIX','archive':'/my/folder1'},
{'server':'serv5','os':'AIX','archive':'/my/folder1'}
]
dct = defaultdict(list)
for d in lst:
key = d['os'], d['archive']
dct[key].append(d['server'])
if you prefer to have a standard dictionary in the end (actually i do not really see a good reason for that) you could use dict.setdefault in order to create an empty list where the key does not yet exist:
dct = {}
for d in lst:
key = d['os'], d['archive']
dct.setdefault(key, []).append(d['server'])
the documentation on defaultdict (vs. setdefault):
This technique is simpler and faster than an equivalent technique
using dict.setdefault()

It's difficult to achieve with list comprehension because of the accumulation effect. However, it's possible using itertools.groupby on the list sorted by your keys (use the same key function for both sorting and grouping).
Then extract the server info in a list comprehension and prefix by the group key. Pass the resulting (group key, server list) to dictionary comprehension and here you go.
import itertools
lst = [
{'server':'serv1','os':'Linux','archive':'/my/folder1'}
,{'server':'serv2','os':'Linux','archive':'/my/folder1'}
,{'server':'serv3','os':'Linux','archive':'/my/folder2'}
,{'server':'serv4','os':'AIX','archive':'/my/folder1'}
,{'server':'serv5','os':'AIX','archive':'/my/folder1'}
]
sortfunc = lambda x : (x['os'],x['archive'])
result = {k:[x['server'] for x in v] for k,v in itertools.groupby(sorted(lst,key=sortfunc),key = sortfunc)}
print(result)
I get:
{('Linux', '/my/folder1'): ['serv1', 'serv2'], ('AIX', '/my/folder1'): ['serv4', 'serv5'], ('Linux', '/my/folder2'): ['serv3']}
Keep in mind that it's not because it can be written in one line that it's more efficient. The defaultdict approach doesn't require sorting for instance.

Elegant way to delete single key of from each element in iterable

Is it possible to manipulate each dictionary in iterable with lambdas?
To be clarify, just want to delete _id key from each element.
Wonder If there is an elegant way to achieve this simple task without 3rd parties like (toolz), functions, or copying dict objects.
Possible solutions that I do not search for.
Traditional way:
cleansed = {k:v for k,v in data.items() if k not in ['_id'] }
Another way:
def clean(iterable):
for d in iterable:
del d['_id']
return
3rd Party way:
Python functional approach: remove key from dict using filter

Functional programming tendency is to not alter the original object, but to return a new modified copy of the data. Main problem is that del is a statement and can't be used functionally.
Maybe this pleases you:
[d.pop('_id') for d in my_dicts]
that will modify all the dictionaries in my_dicts in place and remove the '_id' key from them. The resulting list is a list of _id values, but the dictionaries are edited in the original my_dicts.
I don't recommend using this at all, since it is not as easy to understand as your second example. Using functional programming to alter persistent data is not elegant at all. But if you need to...

Check if this do what you want:
# Python 2.x
func = lambda iterable, key="_id": type(iterable)([{k: v for k, v in d.iteritems() if k != key} for d in iterable])
# Python 3.x
func = lambda iterable, key="_id": type(iterable)([{k: v for k, v in d.items() if k != key} for d in iterable])

Python -- Dictionary comprehension inside an OrderedDict not working

I have this code:
self.statusIcons = collections.OrderedDict
for index in guiConfig.STATUS_ICON_SETS:
self.statusIcons[index] = {condition:\
wx.Image(guiConfig.STATUS_ICON_STRING.format(index, condition),wx.BITMAP_TYPE_PNG).ConvertToBitmap() \
for condition in guiConfig.STATUS_ICON_CONDITIONS}
It sets up an ordereddict of regular dictionaries of wx.Image objects which are set up with comprehension. I originally had nested dict comprehensions and it worked fine but decided I needed the top-level dict to be ordered so ended up with this way. The problem is that now I get this error:
TypeError: 'type' object does not support item assignment
zeroing in on the piece of the code in question. I can't figure out what I did incorrectly. Does ordereddict not allow comprehension even if it isn't the top-level? Maybe it tries to order all dicts within an ordereddict and can't because comprehension is on the lower-level? Not sure, maybe it's something ridiculous I couldn't spot because of tunnel-vision.
PS: in case you need to know what is in the globals I reference above:
STATUS_ICON_SETS = ("comp", "net", "serv", "audio", "sec", "ups", "zwave", "stats")
STATUS_ICON_CONDITIONS = ("on", "off")
STATUS_ICON_STRING = "images/{0}_{1}.png"

You need to call the type to create an instance:
self.statusIcons = collections.OrderedDict()
You omitted the () there.
You can create the OrderedDict elements in a generator expression producing (key, value) tuples here too:
self.statusIcons = collections.OrderedDict(
(index, {condition: wx.Image(
guiConfig.STATUS_ICON_STRING.format(index, condition),
wx.BITMAP_TYPE_PNG).ConvertToBitmap()
for condition in guiConfig.STATUS_ICON_CONDITIONS})
for index in guiConfig.STATUS_ICON_SETS)
but I am not sure readability has improved with that approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slowness on iterating over namedtuple and dicts for defined pairs - python

Related

How to replace the first character of all keys in a dictionary?

Simplifying the code to a dictionary comprehension

Python consolidate dictionary keys and values using comprehension

Elegant way to delete single key of from each element in iterable

Python -- Dictionary comprehension inside an OrderedDict not working

Categories

Resources