What's a better practice and what's faster:
iterating over dictionary (practically by its keys):
for key in dictionary: ...
or iterating over its items:
for key, value in dictionary.items(): ...?
Using dict.items() seems to be a more clean way to iterate over a dictionary, but isn't it slower because of creating another object (dict.items) (I suppose it's negligible, but I had to ask)? Or maybe it's already done with initializing the dictionary?
Also, in the first way, accessing a value by its key shouldn't affect the efficiency because the operation is O(1).
Let's run some tests.
from time import perf_counter
i = {i: 10 for i in range(10**7)}
def t1():
p1 = perf_counter()
for key in i:
i[key]
print("key in i:",perf_counter() - p1)
def t2():
p1 = perf_counter()
for key, value in i.items():
value
print("key, value in i.items()", perf_counter() - p1)
t1()
t2()
OUTPUT
key in i: 0.2863648850000118
key, value in i.items() 0.19483049799998753
As you can see, iterating with the 'items' method is significantly faster. That's because when we use key in dict, we iterate through keys while ignoring value; value should be taken directly from the dictionary using dict[x] or dict.get(x) which is by default, do all hash table steps:
calls hash function with your attribute x
searches into the hash table's array which is hidden by default and we can't see it because it's implementation detail
if there are any collisions, they should be solved
However, when we use items, it iterates through the dictionary and collects keys with the values, so we do not need to perform all of the hash table steps for each item.
Related
I have a dictionary which looks like this:
dictionary={
"ABC-6m-RF-200605-1352": "s3://blabla1.com",
"ABC-3m-RF-200605-1352": "s3://blabla2.com",
"DEF-6m-RF-200605-1352": "s3://blabla3.com"
}
Now, I want to do a matching which takes input such as helper="ABC-6m" and tries to match this string to the key of the dictionary and returns the key (not the value)!
My code currently looks like this but it is not robust, i.e. sometimes it works and sometimes it does not:
dictionary_scan = dict((el, el[:7]) for el in dictionary)
#swapping key and value
dictionary_scan = dict((v, k) for k, v in dictionary.items())
#concat string components in helper variable
helper = 'ABC'+'-'+'6m'
out=list(value for key, value in dictionary_scan.items() if helper in key)
The expected output is: 'ABC-6m-RF-200605-1352'. Sometimes this works in my code but sometimes it does not. Is there a better and more robust way to do this?
If you make a dictionary that maps prefixes to full keys, you'll only be able to get one key with a given prefix.
If there can be multiple keys that start with helper, you need to check them all with an ordinary list comprehension.
out = [key for key in dictionary if key.startswith(helper)]
I have list of dictionaries and in each one of them the key site exists.
So in other words, this code returns True:
all('site' in site for site in summary)
Question is, what will be the pythonic way to iterate over the list of dictionaries and return True if a key different from site exists in any of the dictionaries?
Example: in the following list I would like to return True because of the existence of cost in the last dictionary BUT, I can't tell what will be the other key, it can be cost as in the example and it can be other strings; random keys for that matter.
[
{"site": "site_A"},
{"site": "site_B"},
{"site": "site_C", "cost": 1000}
]
If all dictionaries have the key site, the dictionaries have a length of at least 1. The presence of any other key would increase the dictionary size to be greater than 1, test for that:
any(len(d) > 1 for d in summary)
You could just check, for each dictionary dct:
any(key != "site" for key in dct)
If you want to check this for a list of dictionaries dcts, shove another any around that: any(any(key != "site" for key in dct) for dct in dcts)
This also makes it easily extensible to allowing multiple different keys. (E.g. any(key not in ("site", "otherkey") for key in dct)) Because what's a dictionary good for if you can only use one key?
This is a bit longer version, but it gives you what you need. Just to give more options:
any({k: v for k, v in site.items() if k != 'site'} for site in summary)
This code runs quickly on the sample data, but when iterating over a large file, it seems to run slowly, perhaps because of the nested for loops. Is there any other reason why iterating over items in a defaultdict is slow?
import itertools
sample_genes1={"0002":["GENE1", "GENE2", "GENE3", "GENE4"],
"0003":["GENE1", "GENE2", "GENE3", "GENE6"],
"0202":["GENE4", "GENE2", "GENE1", "GENE7"]}
def get_common_gene_pairs(genelist):
genedict={}
for k,v in sample_genes1.items():
listofpairs=[]
for i in itertools.combinations(v,2):
listofpairs.append(i)
genedict[k]=listofpairs
return genedict
from collections import namedtuple,defaultdict
def get_gene_pair_pids(genelist):
i=defaultdict(list)
d=get_common_gene_pairs(sample_genes1)
Pub_genes=namedtuple("pair", ["gene1", "gene2"])
for p_id,genepairs in d.iteritems():
for p in genepairs:
thispair=Pub_genes(p[0], p[1])
if thispair in i.keys():
i[thispair].append(p_id)
else:
i[thispair]=[p_id,]
return i
if __name__=="__main__":
get_gene_pair_pids(sample_genes1)
One big problem: this line:
if thispair in i.keys():
doesn't take advantage of dictionary search, it's linear search. Drop the keys() call, let the dictionary do its fast lookup:
if thispair in i:
BUT since i is a default dict which creates a list when key doesn't exist, just replace:
if thispair in i.keys():
i[thispair].append(p_id) # i is defaultdict: even if thispair isn't in the dict, it will create a list and append p_id.
else:
i[thispair]=[p_id,]
by this simple statement:
i[thispair].append(p_id)
(it's even faster because there's only one hashing of p_id)
to sum it up:
don't do thispair in i.keys(): it's slow, in python 2, or 3, defaultdict or not
you have defined a defaultdict, but your code just assumed a dict, which works but slower.
Note: without defaultdict you could have just removed the .keys() or done this with a simple dict:
i.setdefault(thispair,list)
i[thispair].append(p_id)
(here the default item depends on the key)
Aside:
def get_common_gene_pairs(genelist):
genedict={}
for k,v in sample_genes1.items(): # should be genelist, not the global variable, you're not using the genelist parameter at all
And you're not using the values of sample_genes1 at all in your code.
Adding to Jean's answer, your get_common_gene_pairs function could be optimized by using dict comprehension as:
def get_common_gene_pairs(genelist):
return {k : list(itertools.combinations(v,2)) for k,v in genelist.items()}
list.append() is much more time consuming compared to it's list comprhension counter part. Also, you don't have to iterate over itertools.combinations(v,2) in order to convert it to list. Type-casting it to list does that.
Here is the comparison I made between list comprehension and list.append() in answer to Comparing list comprehensions and explicit loops, in case you are interested in taking a look.
I have a unique (unique keys) dictionnary that I update adding some new keys depending data on a webpage.
and I want to process only the new keys that may appear after a long time. Here is a piece of code to understand :
a = UniqueDict()
while 1:
webpage = update() # return a list
for i in webpage:
title = getTitle(i)
a[title] = new_value # populate only new title obtained because it's a unique dictionnary
if len(a) > 50:
a.clear() # just to clear dictionnary if too big
# Condition before entering this loop to process only new title entered
for element in a.keys():
process(element)
Is there a way to know only new keys added in the dictionnary (because most of the time, it will be the same keys and values so I don't want them to be processed) ?
Thank you.
What you might also do, is keep the processed keys in a set.
Then you can check for new keys by using set(d.keys()) - set_already_processed.
And add processed keys using set_already_processed.add(key)
You may want to use a OrderedDict:
Ordered dictionaries are just like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.
Make your own dict that tracks additions:
class NewKeysDict(dict):
"""A dict, but tracks keys that are added through __setitem__
only. reset() resets tracking to begin tracking anew. self.new_keys
is a set holding your keys.
"""
def __init__(self, *args, **kw):
super(NewKeysDict, self).__init__(*args, **kw)
self.new_keys = set()
def reset(self):
self.new_keys = set()
def __setitem__(self, key, value):
super(NewKeysDict, self).__setitem__(key, value)
self.new_keys.add(key)
d = NewKeysDict((i,str(i)) for i in range(10))
d.reset()
print(d.new_keys)
for i in range(5, 10):
d[i] = '{} new'.format(i)
for k in d.new_keys:
print(d[k])
(because most of the time, it will be the same keys and values so I don't want them to be processed)
You get complicate !
The keys are immutable and unique.
Each key is followed by a value separated, by a colon.
dict = {"title",title}
text = "textdude"
dict["keytext"]=text
This is add a value textdude, with the new key called "keytext".
For check, we use "in".
"textdude" in dict
He return true
I have dicts that I need to clean, e.g.
dict = {
'sensor1': [list of numbers from sensor 1 pertaining to measurements on different days],
'sensor2': [list of numbers from from sensor 2 pertaining to measurements from different days],
etc. }
Some days have bad values, and I would like to generate a new dict with the all the sensor values from that bad day to be erased by using an upper limit on the values of one of the keys:
def clean_high(dict_name,key_string,limit):
'''clean all the keys to eliminate the bad values from the arrays'''
new_dict = dict_name
for key in new_dict: new_dict[key] = new_dict[key][new_dict[key_string]<limit]
return new_dict
If I run all the lines separately in IPython, it works. The bad days are eliminated, and the good ones are kept. These are both type numpy.ndarray: new_dict[key] and new_dict[key][new_dict[key_string]<limit]
But, when I run clean_high(), I get the error:
TypeError: only integer arrays with one element can be converted to an index
What?
Inside of clean_high(), the type for new_dict[key] is a string, not an array.
Why would the type change? Is there a better way to modify my dictionary?
Do not modify a dictionary while iterating over it. According to the python documentation: "Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries". Instead, create a new dictionary and modify it while iterating over the old one.
def clean_high(dict_name,key_string,limit):
'''clean all the keys to eliminate the bad values from the arrays'''
new_dict = {}
for key in dict_name:
new_dict[key] = dict_name[key][dict_name[key_string]<limit]
return new_dict