Why isn't all the data being stored? - python

I have a dictionary carrying key:value however it only saves the last iteration and discards the previous entries where is it being reset ?? This is the output from the ctr of iterations and the length of the dictionary
Return the complete Term and DocID Ref.
LENGTH:6960
CTR:88699
My code:
class IndexData:
def getTermDocIDCollection(self):
...............
for term in terms:
#TermDocIDCollection[term] = sourceFile['newid']
TermDocIDCollection[term] = []
TermDocIDCollection[term].append(sourceFile['newid'])
return TermDocIDCollection

The piece of code you've commented out does the following:
Sets a value to the key (removing whatever was there before, if it existed)
Sets a new value to the key (an empty list)
Appends the value set in step 1 to the new empty list
Sadly, it would do the same each iteration, so you'd end up with [last value] assigned to the key. The new code (with update) does something similar. In the old days you'd do this:
if term in TermDocIDCollection:
TermDocIDCollection[term].append(sourceFile['newid'])
else:
TermDocIDCollection[term] = [sourceFile['newid']]
or a variation of the theme using try-except. After collections was added you can do this instead:
from collections import defaultdict
# ... code...
TermDocIDCollection = defaultdict(list)
and you'd update it like this:
TermDocIDCollection[term].append(sourceFile['newid'])
no need to check if term exists in the dictionary. If it doesn't, the defaultdict type will first call to the constructor you passed (list) to create the initial value for the key

Related

dictionary being replaced and I am not sure why it is happening?

I have some code which is something along the lines of
storage = {}
for index, n in enumerate(dates):
if n in specific_dates:
for i in a_list:
my_dict[i] = {}
my_dict[i]["somthing"] = value
my_dict[i]["somthing2"] = value_2
else:
#print(storage[dates[index - 1]["my_dict"][i]["somthing"])
for i in a_list:
my_dict[i] = {}
my_dict[i][somthing] = different_value - storage[dates[index - 1]["my_dict"][i]["somthing"]
my_dict[i]["somthing2"] = different_value_2
storage[n]["my_dict"] = my_dict
The first pass will initiate the code in if n in specific_dates: the second pass goes to for i in a_list:
Essentially the code is getting a value set on specific dates and this value is then used for nonspecific dates that occur after the specific date until the next specific date overrides that value. However, at every date, i save a dictionary of values within a master dictionary called storage.
I found the problem which is when I print my_dict on the second pass my_dict[i] is literally an empty dictionary whereas prior to that loop it was filled. Where I have put the commented-out print line it would print value. I have fixed this by changing storage[n]["my_dict"] = my_dict to storage[n]["my_dict"] = my_dict.copy() and can now access value.
However, I do not really understand why this didnt work how I expected in the first place as I thought by assigning my_dict to storage it was creating new memory.
I was hoping someone could explain why this is happening and why storage[dates[index - 1]["my_dict"][i]["somthing"] doesn't create a new space in memory if that is indeed what is happening.

How to create a nested python dictionary with keys as strings?

Summary of issue: I'm trying to create a nested Python dictionary, with keys defined by pre-defined variables and strings. And I'm populating the dictionary from regular expressions outputs. This mostly works. But I'm getting an error because the nested dictionary - not the main one - doesn't like having the key set to a string, it wants an integer. This is confusing me. So I'd like to ask you guys how I can get a nested python dictionary with string keys.
Below I'll walk you through the steps of what I've done. What is working, and what isn't. Starting from the top:
# Regular expressions module
import re
# Read text data from a file
file = open("dt.cc", "r")
dtcc = file.read()
# Create a list of stations from regular expression matches
stations = sorted(set(re.findall(r"\n(\w+)\s", dtcc)))
The result is good, and is as something like this:
stations = ['AAAA','BBBB','CCCC','DDDD']
# Initialize a new dictionary
rows = {}
# Loop over each station in the station list, and start populating
for station in stations:
rows[station] = re.findall("%s\s(.+)" %station, dtcc)
The result is good, and is something like this:
rows['AAAA'] = ['AAAA 0.1132 0.32 P',...]
However, when I try to create a sub-dictionary with a string key:
for station in stations:
rows[station] = re.findall("%s\s(.+)" %station, dtcc)
rows[station]["dt"] = re.findall("%s\s(\S+)" %station, dtcc)
I get the following error.
"TypeError: list indices must be integers, not str"
It doesn't seem to like that I'm specifying the second dictionary key as "dt". If I give it a number instead, it works just fine. But then my dictionary key name is a number, which isn't very descriptive.
Any thoughts on how to get this working?
The issue is that by doing
rows[station] = re.findall(...)
You are creating a dictionary with the station names as keys and the return value of re.findall method as values, which happen to be lists. So by calling them again by
rows[station]["dt"] = re.findall(...)
on the LHS row[station] is a list that is indexed by integers, which is what the TypeError is complaining about. You could do rows[station][0] for example, you would get the first match from the regex. You said you want a nested dictionary. You could do
rows[station] = dict()
rows[station]["dt"] = re.findall(...)
To make it a bit nicer, a data structure that you could use instead is a defaultdict from the collections module.
The defaultdict is a dictionary that accepts a default type as a type for its values. You enter the type constructor as its argument. For example dictlist = defaultdict(list) defines a dictionary that has as values lists! Then immediately doing dictlist[key].append(item1) is legal as the list is automatically created when setting the key.
In your case you could do
from collections import defaultdict
rows = defaultdict(dict)
for station in stations:
rows[station]["bulk"] = re.findall("%s\s(.+)" %station, dtcc)
rows[station]["dt"] = re.findall("%s\s(\S+)" %station, dtcc)
Where you have to assign the first regex result to a new key, "bulk" here but you can call it whatever you like. Hope this helps.

Sum values in a dict of sets

I have what might be a simple task and I tried several solutions but can't seem to figure it out.
I have a dict of sets containing gene names and corresponding positions as sets like:
gene_nr_snp = {'gene1: {3,9}, gene2: {2,3,1}, gene3: {1}}
I want to return a dict with the gene name and the corresponding summed value.
I tried the following:
gene_values = {}
for gene, snp in gene_nr_snp.items():
for i in snp: # iterate the values in each set
snp_total = 0
snp_total += i
gene_values[gene].add(snp_total)
This is returning the same set of values
You can use a dict comprehension and the sum() function:
gene_values = {gene: sum(snp) for gene, snp in gene_nr_snp.items()}
Your attempt fails because you set the snp_total variable to 0 for every value in snp, thus failing to sum anything. You then seem to treat gene_values[gene] as a set but the dictionary starts empty, so you'll get a KeyError. A working version would be:
gene_values = {}
for gene, snp in gene_nr_snp.items():
snp_total = 0
for i in snp: # iterate the values in each set
snp_total += i
gene_values[gene] = snp_total
but the sum() function makes the inner loop rather more verbose than needed; the whole loop body could be replaced by gene_values[gene] = sum(snp).

Python - process only new element of dictionnary

I have a unique (unique keys) dictionnary that I update adding some new keys depending data on a webpage.
and I want to process only the new keys that may appear after a long time. Here is a piece of code to understand :
a = UniqueDict()
while 1:
webpage = update() # return a list
for i in webpage:
title = getTitle(i)
a[title] = new_value # populate only new title obtained because it's a unique dictionnary
if len(a) > 50:
a.clear() # just to clear dictionnary if too big
# Condition before entering this loop to process only new title entered
for element in a.keys():
process(element)
Is there a way to know only new keys added in the dictionnary (because most of the time, it will be the same keys and values so I don't want them to be processed) ?
Thank you.
What you might also do, is keep the processed keys in a set.
Then you can check for new keys by using set(d.keys()) - set_already_processed.
And add processed keys using set_already_processed.add(key)
You may want to use a OrderedDict:
Ordered dictionaries are just like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.
Make your own dict that tracks additions:
class NewKeysDict(dict):
"""A dict, but tracks keys that are added through __setitem__
only. reset() resets tracking to begin tracking anew. self.new_keys
is a set holding your keys.
"""
def __init__(self, *args, **kw):
super(NewKeysDict, self).__init__(*args, **kw)
self.new_keys = set()
def reset(self):
self.new_keys = set()
def __setitem__(self, key, value):
super(NewKeysDict, self).__setitem__(key, value)
self.new_keys.add(key)
d = NewKeysDict((i,str(i)) for i in range(10))
d.reset()
print(d.new_keys)
for i in range(5, 10):
d[i] = '{} new'.format(i)
for k in d.new_keys:
print(d[k])
(because most of the time, it will be the same keys and values so I don't want them to be processed)
You get complicate !
The keys are immutable and unique.
Each key is followed by a value separated, by a colon.
dict = {"title",title}
text = "textdude"
dict["keytext"]=text
This is add a value textdude, with the new key called "keytext".
For check, we use "in".
"textdude" in dict
He return true

Pandas Dataframe to Dictionary with Multiple Keys

I am currently working with a dataframe consisting of a column of 13 letter strings ('13mer') paired with ID codes ('Accession') as such:
However, I would like to create a dictionary in which the Accession codes are the keys with values being the 13mers associated with the accession so that it looks as follows:
{'JO2176': ['IGY....', 'QLG...', 'ESS...', ...],
'CYO21709': ['IGY...', 'TVL...',.............],
...}
Which I've accomplished using this code:
Accession_13mers = {}
for group in grouped:
Accession_13mers[group[0]] = []
for item in group[1].iteritems():
Accession_13mers[group[0]].append(item[1])
However, now I would like to go back through and iterate through the keys for each Accession code and run a function I've defined as find_match_position(reference_sequence, 13mer) which finds the 13mer in in a reference sequence and returns its position. I would then like to append the position as a value for the 13mer which will be the key.
If anyone has any ideas for how I can expedite this process that would be extremely helpful.
Thanks,
Justin
I would suggest creating a new dictionary, whose values are another dictionary. Essentially a nested dictionary.
position_nmers = {}
for key in H1_Access_13mers:
position_nmers[key] = {} # replicate key, val in new dictionary, as a dictionary
for value in H1_Access_13mers[key]:
position_nmers[key][value] = # do something
To introspect the dictionary and make sure it's okay:
print position_nmers
You can iterate over the groupby more cleanly by unpacking:
d = {}
for key, s in df.groupby('Accession')['13mer']:
d[key] = list(s)
This also makes it much clearer where you should put your function!
... However, I think that it might be better suited to an enumerate:
d2 = {}
for pos, val in enumerate(df['13mer']):
d2[val] = pos

Categories

Resources