I'm extracting instances of three elements from an XML file: ComponentStr, keyID, and valueStr. Whenever I find a ComponentStr, I want to add/associate the keyID:valueStr to it. ComponentStr values are not unique. As multiple occurrences of a ComponentStr is read, I want to accumulate the keyID:valueStr for that ComponentStr group. The resulting accumulated data structure after reading the XML file might look like this:
ComponentA: key1:value1, key2:value2, key3:value3
ComponentB: key4:value4
ComponentC: key5:value5, key6:value6
After I generate the final data structure, I want to sort the keyID:valueStr entries within each ComponentStr and also sort all the ComponentStrs.
I'm trying to structure this data in Python 2. ComponentStr seem to work well as a set. The keyID:valueStr is clearly a dict. But how do I associate a ComponentStr entry in a set with its dict entries?
Alternatively, is there a better way to organize this data besides a set and associated dict entries? Each keyID is unique. Perhaps I could have one dict of keyID:some combo of ComponentStr and valueStr? After the data structure was built, I could sort it based on ComponentStr first, then perform some type of slice to group the keyID:valueStr and then sort again on the keyID? Seems complicated.
How about a dict of dicts?
data = {
'ComponentA': {'key1':'value1', 'key2':'value2', 'key3':'value3'},
'ComponentB': {'key4':'value4'},
'ComponentC': {'key5':'value5', 'key6':'value6'},
}
It maintains your data structure and mapping. Interestingly enough, the underlying implementation of dicts is similar to the implementation of sets.
This would be easily constructed a'la this pseudo-code:
data = {}
for file in files:
data[get_component(file)] = {}
for key, value in get_data(file):
data[get_component(file)][key] = value
in the case where you have repeated components, you need to have the sub-dict as the default, but add to the previous one if it's there. I prefer setdefault to other solutions like a defaultdict or subclassing dict with a __missing__ as long as I only have to do it once or twice in my code:
data = {}
for file in files:
for key, value in get_data(file):
data.setdefault([get_component(file)], {})[key] = value
It works like this:
>>> d = {}
>>> d.setdefault('foo', {})['bar'] = 'baz'
>>> d
{'foo': {'bar': 'baz'}}
>>> d.setdefault('foo', {})['ni'] = 'ichi'
>>> d
{'foo': {'ni': 'ichi', 'bar': 'baz'}}
alternatively, as I read your comment on the other answer say you need simple code, you can keep it really simple with some more verbose and less optimized code:
data = {}
for file in files:
for key, value in get_data(file):
if get_component(file) not in data:
data[get_component(file)] = {}
data[get_component(file)][key] = value
You can then sort when you're done collecting the data.
for component in sorted(data):
print(component)
print('-----')
for key in sorted(data[component]):
print(key, data[component][key])
I want to accumulate the keyID:valueStr for that ComponentStr group
In this case you want to have the keys of your dictionary as the ComponentStr, accumulating to me immediately goes to a list, which are easily ordered.
Each keyID is unique. Perhaps I could have one dict of keyID:some
combo of ComponentStr and valueStr?
You should store your data in a manner that is the most efficient when you want to retrieve it. Since you will be accessing your data by the component, even though your keys are unique there is no point in having a dictionary that is accessed by your key (since this is not how you are going to "retrieve" the data).
So, with that - how about using a defaultdict with a list, since you really want all items associated with the same component:
from collections import defaultdict
d = defaultdict(list)
with open('somefile.xml', 'r') as f:
for component, key, value in parse_xml(f):
d[component].append((key, value))
Now you have for each component, a list of tuples which are the associated key and values.
If you want to keep the components in the order that they are read from the file, you can use a OrderedDict (also from the collections module), but if you want to sort them in any arbitrary order, then stick with a normal dictionary.
To get a list of sorted component names, just sort the keys of the dictionary:
component_sorted = sorted(d.keys())
For a use case of printing the sorted components with their associated key/value pairs, sorted by their keys:
for key in component_sorted:
values = d[key]
sorted_values = sorted(values, key=lamdba x: x[0]) # Sort by the keys
print('Pairs for {}'.format(key))
for k,v in sorted_values:
print('{} {}'.format(k,v))
Related
Sorry if this is trivial I'm still learning but I have a list of dictionaries that looks as follow:
[{'1102': ['00576', '00577', '00578', '00579', '00580', '00581']},
{'1102': ['00582', '00583', '00584', '00585', '00586', '00587']},
{'1102': ['00588', '00589', '00590', '00591', '00592', '00593']},
{'1102': ['00594', '00595', '00596', '00597', '00598', '00599']},
{'1102': ['00600', '00601', '00602', '00603', '00604', '00605']}
...]
it contains ~89000 dictionaries. And I have a list containing 4473208 paths. example:
['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv',
'/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv',
...]
and what I want to do is group each path that contains the grouped values in the dict in the folder containing the key together.
I tried using for loops like this:
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for file in ct_paths:
for key, val in elem.items():
if (file[16:20] == key) and (any(x in file[21:26] for x in val)):
temp1.append(file)
grpd_cts.append(temp1)
but this takes around 30hours. is there a way to make it more efficient? any itertools function or something?
Thanks a lot!
ct_paths is iterated repeatedly in your inner loop, and you're only interested in a little bit of it for testing purposes; pull that out and use it to index the rest of your data, as a dictionary.
What does make your problem complicated is that you're wanting to end up with the original list of filenames, so you need to construct a two-level dictionary where the values are lists of all originals grouped under those two keys.
ct_path_index = {}
for f in ct_paths:
ct_path_index.setdefault(f[16:20], {}).setdefault(f[21:26], []).append(f)
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for key, val in elem.items():
d2 = ct_path_index.get(key)
if d2:
for v in val:
v2 = d2.get(v)
if v2:
temp1 += v2
grpd_cts.append(temp1)
ct_path_index looks like this, using your data:
{'1102': {'00575': ['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv'],
'00578': ['/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv']}}
The use of setdefault (which can be a little hard to understand the first time you see it) is important when building up collections of collections, and is very common in these kinds of cases: it makes sure that the sub-collections are created on demand and then re-used for a given key.
Now, you've only got two nested loops; the inner checks are done using dictionary lookups, which are close to O(1).
Other optimizations would include turning the lists in dict_list into sets, which would be worthwhile if you made more than one pass through dict_list.
I have a large dictionary like this:
{'T096': ['AABBAACACCAADBAACCAAAACACCBBAE'],
'T122': ['AACAABBBCABBDAACABCACABBAAACCCCAE'],
'T154': ['ABAABDBAAACCCAE'],
'T169': ['CCDBAACCAAAACCAE'],
'T176': ['BDACACCAAAACCCCAE'],
'T183': ['BBAACAACCBDAACBACCCCAAAACCAE'],
'T197': ['AABBAACABABACABAAABDBAAACCAE'],
'T178': ['CACDBAAACCCAE'],
'T067': ['BACAACCDBAABACCACCAAABCCACACCCCAE']}
I use a library that accepts a list of strings and does some processing (in this example generates common prefixes).
So I extract all values from the above dict.
Once the processing is done (in this example I get the prefixes), I am interested in assigning those processed strings (here, common prefixes) back to the keys in the dict.
E.g. something like this:
{'T096': ['AAB'],
'T197': ['AAB']} # in the rest, there are no common prefixes, so ignored, or set to None
Now I need to be able to retain the mapping between the keys and values, even after extracting all the values. Is there a way to do that elegantly?
Edit: Replaced "get prefix" with "string processing" to reflect the broader scope of the problem!
Use a dictionary-comprehension with collections.Counter:
from collections import Counter
d = {'T096': ['AABBAACACCAADBAACCAAAACACCBBAE'],
'T122': ['AACAABBBCABBDAACABCACABBAAACCCCAE'],
'T154': ['ABAABDBAAACCCAE'],
'T169': ['CCDBAACCAAAACCAE'],
'T176': ['BDACACCAAAACCCCAE'],
'T183': ['BBAACAACCBDAACBACCCCAAAACCAE'],
'T197': ['AABBAACABABACABAAABDBAAACCAE'],
'T178': ['CACDBAAACCCAE'],
'T067': ['BACAACCDBAABACCACCAAABCCACACCCCAE']}
c = Counter(x[0][:3] for x in d.values())
print({k: v[0][:3] for k, v in d.items() if c[v[0][:3]] > 1})
# {'T096': 'AAB', 'T197': 'AAB'}
I have a python script that is working with a large (~14gb) textfile. I end up with a dictionary of keys and values, but I am getting a memory error when I try to sort the dictionary by value.
I know the dictionary is too big to load into memory and then sort, but how could I go about accomplishing this?
You can use an ordered key/value store like wiredtiger, leveldb, bsddb. All of them support ordered keys using custom sort function. leveldb is the easiest to use but if you use python 2.7, bsddb is included in the stdlib. If you only require lexicographic sorting you can use the raw hashopen function to open a persistent sorted dictionary:
from bsddb import hashopen
db = hashopen('dict.db')
db['020'] = 'twenty'
db['002'] = 'two'
db['value'] = 'value'
db['key'] = 'key'
print(db.keys())
This outputs
>>> ['002', '020', 'key', 'value']
Don't forget to close the db after your work:
db.close()
Mind the fact that hashopen configuration might not suit your need. In this case I recommend you use leveldb which has a simple API or wiredtiger for speed.
To order by value in bsddb, you have to use the composite key pattern or key composition. Which boils down to create a dictionary key which keeps the ordering you look for. In this example we pack the original dict value first (so that small values appears first) with the original dict key (so that the bsddb key is unique):
import struct
from bsddb import hashopen
my_dict = {'a': 500, 'abc': 100, 'foobar': 1}
# insert
db = hashopen('dict.db')
for key, value in my_dict.iteritems():
composite_key = struct.pack('>Q', value) + key
db[composite_key] = '' # value is not useful in this case but required
db.close()
# read
db = hashopen('dict.db')
for key, _ in db.iteritems(): # iterate over database
size = struct.calcsize('>Q')
# unpack
value, key = key[:size], key[size:]
value = struct.unpack('>Q', value)[0]
print key, value
db.close()
This outputs the following:
foobar 1
abc 100
a 500
I have a defaultdict(list) dictionary and im trying to access the stored values to perform some operations on them only i've never had to do this before so im not quite sure how to access them givin a list index and a key.
listdict = defaultdict(list)
listdict = {'Cake':['cheesecake','icecream cake','oreo-cheesecake']}
so e.g. say i wanted to use "Cake" key word to access "oreo-cheesecake" string at index 2 in the list.
You are overwriting your defaultdict. It mostly works as a normal dict. We set elements:
listdict = defaultdict(list)
listdict['Cake'] = ['cheesecake','icecream cake','oreo-cheesecake']
And we recover them:
print listdict['Cake'][2]
'oreo-cheesecake'
But you can do:
listdict['nonexistent'].append('stuff')
I am currently working with a dataframe consisting of a column of 13 letter strings ('13mer') paired with ID codes ('Accession') as such:
However, I would like to create a dictionary in which the Accession codes are the keys with values being the 13mers associated with the accession so that it looks as follows:
{'JO2176': ['IGY....', 'QLG...', 'ESS...', ...],
'CYO21709': ['IGY...', 'TVL...',.............],
...}
Which I've accomplished using this code:
Accession_13mers = {}
for group in grouped:
Accession_13mers[group[0]] = []
for item in group[1].iteritems():
Accession_13mers[group[0]].append(item[1])
However, now I would like to go back through and iterate through the keys for each Accession code and run a function I've defined as find_match_position(reference_sequence, 13mer) which finds the 13mer in in a reference sequence and returns its position. I would then like to append the position as a value for the 13mer which will be the key.
If anyone has any ideas for how I can expedite this process that would be extremely helpful.
Thanks,
Justin
I would suggest creating a new dictionary, whose values are another dictionary. Essentially a nested dictionary.
position_nmers = {}
for key in H1_Access_13mers:
position_nmers[key] = {} # replicate key, val in new dictionary, as a dictionary
for value in H1_Access_13mers[key]:
position_nmers[key][value] = # do something
To introspect the dictionary and make sure it's okay:
print position_nmers
You can iterate over the groupby more cleanly by unpacking:
d = {}
for key, s in df.groupby('Accession')['13mer']:
d[key] = list(s)
This also makes it much clearer where you should put your function!
... However, I think that it might be better suited to an enumerate:
d2 = {}
for pos, val in enumerate(df['13mer']):
d2[val] = pos