Pandas Dataframe to Dictionary with Multiple Keys

Pandas Dataframe to Dictionary with Multiple Keys - python

I am currently working with a dataframe consisting of a column of 13 letter strings ('13mer') paired with ID codes ('Accession') as such:
However, I would like to create a dictionary in which the Accession codes are the keys with values being the 13mers associated with the accession so that it looks as follows:
{'JO2176': ['IGY....', 'QLG...', 'ESS...', ...],
'CYO21709': ['IGY...', 'TVL...',.............],
...}
Which I've accomplished using this code:
Accession_13mers = {}
for group in grouped:
Accession_13mers[group[0]] = []
for item in group[1].iteritems():
Accession_13mers[group[0]].append(item[1])
However, now I would like to go back through and iterate through the keys for each Accession code and run a function I've defined as find_match_position(reference_sequence, 13mer) which finds the 13mer in in a reference sequence and returns its position. I would then like to append the position as a value for the 13mer which will be the key.
If anyone has any ideas for how I can expedite this process that would be extremely helpful.
Thanks,
Justin

I would suggest creating a new dictionary, whose values are another dictionary. Essentially a nested dictionary.
position_nmers = {}
for key in H1_Access_13mers:
position_nmers[key] = {} # replicate key, val in new dictionary, as a dictionary
for value in H1_Access_13mers[key]:
position_nmers[key][value] = # do something
To introspect the dictionary and make sure it's okay:
print position_nmers

You can iterate over the groupby more cleanly by unpacking:
d = {}
for key, s in df.groupby('Accession')['13mer']:
d[key] = list(s)
This also makes it much clearer where you should put your function!
... However, I think that it might be better suited to an enumerate:
d2 = {}
for pos, val in enumerate(df['13mer']):
d2[val] = pos

Related

accelerate comparing dictionary keys and values to strings in list in python

Sorry if this is trivial I'm still learning but I have a list of dictionaries that looks as follow:
[{'1102': ['00576', '00577', '00578', '00579', '00580', '00581']},
{'1102': ['00582', '00583', '00584', '00585', '00586', '00587']},
{'1102': ['00588', '00589', '00590', '00591', '00592', '00593']},
{'1102': ['00594', '00595', '00596', '00597', '00598', '00599']},
{'1102': ['00600', '00601', '00602', '00603', '00604', '00605']}
...]
it contains ~89000 dictionaries. And I have a list containing 4473208 paths. example:
['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv',
'/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv',
...]
and what I want to do is group each path that contains the grouped values in the dict in the folder containing the key together.
I tried using for loops like this:
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for file in ct_paths:
for key, val in elem.items():
if (file[16:20] == key) and (any(x in file[21:26] for x in val)):
temp1.append(file)
grpd_cts.append(temp1)
but this takes around 30hours. is there a way to make it more efficient? any itertools function or something?
Thanks a lot!

ct_paths is iterated repeatedly in your inner loop, and you're only interested in a little bit of it for testing purposes; pull that out and use it to index the rest of your data, as a dictionary.
What does make your problem complicated is that you're wanting to end up with the original list of filenames, so you need to construct a two-level dictionary where the values are lists of all originals grouped under those two keys.
ct_path_index = {}
for f in ct_paths:
ct_path_index.setdefault(f[16:20], {}).setdefault(f[21:26], []).append(f)
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for key, val in elem.items():
d2 = ct_path_index.get(key)
if d2:
for v in val:
v2 = d2.get(v)
if v2:
temp1 += v2
grpd_cts.append(temp1)
ct_path_index looks like this, using your data:
{'1102': {'00575': ['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv'],
'00578': ['/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv']}}
The use of setdefault (which can be a little hard to understand the first time you see it) is important when building up collections of collections, and is very common in these kinds of cases: it makes sure that the sub-collections are created on demand and then re-used for a given key.
Now, you've only got two nested loops; the inner checks are done using dictionary lookups, which are close to O(1).
Other optimizations would include turning the lists in dict_list into sets, which would be worthwhile if you made more than one pass through dict_list.

Python Remove Duplicate Dict

I am trying to find a way to remove duplicates from a dict list. I don't have to test the entire object contents because the "name" value in a given object is enough to identify duplication (i.e., duplicate name = duplicate object). My current attempt is this;
newResultArray = []
for i in range(0, len(resultArray)):
for j in range(0, len(resultArray)):
if(i != j):
keyI = resultArray[i]['name']
keyJ = resultArray[j]['name']
if(keyI != keyJ):
newResultArray.append(resultArray[i])
, which is wildly incorrect. Grateful for any suggestions. Thank you.

If name is unique, you should just use a dictionary to store your inner dictionaries, with name being the key. Then you won't even have the issue of duplicates, and you can remove from the list in O(1) time.
Since I don't have access to the code that populates resultArray, I'll simply show how you can convert it into a dictionary in linear time. Although the best option would be to use a dictionary instead of resultArray in the first place, if possible.
new_dictionary = {}
for item in resultArray:
new_dictionary[item['name']] = item
If you must have a list in the end, then you can convert back into a dictionary as such:
new_list = [v for k,v in new_dictionary.items()]

Since "name" provides uniqueness... and assuming "name" is a hashable object, you can build an intermediate dictionary keyed by "name". Any like-named dicts will simply overwrite their predecessor in the dict, giving you a list of unique dictionaries.
tmpDict = {result["name"]:result for result in resultArray}
newArray = list(tmpDict.values())
del tmpDict
You could shrink that down to
newArray = list({result["name"]:result for result in resultArray}.values())
which may be a bit obscure.

How do i retrieve the variable name for each dict in a list of dicts?

I may be missing something fundamental here but consider the following:
graph=nx.read_graphml('path here...')
dDict=dict(nx.degree_centrality(graph)) #create dict
lDict=dict(nx.load_centrality(graph))
new_lists=[dDict,lDict]
for i in new_lists:
print().... #how to get variable name i.e. dDict
how do i iterate through the list of dicts so that when i do a print it returns me the variable name the dict equals i.e. i want to be able to retrieve back 'dDict' and 'lDict'?I do not want a quick hack such as
dDict['name'] = 'dDict'
Any ideas..?
EDIT: the reason i want to do this is so that i can append these centrality measures to a dataframe with new column name i.e.:
for idx in range(len(new_lists)):
for i in range(len(df)):
rowIndex = df.index[i]
df.loc[rowIndex, idx] = new_lists[idx][rowIndex] #instead of idx how do i dynamically name the column according to the list item name.

You can iterate over globals() and get the variable name of the object that matches the content you are looking for.
More info on https://docs.python.org/3/library/functions.html?highlight=globals#globals
However, this is a rather cumbersome trick, you should not do that! Rather, redesign your software so you don't have to look for the variable names in the first place.
def get_var_name_of(x):
return [k for k,v in globals().items() if v==x][0]
dDict = {1:'asd'}
lDict = {2:'qwe'}
new_list=[dDict,lDict]
for d in new_list:
print(get_var_name_of(d))
dDict
lDict

How to associate elements in a set with multiple dict entries

I'm extracting instances of three elements from an XML file: ComponentStr, keyID, and valueStr. Whenever I find a ComponentStr, I want to add/associate the keyID:valueStr to it. ComponentStr values are not unique. As multiple occurrences of a ComponentStr is read, I want to accumulate the keyID:valueStr for that ComponentStr group. The resulting accumulated data structure after reading the XML file might look like this:
ComponentA: key1:value1, key2:value2, key3:value3
ComponentB: key4:value4
ComponentC: key5:value5, key6:value6
After I generate the final data structure, I want to sort the keyID:valueStr entries within each ComponentStr and also sort all the ComponentStrs.
I'm trying to structure this data in Python 2. ComponentStr seem to work well as a set. The keyID:valueStr is clearly a dict. But how do I associate a ComponentStr entry in a set with its dict entries?
Alternatively, is there a better way to organize this data besides a set and associated dict entries? Each keyID is unique. Perhaps I could have one dict of keyID:some combo of ComponentStr and valueStr? After the data structure was built, I could sort it based on ComponentStr first, then perform some type of slice to group the keyID:valueStr and then sort again on the keyID? Seems complicated.

How about a dict of dicts?
data = {
'ComponentA': {'key1':'value1', 'key2':'value2', 'key3':'value3'},
'ComponentB': {'key4':'value4'},
'ComponentC': {'key5':'value5', 'key6':'value6'},
}
It maintains your data structure and mapping. Interestingly enough, the underlying implementation of dicts is similar to the implementation of sets.
This would be easily constructed a'la this pseudo-code:
data = {}
for file in files:
data[get_component(file)] = {}
for key, value in get_data(file):
data[get_component(file)][key] = value
in the case where you have repeated components, you need to have the sub-dict as the default, but add to the previous one if it's there. I prefer setdefault to other solutions like a defaultdict or subclassing dict with a __missing__ as long as I only have to do it once or twice in my code:
data = {}
for file in files:
for key, value in get_data(file):
data.setdefault([get_component(file)], {})[key] = value
It works like this:
>>> d = {}
>>> d.setdefault('foo', {})['bar'] = 'baz'
>>> d
{'foo': {'bar': 'baz'}}
>>> d.setdefault('foo', {})['ni'] = 'ichi'
>>> d
{'foo': {'ni': 'ichi', 'bar': 'baz'}}
alternatively, as I read your comment on the other answer say you need simple code, you can keep it really simple with some more verbose and less optimized code:
data = {}
for file in files:
for key, value in get_data(file):
if get_component(file) not in data:
data[get_component(file)] = {}
data[get_component(file)][key] = value
You can then sort when you're done collecting the data.
for component in sorted(data):
print(component)
print('-----')
for key in sorted(data[component]):
print(key, data[component][key])

I want to accumulate the keyID:valueStr for that ComponentStr group
In this case you want to have the keys of your dictionary as the ComponentStr, accumulating to me immediately goes to a list, which are easily ordered.
Each keyID is unique. Perhaps I could have one dict of keyID:some
combo of ComponentStr and valueStr?
You should store your data in a manner that is the most efficient when you want to retrieve it. Since you will be accessing your data by the component, even though your keys are unique there is no point in having a dictionary that is accessed by your key (since this is not how you are going to "retrieve" the data).
So, with that - how about using a defaultdict with a list, since you really want all items associated with the same component:
from collections import defaultdict
d = defaultdict(list)
with open('somefile.xml', 'r') as f:
for component, key, value in parse_xml(f):
d[component].append((key, value))
Now you have for each component, a list of tuples which are the associated key and values.
If you want to keep the components in the order that they are read from the file, you can use a OrderedDict (also from the collections module), but if you want to sort them in any arbitrary order, then stick with a normal dictionary.
To get a list of sorted component names, just sort the keys of the dictionary:
component_sorted = sorted(d.keys())
For a use case of printing the sorted components with their associated key/value pairs, sorted by their keys:
for key in component_sorted:
values = d[key]
sorted_values = sorted(values, key=lamdba x: x[0]) # Sort by the keys
print('Pairs for {}'.format(key))
for k,v in sorted_values:
print('{} {}'.format(k,v))

How to go from a values_list to a dictionary of lists

I have a django queryset that returns a list of values:
[(client pk, timestamp, value, task pk), (client pk, timestamp, value, task pk),....,].
I am trying to get it to return a dictionary of this format:
{client pk:[[timestamp, value],[timestamp, value],...,], client pk:[list of lists],...,}
The values_list may have multiple records for each client pk. I have been able to get dictionaries of lists for client or task pk using:
def dict_from_vl(vls_list):
keys=[values_list[x][3] for x in range(0,len(values_list),1)]
values = [[values_list[x][1], values_list[x][2]] for x in range(0,len(values_list),1)]
target_dict=dict(zip(keys,values))
return target_dict
However using this method, values for the same key write over previous values as it iterates through the values_list, rather than append them to a list. So this works great for getting the most recent if the values list is sorted oldest records to newest, but not for the purpose of creating a list of lists for the dict value.

Instead of target_dict=dict(zip(keys,values)), do
target_dict = defaultdict(list)
for i, key in enumerate(keys):
target_dict[k].append(values[i])
(defaultdict is available in the standard module collections.)

from collections import defaultdict
d = defaultdict(list)
for x in vls_list:
d[x].append(list(x[1:]))
Although I'm not sure if I got the question right.

I know in Python you're supposed to cram everything into a single line, but you could do it the old fashioned way...
def dict_from_vl(vls_list):
target_dict = {}
for v in vls_list:
if v[0] not in target_dict:
target_dict[v[0]] = []
target_dict[v[0]].append([v[1], v[2]])
return target_dict

For better speed, I suggest you don't create the keys and values lists separately but simply use only one loop:
tgt_dict = defaultdict(list)
for row in vas_list:
tgt_dict[row[0]].append([row[1], row[2]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe to Dictionary with Multiple Keys - python

Related

accelerate comparing dictionary keys and values to strings in list in python

Python Remove Duplicate Dict

How do i retrieve the variable name for each dict in a list of dicts?

How to associate elements in a set with multiple dict entries

How to go from a values_list to a dictionary of lists

Categories

Resources