Input
df
A B
a 23
b,c 34
d,e,%f 30
Goal
df_dct = {'a':23,'b':34,'c':34,'d':'30','e':'30','f':30}
The details as below:
A as keys , B as values
The values in A is string and some is grouped by ','
The keys comes from by spliting ',' , and should replace all '%' and space.
Try
I know using zip to get dict from two dataframes but could not handle spliting.
You can use df.explode() for pandas >= 0.25 with df.to_dict():
In [32]: df.A = df.A.str.replace("%", "")
In [42]: df_dct = df.assign(var1=df['A'].str.split(',')).explode('var1').drop('A', 1).set_index('var1').to_dict()['B']
In [43]: df_dct
Out[43]: {'a': 23, 'b': 34, 'c': 34, 'd': 30, 'e': 30, 'f': 30}
Remove the percentage from column A with str replace
df["A"] = df.A.str.replace("%", "")
Use itertools' product to get the pairing of each element in A and B for each row, then combine them into one list, using chain
from itertools import product, chain
#apply dict to get your final result
dict(chain.from_iterable((product(A.split(","),[B])) for A,B in df.to_numpy()))
{'a': 23, 'b': 34, 'c': 34, 'd': 30, 'e': 30, 'f': 30}
Related
This question already has answers here:
How to merge dicts, collecting values from matching keys?
(17 answers)
Closed 6 days ago.
I am looking to combine two dictionaries by grouping elements that share common keys, but I would also like to account for keys that are not shared between the two dictionaries. For instance given the following two dictionaries.
d1 = {'a':1, 'b':2, 'c': 3, 'e':5}
d2 = {'a':11, 'b':22, 'c': 33, 'd':44}
The intended code would output
df = {'a':[1,11] ,'b':[2,22] ,'c':[3,33] ,'d':[0,44] ,'e':[5,0]}
Or some array like:
df = [[a,1,11] , [b,2,22] , [c,3,33] , [d,0,44] , [e,5,0]]
The fact that I used 0 specifically to denote an entry not existing is not important per se. Just any character to denote the missing value.
I have tried using the following code
df = defaultdict(list)
for d in (d1, d2):
for key, value in d.items():
df[key].append(value)
But get the following result:
df = {'a':[1,11] ,'b':[2,22] ,'c':[3,33] ,'d':[44] ,'e':[5]}
Which does not tell me which dict was missing the entry.
I could go back and look through both of them, but was looking for a more elegant solution
You can use a dict comprehension like so:
d1 = {'a':1, 'b':2, 'c': 3, 'e':5}
d2 = {'a':11, 'b':22, 'c': 33, 'd':44}
res = {k: [d1.get(k, 0), d2.get(k, 0)] for k in set(d1).union(d2)}
print(res)
Another solution:
d1 = {"a": 1, "b": 2, "c": 3, "e": 5}
d2 = {"a": 11, "b": 22, "c": 33, "d": 44}
df = [[k, d1.get(k, 0), d2.get(k, 0)] for k in sorted(d1.keys() | d2.keys())]
print(df)
Prints:
[['a', 1, 11], ['b', 2, 22], ['c', 3, 33], ['d', 0, 44], ['e', 5, 0]]
If you do not want sorted results, leave the sorted() out.
I have a list which contains multiple dict and all the dict have same key but different values. I want to get a dict which has maximum value for a particular key.
l = [{'a':23, 'b': 64, 'c':4},{'a':83, 'b': 34, 'c':47}]
I want to get a dict which has maximum value for 'b'.
Use max with a custom key:
>>> l = [{'a':23, 'b': 64, 'c':4},{'a':83, 'b': 34, 'c':47}]
>>> max(l, key=lambda x: x["b"])
{'a': 23, 'b': 64, 'c': 4}
I have a dataframe such that the column contains both json objects and strings. I want to get rid of rows that does not contains json objects.
Below is how my dataframe looks like :
import pandas as pd
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"a":9,"b":10,"c":11}]})
print(df)
How should i remove the rows that contains only strings, so that after removing those string rows, I can apply below to this column to convert json object into separate columns of dataframe:
from pandas.io.json import json_normalize
df = json_normalize(df['A'])
print(df)
I think I would prefer to use an isinstance check:
In [11]: df.loc[df.A.apply(lambda d: isinstance(d, dict))]
Out[11]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
If you want to include numbers too, you can do:
In [12]: df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
Out[12]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
Adjust this to whichever types you want to include...
The last step, json_normalize takes a list of json objects, for whatever reason a Series is no good (and gives the KeyError), you can make this a list and your good to go:
In [21]: df1 = df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
In [22]: json_normalize(list(df1["A"]))
Out[22]:
a b c d e f
0 5.0 6.0 8.0 NaN NaN NaN
1 NaN NaN NaN 9.0 10.0 11.0
df[df.applymap(np.isreal).sum(1).gt(0)]
Out[794]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
If you want an ugly solution that also works...here's a function I created that finds columns that contain only strings, and returns the df minus those rows. (since your df has only one column, you'll just dataframe containing 1 column with all dicts). Then, from there, you'll want to use
df = json_normalize(df['A'].values) instead of just df = json_normalize(df['A']).
For a single column dataframe...
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
def delete_strings(df):
nrows = df.shape[0]
rows_to_keep = []
for row in np.arange(nrows):
if type(df.iloc[row,0]) == dict:
rows_to_keep.append(row) #add the row number to list of rows
#to keep if the row contains a dict
return df.iloc[rows_to_keep,0] #return only rows with dicts
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",
{"a":9,"b":10,"c":11}]})
df = delete_strings(df)
df = json_normalize(df['A'].values)
print(df)
#0 {'a': 5, 'b': 6, 'c': 8}
#1 {'a': 9, 'b': 10, 'c': 11}
For a multi-column df (also works with a single column df):
def delete_rows_of_strings(df):
rows = df.shape[0] #of rows in df
cols = df.shape[1] #of coluns in df
rows_to_keep = [] #list to track rows to keep
for row in np.arange(rows): #for every row in the dataframe
#num_string will count the number of strings in the row
num_string = 0
for col in np.arange(cols): #for each column in the row...
#if the value is a string, add one to num_string
if type(df.iloc[row,col]) == str:
num_string += 1
#if num_string, the number of strings in the column,
#isn't equal to the number of columns in the row...
if num_string != cols: #...add that row number to the list of rows to keep
rows_to_keep.append(row)
#return the df with rows containing at least one non string
return(df.iloc[rows_to_keep,:])
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india"],
'B' : ['hi',{"a":5,"b":6,"c":8},'sup','america','china']})
# A B
#0 hello hi
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup
print(delete_rows_of_strings(df))
# A B
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup
I have test data which is gathered based on multiple inputs, and results in a single output. I'm currently storing this data in a dictionary whose keys are my parameter/ results labels, and whose values are the test conditions and results. I would like to be able to filter the data so I can generate plots based on isolated conditions.
In my example below, my test conditions would be 'a' and 'b', and the result of the experiment would be 'c'. I want to filter my data so I get a dictionary with the same key, value structure and only my filtered results. However my current dictionary comprehension returns an empty dictionary. Any advice to get the desired result?
Current Code:
data = {'a': [0, 1, 2, 0, 1, 2], 'b': [10, 10, 10, 20, 20, 20], 'c': [1.3, 1.9, 2.3, 2.3, 2.9, 3.4]}
filtered_data = {k:v for k,v in data.iteritems() if v in data['b'] >= 20}
Desired Result:
{'a': [0, 1, 2], 'b': [20, 20, 20], 'c': [2.3, 2.9, 3.4]}
Current Result:
{}
Also, is this dictionary of lists a good schema to store data of this type, given that I'm going to want to filter the results, or is there a better way to accomplish this?
use this:
k:[v[i] for i,x in enumerate(v) if data['b'][i] >= 20] for k,v in data.items()}
Desired Result:
{'a': [0, 1, 2], 'c': [2.3, 2.9, 3.4], 'b': [20, 20, 20]}
Consider using the pandas module for this type of work.
import pandas as pd
df = pd.DataFrame(data)
df = df[df["b"] >= 20]
print(df)
It appears like this will give you what you want. You are using the dictionary key to represent the column name and the values are just rows in a given column, so it is amenable to using a dataframe.
Result:
a b c
3 0 20 2.3
4 1 20 2.9
5 2 20 3.4
Are all of the dictionary value lists in matching orders? If so, you could just look at whichever list you want to filter by, say 'b' in this case, find the values you want, and then either use those indices or the same slice on the other values in the dictionary.
For example:
matching_indices = []
for i in data['b']:
if data['b'][i] >= 20:
matching_indices.append(i)
new_dict = {}
for key in data:
for item in matching_indices:
new_dict[key] = data[key][item]
You could probably figure a dictionary comprehension for it if you wanted. Hopefully this is clear.
you can change this into a method which would give it more flexibility. Your current logic means that dataset a and c are neglected because there are no values greater than or equal to 20:
data = {'a': [0, 1, 2, 0, 1, 2], 'b': [10, 10, 10, 20, 20, 20], 'c': [1.3, 1.9, 2.3, 2.3, 2.9, 3.4]}
filter_vals = ['a', 'b']
new_d = {}
for k, v in data.iteritems():
if k in filter_vals:
new_d[k] = [i for i in v if i >= 20]
print new_d
Now i'm not a big fan if many if statements, but something like this is straight forward and can be called many times
def my_filter(operator, condition, filter_vals, my_dict):
new_d = {}
for k, v in my_dict.iteritems():
if k in filter_vals:
if operator == '>':
new_d[k] = [i for i in v if i > condition]
elif operator == '<':
new_d[k] = [i for i in v if i < condition]
elif operator == '<=':
new_d[k] = [i for i in v if i <= condition]
elif operator == '>=':
new_d[k] = [i for i in v if i >= condition]
return new_d
I agree with the pandas approach above.
If for some reason you hate pandas or are an old school computer scientist, tuples are a good way to tore relational data. In your example, the a, b, and c lists are columns rather than rows. For tuples, you would want to store the rows as:
data = {'a':(0,10,1.3),'b':(1,10,1.9),'c':(2,10,2.3),'d':(0,20,2.3),'e':(1,20,2.9),'f':(2,20,3.4)}
where the tuples are stored in the (condition1, condition2, outcome) format you described and you can call a single test or filter a set as you describe. From there you can get a filtered set of results as follows:
filtered_data = {k:v for k,v in data.iteritems() if v[1]>=20}
which returns:
{'d': (0, 20, 2.3), 'e': (1, 20, 2.9), 'f': (2, 20, 3.4)}
I have a dictionary with multiple values under multiple keys. I do NOT want a single sum of the values. I want to find a way to find the sum for each key.
The file is tab delimited, with an identifier being a combination of two of these items, Btarg. There are multiple values for each of these identifiers.
Here is a test file:
Here is a test file with the desired result below:
Pattern Item Abundance
1 Ant 2
2 Dog 10
3 Giraffe 15
1 Ant 4
2 Dog 5
Here is the expected results:
Pattern1Ant, 6
Pattern2Dog, 15
Pattern3Giraffe, 15
This is what I have so far:
for line in K:
if "pattern" in line:
find = line
Bsplit = find.split("\t")
Buid = Bsplit[0]
Borg = Bsplit[1]
Bnum = (Bsplit[2])
Btarg = Buid[:-1] + "//" + Borg
if Btarg not in dict1:
dict1[Btarg] = []
dict1[Btarg].append(Bnum)
#The following used to work
#for key in dict1.iterkeys():
#dict1[key] = sum(dict1[key])
#print (dict1)
How do I make this work in Python 3 without the error message "Unsupported operand type(s) for +: 'int' and 'list'?
Thanks in advance!
Use from collections import Counter
From the documentation:
c = Counter('gallahad')
Counter({'a': 3, 'l': 2, 'h': 1, 'g': 1, 'd': 1})
Responding to your comment, now I think I know what you want, although I don't know what structure you have your data in. I will take for granted that you can organize your data like this:
In [41]: d
Out[41]: [{'Ant': 2}, {'Dog': 10}, {'Giraffe': 15}, {'Ant': 4}, {'Dog': 5}]
First create a defaultdict
from collections import defaultdict
a = defaultdict(int)
Then start couting:
In [42]: for each in d:
a[each.keys()[0]] += each.values()[0]
Result:
In [43]: a
Out[43]: defaultdict(<type 'int'>, {'Ant': 6, 'Giraffe': 15, 'Dog': 15})
UPDATE 2
Supposing you can get your data in this format:
In [20]: d
Out[20]: [{'Ant': [2, 4]}, {'Dog': [10, 5]}, {'Giraffe': [15]}]
In [21]: from collections import defaultdict
In [22]: a = defaultdict(int)
In [23]: for each in d:
a[each.keys()[0]] =sum(each.values()[0])
....:
In [24]: a
Out[24]: defaultdict(<type 'int'>, {'Ant': 6, 'Giraffe': 15, 'Dog': 15})