I have a dictionary that for examples sake, looks like
{'a': 1, 'b': 4, 'c': 7}
I have a dataframe that has the same index values as the keys in this dict.
I want to add each value from the dict to the dataframe.
I feel like doing a check for every row of the DF, checking the index value, matching it to the one in the dict, then trying to add it is going to be a very slow way right?
You can use map and assign back to a new column:
d = {'a': 1, 'b': 4, 'c': 7}
df = pd.DataFrame({'c':[1,2,3]},index=['a','b','c'])
df['new_col'] = df.index.map(d)
prints:
c new_col
a 1 1
b 2 4
c 3 7
I have a dataframe (very large, millions of rows). Here how it looks:
id value
a1 0:0,1:10,2:0,3:0,4:7
b4 0:5,1:0,2:0,3:0,4:1
c5 0:0,1:3,2:2,3:0,4:0
k2 0:0,1:2,2:0,3:4,4:0
I want to turn those strings into dictionary, but only those key value pairs, where there is no 0. So desired result is:
id value
a1 {1:10, 4:7}
b4 {4:1}
c5 {1:3, 2:2}
k2 {1:2}
How to do that? when I try to use dict() function but it brings KeyError: 0:
df["value"] = dict(df["value"])
So I have problems with turning it into dictionary in the first place
I also have tried this:
df["value"] = json.loads(df["value"])
but it brings same error
This could do the trick, simply using list comprehensions:
import pandas as pd
dt = pd.DataFrame({"id":["a1", "b4", "c5", "k2"],
"value":["0:0,1:10,2:0,3:0,4:7","0:5,1:0,2:0,3:0,4:1","0:0,1:3,2:2,3:0,4:0","0:0,1:2,2:0,3:4,4:0"]})
def to_dict1(s):
return [dict([map(int, y.split(":")) for y in x.split(",") if "0" not in y.split(":")]) for x in s]
dt["dict"] = to_dict1(dt["value"])
Another way to obtain the same result would be using regular expressions (the pattern (?!0{1})(\d) matches any number but a single 0):
import re
def to_dict2(s):
return [dict([map(int, y) for y in re.findall("(?!0{1})(\d):(?!0{1})(\d+)", x)]) for x in s]
In terms of performance, to_dict1 is almost 20% faster, according to my tests.
This code will make a result you want. I made a sample input as you provided, and printed an expected result at the end.
import pandas as pd
df = pd.DataFrame(
{
'id': ['a1', 'b4', 'c5', 'k2'],
'value': ['0:0,1:10,2:0,3:0,4:7', '0:5,1:0,2:0,3:0,4:1', '0:0,1:3,2:2,3:0,4:0', '0:0,1:2,2:0,3:4,4:0']
}
)
value = [] # temporal value to save only key, value pairs without 0
for i, row in df.iterrows():
pairs = row['value'].split(',')
d = dict()
for pair in pairs:
k, v = pair.split(':')
k = int(k)
v = int(v)
if (k != 0) and (v != 0):
d[k] = v
value.append(d)
df['value'] = pd.Series(value)
print(df)
# id value
#0 a1 {1: 10, 4: 7}
#1 b4 {4: 1}
#2 c5 {1: 3, 2: 2}
#3 k2 {1: 2, 3: 4}
def make_dict(row):
""" Requires string list of shape
["0":"0", "1":"10", ...]"""
return {key: val for key, val
in map(lambda x: map(int, x.split(":")), row)
if key != 0 and val != 0}
df["value"] = df.value.str.split(",").apply(make_dict)
This is how I would do it:
def string_to_dict(s):
d = {}
pairs = s.split(',') # get each key pair
for pair in pairs:
key, value = pair.split(':') # split key from value
if int(value): # skip the pairs with zero value
d[key] = value
return d
df['value'] = df['value'].apply(string_to_dict)
use a dictionary comprehension to exclude key or value items equal to zero
txt="""id value
a1 0:0,1:10,2:0,3:0,4:7
b4 0:5,1:0,2:0,3:0,4:1
c5 0:0,1:3,2:2,3:0,4:0
k2 0:0,1:2,2:0,3:4,4:0 """
df = pd.DataFrame({"id":["a1", "b4", "c5", "k2"],
"value":["0:0,1:10,2:0,3:0,4:7","0:5,1:0,2:0,3:0,4:1","0:0,1:3,2:2,3:0,4:0","0:0,1:2,2:0,3:4,4:0"]})
for key,row in df.iterrows():
results=[]
{results.append({int(k),int(v)}) if int(k)!=0 and int(v)!=0 else None for k,v in (x.split(':') for x in row['value'].split(','))}
df.loc[key,'value']=results
print(df)
output:
id value
0 a1 [{1, 10}, {4, 7}]
1 b4 [{1, 4}]
2 c5 [{1, 3}, {2}]
3 k2 [{1, 2}, {3, 4}]
I have a dataframe, in which one column contain a dictionaries for every row. I want to select rows whose dictionary contains a specific value. Doesn't matter which key contains it.
The dictionaries have many levels (they contain a lot of lists, with a lot of dictionaries, again with a lot of lists and so on).
The data could look similar to this, but with the dictionaries being more complex:
df = pd.DataFrame({"A": [1,2,3], "B": [{"a":1}, {"b":**specific_value**}, {"c":3}]})
A B
0 1 {'a': 1}
1 2 {'b': 2}
2 3 {'c': 3}
I tried:
df.B.apply(lambda x : 'specific_value' in x.values())
To which I get "false" even the rows that I know contain the 'specific_value'. I am unsure if it is because of the layers.
You could use a recursive function to search for the specific value:
import pandas as pd
def nested_find_value(d, needle=4):
# we assume d is always a list or dictionary
haystack = d.values() if isinstance(d, dict) else d
for hay in haystack:
if isinstance(hay, (list, dict)):
yield from nested_find_value(hay, needle)
else:
yield hay == needle
def find(d, needle=4):
return any(nested_find_value(d, needle))
df = pd.DataFrame({"A": [1, 2, 3], "B": [{"a": 1}, {"b": {"d": 4}}, {"c": 3}]})
result = df["B"].apply(find)
print(result)
Output
0 False
1 True
2 False
Name: B, dtype: bool
In the example above the specific value is 4.
I have a DataFrame
df = pd.DataFrame({'keywords': [{'a': 3, 'b': 4, 'c': 5}, {'c':1, 'd':2}, {'a':5, 'c':21, 'd':4}, {'b':2, 'c':1, 'g':1, 'h':1, 'i':1}]})
I want to add all the elements across all rows that would give the result without using iterrows:
a: 8
b: 6
c: 28
d: 6
g: 1
h: 1
i: 1
note: no element occurs twice in a single row in the original DataFrame.
Using collections.Counter, you can sum an iterable of Counter objects. Since Counter is a subclass of dict, you can then feed to pd.DataFrame.from_dict.
from collections import Counter
counts = sum(map(Counter, df['keywords']), Counter())
res = pd.DataFrame.from_dict(counts, orient='index')
print(res)
0
a 8
b 6
c 28
d 6
g 1
h 1
i 1
Not sure how this compares in terms of optimization with #jpp's answer, but I'll give it a shot.
# What we're starting out with
df = pd.DataFrame({'keywords': [{'a': 3, 'b': 4, 'c': 5}, {'c':1, 'd':2}, {'a':5, 'c':21, 'd':4}, {'b':2, 'c':1, 'g':1, 'h':1, 'i':1}]})
# Turns the array of dictionaries into a DataFrame
values_df = pd.DataFrame(df["keywords"].values.tolist())
# Sums up the individual keys
sums = {key:values_df[key].sum() for key in values_df.columns}
I want to create a matrix.
Input:
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
...
]
Output:
a p cat g
1st 2 0 0 1
2nd 5 3 4 0
This is my code. But I think it's not smart and very slow when data size huge.
Have any good ways to do this one?
Thank you.
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
### Create matrix ###
result = []
for row in data:
matrix = [0] * len(key_map)
for k, v in row.iteritems():
matrix[key_map.index(k)] = v
result.append(matrix)
print result
# [[2, 0, 0, 1], [5, 3, 4, 0]]
Edited
By #wwii advice. Use Pandas looks good:
from pandas import DataFrame
result = DataFrame(data, index=range(len(data)))
print result.fillna(0, downcast=int).as_matrix().tolist()
# [[2, 0, 1, 0], [5, 4, 0, 3]]
You can use set comprehension to generate the key_map
key_map = list({data for row in data for data in row})
Here is a partial answer. I couldn't get the columns in the order specified - it is limited by how the keys get ordered in the set, key_map. It uses string formatting to line the data up - you can play around with the spacing to fit larger or smaller numbers.
# ordinal from
# http://code.activestate.com/recipes/576888-format-a-number-as-an-ordinal/
from ordinal import ordinal
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
# strings to format the output
header = '{: >10}{: >8}{: >8}{: >8}'.format(*key_map)
line_fmt = '{: <8}{: >2}{: >8}{: >8}{: >8}'
print header
def ordered_data(d, keys):
"""Returns an ordered list of dictionary values.
returns 0 if key not in d
d --> dict
keys --> list of keys
returns list
"""
return [d.get(key, 0) for key in keys]
for i, thing in enumerate(data):
print line_fmt.format(ordinal(i+1), *ordered_data(thing, key_map))
Output
a p g cat
1st 2 0 1 0
2nd 5 3 0 4
It might be worthwhile to dig into the Pandas docs and look at its DataFrame - it might make life easier.
I second the answer using the Pandas dataframes. However, my code should be a bit simpler than yours.
In [1]: import pandas as pd
In [5]: data = [{'a': 2, 'g': 1},{'p': 3, 'a': 5, 'cat': 4}]
In [6]: df = pd.DataFrame(data)
In [7]: df
Out[7]:
a cat g p
0 2 NaN 1 NaN
1 5 4 NaN 3
In [9]: df = df.fillna(0)
In [10]: df
Out[10]:
a cat g p
0 2 0 1 0
1 5 4 0 3
I did my coding in iPython, which I highly recommend!
To save to csv, just use an additional line of code:
df.to_csv('filename.csv')
I am a freshie in python, just suggestions that may be helpful hopefully:)
key_map = []
for row in data:
key_map.extend(row.keys())
key_map = list(set(key_map))
you can change the middle part to this, which will save you some time to find the key_map
In your case union will at least scan through each row to find the different item.