Pandas: Select rows whose dictionary contains a specific value - python

I have a dataframe, in which one column contain a dictionaries for every row. I want to select rows whose dictionary contains a specific value. Doesn't matter which key contains it.
The dictionaries have many levels (they contain a lot of lists, with a lot of dictionaries, again with a lot of lists and so on).
The data could look similar to this, but with the dictionaries being more complex:
df = pd.DataFrame({"A": [1,2,3], "B": [{"a":1}, {"b":**specific_value**}, {"c":3}]})
A B
0 1 {'a': 1}
1 2 {'b': 2}
2 3 {'c': 3}
I tried:
df.B.apply(lambda x : 'specific_value' in x.values())
To which I get "false" even the rows that I know contain the 'specific_value'. I am unsure if it is because of the layers.

You could use a recursive function to search for the specific value:
import pandas as pd
def nested_find_value(d, needle=4):
# we assume d is always a list or dictionary
haystack = d.values() if isinstance(d, dict) else d
for hay in haystack:
if isinstance(hay, (list, dict)):
yield from nested_find_value(hay, needle)
else:
yield hay == needle
def find(d, needle=4):
return any(nested_find_value(d, needle))
df = pd.DataFrame({"A": [1, 2, 3], "B": [{"a": 1}, {"b": {"d": 4}}, {"c": 3}]})
result = df["B"].apply(find)
print(result)
Output
0 False
1 True
2 False
Name: B, dtype: bool
In the example above the specific value is 4.

Related

Add values to new column from a dict with keys matching the index of a dataframe

I have a dictionary that for examples sake, looks like
{'a': 1, 'b': 4, 'c': 7}
I have a dataframe that has the same index values as the keys in this dict.
I want to add each value from the dict to the dataframe.
I feel like doing a check for every row of the DF, checking the index value, matching it to the one in the dict, then trying to add it is going to be a very slow way right?
You can use map and assign back to a new column:
d = {'a': 1, 'b': 4, 'c': 7}
df = pd.DataFrame({'c':[1,2,3]},index=['a','b','c'])
df['new_col'] = df.index.map(d)
prints:
c new_col
a 1 1
b 2 4
c 3 7

JSON file with duplicate keys to a dataframe or excel file

I have a huge JSON file with duplicate keys in each object, simplified example:
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8,
}
]
of course, my data has many more keys and objects, but this is a good snippet.
and I'd like it to look like this:
|a| b |c |
-------------
|3|Banana|45|
|3|Banana|45|
|8|Banana|45|
I'm not picky, anything on excel, R, python... but none of the json parsers I've seen allow duplicates like this.
I've searched a lot, but I haven't found an answer. Is there any way I can do this and not have to do it manually? The dataset is HUGE.
PS I know it's not favorable for json to have multiple duplicate keys. Both the key names and values have duplicates, and I need all of them, but I was given the file this way.
Here's an R solution.
Premise: partially un-jsonify into lists with duplicate names, convert into frames individually, then aggregate into one frame.
I'll augment the data slightly do demonstrate more than one dictionary:
json <- '[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
},
{
"a": 4,
"b": "Pear",
"c": 46,
"a": 4,
"a": 9
}
]'
Here's the code:
L <- jsonlite::fromJSON(json, simplifyDataFrame=FALSE)
L2 <- lapply(L, function(z) as.data.frame(split(unlist(z, use.names=FALSE), names(z))))
do.call(rbind, L2)
# a b c
# 1 3 Banana 45
# 2 3 Banana 45
# 3 8 Banana 45
# 4 4 Pear 46
# 5 4 Pear 46
# 6 9 Pear 46
Maybe I can help with the duplicate keys issue, they are the main problem, IMO.
In Python, there is a way how to deal with duplicate keys in JSON. You could define an own "hook" that processes key:value pairs.
In your example, the key "a" is present 3 times. Here is a demo that gives all such multiple keys unique names by appending consecutive numbers "_1", "_2", "_3", etc. (If there is a chance of name clash with an existing key like "a_1", change the naming format.)
The result is a valid dict you can process as you like.
import collections
import json
data = """
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
}
]
"""
def object_pairs(pairs):
dups = {d:1 for d, i in collections.Counter(pair[0] for pair in pairs).items() if i > 1}
# ^^^ change to d:0 for zero-based counting
dedup = {}
for k, v in pairs:
try:
num = dups[k]
dups[k] += 1
k = f"{k}_{num}"
except KeyError:
pass
dedup[k] = v
return dedup
result = json.loads(data, object_pairs_hook=object_pairs)
print(result) # [{'a_1': 3, 'b': 'Banana', 'c': 45, 'a_2': 3, 'a_3': 8}]

Weird behaviour while manipulating Pandas dataframe within a dictionary

I am unable to understand this behaviour. I have a dataframe, which is present as a "value" inside a dictionary my_dict
my_dict = {'a': pd.DataFrame({'x': [1], 'y': [2]})}
print(my_dict)
>>{'a': x y
0 1 2}
Now, when I attempt a mathematical operation on the dataframe, that works, but a column renaming on the dataframe does not work -
for key, val in my_dict.items():
val['z'] = val['x'] * val['y']
val = val.rename(columns = {'x': 'new_x'})
print(my_dict)
{'a': x y z
0 1 2 2}
The mathematical operation val['z'] = val['x'] * val['y'] resulted in a new column z in the dataframe within my_dict
But the column renaming operation val = val.rename(columns = {'x': 'new_x'}) has no effect.
Why don't I see a column new_x in my_dict. What is going on?
change the assign to inplace
for key, val in my_dict.items():
val['z'] = val['x'] * val['y']
val.rename(columns = {'x': 'new_x'},inplace=True)
my_dict
Out[26]:
{'a': new_x y z
0 1 2 2}

How to find sum of dictionaries in a pandas DataFrame across all rows?

I have a DataFrame
df = pd.DataFrame({'keywords': [{'a': 3, 'b': 4, 'c': 5}, {'c':1, 'd':2}, {'a':5, 'c':21, 'd':4}, {'b':2, 'c':1, 'g':1, 'h':1, 'i':1}]})
I want to add all the elements across all rows that would give the result without using iterrows:
a: 8
b: 6
c: 28
d: 6
g: 1
h: 1
i: 1
note: no element occurs twice in a single row in the original DataFrame.
Using collections.Counter, you can sum an iterable of Counter objects. Since Counter is a subclass of dict, you can then feed to pd.DataFrame.from_dict.
from collections import Counter
counts = sum(map(Counter, df['keywords']), Counter())
res = pd.DataFrame.from_dict(counts, orient='index')
print(res)
0
a 8
b 6
c 28
d 6
g 1
h 1
i 1
Not sure how this compares in terms of optimization with #jpp's answer, but I'll give it a shot.
# What we're starting out with
df = pd.DataFrame({'keywords': [{'a': 3, 'b': 4, 'c': 5}, {'c':1, 'd':2}, {'a':5, 'c':21, 'd':4}, {'b':2, 'c':1, 'g':1, 'h':1, 'i':1}]})
# Turns the array of dictionaries into a DataFrame
values_df = pd.DataFrame(df["keywords"].values.tolist())
# Sums up the individual keys
sums = {key:values_df[key].sum() for key in values_df.columns}

Create a matrix from dynamic dictionary

I want to create a matrix.
Input:
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
...
]
Output:
a p cat g
1st 2 0 0 1
2nd 5 3 4 0
This is my code. But I think it's not smart and very slow when data size huge.
Have any good ways to do this one?
Thank you.
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
### Create matrix ###
result = []
for row in data:
matrix = [0] * len(key_map)
for k, v in row.iteritems():
matrix[key_map.index(k)] = v
result.append(matrix)
print result
# [[2, 0, 0, 1], [5, 3, 4, 0]]
Edited
By #wwii advice. Use Pandas looks good:
from pandas import DataFrame
result = DataFrame(data, index=range(len(data)))
print result.fillna(0, downcast=int).as_matrix().tolist()
# [[2, 0, 1, 0], [5, 4, 0, 3]]
You can use set comprehension to generate the key_map
key_map = list({data for row in data for data in row})
Here is a partial answer. I couldn't get the columns in the order specified - it is limited by how the keys get ordered in the set, key_map. It uses string formatting to line the data up - you can play around with the spacing to fit larger or smaller numbers.
# ordinal from
# http://code.activestate.com/recipes/576888-format-a-number-as-an-ordinal/
from ordinal import ordinal
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
# strings to format the output
header = '{: >10}{: >8}{: >8}{: >8}'.format(*key_map)
line_fmt = '{: <8}{: >2}{: >8}{: >8}{: >8}'
print header
def ordered_data(d, keys):
"""Returns an ordered list of dictionary values.
returns 0 if key not in d
d --> dict
keys --> list of keys
returns list
"""
return [d.get(key, 0) for key in keys]
for i, thing in enumerate(data):
print line_fmt.format(ordinal(i+1), *ordered_data(thing, key_map))
Output
a p g cat
1st 2 0 1 0
2nd 5 3 0 4
It might be worthwhile to dig into the Pandas docs and look at its DataFrame - it might make life easier.
I second the answer using the Pandas dataframes. However, my code should be a bit simpler than yours.
In [1]: import pandas as pd
In [5]: data = [{'a': 2, 'g': 1},{'p': 3, 'a': 5, 'cat': 4}]
In [6]: df = pd.DataFrame(data)
In [7]: df
Out[7]:
a cat g p
0 2 NaN 1 NaN
1 5 4 NaN 3
In [9]: df = df.fillna(0)
In [10]: df
Out[10]:
a cat g p
0 2 0 1 0
1 5 4 0 3
I did my coding in iPython, which I highly recommend!
To save to csv, just use an additional line of code:
df.to_csv('filename.csv')
I am a freshie in python, just suggestions that may be helpful hopefully:)
key_map = []
for row in data:
key_map.extend(row.keys())
key_map = list(set(key_map))
you can change the middle part to this, which will save you some time to find the key_map
In your case union will at least scan through each row to find the different item.

Categories

Resources