JSON file with duplicate keys to a dataframe or excel file - python

I have a huge JSON file with duplicate keys in each object, simplified example:
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8,
}
]
of course, my data has many more keys and objects, but this is a good snippet.
and I'd like it to look like this:
|a| b |c |
-------------
|3|Banana|45|
|3|Banana|45|
|8|Banana|45|
I'm not picky, anything on excel, R, python... but none of the json parsers I've seen allow duplicates like this.
I've searched a lot, but I haven't found an answer. Is there any way I can do this and not have to do it manually? The dataset is HUGE.
PS I know it's not favorable for json to have multiple duplicate keys. Both the key names and values have duplicates, and I need all of them, but I was given the file this way.

Here's an R solution.
Premise: partially un-jsonify into lists with duplicate names, convert into frames individually, then aggregate into one frame.
I'll augment the data slightly do demonstrate more than one dictionary:
json <- '[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
},
{
"a": 4,
"b": "Pear",
"c": 46,
"a": 4,
"a": 9
}
]'
Here's the code:
L <- jsonlite::fromJSON(json, simplifyDataFrame=FALSE)
L2 <- lapply(L, function(z) as.data.frame(split(unlist(z, use.names=FALSE), names(z))))
do.call(rbind, L2)
# a b c
# 1 3 Banana 45
# 2 3 Banana 45
# 3 8 Banana 45
# 4 4 Pear 46
# 5 4 Pear 46
# 6 9 Pear 46

Maybe I can help with the duplicate keys issue, they are the main problem, IMO.
In Python, there is a way how to deal with duplicate keys in JSON. You could define an own "hook" that processes key:value pairs.
In your example, the key "a" is present 3 times. Here is a demo that gives all such multiple keys unique names by appending consecutive numbers "_1", "_2", "_3", etc. (If there is a chance of name clash with an existing key like "a_1", change the naming format.)
The result is a valid dict you can process as you like.
import collections
import json
data = """
[
{
"a": 3,
"b": "Banana",
"c": 45,
"a": 3,
"a": 8
}
]
"""
def object_pairs(pairs):
dups = {d:1 for d, i in collections.Counter(pair[0] for pair in pairs).items() if i > 1}
# ^^^ change to d:0 for zero-based counting
dedup = {}
for k, v in pairs:
try:
num = dups[k]
dups[k] += 1
k = f"{k}_{num}"
except KeyError:
pass
dedup[k] = v
return dedup
result = json.loads(data, object_pairs_hook=object_pairs)
print(result) # [{'a_1': 3, 'b': 'Banana', 'c': 45, 'a_2': 3, 'a_3': 8}]

Related

Combine two dicts and replace missing values [duplicate]

This question already has answers here:
How to merge dicts, collecting values from matching keys?
(17 answers)
Closed 6 days ago.
I am looking to combine two dictionaries by grouping elements that share common keys, but I would also like to account for keys that are not shared between the two dictionaries. For instance given the following two dictionaries.
d1 = {'a':1, 'b':2, 'c': 3, 'e':5}
d2 = {'a':11, 'b':22, 'c': 33, 'd':44}
The intended code would output
df = {'a':[1,11] ,'b':[2,22] ,'c':[3,33] ,'d':[0,44] ,'e':[5,0]}
Or some array like:
df = [[a,1,11] , [b,2,22] , [c,3,33] , [d,0,44] , [e,5,0]]
The fact that I used 0 specifically to denote an entry not existing is not important per se. Just any character to denote the missing value.
I have tried using the following code
df = defaultdict(list)
for d in (d1, d2):
for key, value in d.items():
df[key].append(value)
But get the following result:
df = {'a':[1,11] ,'b':[2,22] ,'c':[3,33] ,'d':[44] ,'e':[5]}
Which does not tell me which dict was missing the entry.
I could go back and look through both of them, but was looking for a more elegant solution
You can use a dict comprehension like so:
d1 = {'a':1, 'b':2, 'c': 3, 'e':5}
d2 = {'a':11, 'b':22, 'c': 33, 'd':44}
res = {k: [d1.get(k, 0), d2.get(k, 0)] for k in set(d1).union(d2)}
print(res)
Another solution:
d1 = {"a": 1, "b": 2, "c": 3, "e": 5}
d2 = {"a": 11, "b": 22, "c": 33, "d": 44}
df = [[k, d1.get(k, 0), d2.get(k, 0)] for k in sorted(d1.keys() | d2.keys())]
print(df)
Prints:
[['a', 1, 11], ['b', 2, 22], ['c', 3, 33], ['d', 0, 44], ['e', 5, 0]]
If you do not want sorted results, leave the sorted() out.

Python Pandas drop

I build a script with Python and i use Pandas.
I'm trying to delete line from a dataframe.
I want to delete lines that contains empty values into two specific columns.
If one of those two column is regularly completed but not the other one, the line is preserved.
So i have build this code that works. But i'm beginner and i am sure that i can simplify my work.
I'm sure i don't need loop "for" in my function. I think there is a way with a good method. I read the doc on internet but i found nothing.
I try my best but i need help.
Also for some reasons i don't want to use numpy.
So here my code :
import pandas as pnd
def drop_empty_line(df):
a = df[(df["B"].isna()) & (df["C"].isna())].index
for i in a:
df = df.drop([i])
return df
def main():
df = pnd.DataFrame({
"A": [5, 0, 4, 6, 5],
"B": [pnd.NA, 4, pnd.NA, pnd.NA, 5],
"C": [pnd.NA, pnd.NA, 9, pnd.NA, 8],
"D": [5, 3, 8, 5, 2],
"E": [pnd.NA, 4, 2, 0, 3]
})
print(drop_empty_line(df))
if __name__ == '__main__':
main()
You indeed don't need a loop. You don't even need a custom function, there is already dropna:
df = df.dropna(subset=['B', 'C'], how='all')
# or in place:
# df.dropna(subset=['B', 'C'], how='all', inplace=True)
output:
A B C D E
1 0 4 <NA> 3 4
2 4 <NA> 9 8 2
4 5 5 8 2 3

Pandas: Select rows whose dictionary contains a specific value

I have a dataframe, in which one column contain a dictionaries for every row. I want to select rows whose dictionary contains a specific value. Doesn't matter which key contains it.
The dictionaries have many levels (they contain a lot of lists, with a lot of dictionaries, again with a lot of lists and so on).
The data could look similar to this, but with the dictionaries being more complex:
df = pd.DataFrame({"A": [1,2,3], "B": [{"a":1}, {"b":**specific_value**}, {"c":3}]})
A B
0 1 {'a': 1}
1 2 {'b': 2}
2 3 {'c': 3}
I tried:
df.B.apply(lambda x : 'specific_value' in x.values())
To which I get "false" even the rows that I know contain the 'specific_value'. I am unsure if it is because of the layers.
You could use a recursive function to search for the specific value:
import pandas as pd
def nested_find_value(d, needle=4):
# we assume d is always a list or dictionary
haystack = d.values() if isinstance(d, dict) else d
for hay in haystack:
if isinstance(hay, (list, dict)):
yield from nested_find_value(hay, needle)
else:
yield hay == needle
def find(d, needle=4):
return any(nested_find_value(d, needle))
df = pd.DataFrame({"A": [1, 2, 3], "B": [{"a": 1}, {"b": {"d": 4}}, {"c": 3}]})
result = df["B"].apply(find)
print(result)
Output
0 False
1 True
2 False
Name: B, dtype: bool
In the example above the specific value is 4.

Create a matrix from dynamic dictionary

I want to create a matrix.
Input:
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
...
]
Output:
a p cat g
1st 2 0 0 1
2nd 5 3 4 0
This is my code. But I think it's not smart and very slow when data size huge.
Have any good ways to do this one?
Thank you.
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
### Create matrix ###
result = []
for row in data:
matrix = [0] * len(key_map)
for k, v in row.iteritems():
matrix[key_map.index(k)] = v
result.append(matrix)
print result
# [[2, 0, 0, 1], [5, 3, 4, 0]]
Edited
By #wwii advice. Use Pandas looks good:
from pandas import DataFrame
result = DataFrame(data, index=range(len(data)))
print result.fillna(0, downcast=int).as_matrix().tolist()
# [[2, 0, 1, 0], [5, 4, 0, 3]]
You can use set comprehension to generate the key_map
key_map = list({data for row in data for data in row})
Here is a partial answer. I couldn't get the columns in the order specified - it is limited by how the keys get ordered in the set, key_map. It uses string formatting to line the data up - you can play around with the spacing to fit larger or smaller numbers.
# ordinal from
# http://code.activestate.com/recipes/576888-format-a-number-as-an-ordinal/
from ordinal import ordinal
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
# strings to format the output
header = '{: >10}{: >8}{: >8}{: >8}'.format(*key_map)
line_fmt = '{: <8}{: >2}{: >8}{: >8}{: >8}'
print header
def ordered_data(d, keys):
"""Returns an ordered list of dictionary values.
returns 0 if key not in d
d --> dict
keys --> list of keys
returns list
"""
return [d.get(key, 0) for key in keys]
for i, thing in enumerate(data):
print line_fmt.format(ordinal(i+1), *ordered_data(thing, key_map))
Output
a p g cat
1st 2 0 1 0
2nd 5 3 0 4
It might be worthwhile to dig into the Pandas docs and look at its DataFrame - it might make life easier.
I second the answer using the Pandas dataframes. However, my code should be a bit simpler than yours.
In [1]: import pandas as pd
In [5]: data = [{'a': 2, 'g': 1},{'p': 3, 'a': 5, 'cat': 4}]
In [6]: df = pd.DataFrame(data)
In [7]: df
Out[7]:
a cat g p
0 2 NaN 1 NaN
1 5 4 NaN 3
In [9]: df = df.fillna(0)
In [10]: df
Out[10]:
a cat g p
0 2 0 1 0
1 5 4 0 3
I did my coding in iPython, which I highly recommend!
To save to csv, just use an additional line of code:
df.to_csv('filename.csv')
I am a freshie in python, just suggestions that may be helpful hopefully:)
key_map = []
for row in data:
key_map.extend(row.keys())
key_map = list(set(key_map))
you can change the middle part to this, which will save you some time to find the key_map
In your case union will at least scan through each row to find the different item.

python map dictionary to array

I have a list of data of the form:
[line1,a]
[line2,c]
[line3,b]
I want to use a mapping of a=5, c=15, b=10:
[line1,5]
[line2,15]
[line3,10]
I am trying to use this code, which I know is incorrect, can someone guide me on how to best achieve this:
mapping = {"a": 5, "b": 10, "c": 15}
applyMap = [line[1] = 'a' for line in data]
Thanks
EDIT:
Just to clarify here, for one line, however I want this mapping to occur to all lines in the file:
Input: ["line1","a"]
Output: ["line1",5]
You could try with a list comprehension.
lines = [
["line1", "much_more_items1", "a"],
["line2", "much_more_items2", "c"],
["line3", "much_more_items3", "b"],
]
mapping = {"a": 5, "b": 10, "c": 15}
# here I assume the key you need to remove is at last position of your items
result = [ line[0:-1] + [mapping[line[-1]] for line in lines ]
Try something like this:
data = [
['line1', 'a'],
['line2', 'c'],
['line3', 'b'],
]
mapping = {"a": 5, "b": 10, "c": 15}
applyMap = [[line[0], mapping[line[1]]] for line in data]
print applyMap
>>> data = [["line1", "a"], ["line2", "b"], ["line3", "c"]]
>>> mapping = { "a": 5, "b": 10, "c": 15}
>>> [[line[0], mapping[line[1]]] for line in data]
[['line1', 5], ['line2', 10], ['line3', 15]]
lineMap = {'line1': 'a', 'line2': 'b', 'line3': 'c'}
cha2num = {'a': 5, 'b': 10, 'c': 15}
result = [[key,cha2num[lineMap[key]]] for key in lineMap]
print result
what you need is a map to relevance 'a' -> 5

Categories

Resources