pandas dataframe, with multi-index, to dictionary - python

I am trying to transform a pandas dataframe resulting from a groupby([columns]). The resulting index will have for each "target_index" different lists of words (example in image below). Transforming it with to_dict() seems to not be working directly (I have tried several orient arguments).
The Input dataframe:
The desired output (only two keys for the example):
{
"2060": {
"NOUN": ["product"]
},
"3881": {
"ADJ": ["greater", "direct", "raw"],
"NOUN": ["manufacturing", "capital"],
"VERB": ["increased"]
}
}
In order to recreate the below dataset:
df= pd.DataFrame([
["2060", "NOUN", ["product"]],
["2060", "ADJ", ["greater"]],
["3881", "NOUN", ["manufacturing", "capital"]],
["3881", "ADJ", ["greater", "direct", "raw"]],
["3881", "VERB", ["increased"]]
], columns= ["a", "b", "c"])
df= df.groupby(["a", "b"]).agg({"c": lambda x: x})

The input given in the constructor is different from the one in the image. I used the input in the constructor. You could use a lambda in groupby.apply to convert each group to dicts, then convert the aggregate to dict:
out = df.groupby(level=0).apply(lambda x: x.droplevel(0).to_dict()['c']).to_dict()
Another option is to use itertuples and dict.setdefault:
out = {}
for (ok, ik), v in df.itertuples():
out.setdefault(ok, {}).setdefault(ik, []).extend(v)
Output:
{'2060': {'ADJ': ['greater'], 'NOUN': ['product']},
'3881': {'ADJ': ['greater', 'direct', 'raw'],
'NOUN': ['manufacturing', 'capital'],
'VERB': ['increased']}}

Related

parse JSON file to CSV with key values null in python

Example
{"data":"value1","version":"value2","version1":"value3"}
{"data":"value1","version1":"value3"}
{"data":"value1","version1":"value3","hi":{"a":"true,"b":"false"}}
I have a JSON file and need to convert it to csv, however the rows are not having same columns, and some rows have nested attributes,how to convert them in python script.
I tried JSON to csv using Python code, but it gives me an error
In order to convert a JSON file to a CSV file in Python, you will need to use the Pandas library.
import pandas as pd
data = [
{
"data": "value1",
"version": "value2",
"version1": "value3"
},
{
"data": "value1",
"version1": "value3"
},
{
"data": "value1",
"version1": "value3",
"hi": {
"a": "true,",
"b": "false"
}
}
]
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)
I have correctly formatted your JSON since it was giving errors.
You could convert the JSON data to a flat list of lists with column names on the first line. Then process that to make the CSV output.
def flatDict(D,p=""):
if not isinstance(D,dict):
return {"":D}
return {p+k+s:v for k,d in D.items() for s,v in flatDict(d,".").items()}
def flatData(data):
lines = [*map(flatDict,data)]
names = dict.fromkeys(k for d in lines for k in d)
return [[*names]] + [ [*map(line.get,names)] for line in lines ]
The flatDict function converts a nested dictionary structure to a single level dictionary with nested keys combined and brought up to the top level. This is done recursively so that it works for any depth of nesting
The flatData function processes each line, to make a list of flattened dictionaries (lines). The union of all keys in that list forms the list of columns names (using a dictionary constructor to get them in order of appearance). The list of names and lines is returned by converting each dictionary to a list mapping key names to line data where present (using the .get() method of dictionaries).
output:
E = [{"data":"value1","version":"value2","version1":"value3"},
{"data":"value1","version1":"value3"},
{"data":"value1","version1":"value3","hi":{"a":"true","b":"false"}} ]
for line in flatData(E):
print(line)
['data', 'version', 'version1', 'hi.a', 'hi.b'] # col names
['value1', 'value2', 'value3', None, None] # data ...
['value1', None, 'value3', None, None]
['value1', None, 'value3', 'true', 'false']

How can I remove a nested data If a specific key found similar in big JSON data?

The simple way to filter is to loop them all, but trust me as I have very massive data looping is very time consuming and maybe not be very efficient way,
[
{
"from_name": "Haio",
"from_id": 183556205,
"receiver_name": "Shubh M",
"targeted_id": 78545445,
"gift_value": '$56'
},
{
"from_name": "Mr. A",
"from_id": 54545455,
"receiver_name": "haio",
"targeted_id": 78545445,
"gift_value": '$7'
}]
What do I want to accomplish?
I just want to delete the dict If targeted_idis same
def remove_duplicates(lst, key=lambda x: x, acc=[], keys = []):
if lst == []:
return acc
elif key(lst[0]) in keys:
return remove_duplicates(lst[1:], key=key, acc=acc, keys=keys)
else:
return remove_duplicates(lst[1:], key=key, acc = acc + [lst[0]], keys=keys + [key(lst[0])])
```
Provided you can load the whole dataset into memory, use pandas and drop_duplicates. By default, it will keep the first of each set of duplicate records. There is some overhead associated with creating a dataframe, but if there are lots of records to go through then dropping duplicates this way should be much faster than a python loop.
import pandas as pd
data =[
{
"from_name": "Haio",
"from_id": 183556205,
"receiver_name": "Shubh M",
"targeted_id": 78545445,
"gift_value": '$56'
},
{
"from_name": "Mr. A",
"from_id": 54545455,
"receiver_name": "haio",
"targeted_id": 78545445,
"gift_value": '$7'
}]
df = pd.DataFrame(data).drop_duplicates(subset=['targeted_id'])
print(df.to_json()) # or specify a file name to save it

Original dict/json from a pd.io.json.json_normalize() dataframe row

I have a pandas dataframe with rows created from dicts, using pd.io.json.json_normalize(). The values(not the keys/columns names) in dataframe have been modified. I want to retrieve a dict, with the same nested format the original dict has, from a row of the dataframe.
sample = {
"A": {
"a": 7
},
"B": {
"a": "name",
"z":{
"dD": 20 ,
"f_f": 3 ,
}
}
}
df = pd.io.json.json_normalize(sample, sep='__')
as expected df.columns returns me:
Index(['A__a', 'B__a', 'B__z__dD', 'B__z__f_f'], dtype='object')
I want to "reverse" the process now.
I can guarantee no string in the original dict(key or value) has a '__' as a substring and neither starts or ends with '_'

how to create a dataframe to generate json in the given format

I need to generate a json from my dataframe but I have tried many formats of df but still I am not able get the required json format.
My required json format is,
[
{
"Keyword": "Red",
"values": [
{
"value": 5,
"TC": "Color"
}
]
},
{
"Keyword": "Orange",
"values": [
{
"value": 5,
"TC": "Color"
}
]
},
{
"Keyword": "Violet",
"values": [
{
"value": 5,
"TC": "Color"
}
]
}
]
I want a df to generate this json. Please help.
but currently im getting df.to_json:
{"Names":{"0":"Ram","1":"pechi","2":"Sunil","3":" Ravi","4":"sri"},"Values":{"0":"[{'value':2,'TC': 'TC Count'}]","1":"[{'value':2,'TC': 'TC Count'}]","2":"[{'value':1,'TC': 'TC Count'}]","3":"[{'value':1,'TC': 'TC Count'}]","4":"[{'value':1,'TC': 'TC Count'}]"}}
I think you need:
set_index for columns not in nested dictionaries
create dicts by apply with to_dict
reset_index for column from index
create json by to_json
print (df)
Keyword TC value
0 Red Color 5
1 Orange Color 5
2 Violet Color 5
j = (df.set_index('Keyword')
.apply(lambda x: [x.to_dict()], axis=1)
.reset_index(name='values')
.to_json(orient='records'))
print (j)
[{"Keyword":"Red","values":[{"TC":"Color","value":5}]},
{"Keyword":"Orange","values":[{"TC":"Color","value":5}]},
{"Keyword":"Violet","values":[{"TC":"Color","value":5}]}]
For write to file:
(df.set_index('Keyword')
.apply(lambda x: [x.to_dict()], axis=1)
.reset_index(name='values')
.to_json('myfile.json', orient='records'))

Unpacking nested dictionary list within python

I would be very grateful if someone could suggest a more Pythonic way of handling the following issue:
Problem:
I have a json object parsed into a python object (dict). The issue I have is that the json object structure is a list of dictionaries(dict1). These dictionaries contain a dictionary(dict2).
I would like to parse all the content of dict1 and combine the contents of dict2 within dict1.
Thereafter, I would like to parse this into pandas.
json_object = {
"data": [{
"complete": "true",
"data_two": {
"a": "5",
"b": "6",
"c": "6",
"d": "8"
},
"time": "2016-10-17",
"End_number": 2
},
{
"complete": "true",
"data_two": {
"a": "11",
"b": "21",
"c": "31",
"d": "41"
},
"time": "2016-10-17",
"End_number": 1
}
],
"Location": "DE",
"End Zone": 5
}
My attempt:
dataList = json_object['data']
Unpacked_Data = [(d['time'],d['End_number'], d['data_two'].keys(),d['data_two'].values()) for d in dataList]
Unpacked_Data is a list of tuples that now contains (time, end_number, [List of keys], [list of values])
To use this in a Pandas dataframe I would then need to unpack the two lists within my tuple. --> is there an easy way to unpack lists within a tuple?
Is there a better and more elegant/Pythonic way of approaching this problem?
Thanks,
12avi
One way (using pandas) is to start by putting everything into a dataframe, then apply pd.Series to it:
df = pd.DataFrame(Unpacked_Data)
unpacked0 = df[2].apply(lambda x: pd.Series(list(x)))
unpacked1 = df[3].apply(lambda x: pd.Series(list(x)))
pd.concat((df[[0,1]],unpacked0,unpacked1))
The other way is to use list comprehension and argument unpacking:
df = pd.DataFrame([[a,b,*c,*d] for a,b,c,d in Unpacked_Data])
However, the second method may not line up the way you want it if the packed lists are not of the same length.

Categories

Resources