Convert nested dictionary to csv - python

I have a particular case where I want to create a csv using the inner values of a nested dictionary as the keys and the inner keys as the header. The 'healthy' key can contain more subkeys other than 'height' and 'weight', but the 'unhealthy' will either ever contain None or a string of values.
My current dictionary looks like this:
{0: {'healthy': {'height': 160,
'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170,
'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
How would I convert this to this csv:
+------+--------+----------------------------+
|height| weight| unhealthy|
+------+--------+----------------------------+
|160 | 180| |
+------+--------+----------------------------+
|170 | 250|alcohol, smoking, overweight|
+------+--------+----------------------------+
Is there anyway of not hardcoding this and doing this without Pandas and saving it to a location?

With D being your dictionary you can pass D.values() to pandas.json_normalize() and rename the columns if needed.
>>> import pandas as pd
>>> print(pd.json_normalize(D.values()).to_markdown(tablefmt='psql'))
+╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+
| | unhealthy | healthy.height | healthy.weight |
|╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌|
| 0 | | 160 | 180 |
| 1 | alcohol, smoking, overweight | 170 | 250 |
+╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌+

So this may be very dumb way to do this, but if your dictionary has this structure and you don't mind about hardcoding the actual values, this might be the way
import csv
dictionary = {0: {'healthy': {'height': 160,
'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170,
'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
with open("out.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(['height', 'weight', 'unhealthy'])
writer.writerows([
[value['healthy']['height'],
value['healthy']['weight'],
value['unhealthy']
] for key, value in dictionary.items()])
So the point is you just create an array of [<height>, <weight>, <unhealthy>] arrays and write it to csv file using python's builtin module's csv.writer.writerows()

I used pandas to deal with the value and to save it as a csv file.
I loaded the json format data as a dataframe.
By transposing the data, I got 'unhealty' columne
By using json_normalize(), I parsed the nested dictionary data, 'healthy' and generated two columns into 'height' and 'weight'
concat the 'healthy' and 'height', 'weight' columns
saved the dataframe as a csv file.
import pandas as pd
val = {
0: {'healthy': {'height': 160, 'weight': 180},
'unhealthy': None},
1: {'healthy': {'height': 170, 'weight': 250},
'unhealthy': 'alcohol, smoking, overweight'}
}
df = pd.DataFrame.from_dict(val)
df = df.transpose()
df = pd.concat([pd.json_normalize(df['healthy'], max_level=1), df['unhealthy']], axis=1)
df.to_csv('filename.csv', index=False) # A csv file generated.
This is the csv file I generated (I opend it using MS Excel).

Related

Python Pandas Flatten nested JSON

I'm new to python & pandas and it took me awhile to get the results I wanted using below. Basically, I am trying to flatten a nested JSON. While json_normalize works, there are columns where it contains a list of objects(key/value). I want to breakdown that list and add them as separate columns. Below sample code I wrote worked out fine but was wondering if this could be simplified or improved further or an alternative? Most of the articles I've found relies on actually naming the columns and such (codes that I cant pick up on just yet), but I would rather have this as a function and name the columns dynamically. The output csv will be used for Power BI.
with open('json.json', 'r') as json_file: jsondata = json.load(json_file)
df = pd.json_normalize(jsondata['results'], errors='ignore')
y = df.columns[df.applymap(type).eq(list).any()]
for x in y:
df = df.explode(x).reset_index(drop=True)
df_exploded = pd.json_normalize(df.pop(x))
for i in df_exploded.columns:
df_exploded = df_exploded.rename(columns={i:x + '_' + i})
df = df.join(df_exploded)
df.to_csv('json.csv')
Sample JSON Format (Not including the large JSON I was working on):
data = {
'results': [
{
'recordType': 'recordType',
'id': 'id',
'values':
{
'orderType': [{'value': 'value', 'text': 'text'}],
'type': [{'value': 'value', 'text': 'text'}],
'entity': [{'value': 'value', 'text': 'text'}],
'trandate': 'trandate'
}
}
]
}
The values part when json_normalized, doesn't get flatten and required explode and joined.
You can use something like this:
df = pd.json_normalize(jsondata['results'],meta=['recordType','id'])[['recordType','id','values.trandate']]
record_paths = [['values','orderType'],['values','type'],['values','entity']]
for i in record_paths:
df = pd.concat([df,pd.json_normalize(jsondata['results'],record_path=i,record_prefix=i[1])],axis=1)
df.to_csv('json.csv')
Or (much faster):
df = pd.DataFrame({'recordType':[i['recordType'] for i in jsondata['results']],
'id':[i['id'] for i in jsondata['results']],
'values.trandate':[i['values']['trandate'] for i in jsondata['results']],
'orderTypevalue':[i['values']['orderType'][0]['value'] for i in jsondata['results']],
'orderTypetext':[i['values']['orderType'][0]['text'] for i in jsondata['results']],
'typevalue':[i['values']['type'][0]['value'] for i in jsondata['results']],
'typetext':[i['values']['type'][0]['text'] for i in jsondata['results']],
'entityvalue':[i['values']['entity'][0]['value'] for i in jsondata['results']],
'entitytext':[i['values']['entity'][0]['text'] for i in jsondata['results']]})
df.to_csv('json.csv')
Output:
| | recordType | id | values.trandate | orderTypevalue | orderTypetext | typevalue | typetext | entityvalue | entitytext |
|---:|:-------------|:-----|:------------------|:-----------------|:----------------|:------------|:-----------|:--------------|:-------------|
| 0 | recordType | id | trandate | value | text | value | text | value | text |

Create a dictionary of unique values of a column in a dataframe in pandas

I have a dataframe:
import pandas as pd
df = pd.DataFrame({
'ID': ['ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ'],
'value': [100, 120, 130, 200, 190, 210],
'value2': [2100, 2120, 2130, 2200, 2190, 2210],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
I want to create dictionary of unique values of the Column 'ID'. I can extract the unique values by:
df.ID.unique()
But that gives me a list. I want the output to be a dictionary, which looks like this:
dict = {0:'ABC', 1: 'XYZ'}
If the number of unique entries in the column is n, then the keys should start at 0 and go till n-1. The values should be the names of unique entries in the column
The actual dataframe has 1000s of rows and is often updated. So I cannot maintain the dict manually.
Try this. -
dict(enumerate(df.ID.unique()))
{0: 'ABC', 1: 'XYZ'}
If you want to get unique values for a particular column in dict, try:
val_dict = {idx:value for idx , value in enumerate(df["ID"].unique())}
Output while printing val_dict
{0: 'ABC', 1: 'XYZ'}

Convert JSON data in data frame Python [duplicate]

This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 2 years ago.
I am beginner of programming language, so it would be appreciated you help and support.
Here is DataFrame and one column' data is JSON type? of data.
ID, Name, Information
1234, xxxx, '{'age': 25, 'gender': 'male'}'
2234, yyyy, '{'age': 34, 'gender': 'female'}'
3234, zzzz, '{'age': 55, 'gender': 'male'}'
I would like to covert this DataFrame as below.
ID, Name, age, gender
1234, xxxx, 25, male
2234, yyyy, 34, female
3234, zzzz, 55, male
I found that ast.literal_eval() can convert str to dict type, but I have no idea how to write code of this issue.
Would you please give some example of a code which can solve this issue?
Given test.csv
ID,Name,Information
1234,xxxx,"{'age': 25, 'gender': 'male'}"
2234,yyyy,"{'age': 34, 'gender': 'female'}"
3234,zzzz,"{'age': 55, 'gender': 'male'}"
Read the file in with pd.read_csv and use the converters parameter with ast.literal_eval, which will convert the data in the Information column from a str type to dict type.
Use pd.json_normalize to unpack the dict with keys as column headers and values in the rows
.join the normalized columns with df
.drop the Information column
import pandas as pd
from ast import literal_eval
df = pd.read_csv('test.csv', converters={'Information': literal_eval})
df = df.join(pd.json_normalize(df.Information))
df.drop(columns=['Information'], inplace=True)
# display(df)
ID Name age gender
0 1234 xxxx 25 male
1 2234 yyyy 34 female
2 3234 zzzz 55 male
If the data is not from a csv file
import pandas as pd
from ast import literal_eval
data = {'ID': [1234, 2234, 3234],
'Name': ['xxxx', 'yyyy', 'zzzz'],
'Information': ["{'age': 25, 'gender': 'male'}", "{'age': 34, 'gender': 'female'}", "{'age': 55, 'gender': 'male'}"]}
df = pd.DataFrame(data)
# apply literal_eval to Information
df.Information = df.Information.apply(literal_eval)
# normalize the Information column and join to df
df = df.join(pd.json_normalize(df.Information))
# drop the Information column
df.drop(columns=['Information'], inplace=True)
If third column is a JSON string, ' is not valid, it should be ", so we need to fix this.
If the third column is a string representation of python dict, you can use eval to convert it.
A sample of code to split third column of type dict and merge into the original DataFrame:
data = [
[1234, 'xxxx', "{'age': 25, 'gender': 'male'}"],
[2234, 'yyyy', "{'age': 34, 'gender': 'female'}"],
[3234, 'zzzz', "{'age': 55, 'gender': 'male'}"],
]
df = pd.DataFrame().from_dict(data)
df[2] = df[2].apply(lambda x: json.loads(x.replace("'", '"'))) # fix the data and convert to dict
merged = pd.concat([df[[0, 1]], df[2].apply(pd.Series)], axis=1)

Convert Nested JSON into Dataframe

I have a nested JSON like below. I want to convert it into a pandas dataframe. As part of that, I also need to parse the weight value only. I don't need the unit.
I also want the number values converted from string to numeric.
Any help would be appreciated. I'm relatively new to python. Thank you.
JSON Example:
{'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'},
'gender': 'male'}
Sample output below:
id name weight gender
123 joe 100 male
use " from pandas.io.json import json_normalize ".
id name weight.number weight.unit gender
123 joe 100 lbs male
if you want to discard the weight unit, just flatten the json:
temp = {'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'}, 'gender': 'male'}
temp['weight'] = temp['weight']['number']
then turn it into a dataframe:
pd.DataFrame(temp)
Something like this should do the trick:
json_data = [{'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'}, 'gender': 'male'}]
# convert the data to a DataFrame
df = pd.DataFrame.from_records(json_data)
# conver id to an int
df['id'] = df['id'].apply(int)
# get the 'number' field of weight and convert it to an int
df['weight'] = df['weight'].apply(lambda x: int(x['number']))
df

Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary

I have a pyspark Dataframe and I need to convert this into python dictionary.
Below code is reproducible:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
Once I have this dataframe, I need to convert it into dictionary.
I tried like this
df.set_index('name').to_dict()
But it gives error. How can I achieve this
Please see the example below:
>>> from pyspark.sql.functions import col
>>> df = (sc.textFile('data.txt')
.map(lambda line: line.split(","))
.toDF(['name','age','height'])
.select(col('name'), col('age').cast('int'), col('height').cast('int')))
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 5| 80|
| Bob| 5| 80|
|Alice| 10| 80|
+-----+---+------+
>>> list_persons = map(lambda row: row.asDict(), df.collect())
>>> list_persons
[
{'age': 5, 'name': u'Alice', 'height': 80},
{'age': 5, 'name': u'Bob', 'height': 80},
{'age': 10, 'name': u'Alice', 'height': 80}
]
>>> dict_persons = {person['name']: person for person in list_persons}
>>> dict_persons
{u'Bob': {'age': 5, 'name': u'Bob', 'height': 80}, u'Alice': {'age': 10, 'name': u'Alice', 'height': 80}}
The input that I'm using to test data.txt:
Alice,5,80
Bob,5,80
Alice,10,80
First we do the loading by using pyspark by reading the lines. Then we convert the lines to columns by splitting on the comma. Then we convert the native RDD to a DF and add names to the colume. Finally we convert to columns to the appropriate format.
Then we collect everything to the driver, and using some python list comprehension we convert the data to the form as preferred. We convert the Row object to a dictionary using the asDict() method. In the output we can observe that Alice is appearing only once, but this is of course because the key of Alice gets overwritten.
Please keep in mind that you want to do all the processing and filtering inside pypspark before returning the result to the driver.
Hope this helps, cheers.
You need to first convert to a pandas.DataFrame using toPandas(), then you can use the to_dict() method on the transposed dataframe with orient='list':
df.toPandas().set_index('name').T.to_dict('list')
# Out[1]: {u'Alice': [10, 80]}
RDDs have built in function asDict() that allows to represent each row as a dict.
If you have a dataframe df, then you need to convert it to an rdd and apply asDict().
new_rdd = df.rdd.map(lambda row: row.asDict(True))
One can then use the new_rdd to perform normal python map operations like:
# You can define normal python functions like below and plug them when needed
def transform(row):
# Add a new key to each row
row["new_key"] = "my_new_value"
return row
new_rdd = new_rdd.map(lambda row: transform(row))
One easy way can be to collect the row RDDs and iterate over it using dictionary comprehension. Here i will try to demonstrate something similar:
Lets assume a movie dataframe:
movie_df
movieId
avg_rating
1
3.92
10
3.5
100
2.79
100044
4.0
100068
3.5
100083
3.5
100106
3.5
100159
4.5
100163
2.9
100194
4.5
We can use dictionary comprehension and iterate over the row RDDs like below:
movie_dict = {int(row.asDict()['movieId']) : row.asDict()['avg_rating'] for row in movie_avg_rating.collect()}
print(movie_dict)
{1: 3.92,
10: 3.5,
100: 2.79,
100044: 4.0,
100068: 3.5,
100083: 3.5,
100106: 3.5,
100159: 4.5,
100163: 2.9,
100194: 4.5}

Categories

Resources