Flatten JSON in Dataframe Column

Flatten JSON in Dataframe Column - python

I have data in a dataframe as seen below (BEFORE)
I am trying to parse/flatten the JSON in the site_Activity column , but I am having no luck.
I have tried some of the methods below as a proof I have tried to solve this on my own.
I have provided a DESIRED AFTER section to highlight how I would expect the data to parse.
Any help is greatly appreciated!
not working df = df.explode(column='site_Activity').reset_index(drop=True)https://stackoverflow.com/questions/54546279/how-to-normalize-json-string-type-column-of-pandas-dataframe
not working pd.json_normalize(df.site_Activity[0])
How to convert JSON data inside a pandas column into new columns
BEFORE
id
site_Activity
123
[{"action_time":"2022-07-05T01:53:59.000000Z","time_spent":12,"url":"cool.stuff.io/advanced"},{"action_time":"2022-07-05T00:10:20.000000Z","time_spent":0,"url":"cool.stuff.io/advanced1"},{"action_time":"2022-07-04T23:45:39.000000Z","time_spent":0,"url":"cool.stuff.io"}]
456
[{"action_time":"2022-07-04T23:00:23.000000Z","time_spent":0,"url":"cool.stuff.io/awesome"}]
DESIRED AFTER
id
action_time
time_spent
url
123
2022-07-05T01:53:59.000000Z
12
cool.stuff.io/advanced
123
2022-07-05T00:10:20.000000Z
0
cool.stuff.io/advanced1
123
2022-07-04T23:45:39.000000Z
0
cool.stuff.io
456
2022-07-04T23:00:23.000000Z
0
cool.stuff.io/awesome

You can:
use .apply(json.loads) to transform the json column into a list/dict column;
use df.explode to transform the list o dicts into a Series of dicts;
use .apply(pd.Series) to 'explode' de Series of dicts into a DataFrame;
use pd.concat to 'merge' the new columns to the rest of the data.
Then it comes to:
import io
import pandas as pd
import json
TESTDATA="""id site_Activity
123 [{"action_time":"2022-07-05T01:53:59.000000Z","time_spent":12,"url":"cool.stuff.io/advanced"},{"action_time":"2022-07-05T00:10:20.000000Z","time_spent":0,"url":"cool.stuff.io/advanced1"},{"action_time":"2022-07-04T23:45:39.000000Z","time_spent":0,"url":"cool.stuff.io"}]
456 [{"action_time":"2022-07-04T23:00:23.000000Z","time_spent":0,"url":"cool.stuff.io/awesome"}]
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep="\t")
df["site_Activity"] = df["site_Activity"].apply(json.loads)
df = df.explode("site_Activity")
df = pd.concat([df[["id"]], df["site_Activity"].apply(pd.Series)], axis=1)
print(df)
# result
id action_time time_spent url
0 123 2022-07-05T01:53:59.000000Z 12 cool.stuff.io/advanced
0 123 2022-07-05T00:10:20.000000Z 0 cool.stuff.io/advanced1
0 123 2022-07-04T23:45:39.000000Z 0 cool.stuff.io
1 456 2022-07-04T23:00:23.000000Z 0 cool.stuff.io/awesome

You can use this:
df = pd.json_normalize(json_data_var, 'site_Activity', ['id'])
'site_Activity' without the square brackets is like the 'many' part; 'id' in the square brackets the 'one'.
Take this example:
url = 'https://someURL.com/burgers'
headers = {
'X-RapidAPI-Key': 'someKey',
'X-RapidAPI-Host': 'someHost.someAPI.com'
}
response = requests.request('GET', url, headers = headers)
api_data = response.json()
df = pd.json_normalize(api_data)
print(df)
This now gives me a dataframe where the 'ingredients' and 'addresses' columns contain nested data
I can then focus on addresses and normalise that
df = pd.json_normalize(api_data, 'addresses', ['name', 'restaurant', 'web', 'description'])

Related

Converting API output from a dictionary to a dataframe (Python)

I have fed some data into a TravelTime (https://github.com/traveltime-dev/traveltime-python-sdk) API which calculates the time it takes to drive between 2 locations. The result of this (called out) is a dictionary with that looks like this:
{'results': [{'search_id': 'arrival_one_to_many',
'locations': [{'id': 'KA8 0EU', 'properties': {'travel_time': 2646}},
{'id': 'KA21 5DT', 'properties': {'travel_time': 392}}],
'unreachable': []}]}
However, I need a table that would look a bit like this:
search_id
id
Travel Time
arrival_one_to_many
KA21 5DT
2646
arrival_one_to_many
KA21 5DT
392
I've tried converting this dictionary to a dataframe using
out_2 = pd.DataFrame.from_dict(out)
This shows as a dataframe with one column called results, so I tried use out_2['results'].str.split(',', expand=True) to split this into multiple columns at the comma delimiters but got an error:
0
0
NaN
Is anyone able to help me to get this dictionary to a readable and useable dataframe/table?
Thanks

#MICHAELKM22 since you are not using all the keys from the dictionary you wont be able to convert it directly to dataframe.
First extract required keys and then convert it into dataframe.
df_list = []
for res in data['results']:
serch_id = res['search_id']
for loc in res['locations']:
temp_df = {}
temp_df['search_id'] = res['search_id']
temp_df['id'] = loc["id"]
temp_df['travel_time'] = loc["properties"]['travel_time']
df_list.append(temp_df)
df = pd.DataFrame(df_list)
search_id id travel_time
0 arrival_one_to_many KA8 0EU 2646
1 arrival_one_to_many KA21 5DT 392

First, this json to be parsed to fetch required value. Once those values are fetched, then we can store them into dataframe.
Below is the code to parse this json (PS: I have saved json in one file) and these values added to DataFrame.
import json
import pandas as pd
f = open('Json_file.json')
req_file = json.load(f)
df = pd.DataFrame()
for i in req_file['results']:
dict_new = dict()
dict_new['searchid'] = i['search_id']
for j in i['locations']:
dict_new['location_id'] = j['id']
dict_new['travel_time'] = j['properties']['travel_time']
df = df.append(dict_new, ignore_index=True)
print(df)
Below is the output of above code:
searchid location_id travel_time
0 arrival_one_to_many KA8 0EU 2646.0
1 arrival_one_to_many KA21 5DT 392.0

How to read this JSON file in Python?

I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?

import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]

You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)

sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])

convert a list in rows of dataframe in one column to simple string

I have a dataframe which has list in one column that I want to convert into a simple string
id data_words_nostops
26561364 [andrographolide, major, labdane, diterpenoid]
26561979 [dgat, plays, critical, role, hepatic, triglyc]
26562217 [despite, success, imatinib, inhibiting, bcr]
DESIRED OUTPUT
id data_words_nostops
26561364 andrographolide, major, labdane, diterpenoid
26561979 dgat, plays, critical, role, hepatic, triglyc
26562217 despite, success, imatinib, inhibiting, bcr

Try this :
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Complete code :
import pandas as pd
l1 = ['26561364', '26561979', '26562217']
l2 = [['andrographolide', 'major', 'labdane', 'diterpenoid'],['dgat', 'plays', 'critical', 'role', 'hepatic', 'triglyc'],['despite', 'success', 'imatinib', 'inhibiting', 'bcr']]
df = pd.DataFrame(list(zip(l1, l2)),
columns =['id', 'data_words_nostops'])
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Output :
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr

df["data_words_nostops"] = df.apply(lambda row: row["data_words_nostops"][0], axis=1)

You can use pandas str join for this:
df["data_words_nostops"] = df["data_words_nostops"].str.join(",")
df
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr

I tried the following as well
df_ready['data_words_nostops_Joined'] = df_ready.data_words_nostops.apply(', '.join)

Remove grave accent from IDs

I have an ID column with grave accent like this `1234ABC40 and I want to remove just that character from this column but keep the dataframe form.
I tried this on the column only. I have a file name x here and has multiple columns. id is the col i want to fix.
pd.read_csv(r'C:\filename.csv', index_col = False)
id = str(x['id'])
id2 = unidecode.unidecode(id)
id3 = id2.replace('`','')
This changes to str but I want that column back in the dataframe form

DataFrames have their own replace() function. Note, for partial replacements you must enable regex=True in the parameters:
import pandas as pd
d = {'id': ["12`3", "32`1"], 'id2': ["004`", "9`99"]}
df = pd.DataFrame(data=d)
df["id"] = df["id"].replace('`','', regex=True)
print df
id id2
0 123 004`
1 321 9`99

Slicing Pandas DataFrame based on csv

Let's say I have a Pandas DataFrame like following.
df = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
Country Name
0 US A
1 UK B
2 SL C
And I'm having a csv like following.
Name,Extended
A,Jorge
B,Alex
E,Mark
F,Bindu
I need to check whether df['Name'] is in csv and if so get the "Extended". If not I need to just get the "Name". So my Expected output is like following.
Country Name Extended
0 US A Jorge
1 UK B Alex
2 SL C C
Following shows what I tried so far.
f = open('mycsv.csv','r')
lines = f.readlines()
def parse(x):
for line in lines:
if x in line.split(',')[0]:
return line.strip().split(',')[1]
df['Extended'] = df['Name'].apply(parse)
Name Country Extended
0 A US Jorge
1 B UK Alex
2 C SL None
I can not figure out how to get the "Name" for C at "Extended"(else part in the code)? Any help.

You can use the "fillna" function from pandas like this:
import pandas as pd
df1 = pd.DataFrame({'Name' : ['A','B','C'],
'Country' : ['US','UK','SL']})
df2 = pd.DataFrame.from_csv('mycsv.csv', index_col=None)
df_merge = pd.merge(df, f, how="left", on="Name")
df_merge["Extended"].fillna('Name', inplace=True)

You could just load the csv as a df and then assign using where:
df['Name'] = df2['Extended'].where(df2['Name'] != df2['Extended'], df2['Name'])
So here we use the boolean condition to test if 'Name' is not equal to 'Extended' and use that value, otherwise just use 'Name'.
Also is 'Extended' always either different or same as 'Name'? If so why not just assign the value of extended to the dataframe:
df['Name'] = df2['Extended']
This would be a lot simpler.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flatten JSON in Dataframe Column - python

Related

Converting API output from a dictionary to a dataframe (Python)

How to read this JSON file in Python?

convert a list in rows of dataframe in one column to simple string

Remove grave accent from IDs

Slicing Pandas DataFrame based on csv

Categories

Resources