Json file to pandas data frame

Json file to pandas data frame - python

I have a JSON file look like below.
myjson= {'data': [{'ID': 'da45e00ca',
'name': 'June_2016',
'objCode': 'ased',
'percentComplete': 4.17,
'plannedCompletionDate': '2021-04-29T10:00:00:000-0500',
'plannedStartDate': '2020-04-16T23:00:00:000-0500',
'priority': 4,
'asedectedCompletionDate': '2022-02-09T10:00:00:000-0600',
'status': 'weds'},
{'ID': '10041ce23c',
'name': '2017_Always',
'objCode': 'ased',
'percentComplete': 4.17,
'plannedCompletionDate': '2021-10-22T10:00:00:000-0600',
'plannedStartDate': '2021-08-09T23:00:00:000-0600',
'priority': 3,
'asedectedCompletionDate': '2023-12-30T11:05:00:000-0600',
'status': 'weds'},
{'ID': '10041ce23ca',
'name': '2017_Always',
'objCode': 'ased',
'percentComplete': 4.17,
'plannedCompletionDate': '2021-10-22T10:00:00:000-0600',
'plannedStartDate': '2021-08-09T23:00:00:000-0600',
'priority': 3,
'asedectedCompletionDate': '2023-12-30T11:05:00:000-0600',
'status': 'weds'}]}
I was trying to normalize it convert it to pandas DF using the below code but doesn't seem to come correct
from pandas.io.json import json_normalize
reff = json_normalize(myjson)
df = pd.DataFrame(data=reff)
df
Can someone have any idea what I'm doing wrong? Thanks in advance!

Try:
import pandas as pd
reff = pd.json_normalize(myjson['data'])
df = pd.DataFrame(data=reff)
df
You forgot to pull your data out of myjson. json_normalize() will iterate through the most outer-layer of your JSON.

This method first normalizes the json data and then converts it into the pandas dataframe. You would have to import this method from the pandas module.
Step 1 - Load the json data
json.loads(json_string)
Step 2 - Pass the loaded data into json_normalize() method
json_normalize(json.loads(json_string))
Example:
import pandas as pd
import json
# Create json string
# with student details
json_string = '''
[
{ "id": "1", "name": "sravan","age":22 },
{ "id": "2", "name": "harsha","age":22 },
{ "id": "3", "name": "deepika","age":21 },
{ "id": "4", "name": "jyothika","age":23 }
]
'''
# Load json data and convert to Dataframe
df = pd.json_normalize(json.loads(json_string))
# Display the Dataframe
print(df)
Output:
id name age
0 1 sravan 22
1 2 harsha 22
2 3 deepika 21
3 4 jyothika 23

Related

pandas read_json from s3 with chunksize option returns single row multiple columns dataframe

I have a json file in s3 (with >100 records), this is sample json file format:
[{
"data": {
"a": "hello"
},
"details": {
"b": "hello1"
},
"dtype": "SP"
},
{
"data": {
"a": "hello2"
},
"details": {
"b": "hello3"
},
"dtype": "SP"
}]
I use aws wrangler to read_json using boto3, I get the right format of dataframe.
data details dtype
0 {'a': 'hello'} {'b': 'hello1'} SP
1 {'a': 'hello2'} {'b': 'hello3'} SP
If I use the chunksize option along with lines=True, I get the dataframe in a single row, multiple column format.
0 1
0 {'data': {'a': 'hello'}, 'details': {'b': 'hel... {'data': {'a': 'hello2'}, 'details': {'b': 'he...
Is there a way to still get right format of dataframe (multiple rows) with the size mentioned by chunksize
Update: I have tried nrows instead of chunksize. It didin't help, gives me the same output as chunksize.
code I am using to read json file from s3:
import boto3
import botocore
import awswrangler as wr
client = boto3.session.Session(
aws_access_key_id=s3_access_key,
aws_secret_access_key=s3_secret_key,
region_name=s3_region,
)
def read_json_s3(path, client, **args):
try:
return wr.s3.read_json(path=path, boto3_session=client, **args)
except botocore.exceptions.ClientError as err:
raise err
I am sending chunksize=1000, lines=True as args

convert json data to pandas dataframe in python (dictionary inside list )

I have json data like below:
{"name": "Monkey", "image": "https://media.npr.org/assets/img/2017/09/12/macaca_nigra_self-portrait-3e0070aa19a7fe36e802253048411a38f14a79f8-s800-c85.webp", "attributes": [{"trait_type": "Bones", "value": "Zombie"}, {"trait_type": "Clothes", "value": "Striped"}, {"trait_type": "Mouth", "value": "Bubblegum"}, {"trait_type": "Eyes", "value": "Black Sunglasses"}, {"trait_type": "Hat", "value": "Sushi"}, {"trait_type": "Background", "value": "Purple"}]}
I want to convert this json data as pandas dataframe only selecting the attributes as filter it as below:
Bones Clothes Mouth Eyes Hat Background
zombie striped bubblegum black sushi purple
Can any expert please help me to get the output as i mentioned
Thank you

There is probably a prettier solution but this does the job:
import json
import pandas as pd
with open('file.json') as f:
trait_types= []
values = []
data = json.load(f)
df = pd.DataFrame(data)
for key in data['attributes']:
trait_types.append(key['trait_type'])
values.append(key['value'])
df = pd.DataFrame({
'trait type': trait_types,
'value' : values})
print(df)

Extracting str from pandas dataframe using json

I read csv file into a dataframe named df
Each rows contains str below.
'{"id":2140043003,"name":"Olallo Rubio",...}'
I would like to extract "name" and "id" from each row and make a new dataframe to store the str.
I use the following codes to extract but it shows an error. Please let me know if there is any suggestions on how to solve this problem. Thanks
JSONDecodeError: Expecting ',' delimiter: line 1 column 32 (char 31)

text={
"id": 2140043003,
"name": "Olallo Rubio",
"is_registered": True,
"chosen_currency": 'Null',
"avatar": {
"thumb": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls": {
"web": {
"user": "https://www.kickstarter.com/profile/2140043003"
},
"api": {
"user": "https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}
def extract(text,*args):
list1=[]
for i in args:
list1.append(text[i])
return list1
print(extract(text,'name','id'))
# ['Olallo Rubio', 2140043003]

Here's what I came up with using pandas.json_normalize():
import pandas as pd
sample = [{
"id": 2140043003,
"name":"Olallo Rubio",
"is_registered": True,
"chosen_currency": None,
"avatar":{
"thumb":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls":{
"web":{
"user":"https://www.kickstarter.com/profile/2140043003"
},
"api":{
"user":"https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}]
# Create datafrane
df = pd.json_normalize(sample)
# Select columns into new dataframe.
df1 = df.loc[:, ["name", "id",]]
Check df1:
Input:
print(df1)
Output:
name id
0 Olallo Rubio 2140043003

Extracting JSON data into a relational table

I have a JSON file which resulted from YouTube's iframe API and needs to be preprocessed. I want to put this JSON data into a pandas dataframe, where each JSON key will be a column, and each recorded "event" should be a new row.
I was able to load the data as a dataframe using the read_json , but with this the keys for each event are shown as an array.
Here is what my JSON data looks like :
{
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
And this is what I did to convert it to a dataframe:
data=pd.read_json("file.json")
df=pd.DataFrame(data)
print(df)
The output looks like this:
0 {'timemillis': 1563469276604, 'date': '18.7.20...
1 {'timemillis': 1563469276694, 'date': '18.7.20...
...
How can I convert this output into a table where I have separate columns for these keys such as 'timemmillis','date','name' and so on? I never worked with JSONs before so I am a bit confused.

import pandas as pd
import json
data = {
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
# or read data from file
# rather than reading file directly to pandas dataframe read as json
# data=pd.read_json("file.json")
with open('file.json') as json_file:
data = json.load(json_file)
df=pd.DataFrame(data['events'])
print(df)
Result
data date name time timemillis
0 18.7.2019 Player is loading 18:31:03,580 1563467463580
1 5 18.7.2019 Player is loaded 18:31:03,668 1563467463668

import pandas as pd
df=pd.read_json("file.json",orient='columns')
rows = []
for i,r in df.iterrows():
rows.append({'eventid':i+1,'timemillis':r['events']['timemillis'],'name':r['events']['name']})
df = pd.DataFrame(rows)
print(df)
Now you can insert this df to database

Write json format using pandas Series and DataFrame

I'm working with csvfiles. My goal is to write a json format with csvfile information. Especifically, I want to get a similar format as miserables.json
Example:
{"source": "Napoleon", "target": "Myriel", "value": 1},
According with the information I have the format would be:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": "Germany",
"target": "USA",
"value": 2
},
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
However, with the code I used the output looks as follow:
[
{
"source": "Germany",
"target": "Mexico",
"value": 1
},
{
"source": null,
"target": "USA",
"value": 2
}
][
{
"source": "Brazil",
"target": "Argentina",
"value": 3
}
]
Null source must be Germany. This is one of the main problems, because there are more cities with that issue. Besides this, the information is correct. I just want to remove several list inside the format and replace null to correct country.
This is the code I used using pandas and collections.
csvdata = pandas.read_csv('file.csv', low_memory=False, encoding='latin-1')
countries = csvdata['country'].tolist()
newcountries = list(set(countries))
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
sourceTemp = []
value = []
country = element
for k,v in frquency.items():
sourceTemp.append(k)
value.append(int(v))
forceData = {'source': Series(country), 'target': Series(sourceTemp), 'value': Series(value)}
dfForce = DataFrame(forceData)
jsondata = dfForce.to_json(orient='records', force_ascii=False, default_handler=callable)
parsed = json.loads(jsondata)
newData = json.dumps(parsed, indent=4, ensure_ascii=False, sort_keys=True)
# since to_json doesn´t have append mode this will be written in txt file
savetxt = open('data.txt', 'a')
savetxt.write(newData)
savetxt.close()
Any suggestion to solve this problem are appreciate!
Thanks

Consider removing the Series() around the scalar value, country. By doing so and then upsizing the dictionaries of series into a dataframe, you force NaN (later converted to null in json) into the series to match the lengths of other series. You can see this by printing out the dfForce dataframe:
from pandas import Series
from pandas import DataFrame
country = 'Germany'
sourceTemp = ['Mexico', 'USA', 'Argentina']
value = [1, 2, 3]
forceData = {'source': Series(country),
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 NaN USA 2
# 2 NaN Argentina 3
To resolve, simply keep country as scalar in dictionary of series:
forceData = {'source': country,
'target': Series(sourceTemp),
'value': Series(value)}
dfForce = DataFrame(forceData)
# source target value
# 0 Germany Mexico 1
# 1 Germany USA 2
# 2 Germany Argentina 3
By the way, you do not need a dataframe object to output to json. Simply use a list of dictionaries. Consider the following using an Ordered Dictionary collection (to maintain the order of keys). In this way the growing list dumps into a text file without appending which would render an invalid json as opposite facing adjacent square brackets ...][... are not allowed.
from collections import OrderedDict
...
data = []
for element in newcountries:
bills = csvdata['target'][csvdata['country'] == element]
frquency = Counter(bills)
for k,v in frquency.items():
inner = OrderedDict()
inner['source'] = element
inner['target'] = k
inner['value'] = int(v)
data.append(inner)
newData = json.dumps(data, indent=4)
with open('data.json', 'w') as savetxt:
savetxt.write(newData)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Json file to pandas data frame - python

Try: import pandas as pd reff = pd.json_normalize(myjson['data']) df = pd.DataFrame(data=reff) df You forgot to pull your data out of myjson. json_normalize() will iterate through the most outer-layer of your JSON.

Related

pandas read_json from s3 with chunksize option returns single row multiple columns dataframe

convert json data to pandas dataframe in python (dictionary inside list )

Extracting str from pandas dataframe using json

Extracting JSON data into a relational table

Write json format using pandas Series and DataFrame

Categories

Resources