Convert JSON file into proper format using python pandas - python

I want to convert JSON file into proper format.
I have a JSON file as given below:
{
"fruit": "Apple",
"size": "Large",
"color": "Red",
"details":"|seedless:true|,|condition:New|"
},
{
"fruit": "Almond",
"size": "small",
"color": "brown",
"details":"|Type:dry|,|seedless:true|,|condition:New|"
}
You can see the data in the details can vary.
I want to change it into :
{
"fruit": "Apple",
"size": "Large",
"color": "Red",
"seedless":"true",
"condition":"New",
},
{
"fruit": "Almond",
"size": "small",
"color": "brown",
"Type":"dry",
"seedless":"true",
"condition":"New",
}
I have tried doing it in python using pandas as:
import json
import pandas as pd
import re
df = pd.read_json("data.json",lines=True)
#I tried to change the pattern of data in details column as
re1 = re.compile('r/|(.?):(.?)|/')
re2 = re.compile('r\"(.*?)\":\"(.*?)\"')
df.replace({'details' :re1}, {'details' : re2},inplace = True, regex = True);
But that giving output as "objects" in all the rows of details column.

Try this,
for d in data:
details = d.pop('details')
d.update(dict(x.split(":") for x in details.split("|") if ":" in x))
print(data)
[{'color': 'Red',
'condition': 'New',
'fruit': 'Apple',
'seedless': 'true',
'size': 'Large'},
{'Type': 'dry',
'color': 'brown',
'condition': 'New',
'fruit': 'Almond',
'seedless': 'true',
'size': 'small'}]

You can convert the (list of) dictionaries to a pandas data frame.
import pandas as pd
# data is a list of dictionaries
data = [{
"fruit": "Apple",
"size": "Large",
"color": "Red",
"details":"|seedless:true|,|condition:New|"
},
{
"fruit": "Almond",
"size": "small",
"color": "brown",
"details":"|Type:dry,|seedless:true|,|condition:New|"
}]
# convert to data frame
df = pd.DataFrame(data)
# remove '|' from details and convert to list
df['details'] = df['details'].str.replace(r'\|', '').str.split(',')
# explode list => one row for each element
df = df.explode('details')
# split details into name/value pair
df[['name', 'value']] = df['details'].str.split(':').apply(lambda x: pd.Series(x))
# drop details column
df = df.drop(columns='details')
print(df)
fruit size color name value
0 Apple Large Red seedless true
0 Apple Large Red condition New
1 Almond small brown Type dry
1 Almond small brown seedless true
1 Almond small brown condition New

Related

Python- cannot get a list of values in for loop when there are numbers in json structures

I was trying to to get all the types inside a colours->numbers but it doesn't work because the colours e.g. green,blue inside an integer so i could not go through a loop to get them.
Here is the code that I was trying to do:
x=0
for colour in colours['rootdata']:
print(colour[x][type])
x+1
but is shows 'string indices must be integers'
I'm able to get a single value with for loop like this :(but this not what i want)
colour_red = JsonResult['rootdata']['colours']['0']['type']
print (colour_red )
This is the simple json sample that I'm using
{
"rootdata": {
"colours": {
"0": {
"type": "red"
},
"1": {
"type": "green"
},
"2": {
"type": "blue"
}
}
}
}
Try this:
my_dict = {"rootdata":
{
"colours": {"0": {"type": "red"}, "1": {"type": "green"}, "2": {"type": "blue"}}
}
}
types = []
types_dict = {}
for k, v in my_dict["rootdata"]["colours"].items():
types.append(v["type"])
types_dict[k] = v["type"]
print(types)
# R e s u l t : ['red', 'green', 'blue']
print(types_dict)
# R e s u l t : {'0': 'red', '1': 'green', '2': 'blue'}
Regards...
You do not need to use a counter. You can obtain the colours by using the following code:
for colour in colours['rootdata']['colours'].values():
print(colour['type'])
This code is works even if the numbers are not in sequential order in the json, and the order of the output does not really matter.

convert json data to pandas dataframe in python (dictionary inside list )

I have json data like below:
{"name": "Monkey", "image": "https://media.npr.org/assets/img/2017/09/12/macaca_nigra_self-portrait-3e0070aa19a7fe36e802253048411a38f14a79f8-s800-c85.webp", "attributes": [{"trait_type": "Bones", "value": "Zombie"}, {"trait_type": "Clothes", "value": "Striped"}, {"trait_type": "Mouth", "value": "Bubblegum"}, {"trait_type": "Eyes", "value": "Black Sunglasses"}, {"trait_type": "Hat", "value": "Sushi"}, {"trait_type": "Background", "value": "Purple"}]}
I want to convert this json data as pandas dataframe only selecting the attributes as filter it as below:
Bones Clothes Mouth Eyes Hat Background
zombie striped bubblegum black sushi purple
Can any expert please help me to get the output as i mentioned
Thank you
There is probably a prettier solution but this does the job:
import json
import pandas as pd
with open('file.json') as f:
trait_types= []
values = []
data = json.load(f)
df = pd.DataFrame(data)
for key in data['attributes']:
trait_types.append(key['trait_type'])
values.append(key['value'])
df = pd.DataFrame({
'trait type': trait_types,
'value' : values})
print(df)

Extracting str from pandas dataframe using json

I read csv file into a dataframe named df
Each rows contains str below.
'{"id":2140043003,"name":"Olallo Rubio",...}'
I would like to extract "name" and "id" from each row and make a new dataframe to store the str.
I use the following codes to extract but it shows an error. Please let me know if there is any suggestions on how to solve this problem. Thanks
JSONDecodeError: Expecting ',' delimiter: line 1 column 32 (char 31)
text={
"id": 2140043003,
"name": "Olallo Rubio",
"is_registered": True,
"chosen_currency": 'Null',
"avatar": {
"thumb": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium": "https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls": {
"web": {
"user": "https://www.kickstarter.com/profile/2140043003"
},
"api": {
"user": "https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}
def extract(text,*args):
list1=[]
for i in args:
list1.append(text[i])
return list1
print(extract(text,'name','id'))
# ['Olallo Rubio', 2140043003]
Here's what I came up with using pandas.json_normalize():
import pandas as pd
sample = [{
"id": 2140043003,
"name":"Olallo Rubio",
"is_registered": True,
"chosen_currency": None,
"avatar":{
"thumb":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=40&h=40&fit=crop&v=1510685152&auto=format&q=92&s=653706657ccc49f68a27445ea37ad39a",
"small":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa",
"medium":"https://ksr-ugc.imgix.net/assets/019/223/259/16513215a3869caaea2d35d43f3c0c5f_original.jpg?w=160&h=160&fit=crop&v=1510685152&auto=format&q=92&s=0bd2f3cec5f12553e679153ba2b5d7fa"
},
"urls":{
"web":{
"user":"https://www.kickstarter.com/profile/2140043003"
},
"api":{
"user":"https://api.kickstarter.com/v1/users/2140043003?signature=1531480520.09df9a36f649d71a3a81eb14684ad0d3afc83e03"
}
}
}]
# Create datafrane
df = pd.json_normalize(sample)
# Select columns into new dataframe.
df1 = df.loc[:, ["name", "id",]]
Check df1:
Input:
print(df1)
Output:
name id
0 Olallo Rubio 2140043003

python generator to pandas dataframe

I have a generator being returned from:
data = public_client.get_product_trades(product_id='BTC-USD', limit=10)
How do i turn the data in to a pandas dataframe?
the method DOCSTRING reads:
"""{"Returns": [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]}"""
I have tried:
df = [x for x in data]
df = pd.DataFrame.from_records(df)
but it does not work as i get the error:
AttributeError: 'str' object has no attribute 'keys'
When i print the above "x for x in data" i see the list of dicts but the end looks strange, could this be why?
print(list(data))
[{'time': '2020-12-30T13:04:14.385Z', 'trade_id': 116918468, 'price': '27853.82000000', 'size': '0.00171515', 'side': 'sell'},{'time': '2020-12-30T12:31:24.185Z', 'trade_id': 116915675, 'price': '27683.70000000', 'size': '0.01683711', 'side': 'sell'}, 'message']
It looks to be a list of dicts but the end value is a single string 'message'.
Based on the updated question:
df = pd.DataFrame(list(data)[:-1])
Or, more cleanly:
df = pd.DataFrame([x for x in data if isinstance(x, dict)])
print(df)
time trade_id price size side
0 2020-12-30T13:04:14.385Z 116918468 27853.82000000 0.00171515 sell
1 2020-12-30T12:31:24.185Z 116915675 27683.70000000 0.01683711 sell
Oh, and BTW, you'll still need to change those strings into something usable...
So e.g.:
df['time'] = pd.to_datetime(df['time'])
for k in ['price', 'size']:
df[k] = pd.to_numeric(df[k])
You could access the values in the dictionary and build a dataframe from it (although not particularly clean):
dict_of_data = [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]
import pandas as pd
list_of_data = [list(dict_of_data[0].values()),list(dict_of_data[1].values())]
pd.DataFrame(list_of_data, columns=list(dict_of_data[0].keys())).set_index('time')
its straightforward just use the pd.DataFrame constructor:
#list_of_dicts = [{
# "time": "2014-11-07T22:19:28.578544Z",
# "trade_id": 74,
# "price": "10.00000000",
# "size": "0.01000000",
# "side": "buy"
# }, {
# "time": "2014-11-07T01:08:43.642366Z",
# "trade_id": 73,
# "price": "100.00000000",
# "size": "0.01000000",
# "side": "sell"
#}]
# or if you take it from 'data'
list_of_dicts = data[:-1]
df = pd.DataFrame(list_of_dicts)
df
Out[4]:
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell
UPDATE
according to the question update, it seems you have json data that is still string...
import json
data = json.loads(data)
data = data['Returns']
pd.DataFrame(data)
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell

Formatting JSON for Pandas Dataframe

I'm trying to wrangle some data to make a recommender system for an app. Of course, to do this I need a record of which users like which posts. I currently have that data in a JSON file that is formatted like this (numbers being post id, and letters being user ids):
{
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}
I'm trying to figure out how to get this into a pandas dataframe that would look like this:
example format
I've tried using a few online JSON to CSV converters out of laziness which unsurprisingly didn't bring it into a useable format for me. I've tried using "print(json_normalize(data))", as well which also did not work, and put each instance of a like into separate columns.
Any advice?
This is a solution optimized for the peculiarities in your dataset.
import pandas as pd
data = {
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}}
formatted = [{'PostID': d, 'User Like': list(data[d].keys())} for d in data]
df = pd.DataFrame.from_dict(formatted)
Output:
From my experience for such simple formats, writing a quick and dirty loop is usually the fastest method rather than finding some ready solution and customizing it. An example for the data you gave here:
import json
my_json=""" {
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}
}"""
parsed_json = json.loads(my_json)
print(parsed_json)
# result:
# {'-1234': {'abc': 'abc', 'def': 'def', 'ghi': 'ghi'},
# '-5678': {'jkl': 'jkl', 'mno': 'mno'}}
for key in parsed_json.keys():
line = ''
line += key
line += ' | '
for value in parsed_json[key].values():
line += value + ', '
line = line[:-2] # stripping the ', ' from the end of the line
print(line)
# result:
# -1234 | abc, def, ghi
# -5678 | jkl, mno
Setup
Thanks Zaroth
import json
my_json=""" {
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}
}"""
parsed_json = json.loads(my_json)
Comprehension
pd.DataFrame(
[(k, [*v]) for k, v in parsed_json.items()],
columns=['PostID', 'User Like']
)
PostID User Like
0 -1234 [abc, def, ghi]
1 -5678 [jkl, mno]
OR
pd.DataFrame({
'PostID': [*parsed_json],
'User Like': [[*v] for v in parsed_json.values()]
})
data = {"-1234": {"abc": "abc","def": "def","ghi": "ghi"},"-5678": {"jkl": "jkl","mno": "mno"}}
key = []
val = []
for k,v in data.items():
key.append(k)
val.append(list(v.values()))
pd.DataFrame(zip(key,val),columns=['PostID','User Like'])

Categories

Resources