Normalize json column and join with rest of dataframe - python

This is my first question here on stackoverflow so please don't roast me.
I was trying to find similar problems on the internet and actually there are several, but for me the solutions didn't work.
I have created this dataframe:
import pandas as pd
from ast import literal_eval
d = {'order_id': [1], 'email': ["hi#test.com"], 'line_items': ["[{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]"]}
orders = pd.DataFrame(data=d)
It looks like this:
order_id email line_items
1 hi#test.com [{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]
I want the dataframe to look like this:
order_id email line_items.sku line_items.quantity
1 hi#test.com testproduct1 2
1 hi#test.com testproduct2 2
I used the following code to change the type of line_items from string to dict:
orders.line_items = orders.line_items.apply(literal_eval)
Normally I would use json_normalize now to flatten the line_items column. But I also want to keep the id and don't know how to do that. I also want to avoid any loops.
Is there anyone who can help me with this issue?
Kind regards
joant95

If your dictionary really is that strange, then you could try:
d['line_items'] = eval(d['line_items'][0])
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
To create d out of orders you could try:
d = orders.to_dict(orient='list')
Or you could try:
orders.line_items = orders.line_items.map(eval)
d = orders.to_dict(orient='records')
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
But: I still don't have a clear picture of the situation :)

Related

How to get metadata when reading nested json with pandas

I'm trying to get the metadata out from a json using pandas json_normalize, but it does not work as expected.
I have a json fine with the following structure
data=[
{'a':'aa',
'b':{'b1':'bb1','b2':'bb2'},
'c':[{
'ca':[{'ca1':'caa1'
}]
}]
}]
I'd like to get the following
ca1
a
b.b1
caa1
aa
bb1
I would expect this to work
pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b','b1']])
but it doesn't find the key b1. Strangely enough if my record_path is 'c' alone it does find the key.
I feel I'm missing something here, but I can't figure out what.
I appreciate any help!
Going down first level you grab the meta as a list of columns you want to keep. Record path use a list to map levels that you want to go down. Finally column b is a dict you can apply to a Series concat back into df and pop to remove unpacked dict column.
df = pd.json_normalize(
data=data,
meta=['a', 'b'],
record_path=['c', 'ca']
)
df = pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
print(df)
Output:
ca1 a b1 b2
0 caa1 aa bb1 bb2
This is a workaround I used eventually
data=[
{'a':'aa',
'b':{'b1':'bb1','b2':'bb2'},
'c':[{
'ca':[{'ca1':'caa1'
}]
}]
}]
df = pd.json_normalize(data, record_path=['c','ca'], meta = ['a',['b']]
)
df = pd.concat([df,pd.json_normalize(df['b'])],axis = 1)
df.drop(columns='b',inplace = True)
I still think there should be a better way, but it works

How to iterate over a CSV file with Pywikibot

I wanted to try uploading a series of items to test.wikidata, creating the item and then adding a statement of inception P571. The csv file sometimes has a date value, sometimes not. When no date value is given, I want to write out a placeholder 'some value'.
Imagine a dataframe like this:
df = {'Object': [1, 2,3], 'Date': [250,,300]}
However, I am not sure using Pywikibot how to iterate over a csv file with pywikibot to create an item for each row and add a statement. Here is the code I wrote:
import pywikibot
import pandas as pd
site = pywikibot.Site("test", "wikidata")
repo = site.data_repository()
df = pd.read_csv('experiment.csv')
item = pywikibot.ItemPage(repo)
for item in df:
date = df['date']
prop_date = pywikibot.Claim(repo, u'P571')
if date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)
When I run this through PAWS, I get the message: KeyError: 'date'
But I think the real issue here is that I am not sure how to get Pywikibot to iterate over each row of the dataframe and create a new claim for each new date value. I would value any feedback or suggestions for good examples and documentation. Many thanks!
Looking back on this, the solution was to use .iterrows() or .itertuples() or .loc[] to access the values in the row.
So
for item in df.itertuples():
prop_date = pywikibot.Claim(repo, u'P571')
if item.date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)

Drop rows in pandas if they contains "???"

Im trying to drop rows in pandas that contains "???", it works for every other value except for "???", I do not know whats the problem.
This is my code (I have tried both types):
df = df[~df["text"].str.contains("?????", na=False)]
df = df[~df["text"].str.contains("?????")]
error that I'm getting:
re.error: nothing to repeat at position 0
It works for every other value except for "????".
I have googled it, and looked all over this website but I couldnt find any solutions.
The parameter expects a regular expression, hence the error re.error.
You can either escape the ? inside the expression like this:
df = df[~df["text"].str.contains("\?\?\?\?\?")]
Or set regex=False as Vorsprung sugested:
df = df[~df["text"].str.contains("?????",regex=False)]
let's convert this into running code:
import numpy as np
import pandas as pd
data = {'A': ['abc', 'cxx???xx', '???',], 'B': ['add', 'ddb', 'c', ]}
df = pd.DataFrame.from_dict(data)
df
output:
A B
0 abc add
1 cxx???xx ddb
2 ??? c
with this:
df[df['A'].str.contains('???',regex=False)]
output:
A B
1 cxx???xx ddb
2 ??? c
you need to tell contains(), that your search string is not a regex.

Normalizing a Python list to get JSON data into a Tables

I am trying to use the following python code to draw info from an API and create a players table ultimately. I have gotten the data mostly normalized to that level, but am struggling to work with the [rosters] list.
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
r1 = requests.get('https://statsapi.web.nhl.com/api/v1/teams/16?hydrate=franchise(roster(season=20182019,person(name,stats(splits=[yearByYear]))))')
data = r1.json()
df1 = json_normalize(data, 'teams',['teams.franchise'],errors='ignore')['franchise']
df2 = json_normalize(df1)['roster.roster']
df3 = pd.DataFrame(data=df2.index, columns=['Id'])
df4 = pd.DataFrame(data=df2.values, columns=['Players'])
df4
returns:
0 [{'person': {'id': 8470645, 'fullName': 'Corey...
Any ideas on what I could do to extract each person from this API into a table? IE:
ID | fullName |
.. .....
.. .....
Thanks.
Looks like the main problem is this is a deep dictionary. Some inspection showed that this code will get you to each player:
all_players = []
for team in data['teams']:
for player in team['franchise']['roster']['roster']:
player = player['person']
print(player.keys())
print(player)
print()
However, some of the keys in player correspond to more dictionaries. So you'll either have to decide which player fields are basic values like strings/ints/etc and keep those, or add more code to parse out the additional dictionaries.
But this code will get you to each player, then you can normalize how you want from there.
Let me know if you need help!

DataFrame constructor not properly called! error

I am new to Python and I am facing problem in creating the Dataframe in the format of key and value i.e.
data = [{'key':'\[GlobalProgramSizeInThousands\]','value':'1000'},]
Here is my code:
columnsss = ['key','value'];
query = "select * from bparst_tags where tag_type = 1 ";
result = database.cursor(db.cursors.DictCursor);
result.execute(query);
result_set = result.fetchall();
data = "[";
for row in result_set:
`row["tag_expression"]`)
data += "{'value': %s , 'key': %s }," % ( `row["tag_expression"]`, `row["tag_name"]` )
data += "]" ;
df = DataFrame(data , columns=columnsss);
But when I pass the data in DataFrame it shows me
pandas.core.common.PandasError: DataFrame constructor not properly called!
while if I print the data and assign the same value to data variable then it works.
You are providing a string representation of a dict to the DataFrame constructor, and not a dict itself. So this is the reason you get that error.
So if you want to use your code, you could do:
df = DataFrame(eval(data))
But better would be to not create the string in the first place, but directly putting it in a dict. Something roughly like:
data = []
for row in result_set:
data.append({'value': row["tag_expression"], 'key': row["tag_name"]})
But probably even this is not needed, as depending on what is exactly in your result_set you could probably:
provide this directly to a DataFrame: DataFrame(result_set)
or use the pandas read_sql_query function to do this for you (see docs on this)
Just ran into the same error, but the above answer could not help me.
My code worked fine on my computer which was like this:
test_dict = {'x': '123', 'y': '456', 'z': '456'}
df=pd.DataFrame(test_dict.items(),columns=['col1','col2'])
However, it did not work on another platform. It gave me the same error as mentioned in the original question. I tried below code by simply adding the list() around the dictionary items, and it worked smoothly after:
df=pd.DataFrame(list(test_dict.items()),columns=['col1','col2'])
Hopefully, this answer can help whoever ran into a similar situation like me.
import json
# Opening JSON file
f = open('data.json')
# returns JSON object as
# a dictionary
data1 = json.load(f)
#converting it into dataframe
df = pd.read_json(data1, orient ='index')

Categories

Resources