Python JSON dict to dataframe no rows - python

So after hours of unsuccesfull googling I finally decided to post this here.
I am trying to convert some data obtained by an API call to a
Pandas.DataFrame()
This is my code:
response = requests.get(url)
data_as_list = response.json()['data']
for dct in data_as_list:
json_df = pd.DataFrame.from_records(dct)
Unfortunately, the returned dataframe only contains the column names, but no row data at all, even though the dictionary has some. I already tried from_dict and pd.read_json() (after dumping it into a JSON string). But all of these had the same
result.
The data is a nested dictionary in JSON format and looks like this

You can make DataFrames From python lists (that contains dictionaries or lists (nested list)) like this code:
json_df = pd.DataFrame(data_as_list)

Do this,
pd.DataFrame(data_as_list)

Related

How do I un-nest multiple sets of data and set them as columns

I am trying to bring this data from an api into a data frame however some of the data is nested under another one. So for example I would like to have each stat like 'passing_att' as its own column. But it is nested under 'passing' which 'passing' and 'rushing' and 'receiving' is all nested under 'stats'. I can bring just 'passing' into a data frame which would show 'passing_att' and the rest of the 'passing' data but then I don't have names and all the other data I would like.
Code
The correct word is to 'flatten' a JSON object. There are many ways to do this. You could use the flatten_json pip package or write your own function to do this. Though for your specific use case, you could use the json_normalize() function from the pandas library because you are trying to flatten a list of JSON objects that can easily be converted into a pandas dataframe.
Once you have the nested JSON object as show below (performed in your code):
json_list = json.loads(response.text)
Use the following code to flatten it:
df = pandas.json_normalize(json_list)
You will get all the nested columns flattened into a single level dataframe separated by a period. For example, the passing_att attribute cann be accessed via the column name 'stats.passing.passing_att'. You can change the separator by passing the sep parameter to the json_normalize function. For example, if you want to use a hyphen as the separator, you can do so by passing the following parameter to the json_normalize function:
df = pandas.json_normalize(json_list, sep='-')
If you want to select only a few columns, you can pass a list of column names to the dataframe's [] operator. For example, if you want to select only the player_name, position, passing_att and passing_cmp columns, you can do so by using the following code:
df = df[['player_name', 'position', 'stats.passing.passing_att', 'stats.passing.passing_cmp']]

Combining Code Steps into a User-Defined Function

I'm writing a function that retrieves data from an API based on an ID#, then reads the json response into a pandas dataframe, munges the dataframe, and finally compiles every dataframe together. The goal is to pass a pandas series of ID#'s into the function, to retrieve the relevant data for a list of thousands of IDs.
When I execute every step manually, the steps work. I get a nice one-row pandas dataframe with all of the columns and the values that I want. When I combine all of the steps within a function containing a for-loop, it stops working.
Here are the steps:
req = Request('https://gs-api.greatschools.org/schools/3601714/metrics') ##request
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXX') ##authenticate
content = urlopen(req).read() ##retrieve
data = pd.read_json(content) ##convert json to pandas dataframe
data.reset_index(inplace=True) ##reset index
data['id'] = 3601714 ##add id column
data.drop(columns=['head-official-name','head-official-email'],inplace=True) ##drop columns
data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity') ##pivot the dataframe
I've combined all of these steps into a function:
def demographics(universal_id):
demo_mstr = []
for item in universal_id:
id = item
req = Request(f'https://gs-api.greatschools.org/schools/{id}/metrics')
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
content = urlopen(req).read()
data = pd.read_json(content)
data.reset_index(inplace=True)
data['id'] = id
data.drop(columns=['head-official-name','head-official-email'],inplace=True)
data = data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity')
demo_mstr.append(data)
return demo_mstr
If I run the function on a test list of ID#s, I get the following error: HTTPError: HTTP Error 422:
I've rewritten the function a number of times, and I've managed to get different error types, but not a working function.
What am I missing?
Update: I am answering my own question, in the hopes that it helps someone.
So, I figured out that the 422 error was related to the fact that not every ID# had the data associated with it, in the API. Hence, some of the API calls were returning no data, which caused the error.
As for pivot, I realized that the need for pivot was caused by pandas' poor handling of json data. In my mind, pd.read_json is only good for exploratory analysis, and even then, it's kind of useless.
What you should do instead is use r.json() to unpack your raw json into its constituent dictionaries, and you need to write a parse_json function that iterates over the dictionaries, and converts them into the column names you desire.
Converting first to a pandas dataframe, then pivoting, then trying to append dataframes together is a recipe for disaster. Stay in json, do what you need to do with json, append all the json arrays into a master list, and then only convert to pandas dataframe at the very end!

Extracting data from a dict like column in a dataframe

I have column info in a dataframe with data in dict like format like the one below.
I would like to get another dataframe with this info and a tried:
feature = [d.get('Feature') for d in df['info']]
but it returns none.
How can I do it? I am really having a bad time trying to get this done.
As the dict is nested, you can try pd.json_normalize() which normalizes semi-structured JSON data into a flat table:
df_new = pd.json_normalize(df['info'])
As some inner dict are further under a list, you may need to further handle this to dig out the deeper contents. Anyway, this should serve well as a starting point for your works.

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

Is there a way to convert list of string formatted dictionary to a dataframe in Python?

I am practicing how to use beautifulsoup and currently in a pickle as I can't convert the results to a dataframe. Hope to get your help.
In this example, the page I want to scrape can be obtained using the following:
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get("https://store.moncler.com/en-ca/women/autumn-winter/view-all-outerwear?tp=72010&ds_rl=1243188&gclid=EAIaIQobChMIpfDj9bjP5wIVlJOzCh0-9ghJEAAYASAAEgLuSfD_BwE&gclsrc=aw.ds", verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
I have managed to isolate to the product section using the following code
test_class = []
for section_tag in soup.find_all('section', class_='search__products__shelf search__products__shelf--moncler'):
for test in section_tag.find_all('article'):
test_class.append(test.get('data-ytos-track-product-data'))
The result of this is a list of string-formatted dictionary which looks like the following:
['{"product_position":0,"product_title":"TREPORT","product_brand":"MONCLER","product_category":"3074457345616676837/3074457345616676843","product_micro_category":"Outerwear","product_micro_category_id":"3074457345616676843","product_macro_category":"OUTERWEAR","product_macro_category_id":"3074457345616676837","product_color_id":"Dark
blue","product_color":"Dark
blue","product_price":0.0,"product_discountedPrice":2530.0,"product_price_tf":"0","product_discountedPrice_tf":"2126.05","product_id":"1890828705323513","product_variant_id":"1890828705323514","list":"searchresult","product_quantity":1,"product_coupon":"","product_cod8":null,"product_cod10":null,"product_legacy_macro_id":"1012","product_legacy_micro_id":"1446","product_is_in_stock":true,"is_rsi_product":false,"rsi_product_tracking_url":""}',
'{"product_position":1,"product_title":"RIMAC","product_brand":"MONCLER","product_category":"3074457345616676837/3074457345616676854","product_micro_category":"Bomber
Jacket","product_micro_category_id":"3074457345616676854","product_macro_category":"OUTERWEAR","product_macro_category_id":"3074457345616676837","product_color_id":"Dark
blue","product_color":"Dark
blue","product_price":0.0,"product_discountedPrice":2340.0,"product_price_tf":"0","product_discountedPrice_tf":"1966.39","product_id":"5549023491788128","product_variant_id":"5549023491788129","list":"searchresult","product_quantity":1,"product_coupon":"","product_cod8":null,"product_cod10":null,"product_legacy_macro_id":"1012","product_legacy_micro_id":"4715","product_is_in_stock":true,"is_rsi_product":false,"rsi_product_tracking_url":""}',
My question is how to convert the result to a pandas dataframe from a list of string formatted dictionary like that?
I have tried to use the code below to start with
import ast
ast.literal_eval(test_class[1])
but to no avail (it gives me below error code).
ValueError: malformed node or string: <_ast.Name object at
0x000001985A976748>
The end result should store each key of the dictionary into columns in a Dataframe (ie. 'product_position','product_title','product_brand',etc)
Any help / guidance would be much appreciated.
Thanks.
Looks like the question really is about how to parse a string, not how to do something with pandas.
The list you have seem to contain simply valid json strings. You can convert them to python dict's using json.loads() from the standard lib. Of course if some strings are malformed that's another story, you'll have to google how to parse malformed jsons.
After getting a list of python dicts turning them into a DataFrame is trivial.
you can use json.loads and then instantiate pandas.DataFrame with the obtained list of dictionaries:
d = [json.loads(e) for e in data]
df = pd.DataFrame(d)

Categories

Resources