Extracting data from a dict like column in a dataframe - python

I have column info in a dataframe with data in dict like format like the one below.
I would like to get another dataframe with this info and a tried:
feature = [d.get('Feature') for d in df['info']]
but it returns none.
How can I do it? I am really having a bad time trying to get this done.

As the dict is nested, you can try pd.json_normalize() which normalizes semi-structured JSON data into a flat table:
df_new = pd.json_normalize(df['info'])
As some inner dict are further under a list, you may need to further handle this to dig out the deeper contents. Anyway, this should serve well as a starting point for your works.

Related

How do I un-nest multiple sets of data and set them as columns

I am trying to bring this data from an api into a data frame however some of the data is nested under another one. So for example I would like to have each stat like 'passing_att' as its own column. But it is nested under 'passing' which 'passing' and 'rushing' and 'receiving' is all nested under 'stats'. I can bring just 'passing' into a data frame which would show 'passing_att' and the rest of the 'passing' data but then I don't have names and all the other data I would like.
Code
The correct word is to 'flatten' a JSON object. There are many ways to do this. You could use the flatten_json pip package or write your own function to do this. Though for your specific use case, you could use the json_normalize() function from the pandas library because you are trying to flatten a list of JSON objects that can easily be converted into a pandas dataframe.
Once you have the nested JSON object as show below (performed in your code):
json_list = json.loads(response.text)
Use the following code to flatten it:
df = pandas.json_normalize(json_list)
You will get all the nested columns flattened into a single level dataframe separated by a period. For example, the passing_att attribute cann be accessed via the column name 'stats.passing.passing_att'. You can change the separator by passing the sep parameter to the json_normalize function. For example, if you want to use a hyphen as the separator, you can do so by passing the following parameter to the json_normalize function:
df = pandas.json_normalize(json_list, sep='-')
If you want to select only a few columns, you can pass a list of column names to the dataframe's [] operator. For example, if you want to select only the player_name, position, passing_att and passing_cmp columns, you can do so by using the following code:
df = df[['player_name', 'position', 'stats.passing.passing_att', 'stats.passing.passing_cmp']]

Combining Code Steps into a User-Defined Function

I'm writing a function that retrieves data from an API based on an ID#, then reads the json response into a pandas dataframe, munges the dataframe, and finally compiles every dataframe together. The goal is to pass a pandas series of ID#'s into the function, to retrieve the relevant data for a list of thousands of IDs.
When I execute every step manually, the steps work. I get a nice one-row pandas dataframe with all of the columns and the values that I want. When I combine all of the steps within a function containing a for-loop, it stops working.
Here are the steps:
req = Request('https://gs-api.greatschools.org/schools/3601714/metrics') ##request
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXX') ##authenticate
content = urlopen(req).read() ##retrieve
data = pd.read_json(content) ##convert json to pandas dataframe
data.reset_index(inplace=True) ##reset index
data['id'] = 3601714 ##add id column
data.drop(columns=['head-official-name','head-official-email'],inplace=True) ##drop columns
data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity') ##pivot the dataframe
I've combined all of these steps into a function:
def demographics(universal_id):
demo_mstr = []
for item in universal_id:
id = item
req = Request(f'https://gs-api.greatschools.org/schools/{id}/metrics')
req.add_header('X-API-Key', 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
content = urlopen(req).read()
data = pd.read_json(content)
data.reset_index(inplace=True)
data['id'] = id
data.drop(columns=['head-official-name','head-official-email'],inplace=True)
data = data.pivot(['enrollment',
'percent-free-and-reduced-price-lunch',
'percent-students-with-limited-english-proficiency',
'student-teacher-ratio',
'percentage-male',
'percentage-female',
'percentage-of-teachers-with-3-or-more-years-experience',
'percentage-of-full-time-teachers-who-are-certified',
'average-salary','id'], 'index', 'ethnicity')
demo_mstr.append(data)
return demo_mstr
If I run the function on a test list of ID#s, I get the following error: HTTPError: HTTP Error 422:
I've rewritten the function a number of times, and I've managed to get different error types, but not a working function.
What am I missing?
Update: I am answering my own question, in the hopes that it helps someone.
So, I figured out that the 422 error was related to the fact that not every ID# had the data associated with it, in the API. Hence, some of the API calls were returning no data, which caused the error.
As for pivot, I realized that the need for pivot was caused by pandas' poor handling of json data. In my mind, pd.read_json is only good for exploratory analysis, and even then, it's kind of useless.
What you should do instead is use r.json() to unpack your raw json into its constituent dictionaries, and you need to write a parse_json function that iterates over the dictionaries, and converts them into the column names you desire.
Converting first to a pandas dataframe, then pivoting, then trying to append dataframes together is a recipe for disaster. Stay in json, do what you need to do with json, append all the json arrays into a master list, and then only convert to pandas dataframe at the very end!

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

Flattened JSON Repeating Columns Pandas

I'm having an issue currently with pulling from an API, getting a JSON dict, then flattening it, and placing it into a dataframe.
The data is structured like this from the json:
X1_0, X2_0, X3_0 ... X1_1, X2_1, X2_1, ... X1_2, X2_2, X2_3
and when I flatten it and place into a dataframe I get each flattened key as an individual column header rather than all combined since they have they _#.
So rather than getting something that's shape is 22 x 6 I get something that would be like 1 x 130.
I'm basically just interested in getting the shape of the dataframe correct but I'm not sure how I should fix it, and whether it should be done before flattening or after?
Any help is appreciated
Try to strip the _# from the keys. My guess is that they are added in the flattening step, so it should be easy to get rid of them.
Now you will get several values per key. Fix this by creating a list of JSON objects where each contains all the values for each key with the same _#.

Python JSON dict to dataframe no rows

So after hours of unsuccesfull googling I finally decided to post this here.
I am trying to convert some data obtained by an API call to a
Pandas.DataFrame()
This is my code:
response = requests.get(url)
data_as_list = response.json()['data']
for dct in data_as_list:
json_df = pd.DataFrame.from_records(dct)
Unfortunately, the returned dataframe only contains the column names, but no row data at all, even though the dictionary has some. I already tried from_dict and pd.read_json() (after dumping it into a JSON string). But all of these had the same
result.
The data is a nested dictionary in JSON format and looks like this
You can make DataFrames From python lists (that contains dictionaries or lists (nested list)) like this code:
json_df = pd.DataFrame(data_as_list)
Do this,
pd.DataFrame(data_as_list)

Categories

Resources