JSON to Pandas Dataframe types change

JSON to Pandas Dataframe types change - python

I have JSON output from m3inference package in python like this:
{'input': {'description': 'Bundeskanzlerin',
'id': '2631881902',
'img_path': '/root/m3/cache/angelamerkeicdu_224x224.jpg',
'lang': 'de',
'name': 'Angela Merkel',
'screen_name': 'angelamerkeicdu'},
'output': {'age': {'19-29': 0.0,
'30-39': 0.0001,
'<=18': 0.0001,
'>=40': 0.9998},
'gender': {'female': 0.9991, 'male': 0.0009},
'org': {'is-org': 0.0032, 'non-org': 0.9968}}}
I store it in:
org = pd.DataFrame.from_dict(json_normalize(org['output']), orient='columns')
gender.male gender.female age.<=18 ... age.>=40 org.non-org org.is-org
0 0.0009 0.9991 0.0000 ... 0.9998 0.9968 0.0032
i dont know where is the 0 value in the first column coming from, I save org.isorg column to isorg
isorg = org['org.is-org']
but when i append it to panda data frame dtypes is object, the value is change to
0 0.0032 Name: org.is-org, dtype: float64
not 0.0032
How to fix this?

"i dont know where 0 value in first column coming from then i save org.isorg column to isorg"
That "0" is an index to your dataframe. Unless you specify your dataframe index, pandas will auto create the index. You can change you index instead.
code example:
org.set_index('gender.male', inplace=True)
Index is like an address to your data. It is how any data point across the dataframe or series can be accessed.

Related

How to access nested data in a pandas dataframe?

Here's an example of the data I'm working with:
values variable.variableName timeZone
0 [{'value': [], turbidity PST
'qualifier': [],
'qualityControlLevel': [],
'method': [{
'methodDescription': '[TS087: YSI 6136]',
'methodID': 15009}],
'source': [],
'offset': [],
'sample': [],
'censorCode': []},
{'value': [{
'value': '17.2',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '17.5',
'qualifiers': ['P'],
'dateTime': '2022-01-05T14:00:00.000-08:00'}
}]
1 [{'value': degC PST
[{'value': '9.3',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '9.4',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:45:00.000-08:00'},
}]
I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:
import requests
import pandas as pd
url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()
json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)
new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']
# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light,
780-900 nm, detection angle 90 ±2.5°, formazin nephelometric units (FNU)'])
This outputs:
turbidity df
Empty DataFrame
Columns: []
Index: []
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Whereas I want my output to be something like:
turbidity df
value qualifiers dateTime
0 17.2 P 2022-01-05T12:30:00.000-08:00
1 17.5 P 2022-01-05T14:00:00.000-08:00
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?

I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.
You can then use
new_df = df.explode("values")
which will split the "turbidity" row into two.
You can then filter rows with empty "value" dictionaries and apply .explode() once again.
You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.

This JSON is nested deep so I think it requires a few steps to transform into what you want.
# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])
# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])
# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True),
pd.json_normalize(df.value).explode('qualifiers')], axis=1)
Resulted dataframe should look like this.
variable.variableName value qualifiers dateTime
0 Temperature, water, °C 10.7 P 2022-01-06T12:15:00.000-08:00
1 Temperature, water, °C 10.7 P 2022-01-06T12:30:00.000-08:00
2 Temperature, water, °C 10.7 P 2022-01-06T12:45:00.000-08:00
3 Temperature, water, °C 10.8 P 2022-01-06T13:00:00.000-08:00
If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.
df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]

explode a dictionary to different rows

This is how my original dataset looks like:
url boolean details
numberOfPages date
xzy.com 0 {'https://www.eltako.depdf': {'numberOfPages': 440, 'date': '2017-09-20'},'https://new.com': {'numberOfPages': 240, 'date': '2017-09-20'} }
The numberOfPages and date col is initally empty while the details col has a dictionary. I want to iterate through all rows (urls) and check their details column. For each key in the details column, I want to make a separate row and then use the numberOfPages and date values to add column values. The result should be something like this:
url boolean pdfLink numberOfPages date
xzy.com 0 https://www.eltako.depdf 440 2017-09-20
https://new.com 240 2017-09-20
I tried this but the second line gives me an error: TypeError: string indices must be integers
def arrange(df):
df=df.explode('details').reset_index(drop=True)
out=pd.DataFrame(df['details'].map(lambda x:[x[y] for y in x]).explode().tolist())
The original type of Info col was dict. I also tried changing the type to str but I still got the same error. Then I tried changing the lambda function to this:
lambda x:[y for y in x]
but the output I get is something like this:
url boolean details 0
xzy.com 0 https://www.eltako.depdf h
Nan Nan t
t
p
So basically the character of the link are being exploded into different rows. How can I fix this?
{'Company URL': {0: 'https://www.eltako.de/'},
'Potential Client': {0: 1},
'PDF Link': {0: nan},
'Number of Pages': {0: nan},
'Creation Date': {0: nan},
'Info': {0: {'https://www.eltako.de/wp-content/uploads/2020/11/Eltako_Gesamtkatalog_LowRes.pdf': {'numberOfPages': 440,
'date': '2017-09-20'}}},1: {'https:new.com: {'numberOfPages': 230,
'date': '2017-09-20'}}}}

Convert a dictionary of a list of dictionaries to pandas DataFrame

I pulled a list of historical option price of AAPL from the RobinHoood function robin_stocks.get_option_historicals(). The data was returned in a form of dictional of list of dictionary as shown below.
I am having difficulties to convert the below object (named historicalData) into a DataFrame. Can someone please help?
historicalData = {'data_points': [{'begins_at': '2020-10-05T13:30:00Z',
'open_price': '1.430000',
'close_price': '1.430000',
'high_price': '1.430000',
'low_price': '1.430000',
'volume': 0,
'session': 'reg',
'interpolated': False},
{'begins_at': '2020-10-05T13:40:00Z',
'open_price': '1.430000',
'close_price': '1.340000',
'high_price': '1.440000',
'low_price': '1.320000',
'volume': 0,
'session': 'reg',
'interpolated': False}],
'open_time': '0001-01-01T00:00:00Z',
'open_price': '0.000000',
'previous_close_time': '0001-01-01T00:00:00Z',
'previous_close_price': '0.000000',
'interval': '10minute',
'span': 'week',
'bounds': 'regular',
'id': '22b49380-8c50-4c76-8fb1-a4d06058f91e',
'instrument': 'https://api.robinhood.com/options/instruments/22b49380-8c50-4c76-8fb1-a4d06058f91e/'}
I tried the below code code but that didn't help:
import pandas as pd
df = pd.DataFrame(historicalData)
df

You didn't write that you want only data_points (as in the
other answer), so I assume that you want your whole dictionary
converted to a DataFrame.
To do it, start with your code:
df = pd.DataFrame(historicalData)
It creates a DataFrame, with data_points "exploded" to
consecutive rows, but they are still dictionaries.
Then rename open_price column to open_price_all:
df.rename(columns={'open_price': 'open_price_all'}, inplace=True)
The reason is to avoid duplicated column names after join
to be performed soon (data_points contain also open_price
attribute and I want the corresponding column from data_points
to "inherit" this name).
The next step is to create a temporary DataFrame - a split of
dictionaries in data_points to individual columns:
wrk = df.data_points.apply(pd.Series)
Print wrk to see the result.
And the last step is to join df with wrk and drop
data_points column (not needed any more, since it was
split into separate columns):
result = df.join(wrk).drop(columns=['data_points'])

This is pretty easy to solve with the below. I have chucked the dataframe to a list via list comprehension
import pandas as pd
df_list = [pd.DataFrame(dic.items(), columns=['Parameters', 'Value']) for dic in historicalData['data_points']]
You then could do:
df_list[0]
which will yield
Parameters Value
0 begins_at 2020-10-05T13:30:00Z
1 open_price 1.430000
2 close_price 1.430000
3 high_price 1.430000
4 low_price 1.430000
5 volume 0
6 session reg
7 interpolated False

how to convert pandas dataframe and numpy array into dictionary?

I have the following pandas dataframe which looks like,
code comp name
0 A292340 디비자산운용 마이티 200커버드콜ATM레버리지
1 A291630 키움투자자산운용 KOSEF 코스닥150선물레버리지
2 A278240 케이비자산운용 KBSTAR 코스닥150선물레버리지
3 A267770 미래에셋자산운용 TIGER 200선물레버리지
4 A267490 케이비자산운용 KBSTAR 미국장기국채선물레버리지(합성 H)
And I like to make dictionary out of this which will look like,
{'20180408' :{'A292340' : {comp : 디비자산운용}, {name : 마이티 200커버드콜ATM 레버리지}}}
Sorry about the data which is in foreign to you, but let me please ask.
What I tried is like,
values = [comp, name]
names = ['comp', 'name']
tmp = {names:values for names, values in zip(names, values)}
tpm = {code:values for values in zip(*tmp)}
aaaa = {date:c for c in zip(*tpm)}
print(aaaa)
aaaa is what I try to get.. and date is just simple list of date, from prior to todate. but when I run this, I got the error
TypeError: unhashable type: 'numpy.ndarray'
Thank you in advance for your answer.

Is this what you want? First, set "code" column as the index. Then use to_dict with "orient="index".
df.set_index("code").to_dict("index")
{'A267490': {'comp': '케이비자산운용', 'name': 'KBSTAR 미국장기국채선물레버리지(합성 H)'},
'A267770': {'comp': '미래에셋자산운용', 'name': 'TIGER 200선물레버리지'},
'A278240': {'comp': '케이비자산운용', 'name': 'KBSTAR 코스닥150선물레버리지'},
'A291630': {'comp': '키움투자자산운용', 'name': 'KOSEF 코스닥150선물레버리지'},
'A292340': {'comp': '디비자산운용', 'name': '마이티 200커버드콜ATM레버리지'}}
Using the argument "index" will give the layout:
{index -> {columnName -> valueOfTheColumn}}
Here since we set code as the index, we have
code -> {"comp" -> comp's value, "name" -> name's value}
'A267490': {'comp': '케이비자산운용', 'name': 'KBSTAR 미국장기국채선물레버리지(합성 H)'}

JSON to Pandas Dataframe not knowing if JSON will have all the columns of the dataframe

I am doing a research project and trying to pull thousands of quarterly results for companies from the SEC EDGAR API.
Each result is a list of dictionaries structured as follows:
[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}...]
I want each result to be a row of a pandas dataframe. The issue is that each result may not have the same fields due to the data available. I would like to check if the column(field) of the dataframe is present in one of the results field and if it is add the result value to the row. If not, I would like to add an np.NaN. How would I go about doing this?

A list/dict comprehension ought to work here:
In [11]: s
Out[11]:
[[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}],
[{'field': 'othercurrentliabilities', 'value': 6886000000.0}]]
In [12]: pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s])
Out[12]:
othercurrentliabilities otherliabilities propertyplantequipmentnet
0 6.886000e+09 1.370000e+10 1.578900e+10
1 6.886000e+09 NaN NaN

make a list of df.result.rows[x]['values']
like below
s=[]
for x in range(df.result.totalrows[0]):
s=s+[df.result.rows[x]['values']]
print(x)
df1=pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s]
df1
will give you result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

JSON to Pandas Dataframe types change - python

Related

How to access nested data in a pandas dataframe?

explode a dictionary to different rows

Convert a dictionary of a list of dictionaries to pandas DataFrame

how to convert pandas dataframe and numpy array into dictionary?

JSON to Pandas Dataframe not knowing if JSON will have all the columns of the dataframe

Categories

Resources