Python convert list of large json objects to DF

Python convert list of large json objects to DF - python

I want to convert a list of objects to a pandas dataframe. The objects are large and complex, a sample one can be found here
The output should be a DF with 3 columns: info, releases, URLs - as per the json object linked above. I've tried pd.DataFrame and from_records, but I keep getting hit with errors. Can anyone suggest a fix?

Have you tried pd.read_json() function?
Here's a link to the documentation
https://pandas.pydata.org/docs/reference/api/pandas.io.json.read_json.html

Related

What goes wrong with my Pandas Read_JSON for a simple API?

I am using a simple method to call some surfing data from MSW with the following code:
import requests
spot = requests.get('http://magicseaweed.com/api/{API_KEY}/forecast/?spot_id={SPOT_ID}&units=eu')
Converting spot to JSON with: spotJson = spot.json()
Returns a list like this:
[{"timestamp":1587772800,"localTimestamp":1587772800,"issueTimestamp":1587772800,"fadedRating":0,"solidRating":0,"swell":{"absMinBreakingHeight":0.22,"absMaxBreakingHeight":0.35,"probability":100,"unit":"m","minBreakingHeight":0.2,"maxBreakingHeight":0.3,"components":{"combined":{"height":0.6,"period":4,"direction":178.94,"compassDirection":"N"},"primary":{"height":0.6,"period":4,"direction":183.14,"compassDirection":"N"}}},"wind":{"speed":22,"direction":196,"compassDirection":"NNE","chill":2,"gusts":25,"unit":"kph"},"condition":{"pressure":1016,"temperature":8,"weather":"11","unitPressure":"mb","unit":"c"},"charts":{"swell":"https:\/\/charts-s3.msw.ms\/archive\/wave\/750\/7-1587772800-1.gif","period":"https:\/\/charts-s3.msw.ms\/archive\/wave\/750\/7-1587772800-2.gif","wind":"https:\/\/charts-s3.msw.ms\/archive\/gfs\/750\/7-1587772800-4.gif","pressure":"https:\/\/charts-s3.msw.ms\/archive\/gfs\/750\/7-1587772800-3.gif","sst":"https:\/\/charts-s3.msw.ms\/archive\/sst\/750\/7-1587772800-10.gif"}}]
I use this list to loop through and get me the moment when it is good for a surf :)! BUT I would love to put the data into Pandas using the read_JSON method from Pandas to enrich my data with other forecasts:
import pandas as pd
raw = pd.read_json(spot)
For which I get the following error:
invalid file path or buffer object type: <class 'list'>
I can get it into a Dataframe using the following method:
pd.DataFrame(spotJson)
But I just want to know why I can't do it directly from Pandas, which would make more sense since this option is available. Any thoughts on why this is not working?

To normalize nested data like this you need to use json_normalize. For pandas older than version 1.0 use pd.io.json.json_normalize, otherwise use pd.json_normalize.
df = pd.io.json.json_normalize(data)
OR
df = pd.json_normalize(data)
Columns are labeled as their path inside the JSON, i.e a nested dictionary in your JSON will have parent_key.child_key notation.
df.columns
Index(['charts.period', 'charts.pressure', 'charts.sst', 'charts.swell',
'charts.wind', 'condition.pressure', 'condition.temperature',
'condition.unit', 'condition.unitPressure', 'condition.weather',
'fadedRating', 'issueTimestamp', 'localTimestamp', 'solidRating',
'swell.absMaxBreakingHeight', 'swell.absMinBreakingHeight',
'swell.components.combined.compassDirection',
'swell.components.combined.direction',
'swell.components.combined.height', 'swell.components.combined.period',
'swell.components.primary.compassDirection',
'swell.components.primary.direction', 'swell.components.primary.height',
'swell.components.primary.period', 'swell.maxBreakingHeight',
'swell.minBreakingHeight', 'swell.probability', 'swell.unit',
'timestamp', 'wind.chill', 'wind.compassDirection', 'wind.direction',
'wind.gusts', 'wind.speed', 'wind.unit']
If you need less columns or want it structured differently you'll have to pass in some arguments when you call this function. You can see the documentation for more info.

How to Convert Dask DataFrame Into List of Dictionaries?

I need to convert a dask dataframe into a list of dictionaries as the response for an API endpoint. I know I can convert the dask dataframe to pandas, and then from there I can convert to dictionary, but it would be better to map each partition to a dict, and then concatenate.
What I tried:
df = dd.read_csv(path, usecols=cols)
dd.compute(df.to_dict(orient='records'))
Error I'm getting:
AttributeError: 'DataFrame' object has no attribute 'to_dict'

You can do it as follows
import dask.bag as db
db.from_delayed(df.map_partitions(pd.DataFrame.to_dict, orient='records'
).to_delayed())
which gives you a bag which you could compute (if it fits in memory) or otherwise manipulate.
Note that to_delayed/from_delayed should not be necessary, there is also a to_bag method, but it doesn't seem to do the right thing.
Also, you are not really getting much from the dataframe model here, you may want to start with db.read_text and the builtin CSV module.

Try this:
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
It will return a list of dictionaries wherein each row will be converted to the dictionary.

The answer of Kunal Bafna is easiest to implement, and has fewer dependencies.
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))

Problem when stores dict into pandas Dataframe

Recently in my project, I need to store a dictionary into a Pandas DataFrame with the code
self.device_info.loc[i,'interference']=[temp_dict]
The device_info is a Pandas DataFrame. The temp_dict is a dictionary and I want it to be stored as an element in the DataFrame for future use. The square bracket is added to ensure there's no error when assigning.
I just found it today that with Pandas version 0.22.0, this code will pack the dictionary as a list and store it into the DataFrame. However, in the version of 0.24.2, this code directly stores the dictionary into Pandas DataFrame.
For example, say when i=0, after executing the code
with Pandas.version == '0.22.0'
type(self.device_info.loc[0,'interference'])
returns list, while Pandas.version == '0.24.2', this code returns a dict. From my perspective, I need a consistent performance that there is always a dictionary stored.
I am currently working on two PCs, one's home and one's at my office, and I cannot update the older version of pandas on my office PC. So I would be much appreciated if anyone can help me figure out why this happens.

Pandas has a from_dict method with many option which takes a dict as input and returns a DataFrame.
You can chose to infer type or force it (to str, for example).
Then, manipulating and appending dataframes is way easier as you won't ever again have dict object problem in that line or column.

Pandas - Appending 'table' format to HDF5Store with different dtypes: invalid combinate of [values_axes]

I recently started trying to use HDF5 format in python pandas to store data but encountered a problem where cant find a workaround for. Before i worked with CSV files and i had no trouble in regards to appending new data.
This is what i try:
store = pd.HDFStore('cdw.h5')
frame.to_hdf('cdw.h5','cdw/data_cleaned', format='table',append=True, data_columns=True,dropna=False)
And it throws:
ValueError: invalid combinate of [values_axes] on appending data [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->float64,kind->float,shape->(1, 176345)] vs current table [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->bytes128,kind->string,shape->None]
I get that it tells me i want to append different data type for a column but what buffles me is that i have wrote the same CSV file before with some other CSV Files from a Dataframe to that HDF5 file.
I'm doing analysis in the forwarding industry and the data there is very inconsistent - more often than not there are missing values or mixed dtypes in columns or other 'data dirt'.
Im looking for a way to append data to HDF5 file no matter what is inside the column as long as the column names are the same.
It would be beautiful to enforce appending data in HDF store independant of datatypes or another simple solution for my problem. The goal is to have an automation later on for the analysis therefore id not like to change datatypes everytime i have a missing value in a column of the total 62 columns i have.
Another question in my question is:
My file access for read_hdf consumes more time than my read_csv i have around 1.5 million rows with 62 columns. Is this because i have no SSD drive? Because i have read that the file access for read_hdf should be faster.
I question myself if I rather should stick with CSV files or with HDF5?
Help would be greatly appreciated.

Okay for anyone having the same issue with appending data where the dtype is not always secured to be the same: I finally found a solution. First convert every column to object with li = list(frame)
frame[li] = frame[li].astype(object)
frame.info() then try the method df.to_hdf(key,value, append=True) and wait for its error message. The error message TypeError: Cannot serialize the column [not_one_datatype] because its data contents are [mixed] object dtype will tell the columns it still doesnt like. Converting those columns to float worked for me! After that the error convert the mentioned column with df['not_one_datatype'].astype(float) only use integer if you are sure that a float will never occur in this column otherwise append method will bug again.
I decided to work parallel with CSV and HDF5 Files. If i get a problem with HDF5 where i have no workaround for i will simply change to CSV - this is what i can recommend personally.
Update: Okay it seems that the creators of this format have not thought about the reality when considering the HDF API: HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! occurs when trying to append data to an already existing file if some column happens to be longer than the initial write to HDF file.
Now the joke here is that the creators of this API expecting me to know the max column length of each possible data in a column at the first write? really? Another inconsistency is that df.to_hdf(append=True) do not have the parameter min_itemsize={'column1':1000}. This format is at best suited for storing self created data only but definately not for data where the dtypes and length of the entries in each column are NOT set in stone. The only solution left when you want to append data from pandas dataframes independent of the stubborn HDF5 API in Python is to insert in every dataframe before appending a row with very long strings except for the numeric columns. Just to be sure that you will always be able to append the data no matter how possible long it will get.
When doing this write process will take ages and slurp gigantic sizes of disc drive for saving the huge HDF5 file.
CSV definately wins against HDF5 in terms of performance, integration and especially usability.

Vectorize parsing json list in python / pandas

I am using python to get data from an API which returns a json which I need to parse for specific variables as below:
r=requests.get(request_here)
g=r.json()
N=len(g)
for i in range(0,N):
Submission_Timestamp=g[i]['vars']['submission']['timestamp']
Outcome=g[i]['appended']['suppressionlist']['add_item']['outcome']
Where each "i" is a unique record. While this method works I wanted to know if there is a way to vectorize this to grab everything in one swoop. Not all records will have the Outcome variable so I also need a KeyError exception for handling those records. Finally, I am pushing this into a pandas dataframe. How would I accomplish this? Thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.