Convert Json to DataFrame using pandas - python

I have a json file that looks like that, and i want to convert it to dataframe while Android and Ios will be my indexes at my DF:
json = {
"Android":{
"lastExecutionID":"21-08-16_07_02_25_25111",
"lastExecutionTime":1629,
"avgDuration":26884
},
"IOS":{
"lastID":"21-08-16_07_02_25_25534",
"lastTime":1669,
"avg":109802
}
}
The best way I have found to do this is to convert each json object to list will using my indexes at my DF,
onw list of each key will using 'colmuns' at my DF and one list for each value.
There is a better way to that ?
Thanks everyone

The easiest to parse json objects to pandas is pd.json_normalize():
>>> df = pd.json_normalize(json)
>>> df
Android.lastExecutionID Android.lastExecutionTime Android.avgDuration IOS.lastID IOS.lastTime IOS.avg
0 21-08-16_07_02_25_25111 1629 26884 21-08-16_07_02_25_25534 1669 109802
Nested names are .-separated, so you then just need to split column names and unstack:
>>> df.columns = df.columns.str.split('.', 1).map(tuple)
>>> df.loc[0].unstack()
avg avgDuration lastExecutionID lastExecutionTime lastID lastTime
Android NaN 26884 21-08-16_07_02_25_25111 1629 NaN NaN
IOS 109802 NaN NaN NaN 21-08-16_07_02_25_25534 1669

As you want Android and Ios to be the indexes of the resulting dataframe, see whether this is what you want:
Use pd.Series + .apply(), as follows:
df = pd.Series(json).apply(pd.Series)
Result:
print(df)
lastExecutionID lastExecutionTime avgDuration lastID lastTime avg
Android 21-08-16_07_02_25_25111 1629.0 26884.0 NaN NaN NaN
IOS NaN NaN NaN 21-08-16_07_02_25_25534 1669.0 109802.0

Related

From a mixed dtype column, extract string from specific column values in python pandas

My dataframe looks like below:
data = {'pred_id':[np.nan, np.nan, 'Pred ID', 258,265,595,658],
'class':[np.nan,np.nan,np.nan,'pork','sausage','chicken','pork'],
'image':['Weight',115.37,'pred','app_images/03112020/Prediction/222_prediction_resized.jpg','app_images/03112020/Prediction/333_prediction_resized.jpg','volume',np.nan]}
df = pd.DataFrame(data)
df
Edited:
I am trying create a new column 'image_name' with values from column 'image'. I want to extract a substring from column 'image' values that contains 'app_images/' in its string, and if not then keep it the same.
I tried the below code and its throwing 'Attribute Error' message.
Help me on how to find the dtype and then extract substring from values that have 'app_images/' and if not then keep the value as it is. I dont know how to fix this. Thanks in advance.
images = []
for i in df['image']:
if i.dtypes == object:
if i.__contains__('app_images/'):
new = i.split('_')[1]
name = new.split('/')[3]+'.jpg'
images.append(name)
else:
images.append(i)
df['image_name'] = images
df
Do not use a loop, use vectorial code, str.extract and a regex.
From your description and code, this seems to be what you expect:
df['image_name'] = (df['image'].str.extract(r'app_images/.*/(\d+)_[^/]+\.jpg',
expand=False)+'.jpg'
)
output:
pred_id class image image_name
0 NaN NaN Weight NaN
1 NaN NaN 115.37 NaN
2 Pred ID NaN pred NaN
3 258 pork app_images/03112020/Prediction/222_prediction_resized.jpg 222.jpg
4 265 sausage app_images/03112020/Prediction/333_prediction_resized.jpg 333.jpg
5 595 chicken volume NaN
6 658 pork NaN NaN
regex demo

Does this occur because there is an NaN?

I have a list of floats, and when I try to convert it into series or dataframe
code
000001.SZ 1.305442
000002.SZ 1.771655
000004.SZ 2.649862
000005.SZ 1.373074
000006.SZ 1.115238
...
601512.SH 16.305734
688123.SH 53.395579
603995.SH 19.598881
688268.SH 70.174454
002972.SZ 19.644900
300811.SZ 24.042762
688078.SH 86.263280
603109.SH NaN
Length: 3753, dtype: float64
df = pd.DataFrame(data = mylist,columns = ["std_r_in20days"])
print(df)
s = pd.Series(mylist)
print(s)
The result is:
std_r_in20days
0 NaN
0 code
000001.SZ 1.305442
000002.SZ 1.77...
dtype: object
AttributeError: Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed')
Does this occur because there is an NaN in mylist? If so, how can I fix it? I don't want to delete the row with NaN but just leave them there
Ser=pd.DataFrame([[1,2,3,4,None],[2,3,4,5,np.nan]])
Ser=Ser.replace(np.nan,0)
You can do it like this. There are also other functions in pandas like fillna().:)
Instead of removing the whole row , you can just replace those values with 0 using pandas function
df.fillna(0)
Also, learn to perform null check in the beginning of your script using
df.isna().sum().sum()
This will give you the number of NaN values in your whole dataframe

DataFrame from dicts with automatic date parsing

I am creating a Pandas DataFrame from sequence of dicts.
The dicts are large and somewhat heterogeneous.
Some of the fields are dates.
I would like to automatically detect and parse the date fields.
This can be achieved by
df0 = pd.Dataframe.from_dict(dicts)
df0.to_csv('tmp.csv', index=False)
df = pd.read_csv('tmp.csv', parse_dates=True)
I would like to find a more direct way to do this.
Use pd.to_datetime with errors='ignore'
Only use on columns of dtype == object using select_dtypes. This prevents converting numeric columns into nonsensical dates.
'ignore' abandons the conversion attempt if any errors are encountered.
combine_first is used instead of update because update keeps the initial dtypes. Since they were object, this would mess it all up.
df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore').combine_first(df)
date0 date1 feuxdate notadate
0 2019-01-01 NaT NaN NaN
1 NaT NaT 0.0 NaN
2 NaT NaT NaN hi
3 NaT 2019-02-01 NaN NaN
Could've also gotten tricky with it using assign to deal with dtypes
df.assign(**df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore'))
Setup
dicts = [
{'date0': '2019-01-01'},
{'feuxdate': 0},
{'notadate': 'hi'},
{'date1': '20190201'}
]
df = pd.DataFrame.from_dict(dicts)

Can I seperate the values of a dictionary into multiple columns and still be able to plot them?

I want to seperate the values of a dictionary into multiple columns and still be able to plot them. At this moment all the values are in one column.
So concretely I would like to split all the different values in the list of values. And use the amount of values in the longest list as the amount of columns. So for all the shorter lists I would like to fill in the gaps with something like 'NA' so I can still plot it in seaborn.
This is the dictionary that I used:
dictio = {'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999], 'seq_2888': [5219.3359, 5365.4089], 'seq_1101': [4287.7417, 4422.8254], 'seq_107': [5825.695099999999, 5972.8073], 'seq_6946': [5179.3118, 5364.420900000001], 'seq_6162': [5531.503199999999, 5645.577399999999], 'seq_504': [4556.920899999999, 4631.959], 'seq_3535': [3396.1715999999997, 3446.1969999999997, 5655.896546], 'seq_4077': [4551.9108, 4754.0073,4565.987654,5668.9999976], 'seq_1626_unamb': [3724.3894999999998]}
This is the code for the dataframe:
df = pd.Series(dictio)
test=pd.DataFrame({'ID':df.index, 'Value':df.values})
seq_107 [5825.695099999999, 5972.8073]
seq_1101 [4287.7417, 4422.8254]
seq_1626_unamb [3724.3894999999998]
seq_2888 [5219.3359, 5365.4089]
seq_3535 [3396.1715999999997, 3446.1969999999997, 5655....
seq_4077 [4551.9108, 4754.0073, 4565.987654, 5668.9999976]
seq_418 [3716.3642000000004, 3796.4124000000006]
seq_504 [4556.920899999999, 4631.959]
seq_6162 [5531.503199999999, 5645.577399999999]
seq_6946 [5179.3118, 5364.420900000001]
seq_7009 [6236.9764, 6367.049999999999]
seq_9143_unamb [4631.958999999999]
Thanks in advance for the help!
Convert the Value column to a list of lists, and reload it into a new dataframe. Afterwards, call plot. Something like this -
df = pd.DataFrame(test.Value.tolist(), index=test.ID)
df
0 1 2 3
ID
seq_107 5825.6951 5972.8073 NaN NaN
seq_1101 4287.7417 4422.8254 NaN NaN
seq_1626_unamb 3724.3895 NaN NaN NaN
seq_2888 5219.3359 5365.4089 NaN NaN
seq_3535 3396.1716 3446.1970 5655.896546 NaN
seq_4077 4551.9108 4754.0073 4565.987654 5668.999998
seq_418 3716.3642 3796.4124 NaN NaN
seq_504 4556.9209 4631.9590 NaN NaN
seq_6162 5531.5032 5645.5774 NaN NaN
seq_6946 5179.3118 5364.4209 NaN NaN
seq_7009 6236.9764 6367.0500 NaN NaN
seq_9143_unamb 4631.9590 NaN NaN NaN
df.plot()

Pandas find max value from series of mixed data

Am using Pandas df and in dataframe i was able to extract to series named 'xy' that look like this:
INVOICE
2014-08-14 00:00:00
4557
Printing
nan
Item AMOUNT
nan
1 9.6
0
0
0
9.6
2
11.6
nan
nan
nan
what i need to find is maximum value which is usually located towards the end of 'xy' pandas series i tried to convert it to string i get into problems as some of the series are string not int or float i need a good way as i am writing this script for several different files
try to use pd.to_numeric
pd.to_numeric(xy, 'coerce').max()
4557.0
to get the last float number
s = pd.to_numeric(xy, 'coerce')
s.loc[s.last_valid_index()]

Categories

Resources