flatten array of arrays json object column in a pandas dataframe - python

0 [{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}]
1 [{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}]
I have a situation where I want to flatten such json as solved here: Converting array of arrays into flattened dataframe
But I want to create new columns so that the output is:
review_id_1 rating_1 review_id_2 rating_2
4873356 5.0 4973356 4.0
4635892 5.0 4645839 3.0
Any help is highly appreciated..

Try using:
print(pd.DataFrame(s.apply(lambda x: {a: b for i in [{x + str(i): y for x, y in v.items()} for i, v in enumerate(x, 1)] for a, b in i.items()}).tolist()))
Output:
rating1 rating2 review_id1 review_id2
0 5.0 4.0 4873356 4973356
1 5.0 3.0 4635892 4645839

This type of data munging tends to be manual.
# Sample data.
df = pd.DataFrame({
'json_data': [
[{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}],
[{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}],
]
})
# Data transformation:
# Step 1: Temporary dataframe that splits data from `df` into two columns.
df2 = pd.DataFrame(zip(*df['json_data']))
# Step 2: Use a list comprehension to concatenate the records from each column so that the df now has 4 columns.
df2 = pd.concat([pd.DataFrame.from_records(df2[col]) for col in df2], axis=1)
# Step 3: Rename final columns
df2.columns = ['review_id_1', 'rating_1', 'review_id_2', 'rating_2']
>>> df2
review_id_1 rating_1 review_id_2 rating_2
0 4873356 5.0 4635892 5.0
1 4973356 4.0 4645839 3.0

Related

Create new dataframe columns based on lists of indices in a column and another dictionary

Given the following dataframe and list of dictionaries:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict([
{'id': '912SAFD', 'key': 3, 'list_index': [0]},
{'id': '812SAFD', 'key': 4, 'list_index': [0, 1]},
{'id': '712SAFD', 'key': 5, 'list_index': [2]}])
designs = [{'designs': [{'color_id': 609090, 'value': 'b', 'lang': ''}]},
{'designs': [{'color_id': 609091, 'value': 'c', 'lang': ''}]},
{'designs': [{'color_id': 609092, 'value': 'd', 'lang': 'fr'}]}]
Dataframe output:
id key list_index
0 912SAFD 3 [0]
1 812SAFD 4 [0, 1]
2 712SAFD 5 [2]
Without using explicit loops (if possible), is it feasible to iterate through the lists in 'list_index' for each row, extract the values and use them to access the list of dictionaries by index and then create new columns based on the values in the dictionaries?
Here is an example of the expected result:
id key list_index 609090 609091 609092 609092_lang
0 912SAFD 3 [0] b NaN NaN NaN
1 812SAFD 4 [0, 1] b c NaN NaN
2 712SAFD 5 [2] NaN NaN d fr
If 'lang' is not empty, it should be added as a column to the dataframe by using the color_id value combined with an underscore and its own name as the column name. For example: 609092_lang.
Any help would be much appreciated.
# this is to get the inner dictionary and make a tidy dataframe from it
designs = [info for design in designs for info in design['designs']]
df_designs = pd.DataFrame(designs)
df_designs['lang_code'] = 'lang_' + df_designs['color_id'].astype(str)
df_designs['lang'] = df_designs.lang.replace('', np.NaN)
df = df.explode('list_index').merge(df_designs, left_on='list_index', right_index=True)
df_color = df.pivot(index=['id', 'key'], columns=['color_id'], values='value')
df_lang = df.pivot(index=['id', 'key'], columns=['lang_code'], values='lang')
df = df_color.join(df_lang).reset_index().dropna(how='all' , axis=1)
print(df)
output :
>>>
id key 609090 609091 609092 lang_609092
0 712SAFD 5 NaN NaN d fr
1 812SAFD 4 b c NaN NaN
2 912SAFD 3 b NaN NaN NaN
alternatively, if you could work with multiIndex df , instead of naming them, that would be simpler :
# this is to get the inner dictionary and make a tidy dataframe from it
designs = [info for design in designs for info in design['designs']]
df_designs = pd.DataFrame(designs)
df_designs['lang'] = df_designs.lang.replace('',np.NaN)
df = df.explode('list_index').merge(df_designs, left_on='list_index', right_index=True)
df = df.pivot(index=['id', 'key'], columns=['color_id'], values=['value','lang']).dropna(how='all' , axis=1).reset_index()
print(df)
output:
>>>
id key value lang
color_id 609090 609091 609092 609092
0 712SAFD 5 NaN NaN d fr
1 812SAFD 4 b c NaN NaN
2 912SAFD 3 b NaN NaN NaN

Group latest values in pandas columns for a given id

I have a pandas dataframe containing some metrics for a given date and user.
>>> pd.DataFrame({"user": ['juan','juan','juan','gonzalo'], "date": [1, 2, 3, 1], "var1": [1, 2, None, 1], "var2": [None, 4, 5, 6]})
user date var1 var2
0 juan 1 1.0 NaN
1 juan 2 2.0 4.0
2 juan 3 NaN 5.0
3 gonzalo 1 1.0 6.0
Now, for each user, I want to extract the 2 more recent values for each variable (var1, var2) ignoring NaN unless there aren't enough values to fill the data.
For reference, this should be the resulting dataframe for the data depicted above
user var1_0 var1_1 var2_0 var2_1
juan 2.0 1.0 5.0 4.0
gonzalo 1.0 NaN 6.0 NaN
each "historical" value is added as a new column with a _0 or _1 suffix.
First sorting if necessary by both columns in DataFrame.sort_values with reshape by DataFrame.sort_values and remove missing values, filter top2 rrows per groups by GroupBy.head, then create counter column by GroupBy.cumcount with pivoting in DataFrame.pivot with flatten MultiIndex:
df1 = (df.sort_values(['user','date'])
.melt(id_vars='user', value_vars=['var1','var2'])
.dropna(subset=['value'])
)
df1 = df1.groupby(['user','variable']).head(2)
df1['g'] = df1.groupby(['user','variable']).cumcount(ascending=False)
df1 = df1.pivot(index='user', columns=['variable', 'g'], values='value')
#oldier pandas versions
#df1 = df1.set_index(['user','variable', 'g'])['value'].unstack([1,2])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
user var1_0 var1_1 var2_0 var2_1
0 gonzalo 1.0 NaN 6.0 NaN
1 juan 2.0 1.0 5.0 4.0
You could group by user and aggregate to get the 2 most recent values. That get's almost all the way there - but instead of columns you have a list of elements. If you want to have the actual 2 columns you have to split the newly created list into columns. Full code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"user": ["juan", "juan", "juan", "gonzalo"],
"date": [1, 2, 3, 1],
"var1": [1, 2, None, 1],
"var2": [None, 4, 5, 6],
}
)
# This almost gets you there
df = (
df.sort_values(by="date")
.groupby("user")
.agg({"var1": lambda x: x.dropna().head(2), "var2": lambda x: x.dropna().head(2)})
)
# Split the columns and get the correct column names
df[["var1_0", "var2_0"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[0] if isinstance(el, np.ndarray) else el for el in row),
axis=1,
)
df[["var1_1", "var2_1"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[-1] if isinstance(el, np.ndarray) else None for el in row),
axis=1,
)
print(df)
>>
var1 var2 var1_0 var2_0 var1_1 var2_1
user
gonzalo 1.0 6.0 1.0 6.0 NaN NaN
juan [1.0, 2.0] [4.0, 5.0] 1.0 4.0 2.0 5.0

Squeezing pandas DataFrame to have non-null values and modify column names

I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0

Converting array of arrays into flattened dataframe

Got a pandas dataframe with the below structure
0 [{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}]
1 [{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}]
....
....
I would like to flatten into a dataframe with the following columns review_id and rating
I was trying out pd.DataFrame(df1.values.flatten()) but looks like I'm getting something basic not right, need help!!!
You wind up getting an array of lists of dicts so need:
import pandas as pd
pd.DataFrame([x for y in df1.values for x in y])
rating review_id
0 5.0 4873356
1 4.0 4973356
2 5.0 4635892
3 3.0 4645839
Or if willing to use itertools:
from itertools import chain
pd.DataFrame(chain.from_iterable(df1.values.ravel()))
1st unnesting , then re build your dataframe (assuming you have columns name 0)
pd.DataFrame(unnesting(df,[0])[0].values.tolist())
Out[61]:
rating review_id
0 5.0 4873356
1 4.0 4973356
2 5.0 4635892
3 3.0 4645839
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

Combine two pandas DataFrame into one new

I have two Pandas DataFrames whose data from different sources, but both DataFrames have the same column names. When combined only one column will keep the name.
Like this:
speed_df = pd.DataFrame.from_dict({
'ts': [0,1,3,4],
'val': [5,4,2,1]
})
temp_df = pd.DataFrame.from_dict({
'ts': [0,1,2],
'val': [9,8,7]
})
And I need to have a result like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,NaN,1],
'temp': [9,8,7,NaN,NaN]
})
Later I will deal with empty cells (here filled with NaN) by copying the values of the previous valid value. And get something like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,4,1],
'temp': [9,8,7,7,7]
})
Use pd.merge
In [406]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed','val_y': 'temp'})
.sort_values(by='ts'))
Out[406]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 NaN 7.0
2 3 2.0 NaN
3 4 1.0 NaN
In [407]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed', 'val_y': 'temp'})
.sort_values(by='ts').ffill())
Out[407]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 4.0 7.0
2 3 2.0 7.0
3 4 1.0 7.0
Two main DataFrame options, one is pd.merge and the other is pd.fillna. Here is the code:
df = speed_df.merge(temp_df, how='outer', on='ts')
df = df.rename(columns=dict(val_x='speed', val_y='temp'))
df = df.sort_values('ts')
df.fillna(method='ffill')
Hope this would be helpful.
Thanks
You need to do a left outer join using pandas.merge function
d = pd.merge(speed_df,temp_df,on='ts',how='outer').rename(columns=\
{'val_x':'speed','val_y':'temp'})
d = d.sort_values('ts')
d['speed']=d['speed'].fillna(4)
d['temp']=d['temp'].fillna(7)
That should return you this:

Categories

Resources