Got a pandas dataframe with the below structure
0 [{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}]
1 [{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}]
....
....
I would like to flatten into a dataframe with the following columns review_id and rating
I was trying out pd.DataFrame(df1.values.flatten()) but looks like I'm getting something basic not right, need help!!!
You wind up getting an array of lists of dicts so need:
import pandas as pd
pd.DataFrame([x for y in df1.values for x in y])
rating review_id
0 5.0 4873356
1 4.0 4973356
2 5.0 4635892
3 3.0 4645839
Or if willing to use itertools:
from itertools import chain
pd.DataFrame(chain.from_iterable(df1.values.ravel()))
1st unnesting , then re build your dataframe (assuming you have columns name 0)
pd.DataFrame(unnesting(df,[0])[0].values.tolist())
Out[61]:
rating review_id
0 5.0 4873356
1 4.0 4973356
2 5.0 4635892
3 3.0 4645839
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
Related
I would like to know if it's possible to combine some rows if we have in specific columns NaN value ? But the order can be change. I thought combine the rows if Name is duplicated.
import pandas as pd
import numpy as np
d = {'Name': ['Jacque','Paul', 'Jacque'], 'City': [np.nan, '4', '10'], 'Birthday' : ['1','2',np.nan]}
df = pd.DataFrame(data=d)
df
And I would like to have this output :
Check with sorted
out = df.apply(lambda x : sorted(x,key=pd.isnull)).dropna()
Name City Birthday
0 Jacque 4.0 1.0
1 Paul 10.0 2.0
I have a pandas DataFrame as shown below. I want to identify the index values of the columns in df that match a given string (more specifically, a string that matches the column names after 'sim-' or 'act-').
# Sample df
import pandas as pd
df = pd.DataFrame({
'sim-prod1': [1, 1.4],
'sim-prod2': [2, 2.1],
'act-prod1': [1.1, 1],
'act-prod2': [2.5, 2]
})
# Get unique prod values from df.columns
prods = pd.Series(df.columns[1:]).str[4:].unique()
prods
array(['prod2', 'prod1'], dtype=object)
I now want to loop through prods and identify the columns where prod1 and prod2 occur, and then use those columns to create new dataframes. How can I do this? In R I could use the which function to do this easily. Example dataframes I want to obtain are below.
df_prod1
sim_prod1 act_prod1
0 1.0 1.1
1 1.4 1.0
df_prod2
sim_prod2 act_prod2
0 2.0 2.5
1 2.1 2.0
Try groupby with axis=1:
for prod, d in df.groupby(df.columns.str[-4:], axis=1):
print(f'this is {prod}')
print(d)
print('='*20)
Output:
this is rod1
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
====================
this is rod2
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
====================
Now, to have them as variables:
dfs = {prod:d for prod, d in df.groupby(df.columns.str[-4:], axis=1)}
Try this, storing the parts of the dataframe as a dictionary:
df_dict = dict(tuple(df.groupby(df.columns.str[4:], axis=1)))
print(df_dict['prod1'])
print('\n')
print(df_dict['prod2'])
Output:
sim-prod1 act-prod1
0 1.0 1.1
1 1.4 1.0
sim-prod2 act-prod2
0 2.0 2.5
1 2.1 2.0
You can also do this without using groupby() and for loop by:-
df_prod2=df[df.columns[df.columns.str.contains(prods[0])]]
df_prod1=df[df.columns[df.columns.str.contains(prods[1])]]
I have at pandas dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,5,3],
'B': [4,2,6]})
df['avg'] = df.mean(axis=1)
df[df<df['avg']]
I would like keep all the values in the dataframe that are below the average value in column df['avg']. When I perform the below operation I am returned all NAN's
df[df<df['avg']]
If I set up a for loop I can get the boolean of what I want.
col_names = ['A', 'B']
for colname in col_names:
df[colname] = df[colname]<df['avg']
What I am searching for would look like this:
df_desired = pd.DataFrame({
'A':[1,np.nan,3],
'B':[np.nan,2,np.nan],
'avg' :[2.5, 3.5, 4.5]
})
How do I do this? There has to be a pythonic way to do this.
You can use .mask(..) [pandas-doc] here. We can use numpy's broadcasting to generate an array of booleans that are higher than the given average:
>>> df.mask(df.values > df['avg'].values[:,None])
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5
I think this is somewhat more idiomatic, and clearer, than the accepted solution:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 3],
'B': [4, 2, 6]})
print(df)
df['avg'] = df.mean(axis=1)
print(df)
df[df[['A', 'B']].ge(df['avg'], axis=0)] = np.NaN
print(df)
Output:
A B
0 1 4
1 5 2
2 3 6
A B avg
0 1 4 2.5
1 5 2 3.5
2 3 6 4.5
A B avg
0 1.0 NaN 2.5
1 NaN 2.0 3.5
2 3.0 NaN 4.5
Speaking of the accepted solution, it is no longer recommended to use .values in order to convert a Pandas DataFrame or Series to a NumPy Array. Fortunately, we don't actually need to use it at all here:
df.mask(df > df['avg'][:, np.newaxis])
0 [{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}]
1 [{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}]
I have a situation where I want to flatten such json as solved here: Converting array of arrays into flattened dataframe
But I want to create new columns so that the output is:
review_id_1 rating_1 review_id_2 rating_2
4873356 5.0 4973356 4.0
4635892 5.0 4645839 3.0
Any help is highly appreciated..
Try using:
print(pd.DataFrame(s.apply(lambda x: {a: b for i in [{x + str(i): y for x, y in v.items()} for i, v in enumerate(x, 1)] for a, b in i.items()}).tolist()))
Output:
rating1 rating2 review_id1 review_id2
0 5.0 4.0 4873356 4973356
1 5.0 3.0 4635892 4645839
This type of data munging tends to be manual.
# Sample data.
df = pd.DataFrame({
'json_data': [
[{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}],
[{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}],
]
})
# Data transformation:
# Step 1: Temporary dataframe that splits data from `df` into two columns.
df2 = pd.DataFrame(zip(*df['json_data']))
# Step 2: Use a list comprehension to concatenate the records from each column so that the df now has 4 columns.
df2 = pd.concat([pd.DataFrame.from_records(df2[col]) for col in df2], axis=1)
# Step 3: Rename final columns
df2.columns = ['review_id_1', 'rating_1', 'review_id_2', 'rating_2']
>>> df2
review_id_1 rating_1 review_id_2 rating_2
0 4873356 5.0 4635892 5.0
1 4973356 4.0 4645839 3.0
I have two Pandas DataFrames whose data from different sources, but both DataFrames have the same column names. When combined only one column will keep the name.
Like this:
speed_df = pd.DataFrame.from_dict({
'ts': [0,1,3,4],
'val': [5,4,2,1]
})
temp_df = pd.DataFrame.from_dict({
'ts': [0,1,2],
'val': [9,8,7]
})
And I need to have a result like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,NaN,1],
'temp': [9,8,7,NaN,NaN]
})
Later I will deal with empty cells (here filled with NaN) by copying the values of the previous valid value. And get something like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,4,1],
'temp': [9,8,7,7,7]
})
Use pd.merge
In [406]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed','val_y': 'temp'})
.sort_values(by='ts'))
Out[406]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 NaN 7.0
2 3 2.0 NaN
3 4 1.0 NaN
In [407]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed', 'val_y': 'temp'})
.sort_values(by='ts').ffill())
Out[407]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 4.0 7.0
2 3 2.0 7.0
3 4 1.0 7.0
Two main DataFrame options, one is pd.merge and the other is pd.fillna. Here is the code:
df = speed_df.merge(temp_df, how='outer', on='ts')
df = df.rename(columns=dict(val_x='speed', val_y='temp'))
df = df.sort_values('ts')
df.fillna(method='ffill')
Hope this would be helpful.
Thanks
You need to do a left outer join using pandas.merge function
d = pd.merge(speed_df,temp_df,on='ts',how='outer').rename(columns=\
{'val_x':'speed','val_y':'temp'})
d = d.sort_values('ts')
d['speed']=d['speed'].fillna(4)
d['temp']=d['temp'].fillna(7)
That should return you this: