I have a pandas dataframe containing some metrics for a given date and user.
>>> pd.DataFrame({"user": ['juan','juan','juan','gonzalo'], "date": [1, 2, 3, 1], "var1": [1, 2, None, 1], "var2": [None, 4, 5, 6]})
user date var1 var2
0 juan 1 1.0 NaN
1 juan 2 2.0 4.0
2 juan 3 NaN 5.0
3 gonzalo 1 1.0 6.0
Now, for each user, I want to extract the 2 more recent values for each variable (var1, var2) ignoring NaN unless there aren't enough values to fill the data.
For reference, this should be the resulting dataframe for the data depicted above
user var1_0 var1_1 var2_0 var2_1
juan 2.0 1.0 5.0 4.0
gonzalo 1.0 NaN 6.0 NaN
each "historical" value is added as a new column with a _0 or _1 suffix.
First sorting if necessary by both columns in DataFrame.sort_values with reshape by DataFrame.sort_values and remove missing values, filter top2 rrows per groups by GroupBy.head, then create counter column by GroupBy.cumcount with pivoting in DataFrame.pivot with flatten MultiIndex:
df1 = (df.sort_values(['user','date'])
.melt(id_vars='user', value_vars=['var1','var2'])
.dropna(subset=['value'])
)
df1 = df1.groupby(['user','variable']).head(2)
df1['g'] = df1.groupby(['user','variable']).cumcount(ascending=False)
df1 = df1.pivot(index='user', columns=['variable', 'g'], values='value')
#oldier pandas versions
#df1 = df1.set_index(['user','variable', 'g'])['value'].unstack([1,2])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
user var1_0 var1_1 var2_0 var2_1
0 gonzalo 1.0 NaN 6.0 NaN
1 juan 2.0 1.0 5.0 4.0
You could group by user and aggregate to get the 2 most recent values. That get's almost all the way there - but instead of columns you have a list of elements. If you want to have the actual 2 columns you have to split the newly created list into columns. Full code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"user": ["juan", "juan", "juan", "gonzalo"],
"date": [1, 2, 3, 1],
"var1": [1, 2, None, 1],
"var2": [None, 4, 5, 6],
}
)
# This almost gets you there
df = (
df.sort_values(by="date")
.groupby("user")
.agg({"var1": lambda x: x.dropna().head(2), "var2": lambda x: x.dropna().head(2)})
)
# Split the columns and get the correct column names
df[["var1_0", "var2_0"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[0] if isinstance(el, np.ndarray) else el for el in row),
axis=1,
)
df[["var1_1", "var2_1"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[-1] if isinstance(el, np.ndarray) else None for el in row),
axis=1,
)
print(df)
>>
var1 var2 var1_0 var2_0 var1_1 var2_1
user
gonzalo 1.0 6.0 1.0 6.0 NaN NaN
juan [1.0, 2.0] [4.0, 5.0] 1.0 4.0 2.0 5.0
Related
Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)
Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2
If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN
I'm working with pandas data frames, and I've run into an error message that I don't understand.
In this toy example, I've got a data frame called df with a number of columns ('var1_1', 'var1_2', 'var1_3', 'var2_1', 'var2_2', 'var3'), a list called var_names1 with a few elements ('var2', 'var3', 'var1'), and an empty list called df_list.
I want to loop over var_names1, in such a way that when eg the value for var_names1 is var2, I create a new data frame with df columns var2_1 and var2_2, and finally append the new dataframe to df_list.
When I run the code I get the following error message: KeyError: "None of [Index([('var2_1', 'var2_2')], dtype='object')] are in the [columns]".
# TOY DATASET
cars = {'var1_1': [1, np.nan, np.nan, np.nan],
'var1_2': [np.nan, 1, 1, np.nan],
'var1_3': [np.nan, np.nan, 1, np.nan],
'var2_1': [1, np.nan, 1, np.nan],
'var2_2': [np.nan, 1, 1, np.nan],
'var3': [1, np.nan, 1, 1]
}
df = pd.DataFrame(cars, columns = ['var1_1', 'var1_2', 'var1_3', 'var2_1', 'var2_2', 'var3'])
print(df)
var1_1 var1_2 var1_3 var2_1 var2_2 var3
0 1.0 NaN NaN 1.0 NaN 1.0
1 NaN 1.0 NaN NaN 1.0 NaN
2 NaN 1.0 1.0 1.0 1.0 1.0
3 NaN NaN NaN NaN NaN 1.0
# CODE
root_names = ['var2', 'var3', 'var1']
df_list = []
for var in root_names:
match_names = [x for x in list(df) if re.match(var,x)]
temp_df = df[[match_names]]
df_list.append(temp_df)
# ERROR MESSAGE
KeyError: "None of [Index([('var2_1', 'var2_2')], dtype='object')] are in the [columns]"
However, when I use bits of the code to check (see below), the columns seem to be there. Can someone please explain the error message. Thanks!
root_names = ['var2', 'var3', 'var1']
for var in root_names:
match_names = [x for x in list(df) if re.match(var,x)]
print(match_names)
# Output
['var2_1', 'var2_2']
['var3']
['var1_1', 'var1_2', 'var1_3']
df[['var2_1', 'var2_2']]
# Output
var2_1 var2_2
0 1.0 NaN
1 NaN 1.0
2 1.0 1.0
3 NaN NaN
match_names is already a list, you don't have to enclose it further inside []
Replace this,
temp_df = df[[match_names]]
With this
temp_df = df[match_names]
This error indicates that the some column is missing in the Dataframe dataset. Try replacing temp_df = df[[match_names]] with temp_df = df[match_names].
You tried to pass a list of a list of values, not a list of values(column names)
I would like to fill my first dataframe with data from the second dataframe. Since I don't need and any special condition I suppose combine_first function looks like the right choice for me.
Unfortunately when I try to combine two dataframes result is still the original dataframe.
My code:
import pandas as pd
df1 = pd.DataFrame({'Gen1': [5, None, 3, 2, 1],
'Gen2': [1, 2, None, 4, 5]})
df2 = pd.DataFrame({'Gen1': [None, 4, None, None, None],
'Gen2': [None, None, 3, None, None]})
df1.combine_first(df2)
Then, When I print(df1) I get df1 as I initiate it on the second row.
Where did I make a mistake?
For me working nice if assign back output, but very similar method DataFrame.update working inplace:
df = df1.combine_first(df2)
print (df)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
df1.update(df2)
print (df1)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
combine_first returns a dataframe which has the change and not updating the existing dataframe so you should get the return dataframe
df1=df1.combine_first(df2)
I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?
If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
Lets say I have 3 different pandas dataFrames
>>> import pandas as pd
>>> import numpy as np
>>>> df1 = pd.DataFrame({'ID': [20016, 50048, 13478, 68493, 57483],
'Sex': ['F', 'M', 'F', 'F', 'M'],
'Var1': [3, 3, 3, 3, 2],
'Var2': [2, 3, np.nan, 3, 2],
'Var3': [-0.25, 0, 4, np.nan, 0.14]})
>>> df1.set_index('ID')
Sex Var1 Var2 Var3
ID
20016 F 3 2.0 -0.25
50048 M 3 3.0 0.00
13478 F 3 NaN 4.00
68493 F 3 3.0 NaN
57483 M 2 2.0 0.14
The 2nd DF is basically an updated version of DF1, which means more row entries and also other columns and maybe changed values in some of the other columns e.g
>>>> df2 = pd.DataFrame({'PERSID': [20016, 50048, 13478, 68493, 57483, 45623],
'Sex': ['F', 'M', 'F', 'F', 'M', 'M'],
'Var1': [3, *1*, 3, 3, 2, np.nan],
'Var2': [*3*, 3, np.nan, 3, 2, 0],
'Var3': [-0.25, 0, 4, np.nan, 0.14, 0.28]})
>>> df2.set_index('ID')
Sex Var1 Var2 Var3
PERSID
20016 F 3.0 3.0 -0.25
50048 M 1.0 3.0 0.00
13478 F 3.0 NaN 4.00
68493 F 3.0 3.0 NaN
57483 M 2.0 2.0 0.14
45623 M NaN 0.0 0.28
And the last dataFrame as an example should be rather different with something like:
SUBJECT Var4 Var5 Var6
200 1640.345 345.0 -0.250000
6700 14236.430 1713.0 -0.050735
6702 1345.400 NaN 0.034450
1330__201805 345.750 335.0 0.140000
4786__201805 NaN 0.0 NaN
And the goal is to merge all 3 dataFrames into one, containing all non redundant information. This means:
if there is a new ID simply add the row
if there is a new column add the column
if there are the exact IDs in two different DFs they need to be merged in such a way that if the cell content is the same, the content of the 2nd DF can be neglected. If however, the content of the cell is different, a new column needs to be added with columnName.y and the other column needs to be renamed to columnName.x
considering only merging of DF1 and DF2 it should look something like this:
ID Sex_x Var1_x Var2_x Var3 Var1_y Var2_y
20016 F 3.0 2.0 -0.25 NaN 3.0
50048 M 3.0 3.0 0.00 1.0 NaN
13478 F 3.0 NaN 4.00 NaN NaN
68493 F 3.0 3.0 NaN NaN NaN
57483 M 2.0 2.0 0.14 NaN NaN
45623 M NaN NaN 0.28 NaN 0.0
The 3rd DF should then be merged too which would result in basibally only adding the rows and columns.
All cells which are not present in the other DFs should be filled with NaN
and it would be awesome if the corresponding columns like name.x and name.y would be next to each other to ensure readability.
I have tried things like pandas.DataFrame.(Merge, join, and concatenate), trying to do it by hand but nothing works as it needs to be.
This is an example how I did the adding of the columns if they are not present:
df_combined = df_1.copy()
for ind, column in enumerate(df_2):
if not column in list(df_combined):
df_combined.insert(len(df_combined.columns), column,
value=pd.Series(np.nan),
allow_duplicates=False)
frame = [df_combined, df_2]
df_combined = pd.concat(frame)
which is probably already not a good solution.
Thanks for any help on how to implement this!