updating a df using an idxmax series from a different df

updating a df using an idxmax series from a different df - python

I'm trying to update locations in a DataFrame based on an idxmax series from a different DataFrame. So for each row in df0, change the value of the column with the highest value in df1 to 1. Is this possible without having to loop though each row?
Here's the code I'm trying to improve:
df0 = pd.DataFrame.from_dict({
'2020-05-04': [0, 0, 0],
'2020-05-05': [0, 0, 0],
'2020-05-06': [0, 0, 0]},
orient='index', columns=['a','b','c'])
df1 = pd.DataFrame.from_dict({
'2020-05-04': [1.1, 1.0, 1.0],
'2020-05-05': [1.0, 1.2, 1.0],
'2020-05-06': [1.0, 1.0, 1.4]},
orient='index', columns=['a','b','c'])
for date in df0.index:
df0.loc[date, df1.idxmax(axis=1)[date]] = 1
Thanks for your suggestions!

Yes, that's possible using unstack and fillna:
# get 'coordinates' of max value
df0 = df1.idxmax(axis=1).rename('col').to_frame()
# set your 'marker' value
df0['val'] = 1
# col val
# 2020-05-04 a 1
# 2020-05-05 b 1
# 2020-05-06 c 1
df0 = df0.set_index('col', append=True).unstack().fillna(0)
# val
# col a b c
# 2020-05-04 1.0 0.0 0.0
# 2020-05-05 0.0 1.0 0.0
# 2020-05-06 0.0 0.0 1.0
The DataFrame now has a MultiIndex for columns. To drop the val level, run
df0.droplevel(0, axis=1)

Related

Group latest values in pandas columns for a given id

I have a pandas dataframe containing some metrics for a given date and user.
>>> pd.DataFrame({"user": ['juan','juan','juan','gonzalo'], "date": [1, 2, 3, 1], "var1": [1, 2, None, 1], "var2": [None, 4, 5, 6]})
user date var1 var2
0 juan 1 1.0 NaN
1 juan 2 2.0 4.0
2 juan 3 NaN 5.0
3 gonzalo 1 1.0 6.0
Now, for each user, I want to extract the 2 more recent values for each variable (var1, var2) ignoring NaN unless there aren't enough values to fill the data.
For reference, this should be the resulting dataframe for the data depicted above
user var1_0 var1_1 var2_0 var2_1
juan 2.0 1.0 5.0 4.0
gonzalo 1.0 NaN 6.0 NaN
each "historical" value is added as a new column with a _0 or _1 suffix.

First sorting if necessary by both columns in DataFrame.sort_values with reshape by DataFrame.sort_values and remove missing values, filter top2 rrows per groups by GroupBy.head, then create counter column by GroupBy.cumcount with pivoting in DataFrame.pivot with flatten MultiIndex:
df1 = (df.sort_values(['user','date'])
.melt(id_vars='user', value_vars=['var1','var2'])
.dropna(subset=['value'])
)
df1 = df1.groupby(['user','variable']).head(2)
df1['g'] = df1.groupby(['user','variable']).cumcount(ascending=False)
df1 = df1.pivot(index='user', columns=['variable', 'g'], values='value')
#oldier pandas versions
#df1 = df1.set_index(['user','variable', 'g'])['value'].unstack([1,2])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
user var1_0 var1_1 var2_0 var2_1
0 gonzalo 1.0 NaN 6.0 NaN
1 juan 2.0 1.0 5.0 4.0

You could group by user and aggregate to get the 2 most recent values. That get's almost all the way there - but instead of columns you have a list of elements. If you want to have the actual 2 columns you have to split the newly created list into columns. Full code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"user": ["juan", "juan", "juan", "gonzalo"],
"date": [1, 2, 3, 1],
"var1": [1, 2, None, 1],
"var2": [None, 4, 5, 6],
}
)
# This almost gets you there
df = (
df.sort_values(by="date")
.groupby("user")
.agg({"var1": lambda x: x.dropna().head(2), "var2": lambda x: x.dropna().head(2)})
)
# Split the columns and get the correct column names
df[["var1_0", "var2_0"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[0] if isinstance(el, np.ndarray) else el for el in row),
axis=1,
)
df[["var1_1", "var2_1"]] = df[["var1", "var2"]].apply(
lambda row: pd.Series(el[-1] if isinstance(el, np.ndarray) else None for el in row),
axis=1,
)
print(df)
>>
var1 var2 var1_0 var2_0 var1_1 var2_1
user
gonzalo 1.0 6.0 1.0 6.0 NaN NaN
juan [1.0, 2.0] [4.0, 5.0] 1.0 4.0 2.0 5.0

Drop Non-equivalent Multiindex Rows in Pandas Dataframe

Goal
If sub-column min equals to sub-column max and if min and max sub-column do not equal to each other in any of the column (ao, his, cyp1a2s, cyp3a4s in this case), drop the row.
Example
arrays = [np.array(['ao', 'ao', 'hia', 'hia', 'cyp1a2s', 'cyp1a2s', 'cyp3a4s', 'cyp3a4s']),
np.array(['min', 'max', 'min', 'max', 'min', 'max', 'min', 'max'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['',''])
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0],
[1, 1, 0, 0, float('nan'), 1, 0, 0],
[0, 2, 0, 0, float('nan'), float('nan'), 1, 1],]), index=['1', '2', '3'], columns=index)
df
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
2 1.0 1.0 0.0 0.0 NaN 1.0 0.0 0.0
3 0.0 2.0 0.0 0.0 NaN NaN 1.0 1.0
Want
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0]]), index=['1'], columns=index)
df
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
Attempt
df.apply(lambda x: x['min'].map(str) == x['max'].map(str), axis=1)
KeyError: ('min', 'occurred at index 1')
Note
The actual dataframe has 50+ columns.

Use DataFrame.xs for DataFrame by second levels of MultiIndex, replace NaNs:
df1 = df.xs('min', axis=1, level=1).fillna('nan')
df2 = df.xs('max', axis=1, level=1).fillna('nan')
Or convert data to strings:
df1 = df.xs('min', axis=1, level=1).astype('str')
df2 = df.xs('max', axis=1, level=1).astype('str')
Compare Dataframes by DataFrame.eq and test if all Trues by DataFrame.all and last filter by boolean indexing:
df = df[df1.eq(df2).all(axis=1)]
print (df)
ao hia cyp1a2s cyp3a4s
min max min max min max min max
1 1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0

The reason df.apply() didn't work is you needed to reference 2 levels of columns.
Also .map(str) was invalid for mapping from float64... used .astype(str)
The following work for >1 columns:
eqCols = ['cyp1a2s','hia']
neqCols = list(set(df.xs('min', level=1, axis=1).columns) - set(eqCols))
EQ = lambda r,c : r[c]['min'].astype(str) == r[c]['max'].astype(str)
df[df.apply(lambda r: ([EQ(r,c) for c in eqCols][0]) & ([(not EQ(r,c)) for c in neqCols][0]), axis=1)]

Squeezing pandas DataFrame to have non-null values and modify column names

I have the following sample DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({'Tom': [2, np.nan, np.nan],
'Ron': [np.nan, 5, np.nan],
'Jim': [np.nan, np.nan, 6],
'Mat': [7, np.nan, np.nan],},
index=['Min', 'Max', 'Avg'])
that looks like this where each row have only one non-null value
Tom Ron Jim Mat
Min 2.0 NaN NaN 7.0
Max NaN 5.0 NaN NaN
Avg NaN NaN 6.0 NaN
Desired Outcome
For each column, I want to have the non-null value and then append the index of the corresponding non-null value to the name of the column. So the final result should look like this
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0
My attempt
Using list comprehensions: Find the non-null value, and append the corresponding index to the column name and then create a new DataFrame
values = [df[col][~pd.isna(df[col])].values[0] for col in df.columns]
# [2.0, 5.0, 6.0, 7.0]
new_cols = [col + '_{}'.format(df[col][~pd.isna(df[col])].index[0]) for col in df.columns]
# ['Tom_Min', 'Ron_Max', 'Jim_Avg', 'Mat_Min']
df_new = pd.DataFrame([values], columns=new_cols)
My question
Is there some in-built functionality in pandas which can do this without using for loops and list comprehensions?

If there is only one non missing value is possible use DataFrame.stack with convert Series to DataFrame and then flatten MultiIndex, for correct order is used DataFrame.swaplevel with DataFrame.reindex:
df = df.stack().to_frame().T.swaplevel(1,0, axis=1).reindex(df.columns, level=0, axis=1)
df.columns = df.columns.map('_'.join)
print (df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0

Use:
s = df.T.stack()
s.index = s.index.map('_'.join)
df = s.to_frame().T
Result:
# print(df)
Tom_Min Ron_Max Jim_Avg Mat_Min
0 2.0 5.0 6.0 7.0

Collect cells in pandas df that are listed in another pandas df (with same index)

Consider the following example (the two elements of interest are final_df and pivot_df. The rest of the code is just to construct these two df's):
import numpy
import pandas
numpy.random.seed(0)
input_df = pandas.concat([pandas.Series(numpy.round_(numpy.random.random_sample(10,), 2)),
pandas.Series(numpy.random.randint(0, 2, 10))], axis = 1)
input_df.columns = ['key', 'val']
pivot_df = input_df.pivot(columns = 'key', values = 'val')\
.fillna(method = 'pad')\
.cumsum()
index_df = pivot_df.notnull()\
.multiply(pivot_df.columns, axis = 1)\
.replace({0.0: numpy.nan})\
.values
final_df = numpy.delete(numpy.partition(index_df, 3, axis = 1),
numpy.s_[3:index_df.shape[1]], axis = 1)
final_df.sort(axis = 1)
final_df = pandas.DataFrame(final_df)
final_df contains as many rows as pivot_df. I want to use these two to construct a third df: bingo_df.
bingo_df should have the same dimensions as final_df. Then, the cells of bingo_df should contain:
Whenever the entry (row = i, col = j) of final_df is numpy.nan,
the entry (i,j) of bingo_df should be numpy.nan as well.
Otherwise, [whenever the entry (i, j) of final_df is not numpy.nan] the entry (i,j) of bingo_df should be the value at cell [i, final_df[i, j].value] of pivot_df (in fact final_df[i, j].value is either the name of a column of pivot_df or numpy.nan)
Expected ouput:
so the first row of final_df is
0.55, nan, nan.
So I'm expecting the first row of bingo_df to be:
0.0, nan, nan
because the value in cell (row = 0, col = 0.55) of pivot_df is 0
(and the two subsequent numpy.nan in the first row of final_df should also be numpy.nan in bingo_df)
so the second row of final_df is
0.55, 0.72, nan
So I'm expecting the second row of bingo_df to be:
0.0, 1.0, nan
because the value in cell (row = 1, col = 0.55) of pivot_df is 0.0
and the value in cell (row = 1, col = 0.72) of pivot_df is 1.0

IIUC lookup
s=final_df.stack()
pd.Series(pivot_df.lookup(s.index.get_level_values(0),s),index=s.index).unstack()
Out[87]:
0 1 2
0 0.0 NaN NaN
1 0.0 1.0 NaN
2 0.0 1.0 2.0
3 0.0 0.0 2.0
4 0.0 0.0 0.0
5 0.0 0.0 0.0
6 0.0 1.0 0.0
7 0.0 2.0 0.0
8 0.0 3.0 0.0
9 0.0 0.0 4.0

Updating values with another dataframe

I have 2 pandas dataframes. The second one is contained in the first one. How can I replace the values in the first one with the ones in the second?
consider this example:
df1 = pd.DataFrame(0, index=[1,2,3], columns=['a','b','c'])
df2 = pd.DataFrame(1, index=[1, 2], columns=['a', 'c'])
ris= [[1, 0, 1],
[1, 0, 1],
[0, 0, 0]]
and ris has the same index and columns of d1
A possible solution is:
for i in df2.index:
for j in df2.columns:
df1.loc[i, j] = df2.loc[i, j]
But this is ugly

I think you can use copy with combine_first:
df3 = df1.copy()
df1[df2.columns] = df2[df2.columns]
print df1.combine_first(df3)
a b c
1 1.0 0 1.0
2 1.0 0 1.0
3 0.0 0 0.0
Next solution is creating empty new DataFrame df4 with index and columns from df1 and fill it by double combine_first:
df4 = pd.DataFrame(index=df1.index, columns=df1.columns)
df4 = df4.combine_first(df2).combine_first(df1)
print df4
a b c
1 1.0 0.0 1.0
2 1.0 0.0 1.0
3 0.0 0.0 0.0

Try
In [7]: df1['a'] = df2['a']
In [8]: df1['c'] = df2['c']
In [14]: df1[['a','c']] = df2[['a','c']]
If the column names are not known:
In [25]: for col in df2.columns:
....: df1[col] = df2[col]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

updating a df using an idxmax series from a different df - python

Related

Group latest values in pandas columns for a given id

Drop Non-equivalent Multiindex Rows in Pandas Dataframe

Squeezing pandas DataFrame to have non-null values and modify column names

Collect cells in pandas df that are listed in another pandas df (with same index)

Updating values with another dataframe

Categories

Resources