Pandas Multiply 2D by 1D Dataframe - python

Looking for an elegant way to multiply a 2D dataframe by a 1D series where the indices and column names align
df1 =
Index
A
B
1
1
5
2
2
6
3
3
7
4
4
8
df2 =
Coef
A
10
B
100
Something like...
df3 = df1.mul(df2)
To get :
Index
A
B
1
10
500
2
20
600
3
30
700
4
40
800

There is no such thing as 1D DataFrame, you need to slice as Series to have 1D, then multiply (by default on axis=1):
df3 = df1.mul(df2['Coef'])
Output:
A B
1 10 500
2 20 600
3 30 700
4 40 800
If Index is a column:
df3 = df1.mul(df2['Coef']).combine_first(df1)[df1.columns]
Output:
Index A B
0 1.0 10.0 500.0
1 2.0 20.0 600.0
2 3.0 30.0 700.0
3 4.0 40.0 800.0

Related

Add a portion of a dataframe to another dataframe

Suppose to have two dataframes, df1 and df2, with equal number of columns, but different number of rows, e.g:
df1 = pd.DataFrame([(1,2),(3,4),(5,6),(7,8),(9,10),(11,12)], columns=['a','b'])
a b
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
6 11 12
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['a','b'])
a b
1 100 200
2 300 400
3 500 600
I would like to add df2 to the df1 tail (df1.loc[df2.shape[0]:]), thus obtaining:
a b
1 1 2
2 3 4
3 5 6
4 107 208
5 309 410
6 511 612
Any idea?
Thanks!
If there is more rows in df1 like in df2 rows is possible use DataFrame.iloc with convert values to numpy array for avoid alignment (different indices create NaNs):
df1.iloc[-df2.shape[0]:] += df2.to_numpy()
print (df1)
a b
0 1 2
1 3 4
2 5 6
3 107 208
4 309 410
5 511 612
For general solution working with any number of rows with unique indices in both Dataframe with rename and DataFrame.add:
df = df1.add(df2.rename(dict(zip(df2.index[::-1], df1.index[::-1]))), fill_value=0)
print (df)
a b
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0
3 107.0 208.0
4 309.0 410.0
5 511.0 612.0

Pandas : Apply weights to another column, for certain ids only

Let's take this sample dataframe and this list of ids :
df=pd.DataFrame({'Id':['A','A','A','B','C','C','D','D'], 'Weight':[50,20,30,1,2,8,3,2], 'Value':[100,100,100,10,20,20,30,30]})
Id Weight Value
0 A 50 100
1 A 20 100
2 A 30 100
3 B 1 10
4 C 2 20
5 C 8 20
6 D 3 30
7 D 2 30
L = ['A','C']
Value column has same values for each id in Id column. For the specific ids of L, I would like to apply the weights of Weight column to Value column. I am currently doing the following way but it is extremely slow with my real big dataframe :
for i in L :
df.loc[df["Id"]==i,"Value"] = (df.loc[df["Id"]==i,"Value"] * df.loc[df["Id"]==i,"Weight"] /
df[df["Id"]==i]["Weight"].sum())
How please could I do that efficiently ?
Expected output :
Id Weight Value
0 A 50 50
1 A 20 20
2 A 30 30
3 B 1 10
4 C 2 4
5 C 8 16
6 D 3 30
7 D 2 30
Idea is working only for filtered rows by Series.isin with GroupBy.transform and sum for sum per groups with same size like original DataFrame:
L = ['A','C']
m = df['Id'].isin(L)
df1 = df[m].copy()
s = df1.groupby('Id')['Weight'].transform('sum')
df.loc[m, 'Value'] = df1['Value'].mul(df1['Weight']).div(s)
print (df)
Id Weight Value
0 A 50 50.0
1 A 20 20.0
2 A 30 30.0
3 B 1 10.0
4 C 2 4.0
5 C 8 16.0
6 D 3 30.0
7 D 2 30.0

With pd.merge() on different column names, the resulting dataframe has duplicated columns. How to avoid that?

Assume the dataframes df_1 and df_2 below, which I want to merge "left".
df_1= pd.DataFrame({'A': [1,2,3,4,5],
'B': [10,20,30,40,50]})
df_2= pd.DataFrame({'AA': [1,5],
'BB': [10,50],
'CC': [100, 500]})
>>> df_1
A B
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
>>> df_2
AA BB CC
0 1 10 100
1 5 50 500
I want to perform a merging which will result to the following output:
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
So, I tried pd.merge(df_1, df_2, left_on=['A', 'B'], right_on=['AA', 'BB'], how='left') which unfortunately duplicates the columns upon which I merge:
A B AA BB CC
0 1 10 1.0 10.0 100.0
1 2 20 NaN NaN NaN
2 3 30 NaN NaN NaN
3 4 40 NaN NaN NaN
4 5 50 5.0 50.0 500.0
How do I achieve this without needing to drop the columns 'AA' and 'BB'?
Thank you!
You can use rename and join by A, B columns only:
df = pd.merge(df_1, df_2.rename(columns={'AA':'A','BB':'B'}), on=['A', 'B'], how='left')
print (df)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
In pd.merge's right_on parameter accepts array-like argument.
Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
df_1.merge(
df_2["CC"], left_on=["A", "B"], right_on=[df_2["AA"], df_2["BB"]], how="left"
)
A B CC
0 1 10 100.0
1 2 20 NaN
2 3 30 NaN
3 4 40 NaN
4 5 50 500.0
df.merge(sec_df) give sec_df without column names you want to merge on.
rigth_on as a list with columns you want to merge, [df_2['AA'], df_2['BB']] is equivalent to [*df_2[['AA', 'BB']].to_numpy()]
IMHO this method is cumbersome. As #jezrael posted renaming columns and merging them is pythonic/pandorable

Map value from one row as a new column in pandas

I have a pandas dataframe:
SrNo value
a nan
1 100
2 200
3 300
b nan
1 500
2 600
3 700
c nan
1 900
2 1000
i want my final dataframe as:
value new_col
100 a
200 a
300 a
500 b
600 b
700 b
900 c
1000 c
i.e for sr.no 'a' the values under a should have 'a' as a new column similarly for b and c
Create new column by where with condition by isnull, then use ffill for replace NaNs by forward filling.
Last remove NaNs rows by dropna and column by drop:
print (df['SrNo'].where(df['value'].isnull()))
0 a
1 NaN
2 NaN
3 NaN
4 b
5 NaN
6 NaN
7 NaN
8 c
9 NaN
10 NaN
Name: SrNo, dtype: object
df['new_col'] = df['SrNo'].where(df['value'].isnull()).ffill()
df = df.dropna().drop('SrNo', 1)
print (df)
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c
Here's one way
In [2160]: df.assign(
new_col=df.SrNo.str.extract('(\D+)', expand=True).ffill()
).dropna().drop('SrNo', 1)
Out[2160]:
value new_col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c
Another way with replace numbers with nan and ffill()
df['col'] = df['SrNo'].replace('([0-9]+)',np.nan,regex=True).ffill()
df = df.dropna(subset=['value']).drop('SrNo',1)
Output:
value col
1 100.0 a
2 200.0 a
3 300.0 a
5 500.0 b
6 600.0 b
7 700.0 b
9 900.0 c
10 1000.0 c

Backfilling columns by groups in Pandas

I have a csv like
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
In row 1 and row 4 C value is missing (NaN). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
If no matching row is found, just put 0 (like in last line)
Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
using fillna I found bfill: use NEXT valid observation to fill gap but the NEXT observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value
You'll have to call df.groupby on A and B first and then apply the bfill function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
You can also group and then call dfGroupBy.bfill directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
If you wish to get rid of NaNs in D, you could do:
df.D.fillna('', inplace=True)

Categories

Resources