Python loop for calculating sum of column values in pandas - python

I have below data frame:
a
100
200
200
b
20
30
40
c
400
50
Need help to calculate sum of values for each item and place it in 2nd column, which ideally should look like below:
a 500
100
200
200
b 90
20
30
40
c 450
400
50

If need sum by groups by column col converted to numeric use GroupBy.transform with repeated non numeric values by ffill:
s = pd.to_numeric(df['col'], errors='coerce')
mask = s.isna()
df.loc[mask, 'new'] = s.groupby(df['col'].where(mask).ffill()).transform('sum')
print (df)
col new
0 a 500.0
1 100 NaN
2 200 NaN
3 200 NaN
4 b 90.0
5 20 NaN
6 30 NaN
7 40 NaN
8 c 450.0
9 400 NaN
10 50 NaN
Or:
df['new'] = np.where(mask, new.astype(int), '')
print (df)
col new
0 a 500
1 100
2 200
3 200
4 b 90
5 20
6 30
7 40
8 c 450
9 400
10 50

Related

Pandas Multiply 2D by 1D Dataframe

Looking for an elegant way to multiply a 2D dataframe by a 1D series where the indices and column names align
df1 =
Index
A
B
1
1
5
2
2
6
3
3
7
4
4
8
df2 =
Coef
A
10
B
100
Something like...
df3 = df1.mul(df2)
To get :
Index
A
B
1
10
500
2
20
600
3
30
700
4
40
800
There is no such thing as 1D DataFrame, you need to slice as Series to have 1D, then multiply (by default on axis=1):
df3 = df1.mul(df2['Coef'])
Output:
A B
1 10 500
2 20 600
3 30 700
4 40 800
If Index is a column:
df3 = df1.mul(df2['Coef']).combine_first(df1)[df1.columns]
Output:
Index A B
0 1.0 10.0 500.0
1 2.0 20.0 600.0
2 3.0 30.0 700.0
3 4.0 40.0 800.0

Pandas : Apply weights to another column, for certain ids only

Let's take this sample dataframe and this list of ids :
df=pd.DataFrame({'Id':['A','A','A','B','C','C','D','D'], 'Weight':[50,20,30,1,2,8,3,2], 'Value':[100,100,100,10,20,20,30,30]})
Id Weight Value
0 A 50 100
1 A 20 100
2 A 30 100
3 B 1 10
4 C 2 20
5 C 8 20
6 D 3 30
7 D 2 30
L = ['A','C']
Value column has same values for each id in Id column. For the specific ids of L, I would like to apply the weights of Weight column to Value column. I am currently doing the following way but it is extremely slow with my real big dataframe :
for i in L :
df.loc[df["Id"]==i,"Value"] = (df.loc[df["Id"]==i,"Value"] * df.loc[df["Id"]==i,"Weight"] /
df[df["Id"]==i]["Weight"].sum())
How please could I do that efficiently ?
Expected output :
Id Weight Value
0 A 50 50
1 A 20 20
2 A 30 30
3 B 1 10
4 C 2 4
5 C 8 16
6 D 3 30
7 D 2 30
Idea is working only for filtered rows by Series.isin with GroupBy.transform and sum for sum per groups with same size like original DataFrame:
L = ['A','C']
m = df['Id'].isin(L)
df1 = df[m].copy()
s = df1.groupby('Id')['Weight'].transform('sum')
df.loc[m, 'Value'] = df1['Value'].mul(df1['Weight']).div(s)
print (df)
Id Weight Value
0 A 50 50.0
1 A 20 20.0
2 A 30 30.0
3 B 1 10.0
4 C 2 4.0
5 C 8 16.0
6 D 3 30.0
7 D 2 30.0

Merge Columns with the Same name in the same dataframe if null

I have a dataframe that looks like this
Depth DT DT DT GR GR GR
1 100 NaN 45 NaN 100 50 NaN
2 200 NaN 45 NaN 100 50 NaN
3 300 NaN 45 NaN 100 50 NaN
4 400 NaN Nan 50 100 50 NaN
5 500 NaN Nan 50 100 50 NaN
I need to merge the same name columns into one if there are null values and keep the first occurrence of the column if other columns are not null.
In the end the data frame should look like
Depth DT GR
1 100 45 100
2 200 45 100
3 300 45 100
4 400 50 100
5 500 50 100
I am beginner in pandas. I tried but wasn't successful. I tried drop duplicate but it couldn't do the what I wanted. Any suggestions?
IIUC, you can do:
(df.set_index('Depth')
.groupby(level=0, axis=1).first()
.reset_index())
output:
Depth DT GR
0 100 45.0 100.0
1 200 45.0 100.0
2 300 45.0 100.0
3 400 50.0 100.0
4 500 50.0 100.0

How to fill np.nan values with data from another row based on matching

i need to do the following
a=[1,2,3,4,5]
c=[0,100,100,200,200,0]
b=['2013-06-10', np.nan, '2013-02-15', np.nan, '2013-05-15']
df=pd.DataFrame({'a':a,'b':b,'c':c})
this will give:
a b c
0 1 2013-06-10 100
1 2 NaN 100
2 3 2013-02-15 200
3 4 NaN 200
4 5 2013-05-15 100
i want to based on value in column C, lookup same value in previous row and fill the date in column B when its null.
it should eventually look like this: -
a b c
0 1 2013-06-10 100
1 2 2013-06-10 100
2 3 2013-02-15 200
3 4 2013-02-15 200
4 5 2013-05-15 100
i currently do it with an apply lambda row-wise function to fill the date, but because my raw data has million of rows, it slows down tremendously. I am wondering if anyone know a much faster way to fillna values with data from a different row based on same value in column C
You can can use ffill:
df['b'] = df.groupby('c')['b'].ffill()
print (df)
a b c
0 1 2013-06-10 100
1 2 2013-06-10 100
2 3 2013-02-15 200
3 4 2013-02-15 200
4 5 2013-05-15 100
Also if some first value by group is NaN in b use apply, becasue need apply both functions per groups:
print (df)
a b c
0 1 NaN 100 <- NaN
1 1 2013-06-10 100
2 2 NaN 100
3 3 2013-02-15 200
4 4 NaN 200
5 5 2013-05-15 100
df['b'] = df.groupby('c')['b'].apply(lambda x: x.ffill().bfill())
print (df)
a b c
0 1 2013-06-10 100
1 1 2013-06-10 100
2 2 2013-06-10 100
3 3 2013-02-15 200
4 4 2013-02-15 200
5 5 2013-05-15 100

Sort pandas grouped cols in line

I am working on some machine learning task and I want change each line from "numbered objects" to "sorted by some attrs objects".
For example, I have 5 heroes in 2 teams represented by theirs stats (dN_%stat% and rN_%stat%), and what I want is to sort heroes in each team by stats numbered 3,4,0,2 so the first one is strongest and so on.
Here is my current code, but it is very slow, so I want to use native pandas objects and operations:
def sort_heroes(df):
for match_id in df.index:
for team in ['r', 'd']:
heroes = []
for n in range(1,6):
heroes.append(
[df.ix[match_id, '%s%s_%s' % (team, n, stat)]
for stat in stats])
heroes.sort(key=lambda x: (x[3], x[4], x[0], x[2]))
for n in range(1,6):
for i, stat in enumerate(stats):
df.ix[match_id, '%s%s_%s' %
(team, n, stat)] = heroes[n - 1][i]
Short example with not full but useful data representation:
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold d2_xp d2_gold
1 10 20 100 10 5000 300 0 0 15 5
2 1 1 1000 80 100 13 200 87 311 67
What I want is to sort those columns by groups with prefix (rN_ and dN_) firstly by gold then by xp
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold d2_xp d2_gold
1 5000 300 10 20 100 20 15 5 0 0
2 1000 80 100 13 1 1 200 87 311 67
You can use:
df.set_index('match_id', inplace=True)
#create MultiIndex with 3 levels
arr = df.columns.str.extract('([rd])(\d*)_(.*)', expand=True).T.values
df.columns = pd.MultiIndex.from_arrays(arr)
#reshape df, sorting
df = df.stack([0,1]).reset_index().sort_values(['match_id','level_1','gold','xp'],
ascending=[True,False,False,False])
print (df)
match_id level_1 level_2 gold xp
4 1 r 3 300.0 5000.0
2 1 r 1 20.0 10.0
3 1 r 2 10.0 100.0
1 1 d 2 5.0 15.0
0 1 d 1 0.0 0.0
8 2 r 2 80.0 1000.0
9 2 r 3 13.0 100.0
7 2 r 1 1.0 1.0
5 2 d 1 87.0 200.0
6 2 d 2 67.0 311.0
#asign new values to level 2
df.level_2 = df.groupby(['match_id','level_1']).cumcount().add(1).astype(str)
#get original shape
df = df.set_index(['match_id','level_1','level_2']).stack().unstack([1,2,3]).astype(int)
df = df.sort_index(level=[0,1,2], ascending=[False, True, False], axis=1)
#Multiindex in columns to column names
df.columns = ['{}{}_{}'.format(x[0], x[1], x[2]) for x in df.columns]
df.reset_index(inplace=True)
print (df)
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold \
0 1 5000 300 10 20 100 10 15 5
1 2 1000 80 100 13 1 1 200 87
d2_xp d2_gold
0 0 0
1 311 67

Categories

Resources