Pandas update row value based on group - python

I have such df
product type price
A 1 10
A 2 15
A 3 NAN
A 4 20
B 1 40
B 2 30
B 3 NAN
B 4 5
C 1 80
C 2 70
C 3 90
C 4 NAN
D 6 75
D 8 40
I want to update the value of price (Not necessarily NAN) based on another value
If (type == 3):
price == df['type' == 1]
elif ( type == 4):
price == df['type' == 2]
elif ( type == 8):
price == df['type' == 6]
Something like this the condition. I tried for each using the following:
df.loc[df['type'] == 3, 'price'] = df[df['type']==1]['price'].iloc[0]
The following way works just that the value 10 is replaced at both product A, B and C.
Is there a way that this value can be updated according to the product group. The expected outcome as following :
product type price
A 1 10
A 2 15
A 3 10
A 4 15
B 1 40
B 2 30
B 3 40
B 4 30
C 1 80
C 2 70
C 3 80
C 4 70
D 6 75
D 8 75
Note: using transform('first') may not be appropriate in this use case. Not necessarily each product will have all the type (1-10)
Thanks

IIUC,you need to fillna with the first value of each group:
df['price']=df['price'].fillna(df.groupby('product')['price'].transform('first'))
print(df)
product type price
0 A 1 10.0
1 A 2 15.0
2 A 3 10.0
3 A 4 20.0
4 B 1 40.0
5 B 2 30.0
6 B 3 40.0
7 B 4 5.0
EDIT , bases on edited question , you can try pivotting and filling na with the respective columns:
piv = df.set_index(['product','type'])['price'].unstack()
piv = piv.fillna({3:piv[1],4:piv[2]})
out = piv.stack().reset_index(name='price')
print(out)
product type price
0 A 1 10.0
1 A 2 15.0
2 A 3 10.0
3 A 4 20.0
4 B 1 40.0
5 B 2 30.0
6 B 3 40.0
7 B 4 5.0
8 C 1 80.0
9 C 2 70.0
10 C 3 90.0
11 C 4 70.0

You can try:
for x,y in zip([3,4,8],[1,2,6]):
df.loc[df['type'].eq(x), 'price'] = df.loc[df['type'].eq(y),'price'].values
Output:
product type price
0 A 1 10
1 A 2 15
2 A 3 10
3 A 4 15
4 B 1 40
5 B 2 30
6 B 3 40
7 B 4 30
8 C 1 80
9 C 2 70
10 C 3 80
11 C 4 70
12 D 6 75
13 D 8 75

Related

In pandas, how to operate on the row with the first instance of a string?

I have a csv file, and I'm trying to convert a column with cumulative values to individual values. I can form most of the column with
df['delta'] = df['expenditure'].diff()
So for each person (A,B..) I want the change in expenditure since they last attended. What which gives me
person days expenditure delta
A 1 10
A 2 24 14
A 10 45 21
B 2 0 -45
B 7 2 2
B 8 10 8
C 5 50 40
C 6 78 28
C 7 90 12
and what I want is
person days expenditure delta
A 1 10 ---> 10
A 2 24 14
A 10 45 21
B 2 0 ---> 0
B 7 2 2
B 8 10 8
C 5 50 ---> 50
C 6 78 28
C 7 90 12
so for each person, I want their lowest day's expenditure value put in delta.
Additionally, if I'm trying to average delta by the days, how would I go about it? That is if I wanted
person days expenditure delta
A 1 10 10
A 2 24 14
A 10 45 21/8
B 2 0 0
B 7 2 2/5
B 8 10 8
So 21/8 is the (change in expenditure)/(change in days) for A
Use DataFrameGroupBy.diff with replace first missing values by original by Series.fillna:
df['delta'] = df.groupby('person')['expenditure'].diff().fillna(df['expenditure'])
print (df)
person days expenditure delta
0 A 1 10 10.0
1 A 2 24 14.0
2 A 10 45 21.0
3 B 2 0 0.0
4 B 7 2 2.0
5 B 8 10 8.0
6 C 5 50 50.0
7 C 6 78 28.0
8 C 7 90 12.0
And for second is possible processing both columns and then divide in DataFrame.eval:
df['delta'] = (df.groupby('person')[['expenditure', 'days']].diff()
.fillna(df[['expenditure','days']])
.eval('expenditure / days'))
What working same like:
df['delta'] = (df.groupby('person')['expenditure'].diff().fillna(df['expenditure'])
.div(df.groupby('person')['days'].diff().fillna(df['days'])))
print (df)
person days expenditure delta
0 A 1 10 10.000
1 A 2 24 14.000
2 A 10 45 2.625
3 B 2 0 0.000
4 B 7 2 0.400
5 B 8 10 8.000
6 C 5 50 10.000
7 C 6 78 28.000
8 C 7 90 12.000

Pandas Dataframe Question: Groupby show largest value

I made a slight mistake on a previous question (Pandas Dataframe Question: Subtract next row and add specific value if NaN)
import pandas as pd
df = pd.DataFrame({'label': 'a a b a c c b b c'.split(), 'Val': [2,2,6, 8, 14, 14,16, 18, 22]})
df
label Val
0 a 2
1 a 2
2 b 6
3 a 8
4 c 14
5 c 14
6 b 16
7 b 18
8 c 22
df['Results'] = abs(df.groupby(['label'])['Val'].diff(-1)).fillna(3)
df
label Val Results
0 a 2 0.0
1 a 2 6.0
2 b 6 10.0
3 a 8 3.0
4 c 14 0.0
5 c 14 8.0
6 b 16 2.0
7 b 18 3.0
8 c 22 3.0
Is it possible to get something like this:
label Val Results
0 a 2 6.0
1 a 2 6.0
2 b 6 10.0
3 a 8 3.0
4 c 14 8.0
5 c 14 8.0
6 b 16 2.0
7 b 18 3.0
8 c 22 3.0
That there is no zero values, the value should be the same for same distances.
If need replace 0 by last non 0 values per groups you can replace 0 to missing values and add GroupBy.bfill:
df['Results'] = (df.groupby(['label'])['Val']
.diff(-1)
.fillna(3)
.replace(0, np.nan)
.groupby(df['label'])
.bfill()
.abs())
print (df)
label Val Results
0 a 2 6.0
1 a 2 6.0
2 b 6 10.0
3 a 8 3.0
4 c 14 8.0
5 c 14 8.0
6 b 16 2.0
7 b 18 3.0
8 c 22 3.0

pandas personalized groupby data arithmetic manipulation

Here is my dataframe:
id_1 id_2 cost id_3 other
0 1 a 30 10 a
1 1 a 30 20 f
2 1 a 30 30 h
3 1 b 60 40 b
4 1 b 60 50 m
5 2 a 10 60 u
6 2 a 10 70 l
7 2 b 8 80 u
8 3 c 15 90 y
9 3 c 15 100 l
10 4 d 8 110 m
11 5 e 5 120 v
I want a groupby(['id_1', 'id_2']), but
Dividing the cost number, which is the same in each line of same group, between each of these lines (for example, dividing 30/3=10 between the three a values).
I would expect something like this:
id_1 id_2 cost id_3 other
0 1 a 10 10 a
1 1 a 10 20 f
2 1 a 10 30 h
3 1 b 30 40 b
4 1 b 30 50 m
5 2 a 5 60 u
6 2 a 5 70 l
7 2 b 8 80 u
8 3 c 7.5 90 y
9 3 c 7.5 100 l
10 4 d 8 110 m
11 5 e 5 120 v
It is a similar question to
this link. But now I want more flexibility in manipulating data inside a group of rows.
How can I proceed?
Thaks!
Let us do transform
df.cost/=df.groupby(['id_1','id_2']).cost.transform('count')
df
id_1 id_2 cost id_3 other
0 1 a 10.0 10 a
1 1 a 10.0 20 f
2 1 a 10.0 30 h
3 1 b 30.0 40 b
4 1 b 30.0 50 m
5 2 a 5.0 60 u
6 2 a 5.0 70 l
7 2 b 8.0 80 u
8 3 c 7.5 90 y
9 3 c 7.5 100 l
10 4 d 8.0 110 m
11 5 e 5.0 120 v

How to pass dataframe column value as window size after df.groupby?

A B C
0 1 10 2
1 1 15 2
2 1 14 2
3 2 11 4
4 2 12 4
5 2 13 4
6 2 16 4
7 1 18 2
This is my sample DataFrame.
I want to apply groupby on column 'A',
Apply rolling sum on column 'B' based on the value of column 'C', means when A is 1 so window size should be 2 and instead of NaN I want the sum of remaining values regardless of window size.
Currently my output is:
A
1 0 25.0
1 29.0
2 32.0
7 NaN
2 3 23.0
4 25.0
5 29.0
6 NaN
code for above:
df['B'].groupby(df['A']).rolling(df['C'][0]).sum().shift(-1)
when C = 4 , I want the window of rolling to be 4 and dont want NaN
The desired output should be as follows:
A B C Rolling_sum
0 1 10 2 25
1 1 15 2 29
2 1 14 2 32
7 1 18 2 18
3 2 11 4 52
4 2 12 4 41
5 2 13 4 29
6 2 16 4 16
Because you want pass dynamic window by column C use lambda function with change order by iloc[::-1]:
df = df.sort_values('A')
df['Rolling_sum'] = (df.iloc[::-1].groupby('A')
.apply(lambda x: x.B.rolling(x.C.iat[0], min_periods=0).sum())
.reset_index(level=0, drop=True))
print (df)
A B C Rolling_sum
0 1 10 2 25.0
1 1 15 2 29.0
2 1 14 2 32.0
7 1 18 2 18.0
3 2 11 4 52.0
4 2 12 4 41.0
5 2 13 4 29.0
6 2 16 4 16.0
Solution with strides if performance is important (depends of number of groups, size of groups, the best test in real data):
def rolling_window(a, window):
a = np.concatenate([[0] * (window - 1), a])
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides).sum(axis=1)
df = df.sort_values('A')
df['Rolling_sum'] = (df.iloc[::-1].groupby('A')
.apply(lambda x: pd.Series(rolling_window(x.B, x.C.iat[0]),
index=x.index))
.reset_index(level=0, drop=True))
print (df)
A B C Rolling_sum
0 1 10 2 25
1 1 15 2 29
2 1 14 2 32
7 1 18 2 18
3 2 11 4 52
4 2 12 4 41
5 2 13 4 29
6 2 16 4 16
We can use DataFrame.groupby
to use groupby.rolling based on the value of column C.
Here we use df[::-1] to reverse the order of the index and obtain the appropriate solution.
Finally we use pd.concat to join the series obtained for each value of C.
df = df.sort_values('A')
df['Rolling_sum']= pd.concat([group[::-1].groupby(df['A'])
.rolling(i,min_periods = 1)
.B.sum()
.reset_index(level = 'A',drop =True)
for i, group in df.groupby('C')])
print(df)
Output
A B C Rolling_sum
0 1 10 2 25.0
1 1 15 2 29.0
2 1 14 2 32.0
7 1 18 2 18.0
3 2 11 4 52.0
4 2 12 4 41.0
5 2 13 4 29.0
6 2 16 4 16.0

How to duplicate rows and increment column where incremented column value does not exist?

I have a pandas Dataframe like this:
id alt amount
6 b 30
6 a 30
3 d 56
3 a 40
1 c 35
1 b 10
1 a 20
which I would like to be turned into this:
id alt amount
6 d 56
6 c 35
6 b 30
6 a 30
5 d 56
5 c 35
5 b 26
5 a 33.33
4 d 56
4 c 35
4 b 22
4 a 36.66
3 d 56
3 c 35
3 b 18
3 a 40
2 c 35
2 b 14
2 a 30
1 c 35
1 b 10
1 a 20
For each missing id number N that is less than the maximum id number, all alt values for the largest id number smaller than N should be duplicated, with id number set to N for these duplicated rows. If the alt value is repeated for a larger id number, then the additional amount entries should increase by the difference divided by the difference between id values (number of steps.) If the alt value is not repeated, then the amount can simple be copied over for each additional id value.
For instance, a appears with id numbers 1, 3 and 6 and amounts 20, 40, 30 respectively. We need to add an instance of a with an id of 2. The amount in this will be 30, since it takes 2 steps to go from 1 to 3 and we are increasing by 20. Going from 3 to 6 there are 3 steps and we are decreasing by 10. -10/3 = -3.33, so we subtract 3.33 for each new instance of a.
I thought of doing some combination of duplicating, sorting, and forward filling? I'm unsure of the logic here though.
You can do with pivot + reindex then interpolate
yourdf=df.pivot(*df.columns).\
reindex(range(df.id.min(),df.id.max()+1)).\
interpolate(method='index').stack().reset_index()
yourdf
Out[51]:
id alt 0
0 1 a 20.000000
1 1 b 10.000000
2 1 c 35.000000
3 2 a 30.000000
4 2 b 14.000000
5 2 c 35.000000
6 3 a 40.000000
7 3 b 18.000000
8 3 c 35.000000
9 3 d 56.000000
10 4 a 36.666667
11 4 b 22.000000
12 4 c 35.000000
13 4 d 56.000000
14 5 a 33.333333
15 5 b 26.000000
16 5 c 35.000000
17 5 d 56.000000
18 6 a 30.000000
19 6 b 30.000000
20 6 c 35.000000
21 6 d 56.000000

Categories

Resources