pandas personalized groupby data arithmetic manipulation - python

Here is my dataframe:
id_1 id_2 cost id_3 other
0 1 a 30 10 a
1 1 a 30 20 f
2 1 a 30 30 h
3 1 b 60 40 b
4 1 b 60 50 m
5 2 a 10 60 u
6 2 a 10 70 l
7 2 b 8 80 u
8 3 c 15 90 y
9 3 c 15 100 l
10 4 d 8 110 m
11 5 e 5 120 v
I want a groupby(['id_1', 'id_2']), but
Dividing the cost number, which is the same in each line of same group, between each of these lines (for example, dividing 30/3=10 between the three a values).
I would expect something like this:
id_1 id_2 cost id_3 other
0 1 a 10 10 a
1 1 a 10 20 f
2 1 a 10 30 h
3 1 b 30 40 b
4 1 b 30 50 m
5 2 a 5 60 u
6 2 a 5 70 l
7 2 b 8 80 u
8 3 c 7.5 90 y
9 3 c 7.5 100 l
10 4 d 8 110 m
11 5 e 5 120 v
It is a similar question to
this link. But now I want more flexibility in manipulating data inside a group of rows.
How can I proceed?
Thaks!

Let us do transform
df.cost/=df.groupby(['id_1','id_2']).cost.transform('count')
df
id_1 id_2 cost id_3 other
0 1 a 10.0 10 a
1 1 a 10.0 20 f
2 1 a 10.0 30 h
3 1 b 30.0 40 b
4 1 b 30.0 50 m
5 2 a 5.0 60 u
6 2 a 5.0 70 l
7 2 b 8.0 80 u
8 3 c 7.5 90 y
9 3 c 7.5 100 l
10 4 d 8.0 110 m
11 5 e 5.0 120 v

Related

Pandas copy value from row above

Got a panda's dataframe with below structure:
Index Name Value Other
1 NaN 10 5
2 A 20 2
3 30 3
4 100 12
5 NaN 40 10
6 C 10 1
7 40 10
8 40 10
9 40 10
10 NaN 40 10
11 D 10 1
12 NaN 40 10
...
I need to copy a value from column Name in rows that have it to rows that are below it until it finds NaN or other value? So how do i approach a copying a name A to row 3,4? then C [row 6] copying to row 7,8,9... until NaN/SomeName?
so after running the code it should get a dataframe like it:
Index Name Value Other
1 NaN 10 5
2 A 20 2
3 A 30 3
4 A 100 12
5 NaN 40 10
6 C 10 1
7 C 40 10
8 C 40 10
9 C 40 10
10 NaN 40 10
11 D 10 1
12 NaN 40 10
Just use replace():
df=df.replace('nan',float('NaN'),regex=True)
#for converting string 'nan' to Actual NaN
df['Name']=df['Name'].replace('',method='ffill')
#for forward filling '' values
output:
Index Name Value Other
1 NaN 10 5
2 A 20 2
3 A 30 3
4 A 100 12
5 NaN 40 10
6 C 10 1
7 C 40 10
8 C 40 10
9 C 40 10
10 NaN 40 10
11 D 10 1
12 NaN 40 10

python pandas: Remove duplicates by columns A, which is not satisfying a condition in column B

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10

In pandas, how to operate on the row with the first instance of a string?

I have a csv file, and I'm trying to convert a column with cumulative values to individual values. I can form most of the column with
df['delta'] = df['expenditure'].diff()
So for each person (A,B..) I want the change in expenditure since they last attended. What which gives me
person days expenditure delta
A 1 10
A 2 24 14
A 10 45 21
B 2 0 -45
B 7 2 2
B 8 10 8
C 5 50 40
C 6 78 28
C 7 90 12
and what I want is
person days expenditure delta
A 1 10 ---> 10
A 2 24 14
A 10 45 21
B 2 0 ---> 0
B 7 2 2
B 8 10 8
C 5 50 ---> 50
C 6 78 28
C 7 90 12
so for each person, I want their lowest day's expenditure value put in delta.
Additionally, if I'm trying to average delta by the days, how would I go about it? That is if I wanted
person days expenditure delta
A 1 10 10
A 2 24 14
A 10 45 21/8
B 2 0 0
B 7 2 2/5
B 8 10 8
So 21/8 is the (change in expenditure)/(change in days) for A
Use DataFrameGroupBy.diff with replace first missing values by original by Series.fillna:
df['delta'] = df.groupby('person')['expenditure'].diff().fillna(df['expenditure'])
print (df)
person days expenditure delta
0 A 1 10 10.0
1 A 2 24 14.0
2 A 10 45 21.0
3 B 2 0 0.0
4 B 7 2 2.0
5 B 8 10 8.0
6 C 5 50 50.0
7 C 6 78 28.0
8 C 7 90 12.0
And for second is possible processing both columns and then divide in DataFrame.eval:
df['delta'] = (df.groupby('person')[['expenditure', 'days']].diff()
.fillna(df[['expenditure','days']])
.eval('expenditure / days'))
What working same like:
df['delta'] = (df.groupby('person')['expenditure'].diff().fillna(df['expenditure'])
.div(df.groupby('person')['days'].diff().fillna(df['days'])))
print (df)
person days expenditure delta
0 A 1 10 10.000
1 A 2 24 14.000
2 A 10 45 2.625
3 B 2 0 0.000
4 B 7 2 0.400
5 B 8 10 8.000
6 C 5 50 10.000
7 C 6 78 28.000
8 C 7 90 12.000

Pandas update row value based on group

I have such df
product type price
A 1 10
A 2 15
A 3 NAN
A 4 20
B 1 40
B 2 30
B 3 NAN
B 4 5
C 1 80
C 2 70
C 3 90
C 4 NAN
D 6 75
D 8 40
I want to update the value of price (Not necessarily NAN) based on another value
If (type == 3):
price == df['type' == 1]
elif ( type == 4):
price == df['type' == 2]
elif ( type == 8):
price == df['type' == 6]
Something like this the condition. I tried for each using the following:
df.loc[df['type'] == 3, 'price'] = df[df['type']==1]['price'].iloc[0]
The following way works just that the value 10 is replaced at both product A, B and C.
Is there a way that this value can be updated according to the product group. The expected outcome as following :
product type price
A 1 10
A 2 15
A 3 10
A 4 15
B 1 40
B 2 30
B 3 40
B 4 30
C 1 80
C 2 70
C 3 80
C 4 70
D 6 75
D 8 75
Note: using transform('first') may not be appropriate in this use case. Not necessarily each product will have all the type (1-10)
Thanks
IIUC,you need to fillna with the first value of each group:
df['price']=df['price'].fillna(df.groupby('product')['price'].transform('first'))
print(df)
product type price
0 A 1 10.0
1 A 2 15.0
2 A 3 10.0
3 A 4 20.0
4 B 1 40.0
5 B 2 30.0
6 B 3 40.0
7 B 4 5.0
EDIT , bases on edited question , you can try pivotting and filling na with the respective columns:
piv = df.set_index(['product','type'])['price'].unstack()
piv = piv.fillna({3:piv[1],4:piv[2]})
out = piv.stack().reset_index(name='price')
print(out)
product type price
0 A 1 10.0
1 A 2 15.0
2 A 3 10.0
3 A 4 20.0
4 B 1 40.0
5 B 2 30.0
6 B 3 40.0
7 B 4 5.0
8 C 1 80.0
9 C 2 70.0
10 C 3 90.0
11 C 4 70.0
You can try:
for x,y in zip([3,4,8],[1,2,6]):
df.loc[df['type'].eq(x), 'price'] = df.loc[df['type'].eq(y),'price'].values
Output:
product type price
0 A 1 10
1 A 2 15
2 A 3 10
3 A 4 15
4 B 1 40
5 B 2 30
6 B 3 40
7 B 4 30
8 C 1 80
9 C 2 70
10 C 3 80
11 C 4 70
12 D 6 75
13 D 8 75

How to duplicate rows and increment column where incremented column value does not exist?

I have a pandas Dataframe like this:
id alt amount
6 b 30
6 a 30
3 d 56
3 a 40
1 c 35
1 b 10
1 a 20
which I would like to be turned into this:
id alt amount
6 d 56
6 c 35
6 b 30
6 a 30
5 d 56
5 c 35
5 b 26
5 a 33.33
4 d 56
4 c 35
4 b 22
4 a 36.66
3 d 56
3 c 35
3 b 18
3 a 40
2 c 35
2 b 14
2 a 30
1 c 35
1 b 10
1 a 20
For each missing id number N that is less than the maximum id number, all alt values for the largest id number smaller than N should be duplicated, with id number set to N for these duplicated rows. If the alt value is repeated for a larger id number, then the additional amount entries should increase by the difference divided by the difference between id values (number of steps.) If the alt value is not repeated, then the amount can simple be copied over for each additional id value.
For instance, a appears with id numbers 1, 3 and 6 and amounts 20, 40, 30 respectively. We need to add an instance of a with an id of 2. The amount in this will be 30, since it takes 2 steps to go from 1 to 3 and we are increasing by 20. Going from 3 to 6 there are 3 steps and we are decreasing by 10. -10/3 = -3.33, so we subtract 3.33 for each new instance of a.
I thought of doing some combination of duplicating, sorting, and forward filling? I'm unsure of the logic here though.
You can do with pivot + reindex then interpolate
yourdf=df.pivot(*df.columns).\
reindex(range(df.id.min(),df.id.max()+1)).\
interpolate(method='index').stack().reset_index()
yourdf
Out[51]:
id alt 0
0 1 a 20.000000
1 1 b 10.000000
2 1 c 35.000000
3 2 a 30.000000
4 2 b 14.000000
5 2 c 35.000000
6 3 a 40.000000
7 3 b 18.000000
8 3 c 35.000000
9 3 d 56.000000
10 4 a 36.666667
11 4 b 22.000000
12 4 c 35.000000
13 4 d 56.000000
14 5 a 33.333333
15 5 b 26.000000
16 5 c 35.000000
17 5 d 56.000000
18 6 a 30.000000
19 6 b 30.000000
20 6 c 35.000000
21 6 d 56.000000

Categories

Resources