Calculated column with shift - python

This is the base DataFrame:
g_accessor number_opened number_closed
0 49 - 20 3.0 1.0
1 50 - 20 2.0 14.0
2 51 - 20 1.0 6.0
3 52 - 20 0.0 6.0
4 1 - 21 1.0 4.0
5 2 - 21 3.0 5.0
6 3 - 21 4.0 11.0
7 4 - 21 2.0 7.0
8 5 - 21 6.0 10.0
9 6 - 21 2.0 8.0
10 7 - 21 4.0 9.0
11 8 - 21 2.0 3.0
12 9 - 21 2.0 1.0
13 10 - 21 1.0 11.0
14 11 - 21 6.0 3.0
15 12 - 21 3.0 3.0
16 13 - 21 2.0 6.0
17 14 - 21 5.0 9.0
18 15 - 21 9.0 13.0
19 16 - 21 7.0 7.0
20 17 - 21 9.0 4.0
21 18 - 21 3.0 8.0
22 19 - 21 6.0 3.0
23 20 - 21 6.0 1.0
24 21 - 21 3.0 5.0
25 22 - 21 5.0 3.0
26 23 - 21 1.0 0.0
I want to add a calculated new column number_active which relies on previous values. For this I'm trying to use pd.DataFrame.shift(), like this:
# Creating new column and setting all rows to 0
df['number_active'] = 0
# Active from previous period
PREVIOUS_PERIOD_ACTIVE = 22
# Calculating active value for first period in the DataFrame, based on `PREVIOUS_PERIOD_ACTIVE`
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
# Calculating all columns using DataFrame.shift()
df['number_active'] = (df['number_opened'] + df['number_active'].shift(1)) - df['number_closed']
# Recalculating first active value as it was overwritten in the previous step.
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
The result:
g_accessor number_opened number_closed number_active
0 49 - 20 3.0 1.0 24.0
1 50 - 20 2.0 14.0 12.0
2 51 - 20 1.0 6.0 -5.0
3 52 - 20 0.0 6.0 -6.0
4 1 - 21 1.0 4.0 -3.0
5 2 - 21 3.0 5.0 -2.0
6 3 - 21 4.0 11.0 -7.0
7 4 - 21 2.0 7.0 -5.0
8 5 - 21 6.0 10.0 -4.0
9 6 - 21 2.0 8.0 -6.0
10 7 - 21 4.0 9.0 -5.0
11 8 - 21 2.0 3.0 -1.0
12 9 - 21 2.0 1.0 1.0
13 10 - 21 1.0 11.0 -10.0
14 11 - 21 6.0 3.0 3.0
15 12 - 21 3.0 3.0 0.0
16 13 - 21 2.0 6.0 -4.0
17 14 - 21 5.0 9.0 -4.0
18 15 - 21 9.0 13.0 -4.0
19 16 - 21 7.0 7.0 0.0
20 17 - 21 9.0 4.0 5.0
21 18 - 21 3.0 8.0 -5.0
22 19 - 21 6.0 3.0 3.0
23 20 - 21 6.0 1.0 5.0
24 21 - 21 3.0 5.0 -2.0
25 22 - 21 5.0 3.0 2.0
26 23 - 21 1.0 0.0 1.0
Oddly, it seems that only the first active value (index 1) is calculated correctly (since the value at index 0 is calculated independently, via df.iat). For the rest of the values it seems that number_closed is interpreted as negative value - for some reason.
What am I missing/doing wrong?

You are assuming that the result for the previous row is available when the current row is calculated. This is not how pandas calculations work. Pandas calculations treat each row in isolation, unless you are applying multi-row operations like cumsum and shift.
I would calculate the number active with a minimal example as:
df = pandas.DataFrame({'ignore': ['a','b','c','d','e'], 'number_opened': [3,4,5,4,3], 'number_closed':[1,2,2,1,2]})
df['number_active'] = df['number_opened'].cumsum() + 22 - df['number_closed'].cumsum()
This gives a result of:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
29
3
d
4
1
32
4
e
3
2
33
The code in your question with my minimal example gave:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
3
3
d
4
1
3
4
e
3
2
1

Related

add two dataframe with unequal index [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Pandas Merging 101
(8 answers)
Closed 1 year ago.
a:
Length 10kN
0 0.0 5
1 0.5 5
2 1.0 5
3 1.5 5
4 2.0 5
5 2.5 5
6 3.0 5
7 3.5 5
8 4.0 5
9 4.5 5
10 5.0 5
11 5.0 -5
12 5.5 -5
13 6.0 -5
14 6.5 -5
15 7.0 -5
16 7.5 -5
17 8.0 -5
18 8.5 -5
19 9.0 -5
20 9.5 -5
21 10.0 -5
b:
Length1 20kN
0 0.0 50
1 0.5 45
2 1.0 40
3 1.5 35
4 2.0 30
5 2.5 25
6 3.0 20
7 3.5 15
8 4.0 10
9 4.5 5
10 5.0 0
11 5.5 -5
12 6.0 -10
13 6.5 -15
14 7.0 -20
15 7.5 -25
16 8.0 -30
17 8.5 -35
18 9.0 -40
19 9.5 -45
20 10.0 -50
c as a result of my code below:
Length 10kN Length1 20kN Total
0 0.0 5 0.0 50.0 55.0
1 0.5 5 0.5 45.0 50.0
2 1.0 5 1.0 40.0 45.0
3 1.5 5 1.5 35.0 40.0
4 2.0 5 2.0 30.0 35.0
5 2.5 5 2.5 25.0 30.0
6 3.0 5 3.0 20.0 25.0
7 3.5 5 3.5 15.0 20.0
8 4.0 5 4.0 10.0 15.0
9 4.5 5 4.5 5.0 10.0
10 5.0 5 5.0 0.0 5.0
11 5.0 -5 5.5 -5.0 -10.0
12 5.5 -5 6.0 -10.0 -15.0
13 6.0 -5 6.5 -15.0 -20.0
14 6.5 -5 7.0 -20.0 -25.0
15 7.0 -5 7.5 -25.0 -30.0
16 7.5 -5 8.0 -30.0 -35.0
17 8.0 -5 8.5 -35.0 -40.0
18 8.5 -5 9.0 -40.0 -45.0
19 9.0 -5 9.5 -45.0 -50.0
20 9.5 -5 10.0 -50.0 -55.0
21 10.0 -5 NaN NaN NaN
Code I tried:
import pandas as pd
a=pd.read_csv("first.csv")
b=pd.read_csv("second.csv")
c=pd.concat([a,b], axis=1)
c['Total']=c['10kN']+c['20kN']
print(c['Total'])
print(a)
print(b)
print(c)
I want add two column 10kN and 20kN for same length

Convert column vector into multi-column matrix

I have a column vector with say 30 values (1-30) I would like to try to manipulate this vector so that it becomes a matrix with 5 values in the first column, 10 values in the second and 15 values in the third column. How would I implement this using Pandas or NumPy?
import pandas as pd
#Create data
df = pd.DataFrame(np.linspace(1,20,20))
print(df)
1
2
:
28
29
30
In order to get something like this:
# Manipulate the column vector to make columns where the first column has 5
# the second column has 10 and the last column has 15 values
'T1' 'T2' 'T3'
1 6 16
2 7 17
3 8 18
4 9 19
5 10 20
NA 11 21
NA 12 22
NA 13 23
NA 14 24
NA 15 25
NA NA 26
NA NA 27
NA NA 28
NA NA 29
NA NA 30
It took a little time to find out what series is this, and I found that its a triangular series , just a modified one.
tri = lambda x:int((0.25+2*x)**0.5-0.5)
This would give results like:
0 1 1 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 ...
And after the modification:
modtri = lambda x:int((0.25+2*(x//5))**0.5-0.5)
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
So each occurrence in normal triangular series repeats 5 times.
The above modtri function would directly map the index starting from 0, to appropriate group ids.
and so after that, this would do the job:
df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
Full execution:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.linspace(1,30,30))
N = 5 #the increment value
modtri = lambda x:int((0.25+2*(x//N))**0.5-0.5)
df2 = df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
df2.rename(columns={0: "T1", 1: "T2",2:"T3"},inplace=True)
print(df2)
Output:
T1 T2 T3
0 1.0 6.0 16.0
1 2.0 7.0 17.0
2 3.0 8.0 18.0
3 4.0 9.0 19.0
4 5.0 10.0 20.0
5 NaN 11.0 21.0
6 NaN 12.0 22.0
7 NaN 13.0 23.0
8 NaN 14.0 24.0
9 NaN 15.0 25.0
10 NaN NaN 26.0
11 NaN NaN 27.0
12 NaN NaN 28.0
13 NaN NaN 29.0
14 NaN NaN 30.0
Try this by slicing with reindexing:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
Original data before operation:
df = pd.DataFrame(np.linspace(1,30,30))
print(df)
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 20.0
20 21.0
21 22.0
22 23.0
23 24.0
24 25.0
25 26.0
26 27.0
27 28.0
28 29.0
29 30.0
Running new codes:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
print(df)
0 T1 T2 T3
0 1.0 1.0 6.0 16.0
1 2.0 2.0 7.0 17.0
2 3.0 3.0 8.0 18.0
3 4.0 4.0 9.0 19.0
4 5.0 5.0 10.0 20.0
5 6.0 NaN 11.0 21.0
6 7.0 NaN 12.0 22.0
7 8.0 NaN 13.0 23.0
8 9.0 NaN 14.0 24.0
9 10.0 NaN 15.0 25.0
10 11.0 NaN NaN 26.0
11 12.0 NaN NaN 27.0
12 13.0 NaN NaN 28.0
13 14.0 NaN NaN 29.0
14 15.0 NaN NaN 30.0
15 16.0 NaN NaN NaN
16 17.0 NaN NaN NaN
17 18.0 NaN NaN NaN
18 19.0 NaN NaN NaN
19 20.0 NaN NaN NaN
20 21.0 NaN NaN NaN
21 22.0 NaN NaN NaN
22 23.0 NaN NaN NaN
23 24.0 NaN NaN NaN
24 25.0 NaN NaN NaN
25 26.0 NaN NaN NaN
26 27.0 NaN NaN NaN
27 28.0 NaN NaN NaN
28 29.0 NaN NaN NaN
29 30.0 NaN NaN NaN

Substraction between two dataframe's column

I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0
This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0

Pandas backfill specific value

I have dataframe as such:
df = pd.DataFrame({'val': [np.nan,np.nan,np.nan,np.nan, 15, 1, 5, 2,np.nan, np.nan, np.nan, np.nan,np.nan,np.nan,2,23,5,12, np.nan np.nan, 3,4,5]})
df['name'] = ['a']*8 + ['b']*15
df
>>>
val name
0 NaN a
1 NaN a
2 NaN a
3 NaN a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 NaN b
12 NaN b
13 NaN b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
18 NaN b
19 NaN b
20 3.0 b
21 4.0 b
22 5.0 b
For each name i want to backfill the prior 3 na spots with -1 so that I end up with
>>>
val name
0 NaN a
1 -1.0 a
2 -1.0 a
3 -1.0 a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 -1.0 b
12 -1.0 b
13 -1.0 b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
18 -1 b
19 -1 b
20 3.0 b
21 4.0 b
22 5.0 b
Note there can be multiple sections with NaN. If a section has less than 3 nans it will fill all of them (it backfills all up to 3).
You can using first_valid_index, return the first not null value of each group
then assign the -1 in by using the loc
idx=df.groupby('name').val.apply(lambda x : x.first_valid_index())
for x in idx:
df.loc[x - 3:x - 1, 'val'] = -1
df
Out[51]:
val name
0 NaN a
1 -1.0 a
2 -1.0 a
3 -1.0 a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 -1.0 b
12 -1.0 b
13 -1.0 b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
Update
s=df.groupby('name').val.bfill(limit=3)
s.loc[s.notnull()&df.val.isnull()]=-1
s
Out[59]:
0 NaN
1 -1.0
2 -1.0
3 -1.0
4 15.0
5 1.0
6 5.0
7 2.0
8 NaN
9 NaN
10 NaN
11 -1.0
12 -1.0
13 -1.0
14 2.0
15 23.0
16 5.0
17 12.0
18 NaN
19 -1.0
20 -1.0
21 -1.0
22 3.0
23 4.0
24 5.0
Name: val, dtype: float64

pandas series bfill first half, ffill second half

Suppose I had:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
s = pd.Series(np.random.choice((0, 1, 2, 3, 4, np.nan),
(50,), p=(.1, .1, .1, .1, .1, .5)))
I want to back fill in missing values for the first half of the series and forward fill for the second half of the series. Middle out, if you will.
Expected output
0 4.0
1 4.0
2 4.0
3 4.0
4 4.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 1.0
13 1.0
14 1.0
15 1.0
16 1.0
17 1.0
18 4.0
19 1.0
20 2.0
21 0.0
22 0.0
23 NaN
24 NaN
25 NaN
26 NaN
27 3.0
28 2.0
29 4.0
30 4.0
31 4.0
32 4.0
33 0.0
34 0.0
35 0.0
36 0.0
37 2.0
38 2.0
39 2.0
40 2.0
41 1.0
42 1.0
43 0.0
44 2.0
45 2.0
46 2.0
47 2.0
48 2.0
49 2.0
dtype: float64
I just operate on the two halves independently here:
In [71]: s.ix[:len(s)/2].bfill().append(s.ix[1+len(s)/2:].ffill())
Out[71]:
0 4
1 4
2 4
3 4
4 4
5 0
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 4
19 1
20 2
21 0
22 0
23 NaN
24 NaN
25 NaN
26 NaN
27 3
28 2
29 4
30 4
31 4
32 4
33 0
34 0
35 0
36 0
37 2
38 2
39 2
40 2
41 1
42 1
43 0
44 2
45 2
46 2
47 2
48 2
49 2
dtype: float64

Categories

Resources