Suppose I had:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
s = pd.Series(np.random.choice((0, 1, 2, 3, 4, np.nan),
(50,), p=(.1, .1, .1, .1, .1, .5)))
I want to back fill in missing values for the first half of the series and forward fill for the second half of the series. Middle out, if you will.
Expected output
0 4.0
1 4.0
2 4.0
3 4.0
4 4.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 1.0
13 1.0
14 1.0
15 1.0
16 1.0
17 1.0
18 4.0
19 1.0
20 2.0
21 0.0
22 0.0
23 NaN
24 NaN
25 NaN
26 NaN
27 3.0
28 2.0
29 4.0
30 4.0
31 4.0
32 4.0
33 0.0
34 0.0
35 0.0
36 0.0
37 2.0
38 2.0
39 2.0
40 2.0
41 1.0
42 1.0
43 0.0
44 2.0
45 2.0
46 2.0
47 2.0
48 2.0
49 2.0
dtype: float64
I just operate on the two halves independently here:
In [71]: s.ix[:len(s)/2].bfill().append(s.ix[1+len(s)/2:].ffill())
Out[71]:
0 4
1 4
2 4
3 4
4 4
5 0
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 4
19 1
20 2
21 0
22 0
23 NaN
24 NaN
25 NaN
26 NaN
27 3
28 2
29 4
30 4
31 4
32 4
33 0
34 0
35 0
36 0
37 2
38 2
39 2
40 2
41 1
42 1
43 0
44 2
45 2
46 2
47 2
48 2
49 2
dtype: float64
Related
This is the base DataFrame:
g_accessor number_opened number_closed
0 49 - 20 3.0 1.0
1 50 - 20 2.0 14.0
2 51 - 20 1.0 6.0
3 52 - 20 0.0 6.0
4 1 - 21 1.0 4.0
5 2 - 21 3.0 5.0
6 3 - 21 4.0 11.0
7 4 - 21 2.0 7.0
8 5 - 21 6.0 10.0
9 6 - 21 2.0 8.0
10 7 - 21 4.0 9.0
11 8 - 21 2.0 3.0
12 9 - 21 2.0 1.0
13 10 - 21 1.0 11.0
14 11 - 21 6.0 3.0
15 12 - 21 3.0 3.0
16 13 - 21 2.0 6.0
17 14 - 21 5.0 9.0
18 15 - 21 9.0 13.0
19 16 - 21 7.0 7.0
20 17 - 21 9.0 4.0
21 18 - 21 3.0 8.0
22 19 - 21 6.0 3.0
23 20 - 21 6.0 1.0
24 21 - 21 3.0 5.0
25 22 - 21 5.0 3.0
26 23 - 21 1.0 0.0
I want to add a calculated new column number_active which relies on previous values. For this I'm trying to use pd.DataFrame.shift(), like this:
# Creating new column and setting all rows to 0
df['number_active'] = 0
# Active from previous period
PREVIOUS_PERIOD_ACTIVE = 22
# Calculating active value for first period in the DataFrame, based on `PREVIOUS_PERIOD_ACTIVE`
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
# Calculating all columns using DataFrame.shift()
df['number_active'] = (df['number_opened'] + df['number_active'].shift(1)) - df['number_closed']
# Recalculating first active value as it was overwritten in the previous step.
df.iat[0,3] = (df.iat[0,1] + PREVIOUS_PERIOD_ACTIVE) - df.iat[0,2]
The result:
g_accessor number_opened number_closed number_active
0 49 - 20 3.0 1.0 24.0
1 50 - 20 2.0 14.0 12.0
2 51 - 20 1.0 6.0 -5.0
3 52 - 20 0.0 6.0 -6.0
4 1 - 21 1.0 4.0 -3.0
5 2 - 21 3.0 5.0 -2.0
6 3 - 21 4.0 11.0 -7.0
7 4 - 21 2.0 7.0 -5.0
8 5 - 21 6.0 10.0 -4.0
9 6 - 21 2.0 8.0 -6.0
10 7 - 21 4.0 9.0 -5.0
11 8 - 21 2.0 3.0 -1.0
12 9 - 21 2.0 1.0 1.0
13 10 - 21 1.0 11.0 -10.0
14 11 - 21 6.0 3.0 3.0
15 12 - 21 3.0 3.0 0.0
16 13 - 21 2.0 6.0 -4.0
17 14 - 21 5.0 9.0 -4.0
18 15 - 21 9.0 13.0 -4.0
19 16 - 21 7.0 7.0 0.0
20 17 - 21 9.0 4.0 5.0
21 18 - 21 3.0 8.0 -5.0
22 19 - 21 6.0 3.0 3.0
23 20 - 21 6.0 1.0 5.0
24 21 - 21 3.0 5.0 -2.0
25 22 - 21 5.0 3.0 2.0
26 23 - 21 1.0 0.0 1.0
Oddly, it seems that only the first active value (index 1) is calculated correctly (since the value at index 0 is calculated independently, via df.iat). For the rest of the values it seems that number_closed is interpreted as negative value - for some reason.
What am I missing/doing wrong?
You are assuming that the result for the previous row is available when the current row is calculated. This is not how pandas calculations work. Pandas calculations treat each row in isolation, unless you are applying multi-row operations like cumsum and shift.
I would calculate the number active with a minimal example as:
df = pandas.DataFrame({'ignore': ['a','b','c','d','e'], 'number_opened': [3,4,5,4,3], 'number_closed':[1,2,2,1,2]})
df['number_active'] = df['number_opened'].cumsum() + 22 - df['number_closed'].cumsum()
This gives a result of:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
29
3
d
4
1
32
4
e
3
2
33
The code in your question with my minimal example gave:
ignore
number_opened
number_closed
number_active
0
a
3
1
24
1
b
4
2
26
2
c
5
2
3
3
d
4
1
3
4
e
3
2
1
I have different dataset total product data and selling data. I need to find out the Remaining products from product data comparing selling data. So, for that, I have done some general preprocessing and make both dataframe ready to use. But can't get it how to compare them.
DataFrame 1:
Item Qty
0 BUDS2 1.0
1 C100 4.0
2 CK1 5.0
3 DM10 10.0
4 DM7 2.0
5 DM9 9.0
6 HM12 6.0
7 HM13 4.0
8 HOCOX25(CTYPE) 1.0
9 HOCOX30USB 1.0
10 RM510 8.0
11 RM512 8.0
12 RM569 1.0
13 RM711 2.0
14 T2C 1.0
and
DataFrame 2 :
Item Name Quantity
0 BUDS2 2.0
1 C100 5.0
2 C101CABLE 1.0
3 CK1 8.0
4 DM10 12.0
5 DM7 5.0
6 DM9 10.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 9.0
10 HM13 8.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 3.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 11.0
17 RM512 10.0
18 RM569 2.0
19 RM711 3.0
20 T2C 1.0
21 Y1 3.0
22 ZIRCON 1.0
I want to see the available quantity for each item. And I want to get an output like dataframe 2 but the Quantity column values will be changed after doing the subtraction operation. How can I do that ??
Expected Output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 2.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0
This can help by merging two dataframe:
df_new = df_2.merge(df_1,'left',left_on='Item Name',right_on='Item').fillna(0)
df_new.Quantity = df_new.Quantity - df_new.Qty
df_new = df_new.drop(['Item','Qty'],axis=1)
df_new output:
Item Name Quantity
0 BUDS2 1.0
1 C100 1.0
2 C101CABLE 1.0
3 CK1 3.0
4 DM10 2.0
5 DM7 3.0
6 DM9 1.0
7 G20CTYPE 1.0
8 G20NORMAL 1.0
9 HM12 3.0
10 HM13 4.0
11 HM9 3.0
12 HOCOX25CTYPE 3.0
13 HOCOX30USB 2.0
14 M45 1.0
15 REMAXRC080M 2.0
16 RM510 3.0
17 RM512 2.0
18 RM569 1.0
19 RM711 1.0
20 T2C 0.0
21 Y1 3.0
22 ZIRCON 1.0
I have dataframe as such:
df = pd.DataFrame({'val': [np.nan,np.nan,np.nan,np.nan, 15, 1, 5, 2,np.nan, np.nan, np.nan, np.nan,np.nan,np.nan,2,23,5,12, np.nan np.nan, 3,4,5]})
df['name'] = ['a']*8 + ['b']*15
df
>>>
val name
0 NaN a
1 NaN a
2 NaN a
3 NaN a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 NaN b
12 NaN b
13 NaN b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
18 NaN b
19 NaN b
20 3.0 b
21 4.0 b
22 5.0 b
For each name i want to backfill the prior 3 na spots with -1 so that I end up with
>>>
val name
0 NaN a
1 -1.0 a
2 -1.0 a
3 -1.0 a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 -1.0 b
12 -1.0 b
13 -1.0 b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
18 -1 b
19 -1 b
20 3.0 b
21 4.0 b
22 5.0 b
Note there can be multiple sections with NaN. If a section has less than 3 nans it will fill all of them (it backfills all up to 3).
You can using first_valid_index, return the first not null value of each group
then assign the -1 in by using the loc
idx=df.groupby('name').val.apply(lambda x : x.first_valid_index())
for x in idx:
df.loc[x - 3:x - 1, 'val'] = -1
df
Out[51]:
val name
0 NaN a
1 -1.0 a
2 -1.0 a
3 -1.0 a
4 15.0 a
5 1.0 a
6 5.0 a
7 2.0 a
8 NaN b
9 NaN b
10 NaN b
11 -1.0 b
12 -1.0 b
13 -1.0 b
14 2.0 b
15 23.0 b
16 5.0 b
17 12.0 b
Update
s=df.groupby('name').val.bfill(limit=3)
s.loc[s.notnull()&df.val.isnull()]=-1
s
Out[59]:
0 NaN
1 -1.0
2 -1.0
3 -1.0
4 15.0
5 1.0
6 5.0
7 2.0
8 NaN
9 NaN
10 NaN
11 -1.0
12 -1.0
13 -1.0
14 2.0
15 23.0
16 5.0
17 12.0
18 NaN
19 -1.0
20 -1.0
21 -1.0
22 3.0
23 4.0
24 5.0
Name: val, dtype: float64
I've got some SQL data that I'm grouping and performing some aggregation on. It works nicely:
grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
c: [numpy.sum, numpy.mean, numpy.size],
d: [numpy.sum, numpy.mean, numpy.size]
})
and
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
but I want to fill all of the rows that are in a=25 but not in a=26 with zeros. In other words, something like:
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 0 0 0 0 0 0
25 0 0 0 0 0 0
How can I do this?
Consider the dataframe df
df = pd.DataFrame(
np.random.randint(10, size=(6, 6)),
pd.MultiIndex.from_tuples(
[(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
names=['a', 'b']
),
pd.MultiIndex.from_product(
[['c', 'd'], ['sum', 'mean', 'size']]
)
)
c d
sum mean size sum mean size
a b
25 20 8 3 5 5 0 2
21 3 7 8 9 2 7
23 2 1 3 2 5 4
24 9 0 1 7 1 6
25 1 9 3 5 8 8
26 23 8 8 4 8 0 5
You can quickly recover all missing rows from the cartesian product with unstack(fill_value=0) followed by stack
df.unstack(fill_value=0).stack()
c d
mean size sum mean size sum
a b
25 20 3 5 8 0 2 5
21 7 8 3 2 7 9
23 1 3 2 5 4 2
24 0 1 9 1 6 7
25 9 3 1 8 8 5
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 8 4 8 0 5 8
24 0 0 0 0 0 0
25 0 0 0 0 0 0
Note: Using fill_value=0 preserves the dtype int. Without it, when unstacked, the gaps get filled with NaN and dtypes get converted to float
print(df)
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
I like:
df = df.unstack().replace(np.nan,0).stack(-1)
print(df)
c d
mean size sum mean size sum
a b
25 20 0.804511 133.0 107.0 40060.150376 133.0 5328000.0
21 0.774648 142.0 110.0 42471.830986 142.0 6031000.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.947368 76.0 72.0 38421.052632 76.0 2920000.0
25 0.818182 66.0 54.0 38939.393939 66.0 2570000.0
26 20 0.000000 0.0 0.0 0.000000 0.0 0.0
21 0.000000 0.0 0.0 0.000000 0.0 0.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.000000 0.0 0.0 0.000000 0.0 0.0
25 0.000000 0.0 0.0 0.000000 0.0 0.0
I have data
age
32
16
39
39
23
36
29
26
43
34
35
50
29
29
31
42
53
I need to get smth like this
I can get
df.age.value_counts()
and
100. * df.age.value_counts() / len(df.age)
But how can I union this and give name to columns?
You can use cut with agg:
#helper df with min and max ages, necessary add category Total
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+','Total'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65,np.nan],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120, np.nan]})
print (df1)
G Max Min
0 14 yo and younger 14.0 0.0
1 15-19 19.0 15.0
2 20-24 24.0 20.0
3 25-29 29.0 25.0
4 30-34 34.0 30.0
5 35-39 39.0 35.0
6 40-44 44.0 40.0
7 45-49 49.0 45.0
8 50-54 54.0 50.0
9 55-59 59.0 55.0
10 60-64 64.0 60.0
11 65+ 120.0 65.0
12 Total NaN NaN
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
df['Groups'] = pd.cut(df.age, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (df)
age Groups
0 32 30-34
1 16 15-19
2 39 35-39
3 39 35-39
4 23 20-24
5 36 35-39
6 29 25-29
7 26 25-29
8 43 40-44
9 34 30-34
10 35 35-39
11 50 50-54
12 29 25-29
13 29 25-29
14 31 30-34
15 42 40-44
16 53 50-54
df = df.groupby('Groups')['Groups']
.agg({'Total':[len, lambda x: len(x)/df.shape[0] * 100 ]})
.rename(columns={'len':'N', '<lambda>':'%'})
#last Total row
df.ix['Total'] = df.sum()
print (df)
Total
N %
Groups
14 yo and younger 0.0 0.000000
15-19 1.0 5.882353
20-24 1.0 5.882353
25-29 4.0 23.529412
30-34 3.0 17.647059
35-39 4.0 23.529412
40-44 2.0 11.764706
45-49 0.0 0.000000
50-54 2.0 11.764706
55-59 0.0 0.000000
60-64 0.0 0.000000
65+ 0.0 0.000000
Total 17.0 100.000000
EDIT1:
Solution with size scale better:
df1 = df.groupby('Groups').size().to_frame()
df1.columns = pd.MultiIndex.from_arrays(('Total','N'))
df1.ix[:,('Total','%')] = 100 * df1.ix[:,('Total','N')] / df.shape[0]
df1.ix['Total'] = df1.sum()
print (df1)
Total
N %
Groups
14 yo and younger 0.0 0.000000
15-19 1.0 5.882353
20-24 1.0 5.882353
25-29 4.0 23.529412
30-34 3.0 17.647059
35-39 4.0 23.529412
40-44 2.0 11.764706
45-49 0.0 0.000000
50-54 2.0 11.764706
55-59 0.0 0.000000
60-64 0.0 0.000000
65+ 0.0 0.000000
Total 17.0 100.000000