Here's my dataset
customer_id hour size
1 0 1
1 1 18
2 1 7
Here's my code
table = a.pivot_table(index=['customer_id'],
columns='hour',
fill_value=0,
values='size')
Here's what I've got
hour 0 1
customer_id
1 1 18
2 8 7
What I need
hour 0 1 count sum
customer_id
1 1 18 2 19
2 0 7 1 7
count 1 2
sum 1 25
count is non-zero count in a category and sum is sum in a category
One possible a bit dynamic solution is omit fill_value=0:
table = a.pivot_table(index='customer_id',
columns='hour',
values='size')
print (table)
hour 0 1
customer_id
1 1.0 18.0
2 NaN 7.0
a = table.agg(['count','sum'])
b = table.T.agg(['count','sum']).T
print (table.fillna(0).append(a).join(b))
0 1 count sum
1 1.0 18.0 2.0 19.0
2 0.0 7.0 1.0 7.0
count 1.0 2.0 NaN NaN
sum 1.0 25.0 NaN NaN
Related
I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()
I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6
If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6
Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0
I have a datframe as :
data=[[0,1,5],
[0,1,6],
[0,0,8],
[0,0,10],
[0,1,12],
[0,0,14],
[0,1,16],
[0,1,18],
[1,0,2],
[1,1,0],
[1,0,1],
[1,0,2]]
df = pd.DataFrame(data,columns=['KEY','COND','VAL'])
For RES1, I want to create a counter variable RES where COND ==1. The value of RES for the
first KEY of the group remains same as the VAL (Can I use cumcount() in some way).
For RES2, then I just want to fill the missing values as
the previous value. (df.fillna(method='ffill')), I am thinking..
KEY COND VAL RES1 RES2
0 0 1 5 5 5
1 0 1 6 6 6
2 0 0 8 6
3 0 0 10 6
4 0 1 12 7 7
5 0 0 14 7
6 0 1 16 8 8
7 0 1 18 9 9
8 1 0 2 2 2
9 1 1 0 3 3
10 1 0 1 3
11 1 0 2 3
Aim is to look fir a vectorized solution that's most optimal over million rows.
IIUC
con=(df.COND==1)|(df.index.isin(df.drop_duplicates('KEY').index))
df['res1']=df.groupby('KEY').VAL.transform('first')+
df.groupby('KEY').COND.cumsum()[con]-
df.groupby('KEY').COND.transform('first')
df['res2']=df.res1.ffill()
df
Out[148]:
KEY COND VAL res1 res2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
You want:
s = (df[df.KEY.duplicated()] # Ignore first row in each KEY group
.groupby('KEY').COND.cumsum() # Counter within KEY
.add(df.groupby('KEY').VAL.transform('first')) # Add first value per KEY
.where(df.COND.eq(1)) # Set only where COND == 1
.add(df.loc[~df.KEY.duplicated(), 'VAL'], fill_value=0) # Set 1st row by KEY
)
df['RES1'] = s
df['RES2'] = df['RES1'].ffill()
KEY COND VAL RES1 RES2
0 0 1 5 5.0 5.0
1 0 1 6 6.0 6.0
2 0 0 8 NaN 6.0
3 0 0 10 NaN 6.0
4 0 1 12 7.0 7.0
5 0 0 14 NaN 7.0
6 0 1 16 8.0 8.0
7 0 1 18 9.0 9.0
8 1 0 2 2.0 2.0
9 1 1 0 3.0 3.0
10 1 0 1 NaN 3.0
11 1 0 2 NaN 3.0
I would like to use rolling count with maximum value is 36 which need to include NaN value such as start with 0 if its NaN. I have dataframe that look like this:
Input:
val
NaN
1
1
NaN
2
1
3
NaN
5
Code:
b = a.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
It gives me:
Val count
NaN 1
1 2
1 2
NaN 3
2 4
1 4
3 5
NaN 6
5 7
Expected Output:
Val count
NaN 0
1 1
1 1
NaN 1
2 2
1 2
3 3
NaN 3
5 4
You can just filter out nan
df.val.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x[~np.isnan(x)]))).fillna(0)
Out[35]:
0 0.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 3.0
7 3.0
8 4.0
Name: val, dtype: float64
The reason why
np.unique([np.nan]*2)
Out[38]: array([nan, nan])
np.nan==np.nan
Out[39]: False
I'm building a monte carlo model and need to model how many new items I capture each month, for a given of months. Each month I add a random number of items with a known mean and stdev.
months = ['2017-03','2017-04','2017-05']
new = np.random.normal(4,3,size = len(months)).round()
print new
[ 1. 5. 4.]
df_new = pd.DataFrame(zip(months,new),columns = ['Period','newPats'])
print df_new
Period newPats
0 2017-03 1.0
1 2017-04 5.0
2 2017-05 4.0
I need to transform this into an item x month dataframe, where the value is a zero until the month that the given item starts.
Here's the shape I have:
df_full = pd.DataFrame(np.ones((new.sum(), len(months))),columns = months)
2017-03 2017-04 2017-05
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 1.0 1.0 1.0
8 1.0 1.0 1.0
9 1.0 1.0 1.0
and here's the output I need:
#perform transformation
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1
The rule is that there was 1 item added in 2017-03, so all periods = 1 for the first record. The next 5 items were added in 2017-04, so all prior periods = 0. The final 4 items were added in 2017-05, so they are only = 1 in the last month. This is going into a monte carlo simulation which will be run thousands of times, so I can't manually iterate over the columns/rows - any vectorized suggestions for how to handle?
Beat you all to it.
df_out = pd.DataFrame([new[:x+1].sum() * [1] + (new.sum() - new[:x+1].sum() ) * [0] for x in range(len(months))]).transpose()
df_out.columns = months
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1