Pandas Dataframe of staggered zeros - python

I'm building a monte carlo model and need to model how many new items I capture each month, for a given of months. Each month I add a random number of items with a known mean and stdev.
months = ['2017-03','2017-04','2017-05']
new = np.random.normal(4,3,size = len(months)).round()
print new
[ 1. 5. 4.]
df_new = pd.DataFrame(zip(months,new),columns = ['Period','newPats'])
print df_new
Period newPats
0 2017-03 1.0
1 2017-04 5.0
2 2017-05 4.0
I need to transform this into an item x month dataframe, where the value is a zero until the month that the given item starts.
Here's the shape I have:
df_full = pd.DataFrame(np.ones((new.sum(), len(months))),columns = months)
2017-03 2017-04 2017-05
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 1.0 1.0 1.0
8 1.0 1.0 1.0
9 1.0 1.0 1.0
and here's the output I need:
#perform transformation
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1
The rule is that there was 1 item added in 2017-03, so all periods = 1 for the first record. The next 5 items were added in 2017-04, so all prior periods = 0. The final 4 items were added in 2017-05, so they are only = 1 in the last month. This is going into a monte carlo simulation which will be run thousands of times, so I can't manually iterate over the columns/rows - any vectorized suggestions for how to handle?

Beat you all to it.
df_out = pd.DataFrame([new[:x+1].sum() * [1] + (new.sum() - new[:x+1].sum() ) * [0] for x in range(len(months))]).transpose()
df_out.columns = months
print df_out
2017-03 2017-04 2017-05
0 1 1 1
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 0 1
7 0 0 1
8 0 0 1
9 0 0 1

Related

How to insert list of values into null values of a column in python?

I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6
If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6
Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0

Sum values of columns that start with the same text string

I want to take the sum of values (row-wise) of columns that start with the same text string. Underneath is my original df with fails on courses.
Original df:
ID P_English_2 P_English_3 P_German_1 P_Math_1 P_Math_3 P_Physics_2 P_Physics_4
56 1 3 1 2 0 0 3
11 0 0 0 1 4 1 0
6 0 0 0 0 0 1 0
43 1 2 1 0 0 1 1
14 0 1 0 0 1 0 0
Desired df:
ID P_English P_German P_Math P_Physics
56 4 1 2 3
11 0 0 5 1
6 0 0 0 1
43 3 1 0 2
14 1 0 1 0
Tried code:
import pandas as pd

df = pd.DataFrame({"ID": [56,11,6,43,14],
"P_Math_1": [2,1,0,0,0],
"P_English_3": [3,0,0,2,1],

 "P_English_2": [1,0,0,1,0],
"P_Math_3": [0,4,0,0,1],
"P_Physics_2": [0,1,1,1,0],

 "P_Physics_4": [3,0,0,1,0],
"P_German_1": [1,0,0,1,0]})
print(df)

categories = ['P_Math', 'P_English', 'P_Physics', 'P_German']
def correct_categories(cols):

 return [cat for col in cols for cat in categories if col.startswith(cat)]
result = df.groupby(correct_categories(df.columns),axis=1).sum()

print(result)
Let's try groupby with axis=1:
# extract the subjects
subjects = [x[0] for x in df.columns.str.rsplit('_',n=1)]
df.groupby(subjects, axis=1).sum()
Output:
ID P_English P_German P_Math P_Physics
0 56 4 1 2 3
1 11 0 0 5 1
2 6 0 0 0 1
3 43 3 1 0 2
4 14 1 0 1 0
Or you can use wide_to_long, assuming ID are unique valued:
(pd.wide_to_long(df, stubnames=categories,
i=['ID'], j='count', sep='_')
.groupby('ID').sum()
)
Output:
P_Math P_English P_Physics P_German
ID
56 2.0 4.0 3.0 1.0
11 5.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0
43 0.0 3.0 2.0 1.0
14 1.0 1.0 0.0 0.0

How do I add sum and count on a pivot table

Here's my dataset
customer_id hour size
1 0 1
1 1 18
2 1 7
Here's my code
table = a.pivot_table(index=['customer_id'],
columns='hour',
fill_value=0,
values='size')
Here's what I've got
hour 0 1
customer_id
1 1 18
2 8 7
What I need
hour 0 1 count sum
customer_id
1 1 18 2 19
2 0 7 1 7
count 1 2
sum 1 25
count is non-zero count in a category and sum is sum in a category
One possible a bit dynamic solution is omit fill_value=0:
table = a.pivot_table(index='customer_id',
columns='hour',
values='size')
print (table)
hour 0 1
customer_id
1 1.0 18.0
2 NaN 7.0
a = table.agg(['count','sum'])
b = table.T.agg(['count','sum']).T
print (table.fillna(0).append(a).join(b))
0 1 count sum
1 1.0 18.0 2.0 19.0
2 0.0 7.0 1.0 7.0
count 1.0 2.0 NaN NaN
sum 1.0 25.0 NaN NaN

Panda, summing multiple DataFrames with different columns

I have the following DataFrames:
A =
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
B =
0 5
0 1 1
5 1 1
I want to 'join' these two frames such that:
A + B =
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1
where A+B is a new dataframe
Using add
df1.add(df2,fill_value=0).fillna(0)
Out[217]:
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0
If you need int
df1.add(df2,fill_value=0).fillna(0).astype(int)
Out[242]:
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1
import numpy as np
import pandas as pd
A = pd.DataFrame(np.ones(9).reshape(3, 3))
B = pd.DataFrame(np.ones(4).reshape(2, 2), columns=[0, 5], index=[0, 5])
A.add(B, fill_value=0).fillna(0)
[Out]
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0

No results are returned in dataframe

I am trying to take the average every fifth and every sixth row of var A in a dataframe, and put the result in a new column as var B. But it NaN shows. It may be resulted by I did not return values correctly?
Here is the sample data:
PID A
1 0
1 3
1 2
1 6
1 0
1 2
2 3
2 3
2 1
2 4
2 0
2 4
Expected results:
PID A B
1 0 1
1 3 1
1 2 1
1 6 1
1 0 1
1 2 1
2 3 2
2 3 2
2 1 2
2 4 2
2 0 2
2 4 2
My codes:
lst1 = df.iloc[5::6, :]
lst2 = df.iloc[4::6, :]
df['B'] = (lst1['A'] + lst2['A'])/2
print(df['B'])
The script can be run without error, but the var B is empty and show NaN.
Thanks for your help!
There is problem data not aligned, because different indexes, so get NaNs.
print(lst1)
PID A
5 1 2
11 2 4
print(lst2)
PID A
4 1 0
10 2 0
print (lst1['A'] + lst2['A'])
4 NaN
5 NaN
10 NaN
11 NaN
Name: A, dtype: float64
Solution is use values for add Series to numpy array:
print (lst1['A'] + (lst2['A'].values))
5 2
11 4
Name: A, dtype: int64
Or you can sum 2 numpy arrays:
print (lst1['A'].values + (lst2['A'].values))
[2 4]
It seems you need:
df['B'] = (lst1['A'] + lst2['A'].values).div(2)
df['B'] = df['B'].bfill()
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
But if need mean of 5. and 6. value per group by PID use groupby with transform:
df['B'] = df.groupby('PID').transform(lambda x: x.iloc[[4, 5]].mean())
print(df)
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 1
Straightforward way taking the mean of the 5th and 6th positions within each group defined by 'PID'.
df.assign(B=df.groupby('PID').transform(lambda x: x.values[[4, 5]].mean()))
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0
Option 2
Fun way using join assuming there are actually exactly 6 rows for each 'PID'.
df.join(df.set_index('PID').A.pipe(lambda d: (d.iloc[4::6] + d.iloc[5::6]) / 2).rename('B'), on='PID')
PID A B
0 1 0 1.0
1 1 3 1.0
2 1 2 1.0
3 1 6 1.0
4 1 0 1.0
5 1 2 1.0
6 2 3 2.0
7 2 3 2.0
8 2 1 2.0
9 2 4 2.0
10 2 0 2.0
11 2 4 2.0

Categories

Resources