I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6
I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.
You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')
Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
I have two dataframes, which I need to merge/join based on a column. When I try to join/merge them, the new columns gives NaN.
Basically, I need to perform Left Join on the dataframes, considering df_user as the dataframe on the Left.
PS: The column on both the dataframes have same datatype.
Please find the dataframes below -
df_user.dtypes
App category
Sentiment int8
Sentiment_Polarity float64
Sentiment_Subjectivity float64
df_play.dtypes
App category
Category category
Rating float64
Reviews float64
Size float64
Installs int64
Type int8
Price float64
Content Rating int8
Installs_Cat int8
df_play.head()
App Category Rating Reviews Size Installs Type Price Content Installs_Cat
0 SPrapBook ART_AND_DESIGN 4.1 159 19 10000 0 0 0 9
1 U Launcher ART_AND_DESIGN 4.5 87510 25 5000000 0 0 0 14
2 Sketch - ART_AND_DESIGN 4.3 215644 2.8 50000000 0 0 1 16
3 Pixel Dra ART_AND_DESIGN 4.4 967 5.6 100000 0 0 0 11
4 Paper flo ART_AND_DESIGN 3.8 167 19 50000 0 0 0 10
df_user.head()
App Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You 2 1.00 0.533333
1 10 Best Foods for You 2 0.25 0.288462
3 10 Best Foods for You 2 0.40 0.875000
4 10 Best Foods for You 2 1.00 0.300000
5 10 Best Foods for You 2 1.00 0.300000
I tried both the codes below -
result = pd.merge(df_user, df_play, how='left', on='App')
result = df_user.join(df_play.set_index('App'),on='App',how='left',rsuffix='_y')
But all i got was -
App Sentiment Sentiment_Polarity Sentiment_Subjectivity Category Rating Reviews Size Installs Type Price Content Rating Installs_Cat
0 10 Best Foods for You 2 1.00 0.533333 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 10 Best Foods for You 2 0.25 0.288462 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 10 Best Foods for You 2 0.40 0.875000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please excuse me for the formatting.
I have a problem with pandas interpolate(). I only want to interpolate when there are not more than 2 succsessive "np.nans".
But the interpolate function tries to interpolate also single values when there are more than 2 np.nans!?
s = pd.Series(data = [np.nan,10,np.nan,np.nan,np.nan,5,np.nan,6,np.nan,np.nan,30])
a = s.interpolate(limit=2,limit_area='inside')
print(a)
the output I get is:
0 NaN
1 10.00
2 8.75
3 7.50
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
I do not want the result in line 2 and 3.
What I want is:
0 NaN
1 10.00
2 NaN
3 NaN
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
Can anybody please help?
Groupby.transform with Series.where
s_notna = s.notna()
m = (s.groupby(s_notna.cumsum()).transform('size').le(3) | s_notna)
s = s.interpolate(limit_are='inside').where(m)
print(s)
Output
0 NaN
1 10.0
2 NaN
3 NaN
4 NaN
5 5.0
6 5.5
7 6.0
8 14.0
9 22.0
10 30.0
dtype: float64
I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','US','US','US','US','US','US'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
i want to calculate moving average and std dev for val column by country and product..like 3 weeks,5 weeks ,7 weeks etc
wanted dataframe:
'Contry', 'product','week',val', '3wks_avg' '3wks_std','5wks_avg',5wks,std'..etc
Like WenYoBen suggested, we can create a list of all the window sizes you want, and then dynamically create your wanted columns with GroupBy.rolling:
weeks = [3, 5, 7]
for week in weeks:
df[[f'{week}wks_avg', f'{week}wks_std']] = (
df.groupby(['Country', 'Product']).rolling(window=week, on='Week')['val']
.agg(['mean', 'std']).reset_index(drop=True)
)
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
0 UK A 1 5 nan nan nan nan nan nan
1 UK A 2 4 nan nan nan nan nan nan
2 UK A 3 3 4.00 1.00 nan nan nan nan
3 UK A 4 1 2.67 1.53 nan nan nan nan
4 UK B 1 5 nan nan nan nan nan nan
5 UK B 2 6 nan nan nan nan nan nan
6 UK B 3 7 6.00 1.00 nan nan nan nan
7 UK B 4 8 7.00 1.00 nan nan nan nan
8 UK B 5 9 8.00 1.00 7.00 1.58 nan nan
9 UK B 6 10 9.00 1.00 8.00 1.58 nan nan
10 UK B 7 11 10.00 1.00 9.00 1.58 8.00 2.16
11 UK B 8 12 11.00 1.00 10.00 1.58 9.00 2.16
12 UK C 1 5 nan nan nan nan nan nan
13 UK C 2 5 nan nan nan nan nan nan
14 UK C 3 5 5.00 0.00 nan nan nan nan
15 US D 1 5 nan nan nan nan nan nan
16 US D 2 6 nan nan nan nan nan nan
17 US D 3 7 6.00 1.00 nan nan nan nan
18 US D 4 8 7.00 1.00 nan nan nan nan
19 US D 5 9 8.00 1.00 7.00 1.58 nan nan
20 US D 6 10 9.00 1.00 8.00 1.58 nan nan
This is how you would get the moving average for 3 weeks :
df['3weeks_avg'] = list(df.groupby(['Country', 'Product']).rolling(3).mean()['val'])
Apply the same principle for the other columns you want to compute.
IIUC, you may try this
wks = ['Week_3', 'Week_5', 'Week_7']
df_calc = (df2.groupby(['Country', 'Product']).expanding().val
.agg(['mean', 'std']).rename(lambda x: f'Week_{x+1}', level=-1)
.query('ilevel_2 in #wks').unstack())
Out[246]:
mean std
Week_3 Week_5 Week_7 Week_3 Week_5 Week_7
Country Product
UK A 4.0 NaN NaN 1.0 NaN NaN
B NaN 5.0 6.0 NaN NaN 1.0
You will want to use a groupby-transform to get the rolling moments of your data. The following should compute what you are looking for:
weeks = [3, 5, 7] # define weeks
df2 = df2.sort_values('Week') # order by time
for i in weeks: # loop through time intervals you want to compute
df2['{}wks_avg'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).mean()) # i-week rolling mean
df2['{}wks_std'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).std()) # i-week rolling std
Here is what the resulting dataframe will look like.
print(df2.dropna().head().to_string())
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
17 US D 3 7 6.0 1.0 6.0 1.0 6.0 1.0
6 UK B 3 7 6.0 1.0 6.0 1.0 6.0 1.0
14 UK C 3 5 5.0 0.0 5.0 0.0 5.0 0.0
2 UK A 3 3 4.0 1.0 4.0 1.0 4.0 1.0
7 UK B 4 8 7.0 1.0 7.0 1.0 7.0 1.0