I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37
Related
I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values
I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.
I have two table like that:
Customr Issue Date_Issue
1 1 01/01/2019
1 2 03/06/2019
1 3 04/07/2019
1 4 13/09/2019
2 5 01/02/2019
2 6 16/03/2019
2 7 20/08/2019
2 8 30/08/2019
2 9 01/09/2019
3 10 01/02/2019
3 11 03/02/2019
3 12 05/03/2019
3 13 20/04/2019
3 14 25/04/2019
3 15 13/05/2019
3 16 20/05/2019
3 17 25/05/2019
3 18 01/06/2019
3 19 03/07/2019
3 20 20/08/2019
Customr Date_Survey df_Score
1 06/04/2019 10
2 10/06/2019 9
3 01/08/2019 3
And I need to obtain the number of issues of each customer in the three month before the date of survey.
But I can not get this query in Pandas.
#first table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
#And second table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
I expect the result
Custr Date_Survey Score Count_issues
1 06/04/2019 10 0
2 10/06/2019 9 1
3 01/08/2019 3 5
Use:
#convert columns to datetimes
df1['Date_Issue'] = pd.to_datetime(df1['Date_Issue'], dayfirst=True)
df2['Date_Survey'] = pd.to_datetime(df2['Date_Survey'], dayfirst=True)
#create datetimes for 3 months before
df2['Date1'] = df2['Date_Survey'] - pd.offsets.DateOffset(months=3)
#merge together
df = df1.merge(df2, on='Customr')
#filter by between, select only Customr and get counts
s = df.loc[df['Date_Issue'].between(df['Date1'], df['Date_Survey']), 'Customr'].value_counts()
#map to new column and replace NaNs to 0
df2['Count_issues'] = df2['Customr'].map(s).fillna(0, downcast='int')
print (df2)
Customr Date_Survey df_Score Date1 Count_issues
0 1 2019-04-06 10 2019-01-06 0
1 2 2019-06-10 9 2019-03-10 1
2 3 2019-08-01 3 2019-05-01 5
I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))
I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.