I have a large year-long dataframe of occurrences with month (1-12), week (1-52), day_of_week (0-6), and hour (0-23).
Below is just a snippet of the dataset. Each row is an occurrence.
The first part of the snippet below shows multiple occurrences captured with a date/timestamp of 2018-04-01 00:00:00 (Sunday). The second part of the snippet below (after the first ellipses) shows multiple occurrences in the following hour and the third part is the next hour, and so on.
month week day_of_week hour
0 4 13 6 0
1 4 13 6 0
2 4 13 6 0
3 4 13 6 0
4 4 13 6 0
...
100 4 13 6 1
101 4 13 6 1
102 4 13 6 1
...
...
300 4 13 6 2
301 4 13 6 2
302 4 13 6 2
...
I would like to be able to display a summary of this dataset showing the weekly average count of occurrences for each of the hours (0-23) as well as for each month.
For example:
month hour weekly_ave
4 0 100
4 1 175
4 2 250
...
4 23 500
5 0 90
How do I do this using pandas groupby and aggregate functions?
Thanks!
df.groupby(['month','hour'])['hour'].count()
Then, if you need this formatted a little bit nicer:
df.groupby(['month','hour'])['hour'].count().rename("weekly:ave").reset_index()
I was able to figure it out. I had to do a second groupby:
df.groupby(['month', 'hour', 'week']) \
.agg({'day_of_week': 'count'}) \
.groupby(['month', 'hour']).mean() \
.rename(columns={"day_of_week": "weekly_ave"}).reset_index()
This gave me what I needed but is there a more elegant way of doing this?
Thanks.
Related
I have following dataframe in pandas
code tank nozzle_1 nozzle_2 nozzle_var nozzle_sale
123 1 1 1 10 10
123 1 2 2 12 10
123 2 1 1 10 10
123 2 2 2 12 10
123 1 1 1 10 10
123 2 2 2 12 10
Now, I want to generate cumulative sum of all the columns grouping over tank and take out the last observation. nozzle_1 and nozzle_2 columns are dynamic, it could be nozzle_3, nozzle_4....nozzle_n etc. I am doing following in pandas to get the cumsum
## Below code for calculating cumsum of dynamic columns nozzle_1 and nozzle_2
cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
df.assign(**df.groupby('tank')[cols].agg(['cumsum'])\
.pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1, inplace=False)))
## nozzle_sale_cumsum is static column
df[nozzle_sale_cumsum] = df.groupby('tank')['nozzle_sale'].cumsum()
From above code I will get cumsum of following columns
tank nozzle_1 nozzle_2 nozzle_var nozzle_1_cumsum nozzle_2_cumsum nozzle_sale_cumsum
1 1 1 10 1 1 10
1 2 2 12 3 3 20
2 1 1 10 1 1 10
2 2 2 12 3 3 20
1 1 1 10 4 4 30
2 2 2 12 5 5 30
Now, I want to get last values of all 3 cumsum columns grouping over tank. I can do it with following code in pandas, but it is hard coded with column names.
final_df= df.groupby('tank').agg({'nozzle_1_cumsum':'last',
'nozzle_2_cumsum':'last',
'nozzle_sale_cumsum':'last',
}).reset_index()
Problem with above code is nozzle_1_cumsum and nozzle_2_cumsum is hard coded which is not the case. How can I do this in pandas with dynamic columns.
How about:
df.filter(regex='_cumsum').groupby(df['tank']).last()
Output:
nozzle_1_cumsum nozzle_2_cumsum nozzle_sale_cumsum
tank
1 4 4 30
2 5 5 30
You can also replace df.filter(...) by, e.g., df.iloc[:,-3:] or df[col_names].
I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37
I have a dataframe like this:
userId date new doa
67 23 2018-07-02 1 2
68 23 2018-07-03 1 3
69 23 2018-07-04 1 4
70 23 2018-07-06 1 6
71 23 2018-07-07 1 7
72 23 2018-07-10 1 10
73 23 2018-07-11 1 11
74 23 2018-07-13 1 13
75 23 2018-07-15 1 15
76 23 2018-07-16 1 16
77 23 2018-07-17 1 17
......
194605 448053 2018-08-11 1 11
194606 448054 2018-08-11 1 11
194607 448065 2018-08-11 1 11
df['doa'] stands for day of appearance.
Now I want to find out like which unique userIds have appeared on a daily basis. Like which userIds are appearing on day1, day2, day3, and so on. So how do I exactly groupby them? And also I want to find out like the avg. no of days unique users are opening the app in a month?
And finally I want to also find out like which users have appeared at least once every day throughout the month.
I want some thing like this:
userId week_no ndays
23 1 2
23 2 5
23 3 6
.....
1533 1 0
1534 2 1
1534 3 4
1534 4 1
1553 1 1
1553 2 0
1553 3 0
1553 4 0
And so on. ndays means no. of days in a week.
You're asking several different questions, and none of them are particularly difficult, they just require a couple groupbys and aggregation operations.
Setup
df = pd.DataFrame({
'userId': [1,1,1,1,1,2,2,2,2,3,3,3,3,3],
'date': ['2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-08-06',
'2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-07-02', '2018-07-03',
'2018-07-04', '2018-07-05', '2018-08-06']
})
df.date = pd.to_datetime(df.date)
df['doa'] = df.date.dt.day
userId date doa
0 1 2018-07-02 2
1 1 2018-07-03 3
2 1 2018-08-04 4
3 1 2018-08-05 5
4 1 2018-08-06 6
5 2 2018-07-02 2
6 2 2018-07-03 3
7 2 2018-08-04 4
8 2 2018-08-05 5
9 3 2018-07-02 2
10 3 2018-07-03 3
11 3 2018-07-04 4
12 3 2018-07-05 5
13 3 2018-08-06 6
Questions
How do I find the unique visitors per day?
You may use groupby and unique:
df.groupby([df.date.dt.month, 'doa']).userId.unique()
date doa
7 2 [1, 2, 3]
3 [1, 2, 3]
4 [3]
5 [3]
8 4 [1, 2]
5 [1, 2]
6 [1, 3]
Name: userId, dtype: object
How do I find the average number of days per month users open the app?
Using groupby and size:
df.groupby(['userId', df.date.dt.month]).size()
userId date
1 7 2
8 3
2 7 2
8 2
3 7 4
8 1
dtype: int64
This will give you the number of times per month each unique visitor has visited. If you want the average of this, simply apply mean:
df.groupby(['userId', df.date.dt.month]).size().groupby('date').mean()
date
7 2.666667
8 2.000000
dtype: float64
This one was a bit more unclear, but it seems that you want the number of days a user was seen per week:
You can groupby userId, as well as a variation on your date column to create continuous weeks, starting at the minimum date, then use size:
(df.groupby(
['userId', (df.date.dt.week.sub(df.date.dt.week.min())+1).rename('week_no')])
.size().reset_index(name='ndays')
)
userId week_no ndays
0 1 1 2
1 1 5 2
2 1 6 1
3 2 1 2
4 2 5 2
5 3 1 4
6 3 6 1
I have the following data:
df =
MONTH DAY HOUR DURATION
1 1 7 20
1 1 7 21
1 2 7 20
1 2 8 22
2 1 7 19
2 1 8 25
2 1 8 29
2 2 8 27
I want to get the mean DURATION grouped by HOUR and averaged over MONTH and DAY. In other words, I want to know what is the average DURATION per HOUR.
This is my current code. If I delete 'MONTH','DAY' from df.groupby(['MONTH','DAY','HOUR','DURATION']), then I get higher values of DURATION, which are not correct. Therefore I decided to keep 'MONTH','DAY'.
grouped = df.groupby(['MONTH','DAY','HOUR','DURATION']).size() \
.groupby(level=['HOUR','DURATION']).mean().reset_index()
grouped
However, anyway, it gives me incorrect output. This is an example for some random data (it can be seen that the hour 8 is repeated many times, also the column 0 appears).
HOUR DURATION 0
0 7 122.0 1.0
1 8 77.0 1.0
2 8 82.0 1.0
3 8 83.0 1.0
Have you tried:
df.groupby("HOUR").agg({'DURATION_1' : 'mean', 'DURATION_2' : 'mean'})
I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.