Here is data
id
date
population
1
2021-5
21
2
2021-5
22
3
2021-5
23
4
2021-5
24
1
2021-4
17
2
2021-4
24
3
2021-4
18
4
2021-4
29
1
2021-3
20
2
2021-3
29
3
2021-3
17
4
2021-3
22
I want to calculate the monthly change regarding population in each id. so result will be:
id
date
delta
1
5
.2353
1
4
-.15
2
5
-.1519
2
4
-.2083
3
5
.2174
3
4
.0556
4
5
-.2083
4
4
.3182
delta := (this month - last month) / last month
How to approach this in pandas? I'm thinking of groupby but don't know what to do next
remember there might be more dates. but results is always
Use GroupBy.pct_change with sorting columns first before, last remove misisng rows by column delta:
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date'], ascending=[True, False])
df['delta'] = df.groupby('id')['population'].pct_change(-1)
df = df.dropna(subset=['delta'])
print (df)
id date population delta
0 1 2021-05-01 21 0.235294
4 1 2021-04-01 17 -0.150000
1 2 2021-05-01 22 -0.083333
5 2 2021-04-01 24 -0.172414
2 3 2021-05-01 23 0.277778
6 3 2021-04-01 18 0.058824
3 4 2021-05-01 24 -0.172414
7 4 2021-04-01 29 0.318182
Try this:
df.groupby('id')['population'].rolling(2).apply(lambda x: (x.iloc[0] - x.iloc[1]) / x.iloc[0]).dropna()
maybe you could try something like:
data['delta'] = data['population'].diff()
data['delta'] /= data['population']
with this approach the first line would be NaNs, but for the rest, this should work.
Related
I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.
I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1
I have a dataframe which looks like this:
UserId Date_watched Days_not_watch
1 2010-09-11 5
1 2010-10-01 8
1 2010-10-28 1
2 2010-05-06 12
2 2010-05-18 5
3 2010-08-09 10
3 2010-09-25 5
I want to find out the no. of days the user gave as a gap, so I want a column for each row for each user and my dataframe should look something like this:
UserId Date_watched Days_not_watch Gap(2nd watch_date - 1st watch_date - days_not_watch)
1 2010-09-11 5 0 (First gap will be 0 for all users)
1 2010-10-01 8 15 (11th Sept+5=16th Sept; 1st Oct - 16th Sept=15days)
1 2010-10-28 1 9
2 2010-05-06 12 0
2 2010-05-18 5 0 (because 6th May+12 days=18th May)
3 2010-08-09 10 0
3 2010-09-25 4 36
3 2010-10-01 2 2
I have mentioned the formula for calculating the Gap beside the column name of the dataframe.
Here is one approach using groupby + shift:
# sort by date first
df['Date_watched'] = pd.to_datetime(df['Date_watched'])
df = df.sort_values(['UserId', 'Date_watched'])
# calculate groupwise start dates, shifted
grp = df.groupby('UserId')
starts = grp['Date_watched'].shift() + \
pd.to_timedelta(grp['Days_not_watch'].shift(), unit='d')
# calculate timedelta gaps
df['Gap'] = (df['Date_watched'] - starts).fillna(pd.Timedelta(0))
# convert to days and then integers
df['Gap'] = (df['Gap'] / pd.Timedelta('1 day')).astype(int)
print(df)
UserId Date_watched Days_not_watch Gap
0 1 2010-09-11 5 0
1 1 2010-10-01 8 15
2 1 2010-10-28 1 19
3 2 2010-05-06 12 0
4 2 2010-05-18 5 0
5 3 2010-08-09 10 0
6 3 2010-09-25 5 37
I have a data frame available with date column like below.
df = pd.DataFrame({'Date':pd.date_range('2018-10-01', periods=14)})
I want to append week number column based on date, so it will look like
so the 2018-10-01 will be week 1 and after 7 days 2018-10-08 would be week 2 and so on.
Any help how can I perform this?
Use weekday with factorize with add 1 for groups starting from 1:
df['Week'] = pd.factorize(df['Date'].dt.weekofyear)[0] + 1
print (df)
Date Week
0 2018-10-01 1
1 2018-10-02 1
2 2018-10-03 1
3 2018-10-04 1
4 2018-10-05 1
5 2018-10-06 1
6 2018-10-07 1
7 2018-10-08 2
8 2018-10-09 2
9 2018-10-10 2
10 2018-10-11 2
11 2018-10-12 2
12 2018-10-13 2
13 2018-10-14 2
I have a dataframe like this:
userId date new doa
67 23 2018-07-02 1 2
68 23 2018-07-03 1 3
69 23 2018-07-04 1 4
70 23 2018-07-06 1 6
71 23 2018-07-07 1 7
72 23 2018-07-10 1 10
73 23 2018-07-11 1 11
74 23 2018-07-13 1 13
75 23 2018-07-15 1 15
76 23 2018-07-16 1 16
77 23 2018-07-17 1 17
......
194605 448053 2018-08-11 1 11
194606 448054 2018-08-11 1 11
194607 448065 2018-08-11 1 11
df['doa'] stands for day of appearance.
Now I want to find out like which unique userIds have appeared on a daily basis. Like which userIds are appearing on day1, day2, day3, and so on. So how do I exactly groupby them? And also I want to find out like the avg. no of days unique users are opening the app in a month?
And finally I want to also find out like which users have appeared at least once every day throughout the month.
I want some thing like this:
userId week_no ndays
23 1 2
23 2 5
23 3 6
.....
1533 1 0
1534 2 1
1534 3 4
1534 4 1
1553 1 1
1553 2 0
1553 3 0
1553 4 0
And so on. ndays means no. of days in a week.
You're asking several different questions, and none of them are particularly difficult, they just require a couple groupbys and aggregation operations.
Setup
df = pd.DataFrame({
'userId': [1,1,1,1,1,2,2,2,2,3,3,3,3,3],
'date': ['2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-08-06',
'2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-07-02', '2018-07-03',
'2018-07-04', '2018-07-05', '2018-08-06']
})
df.date = pd.to_datetime(df.date)
df['doa'] = df.date.dt.day
userId date doa
0 1 2018-07-02 2
1 1 2018-07-03 3
2 1 2018-08-04 4
3 1 2018-08-05 5
4 1 2018-08-06 6
5 2 2018-07-02 2
6 2 2018-07-03 3
7 2 2018-08-04 4
8 2 2018-08-05 5
9 3 2018-07-02 2
10 3 2018-07-03 3
11 3 2018-07-04 4
12 3 2018-07-05 5
13 3 2018-08-06 6
Questions
How do I find the unique visitors per day?
You may use groupby and unique:
df.groupby([df.date.dt.month, 'doa']).userId.unique()
date doa
7 2 [1, 2, 3]
3 [1, 2, 3]
4 [3]
5 [3]
8 4 [1, 2]
5 [1, 2]
6 [1, 3]
Name: userId, dtype: object
How do I find the average number of days per month users open the app?
Using groupby and size:
df.groupby(['userId', df.date.dt.month]).size()
userId date
1 7 2
8 3
2 7 2
8 2
3 7 4
8 1
dtype: int64
This will give you the number of times per month each unique visitor has visited. If you want the average of this, simply apply mean:
df.groupby(['userId', df.date.dt.month]).size().groupby('date').mean()
date
7 2.666667
8 2.000000
dtype: float64
This one was a bit more unclear, but it seems that you want the number of days a user was seen per week:
You can groupby userId, as well as a variation on your date column to create continuous weeks, starting at the minimum date, then use size:
(df.groupby(
['userId', (df.date.dt.week.sub(df.date.dt.week.min())+1).rename('week_no')])
.size().reset_index(name='ndays')
)
userId week_no ndays
0 1 1 2
1 1 5 2
2 1 6 1
3 2 1 2
4 2 5 2
5 3 1 4
6 3 6 1