I have a df that is a time series of user access data
UserID Access Date
a 10/01/2019
b 10/01/2019
c 10/01/2019
a 10/02/2019
b 10/02/2019
d 10/02/2019
e 10/03/2019
f 10/03/2019
a 10/03/2019
b 10/03/2019
a 10/04/2019
b 10/04/2019
c 10/05/2019
I have another df that lists out the dates and I want to aggregate the unique occurrence of UserIDs in the rolling past 3 days. The expected output would look like below:
Date Past_3_days_unique_count
10/01/2019 NaN
10/02/2019 NaN
10/03/2019 6
10/04/2019 5
10/04/2019 5
How would I be able to achieve this?
It's quite straightforward - let me walk you through it via the following snippet and its comments.
import pandas as pd
import numpy as np
# Generate some dates
dates = pd.date_range("01-01-2016", "01-10-2016", freq="6H")
# Generate some user ids
ids = np.random.randint(1, 5, len(dates))
df = pd.DataFrame({"id": ids, "date": dates})
# Collect unique IDs for each day
q = df.groupby(df["date"].dt.to_period("D"))["id"].nunique()
# Grab the rolling sum over 3 previous days which is what we wanted
q.rolling(3).sum()
Use pandas groupby the documentation is very good
Related
I have a dataset of events with a date column which I need to display in a weekly plot and do some more data processing afterwards. After some googling I found pd.Grouper(freq="W") so I am using that to group the events by week and display them. My problem is that after doing the groupby and ungroup I end up with a data frame where there is an unnamed column that I am unable to refer to except using iloc. This is an issue because in later plots I am grouping by other columns so I need a way to refer to this column by name, not iloc.
Here's a reproducible example of my dataset:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
grouper = df.set_index("date").groupby([pd.Grouper(freq="W"), 'dummy'])
result = grouper['dummy'].count().unstack('dummy').fillna(0)
The result data frame that I get has weird indexes/columns that I am unable to navigate:
>>> print(result)
dummy 1
date
2023-01-01 1
2023-01-08 3
2023-01-15 4
2023-01-22 9
2023-01-29 8
2023-02-05 5
>>> print(result.columns)
Int64Index([1], dtype='int64', name='dummy')
Then only column here is dummy, but even after result.dummy I get an AttributeError
I've also tried result.reset_index():
dummy date 1
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
But for this data frame I can only get the date column - the counts column named "1" cannot be accessed using result.reset_index()["1"] as I get an AttributeError
I am completely perplexed by what is going on here, pandas is really powerful but sometimes I find it incredibly unintuitive. I've checked several pages of the docs and checked if there's another index level (there isn't). Can someone who's better at pandas help me out here?
I just want a way to convert the grouped data frame into something like this:
date counts
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
Where date and counts are columns and there is an unnamed index
You can solve this by simply doing:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
result = df.groupby([pd.Grouper(freq="W", key='date'), 'dummy'], squeeze=True)['dummy'].count()
result = result.reset_index(name='counts')
result = result.drop(['dummy'], axis = 1)
which gives
date counts
0 2023-01-01 3
1 2023-01-08 7
2 2023-01-15 5
3 2023-01-22 5
4 2023-01-29 8
5 2023-02-05 2
Given a dataframe like the one below:
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
I need to create another dataframe containing only the id and the number of calls made on different days. An example of output is as follows:
Id | Count
1 | 1
2 | 2
3 | 1
What I'm trying so far:
df2 = df.groupby(['id','date']).size().reset_index().rename(columns={0:'COUNT'})
df2
However, the way out is far from desired. Can anyone help?
You can make use of .nunique() [pandas-doc] to count the unique days per id:
table.groupby('id').date.nunique()
This gives us a series:
>>> df.groupby('id').date.nunique()
id
1 1
2 2
3 1
Name: date, dtype: int64
You can make use of .to_frame() [pandas-doc] to convert it to a dataframe:
>>> df.groupby('id').date.nunique().to_frame('count')
count
id
1 1
2 2
3 1
You can use pd.Dataframe function to convert the result into a dataframe and further rename the columns as per you like.
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
x = pd.DataFrame(df.groupby('id').date.nunique().reset_index())
x.columns = ['Id', 'Count']
print(x)
I have the following dataframe:
df.index = df['Date']
df.groupby([df.index.month, df['Category'])['Amount'].sum()
Date Category Amount
1 A -125.35
B -40.00
...
12 A 505.15
B -209.00
I would like to report the sum of the Amount for every Category B like:
Date Category Amount
1 B -40.00
...
12 B -209.00
I tried the df.get_group method but this method needs tuple that contains the Date and Category key. Is there a way to filter out only the Categories with B?
You can use IndexSlice:
# groupby here
df_group = df.groupby([df.index.month, df['Category'])['Amount'].sum()
# report only Category B
df_group.loc[pd.IndexSlice[:,'B'],:]
Or query:
# query works with index level name too
df_group.query('Category=="B"')
Output:
Amount
Date Category
1 B -40.0
12 B -209.0
apply a filter to your groupby dataframe where Category equals B
filter=df['Category']=='B'
df[filter].groupby([df.index.month, df['Category'])['Amount'].sum()
I am working with Python in Bigquery and have a large dataframe df (circa 7m rows). I also have a list lst that holds some dates (say all days in a given month).
I am trying to create an additional column "random_day" in df with a random value from lst in each row.
I tried running a loop and apply function but being quite a large dataset it is proving challenging.
My attempts passed by the loop solution:
df["rand_day"] = ""
for i in a["row_nr"]:
rand_day = sample(day_list,1)[0]
df.loc[i,"rand_day"] = rand_day
And the apply solution, defining first my function and then calling it:
def random_day():
rand_day = sample(day_list,1)[0]
return day
df["rand_day"] = df.apply(lambda row: random_day())
Any tips on this?
Thank you
Use numpy.random.choice and if necessary convert dates by to_datetime:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
day_list = pd.to_datetime(['2015-01-02','2016-05-05','2015-08-09'])
#alternative
#day_list = pd.DatetimeIndex(['2015-01-02','2016-05-05','2015-08-09'])
df["rand_day"] = np.random.choice(day_list, size=len(df))
print (df)
A B rand_day
0 a 4 2016-05-05
1 b 5 2016-05-05
2 c 4 2015-08-09
3 d 5 2015-01-02
4 e 5 2015-08-09
5 f 4 2015-08-09
Input is like this
Data Id
201505 A
201507 A
201509 A
200001 B
200001 C
200002 C
200005 C
i am finding date gaps and using this.But it is taking too long time to complete the function for large data how can i reduce time complexity of
#convert to datetimes
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
#resample by start of months with asfreq
mdf = month.set_index('data').groupby(['series_id','symbol'])['series_id'].resample('MS').asfreq().rename('val').reset_index()
x = mdf['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
mdf.index = x.cumsum()
#filter only NaNs row and aggregate first, last and count.
mdf = (mdf[~x.values].groupby(['series_id','symbol','g'])['data'].agg(['first','last','size']).reset_index(level=2, drop=True).reset_index())
print mdf
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2
How can i reduce the time complexity or some other way to find the date gaps.
The assumptions made are the following:
All values in the Data column are unique, even across groups
The data in the data column are integers
The data is sorted by group first and then by value.
Here is my algorithm (mdf is the input df):
import pandas as pd
df2 = pd.DataFrame({'Id':mdf['Id'],'First':mdf['Data']+1,'Last':(mdf['Data']-1).shift(-1)})
df2 = df2.groupby('Id').apply(lambda g: g[g['Data'] != g['Data'].max()]).reset_index(drop=True)
print(df2[~df['First'].isin(mdf['Data'])&~df['Last'].isin(mdf['Data'])])
So using a bit the idea #RushabhMehta, you can us pd.DateOffset to create the output dataframe. Your input dataframe is called month, with column 'data' and 'series_id', according to your code. Here is the idea:
month['data'] = pd.to_datetime(month['data'], format='%Y%m')
month = month.sort_values(['series_id','data'])
# create mdf with the column you want
mdf = pd.DataFrame({'Id':month.series_id, 'first':month.data + pd.DateOffset(months=1),
'last': (month.groupby('series_id').data.shift(-1) - pd.DateOffset(months=1))})
Note how the column 'last' is created, using groupby, shift the value and substract a month with pd.DateOffset(months=1). Now select only the rows where the date in 'first' is before the one in 'last' and create the column size such as:
mdf = mdf.loc[mdf['first'] <= mdf['last']]
mdf['size'] = (mdf['last']- mdf['first']).astype('timedelta64[M]')+1
mdf looks like:
first Id last size
0 2015-06-01 A 2015-06-01 1.0
1 2015-08-01 A 2015-08-01 1.0
3 2000-02-01 B 2000-02-01 1.0
6 2000-03-01 C 2000-04-01 2.0
Just need to reorder column and reset_index if you want.