Count unique dates in pandas dataframe - python

I have a dataframe of surface weather observations (fzraHrObs) organized by a station identifier code and date. fzraHrObs has several columns of weather data. The station code and date (datetime objects) look like:
usaf dat
716270 2014-11-23 12:00:00
2015-12-20 08:00:00
2015-12-20 09:00:00
2015-12-21 04:00:00
2015-12-28 03:00:00
716280 2015-12-19 08:00:00
2015-12-19 08:00:00
I would like to get a count of the number of unique dates (days) per year for each station - i.e. the number of days of obs per year at each station. In my example above this would give me:
usaf Year Count
716270 2014 1
2015 3
716280 2014 0
2015 1
I've tried using groupby and grouping by station, year, and date:
grouped = fzraHrObs['dat'].groupby(fzraHrObs['usaf'], fzraHrObs.dat.dt.year, fzraHrObs.dat.dt.date])
Count, size, nunique, etc. on this just gives me the number of obs on each date, not the number of dates themselves per year. Any suggestions on getting what I want here?

Could be something like this, group the date by usaf and year and then count the number of unique values:
import pandas as pd
df.dat.apply(lambda dt: dt.date()).groupby([df.usaf, df.dat.apply(lambda dt: dt.year)]).nunique()
# usaf dat
# 716270 2014 1
# 2015 3
# 716280 2015 1
# Name: dat, dtype: int64

The following should work:
df.groupby(['usaf', df.dat.dt.year])['dat'].apply(lambda s: s.dt.date.nunique())
What I did differently is group by two levels only, then use the nunique method of pandas series to count the number of unique dates in each group.

Related

calc churn rate in pandas

i have a sales dataset (simplified) with sales from existing customers (first_order = 0)
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00'],
'email':['1#abc.de','2#abc.de','3#abc.de','1#abc.de'],
'first_order':[1,1,1,1],
'Last_Order_Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00']
})
I would like to analyze how many existing customers we lose per month.
my idea is to
group(count) by month and
then count how many have made their last purchase in the following months which gives me a churn cross table where I can see that e.g. we had 300 purchases in January, and 10 of them bought the last time in February.
like this:
Col B is the total number of repeating customers and column C and further is the last month they bought something.
E.g. we had 2400 customers in January, 677 of them made their last purchase in this month, 203 more followed in February etc.
I guess I could first group the total number of sales per month and then group a second dataset by Last_Order_Date and filter by month.
but I guess there is a handy python way ?! :)
any ideas?
thanks!
The below code helps you to identify how many purchases are done in each month.
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.groupby(pd.Grouper(freq="M")).size()
O/P:
Date
2020-02-29 1
2020-03-31 0
2020-04-30 1
2020-05-31 1
2020-06-30 1
Freq: M, dtype: int64
I couldn't find the required data and explanation clearly. This could be your starting point. Please let me know, if it is helped you in anyway.
Update-1:
df.pivot_table(index='Date', columns='email', values='Last_Order_Date', aggfunc='count')
Output:
email 1#abc.de 2#abc.de 3#abc.de
Date
2020-02-26 00:00:00 1.0 NaN NaN
2020-04-10 00:00:00 NaN NaN 1.0
2020-05-05 00:00:00 NaN 1.0 NaN
2020-06-30 00:00:00 1.0 NaN NaN

plot the sorted weekdays/month on timeseries dataframe in python

I have an one year of traffic data stored in a data frame.
study time
volume
month
hour
day
year
weekday
week_of_year
weekend
2019-01-01 00:00:00
25
January
0
Tuesday
2019
1
1
0
2019-01-01 00:00:15
25
January
0
Tuesday
2019
1
1
0
2019-01-01 00:00:30
21
January
0
Tuesday
2019
1
1
0
2019-01-02 00:00:00
100
January
0
Wednesday
2019
2
1
0
2019-01-02 00:00:15
2
January
0
Wednesday
2019
2
1
0
2019-01-02 00:00:30
50
January
0
Wednesday
2019
2
1
0
I want to see the hourly, daily, weekly and monthly patterns on volume data. I did so using this script:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(16,10))
plt.axes(ax[0,0])
countData19_gdf.groupby(['hour','address']).mean().groupby(['hour'])['volume'].mean().plot(x='hour',y='volume')
plt.ylabel("Total averge counts of the stations")
plt.axes(ax[0,1])
countData19_gdf.groupby(['day','address']).mean().groupby(['day'])['volume'].mean().plot(x='day',y='volume')
plt.axes(ax[1,0])
countData19_gdf.groupby(['week_of_year','address']).mean().groupby(['week_of_year'])['volume'].mean().plot(x='week_of_year',y='volume', rot=90)
plt.ylabel("Total averge counts of the stations")
plt.axes(ax[1,1])
countData19_gdf.groupby(['month','address']).mean().groupby(['month'])['volume'].mean().plot(x='month',y='volume', rot=90)
plt.ylabel("Total averge counts of the stations")
ax[0,0].title.set_text('Hourly')
ax[0,1].title.set_text('Daily')
ax[1,0].title.set_text('Weekly')
ax[1,1].title.set_text('Monthly')
plt.savefig('temporal_global.png')
and the result looks like this, in which the weekdays is or months are not sorted.
Can you please help me with how I can sort them? I tried to sort days as integers but it does not work.
The groupby method will automatically sort the index, however for string values, that means sorting alphabetically (and not by, for example, order of weekdays).
What you can do is use reindex method to have the index order how you would like it. For example:
countData19_gdf.groupby(['day','address']).mean().groupby(['day'])['volume'].mean().reindex(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']).plot(x='day',y='volume')
Note:
If value in index is not present in the list of values specified in reindex method, that row will not be included. Likewise, if there's a new value in that list, which is not present in the index, it will result in a NaN value assigned to that new index. So, if your countData19_gdf doesn't have day such as Monday, it will be present in the reindexed df, but the value will be set to NaN.
Edit:
Since you already have numerical values for weekday (you might want to get the same for months), to avoid specifying the new index by hand, you could get sorted string values via:
countData19_gdf.sort_values(by = 'weekday')['day'].unique()
Quick example (I changed around some 'day' values in the given data to display the issue):
df.groupby(['day','address']).mean().groupby(['day'])['volume'].mean().plot(x='day',y='volume')
Outputs:
df.groupby(['day','address']).mean().groupby(['day'])['volume'].mean().reindex(['Tuesday','Wednesday','Friday']).plot(x='day',y='volume')
Outputs:

Pandas Dataframe resample week, starting first day of the year

I have a dataframe containing hourly data, i want to get the max for each week of the year, so i used resample to group data by week
weeks = data.resample("W").max()
the problem is that week max is calculated starting the first monday of the year, while i want it to be calculated starting the first day of the year.
I obtain the following result, where you can notice that there is 53 weeks, and the last week is calculated on the next year while 2017 doesn't exist in the data
Date dots
2016-01-03 0.647786
2016-01-10 0.917071
2016-01-17 0.667857
2016-01-24 0.669286
2016-01-31 0.645357
Date dots
2016-12-04 0.646786
2016-12-11 0.857714
2016-12-18 0.670000
2016-12-25 0.674571
2017-01-01 0.654571
is there a way to calculate week for pandas dataframe starting first day of the year?
Find the starting day of the year, for example let say it's Friday, and then you can specify an anchoring suffix to resample in order to calculate week starting first day of the year:
weeks = data.resample("W-FRI").max()
One quick remedy is, given you data in one year, you can group it by day first, then take group of 7 days:
new_df = (df.resample("D", on='Date').dots
.max().reset_index()
)
new_df.groupby(new_df.index//7).agg({'Date': 'min', 'dots': 'max'})
new_df.head()
Output:
Date dots
0 2016-01-01 0.996387
1 2016-01-08 0.999775
2 2016-01-15 0.997612
3 2016-01-22 0.979376
4 2016-01-29 0.998240
5 2016-02-05 0.995030
6 2016-02-12 0.987500
and tail:
Date dots
48 2016-12-02 0.999910
49 2016-12-09 0.992910
50 2016-12-16 0.996877
51 2016-12-23 0.992986
52 2016-12-30 0.960348

Hourly average for each week/month in dataframe (moving average)

I have a dataframe with full year data of values on each second:
YYYY-MO-DD HH-MI-SS_SSS TEMPERATURE (C)
2016-09-30 23:59:55.923 28.63
2016-09-30 23:59:56.924 28.61
2016-09-30 23:59:57.923 28.63
... ...
2017-05-30 23:59:57.923 30.02
I want to create a new dataframe which takes each week or month of values and average them over the same hour of each day (kind of moving average but for each hour).
So the result for the month case will be like this:
Date TEMPERATURE (C)
2016-09 00:00:00 28.63
2016-09 01:00:00 27.53
2016-09 02:00:00 27.44
...
2016-10 00:00:00 28.61
... ...
I'm aware of the fact that I can split the df into 12 df's for each month and use:
hour = pd.to_timedelta(df['YYYY-MO-DD HH-MI-SS_SSS'].dt.hour, unit='H')
df2 = df.groupby(hour).mean()
But I'm searching for a better and faster way.
Thanks !!
Here's an alternate method of converting your date and time columns:
df['datetime'] = pd.to_datetime(df['YYYY-MO-DD'] + ' ' + df['HH-MI-SS_SSS'])
Additionally you could groupby both week and hour to form a MultiIndex dataframe (instead of creating and managing 12 dfs):
df.groupby([df.datetime.dt.weekofyear, df.datetime.dt.hour]).mean()

Pandas groupby date range

I have a table where one of the columns is the date of occurrence (the dataframe is not indexed by date)
I want to group the table by date wherein all items which occurred prior to a certain date are grouped into one bucket. This would need to be cumulative, so later buckets will include all datapoints from earlier ones.
Here's the daterange object I need to group by:
date_rng = date_range('28/02/2010','31/08/2014',freq='3M')
Here's an example of a few datapoints in the table:
df_raw.head()
Ticker FY Periodicity Measure Val Date
0 BP9DL90 2009 ANN CPX 1000.00 2008-03-31 00:00:00
1 BP9DL90 2010 ANN CPX 600.00 2009-03-25 00:00:00
2 BP9DL90 2010 ANN CPX 600.00 2009-09-16 00:00:00
3 BP9DL90 2011 ANN CPX 570.00 2010-03-17 00:00:00
4 BP9DL90 2011 ANN GRM 57.09 2010-09-06 00:00:00
[5 rows x 6 columns]
Any input would be much appreciated.
Thanks
you could create a function that returns 1 if the date is in the date range you want, and then use this to group by:
# convert date column do datetime type
df['Date']=pd.to_datetime(df['DATE']), format='%d-%m-%Y %H:%M:%S'
def is_in_range(x):
if x['Date'] > '28-02-2010 00:00:00' and x['Date'] < '31-08-2014 00:00:00':
return 1
else:
return 0
data.groupby(df['date'].map(is_in_range))

Categories

Resources