Pandas average row count per Day of the week - python

To get to that sum/count of rows of a weekday i do the following:
df['day'] = pandas.to_datetime(df['datetime']).dt.day_name()
print(pandas.value_counts(df.day))
But how can i get the average of rows per weekday if, for example there are more Fridays in the data frame than Mondays?
Or asked differently: how can i divide each count by the number of that weekday that have happend ?
To clarify:
For example there have been 5 Mo,tue,wen,thurs but 4 fri,sat,sun () i would like to divide the counts of Mo,tue,wen,thurs by 5 and the counts of fri,sat,sun () by 4
The answer below is correct.

Assuming you just want the number of days:
num_days = df['day'].value_counts()
If you want percentages of days in the dataset.
df['day'].value_counts(normalize=True)
Taking this a step further, it looks like you want the number of days in your dataset versus the number of possibly days.
# Create series for days in your dataframe
days_in_df = df['day'].value_counts()
# Create a dataframe with all days
start = '01/01/2019'
end = '01/31/2019'
all_days_df = pd.DataFrame(data={'datetime':pd.date_range(start='01/01/2019',periods=31,freq='d')})
all_days_df['all_days'] = all_days_df['datetime'].dt.day_name()
# Use that for value counts
all_days_count = all_days_df['all_days'].value_counts()
# We now merge them
result = pd.concat([all_days_count,days_in_df],axis=1,sort=True)
# Finnaly we can get the ration
result['day']/result['all_days']

Related

Calculate 3 months unique Emp count for a given month from last 3 months data using pandas

I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.
df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)
testdata.xlsx Google drive link..
https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true
After using above groupby command I get output for 1 month groups based on DateM column which correct.
Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.
Sample output:
I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.
I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.
I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on
# Sort data:
df.sort_values(by='datem', inplace=True, ignore_index=True)
# Create `dfu` which is `df` with unique `empid` for each `datem` only:
dfu = df.groupby(['datem', 'empid']).count().reset_index()
dfu.rename(columns={'date':'count'}, inplace=True)
dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
dfu
# Obtain the list of unique periods:
unique_period = dfu['datem'].unique()
# Create empty dataframe:
dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
for p in unique_period:
# Create 3 consecutive range:
tem_range = pd.period_range(start=p, freq='M', periods=3)
# Extract dataframe from `dfu` with period in range wanted:
tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
# Some cleaning:
tem_dfu.drop_duplicates(subset='empid', keep='first')
tem_dfu.drop(columns='count', inplace=True)
tem_dfu['start_period'] = p
# Concat and obtain desired output:
dfe = pd.concat([dfe, tem_dfu])
dfe
Hope this is what you are looking for

Loop Through Days of the Week in Pandas Dataframe

I have a Pandas DataFrame with a start column of dtype of datetime64[ns, UTC] and the DataFrame is sorted in ascending order based on the start column. From this DataFrame I used the following to create a new (updated) DataFrame indicating the day of the week for the start column
format_datetime_df['day_of_week'] = format_datetime_df['start'].dt.dayofweek
I want to pass the DataFrame into a function. The function needs to loop through the days of the week, so from 0 to 6, and keep a running total of the distance (kept in column 'distance') covered. If the distance covered is greater than 15, then a counter is incremented. It needs to do this for all rows of the DataFrame. The return of the function will be the total number of weeks over 15.
I am getting stuck on how to implement this as my 'day_of_week' column starts as follows
3
3
5
1
5
So, week 1 would be comprised of 3, 3, 5 and week 2 would be comprised of 1, 5, ...
I want to do something like
number_of_weeks_over_10km = format_datetime_df.groupby().apply(weeks_over_10km)
but am not really sure what should go in the groupby() function. I also feel like I am overcomplicating this.
It was complicated, but I figured it out. Here is the basic flow of what I did
# Create a helper index that allows iteration by week while also considering the year
# Function to return the total distance for each week
# Create a NumPy array to store the total distance for each week
# Append the total distance for each week to the array
# Count the number of times the total distance for each week was > x (in km)
The helper index that allowed for iteration by week while also considering the year came from another post here on Stack Overflow (Iterate over pd df with date column by week python). This had a consequence though, in that I had to create and append the NumPy array outside of the function in order to get everything to work.
I guess you can solve that using Pandas without functions. Just determine year and week using
df["isoweek"] = (df["start"].dt.isocalendar()["year"].astype(str)
+ " "
+ df["start"].dt.isocalendar()["week"].astype(str)
)
Then you determine the distance using a groupby and count the entries above 15:
weeks_above_15 = (df.groupby("isoweek")["distance"].sum() > 15).sum()

How to get specific information from a csv file using pandas?

I want to extract specific information from [this csv file][1].
I need make a list of days and give an overview.
You're looking for DataFrame.resample. Based on a specific column, it will group the rows of the dataframe by a specific time interval.
First you need to do this, if you haven't already:
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
Get the lowest 5 days of visibility:
>>> df.resample(rule='D', on='Date/Time')['Visibility (km)'].mean().nsmallest(5)
Date/Time
2012-03-01 2.791667
2012-03-14 5.350000
2012-12-27 6.104167
2012-01-17 6.433333
2012-02-01 6.795833
Name: Visibility (km), dtype: float64
Basically what that does is this:
Groups all the rows by day
Converts each group to the average value of all the Visibility (km) items for that day
Returns the 5 smallest
Count the number of foggy days
>>> df.resample(rule='D', on='Date/Time').apply(lambda x: x['Weather'].str.contains('Fog').any()).sum()
78
Basically what that does is this:
Groups all the rows by day
For each day, adds a True if any row inside that day contains 'Fog' in the Weather column, False otherwise
Counts how many True's there were, and thus the number of foggy days.
This will get you an array of all unique foggy days. you can use the shape method to get its dimension
df[df["Weather"].apply(lambda x : "Fog" in x)]["Date/Time"].unique()
I need make a list of days with lowest visibility and give an overview of other time parameters for those days in tabular form.
Since your Date/Time column represents a particular hour, you'll need to do some grouping to get the minimum visibility for a particular day. The following will find the 5 least-visible days.
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Group on the new "Date" column and get the minimum values of
# each column for each group.
>>> min_by_day = data.groupby("Date").min()
# Now we can use nsmallest, since 1 row == 1 day in min_by_day.
# Since `nsmallest` returns a pandas.Series with "Date" as the index,
# we have to use `.index` to pull the date objects from the result.
>>> least_visible_days = min_by_day.nsmallest(5, "Visibility (km)").index
Then you can limit your original dataset to the least-visible days with
data[data["Date"].isin(least_visible_days)]
I also need the total number of foggy days.
We can use the extracted date in this case too:
# Extract the date from the "Date/Time" column
>>> data["Date"] = pandas.to_datetime(data["Date/Time"]).dt.date
# Filter on hours which have foggy weather
>>> foggy = data[data["Weather"].str.contains("Fog")]
# Count number of unique days
>>> len(foggy["Date"].unique())

Add hours column to regular list of minutes, group by it, and average the data in Python

I have looked for similar questions, but none seems to be addressing the following challenge. I have a pandas dataframe with a list of minutes and corresponding values, like the following:
minute value
0 454
1 434
2 254
The list is a year-long list, thus counting 60 minutes * 24 hours * 365 days = 525600 observations.
I would like to add a new column called hour, which indeed expresses the hour of the day (assuming minutes 0-59 are 12AM, 60-119 are 1AM, and so forth until the following day, where the sequence restarts).
Then, once the hour column is added, I would like to group observations by it and calculate the average value for every hour of the year, and end up with a dataframe with 24 observations, each expressing the average value of the original data at each hour n.
Using integer and remainder division you can get the hour.
df['hour'] = df['minute']//60%24
If you want other date information it can be useful to use January 1st of some year (not a leap year) as the origin and convert to a datetime. Then you can grab a lot of the date attributes, in this case hour.
df['hour'] = pd.to_datetime(df['minute'], unit='m', origin='2017-01-01').dt.hour
Then for your averages you get the resulting 24 row Series with:
df.groupby('hour')['value'].mean()
Here's a way to do:
# sample df
df = pd.DataFrame({'minute': np.arange(525600), 'value': np.arange(525600)})
# set time format
df['minute'] = pd.to_timedelta(df['minute'], unit='m')
# calculate mean
df_new = df.groupby(pd.Grouper(key='minute', freq='1H'))['value'].mean().reset_index()
Although, you don't need hour column explicity to calculate these value, but if you want to get it, you can do it by:
df_new['hour'] = pd.to_datetime(df_new['minute']).dt.hour

Remove last n days from dataframe

I have a pandas dataframe with datetime index (30 min frequency). And I want do remove "n" last days from it. My dataframe do not include weekends, so if the last day of it is Monday, I want to remove Monday, Friday and Thursday (from the end). So, I mean observed days, not calendar. What is the most pythonic way to do it?
Thanks.
Pandas knows about Monday to Friday as business days.
So if you want to remove the last n business days from your dataframe, you can just do:
df.drop(df[df.index >= df.index.max().date()-pd.offsets.BDay(n-1)].index, inplace=True)
If you really need to remove observed days in the dataframe, if will be slightly more complex because you will have to count the days. Code could be (using a companion dataframe called df_days):
# create a dataframe with same index and only one row per day:
df_days = pd.DataFrame(index=df.index).assign(day=df.index.date).drop_duplicates('day')
# now count the observed day in the companion dataframe
df_days['new_day'] = 1
df_days['days'] = df_days['new_day'].cumsum()
# compute first index to remove to remove last observed n days
ix = df_days.loc[df_days['days'] == df_days['days'].max() + 1 - n].index[0]
# ok drop the last observed n days from the initial dataframe and delete the companion one
df.drop(df.loc[df.index > ix].index)
del df_days

Categories

Resources