Counting number of entries per month pandas - python

I have a df in format:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
...
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
And I want to count the number of entries in each month and place it in a new df i.e. the number of 'start' entries for Jan, Feb, etc, to give me:
Month Entries
2020-01 3
...
2020-04 2
I am currently trying something like this, but its not what I'm needing:
df.index = pd.to_datetime(df['start'],format='%Y-%m-%d')
df.groupby(pd.Grouper(freq='M'))
df['start'].value_counts()

Use Groupby.count with Series.dt:
In [1282]: df
Out[1282]:
start end
0 2020-01-01 2020-01-01
1 2020-01-01 2020-01-01
2 2020-01-02 2020-01-02
57 2020-04-01 2020-04-01
58 2020-04-02 2020-04-02
# Do this only when your `start` and `end` columns are object. If already datetime, you can ignore below 2 statements
In [1284]: df.start = pd.to_datetime(df.start)
In [1285]: df.end = pd.to_datetime(df.end)
In [1296]: df1 = df.groupby([df.start.dt.year, df.start.dt.month]).count().rename_axis(['year', 'month'])['start'].reset_index(name='Entries')
In [1297]: df1
Out[1297]:
year month Entries
0 2020 1 3
1 2020 4 2

Related

Create New DataFrame, assigning a count for each instance in a time frame

Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'start_date':['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-03','2021-01-04','2021-02-03','2021-03-04','2021-03-05']})
plan_dates
start_date end_date
0 2021-01-01 2021-01-03
1 2021-01-02 2021-01-04
2 2021-01-03 2021-02-03
3 2021-01-04 2021-03-04
4 2021-01-05 2021-03-05
I would like to create a new DataFrame which has 2 columns:
date
count of active plans (the count of cases where the date is within the start_date & end_date in each row of the plan_dates df)
INTENDED DF:
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
Any help would be greatly appreciated.
First convert both columns to datetimes and add one day to end_date, then repeat index by Index.repeat with subtraction of days and add counter values by GroupBy.cumcount with to_timedelta, last count by Series.value_counts with come data cleaning and converting to DataFrame:
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df1 = (df['start_date'].add(pd.to_timedelta(counter, unit='d'))
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count_active_plans'))
print (df1)
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
.. ... ...
59 2021-03-01 2
60 2021-03-02 2
61 2021-03-03 2
62 2021-03-04 2
63 2021-03-05 1
[64 rows x 2 columns]

How can I combine the dateframe if the date is consecutive?

I am now to python and pandas.
I have the following dateframe. I would like to combine the start and end date if they are in consecutive day.
data = {"Project":["A","A","A",'A',"B","B"], "Start":[dt.datetime(2020,1,1),dt.datetime(2020,1,16),dt.datetime(2020,1,31),dt.datetime(2020,7,1),dt.datetime(2020,1,31),dt.datetime(2020,2,16)],"End":[dt.datetime(2020,1,15),dt.datetime(2020,1,30),dt.datetime(2020,2,15),dt.datetime(2020,7,15),dt.datetime(2020,2,15),dt.datetime(2020,2,20)]}
df = pd.DataFrame(data)
Project Start End
0 A 2020-01-01 2020-01-15
1 A 2020-01-16 2020-01-30
2 A 2020-01-31 2020-02-15
3 A 2020-07-01 2020-07-15
4 B 2020-01-31 2020-02-15
5 B 2020-02-16 2020-02-20
And my expected result:
Project Start End
0 A 2020-01-01 2020-02-15
1 A 2020-07-01 2020-07-15
2 B 2020-01-31 2020-02-20
If the next day of end is another start, I would like to combine the two rows.
Is there any pandas function can do this?
Thank a lot!
Create a mask with groupby and shift, then assign the values directly and drop_duplicates:
mask = df.groupby("Project").apply(lambda d: (d["Start"].shift(-1)-d["End"]).dt.days<=1).reset_index(drop=True)
df.loc[mask, "End"]= df["End"].shift(-1)
print (df.drop_duplicates(subset=["Project","End"],keep="first"))
Project Start End
0 A 2020-01-01 2020-01-30
2 A 2020-05-01 2020-05-15
3 A 2020-07-01 2020-07-15
4 B 2020-02-01 2020-02-20
For multiple rows instead, one way is to create an array of dates in long form by list comprehension & pd.date_range, and then get a mask grouped by cumsum, and finally get the min/max date of each group:
s = [(i[0],x) for i in df.to_numpy() for x in pd.date_range(*i[1:])]
new = pd.DataFrame(index=pd.MultiIndex.from_tuples(s,names=["Project","Date"])).reset_index()
mask = new.groupby("Project")["Date"].diff().dt.days.gt(1).cumsum()
print (new.groupby(["Project", mask]).agg({"min","max"}))
Date
min max
Project Date
A 0 2020-01-01 2020-02-15
1 2020-07-01 2020-07-15
B 1 2020-01-31 2020-02-20

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1

Get list of of Hours Between Two Datetime Variables [duplicate]

This question already has answers here:
Calculate Time Difference Between Two Pandas Columns in Hours and Minutes
(4 answers)
Closed 2 years ago.
I have a dataframe that looks like this:
Date Name Provider Task StartDateTime LastDateTime
2020-01-01 00:00:00 Bob PEM ED A 7a-4p 2020-01-01 07:00:00 2020-01-01 16:00:00
2020-01-02 00:00:00 Tom PEM ED C 10p-2a 2020-01-02 22:00:00 2020-01-03 02:00:00
I would like to list the number of hours between each person's StartDateTime LastDateTime(datetime64[ns]) and then create an updated dataframe to reflect said lists. So for example, the updated dataframe would look like this:
Name Date Hour
Bob 2020-01-01 7
Bob 2020-01-01 8
Bob 2020-01-01 9
...
Tom 2020-01-02 22
Tom 2020-01-02 23
Tom 2020-01-03 0
Tom 2020-01-03 1
...
I honestly do not have a solid idea where to start, I've found some articles that may provide a foundation but I'm not sure how to adapt my query to the below code since I want the counts based on the row and hour values.
def daterange(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
start_dt = date(2015, 12, 20)
end_dt = date(2016, 1, 11)
for dt in daterange(start_dt, end_dt):
print(dt.strftime("%Y-%m-%d"))
https://www.w3resource.com/python-exercises/date-time-exercise/python-date-time-exercise-50.php
Let us create the range of datetime then , use explode
df['Date']=[pd.date_range(x,y , freq='H') for x , y in zip(df.StartDateTime,df.LastDateTime)]
s=df[['Date','Name']].explode('Date').reset_index(drop=True)
s['Hour']=s.Date.dt.hour
s['Date']=s.Date.dt.date
s.head()
Date Name Hour
0 2020-01-01 Bob 7
1 2020-01-01 Bob 8
2 2020-01-01 Bob 9
3 2020-01-01 Bob 10
4 2020-01-01 Bob 11

Comparing Two DataFrames and Loop through them (to test a condition)

I am trying to 'join' two DataFrames based on a condition.
Condition
if df1.Year == df2.Year &
df1.Date >= df2.BeginDate or df1.Date <= df2.EndDate &
df1.ID == df2.ID
#if the condition is True, I would love to add an extra column (binary) to df1, something like
#df1.condition = Yes or No.
My data looks like this:
df1:
Year Week ID Date
2020 1 123 2020-01-01 00:00:00
2020 1 345 2020-01-01 00:00:00
2020 2 123 2020-01-07 00:00:00
2020 1 123 2020-01-01 00:00:00
df2:
Year BeginDate EndDate ID
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 978
2020 2020-09-21 00:00:00 2020-01-02 00:00:00 978
end_df: #Expected output
Year Week ID Condition
2020 1 123 True #Year is matching, week1 is between the dates, ID is matching too
2019 1 345 False #Year is not matching
2020 2 187 False # ID is not matching
2020 1 123 True # Same as first row.
I thought to solve this by looping over two DataFrames:
for row in df1.iterrrows():
for row2 in df2.iterrows():
if row['Year'] == row2['Year2']:
if row['ID] == row2['ID']:
.....
.....
row['Condition'] = True
else:
row['Condition'] = False
However... this is leading to error after error.
Really looking forward how you guys will tackle this problem. Many thanks in advance!
UPDATE 1
I created a loop. However, this loop is taking ages (and I am not sure how to add the value to a new column).
Note, in df1 I created a 'Date' column (in the same format as the begin & enddate from df2).
Key now: How can I add the True value (in the end of the loop..) to my df1 (in an extra column)?
for index, row in df1.interrows():
row['Year'] = str(row['Year'])
for index1, row1 in df2.iterrows():
row1['Year'] = str(row1['Year'])
if row['Year'] == row1['Year']:
row['ID'] = str(row['ID'])
row1['ID'] = str(row1['ID'])
if row['ID] == row1['ID']:
if row['Date'] >= row1['BeginDate'] and row['Date'] <= row1['Enddate']:
print("I would like to add this YES to df1 in an extra column")
Edit 2
Trying #davidbilla solution: It looks like the 'condition' column is not doing well. As you can see, it match even while df1.Year != df2.Year. Note that df2 is sorted based on ID (so all the same unique numbers should be there
I guess you are expecting something like this - if you are trying to match the dataframes row wise (i.e compare row1 of df1 with row1 of df2):
df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID'])&((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])), True, False)
np.where takes the conditions as the first parameter, the second parameter will the be the value if the condition pass, the 3rd parameter is the value if the condition fail.
EDIT 1:
Based on your sample dataset
df1 = pd.DataFrame([[2020,1,123],[2020,1,345],[2020,2,123],[2020,1,123]],
columns=['Year','Week','ID'])
df2 = pd.DataFrame([[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',123],
[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',123],
[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',978],
[2020,'2020-09-21 00:00:00','2020-01-02 00:00:00',978]],
columns=['Year','BeginDate','EndDate','ID'])
df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID']),True, False)
# &((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])) - removed this condition as the df has no Date field
print(df1)
Output:
Year Date ID condition
0 2020 1 123 True
1 2020 1 345 False
2 2020 2 123 False
3 2020 1 123 False
EDIT 2: To compare one row in df1 with all rows in df2
df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))
This takes df1['Year'] and compares it against all values of df2['Year'].
Based on the sample dataset:
df1:
Year Date ID
0 2020 2020-01-01 123
1 2020 2020-01-01 345
2 2020 2020-10-01 123
3 2020 2020-11-13 123
df2:
Year BeginDate EndDate ID
0 2020 2020-01-01 2020-02-01 123
1 2020 2020-01-01 2020-01-02 123
2 2020 2020-03-01 2020-05-01 978
3 2020 2020-09-21 2020-10-01 978
Code change:
date_range = list(zip(df2['BeginDate'],df2['EndDate']))
def check_date(date):
for (s,e) in date_range:
if date>=s and date<=e:
return True
return False
df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))
df1['date_compare'] = df1['Date'].apply(lambda x: check_date(x)) # you can directly store this in df1['condition']. I just wanted to print the values so have used a new field
df1['condition'] = (df1['condition']==True)&(df1['date_compare']==True)
Output:
Year Date ID condition date_compare
0 2020 2020-01-01 123 True True # Year match, ID match and Date is within the range of df2 row 1
1 2020 2020-01-01 345 False True # Year match, ID no match
2 2020 2020-10-01 123 True True # Year match, ID match, Date is within range of df2 row 4
3 2020 2020-11-13 123 False False # Year match, ID match, but Date is not in range of any row in df2
EDIT 3:
Based on updated question (Earlier I thought it was ok if the 3 values year, id and date match df2 in any of the rows not on the same row). I think I got better understanding of your requirement now.
df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['Date'] = pd.to_datetime(df1['Date'])
df1['condition'] = False
for idx1, row1 in df1.iterrows():
match = False
for idx2, row2 in df2.iterrows():
if (row1['Year']==row2['Year']) & \
(row1['ID']==row2['ID']) & \
(row1['Date']>=row2['BeginDate']) & \
(row1['Date']<=row2['EndDate']):
match = True
df1.at[idx1, 'condition'] = match
Output - Set 1:
DF1:
Year Date ID
0 2020 2020-01-01 123
1 2020 2020-01-01 123
2 2020 2020-01-01 345
3 2020 2020-01-10 123
4 2020 2020-11-13 123
DF2:
Year BeginDate EndDate ID
0 2020 2020-01-15 2020-02-01 123
1 2020 2020-01-01 2020-01-02 123
2 2020 2020-03-01 2020-05-01 978
3 2020 2020-09-21 2020-10-01 978
DF1 result:
Year Date ID condition
0 2020 2020-01-01 123 True
1 2020 2020-01-01 123 True
2 2020 2020-01-01 345 False
3 2020 2020-01-10 123 False
4 2020 2020-11-13 123 False
Output - Set 2:
DF1:
Year Date ID
0 2019 2019-01-01 s904112
1 2019 2019-01-01 s911243
2 2019 2019-01-01 s917131
3 2019 2019-01-01 sp986214
4 2019 2019-01-01 s510006
5 2020 2020-01-10 s540006
DF2:
Year BeginDate EndDate ID
0 2020 2020-01-27 2020-09-02 s904112
1 2020 2020-01-27 2020-09-02 s904112
2 2020 2020-01-03 2020-03-15 s904112
3 2020 2020-04-15 2020-01-05 s904112
4 2020 2020-01-05 2020-05-15 s540006
5 2019 2019-01-05 2019-05-15 s904112
DF1 Result:
Year Date ID condition
0 2019 2019-01-01 s904112 False
1 2019 2019-01-01 s911243 False
2 2019 2019-01-01 s917131 False
3 2019 2019-01-01 sp986214 False
4 2019 2019-01-01 s510006 False
5 2020 2020-01-10 s540006 True
The 2nd row of the desired output has Year as 2019, so I assume the 2nd row of df1.Year is also 2019 instead of 2020
If I understand correctly, you need to merge and filter-out Date outside of the BeginDate and EndDate range. First, there are duplicates and invalid date ranges in df2. We need to drop duplicates and invalid ranges before merge. Invalid date ranges are ranges where BeginDate >= EndDate which is index 3 of df2.
#convert all date columns of both `df1` and `df2` to datetime dtype
df1['Date'] = pd.to_datetime(df1['Date'])
df2[['BeginDate', 'EndDate']] = df2[['BeginDate', 'EndDate']].apply(pd.to_datetime)
#left-merge on `Year`, `ID` and using `eval` to compute
#columns `Condition` where `Date` is between `BeginDate` and `EndDate`.
#Finally assign back to `df1`
df1['Condition'] = (df1.merge(df2.loc[df2.BeginDate < df2.EndDate].drop_duplicates(),
on=['Year','ID'], how='left')
.eval('Condition= BeginDate <= Date <= EndDate')['Condition'])
Out[614]:
Year Week ID Date Condition
0 2020 1 123 2020-01-01 True
1 2019 1 345 2020-01-01 False
2 2020 2 123 2020-01-07 False
3 2020 1 123 2020-01-01 True

Categories

Resources