Existing Dataframe :
Id Date_of_activity
A 2020-09-17 12:36:00
A 2020-11-02 00:00:00
A 2020-12-02 00:00:00
A 2021-01-02 00:00:00
A 2021-02-02 00:00:00
A 2021-03-03 12:12:00
A 2021-04-03 12:12:00
B 2020-11-02 00:00:00
B 2021-01-02 00:00:00
B 2021-03-03 12:12:00
B 2021-04-03 12:12:00
Expected Dataframe :
Id Missed_Month_Count
A 1
B 2
I am looking to calculate the Number of Missed Months where NO activity was Done.
For Id A , No activity was done in 10th Month of 2020 so the missed month count should be 1 , likewise for B , No activity was done in 12th month of 2020 and 2nd month of 2021 , which makes missed_month_count as 2.
You can use:
# convert to Monthly period
s = pd.to_datetime(df['Date_of_activity']).dt.to_period('M')
# compute the difference per group
# if != 1, then there is a missing month
out = (s.sort_values()
.groupby(df['Id'], sort=False)
.apply(lambda g: g.drop_duplicates().diff().ne('M').sum()-1)
.reset_index(name='Missed_Month_Count')
)
output:
Id Missed_Month_Count
0 A 1
1 B 2
Related
I have a dataset called weather and it contains one column 'Date' that looks like this.
Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-01-01
2020-02-01
2020-04-01
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2020-01-01
The problem is the year is always 2020 when it should be 2020, 2021, and 2022.
The desired column looks like this
Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2021-01-01
2021-02-01
2021-04-01
2021-05-01
2021-06-01
2021-07-01
2021-08-01
2021-09-01
2021-10-01
2021-11-01
2021-12-01
2022-01-01
Each year's last month is not necessarily 12, but the new year starts with month 01.
Here is my code:
month = ['01','02','03','04','05','06','07','08','09','10','11','12']
for i in range(len(weather['Date'])):
year = 2022
for j in range(len(month)):
if weather['Date'][i][5:7] == '01':
weather['Date'][i] = weather['Date'][i].apply(lambda x: 'year' + x[5:])
Is there any suggestion for fixing my code and getting the desired column?
Here's one approach:
Turn the date strings in column Date into datetime, using pd.to_datetime and apply Series.diff and chain Series.dt.day.
Since each negative value (i.e. "day") in our Series will represent the start of a new year, let's apply Series.lt(0) to turn all values below 0 into True and the rest into False.
At this stage, we chain Series.cumsum to end up with a Series containing 0, ..., 1, ..., 2. These will be the values that need to be added to the year 2020 to achieve the correct years.
Now, finally, we can create the correct dates by passing (new_year = year + addition), month, day again to pd.to_datetime (cf. this SO answer).
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(dict(year=(df['Date'].dt.year
+ df['Date'].diff().dt.days.lt(0).cumsum()),
month=df['Date'].dt.month,
day=df['Date'].dt.day))
df['Date']
0 2020-01-01
1 2020-01-02
2 2020-02-01
3 2020-02-04
4 2020-03-01
5 2020-04-01
6 2020-04-02
7 2020-04-03
8 2020-04-04
9 2020-05-01
10 2020-06-01
11 2020-07-01
12 2020-08-01
13 2020-09-01
14 2020-10-01
15 2020-11-01
16 2021-01-01
17 2021-02-01
18 2021-04-01
19 2021-05-01
20 2021-06-01
21 2021-07-01
22 2021-08-01
23 2021-09-01
24 2021-10-01
25 2021-11-01
26 2021-12-01
27 2022-01-01
Name: Date, dtype: datetime64[ns]
You don't need to convert to datetime, of course. You can also recreate the date strings, leaving off from the following line:
df['Date'].str[5:7].astype(int).diff().lt(0).cumsum()
Similar to #ouroboros1, but using numpy to get the number of years to add to each date, and then pd.offsets.DateOffset(years=...) for the addition.
import numpy as np
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
s = df['Date'].values
y = np.r_[0, (s[:-1] > s[1:]).cumsum()]
At this point, it would be tempting to do:
df['Date'] += y * pd.offsets.DateOffset(years=1)
But then we would get a warning: PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized.
So instead, we group by number of years to add, and add the relevant offset to all the dates in the group.
def add_years(g):
return g['Date'] + pd.offsets.DateOffset(years=g['y'].iloc[0])
df['Date'] = df.assign(y=y).groupby('y', sort=False, group_keys=False).apply(add_years)
This is reasonably fast (4.25 ms for 1000 rows and 10 distinct y values), and, for other situations than yours, is a bit more general than #ouroboros1's answer:
It handles date changes due to leap year (not present in your example where all dates are on the first of a month, but if one of the dates was '2020-02-29' and we try to add 1 year to it using the construct dt = df['Date'].dt; pd.to_datetime(dict(year=dt.year + y, month=dt.month, ...), then we'd get a ValueError: cannot assemble the datetimes: day is out of range for month).
It preserves any time of day and timezone information (again, not in your case, but in the general case one would retain those).
I have a dataset with that form :
>>> df
my_timestamp disease month
0 2016-01-01 15:00:00 2 jan
0 2016-01-01 11:00:00 1 jan
1 2016-01-02 15:00:00 3 jan
2 2016-01-03 15:00:00 4 jan
3 2016-01-04 15:00:00 2 jan
I wont to count the number of unique apparition by month, by values, then plot the count of every value by month.
df
values count
jan 2 3
jan 2 3
How can I plot it ? In one plot with month on x axis, one line for every values, and their count on y
If you want to plot by month, then you also need to plot by year if multiple years. You can use dt.strftime when using .groupby to group by year and month.
Given the following slightly altered dataset to include more months:
my_timestamp disease month
2016-01-01 15:00:00 2 jan
2016-02-01 11:00:00 1 feb
2017-01-02 15:00:00 3 jan
2017-01-02 15:00:00 4 jan
2016-01-04 15:00:00 2 jan
You can run the following
df['my_timestamp'] = pd.to_datetime(df['my_timestamp'])
df.groupby(df['my_timestamp'].dt.strftime('%Y-%m'))['disease'].nunique().plot()
What I did to get that data into barplot.
I created a month column. Then :
for v in df.disease.unique():
diseases = df_cut[df_cut['disease']==v].groupby('month_num')['disease'].count()
x = diseases.index
y = diseases.values
plt.bar(x, y)
I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.
I have a pandas column that contain timestamps that are unordered. When I sort them it works fine except for the values H:MM:SS.
d = ({
'A' : ['8:00:00','9:00:00','10:00:00','20:00:00','24:00:00','26:20:00'],
})
df = pd.DataFrame(data=d)
df = df.sort_values(by='A',ascending=True)
Out:
A
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00
0 8:00:00
1 9:00:00
Ideally, I'd like to add a zero before 5 letter strings. If I convert them all to time delta it converts the times after midnight into 1 day plus n amount of hours. e.g.
df['A'] = pd.to_timedelta(df['A'])
A
0 0 days 08:00:00
1 0 days 09:00:00
2 0 days 10:00:00
3 0 days 20:00:00
4 1 days 00:00:00
5 1 days 02:20:00
Intended Output:
A
0 08:00:00
1 09:00:00
2 10:00:00
3 20:00:00
4 24:00:00
5 26:20:00
If you only need to sort by the column as timedelta, you can convert the column to timedelta and use argsort on it to create the sorting order to sort the data frame:
df.iloc[pd.to_timedelta(df.A).argsort()]
# A
#0 8:00:00
#1 9:00:00
#2 10:00:00
#3 20:00:00
#4 24:00:00
#5 26:20:00
I have a dataframe of Ids and dates.
id date
1 2010-03-09 00:00:00
1 2010-05-28 00:00:00
1 2010-10-12 00:00:00
1 2010-12-10 00:00:00
1 2011-07-11 00:00:00
I'd like to reshape the dataframe so that I have one date in one column, and the next date adjacent in another column. See below
id date date2
1 2010-03-09 00:00:00 2010-05-28 00:00:00
1 2010-05-28 00:00:00 2010-10-12 00:00:00
1 2010-10-12 00:00:00 2010-12-10 00:00:00
1 2010-12-10 00:00:00 2011-07-11 00:00:00
How can I achieve this?
df['date2'] = df.date.shift(-1) # use shift function to shift index of the date
# column and assign it back to df as a new column
df.dropna() # the last row will be nan for date2, drop it if you
# don't need it
# id date date2
#0 1 2010-03-09 00:00:00 2010-05-28 00:00:00
#1 1 2010-05-28 00:00:00 2010-10-12 00:00:00
#2 1 2010-10-12 00:00:00 2010-12-10 00:00:00
#3 1 2010-12-10 00:00:00 2011-07-11 00:00:00
Looks like Psidom has a swaggy answer already ... but since I was already at it:
df_new = df.iloc[:-1]
df_new['date2'] = df.date.values[1:]