Plotting count of unique values in groupby - python

I have a dataset with that form :
>>> df
my_timestamp disease month
0 2016-01-01 15:00:00 2 jan
0 2016-01-01 11:00:00 1 jan
1 2016-01-02 15:00:00 3 jan
2 2016-01-03 15:00:00 4 jan
3 2016-01-04 15:00:00 2 jan
I wont to count the number of unique apparition by month, by values, then plot the count of every value by month.
df
values count
jan 2 3
jan 2 3
How can I plot it ? In one plot with month on x axis, one line for every values, and their count on y

If you want to plot by month, then you also need to plot by year if multiple years. You can use dt.strftime when using .groupby to group by year and month.
Given the following slightly altered dataset to include more months:
my_timestamp disease month
2016-01-01 15:00:00 2 jan
2016-02-01 11:00:00 1 feb
2017-01-02 15:00:00 3 jan
2017-01-02 15:00:00 4 jan
2016-01-04 15:00:00 2 jan
You can run the following
df['my_timestamp'] = pd.to_datetime(df['my_timestamp'])
df.groupby(df['my_timestamp'].dt.strftime('%Y-%m'))['disease'].nunique().plot()

What I did to get that data into barplot.
I created a month column. Then :
for v in df.disease.unique():
diseases = df_cut[df_cut['disease']==v].groupby('month_num')['disease'].count()
x = diseases.index
y = diseases.values
plt.bar(x, y)

Related

missing month count from the datetime column in pandas dataframe

Existing Dataframe :
Id Date_of_activity
A 2020-09-17 12:36:00
A 2020-11-02 00:00:00
A 2020-12-02 00:00:00
A 2021-01-02 00:00:00
A 2021-02-02 00:00:00
A 2021-03-03 12:12:00
A 2021-04-03 12:12:00
B 2020-11-02 00:00:00
B 2021-01-02 00:00:00
B 2021-03-03 12:12:00
B 2021-04-03 12:12:00
Expected Dataframe :
Id Missed_Month_Count
A 1
B 2
I am looking to calculate the Number of Missed Months where NO activity was Done.
For Id A , No activity was done in 10th Month of 2020 so the missed month count should be 1 , likewise for B , No activity was done in 12th month of 2020 and 2nd month of 2021 , which makes missed_month_count as 2.
You can use:
# convert to Monthly period
s = pd.to_datetime(df['Date_of_activity']).dt.to_period('M')
# compute the difference per group
# if != 1, then there is a missing month
out = (s.sort_values()
.groupby(df['Id'], sort=False)
.apply(lambda g: g.drop_duplicates().diff().ne('M').sum()-1)
.reset_index(name='Missed_Month_Count')
)
output:
Id Missed_Month_Count
0 A 1
1 B 2

how do I classify or regroup dataset based on time variation in python

I need to assign number to values between different time hourly. How can I then add a new column to this where I can specify each cell to be grouped hourly. for instance, all the transactions within 00:00:00 to 00:59:59 to be filled with 1, transactions within 01:00:00 to 01:59:59 to be filled with 2, and so on till 23:00:00 to 23:59:59 to be filled with 24
Time_duration = df['period']
print (Time_duration)
0 23:59:56
1 23:59:56
2 23:59:55
3 23:59:53
4 23:59:52
...
74187 00:00:18
74188 00:00:09
74189 00:00:08
74190 00:00:03
74191 00:00:02 ```
# this is the result I desire.... How can I then add a new column to this where I can specify each cell to be grouped hourly. for instance, all the transactions within 00:00:00 to 00:59:59 to be filled with 1, transactions within 01:00:00 to 01:59:59 to be filled with 2, and so on till 23:00:00 to 23:59:59 to be filled with 24.
0 23:59:56 24
1 23:59:56 24
2 23:59:55 24
3 23:59:53 24
4 23:59:52 24
...
74187 00:00:18 1
74188 00:00:09 1
74189 00:00:08 1
74190 00:00:03 1
74191 00:00:02 1
df.sort_values(by=["period"])
timeStamp_list = (pd.to_datetime(list(df['period'])))
df['Hour'] =timeStamp_list.hour
try this code, this works for me.
You can use regular expressions and str.extract
import pandas as pd
pattern= r'^(\d{1,2}):' #capture the digits of the hour
df['hour']=df['period'].str.extract(pattern).astype('int') + 1 # cast it as int so that you can add 1

How can I parse a field in a DF into Month, Day, Year, Hour, and Weekday?

I have data that looks like this.
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag
2 1/1/2018 0:18:50 1/1/2018 12:24:39 AM N
2 1/1/2018 0:30:26 1/1/2018 12:46:42 AM N
2 1/1/2018 0:07:25 1/1/2018 12:19:45 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:38:35 1/1/2018 1:08:50 AM N
2 1/1/2018 0:18:41 1/1/2018 12:28:22 AM N
2 1/1/2018 0:38:02 1/1/2018 12:55:02 AM N
2 1/1/2018 0:05:02 1/1/2018 12:18:35 AM N
2 1/1/2018 0:35:23 1/1/2018 12:42:07 AM N
So, I converted df.lpep_pickup_datetime to datetime, but originally it comes in as a string. I'm not sure which one is easier to work with. I want to append 5 fields onto my current dataframe: year, month, day, weekday, and hour.
I tried this:
df['Year']=[d.split('-')[0] for d in df.lpep_pickup_datetime]
df['Month']=[d.split('-')[1] for d in df.lpep_pickup_datetime]
df['Day']=[d.split('-')[2] for d in df.lpep_pickup_datetime]
That gives me this error: AttributeError: 'Timestamp' object has no attribute 'split'
I tried this:
df2 = pd.DataFrame(df.lpep_pickup_datetime.dt.strftime('%m-%d-%Y-%H').str.split('/').tolist(),
columns=['Month', 'Day', 'Year', 'Hour'],dtype=int)
df = pd.concat((df,df2),axis=1)
That gives me this error: AssertionError: 4 columns passed, passed data had 1 columns
Basically, I want to parse df.lpep_pickup_datetime into year, month, day, weekday, and hour, appending each to the same dataframe. How can I do that?
Thanks!!
Here you go, first I'm creating a random dataset and then renaming the column date to the name you want, so you can just copy the code. Pandas has a big section of time-series series manipulation, you don't actually need to import datetime. Here you can find a lot more information about it:
import pandas as pd
date_rng = pd.date_range(start='1/1/2018', end='4/01/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['lpep_pickup_datetime'] = df['date']
df['year'] = df['lpep_pickup_datetime'].dt.year
df['year'] = df['lpep_pickup_datetime'].dt.month
df['weekday'] = df['lpep_pickup_datetime'].dt.weekday
df['day'] = df['lpep_pickup_datetime'].dt.day
df['hour'] = df['lpep_pickup_datetime'].dt.hour
print(df)
Output:
date lpep_pickup_datetime year weekday day hour
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 0 1 0
1 2018-01-01 01:00:00 2018-01-01 01:00:00 1 0 1 1
2 2018-01-01 02:00:00 2018-01-01 02:00:00 1 0 1 2
3 2018-01-01 03:00:00 2018-01-01 03:00:00 1 0 1 3
4 2018-01-01 04:00:00 2018-01-01 04:00:00 1 0 1 4
... ... ... ... ... ... ...
2156 2018-03-31 20:00:00 2018-03-31 20:00:00 3 5 31 20
2157 2018-03-31 21:00:00 2018-03-31 21:00:00 3 5 31 21
2158 2018-03-31 22:00:00 2018-03-31 22:00:00 3 5 31 22
2159 2018-03-31 23:00:00 2018-03-31 23:00:00 3 5 31 23
2160 2018-04-01 00:00:00 2018-04-01 00:00:00 4 6 1 0
EDIT: Since this is not working (As stated in the comments in this answer), I believe your data is formated incorrectly. Try this before applying anything:
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'], format='%d/%m/%y %H:%M:%S')
If this format is recognized properly, then you should have no trouble using dt.year,dt.month,dt.hour,dt.day,dt.weekday.
Give this a go. Since your dates are in the datetime dtype already, just use the datetime properties to extract each part.
import pandas as pd
from datetime import datetime as dt
# Creating a fake dataset of dates.
dates = [dt.now().strftime('%d/%m/%Y %H:%M:%S') for i in range(10)]
df = pd.DataFrame({'lpep_pickup_datetime': dates})
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
# Parse each date into its parts and store as a new column.
df['month'] = df['lpep_pickup_datetime'].dt.month
df['day'] = df['lpep_pickup_datetime'].dt.day
df['year'] = df['lpep_pickup_datetime'].dt.year
# ... and so on ...
Output:
lpep_pickup_datetime month day year
0 2019-09-24 16:46:10 9 24 2019
1 2019-09-24 16:46:10 9 24 2019
2 2019-09-24 16:46:10 9 24 2019
3 2019-09-24 16:46:10 9 24 2019
4 2019-09-24 16:46:10 9 24 2019
5 2019-09-24 16:46:10 9 24 2019
6 2019-09-24 16:46:10 9 24 2019
7 2019-09-24 16:46:10 9 24 2019
8 2019-09-24 16:46:10 9 24 2019
9 2019-09-24 16:46:10 9 24 2019

Convert and order time in a pandas df

I am trying to order timestamps in a pandas df. The times begin around 08:00:00 am and finish around 3:00:00 am. I'd like to add 24hrs to times after midnight. So times read 08:00:00 to 27:00:00 am. The problem is the times aren't ordered.
Example:
import pandas as pd
d = ({
'time' : ['08:00:00 am','12:00:00 pm','16:00:00 pm','20:00:00 pm','2:00:00 am','13:00:00 pm','3:00:00 am'],
})
df = pd.DataFrame(data=d)
If I try order the times via
df = pd.DataFrame(data=d)
df['time'] = pd.to_timedelta(df['time'])
df = df.sort_values(by='time',ascending=True)
Out:
time
4 02:00:00
6 03:00:00
0 08:00:00
1 12:00:00
5 13:00:00
2 16:00:00
3 20:00:00
Whereas I'm hoping the output is:
time
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
I'm not sure if this can be done though. Specifically, if I can differentiate between 8:00:00 am and the times after midnight (1am-3am).
Add a day offset for times after midnight and before when a new "day" is supposed to begin (pick some time after 3 am & before 7 am) & then sort values
cutoff, day = pd.to_timedelta(['3.5H', '24H'])
df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True)
# Out:
0 0 days 08:00:00
1 0 days 12:00:00
2 0 days 13:00:00
3 0 days 16:00:00
4 0 days 20:00:00
5 1 days 02:00:00
6 1 days 03:00:00
The last two values are numerically equal to 26 hours & 27 hours, just displayed differently.
If you need them in HH:MM:SS format, use string-formatting with the appropriate timedelta components
Ex:
x = df.time.apply(lambda x: x if x > cutoff else x + day).sort_values().reset_index(drop=True).dt.components
x.apply(lambda x: '{:02d}:{:02d}:{:02d}'.format(x.days*24+x.hours, x.minutes, x.seconds), axis=1)
#Out:
0 08:00:00
1 12:00:00
2 13:00:00
3 16:00:00
4 20:00:00
5 26:00:00
6 27:00:00
dtype: object

Create Datetime index after groupby

I would like to revert index after groupby function.
Question is how to create a DateTime index having year, month, day in separate columns in Multindex.
Given a DataFrame as an example:
import pandas as pd
import numpy as np
index=pd.date_range('2011-1-1 00:00:00', '2011-1-31 23:50:00', freq='10min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
Then, get the sum over each hour using grupby:
day_h = df.groupby([lambda x: x.year, lambda x: x.month, lambda x: x.day,lambda x: x.hour]).mean()
This creates an Index, where year, month, day and hour are in separate columns.
A B
2011 1 1 0 0.209908 1.196164
2011 1 1 1 0.692531 0.518185
2011 1 1 2 1.674748 0.013136
2011 1 1 3 1.674748 0.013136
2011 1 1 4 1.674748 0.013136
2011 1 1 5 1.674748 0.013136
The desired output would be to have DateTime index:
A B
2011-1-1 00:00 0.209908 1.196164
2011-1-1 01:00 0.692531 0.518185
2011-1-1 03:00 1.674748 0.013136
2011-1-1 04:00 1.674748 0.013136
2011-1-1 05:00 1.674748 0.013136
In my files there are some missing rows, so I can't create a new index with 1h timestep.
My data after groupby Example data
Someone else on SO had a similar question, but their solution was to use resample. You can avoid resampling by mapping the tuples in the multi-index to create a new index. This will handle missing rows just fine.
day_h['new_index'] = day_h.index.map(lambda x: datetime.datetime(x[0], x[1], x[2], x[3]))
day_h.set_index('new_index')
Output:
A B
new_index
2011-01-01 00:00:00 -1.095114 1.995776
2011-01-01 01:00:00 -2.411459 4.508794
2011-01-01 02:00:00 -1.261747 4.953709
2011-01-01 03:00:00 -0.311934 5.454112
2011-01-01 04:00:00 2.095718 6.854375
2011-01-01 05:00:00 1.696756 3.518919
2011-01-01 06:00:00 0.623589 1.740478
2011-01-01 07:00:00 0.544426 0.916016
2011-01-01 08:00:00 2.331326 0.891177

Categories

Resources