Extract day of month as array from datetime column - python

I have loaded a pandas dataframe from a .csv file that contains a column having datetime values.
df = pd.read_csv('data.csv')
The name of the column having the datetime values is pickup_datetime. Here's what I get if i do df['pickup_datetime'].head():
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Name: pickup_datetime, dtype: datetime64[ns, UTC]
How do I convert this column into a numpy array having only the day values of the datetime? For example: 15 from 0 2009-06-15 17:26:00+00:00, 05 from 1 2010-01-05 16:52:00+00:00, etc..

df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
df['pickup_datetime'].dt.day.values
# array([15, 5, 18, 21, 9])

Just adding another Variant, although coldspeed already provide the briefed answer as a x-mas and New year bonus :-) :
>>> df
pickup_datetime
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Convert the strings to timestamps by inferring their format:
>>> df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
>>> df
pickup_datetime
0 2009-06-15 17:26:00
1 2010-01-05 16:52:00
2 2011-08-18 00:35:00
3 2012-04-21 04:30:00
4 2010-03-09 07:51:00
You can pic the day's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.day
0 15
1 5
2 18
3 21
4 9
Name: pickup_datetime, dtype: int64
You can pic the month's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.month
0 6
1 1
2 8
3 4
4 3
You can pic the Year's only from the pickup_datetime
>>> df['pickup_datetime'].dt.year
0 2009
1 2010
2 2011
3 2012
4 2010

Related

How to calculate monthly changes in a time series using pandas dataframe

As I am new to Python I am probably asking for something basic for most of you. However, I have a df where 'Date' is the index, another column that is returning the month related to the Date, and one Data column.
Mnth TSData
Date
2012-01-05 1 192.6257
2012-01-12 1 194.2714
2012-01-19 1 192.0086
2012-01-26 1 186.9729
2012-02-02 2 183.7700
2012-02-09 2 178.2343
2012-02-16 2 172.3429
2012-02-23 2 171.7800
2012-03-01 3 169.6300
2012-03-08 3 168.7386
2012-03-15 3 167.1700
2012-03-22 3 165.9543
2012-03-29 3 165.0771
2012-04-05 4 164.6371
2012-04-12 4 164.6500
2012-04-19 4 166.9171
2012-04-26 4 166.4514
2012-05-03 5 166.3657
2012-05-10 5 168.2543
2012-05-17 5 176.8271
2012-05-24 5 179.1971
2012-05-31 5 183.7120
2012-06-07 6 195.1286
I wish to calculate monthly changes in the data set that I can later use in a boxplot. So from the table above the results i seek are:
Mnth Chng
1 -8,9 (183,77 - 192,66)
2 -14,14 (169,63 - 183,77)
3 -5 (164,63 - 169,63)
4 1,73 (166,36 - 164,63)
5 28,77 (195,13 - 166,36)
and so on...
any suggestions?
thanks :)
IIUC, starting from this as df:
Date Mnth TSData
0 2012-01-05 1 192.6257
1 2012-01-12 1 194.2714
2 2012-01-19 1 192.0086
3 2012-01-26 1 186.9729
4 2012-02-02 2 183.7700
...
20 2012-05-24 5 179.1971
21 2012-05-31 5 183.7120
22 2012-06-07 6 195.1286
you can use:
df.groupby('Mnth')['TSData'].first().diff().shift(-1)
# or
# -df.groupby('Mnth')['TSData'].first().diff(-1)
NB. the data must be sorted by date to have the desired date to be used in the computation as the first item of each group (df.sort_values(by=['Mnth', 'Date']))
output:
Mnth
1 -8.8557
2 -14.1400
3 -4.9929
4 1.7286
5 28.7629
6 NaN
Name: TSData, dtype: float64
I'll verify that we have a datetime index:
df.index = pd.to_datetime(df.index)
Then it's simply a matter of using resample:
df['TSData'].resample('M').first().diff().shift(freq='-1M')
Output:
Date
2011-12-31 NaN
2012-01-31 -8.8557
2012-02-29 -14.1400
2012-03-31 -4.9929
2012-04-30 1.7286
2012-05-31 28.7629
Name: TSData, dtype: float64

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Find number of months between two dates in python

I have a date object and a date column 'date1' in pandas dataframe 'df' as below:
date = '202107'
df
date1
0 2021-07-01
1 2021-08-01
2 2021-09-01
3 2021-10-01
4 2021-11-01
5 2023-02-01
6 2023-03-01
I want to create a column 'months' in df where
months = (date1 + 1month) - date
My output dataframe should look like below:
df
date1 months
0 2021-07-01 1
1 2021-08-01 2
2 2021-09-01 3
3 2021-10-01 4
4 2021-11-01 5
5 2023-02-01 20
6 2023-03-01 21
Here's a way to do using pandas:
date = '202107'
date = pd.to_datetime(date, format='%Y%m')
df['months'] = (df.date + pd.offsets.MonthBegin(1)).dt.month - date.month
print(df)
date months
0 2021-07-01 1
1 2021-08-01 2
2 2021-09-01 3
3 2021-10-01 4
4 2021-11-01 5
Given a date variable as follows
mydate = 202003
and a dataframe [df] containing a datetime variable start_date. You can do:
mydate_to_use= pd.to_datetime(mydate , format = '%Y%m', errors='ignore')
df['months'] = (df['START_DATE'].dt.year - mydate_to_use.year) * 12 + (df['START_DATE'].dt.month - mydate_to_use.month)
IIUC
s=(df.date1-pd.to_datetime(date,format='%Y%m'))//np.timedelta64(1, 'M')+1
Out[118]:
0 1
1 2
2 3
3 4
4 5
Name: date1, dtype: int64
df['months']=s
Update
(df.date1.dt.year*12+df.date1.dt.month)-(pd.to_numeric(date)//100)*12-(pd.to_numeric(date)%100)+1
Out[379]:
0 1
1 2
2 3
3 4
4 5
5 20
6 21
Name: date1, dtype: int64

Number of active IDs in each period

I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1

Pandas: Convert a DataFrame into a Series when index is Year-Month and columns are Day

I have a dataframe that looks similar to the following:
df = pd.DataFrame({'Y_M':['201710','201711','201712'],'1':[1,5,9],'2':[2,6,10],'3':[3,7,11],'4':[4,8,12]})
df = df.set_index('Y_M')
Which creates a dataframe looking like this:
1 2 3 4
Y_M
201711 1 2 3 4
201712 5 6 7 8
201713 9 10 11 12
The columns are the day of the month. They stretch on to the right, going all the way up to 31. (February will have columns 29, 30, and 31 filled with NaN).
The index contains the year and the month (e.g. 201711 referring to Nov 2017)
My question is: How can I make this a single series, with the year/month/day combined? My output would be the following:
Y_M
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
The index can be converted to a datetime. In fact I think it would make it easier.
Use stack for Series and then combine datetimes by to_datetime with timedeltas by
to_timedelta:
df = df.stack()
df.index = pd.to_datetime(df.index.get_level_values(0), format='%Y%m') + \
pd.to_timedelta(df.index.get_level_values(1).astype(int) - 1, unit='D')
print (df)
2017-10-01 1
2017-10-02 2
2017-10-03 3
2017-10-04 4
2017-11-01 5
2017-11-02 6
2017-11-03 7
2017-11-04 8
2017-12-01 9
2017-12-02 10
2017-12-03 11
2017-12-04 12
dtype: int64
print (df.index)
DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10-03', '2017-10-04',
'2017-11-01', '2017-11-02', '2017-11-03', '2017-11-04',
'2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04'],
dtype='datetime64[ns]', freq=None)
Last if necessary strings in index (not DatetimeIndex) add DatetimeIndex.strftime:
df.index = df.index.strftime('%Y%m%d')
print (df)
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
dtype: int64
print (df.index)
Index(['20171001', '20171002', '20171003', '20171004', '20171101', '20171102',
'20171103', '20171104', '20171201', '20171202', '20171203', '20171204'],
dtype='object')
Without bringing date into it.
s = df.stack()
s.index = s.index.map('{0[0]}{0[1]:>02s}'.format)
s
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
dtype: int64

Categories

Resources