find days between 2 dates in python but only number - python

I was trying to find difference of a series of dates and a date. for example, the series is
from may1 to june1 which is
date = pd.DataFrame()
In [0]: date['test'] = pd.date_range("2021-05-01", "2021-06-01", freq = "D")
Out[0]: date
test
0 2021-05-01 00:00:00
1 2021-05-02 00:00:00
2 2021-05-03 00:00:00
3 2021-05-04 00:00:00
4 2021-05-05 00:00:00
5 2021-05-06 00:00:00
6 2021-05-07 00:00:00
7 2021-05-08 00:00:00
8 2021-05-09 00:00:00
9 2021-05-10 00:00:00
In[1]
date['test'] = date['test'].dt.date
Out[1]:
test
0 2021-05-01
1 2021-05-02
2 2021-05-03
3 2021-05-04
4 2021-05-05
5 2021-05-06
6 2021-05-07
7 2021-05-08
8 2021-05-09
9 2021-05-10
In[2]:date['base'] = dt.strptime("2021-05-01",'%Y-%m-%d')
Out[2]:
0 2021-05-01 00:00:00
1 2021-05-01 00:00:00
2 2021-05-01 00:00:00
3 2021-05-01 00:00:00
4 2021-05-01 00:00:00
5 2021-05-01 00:00:00
6 2021-05-01 00:00:00
7 2021-05-01 00:00:00
8 2021-05-01 00:00:00
9 2021-05-01 00:00:00
In[3]:date['base'] = date['base'].dt.date
Out[3]:
base
0 2021-05-01
1 2021-05-01
2 2021-05-01
3 2021-05-01
4 2021-05-01
5 2021-05-01
6 2021-05-01
7 2021-05-01
8 2021-05-01
9 2021-05-01
In[4]:date['test']-date['base']
Out[4]:
diff
0 0 days 00:00:00.000000000
1 1 days 00:00:00.000000000
2 2 days 00:00:00.000000000
3 3 days 00:00:00.000000000
4 4 days 00:00:00.000000000
5 5 days 00:00:00.000000000
6 6 days 00:00:00.000000000
7 7 days 00:00:00.000000000
8 8 days 00:00:00.000000000
9 9 days 00:00:00.000000000
10 10 days 00:00:00.000000000
the only thing i could get is this. I don't want anything other than the number 1-10 cuz i need them for further numerical calculation but i can't get rid of those. Also how could i construct a time series which just outputs the date not the hms after it? i don't want to manually .dt.date for all of those and it sometimes mess things up

You don't need to create a column base for this, simply do:
>>> (date['test'] - pd.to_datetime("2021-05-01", format='%Y-%m-%d')).dt.days
0 0
1 1
2 2
3 3
4 4
...
27 27
28 28
29 29
30 30
31 31
Name: test, dtype: int64

You can convert the timestamps first to epoch seconds (they are actually stored internally as some number, and likely a factor of epoch seconds)
Using pandas datetime to unix timestamp seconds
import pandas as pd
# start df with date column
df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
# create a column for datetimes
df["ts"] = (df["date"] - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
>>> df
date ts
0 2021-05-01 1619827200
1 2021-05-02 1619913600
2 2021-05-03 1620000000
3 2021-05-04 1620086400
...
31 2021-06-01 1622505600
This will allow you to do integer math before converting back
>>> df["days"] = (df["ts"] - min(df["ts"])) // (60*60*24) # 1 day in seconds
>>> df
date ts days
0 2021-05-01 1619827200 0
1 2021-05-02 1619913600 1
2 2021-05-03 1620000000 2
3 2021-05-04 1620086400 3
...
31 2021-06-01 1622505600 31

Alternatively, with a naive day-based series, you can use the index as the day offset (as that's how the DataFrame was generated)!
>>> import pandas as pd
>>> df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
>>> df["days"] = df.index
>>> df
date days
0 2021-05-01 0
1 2021-05-02 1
2 2021-05-03 2
3 2021-05-04 3
...
31 2021-06-01 31

Related

Stacking multiple dataframes together for different timestamp format into one timestamp

I have multiple data frames each having data varying from 1 to 1440 minute (one day).Each dataframes are alike and same columns and same length. The time column values are in hhmm format.
Lets say df_A has the data of 1st day, that is 2021-05-06 It looks like this.
>df_A
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
And the next day's data is in df_B which is also the same. The date is 2021-05-07
>df_B
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
How could I stack these together one under another and create one dataframe while identifying each rows with a column having values in format like YYYYMMDD HH:mm. Which somewhat will look like this:
>df
timestamp col1 col2..... col80
20210506 0000
20210506 0001
.
.
20210506 2359
20210507 0000
.
.
20210507 2359
How could I achieve this while dealing with multiple data frames at ones?
df_A = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_B = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_A['date'] = pd.to_datetime('2021-05-06 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_B['date'] = pd.to_datetime('2021-05-07 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_final = pd.concat([df_A, df_B])
df_final
timestamp date
0 0 2021-05-06 00:00:00
1 1 2021-05-06 00:01:00
2 2 2021-05-06 00:02:00
3 3 2021-05-06 00:03:00
4 4 2021-05-06 00:04:00
5 5 2021-05-06 00:05:00
6 6 2021-05-06 00:06:00
7 7 2021-05-06 00:07:00
8 8 2021-05-06 00:08:00
9 9 2021-05-06 00:09:00
0 0 2021-05-07 00:00:00
1 1 2021-05-07 00:01:00
2 2 2021-05-07 00:02:00
3 3 2021-05-07 00:03:00
4 4 2021-05-07 00:04:00
5 5 2021-05-07 00:05:00
6 6 2021-05-07 00:06:00
7 7 2021-05-07 00:07:00
8 8 2021-05-07 00:08:00
9 9 2021-05-07 00:09:00

Pandas DateTime only partially showing in Matplotlib

Problem: I am trying to make a bar chart based on a very simple Pandas DataFrame that has a DateTime index and integers in one column. Data below. Only some of the data is showing up in Matplotlib, however.
Code:
fig, ax = plt.subplots(figsize=(16,6))
ax.bar(unique_opps_by_month.index, unique_opps_by_month['Opportunity Name'])
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%b'))
ax.set_xlim('2016-01', '2021-04')
fig.autofmt_xdate()
The output, however, looks like:
This doesn't match the data though! There should be eight bars in 2017, and not all the same height. There are similar problems in the other years as well. Why are only some of the bars showing? How can I make Matplotlib show all the data?
Data
2016-02-01 1
2016-05-01 1
2016-08-01 1
2016-09-01 1
2017-01-01 1
2017-02-01 1
2017-03-01 1
2017-04-01 1
2017-07-01 3
2017-10-01 2
2017-11-01 3
2017-12-01 1
2018-02-01 2
2018-03-01 2
2018-04-01 2
2018-06-01 1
2018-07-01 1
2018-08-01 1
2018-11-01 1
2018-12-01 2
2019-03-01 5
2019-04-01 2
2019-05-01 1
2019-06-01 2
2019-07-01 1
2019-08-01 2
2019-09-01 4
2019-11-01 5
2020-01-01 4
2020-02-01 6
2020-03-01 1
2020-06-01 1
2020-07-01 2
2020-09-01 3
2020-10-01 5
2020-11-01 4
2020-12-01 6
2021-01-01 3
2021-02-01 6
2021-03-01 3

How can I parse a field in a DF into Month, Day, Year, Hour, and Weekday?

I have data that looks like this.
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag
2 1/1/2018 0:18:50 1/1/2018 12:24:39 AM N
2 1/1/2018 0:30:26 1/1/2018 12:46:42 AM N
2 1/1/2018 0:07:25 1/1/2018 12:19:45 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:38:35 1/1/2018 1:08:50 AM N
2 1/1/2018 0:18:41 1/1/2018 12:28:22 AM N
2 1/1/2018 0:38:02 1/1/2018 12:55:02 AM N
2 1/1/2018 0:05:02 1/1/2018 12:18:35 AM N
2 1/1/2018 0:35:23 1/1/2018 12:42:07 AM N
So, I converted df.lpep_pickup_datetime to datetime, but originally it comes in as a string. I'm not sure which one is easier to work with. I want to append 5 fields onto my current dataframe: year, month, day, weekday, and hour.
I tried this:
df['Year']=[d.split('-')[0] for d in df.lpep_pickup_datetime]
df['Month']=[d.split('-')[1] for d in df.lpep_pickup_datetime]
df['Day']=[d.split('-')[2] for d in df.lpep_pickup_datetime]
That gives me this error: AttributeError: 'Timestamp' object has no attribute 'split'
I tried this:
df2 = pd.DataFrame(df.lpep_pickup_datetime.dt.strftime('%m-%d-%Y-%H').str.split('/').tolist(),
columns=['Month', 'Day', 'Year', 'Hour'],dtype=int)
df = pd.concat((df,df2),axis=1)
That gives me this error: AssertionError: 4 columns passed, passed data had 1 columns
Basically, I want to parse df.lpep_pickup_datetime into year, month, day, weekday, and hour, appending each to the same dataframe. How can I do that?
Thanks!!
Here you go, first I'm creating a random dataset and then renaming the column date to the name you want, so you can just copy the code. Pandas has a big section of time-series series manipulation, you don't actually need to import datetime. Here you can find a lot more information about it:
import pandas as pd
date_rng = pd.date_range(start='1/1/2018', end='4/01/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['lpep_pickup_datetime'] = df['date']
df['year'] = df['lpep_pickup_datetime'].dt.year
df['year'] = df['lpep_pickup_datetime'].dt.month
df['weekday'] = df['lpep_pickup_datetime'].dt.weekday
df['day'] = df['lpep_pickup_datetime'].dt.day
df['hour'] = df['lpep_pickup_datetime'].dt.hour
print(df)
Output:
date lpep_pickup_datetime year weekday day hour
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 0 1 0
1 2018-01-01 01:00:00 2018-01-01 01:00:00 1 0 1 1
2 2018-01-01 02:00:00 2018-01-01 02:00:00 1 0 1 2
3 2018-01-01 03:00:00 2018-01-01 03:00:00 1 0 1 3
4 2018-01-01 04:00:00 2018-01-01 04:00:00 1 0 1 4
... ... ... ... ... ... ...
2156 2018-03-31 20:00:00 2018-03-31 20:00:00 3 5 31 20
2157 2018-03-31 21:00:00 2018-03-31 21:00:00 3 5 31 21
2158 2018-03-31 22:00:00 2018-03-31 22:00:00 3 5 31 22
2159 2018-03-31 23:00:00 2018-03-31 23:00:00 3 5 31 23
2160 2018-04-01 00:00:00 2018-04-01 00:00:00 4 6 1 0
EDIT: Since this is not working (As stated in the comments in this answer), I believe your data is formated incorrectly. Try this before applying anything:
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'], format='%d/%m/%y %H:%M:%S')
If this format is recognized properly, then you should have no trouble using dt.year,dt.month,dt.hour,dt.day,dt.weekday.
Give this a go. Since your dates are in the datetime dtype already, just use the datetime properties to extract each part.
import pandas as pd
from datetime import datetime as dt
# Creating a fake dataset of dates.
dates = [dt.now().strftime('%d/%m/%Y %H:%M:%S') for i in range(10)]
df = pd.DataFrame({'lpep_pickup_datetime': dates})
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
# Parse each date into its parts and store as a new column.
df['month'] = df['lpep_pickup_datetime'].dt.month
df['day'] = df['lpep_pickup_datetime'].dt.day
df['year'] = df['lpep_pickup_datetime'].dt.year
# ... and so on ...
Output:
lpep_pickup_datetime month day year
0 2019-09-24 16:46:10 9 24 2019
1 2019-09-24 16:46:10 9 24 2019
2 2019-09-24 16:46:10 9 24 2019
3 2019-09-24 16:46:10 9 24 2019
4 2019-09-24 16:46:10 9 24 2019
5 2019-09-24 16:46:10 9 24 2019
6 2019-09-24 16:46:10 9 24 2019
7 2019-09-24 16:46:10 9 24 2019
8 2019-09-24 16:46:10 9 24 2019
9 2019-09-24 16:46:10 9 24 2019

Changing datetime column to integer number without loop

I have a pandas dataset like this:
user_id datetime
1 13 days 21:50:00
2 0 days 02:05:00
5 10 days 00:10:00
7 2 days 01:20:00
1 3 days 11:50:00
2 1 days 02:30:00
I want to have a column that contains the mintues, So in this case the result can be :
user_id datetime minutes
1 13 days 21:50:00 20030
2 0 days 02:05:00 125
5 10 days 00:10:00 14402
7 2 days 01:20:00 2960
1 3 days 11:50:00 5030
2 1 days 02:30:00 1590
Is there any way to do that without loop?
Yes, there is a special dt accessor for date/time series:
df['minutes'] = df['datetime'].dt.total_seconds() / 60
If you only want whole minutes, cast the result using .astype(int).
Here is a way with pd.Timedelta:
df['minutes'] = pd.to_timedelta(df.datetime) / pd.Timedelta(1, 'm')
>>> df
user_id datetime minutes
0 1 13 days 21:50:00 20030.0
1 2 0 days 02:05:00 125.0
2 5 10 days 00:10:00 14410.0
3 7 2 days 01:20:00 2960.0
4 1 3 days 11:50:00 5030.0
5 2 1 days 02:30:00 1590.0
if your datetime column is already of dtype timedelta, you can omit the explicit casting and just use:
df['minutes'] = df.datetime / pd.Timedelta(1, 'm')

Pandas: How to create a datetime object from Week and Year?

I have a dataframe that provides two integer columns with the Year and Week of the year:
import pandas as pd
import numpy as np
L1 = [43,44,51,2,5,12]
L2 = [2016,2016,2016,2017,2017,2017]
df = pd.DataFrame({"Week":L1,"Year":L2})
df
Out[72]:
Week Year
0 43 2016
1 44 2016
2 51 2016
3 2 2017
4 5 2017
5 12 2017
I need to create a datetime-object from these two numbers.
I tried this, but it throws an error:
df["DT"] = df.apply(lambda x: np.datetime64(x.Year,'Y') + np.timedelta64(x.Week,'W'),axis=1)
Then I tried this, it works but gives the wrong result, that is it ignores the week completely:
df["S"] = df.Week.astype(str)+'-'+df.Year.astype(str)
df["DT"] = df["S"].apply(lambda x: pd.to_datetime(x,format='%W-%Y'))
df
Out[74]:
Week Year S DT
0 43 2016 43-2016 2016-01-01
1 44 2016 44-2016 2016-01-01
2 51 2016 51-2016 2016-01-01
3 2 2017 2-2017 2017-01-01
4 5 2017 5-2017 2017-01-01
5 12 2017 12-2017 2017-01-01
I'm really getting lost between Python's datetime, Numpy's datetime64, and pandas Timestamp, can you tell me how it's done correctly?
I'm using Python 3, if that is relevant in any way.
EDIT:
Starting with Python 3.8 the problem is easily solved with a newly introduced method on datetime.date objects: https://docs.python.org/3/library/datetime.html#datetime.date.fromisocalendar
Try this:
In [19]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
pd.to_timedelta(df.Week.mul(7).astype(str) + ' days')
Out[19]:
0 2016-10-28
1 2016-11-04
2 2016-12-23
3 2017-01-15
4 2017-02-05
5 2017-03-26
dtype: datetime64[ns]
Initially I have timestamps in s
It's much easier to parse it from UNIX epoch timestamp:
df['Date'] = pd.to_datetime(df['UNIX_Time'], unit='s')
Timing for 10M rows DF:
Setup:
In [26]: df = pd.DataFrame(pd.date_range('1970-01-01', freq='1T', periods=10**7), columns=['date'])
In [27]: df.shape
Out[27]: (10000000, 1)
In [28]: df['unix_ts'] = df['date'].astype(np.int64)//10**9
In [30]: df
Out[30]:
date unix_ts
0 1970-01-01 00:00:00 0
1 1970-01-01 00:01:00 60
2 1970-01-01 00:02:00 120
3 1970-01-01 00:03:00 180
4 1970-01-01 00:04:00 240
5 1970-01-01 00:05:00 300
6 1970-01-01 00:06:00 360
7 1970-01-01 00:07:00 420
8 1970-01-01 00:08:00 480
9 1970-01-01 00:09:00 540
... ... ...
9999990 1989-01-05 10:30:00 599999400
9999991 1989-01-05 10:31:00 599999460
9999992 1989-01-05 10:32:00 599999520
9999993 1989-01-05 10:33:00 599999580
9999994 1989-01-05 10:34:00 599999640
9999995 1989-01-05 10:35:00 599999700
9999996 1989-01-05 10:36:00 599999760
9999997 1989-01-05 10:37:00 599999820
9999998 1989-01-05 10:38:00 599999880
9999999 1989-01-05 10:39:00 599999940
[10000000 rows x 2 columns]
Check:
In [31]: pd.to_datetime(df.unix_ts, unit='s')
Out[31]:
0 1970-01-01 00:00:00
1 1970-01-01 00:01:00
2 1970-01-01 00:02:00
3 1970-01-01 00:03:00
4 1970-01-01 00:04:00
5 1970-01-01 00:05:00
6 1970-01-01 00:06:00
7 1970-01-01 00:07:00
8 1970-01-01 00:08:00
9 1970-01-01 00:09:00
...
9999990 1989-01-05 10:30:00
9999991 1989-01-05 10:31:00
9999992 1989-01-05 10:32:00
9999993 1989-01-05 10:33:00
9999994 1989-01-05 10:34:00
9999995 1989-01-05 10:35:00
9999996 1989-01-05 10:36:00
9999997 1989-01-05 10:37:00
9999998 1989-01-05 10:38:00
9999999 1989-01-05 10:39:00
Name: unix_ts, Length: 10000000, dtype: datetime64[ns]
Timing:
In [32]: %timeit pd.to_datetime(df.unix_ts, unit='s')
10 loops, best of 3: 156 ms per loop
Conclusion: I think 156 milliseconds for converting 10.000.000 rows is not that slow
Like #Gianmario Spacagna mentioned for datetimes higher like 2018 use %V with %G:
L1 = [43,44,51,2,5,12,52,53,1,2,5,52]
L2 = [2016,2016,2016,2017,2017,2017,2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})
df['new'] = pd.to_datetime(df.Week.astype(str)+
df.Year.astype(str).add('-1') ,format='%V%G-%u')
print (df)
Week Year new
0 43 2016 2016-10-24
1 44 2016 2016-10-31
2 51 2016 2016-12-19
3 2 2017 2017-01-09
4 5 2017 2017-01-30
5 12 2017 2017-03-20
6 52 2018 2018-12-24
7 53 2018 2018-12-31
8 1 2019 2018-12-31
9 2 2019 2019-01-07
10 5 2019 2019-01-28
11 52 2019 2019-12-23
There is something fishy going on with weeks starting from 2019. The ISO-8601 standard assigns the 31st December 2018 to the week 1 of year 2019. The other approaches based on:
pd.to_datetime(df.Week.astype(str)+
df.Year.astype(str).add('-2') ,format='%W%Y-%w')
will give shifted results starting from 2019.
In order to be compliant with the ISO-8601 standard you would have to do the following:
import pandas as pd
import datetime
L1 = [52,53,1,2,5,52]
L2 = [2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})
df['ISO'] = df['Year'].astype(str) + '-W' + df['Week'].astype(str) + '-1'
df['DT'] = df['ISO'].map(lambda x: datetime.datetime.strptime(x, "%G-W%V-%u"))
print(df)
It prints:
Week Year ISO DT
0 52 2018 2018-W52-1 2018-12-24
1 53 2018 2018-W53-1 2018-12-31
2 1 2019 2019-W1-1 2018-12-31
3 2 2019 2019-W2-1 2019-01-07
4 5 2019 2019-W5-1 2019-01-28
5 52 2019 2019-W52-1 2019-12-23
The week 53 of 2018 is ignored and mapped to the week 1 of 2019.
Please verify yourself on https://www.epochconverter.com/weeks/2019.
If you want to follow ISO Week Date
Weeks start with Monday. Each week's year is the Gregorian year in
which the Thursday falls. The first week of the year, hence, always
contains 4 January. ISO week year numbering therefore slightly
deviates from the Gregorian for some days close to 1 January.
The following sample code, generates a sequence of 60 Dates, starting from 18Dec2016 Sun and adds the appropriate columns.
It adds:
A "Date"
Week Day of the "Date"
Finds the Week Starting Monday of that "Date"
Finds the Year of the Week Starting Monday of that "Date"
Adds a Week Number (ISO)
Gets the Starting Monday Date, from Year and Week Number
Sample Code Below:
# Generate Some Dates
dft1 = pd.DataFrame(pd.date_range('2016-12-18', freq='D', periods=60))
dft1.columns = ['e_FullDate']
dft1['e_FullDateWeekDay'] = dft1.e_FullDate.dt.day_name().str.slice(0,3)
#Add a Week Start Date (Monday)
dft1['e_week_start'] = dft1['e_FullDate'] - pd.to_timedelta(dft1['e_FullDate'].dt.weekday,
unit='D')
dft1['e_week_startWeekDay'] = dft1.e_week_start.dt.day_name().str.slice(0,3)
#Add a Week Start Year
dft1['e_week_start_yr'] = dft1.e_week_start.dt.year
#Add a Week Number of Week Start Monday
dft1['e_week_no'] = dft1['e_week_start'].dt.week
#Add a Week Start generate from Week Number and Year
dft1['e_week_start_from_week_no'] = pd.to_datetime(dft1.e_week_no.astype(str)+
dft1.e_week_start_yr.astype(str).add('-1') ,format='%W%Y-%w')
dft1['e_week_start_from_week_noWeekDay'] = dft1.e_week_start_from_week_no.dt.day_name().str.slice(0,3)
with pd.option_context('display.max_rows', 999, 'display.max_columns', 0, 'display.max_colwidth', 9999):
display(dft1)

Categories

Resources