Pandas to datetime - python

I have a date that is formatted like this:
01-19-71
and 71 is 1971 but whenever to_datetime is used it converts is to 2071! how can I solve this problem? I am told that this would need regex but I can't imagine how since there are many cases in this data
my current code:
re_1 = r"\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
re_2 = r"(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ \-\.,]+(?:\d{1,2}[\w]*[ \-,]+)?[1|2]\d{3}"
re_3 = r"(?:\d{1,2}/)?[1|2]\d{3}"
# Correct misspillings
df = df.str.replace("Janaury", "January")
df = df.str.replace("Decemeber", "December")
# Extract dates
regex = "((%s)|(%s)|(%s))"%(re_1, re_2, re_3)
dates = df.str.extract(regex)
# Sort the Series
dates = pd.Series(pd.to_datetime(dates.iloc[:,0]))
dates.sort_values(ascending=True, inplace=True)

Considering that one has a string as follows
date = '01-19-71'
In order to convert to datetime object where 71 is converted to 1971 and not 2071, one can use datetime.strptime as follows
import datetime as dt
date = dt.datetime.strptime(date, '%m-%d-%y')
[Out]:
1971-01-19 00:00:00

Related

What is the correct format code for this type of date?

I have a numpy array (called dates) of dates (as strings) which I thought were in the form %Y-%m-%d %H:%M:%S. However, I get an error that I have dates such as 2021-05-11T00:00:00.0000000. Not sure where did that additional 'T' come and why is the time so precise.
I am trying to get rid of the time and only have the date.
My code is here:
dates = dataset.iloc[:,0].to_numpy()
newDates = []
for i in range(0,len(dates)):
newDates.append(datetime.strptime(dates[i], '%Y-%m-%dT%H:%M:%S.%f'))
newDates[i] = newDates[i].strftime('%Y-%m-%d')
dates = newDates
I get an error saying "ValueError: unconverted data remains: 0".
If I wrote instead
newDates.append(datetime.strptime(dates[i], '%Y-%m-%dT%H:%M:%S%f'))
I get an error "ValueError: unconverted data remains: .0000000".
In which format should the date be given?
If you have datetime in dataframe you can use pd.to_datetime and Series.dt.strftime for converting to desired format. pandas do all for you! (why convert values in dataframe to numpy.array.)
import pandas as pd
# example df
df = pd.DataFrame({'datetime': ['2021-05-11T00:00:00.0000000' ,
'2021-05-20T00:00:00.0000000' ,
'2021-06-24T00:00:00.0000000']})
df['datetime'] = pd.to_datetime(df['datetime']).dt.strftime('%Y-%m-%d')
print(df)
datetime
0 2021-05-11
1 2021-05-20
2 2021-06-24
Does this help? https://strftime.org/
The extra T can be seen after %Y-%m-%d
If you just want to get the date, just split the string like this.
date = date.split('T')[0]
this will first split the date string into to parts,
[2021-05-11','00:00:00.0000000]
then you can extract the first variable in the list by saving only index 0
then you are just left with
date = '2021-05-11'
dates = dataset.iloc[:,0].to_numpy()
newDates = []
for i in dates:
newDates.append(i.split('T')[0])
dates = newDates
assuming dates is a list

python pandas converting UTC integer to datetime

I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.
Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')
You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40

Subtract 2 datetime lists dd/mm/YYYY in pandas

So, Basically, I got this 2 df columns with data content. The initial content is in the dd/mm/YYYY format, and I want to subtract them. But I can't really subtract string, so I converted it to datetime, but when I do such thing for some reason the format changes to YYYY-dd-mm, so when I try to subtract them, I got a wrong result. For example:
Initial Content:
a: 05/09/2022
b: 30/09/2021
result expected: 25 days.
Converted to DateTime:
a: 2022-05-09
b: 2021-09-30 (For some reason this date stills the same)
result: 144 days.
I'm using pandas and datetime to make this project.
So, I wanted to know a way I can subtract this 2 columns with the proper result.
--- Answer
When I used
pd.to_datetime(date, format="%d/%m/%Y")
It worked. Thank you all for your time. This is my first project in pandas. :)
df = pd.DataFrame({'Date1': ['05/09/2021'], 'Date2': ['30/09/2021']})
df = df.apply(lambda x:pd.to_datetime(x,format=r'%d/%m/%Y')).assign(Delta=lambda x: (x.Date2-x.Date1).dt.days)
print(df)
Date1 Date2 Delta
0 2021-09-05 2021-09-30 25
I just answered a similar query here subtracting dates in python
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_format_str = '%Y-%m-%d %H:%M:%S.%f'
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = datetime.strptime(date_1, date_format_str)
end = datetime.strptime(date_2, date_format_str)
diff = end - start
# Get interval between two timstamps as timedelta object
diff_in_hours = diff.total_seconds() / 3600
print(diff_in_hours)
# get the difference between two dates as timedelta object
diff = end.date() - start.date()
print(diff.days)
Pandas
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = pd.to_datetime(date_1, format='%Y-%m-%d %H:%M:%S.%f')
end = pd.to_datetime(date_2, format='%Y-%m-%d %H:%M:%S.%f')
# get the difference between two datetimes as timedelta object
diff = end - start
print(diff.days)

convert yyyy-mm-dd to mmm-yy in dataframe python

I am trying to convert the way month and year is presented.
I have dataframe as below
Date
2020-01-31
2020-04-30
2021-05-05
and I want to convert it in the way like month and year.
The output that I am expecting is
Date
Jan-20
Apr-20
May-21
I tried to do it with datetime but it doesn't work.
pd.to_datetime(pd.Series(df['Date'),format='%mmm-%yy')
Use .dt.strftime() to change the display format. %b-%y is the format string for Mmm-YY:
df.Date = pd.to_datetime(df.Date).dt.strftime('%b-%y')
# Date
# 0 Jan-20
# 1 Apr-20
# 2 May-21
Or if Date is the index:
df.index = pd.to_datetime(df.index).dt.strftime('%b-%y')
import pandas as pd
date_sr = pd.to_datetime(pd.Series("2020-12-08"))
change_format = date_sr.dt.strftime('%b-%Y')
print(change_format)
reference https://docs.python.org/3/library/datetime.html
%Y-%m-%d changed to ('%b-%y')
import datetime
df['Date'] = df['Date'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').strftime('%b-%y'))
# reference https://docs.python.org/3/library/datetime.html
# %Y-%m-%d changed to ('%b-%y')

Pandas converting date with string in

I'm starting with python and pandas and matplotlib. I'm working with data with over million entries. I'm trying to change the date format. In CSV file date format is 23-JUN-11. I will like to use dates in future to plot amount of donation for each candidate. How to convert the date format to a readable format for pandas?
Here is the link to cut file 149 entries
My code:
%matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
First candidate
reader_bachmann = pd.read_csv('P00000001-ALL.csv' ,converters={'cand_id': lambda x: str(x)[1:]},parse_dates=True, squeeze=True, low_memory=False, nrows=411 )
date_frame = pd.DataFrame(reader_bachmann, columns = ['contb_receipt_dt'])
Data slice
s = date_frame.iloc[:,0]
date_slice = pd.Series([s])
date_strip = date_slice.str.replace('JUN','6')
Trying to convert to new date format
date = pd.to_datetime(s, format='%d%b%Y')
print(date_slice)
Here is the error message
ValueError: could not convert string to float: '05-JUL-11'
You need to use a different date format string:
format='%d-%b-%y'
Why?
The error message gives a clue as to what is wrong:
ValueError: could not convert string to float: '05-JUL-11'
The format string controls the conversion, and is currently:
format='%d%b%Y'
And the fields needed are:
%y - year without a century (range 00 to 99)
%b - abbreviated month name
%d - day of the month (01 to 31)
What is missing is the - that are separating the field in your data string, and the y for a two digit year instead of the current Y for a four digit year.
As an alternative you can use dateutil.parser to parse dates containing string directly, I have created a random dataframe for demo.
l = []
for i in range(100):
l.append('23-JUN-11')
B = pd.DataFrame({'Date':l})
Now, Let's import dateutil.parser and apply it on our date column
import dateutil.parser
B['Date2'] = B['Date'].apply(lambda x : dateutil.parser.parse(x))
B.head()
Out[106]:
Date Date2
0 23-JUN-11 2011-06-23
1 23-JUN-11 2011-06-23
2 23-JUN-11 2011-06-23
3 23-JUN-11 2011-06-23
4 23-JUN-11 2011-06-23

Categories

Resources