I have a date that is formatted like this:
01-19-71
and 71 is 1971 but whenever to_datetime is used it converts is to 2071! how can I solve this problem? I am told that this would need regex but I can't imagine how since there are many cases in this data
my current code:
re_1 = r"\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
re_2 = r"(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ \-\.,]+(?:\d{1,2}[\w]*[ \-,]+)?[1|2]\d{3}"
re_3 = r"(?:\d{1,2}/)?[1|2]\d{3}"
# Correct misspillings
df = df.str.replace("Janaury", "January")
df = df.str.replace("Decemeber", "December")
# Extract dates
regex = "((%s)|(%s)|(%s))"%(re_1, re_2, re_3)
dates = df.str.extract(regex)
# Sort the Series
dates = pd.Series(pd.to_datetime(dates.iloc[:,0]))
dates.sort_values(ascending=True, inplace=True)
Considering that one has a string as follows
date = '01-19-71'
In order to convert to datetime object where 71 is converted to 1971 and not 2071, one can use datetime.strptime as follows
import datetime as dt
date = dt.datetime.strptime(date, '%m-%d-%y')
[Out]:
1971-01-19 00:00:00
Related
I have a numpy array (called dates) of dates (as strings) which I thought were in the form %Y-%m-%d %H:%M:%S. However, I get an error that I have dates such as 2021-05-11T00:00:00.0000000. Not sure where did that additional 'T' come and why is the time so precise.
I am trying to get rid of the time and only have the date.
My code is here:
dates = dataset.iloc[:,0].to_numpy()
newDates = []
for i in range(0,len(dates)):
newDates.append(datetime.strptime(dates[i], '%Y-%m-%dT%H:%M:%S.%f'))
newDates[i] = newDates[i].strftime('%Y-%m-%d')
dates = newDates
I get an error saying "ValueError: unconverted data remains: 0".
If I wrote instead
newDates.append(datetime.strptime(dates[i], '%Y-%m-%dT%H:%M:%S%f'))
I get an error "ValueError: unconverted data remains: .0000000".
In which format should the date be given?
If you have datetime in dataframe you can use pd.to_datetime and Series.dt.strftime for converting to desired format. pandas do all for you! (why convert values in dataframe to numpy.array.)
import pandas as pd
# example df
df = pd.DataFrame({'datetime': ['2021-05-11T00:00:00.0000000' ,
'2021-05-20T00:00:00.0000000' ,
'2021-06-24T00:00:00.0000000']})
df['datetime'] = pd.to_datetime(df['datetime']).dt.strftime('%Y-%m-%d')
print(df)
datetime
0 2021-05-11
1 2021-05-20
2 2021-06-24
Does this help? https://strftime.org/
The extra T can be seen after %Y-%m-%d
If you just want to get the date, just split the string like this.
date = date.split('T')[0]
this will first split the date string into to parts,
[2021-05-11','00:00:00.0000000]
then you can extract the first variable in the list by saving only index 0
then you are just left with
date = '2021-05-11'
dates = dataset.iloc[:,0].to_numpy()
newDates = []
for i in dates:
newDates.append(i.split('T')[0])
dates = newDates
assuming dates is a list
I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.
Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')
You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40
So, Basically, I got this 2 df columns with data content. The initial content is in the dd/mm/YYYY format, and I want to subtract them. But I can't really subtract string, so I converted it to datetime, but when I do such thing for some reason the format changes to YYYY-dd-mm, so when I try to subtract them, I got a wrong result. For example:
Initial Content:
a: 05/09/2022
b: 30/09/2021
result expected: 25 days.
Converted to DateTime:
a: 2022-05-09
b: 2021-09-30 (For some reason this date stills the same)
result: 144 days.
I'm using pandas and datetime to make this project.
So, I wanted to know a way I can subtract this 2 columns with the proper result.
--- Answer
When I used
pd.to_datetime(date, format="%d/%m/%Y")
It worked. Thank you all for your time. This is my first project in pandas. :)
df = pd.DataFrame({'Date1': ['05/09/2021'], 'Date2': ['30/09/2021']})
df = df.apply(lambda x:pd.to_datetime(x,format=r'%d/%m/%Y')).assign(Delta=lambda x: (x.Date2-x.Date1).dt.days)
print(df)
Date1 Date2 Delta
0 2021-09-05 2021-09-30 25
I just answered a similar query here subtracting dates in python
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_format_str = '%Y-%m-%d %H:%M:%S.%f'
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = datetime.strptime(date_1, date_format_str)
end = datetime.strptime(date_2, date_format_str)
diff = end - start
# Get interval between two timstamps as timedelta object
diff_in_hours = diff.total_seconds() / 3600
print(diff_in_hours)
# get the difference between two dates as timedelta object
diff = end.date() - start.date()
print(diff.days)
Pandas
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = pd.to_datetime(date_1, format='%Y-%m-%d %H:%M:%S.%f')
end = pd.to_datetime(date_2, format='%Y-%m-%d %H:%M:%S.%f')
# get the difference between two datetimes as timedelta object
diff = end - start
print(diff.days)
I am trying to convert the way month and year is presented.
I have dataframe as below
Date
2020-01-31
2020-04-30
2021-05-05
and I want to convert it in the way like month and year.
The output that I am expecting is
Date
Jan-20
Apr-20
May-21
I tried to do it with datetime but it doesn't work.
pd.to_datetime(pd.Series(df['Date'),format='%mmm-%yy')
Use .dt.strftime() to change the display format. %b-%y is the format string for Mmm-YY:
df.Date = pd.to_datetime(df.Date).dt.strftime('%b-%y')
# Date
# 0 Jan-20
# 1 Apr-20
# 2 May-21
Or if Date is the index:
df.index = pd.to_datetime(df.index).dt.strftime('%b-%y')
import pandas as pd
date_sr = pd.to_datetime(pd.Series("2020-12-08"))
change_format = date_sr.dt.strftime('%b-%Y')
print(change_format)
reference https://docs.python.org/3/library/datetime.html
%Y-%m-%d changed to ('%b-%y')
import datetime
df['Date'] = df['Date'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d').strftime('%b-%y'))
# reference https://docs.python.org/3/library/datetime.html
# %Y-%m-%d changed to ('%b-%y')
I'm starting with python and pandas and matplotlib. I'm working with data with over million entries. I'm trying to change the date format. In CSV file date format is 23-JUN-11. I will like to use dates in future to plot amount of donation for each candidate. How to convert the date format to a readable format for pandas?
Here is the link to cut file 149 entries
My code:
%matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
First candidate
reader_bachmann = pd.read_csv('P00000001-ALL.csv' ,converters={'cand_id': lambda x: str(x)[1:]},parse_dates=True, squeeze=True, low_memory=False, nrows=411 )
date_frame = pd.DataFrame(reader_bachmann, columns = ['contb_receipt_dt'])
Data slice
s = date_frame.iloc[:,0]
date_slice = pd.Series([s])
date_strip = date_slice.str.replace('JUN','6')
Trying to convert to new date format
date = pd.to_datetime(s, format='%d%b%Y')
print(date_slice)
Here is the error message
ValueError: could not convert string to float: '05-JUL-11'
You need to use a different date format string:
format='%d-%b-%y'
Why?
The error message gives a clue as to what is wrong:
ValueError: could not convert string to float: '05-JUL-11'
The format string controls the conversion, and is currently:
format='%d%b%Y'
And the fields needed are:
%y - year without a century (range 00 to 99)
%b - abbreviated month name
%d - day of the month (01 to 31)
What is missing is the - that are separating the field in your data string, and the y for a two digit year instead of the current Y for a four digit year.
As an alternative you can use dateutil.parser to parse dates containing string directly, I have created a random dataframe for demo.
l = []
for i in range(100):
l.append('23-JUN-11')
B = pd.DataFrame({'Date':l})
Now, Let's import dateutil.parser and apply it on our date column
import dateutil.parser
B['Date2'] = B['Date'].apply(lambda x : dateutil.parser.parse(x))
B.head()
Out[106]:
Date Date2
0 23-JUN-11 2011-06-23
1 23-JUN-11 2011-06-23
2 23-JUN-11 2011-06-23
3 23-JUN-11 2011-06-23
4 23-JUN-11 2011-06-23