I am working on a stock market analysis where I look at past Balance Sheets and income statements, and want to change the date column which saves them as a string of the form "2021-09-30" into datetimes. I am trying to use pd.to_datetime but it is giving me an error.
When I run
df['datekey'] = pd.to_datetime(df['datekey'], format='%Y-%m-%d')
I get
"ValueError: time data "2021-09-30" doesn't match format specified"
when it should (if I am doing this correctly).
This column doesn't have a time value in it. It is just (for all dates) "2021-09-30".
You have extra quotes and spaces in your data. Try:
df["datekey"] = pd.to_datetime(df["datekey"].str.replace(" ","").str.strip('"'), format="%Y-%m-%d")
>>> df["datekey"]
0 2021-09-30
1 2021-06-30
2 2021-03-31
3 2020-12-31
4 2020-09-30
5 2020-06-30
6 2020-03-31
7 2019-12-31
8 2019-09-30
9 2019-06-30
Name: datekey, dtype: datetime64[ns]
Seems like the value itself is enclosed by double quotes, you need to include quotes as well in your formats:
df['datekey'] = pd.to_datetime(df['datekey'], format='"%Y-%m-%d"')
Alternatively, you can strip off the quotes before converting to datetime, this is useful if some values are not enclosed by double quotes:
df['datekey'] = pd.to_datetime(df['datekey'].str.strip('"'), format='%Y-%m-%d')
Related
I want to convert string into date time format. I have three different formats in my current column like this-
01-05-21 5:50 (month-day-year hour:min)
13/01/2021 05:50:00 (Day/month/year hour:min:sec)
1/1/21 0:00 (day/month/year hour:min)
I want to convert them into single format lets say 13-01-21 05:50:00 and 05-01-21 05:50:00 (day-month-year Hour:min:sec)
I can not able to do both the things in single python code.
df.head()
ts
0 01-05-21 5:50
1 01-05-21 6:00
2 13/01/2021 05:00:00
3 13/01/2021 05:10:00
4 1/1/21 0:00
(Three different formats)
https://stackabuse.com/how-to-format-dates-in-python/
Use this link. You can add a hyphen between the number yourself
You may use to_datetime here to conditionally choose between one of the two formatting masks, when converting the string column to datetime:
df[df["df_str"].str.contains(r'^\d{2}-\d{2}-\d{2} \d{1,2}:\d{2}$')]["dt"] = pd.to_datetime(data["df_str"], format='%m-%d-%y %H:%M')
df[not df["df_str"].str.contains(r'^\d{2}-\d{2}-\d{2} \d{1,2}:\d{2}$')]["dt"] = pd.to_datetime(data["df_str"], format='%d/%m/%Y %H:%M:%S')
I am new to python. I have a data-frame which has a date column in it, it has different formats. I would like to check if it is following particular date format or not. I it is not following I want to drop it. I have tried using try except and iterating over the rows. But I am looking for a faster way to check if the column is following a particular date format or not. If it is not following then it has to drop. Is there any faster way to do it? Using DATE TIME library?
My code:
Date_format = %Y%m%d
df =
Date abc
0 2020-03-22 q
1 03-12-2020 w
2 55552020 e
3 25122020 r
4 12/25/2020 r
5 1212202033 y
Excepted out:
Date abc
0 2020-03-22 q
You could try
pd.to_datetime(df.Date, errors='coerce')
0 2020-03-22
1 2020-03-12
2 NaT
3 NaT
4 2020-12-25
5 NaT
It's easy to drop the null values then
EDIT:
For a given format you can still leverage pd.to_datetime:
datetimes = pd.to_datetime(df.Date, format='%Y-%m-%d', errors='coerce')
datetimes
0 2020-03-22
1 NaT
2 NaT
3 NaT
4 NaT
5 NaT
df.loc[datetimes.notnull()]
Also note I am using the format %Y-%m-%d which I think is the one you want based on your expected output (not the one you gave as Date_format)
I've been trying to find an answer for 4 hours, but no luck. Any help will be very appreciable.
Goal: convert 20170103 into 2017-01-03 and 022100 into 02:21:00 for candlestick plotting
date_int = 20170103
df = pd.DataFrame({'date':[date_int]*10})
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
print(df['date'])
time_int = 020100
df = pd.DataFrame({'time':[time_int]*10})
df['time'] = df['time'].apply(lambda x: pd.to_datetime(str(x), format='%H:%M:%S'))
print(df['time'])
but the second code shows 'invalid token' error.
And I also notice that this code performs very slow. If there is a more efficient way, please, let me know. Thank you so much in advance for your help.
To expand on my comments, you have a few things wrong here. Firstly as mentioned, the used format in your second example is wrong. Your data has the format '%H%M%S', so it is the one you need to specify in the argument.
When using pd.to_datetime, the specified format indicates the actual data format so that it can be correctly parsed.
In order to further modify it, you need to add Series.dt.strftime:
date_int = 20170103
df = pd.DataFrame({'date':[date_int]*10})
df.date = pd.to_datetime(df.date, format='%Y%m%d').dt.strftime('%Y-%m-%d')
date
0 2017-01-03
1 2017-01-03
2 2017-01-03
3 2017-01-03
4 2017-01-03
5 2017-01-03
6 2017-01-03
7 2017-01-03
8 2017-01-03
9 2017-01-03
So similarly for your second example you need:
df.time = pd.to_datetime(df.time, format='%H%M%S').dt.strftime('%H:%M:%S')
Here, Based on my comment above. (for Invalid token error, make it string surrounded by single quote or double)
time_int = '020100'
df = pd.DataFrame({'time':[time_int]*10})
df['time'] = df['time'].apply(lambda x: pd.to_datetime(str(x), format='%H%M%S'))
df['time'] = df['time'].dt.time
print(df['time'])
Output:
0 02:01:00
1 02:01:00
2 02:01:00
3 02:01:00
4 02:01:00
5 02:01:00
6 02:01:00
7 02:01:00
8 02:01:00
9 02:01:00
I'm looking at the question and it looks like the original question was two test cases to get code using the panda package debugged. The comment that the code ran slowly suggests that a file of dates and times is being read. Given that candlestick plots could be used with a datetime object, perhaps this all could be solved simply.
Reading each line pull the date and time out as a single string, say '20170103 022100'.
Use datetime to parse directly to a datetime object.
import datetime as dt
ts='20170103 022100'
result=dt.datetime.strptime(ts,'%Y%m%d %H%M%S')
What's nice about strptime is that the single space in the format represents whitespace, so the multiple spaces in the string parse correctly.
Hope that simplifies things.
I have a dataframe (df) with two columns where the head looks like
name start end
0 John 2018-11-09 00:00:00 2012-03-01 00:00:00
1 Steve 1990-09-03 00:00:00
2 Debs 1977-09-07 00:00:00 2012-07-02 00:00:00
3 Mandy 2009-01-09 00:00:00
4 Colin 1993-08-22 00:00:00 2002-06-03 00:00:00
The start and end columns have the type object. I want to change the type to datetime so I can use the following:
referenceError = DeptTemplate['start'] > DeptTemplate['end']
am trying to change the type using:
df['start'].dt.strftime('%d/%m/%Y')
df['end'].dt.strftime('%d/%m/%Y')
but I think where there are some rows where there are no date in the columns its causing a problem. How can I set any blank values so I can change the type to date time and run my analysis?
As shown in the .to_datetime docs you can set the behavior using the errors kwarg. You can also set the strftime format with the format kwarg.
# Bad values will be NaT
df["start"] = pd.to_datetime(df.start, errors='coerce', format='%d/%m/%Y')
As mentioned in the comments, you can prepare the column with replace if you absolutely must use strftime.
I have a column with a birthdate. Some are N.A, some 01.01.2016 but some contain 01.01.2016 01:01:01
Filtering the N.A. values works fine. But handling the different date formats seems clumsy. Is it possible to have pandas handle these gracefully and e.g. for a birthdate only interpret the date and not fail?
pd.to_datetime() will handle multiple formats
>>> ser = pd.Series(['NaT', '01.01.2016', '01.01.2016 01:01:01'])
>>> pd.to_datetime(ser)
0 NaT
1 2016-01-01 00:00:00
2 2016-01-01 01:01:01
dtype: datetime64[ns]