Im receiving the following error:
ValueError: time data '2013' does not match format '%Y%m%d' (match)
Here is the section of code where the error is occuring:
# Convert periodEndDate from string to datetime to epoch timestamp
df['periodEndDate'] = df['periodEndDate'].apply(lambda x: pd.to_datetime(int(x), format='%Y%m%d').timestamp())
df['periodEndDate'] = df['periodEndDate'].astype(int)
df['periodTypeId'] = 1
return df.to_dict('records')
output:
0 2013
1 2012
2 2015
3 20111231
4 2016
5 2014
6 2017
7 2018
I understand that the code is failing as '2013' does not match the format, is it possible to insert a day and month to resolve this issue?
Don't specify the format. Let pandas infer it.
df['periodEndDate'] = pd.to_datetime(df["periodEndDate"])
>>> df
0 2013-01-01
1 2012-01-01
2 2015-01-01
3 2011-12-31
4 2016-01-01
5 2014-01-01
6 2017-01-01
7 2018-01-01
Name: periodEndDate, dtype: datetime64[ns]
Related
I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.
While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.
I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.
For an excel file in which date column is not type of date format, so in date 2018.10, we can see 0 has been omitted and it becomes 2018.1.
date
2018.12
2018.11
2018.1
2018.9
2018.8
2018.7
2018.6
2018.5
2018.4
2018.3
2018.2
2018.1
How can I convert this column to year month format correctly? Thank you.
I try with df['date'] = pd.to_datetime(df['date'].map('{:.1f}'.format), format='%Y.%m'), but I get this:
8 2018-01-01
9 2018-01-01
10 2018-01-01
11 2018-09-01
12 2018-08-01
13 2018-07-01
14 2018-06-01
15 2018-05-01
16 2018-04-01
17 2018-03-01
18 2018-02-01
First convert values to strings and then to datetimes in first step.
Then correct October - test if previous month is 11, next is 9 and incorrect is 1:
df['date'] = pd.to_datetime(df['date'].astype(str), format='%Y.%m')
mo = df['date'].dt.month
mask = mo.shift().eq(11) & mo.eq(1) & mo.shift(-1).eq(9)
df.loc[mask, 'date'] = df.loc[mask, 'date'] + pd.offsets.DateOffset(month=10)
print (df)
date
0 2018-12-01
1 2018-11-01
2 2018-10-01
3 2018-09-01
4 2018-08-01
5 2018-07-01
6 2018-06-01
7 2018-05-01
8 2018-04-01
9 2018-03-01
10 2018-02-01
11 2018-01-01
it might be easiest to fix this in the excel file! if you've got a lot of data (thousands of rows) then maybe it's worth writing code. code options are:
look at row above/below and try and infer whether .1 means be January or October
ignore the column, if you have data for every month then just make up the correct sequence
Consider python panda code as
datetest = pd.DataFrame({'year':['02','08',23,32,43,68,70,72,85,94]})
newdate = pd.to_datetime(datetest['year'], format='%y')
print(newdate)
Output:
0 2002-01-01
1 2008-01-01
2 2023-01-01
3 2032-01-01
4 2043-01-01
5 2068-01-01
6 1970-01-01
7 1972-01-01
8 1985-01-01
9 1994-01-01
Name: year, dtype: datetime64[ns]
So how can I convert 2023, 2032, 2043, 2068 to 1923, 1932, 1943, 1968 respectively keeping datetime format intact?
You could do use boolean indexing and pandas.DateOffset to adjust any dates in the future by 100 years.
If this rule is too strict, you can set your own threshold for what an acceptible year might be:
year = pd.datetime.today().year
# If setting your own threshold year eg.
# year = 2030
newdate.loc[newdate.dt.year.gt(year)] -= pd.DateOffset(years=100)
[out]
0 2002-01-01
1 2008-01-01
2 1923-01-01
3 1932-01-01
4 1943-01-01
5 1968-01-01
6 1970-01-01
7 1972-01-01
8 1985-01-01
9 1994-01-01
Name: year, dtype: datetime64[ns]
Command:
dataframe.date.head()
Result:
0 12-Jun-98
1 7-Aug-2005
2 28-Aug-66
3 11-Sep-1954
4 9-Oct-66
5 NaN
Command:
pd.to_date(dataframe.date.head())
Result:
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 2066-08-28 00:00:00
3 1954-09-11 00:00:00
4 2066-10-09 00:00:00
5 NaN
I don't want to get 2066 it should be 1966, what to do?
The year range supposed to be from 1920 to 2017. The dataframe contains Null values
You can substract 100 years if dt.year is more as 2017:
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].mask(df['date'].dt.year > 2017,
df['date'] - pd.Timedelta(100, unit='Y'))
print (df)
date
0 1998-06-12 00:00:00
1 2005-08-07 00:00:00
2 1966-08-28 18:00:00
3 1954-09-11 00:00:00
4 1966-10-09 18:00:00