Why Pandas refuse to read a date 9 centuries into the future? - python

Consider this example df.
import pandas as pd
from io import StringIO
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(mycsv)
df
id date
0 1 11/07/2018
1 2 11/07/2918
2 3 02/01/2019
Clearly there was a typo there (2918 instead of 2018), but I'd like to parse it as a date nonetheless.
So let's check df.dtypes
id int64
date object
dtype: object
Ok, by default it was read as a string. So I'll explicitly tell read_csv to parse that column as a date.
df = pd.read_csv(mycsv, parse_dates=["date"])
But df.dtypes still shows date was read as a string (object dtype).
If I correct the typo ...
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2018\n3,02/01/2019")
it works
df = pd.read_csv(mycsv, parse_dates=["date"])
df
id date
0 1 2018-11-07
1 2 2018-11-07
2 3 2019-02-01
df.dtypes
id int64
date datetime64[ns]
dtype: object
So clearly it is failing to parse such an unrealistic date (11/07/2918) and then the whole column gets handled as string.
But why it cannot properly handle the 11/07/2918 date? and How can I make it correctly parse such date?
read_csv documentation says that by default it uses dateutil.parser.parse. And when you try by hand:
import dateutil
dateutil.parser.parse("13/07/2918")
It just works. No exception, no error and produces a valid datetime object: datetime.datetime(2918, 7, 13, 0, 0)
Also converting that to numpy.datetime64 works
import dateutil
toy = dateutil.parser.parse("13/07/2918")
np.datetime64(toy)
It produces a valid and correctly parsed object.
numpy.datetime64('2918-07-13T00:00:00.000000')
Similarly, using pandas' strptime works all right and produces a valid datetime object.
pd.datetime.strptime("11/07/2918", "%d/%m/%Y")
Now, trying that with a custom date parser, just to make sure date-format is right
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(
mycsv,
parse_dates=["date"],
date_parser=lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
)
Again df["date"].dtype is dtype('O')
Ok, so I was giving up trying to convince read_csv to properly parse the date. So I said, let's just convert it to date.
Either this
df["date"].astype("datetime64")
or this
pd.to_datetime(df["date"])
Throws and exception
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2918-07-11 00:00:00
Nothing seems to work.
Any ideas why this happens and how to make it work?

From the docs:
Since pandas represents timestamps in nanosecond resolution, the time span that can
be represented using a 64-bit integer is limited to approximately 584 years:
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
How to represent out of bounds times:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob

Related

How to convert Pandas Series of strings to Pandas datetime with non-standard formats that contain dates before 1970

I have a column of dates in the following format:
Jan-85
Apr-99
Nov-01
Feb-65
Apr-57
Dec-19
I want to convert this to a pandas datetime object.
The following syntax works to convert them:
pd.to_datetime(temp, format='%b-%y')
where temp is the pd.Series object of dates. The glaring issue here of course is that dates that are prior to 1970 are being wrongly converted to 20xx.
I tried updating the function call with the following parameter:
pd.to_datetime(temp, format='%b-%y', origin='1950-01-01')
However, I am getting the error:
Name: temp, Length: 42537, dtype: object' is not compatible with origin='1950-01-01'; it must be numeric with a unit specified
I tried specifying a unit as it said, but I got a different error citing that the unit cannot be specified alongside a format.
Any ideas how to fix this?
Just #DudeWah's logic, but improving upon the code:
def days_of_future_past(date,chk_y=pd.Timestamp.today().year):
return date.replace(year=date.year-100) if date.year > chk_y else date
temp = pd.to_datetime(temp,format='%b-%y').map(days_of_future_past)
Output:
>>> temp
0 1985-01-01
1 1999-04-01
2 2001-11-01
3 1965-02-01
4 1957-04-01
5 2019-12-01
6 1965-05-01
Name: date, dtype: datetime64[ns]
Gonna go ahead and answer my own question so others can use this solution if they come across this same issue. Not the greatest, but it gets the job done. It should work until 2069, so hopefully pandas will have a better solution to this by then lol
Perhaps someone else will post a better solution.
def wrong_date_preprocess(data):
"""Correct date issues with pre-1970 dates with whacky mon-yy format."""
df1 = data.copy()
dates = df1['date_column_of_interest']
# use particular datetime format with data; ex: jan-91
dates = pd.to_datetime(dates, format='%b-%y')
# look at wrongly defined python dates (pre 1970) and get indices
date_dummy = dates[dates > pd.Timestamp.today().floor('D')]
idx = list(date_dummy.index)
# fix wrong dates by offsetting 100 years back dates that defaulted to > 2069
dummy2 = date_dummy.apply(lambda x: x.replace(year=x.year - 100)).to_list()
dates.loc[idx] = dummy2
df1['date_column_of_interest'] = dates
return(df1)

skip rows with bad dates while using pd.read_csv

I'm reading in csv files from an external data source using pd.read_csv, as in the code below:
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=np.datetime64,
)
However, somewhere in the csv that's being sent, there is a misformatted date, resulting in the following error:
ValueError: Error parsing datetime string "2015-08-2" at position 8
This causes the entire application to crash. Of course, I can handle this case with a try/except, but then I will lose all the other data in that particular csv. I need pandas to keep and parse that other data.
I have no way of predicting when/where this data (which changes daily) will have badly formatted dates. Is there some way to get pd.read_csv to skip only the rows with bad dates but to still parse all the other rows in the csv?
somewhere in the csv that's being sent, there is a misformatted date
np.datetime64 needs ISO8601 formatted strings to work properly. The good news is that you can wrap np.datetime64 in your own function and use this as the date_parser:
def parse_date(v):
try:
return np.datetime64(v)
except:
# apply whatever remedies you deem appropriate
pass
return v
pd.read_csv(
...
date_parser=parse_date
)
I need pandas to keep and parse that other data.
I often find that a more flexible date parser like dateutil works better than np.datetime64 and may even work without the extra function:
import dateutil
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=dateutil.parser.parse,
)
Here's another way to do this using pd.convert_objects() method:
# make good and bad date csv files
# read in good dates file using parse_dates - no problem
df = pd.read_csv('dategood.csv', parse_dates=['dates'], date_parser=np.datetime64)
df.dtypes
dates datetime64[ns]
data float64
dtype: object
# try same code on bad dates file - throws exceptions
df = pd.read_csv('datebad.csv', parse_dates=['dates'], date_parser=np.datetime64)
ValueError: Error parsing datetime string "Q%Bte0tvk5" at position 0
# read the file first without converting dates
# then use convert objects to force conversion
df = pd.read_csv('datebad.csv')
df['cdate'] = df.dates.convert_objects(convert_dates='coerce')
# resulting new date column is a datetime64 same as good data file
df.dtype
dates object
data float64
cdate datetime64[ns]
dtype: object
# the bad date has NaT in the cdate column - can clean it later
df.head()
dates data cdate
0 2015-12-01 0.914836 2015-12-01
1 2015-12-02 0.866848 2015-12-02
2 2015-12-03 0.103718 2015-12-03
3 2015-12-04 0.514086 2015-12-04
4 Q%Bte0tvk5 0.583617 NaT
use inbuilt pd.to_datetime, which converts the non date type data to NaT
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=pd.to_datetime,
)
Now you can filter out the invalid rows with standard nan/ null check
df = df[~df["dates"].isnull()]

pandas raises ValueError on DatetimeIndex Conversion

I am converting all ISO-8601 formatted values into Unix Values. For some inexplicable reason this line
a_col = pd.DatetimeIndex(a_col).astype(np.int64)/10**6
raises the error
ValueError: Unable to convert 0 2001-06-29
... (Abbreviated Output of Column
Name: DateCol, dtype: datetime64[ns] to datetime dtype
This is very odd because I've guaranteed that each value is in datetime.datetime format as you can see here:
if a_col.dtypes is (np.dtype('object') or np.dtype('O')):
a_col = a_col.apply(lambda x: x if isinstance(x, datetime.datetime) else epoch)
a_col = pd.DatetimeIndex(a_col).astype(np.int64)/10**6
Epoch is datetime.datetime.
When I check the dtypes of the column that gives me an error it's "object), exactly what I'm checking for. Is there something I'm missing?
Assuming that your time zone is US/Eastern (based on your dataset) and that your DataFrame is named df, please try the following:
import datetime as dt
from time import mktime
import pytz
df['Job Start Date'] = \
df['Job Start Date'].apply(lambda x: mktime(pytz.timezone('US/Eastern').localize(x)
.astimezone(pytz.UTC).timetuple()))
>>> df['Job Start Date'].head()
0 993816000
1 1080824400
2 1052913600
3 1080824400
4 1075467600
Name: Job Start Date, dtype: float64
You first need to make your 'naive' datetime objects timezone aware (to US/Eastern) and then convert them to UTC. Finally, pass your new UTC aware datetime object as a timetable to the mtkime function from the time module.

pandas save date in ISO format?

I'm trying to generate a Pandas DataFrame where date_range is an index. Then save it to a CSV file so that the dates are written in ISO-8601 format.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
NumberOfSamples = 10
dates = pd.date_range('20130101',periods=NumberOfSamples,freq='90S')
df3 = DataFrame(index=dates)
df3.to_csv('dates.txt', header=False)
The current output to dates.txt is:
2013-01-01 00:00:00
2013-01-01 00:01:30
2013-01-01 00:03:00
2013-01-01 00:04:30
...................
I'm trying to get it to look like:
2013-01-01T00:00:00Z
2013-01-01T00:01:30Z
2013-01-01T00:03:00Z
2013-01-01T00:04:30Z
....................
Use datetime.strftime and call map on the index:
In [72]:
NumberOfSamples = 10
import datetime as dt
dates = pd.date_range('20130101',periods=NumberOfSamples,freq='90S')
df3 = pd.DataFrame(index=dates)
df3.index = df3.index.map(lambda x: dt.datetime.strftime(x, '%Y-%m-%dT%H:%M:%SZ'))
df3
Out[72]:
Empty DataFrame
Columns: []
Index: [2013-01-01T00:00:00Z, 2013-01-01T00:01:30Z, 2013-01-01T00:03:00Z, 2013-01-01T00:04:30Z, 2013-01-01T00:06:00Z, 2013-01-01T00:07:30Z, 2013-01-01T00:09:00Z, 2013-01-01T00:10:30Z, 2013-01-01T00:12:00Z, 2013-01-01T00:13:30Z]
Alternatively and better in my view (thanks to #unutbu) you can pass a format specifier to to_csv:
df3.to_csv('dates.txt', header=False, date_format='%Y-%m-%dT%H:%M:%SZ')
With pd.Index.strftime:
If you're sure that all your dates are UTC, you can hardcode the format:
df3.index = df3.index.strftime('%Y-%m-%dT%H:%M:%SZ')
which gives you 2013-01-01T00:00:00Z and so on. Note that the "Z" denotes UTC!
With pd.Timestamp.isoformat and pd.Index.map:
df3.index = df3.index.map(lambda timestamp: timestamp.isoformat())
This gives you 2013-01-01T00:00:00. If you attach a timezone to your dates first (e.g. by passing tz="UTC" to date_range), you'll get: 2013-01-01T00:00:00+00:00 which also conforms to ISO-8601 but is a different notation. This should work for any dateutil or pytz timezone, leaving no room for ambiguity when clocks switch from daylight saving to standard time.

Can pandas automatically read dates from a CSV file?

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:
df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])
For example it can be checked in this way:
for i, r in df.iterrows():
print type(r['col1']), type(r['col2']), type(r['col3'])
In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to "learn" pandas to recognized dates?
You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.
Suppose you have a column 'datetime' with your string, then:
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.
Perhaps the pandas interface has changed since #Rutger answered, but in the version I'm using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:
from datetime import datetime
import pandas as pd
dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]
df = pd.read_csv('test.dat', parse_dates=['datetime'], date_parser=dateparse)
Since the original question asker said he wants dates and the dates are in 2013-6-4 format, the dateparse function should really be:
dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d').date() for d in dates]
You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():
If a column or index contains an unparseable date, the entire column
or index will be returned unaltered as an object data type. For
non-standard datetime parsing, use pd.to_datetime after pd.read_csv.
Demo:
>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
date
0 2013-6-4
>>> df.dtypes
date object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
date
0 2013-06-04
>>> df.dtypes
date datetime64[ns]
dtype: object
When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.
The following works:
def dateparse(d,t):
dt = d + " " + t
return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')
df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
you can even have the different date parts in different columns and pass the parameter:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.
you can use the date_parser parameter to pass a function to convert your format.
date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.
Yes - according to the pandas.read_csv documentation:
Note: A fast-path exists for iso8601-formatted dates.
So if your csv has a column named datetime and the dates looks like 2013-01-01T01:01 for example, running this will make pandas (I'm on v0.19.2) pick up the date and time automatically:
df = pd.read_csv('test.csv', parse_dates=['datetime'])
Note that you need to explicitly pass parse_dates, it doesn't work without.
Verify with:
df.dtypes
You should see the datatype of the column is datetime64[ns]
While loading csv file contain date column.We have two approach to to make pandas to
recognize date column i.e
Pandas explicit recognize the format by arg date_parser=mydateparser
Pandas implicit recognize the format by agr infer_datetime_format=True
Some of the date column data
01/01/18
01/02/18
Here we don't know the first two things It may be month or day. So in this case we have to use
Method 1:-
Explicit pass the format
mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)
Method 2:- Implicit or Automatically recognize the format
df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)
In addition to what the other replies said, if you have to parse very large files with hundreds of thousands of timestamps, date_parser can prove to be a huge performance bottleneck, as it's a Python function called once per row. You can get a sizeable performance improvements by instead keeping the dates as text while parsing the CSV file and then converting the entire column into dates in one go:
# For a data column
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']})
df['mydatetime'] = pd.to_datetime(df['mydatetime'], exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a DateTimeIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col='mydatetime')
df.index = pd.to_datetime(df.index, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a MultiIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])
idx_mydatetime = df.index.get_level_values(0)
idx_num = df.index.get_level_values(1)
idx_mydatetime = pd.to_datetime(idx_mydatetime, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num])
For my use case on a file with 200k rows (one timestamp per row), that cut down processing time from about a minute to less than a second.
Read the existing string columns in date and time format respectively
pd.read_csv('CGMData.csv', parse_dates=['Date', 'Time'])
Resulted Columns
Concat string columns of date and time and add new column of datetype object - Remove Original columns
if want to rename the new column name then pass as dictionary as
show in below example and the new column name will be the key name,
if pass as list of column, new column name will be concate of column name passed in the list separated by _ e.g Date_Time
# parse_dates={'given_name': ['Date', 'Time']}
pd.read_csv("InsulinData.csv",low_memory=False,
parse_dates=[['Date', 'Time']])
pd.read_csv("InsulinData.csv",low_memory=False,
parse_dates={'date_time': ['Date', 'Time']})
Concat string columns of date and time and add new column of datetype object and Keep the Original columns
pd.read_csv("InsulinData.csv",low_memory=False,
parse_dates=[['Date', 'Time']], keep_date_col=True)
Want to change the format of date and time when read from csv
parser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S')
pd.read_csv('path', date_parser=parser, parse_dates=['date', 'time'])
If performance matters to you make sure you time:
import sys
import timeit
import pandas as pd
print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)
repeat = 3
numbers = 100
def time(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
print("Format %m/%d/%y")
setup = """import pandas as pd
import io
data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""
time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')
print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io
data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""
time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')
prints:
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28)
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996
So with iso8601-formatted date (%Y-%m-%d %H:%M:%S is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format (which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.
You can use the parameter date_parser with a function for converting a sequence of string columns to an array of datetime instances:
parser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S')
pd.read_csv('path', date_parser=parser, parse_dates=['date_col1', 'date_col2'])
Yes, this code works like breeze. Here index 0 refers to the index of the date column.
df = pd.read_csv(filepath, parse_dates=[0], infer_datetime_format = True)
No, there is no way in pandas to automatically recognize date columns.
Pandas does a poor job at type inference. It basically puts most columns as the generic object type, unless you manually work around it eg. using the abovementioned parse_dates parameter.
If you want to automatically detect columns types, you'd have to use a separate data profiling tool, eg. visions, and then cast or feed the inferred types back into your DataFrame constructor (eg. for dates and from_csv, using the parse_dates parameter).

Categories

Resources