skip rows with bad dates while using pd.read_csv

skip rows with bad dates while using pd.read_csv - python

I'm reading in csv files from an external data source using pd.read_csv, as in the code below:
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=np.datetime64,
)
However, somewhere in the csv that's being sent, there is a misformatted date, resulting in the following error:
ValueError: Error parsing datetime string "2015-08-2" at position 8
This causes the entire application to crash. Of course, I can handle this case with a try/except, but then I will lose all the other data in that particular csv. I need pandas to keep and parse that other data.
I have no way of predicting when/where this data (which changes daily) will have badly formatted dates. Is there some way to get pd.read_csv to skip only the rows with bad dates but to still parse all the other rows in the csv?

somewhere in the csv that's being sent, there is a misformatted date
np.datetime64 needs ISO8601 formatted strings to work properly. The good news is that you can wrap np.datetime64 in your own function and use this as the date_parser:
def parse_date(v):
try:
return np.datetime64(v)
except:
# apply whatever remedies you deem appropriate
pass
return v
pd.read_csv(
...
date_parser=parse_date
)
I need pandas to keep and parse that other data.
I often find that a more flexible date parser like dateutil works better than np.datetime64 and may even work without the extra function:
import dateutil
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=dateutil.parser.parse,
)

Here's another way to do this using pd.convert_objects() method:
# make good and bad date csv files
# read in good dates file using parse_dates - no problem
df = pd.read_csv('dategood.csv', parse_dates=['dates'], date_parser=np.datetime64)
df.dtypes
dates datetime64[ns]
data float64
dtype: object
# try same code on bad dates file - throws exceptions
df = pd.read_csv('datebad.csv', parse_dates=['dates'], date_parser=np.datetime64)
ValueError: Error parsing datetime string "Q%Bte0tvk5" at position 0
# read the file first without converting dates
# then use convert objects to force conversion
df = pd.read_csv('datebad.csv')
df['cdate'] = df.dates.convert_objects(convert_dates='coerce')
# resulting new date column is a datetime64 same as good data file
df.dtype
dates object
data float64
cdate datetime64[ns]
dtype: object
# the bad date has NaT in the cdate column - can clean it later
df.head()
dates data cdate
0 2015-12-01 0.914836 2015-12-01
1 2015-12-02 0.866848 2015-12-02
2 2015-12-03 0.103718 2015-12-03
3 2015-12-04 0.514086 2015-12-04
4 Q%Bte0tvk5 0.583617 NaT

use inbuilt pd.to_datetime, which converts the non date type data to NaT
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=pd.to_datetime,
)
Now you can filter out the invalid rows with standard nan/ null check
df = df[~df["dates"].isnull()]

Related

How to convert Pandas Series to Timestamp when not every value is convertible?

Context
I have a Pandas Series containing Dates in a String format (e.g. 2017-12-19 09:35:00). My goal is to convert this Series into Timestamps (Time in Seconds since 1970).
The difficulty is, that some Values in this Series are corrupt and cannot be converted to a Timestamp. In that case, they should be converted to None.
Code
import datetime
series = series.apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S").timestamp())
Question
The code above would work when all Values are in the correct format, however there is corrupt data.
How can I achieve my goal while converting all not-convertible data to None?

Pandas typically represents invalid timestamps with NaT (Not a Time). You can use pd.to_datetime with errors="coerce":
import pandas as pd
series = pd.Series(["2023-01-07 12:34:56", "error"])
out = pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S", errors="coerce")
output:
0 2023-01-07 12:34:56
1 NaT
dtype: datetime64[ns]

Create a function with try except, like this:
def to_timestamp(x):
try:
return datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S").timestamp()
except:
return None
series = series.apply(to_timestamp)

How to convert Pandas Series of strings to Pandas datetime with non-standard formats that contain dates before 1970

I have a column of dates in the following format:
Jan-85
Apr-99
Nov-01
Feb-65
Apr-57
Dec-19
I want to convert this to a pandas datetime object.
The following syntax works to convert them:
pd.to_datetime(temp, format='%b-%y')
where temp is the pd.Series object of dates. The glaring issue here of course is that dates that are prior to 1970 are being wrongly converted to 20xx.
I tried updating the function call with the following parameter:
pd.to_datetime(temp, format='%b-%y', origin='1950-01-01')
However, I am getting the error:
Name: temp, Length: 42537, dtype: object' is not compatible with origin='1950-01-01'; it must be numeric with a unit specified
I tried specifying a unit as it said, but I got a different error citing that the unit cannot be specified alongside a format.
Any ideas how to fix this?

Just #DudeWah's logic, but improving upon the code:
def days_of_future_past(date,chk_y=pd.Timestamp.today().year):
return date.replace(year=date.year-100) if date.year > chk_y else date
temp = pd.to_datetime(temp,format='%b-%y').map(days_of_future_past)
Output:
>>> temp
0 1985-01-01
1 1999-04-01
2 2001-11-01
3 1965-02-01
4 1957-04-01
5 2019-12-01
6 1965-05-01
Name: date, dtype: datetime64[ns]

Gonna go ahead and answer my own question so others can use this solution if they come across this same issue. Not the greatest, but it gets the job done. It should work until 2069, so hopefully pandas will have a better solution to this by then lol
Perhaps someone else will post a better solution.
def wrong_date_preprocess(data):
"""Correct date issues with pre-1970 dates with whacky mon-yy format."""
df1 = data.copy()
dates = df1['date_column_of_interest']
# use particular datetime format with data; ex: jan-91
dates = pd.to_datetime(dates, format='%b-%y')
# look at wrongly defined python dates (pre 1970) and get indices
date_dummy = dates[dates > pd.Timestamp.today().floor('D')]
idx = list(date_dummy.index)
# fix wrong dates by offsetting 100 years back dates that defaulted to > 2069
dummy2 = date_dummy.apply(lambda x: x.replace(year=x.year - 100)).to_list()
dates.loc[idx] = dummy2
df1['date_column_of_interest'] = dates
return(df1)

Data parsing in pandas, python

I have an excel file with many columns, one of them, 'Column3' is date with some text in it, basically it looks like that:
26/05/20
XXX
YYY
12/05/2020
The data is written in DD/MM/YY format but pandas, just like excel, thinks that 12/05/2020 it's 05 Dec 2020 while it is 12 May 2020. (My windows is set to american date format)
Important note: when I open stock excel file, cells with 12/05/2020 already are Date type, trying to convert it to text it gives me 44170 which will give me wrong date if I just reformat it into DD/MM/YY
I added this line of code:
iport pandas as pd
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
df = pd.read_excel("my_file.xlsx", parse_dates=['Column3'], date_parser=dateparse)
But the text in the column generates an error.
ValueError: time data 'XXX' does not match format '%d/%m/%y'
I went a step further and manually removed all text (obviously I can't do it all the time) to see whether it works or nor, but then I got following error
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
TypeError: strptime() argument 1 must be str, not datetime.datetime
I also tried this:
df['Column3'] = pd.to_datetime(df.Column3, format ='%d/%m/%y', errors="coerce")
# if I make errors="ignore" it doesn't change anything.
in that case my 26/05/20 was correctly converted to 26 May 2020 but I lost all my text data(it's ok) and other dates which didn't match with my format argument. Because previously they were recognized as American type date.
My objective is to convert the data in Column3 to the same format so I could apply filters with pandas.
I think it's couple solutions:
tell Pandas to not convert text to date at all (but it is already saved as Date type in stock file, will it work?)
somehow ignore text values and use date_parser= method co convert add dates to DD/MM/YY
with help of pd.to_datetime convert 26/05/20 to 26 May 2020 and than convert 2020-09-06 00:00:00 to 9 June 2020 (seems to be the simplest one but ignore argument doesn't work.)
Here's link to small sample file https://easyupload.io/ca5p6w

You can pass a date_parser to read_excel:
dateparser = lambda x: pd.to_datetime(x, dayfirst=True)
pd.read_excel('test.xlsx', date_parser = dateparser)

Posting this as an answer, since it's too long for a comment
The problem originates in Excel. If I open it in Excel, I see 2 strings that look like dates 26/05/20, 05/12/2020 and 06/02/2020. Note the difference between the 20 and 2020 On lines 24 and 48 I see dates in Column4. This seems to indicate the Excel is put together. Is this Excel assembled by copy-paste, or programmatically?
loading it with just pd.read_excel gives these results for the dates:
26/05/20
2020-12-05 00:00:00
2020-02-06 00:00:00
If I do df["Column3"].apply(type)
gives me
str
<class 'datetime.datetime'>
<class 'datetime.datetime'>
So in the Excel file these are marked as datetime.
Loading them with df = pd.read_excel(DATA_DIR / "sample.xlsx", dtype={"Column3": str}) changes the type of all to str, but does not change the output.
If you open the extract the file, and go look at the xml file xl\worksheets\sheet1.xml directly and look for cell C26, you see it as 44170, while C5 is 6, which is a reference to 26/05/20 in xl/sharedStrings.xml
How do you 'make' this Excel file? This can best be solved in how this file is put together.
Workaround
As a workaround, you can convert the dates piecemeal. The different format allows this:
format1 = "%d/%m/%y"
format2 = "%Y-%d-%m %H:%M:%S"
Then you can do pd.to_datetime(dates, format=format1, errors="coerce") to only get the first dates, and NaT for the ones not according to the format. Then you use combine_first to fill the voids.
dates = df["Column3"] # of the one imported with dtype={"Column3": str}
dates_parsed = (
pd.to_datetime(dates, format=format1, errors="coerce")
.combine_first(pd.to_datetime(dates, format=format2, errors="coerce"))
.astype(object)
.combine_first(dates)
)
The astype(object) is needed to fill in the empty places with the string values.

I think, first you should import the file without date parsing then convert it to date format using following:
df['column3']= pd.to_datetime(df['column3'], errors='coerce')
Hope this will work

Why Pandas refuse to read a date 9 centuries into the future?

Consider this example df.
import pandas as pd
from io import StringIO
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(mycsv)
df
id date
0 1 11/07/2018
1 2 11/07/2918
2 3 02/01/2019
Clearly there was a typo there (2918 instead of 2018), but I'd like to parse it as a date nonetheless.
So let's check df.dtypes
id int64
date object
dtype: object
Ok, by default it was read as a string. So I'll explicitly tell read_csv to parse that column as a date.
df = pd.read_csv(mycsv, parse_dates=["date"])
But df.dtypes still shows date was read as a string (object dtype).
If I correct the typo ...
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2018\n3,02/01/2019")
it works
df = pd.read_csv(mycsv, parse_dates=["date"])
df
id date
0 1 2018-11-07
1 2 2018-11-07
2 3 2019-02-01
df.dtypes
id int64
date datetime64[ns]
dtype: object
So clearly it is failing to parse such an unrealistic date (11/07/2918) and then the whole column gets handled as string.
But why it cannot properly handle the 11/07/2918 date? and How can I make it correctly parse such date?
read_csv documentation says that by default it uses dateutil.parser.parse. And when you try by hand:
import dateutil
dateutil.parser.parse("13/07/2918")
It just works. No exception, no error and produces a valid datetime object: datetime.datetime(2918, 7, 13, 0, 0)
Also converting that to numpy.datetime64 works
import dateutil
toy = dateutil.parser.parse("13/07/2918")
np.datetime64(toy)
It produces a valid and correctly parsed object.
numpy.datetime64('2918-07-13T00:00:00.000000')
Similarly, using pandas' strptime works all right and produces a valid datetime object.
pd.datetime.strptime("11/07/2918", "%d/%m/%Y")
Now, trying that with a custom date parser, just to make sure date-format is right
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(
mycsv,
parse_dates=["date"],
date_parser=lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
)
Again df["date"].dtype is dtype('O')
Ok, so I was giving up trying to convince read_csv to properly parse the date. So I said, let's just convert it to date.
Either this
df["date"].astype("datetime64")
or this
pd.to_datetime(df["date"])
Throws and exception
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2918-07-11 00:00:00
Nothing seems to work.
Any ideas why this happens and how to make it work?

From the docs:
Since pandas represents timestamps in nanosecond resolution, the time span that can
be represented using a 64-bit integer is limited to approximately 584 years:
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
How to represent out of bounds times:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob

dataframe datetimeindex changes

I have a dataframe with a date column. I want to turn this date column into my index. When I change the date column into pd.to_datetime(df['Date'], errors='raise', dayfirst=True) I get:
df1.head()
Out[60]:
Date Open High Low Close Volume Market Cap
0 2018-03-14 0.789569 0.799080 0.676010 0.701902 479149000 30865600000
1 2018-03-13 0.798451 0.805729 0.778471 0.789711 279679000 31213000000
2 2018-12-03 0.832127 0.838328 0.787882 0.801048 355031000 32529500000
3 2018-11-03 0.795765 0.840407 0.775737 0.831122 472972000 31108000000
4 2018-10-03 0.854872 0.860443 0.793736 0.796627 402670000 33418600000
The format of Date originally is string dd-mm-yyyy, but as you can see, the tranformation to datetime messes things up from the 2nd row on. How can I get consistent datetimes?
Edit: I think I solved it. Using the answers below about format I found out the error was in a package that I used to generate the data (\[cryptocmd\]). I changed the format to %Y-%m-%d in the utils script of the package and now it seems to work fine.

According to the docs:
dayfirst : boolean, default False
Specify a date parse order if arg is str or its list-likes. If True,
parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10.
Warning: dayfirst=True is not strict, but will prefer to parse with
day first (this is a known bug, based on dateutil behavior).
Emphasis mine. Since you apparently know that your format is "dd-mm-yyyy" you should specify it explicitly:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y', errors='raise')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

skip rows with bad dates while using pd.read_csv - python

use inbuilt pd.to_datetime, which converts the non date type data to NaT pd.read_csv( BytesIO(raw_data), parse_dates=['dates'], date_parser=pd.to_datetime, ) Now you can filter out the invalid rows with standard nan/ null check df = df[~df["dates"].isnull()]

Related

How to convert Pandas Series to Timestamp when not every value is convertible?

How to convert Pandas Series of strings to Pandas datetime with non-standard formats that contain dates before 1970

Data parsing in pandas, python

Why Pandas refuse to read a date 9 centuries into the future?

dataframe datetimeindex changes

Categories

Resources