Processing data with incorrect dates like 30th of February

Processing data with incorrect dates like 30th of February - python

In trying to process a large number of bank account statements given in CSV format I realized that some of the dates are incorrect (30th of February, which is not possible).
So this snippet fails [1] telling me that some dates are incorrect:
df_from_csv = pd.read_csv( csv_filename
, encoding='cp1252'
, sep=";"
, thousands='.', decimal=","
, dayfirst=True
, parse_dates=['Buchungstag', 'Wertstellung']
)
I could of course pre-process those CSV files and replace the 30th of Feb with 28th of Feb (or whatever the Feb ended in that year).
But is there a way to do this in Pandas, while importing? Like "If this column fails, set it to X"?
Sample row
775945;28.02.2018;30.02.2018;;901;"Zinsen"
As you can see, the date 30.02.2018 is not correct, because there ain't no 30th of Feb. But this seems to be a known problem in Germany. See [2].
[1] Here's the error message:
ValueError: day is out of range for month
[2] https://de.wikipedia.org/wiki/30._Februar

Here is how I solved it:
I added a custom date-parser:
import calendar
def mydateparser(dat_str):
"""Given a date like `30.02.2020` create a correct date `28.02.2020`"""
if dat_str.startswith("30.02"):
(d, m, y) = [int(el) for el in dat_str.split(".")]
# This here will get the first and last days in a given year/month:
(first, last) = calendar.monthrange(y, m)
# Use the correct last day (`last`) in creating a new datestring:
dat_str = f"{last:02d}.{m:02d}.{y}"
return pd.datetime.strptime(dat_str, "%d.%m.%Y")
# and used it in `read_csv`
for csv_filename in glob.glob(f"{path}/*.csv"):
# read csv into DataFrame
df_from_csv = pd.read_csv(csv_filename,
encoding='cp1252',
sep=";",
thousands='.', decimal=",",
dayfirst=True,
parse_dates=['Buchungstag', 'Wertstellung'],
date_parser=mydateparser
)
This allows me to fix those incorrect "30.02.XX" dates and allow pandas to convert those two date columns (['Buchungstag', 'Wertstellung']) into dates, instead of objects.

You could load it all up as text, then run it through a regex to identify non legal dates - which you could apply some adjustment function.
A sample regex you might apply could be:
ok_date_pattern = re.compile(r"^(0[1-9]|[12][0-9]|3[01])[-](0[1-9]|1[012])[-](19|20|99)[0-9]{2}\b")
This finds dates in DD-MM-YYYY format where the DD is constrained to being from 01 to 31 (i.e. a day of 42 would be considered illegal) and MM is constrained to 01 to 12, and YYYY is constrained to being within the range 1900 to 2099.
There are other regexes that go into more depth - such as some of the inventive answers found here
What you then need is a working adjustment function - perhaps that parses the date as best it can and returns a nearest legal date. I'm not aware of anything that does that out of the box, but a function could be written to deal with the most common edge cases I guess.
Then it'd be a case of tagging legal and illegal dates using an appropriate regex, and assigning some date-conversion function to deal with these two classes of dates appropriately.

Related

.Between function not working as expected

I am encountering some issues when using the .between method in Python.
I have a simple dataset consisting of ~59000 records
The date format is in DD/MM/YYYY and I would like to filter the days in the month of April in the year 2014.
psi_df = pd.read_csv('thecsvfile.csv')
psi_west_df = psi_df[['24-hr_psi','west']]
april_records = psi_west_df[psi_west_df['24-hr_psi'].between('1/4/2014','31/4/2014')]
april_records.head(100)
I received the output whereby the date suddenly jumps from 3/4/2014 (3rd April) - 10/4/2014 (10th April). This pattern recurs for every month and for every year up till the year 2020 (the final year of this dataset), which was not my original intention of obtaining the data for the month of April in the year 2014.
As I am still rather new to python, I decided to perform some fixes in Excel instead. I separated the date and the time columns and reran the code with the necessary syntax updated.
psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
april_records = psi_west_df[psi_west_df['date'].between('1/4/2014','31/4/2014')]
april_records.head(100)
I still faced the same issue and now, I am totally stumped as to why this is occurring. Am I using the .between method wrongly? Seeking everyone's kind guidance and directions as to why this is occurring. Much appreciated and many thanks everyone.
The csv file that I am using can be obtained from this website:
https://data.gov.sg/dataset/historical-24-hr-psi

The first problem is your date column isn't a date but an object column.
Ensure you column is really a date by using the pandas to_datetime function.
psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')
After the column is really a date column in order for the between function to run with no problems you should give it two date object and not string object like this:
start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]
So all together:
psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')
start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]
april_records.head(100)
Note - this code should work on the data after you change it with excel, meaning you have a separate column for data and time.

Data parsing in pandas, python

I have an excel file with many columns, one of them, 'Column3' is date with some text in it, basically it looks like that:
26/05/20
XXX
YYY
12/05/2020
The data is written in DD/MM/YY format but pandas, just like excel, thinks that 12/05/2020 it's 05 Dec 2020 while it is 12 May 2020. (My windows is set to american date format)
Important note: when I open stock excel file, cells with 12/05/2020 already are Date type, trying to convert it to text it gives me 44170 which will give me wrong date if I just reformat it into DD/MM/YY
I added this line of code:
iport pandas as pd
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
df = pd.read_excel("my_file.xlsx", parse_dates=['Column3'], date_parser=dateparse)
But the text in the column generates an error.
ValueError: time data 'XXX' does not match format '%d/%m/%y'
I went a step further and manually removed all text (obviously I can't do it all the time) to see whether it works or nor, but then I got following error
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
TypeError: strptime() argument 1 must be str, not datetime.datetime
I also tried this:
df['Column3'] = pd.to_datetime(df.Column3, format ='%d/%m/%y', errors="coerce")
# if I make errors="ignore" it doesn't change anything.
in that case my 26/05/20 was correctly converted to 26 May 2020 but I lost all my text data(it's ok) and other dates which didn't match with my format argument. Because previously they were recognized as American type date.
My objective is to convert the data in Column3 to the same format so I could apply filters with pandas.
I think it's couple solutions:
tell Pandas to not convert text to date at all (but it is already saved as Date type in stock file, will it work?)
somehow ignore text values and use date_parser= method co convert add dates to DD/MM/YY
with help of pd.to_datetime convert 26/05/20 to 26 May 2020 and than convert 2020-09-06 00:00:00 to 9 June 2020 (seems to be the simplest one but ignore argument doesn't work.)
Here's link to small sample file https://easyupload.io/ca5p6w

You can pass a date_parser to read_excel:
dateparser = lambda x: pd.to_datetime(x, dayfirst=True)
pd.read_excel('test.xlsx', date_parser = dateparser)

Posting this as an answer, since it's too long for a comment
The problem originates in Excel. If I open it in Excel, I see 2 strings that look like dates 26/05/20, 05/12/2020 and 06/02/2020. Note the difference between the 20 and 2020 On lines 24 and 48 I see dates in Column4. This seems to indicate the Excel is put together. Is this Excel assembled by copy-paste, or programmatically?
loading it with just pd.read_excel gives these results for the dates:
26/05/20
2020-12-05 00:00:00
2020-02-06 00:00:00
If I do df["Column3"].apply(type)
gives me
str
<class 'datetime.datetime'>
<class 'datetime.datetime'>
So in the Excel file these are marked as datetime.
Loading them with df = pd.read_excel(DATA_DIR / "sample.xlsx", dtype={"Column3": str}) changes the type of all to str, but does not change the output.
If you open the extract the file, and go look at the xml file xl\worksheets\sheet1.xml directly and look for cell C26, you see it as 44170, while C5 is 6, which is a reference to 26/05/20 in xl/sharedStrings.xml
How do you 'make' this Excel file? This can best be solved in how this file is put together.
Workaround
As a workaround, you can convert the dates piecemeal. The different format allows this:
format1 = "%d/%m/%y"
format2 = "%Y-%d-%m %H:%M:%S"
Then you can do pd.to_datetime(dates, format=format1, errors="coerce") to only get the first dates, and NaT for the ones not according to the format. Then you use combine_first to fill the voids.
dates = df["Column3"] # of the one imported with dtype={"Column3": str}
dates_parsed = (
pd.to_datetime(dates, format=format1, errors="coerce")
.combine_first(pd.to_datetime(dates, format=format2, errors="coerce"))
.astype(object)
.combine_first(dates)
)
The astype(object) is needed to fill in the empty places with the string values.

I think, first you should import the file without date parsing then convert it to date format using following:
df['column3']= pd.to_datetime(df['column3'], errors='coerce')
Hope this will work

How do I parse a date without zero padding, in the format (1 or 2-digit year)-(Month abbreviation)?

I need to parse a few dates that are roughly in the format (1 or 2-digit year)-(Month abbreviation), for example:
5-Jun (June 2005)
13-Jan (January 2013)
I tried using strptime with the format %b-%y but it did not consistently produce the desired date. Per the documentation, this is because some years in my dataset are not zero-padded.
Further, when I tested the datetime module (please see below for my code) on the string "5-Jun", I got "2019-06-05", instead of the desired result (June 2005), even if I set yearfirst=True when calling parse.
from dateutil.parser import parse
parsed = parse("5-Jun",yearfirst=True)
print(parsed)

It will be easier if 0 is padded to single digit years, as it can be directly converted to time using format. Regular expression is used here to replace any instance of single digit number with it's '0 padded in front' value. I've used regex from here.
Sample code:
import re
match_condn = r'\b([0-9])\b'
replace_str = r'0\1'
datetime.strptime(re.sub(match_condn, replace_str, '15-Jun'), '%y-%b').strftime("%B %Y")
Output:
June 2015

One approach is to use str.zfill
Ex:
import datetime
d = ["5-Jun", "13-Jan"]
for date in d:
date, month = date.split("-")
date = date.zfill(2)
print(datetime.datetime.strptime(date+"-"+month, "%y-%b").strftime("%B %Y"))
Output:
June 2005
January 2013

Ah. I see from #Rakesh's answer what your data is about. I thought you needed to parse the full name of the month. So you had your two terms %b and %y backwards, but then you had the problem with the single-digit years. I get it now. Here's a much simpler way to get what you want if you can assume your dates are always in one of the two formats you indicate:
inp = "5-Jun"
t = time.strptime(("0" + inp)[-6:], "%y-%b")

Convert 18-digit LDAP/FILETIME timestamps to human readable date

I have exported a list of AD Users out of AD and need to validate their login times.
The output from the powershell script give lastlogin as LDAP/FILE time
EXAMPLE 130305048577611542
I am having trouble converting this to readable time in pandas
Im using the following code:
df['date of login'] = pd.to_datetime(df['FileTime'], unit='ns')
The column FileTime contains time formatted like the EXAMPLE above.
Im getting the following output in my new column date of login
EXAMPLE 1974-02-17 03:50:48.577611542
I know this is being parsed incorrectly as when i input this date time on a online converter i get this output
EXAMPLE:
Epoch/Unix time: 1386031258
GMT: Tuesday, December 3, 2013 12:40:58 AM
Your time zone: Monday, December 2, 2013 4:40:58 PM GMT-08:00
Anyone have an idea of what occuring here why are all my dates in the 1970'

I know this answer is very late to the party, but for anyone else looking in the future.
The 18-digit Active Directory timestamps (LDAP), also named 'Windows NT time format','Win32 FILETIME or SYSTEMTIME' or NTFS file time. These are used in Microsoft Active Directory for pwdLastSet, accountExpires, LastLogon, LastLogonTimestamp and LastPwdSet. The timestamp is the number of 100-nanoseconds intervals (1 nanosecond = one billionth of a second) since Jan 1, 1601 UTC.
Therefore, 130305048577611542 does indeed relate to December 3, 2013.
When putting this value through the date time function in Python, it is truncating the value to nine digits. Therefore the timestamp becomes 130305048 and goes from 1.1.1970 which does result in a 1974 date!
In order to get the correct Unix timestamp you need to do:
(130305048577611542 / 10000000) - 11644473600

Here's a solution I did in Python that worked well for me:
import datetime
def ad_timestamp(timestamp):
if timestamp != 0:
return datetime.datetime(1601, 1, 1) + datetime.timedelta(seconds=timestamp/10000000)
return np.nan
So then if you need to convert a Pandas column:
df.lastLogonTimestamp = df.lastLogonTimestamp.fillna(0).apply(ad_timestamp)
Note: I needed to use fillna before using apply. Also, since I filled with 0's, I checked for that in the conversion function about, if timestamp != 0. Hope that makes sense. It's extra stuff but you may need it to convert the column in question.

I've been stuck on this for couple of days. But now i am ready to share really working solution in more easy to use form:
import datetime
timestamp = 132375402928051110
value = datetime.datetime (1601, 1, 1) +
datetime.timedelta(seconds=timestamp/10000000) ### combine str 3 and 4
print(value.strftime('%Y-%m-%d %H:%M:%S'))

How to use ``xlrd.xldate_as_tuple()``

I am not quite sure how to use the following function:
xlrd.xldate_as_tuple
for the following data
xldate:39274.0
xldate:39839.0
Could someone please give me an example on usage of the function for the data?

Quoth the documentation:
Dates in Excel spreadsheets
In reality, there are no such things.
What you have are floating point
numbers and pious hope. There are
several problems with Excel dates:
(1) Dates are not stored as a separate
data type; they are stored as floating
point numbers and you have to rely on
(a) the "number format" applied to
them in Excel and/or (b) knowing which
cells are supposed to have dates in
them. This module helps with (a) by
inspecting the format that has been
applied to each number cell; if it
appears to be a date format, the cell
is classified as a date rather than a
number. Feedback on this feature,
especially from non-English-speaking
locales, would be appreciated.
(2) Excel for Windows stores dates by
default as the number of days (or
fraction thereof) since
1899-12-31T00:00:00. Excel for
Macintosh uses a default start date of
1904-01-01T00:00:00. The date system
can be changed in Excel on a
per-workbook basis (for example: Tools
-> Options -> Calculation, tick the "1904 date system" box). This is of
course a bad idea if there are already
dates in the workbook. There is no
good reason to change it even if there
are no dates in the workbook. Which
date system is in use is recorded in
the workbook. A workbook transported
from Windows to Macintosh (or vice
versa) will work correctly with the
host Excel. When using this module's
xldate_as_tuple function to convert
numbers from a workbook, you must use
the datemode attribute of the Book
object. If you guess, or make a
judgement depending on where you
believe the workbook was created, you
run the risk of being 1462 days out of
kilter.
Reference:
http://support.microsoft.com/default.aspx?scid=KB;EN-US;q180162
(3) The Excel implementation of the
Windows-default 1900-based date system
works on the incorrect premise that
1900 was a leap year. It interprets
the number 60 as meaning 1900-02-29,
which is not a valid date.
Consequently any number less than 61
is ambiguous. Example: is 59 the
result of 1900-02-28 entered directly,
or is it 1900-03-01 minus 2 days? The
OpenOffice.org Calc program "corrects"
the Microsoft problem; entering
1900-02-27 causes the number 59 to be
stored. Save as an XLS file, then open
the file with Excel -- you'll see
1900-02-28 displayed.
Reference:
http://support.microsoft.com/default.aspx?scid=kb;en-us;214326
which I quote here because the answer to your question is likely to be wrong unless you take that into account.
So to put this into code would be something like:
import datetime
import xlrd
book = xlrd.open_workbook("myfile.xls")
sheet = book.sheet_by_index(0)
cell = sheet.cell(5, 19) # type, <class 'xlrd.sheet.Cell'>
if sheet.cell(5, 19).ctype == 3: # 3 means 'xldate' , 1 means 'text'
ms_date_number = sheet.cell_value(5, 19) # Correct option 1
ms_date_number = sheet.cell(5, 19).value # Correct option 2
year, month, day, hour, minute, second = xlrd.xldate_as_tuple(ms_date_number,
book.datemode)
py_date = datetime.datetime(year, month, day, hour, minute, nearest_second)
which gives you a Python datetime in py_date that you can do useful operations upon using the standard datetime module.
I've never used xlrd, and my example is completely made up, but if there is a myfile.xls and it really has a date number in cell F20, and you aren't too fussy about precision as noted above, this code should work.

The documentation of the function (minus the list of possible exceptions):
xldate_as_tuple(xldate, datemode) [#]
Convert an Excel number (presumed to represent a date, a datetime or a
time) into a tuple suitable for feeding to datetime or mx.DateTime
constructors.
xldate
The Excel number
datemode
0: 1900-based, 1: 1904-based.
WARNING: when using this function to interpret the contents of
a workbook, you should pass in the Book.datemode attribute of that
workbook. Whether the workbook has ever been anywhere near a Macintosh is
irrelevant.
Returns:
Gregorian (year, month, day, hour, minute, nearest_second).
As the author of xlrd, I'm interested in knowing how the documentation can be made better. Could you please answer these:
Did you read the general section on dates (quoted by #msw)?
Did you read the above specific documentation of the function?
Can you suggest any improvement in the documentation?
Did you actually try running the function, like this:
>>> import xlrd
>>> xlrd.xldate_as_tuple(39274.0, 0)
(2007, 7, 11, 0, 0, 0)
>>> xlrd.xldate_as_tuple(39274.0 - 1.0/60/60/24, 0)
(2007, 7, 10, 23, 59, 59)
>>>

Use it as such:
number = 39274.0
book_datemode = my_book.datemode
year, month, day, hour, minute, second = xldate_as_tuple(number, book_datemode)

import datetime as dt
import xlrd
log_dir = 'C:\\Users\\'
infile = 'myfile.xls'
book = xlrd.open_workbook(log_dir+infile)
sheet1 = book.sheet_by_index(0)
date_column_idx = 1
## iterate through the sheet to locate the date columns
for rownum in range(sheet1.nrows):
rows = sheet1.row_values(rownum)
## check if the cell is a date; continue otherwise
if sheet1.cell(rownum, date_column_idx).ctype != 3 :
continue
install_dt_tuple = xlrd.xldate_as_tuple((rows[date_column_idx ]), book.datemode)
## the "*date_tuple" will automatically unpack the tuple. Thanks mfitzp :-)
date = dt.datetime(*date_tuple)

Here's what I use to automatically convert dates:
cell = sheet.cell(row, col)
value = cell.value
if cell.ctype == 3: # xldate
value = datetime.datetime(*xlrd.xldate_as_tuple(value, workbook.datemode))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing data with incorrect dates like 30th of February - python

Related

.Between function not working as expected

Data parsing in pandas, python

How do I parse a date without zero padding, in the format (1 or 2-digit year)-(Month abbreviation)?

Convert 18-digit LDAP/FILETIME timestamps to human readable date

How to use ``xlrd.xldate_as_tuple()``

Categories

Resources