How to remove garbage values from dates extracted with regex in Python - python

Goal: extract dates from medical records (stored in pandas Series, dates are in all possible formats)
For numerical dates I used:
str.extractall(r'((?:\b\d{1,2}[/]){1,2}(?:(?:\d{2}\b)|\b\d{4}\b))')
Problem:
Input text1:
"(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; Independent
Output1: 5/11/85 (as wished) but also: 16/22
Input text2:
[text...] (7/11/77) CBC: 4.9/36/308 Pertinent [...]:
Output2: 7/11/77 (as wished) but also 9/36
Especially the second case is hard, because transforming it to date returns: September 2036, so, it can't be selected out that way.
[^-] makes it even worse.
The dates are everywhere in the text, like:
[...] has also taken diet pills (last episode in Feb 1993) but [...]
Feb 1993 etc. wasn't a problem.

You should specify what "all formats" means. In your example you just show 1 format. Could "JAN-02-2016" "01/02/2016" "02/01/2016" all be present? European and US time formats? etc?
In your example, it looks like dates are always at the start of the line and surrounded by parentheses, however, which makes it sort of straightforward.
^((\d+/\d+)).|^((\d+/\d+/d+)).

The main rule when you working with regexes is: know your data. You must compose as much accurate regex as you can.
Then I would suggest you to parse such crude dates into actual, full-fledged date objects. It serves two main goals: first, you filter out negative regex matches; second, now you can cope with your dates in much more convenient, handy way using date object's methods rather than comparing just text strings. For example, you can access date's day, month or year, compare it with desired value, and filter out dates based on such comparison.
For parsing dates I would recommend you to use one of sophisticated date parsing libraries, such as dateutil or dateparser, which handles a lot of tricky details for you, for free.

Related

Pandas df.loc with regex

I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format.
I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows:
daily_data.loc[daily_data[areaLabel] == location].sum()
where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.
Use filter to get the columns that match your regex
>>> df.filter(regex="Province(/|_)State").columns[0]
'Province/State'
Then use this to select only rows that match your location:
df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum()
This however assumes that there are no other columns that would match the regex.

Order_By custom date in peewee for SQLite

I made a huge misstake building up a database, but it works perfectly except for 1 feature. Changing the program in all the places where it needs to be changed for that feature to work would be a titanic job of weeks, so let's hope this workaround is possible.
The issue: I've stored data in a SQLite database as "dd/mm/yyyy" TextField format instead of DateField.
The need: I need to sort by dates on a union query, to get the last number of records in that union following my custom date format. They are from different tables, so I can't just use rowid or stuff like that to get the last ones, I need to do it by date and I can't change the already stored data in the database because there are already invoices created with that format ("dd/mm/yyyy" is the default date format in my country).
This is the query that captures data:
records = []
limited_to = 25
union = (facturas | contado | albaranes | presupuestos)
for record in (union
.select_from(union.c.idunica, union.c.fecha, union.c.codigo,
union.c.tipo, union.c.clienterazonsocial,
union.c.totalimporte, union.c.pagada,
union.c.contabilizar, union.c.concepto1,
union.c.cantidad1, union.c.precio1,
union.c.observaciones)
.order_by(union.c.fecha.desc()) # TODO this is what I need to change.
.limit(limited_to)
.tuples()):
records.append(record)
Now to complicate things even more, the union is already created by a really complex where clause for each database before it's transformed into an union query.
So my only hope is: Is there a way to make order_by follow a custom date format instead?
To clarify, this is the simple transformation that I'd need the order_by clause to follow, because I assume SQLite wouldn't have issues sorting if this would be the date format:
def reverse_date(date: str) -> str:
"""Reverse the date order from dd/mm/yyyy dates into yyyy-mm-dd"""
yyyy, mm, dd = date.split("/")
return f"{yyyy}-{mm}-{dd}"
Note: I've left lot of code out because I think it's unnecesary. This is the minimum amount of code needed to understand the problem. Let me know if you need more data.
Update: Trying this workaround, it seems to work fine. Need more testing but it's promising. If someone ever faces the same issue, here you go:
.order_by(sqlfunc.Substr(union.c.fecha, 7)
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 4, 2))
.concat('/')
.concat(sqlfunc.Substr(union.c.fecha, 1, 2))
.desc())
Happy end of 2020 year!
As you pointed out, if you want the dates to sort properly, they need to be in yyyy-mm-dd format, which is the text format you should always use in SQLite (or something with the same year, month, day, order).
You might be able to do a rearrangement here using re.sub:
.order_by(re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\2-\1',
union.c.fecha))
Here we are using regex to capture the year, month, and day components in separate capture groups. Then, we replace with these components in the correct order for sorting.

Need to somehow change the formatting of a column of data from text to date

I've been handed a few sheets of data that I need to compare and contrast for a Uni project.
I have 2 sheets where the date/time format is 01/08/2020 00:01. Which is perfect for what I need to do. The third sheet has the date formatted (for some inexplicable reason) as 2020-08-01T00:01:00.0000000.
It's in normal text instead of date format, and there's 1000s of rows, and they don't follow the same increments, so I can't just start at the top and drag down. Obviously, I can't do it manually. I'm wondering if there's some Python code that I could use that recognises the year, data and day, reformat them according, replace the T with a space, and get rid of the 7 zeros at the end.
You can use a formula to extract the dates you need. Assuming your strings start in cell A1, you can use this formula:
=DATE(LEFT(A1,4),MID(A1,6,2),MID(A1,9,2))+TIME(MID(A1,12,2),MID(A1,15,2),MID(A1,18,2))

python - parsing mystery date format [duplicate]

This question already has answers here:
Convert weird Python date format to readable date
(2 answers)
Closed 7 years ago.
I'm importing data from an Excel spreadsheet into python. My dates are coming through in a bizarre format of which I am not familiar and cannot parse.
in excel: (7/31/2015)
42216
after I import it:
u'/Date(1438318800000-0500)/'
Two questions:
what format is this and how might I parse it into something more intuitive and easier to read?
is there a robust, swiss-army-knife-esque way to convert dates without specifying input format?
Timezones necessarily make this more complex, so let's ignore them...
As #SteJ remarked, what you get is (close to) the time in seconds since 1st January 1970. Here's a Wikipedia article how that's normally used. Oddly, the string you get seems to have a timezone (-0500, EST in North America) attached. Makes no sense if it's properly UNIX time (which is always in UTC), but we'll pass on that...
Assuming you can get it reduced to a number (sans timezone) the conversion into something sensible in Python is really straight-forward (note the reduction in precision; your original number is the number of milliseconds since the epoch, rather than the standard number of seconds from the epoch):
from datetime import datetime
time_stamp = 1438318800
time_stamp_dt = datetime.fromtimestamp(time_stamp)
You can then get time_stamp_dt into any format you think best using strftime, e.g., time_stamp_dt.strftime('%m/%d/%Y'), which pretty much gives you what you started with.
Now, assuming that the format of the string you provided is fairly regular, we can extract the relevant time quite simply like this:
s = '/Date(1438318800000-0500)/'
time_stamp = int(s[6:16])

Determining if an xlsx cell is date formatted for Excel 2007 spreadsheets

I'm working with some code that reads data from xlsx files by parsing the xml. It is all pretty straightforward, with the exception of date cell.
Dates are stored as integers and have an "s" attribute that is an index into the stylesheet, which can be used to get a date formatting string. Here are some examples from a previous stackoverflow question that is linked below:
19 = 'h:mm:ss AM/PM';
20 = 'h:mm';
21 = 'h:mm:ss';
22 = 'm/d/yy h:mm';
These are the built in date formatting strings from the ooxml standard, however it seems like excel tends to use custom formatted strings instead of the builtins. Here is an example format from an Excel 2007 spreadsheet. numFmtId greater than 164 is a custom format.
<numFmt formatCode="MM/DD/YY" numFmtId="165"/>
Determining if a cell should be formatted as a date is difficult because the only indicator I can find is the formatCode. This one is obviously a date, but cells could be formatted any number of ways. My initial attempt is to look for Ms, Ds, and Ys in the formatCode, but that seems problematic.
Has anybody had any luck with this problem? It seems like the standard excel reading libraries are lacking in xlsx support at this time. I've read through the standards and have dug through a lot of xlsx files without much luck.
The best information seems to come from this stackoverflow question:
what indicates an office open xml cell contains a date time value
Thanks!
Dates are stored as integers
In the Excel data model, there is really no such thing as an integer. Everything is a float. Dates and datetimes are floats, representing days and a fraction since a variable epoch. Times are fractions of a day.
It seems like the standard excel
reading libraries are lacking in xlsx
support at this time.
google("xlsxrd"). To keep up to date, join the python-excel group.
Edit I see that you have already asked a question there. If you had asked a question there as specific as this one, or responded to my request for clarification, you would have this info over two weeks ago.
Have a look at the xlrd documentation. Up the front there is a discussion on Excel dates. All of it applies to Excel 2007 as well as earlier versions. In particular: it is necessary to parse custom formats. It is necessary to have a table of "standard" format indexes which are for date formats. "Standard" formats listed in some places don't include the formats used in CJK locales.
Options for you:
(1) Borrow from the xlrd source code, including the xldate_as_tuple function.
(2) Option (1) + Get the xlsxrd bolt-on kit and borrow from its source code.
(3) [Recommended] Get the xlsxrd bolt-on kit and use it ... you get a set of APIs that operate across Excel versions 2.0 to 2007 and Python versions 2.1 to 2.7.
It isn't enough simply to look for Ms, Ds, and Ys in the number format code
[Red]#,##0 ;[Yellow](#,##0)
is a perfectly valid number format, which contains both Y and D, but isn't a date format. I specifically test for any of the standard date/time formatting characters ('y', 'm', 'd', 'H', 'i', 's') that are outside of square braces ('[' ']').
Even then, I was finding that a few false positives were slipping through, mainly associated with accounting and currency formats. Because these typically begin with either an underscore ('_') or a space followed by a zero (' 0') (neither of which I've ever encountered in a date format, I explicitly filter these values out.
A part of my (PHP) code for determining if a format mask is a date or not:
private static $possibleDateFormatCharacters = 'ymdHis';
// Typically number, currency or accounting (or occasionally fraction) formats
if ((substr($pFormatCode,0,1) == '_') || (substr($pFormatCode,0,2) == '0 ')) {
return false;
}
// Try checking for any of the date formatting characters that don't appear within square braces
if (preg_match('/(^|\])[^\[]*['.self::$possibleDateFormatCharacters.']/i',$pFormatCode)) {
return true;
}
// No date...
return false;
I'm sure that there may still be exceptions that I'm missing, but (if so) they are probably extreme cases

Categories

Resources