retrieving a date from a string - python

im trying to retrieve a date from a string. the problem is that the pattern of this date varies a lot (string comes from an OCR reading). These are the patterns i need to identify:
11/11/1111 (i can get this one already)
11-11-1111 (i can get this one already)
11 11 1111 (i can get this one already)
11- 11- 1111
11 11 1111
11-11 1111
23- 10-17
9 06- 17
So far, the RegEx I have is a slight adaptation (it now allows spaces instead of just - or / separating the numbers) from a stackoverflow answer :
match_date=re.search(r'(?:(?:31(\/|-|\.| )(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.| )(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.| )(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})',line)
Is there a way of building a regex for such a "fluid" date structure?

Regex: \b(?:\d{1,2}[- /]\s?){2}(?:\d{4}|\d{2})\b or ^(?:\d{1,2}[- /]\s?){2}(?:\d{4}|\d{2})$
Regex demo

You could go for
\b\d{1,2}[- /]+\d{1,2}[- /]+\d{2,4}\b
See a demo on regex101.com.

I know regex is a better answer because with one line you can match all possibilities but I prefer convert to datetime
from datetime import datetime
string = "11- 11- 1111"
for fmt in ('%Y-%m-%d', '%d- %m- %Y', '%d %m %Y', '%d- %m- %y'):
try:
datetime_object = datetime.strptime(string, '%d- %m- %y')
...

Related

Regex for date of birth with maximum age

I am looking for a regular expression in Python which mathes for date of birth is the given format: dd.mm.YYYY
For example:
31.12.1999 (31st of December)
02.07.2021
I have provided more examples in the demo.
Dates which are older as 01.01.1920 should not match!!!
Try:
^(?:[0-2][1-9]|3[01])\.(?:0[1-9]|1[12])\.(?:19[2-9]\d|[2-9]\d{3})$
See Regex Demo
Be aware that this will not catch dates like 31.02.2021, that is, it is not sophisticated enough to know how many days are in any given month and it is hopeless to try to come up with a regex that can do that because February is problematic because the regex can't compute which years are leap years.
This will also allow future dates such as 01.01.3099 (you do want this to be work for the future, no?).
Update
You really need to be using the datetime class from the datetime package and, if you want to insist that the date and month fields contain two digits, a regex just to ensure the format:
import re
from datetime import datetime, date
validated = False # assume not validated
s = '31.03.2019'
m = re.fullmatch(r'\d{2}\.\d{2}\.\d{4}', s)
if m:
# we have ensured the correct number of digits:
try:
d = datetime.strptime(s, '%d.%m.%Y').date()
if d >= date(1920, 1, 1):
validated = True
except ValueError:
pass
print(validated)
As I said, it can be done with a very convoluted regex. However, I do not actually recommend using this, I just had fun writing it as a challenge. You should in reality use a very permissive regex and validate the ranges in code.
Demo.
# Easy dates, those <= 28th, valid for all months/years.
(0[1-9]|1[0-9]|2[0-8])\.(0[1-9]|1[0-2])\.(19[2-9][0-9]|2[0-9][0-9][0-9])
|
# Validate the 29th of Februari for 1920-1999.
29\.02\.19([3579][26]|[2468][048])
|
# Validate the 29th of Februari for 2000-2999.
29\.02\.((2[0-9])(0[48]|[13579][26]|[2468][048])|2000|2400|2800)
|
# Validate 29th and 30th.
(29|30)\.(01|0[3-9]|1[0-2])\.(19[2-9][0-9]|2[0-9][0-9][0-9])
|
# Validate 31st.
31\.(01|03|05|07|08|10|12)\.(19[2-9][0-9]|2[0-9][0-9][0-9])
\d\{2}.\d\{2}.\d{4}
Validating the value of the dates should be done at the application level.

Extracting time with regex from a string

I have scraped some data and there are some hours that have time in 12 hours format. The string is like this: Mon - Fri:,10:00 am - 7:00 pm. So i need to extract the times 10:00 am and 7:00 pm and then convert them to 24 hour format. Then the final string I want to make is like this:
Mon - Fri:,10:00 - 19:00
Any help would be appreciated in this regard. I have tried the following:
import re
txt = 'Mon - Fri:,10:00 am - 7:00 pm'
data = re.findall(r'\s(\d{2}\:\d{2}\s?(?:AM|PM|am|pm))', txt)
print(data)
But this regex and any other that I tried to use didn't do the task.
Your regex enforces a whitespace before the leading digit which prevents ,10:00 am from matching and requires two digits before the colon which fails to match 7:00 pm. r"(?i)(\d?\d:\d\d (?:a|p)m)" seems like the most precise option.
After that, parse the match using datetime.strptime and convert it to military using the "%H:%M" format string. Any invalid times like 10:67 will raise a nice error (if you anticipate strings that should be ignored, adjust the regex to strictly match 24-hour times).
import re
from datetime import datetime
def to_military_time(x):
return datetime.strptime(x.group(), "%I:%M %p").strftime("%H:%M")
txt = "Mon - Fri:,10:00 am - 7:00 pm"
data = re.sub(r"(?i)(\d?\d:\d\d (?:a|p)m)", to_military_time, txt)
print(data) # => Mon - Fri:,10:00 - 19:00
Your regex looks only for two digit hours (\d{2}) with white space before them (\s). The following captures also one digit hours, with a possible comma instead of the space.
data = re.findall(r'[\s,](\d{1,2}\:\d{2}\s?(?:AM|PM|am|pm))', txt)
However, you might want to consider all punctuation as valid:
data = re.findall(r'[\s!"#$%&\'\(\)*+,-./:;\<=\>?#\[\\\]^_`\{|\}~](\d{1,2}\:\d{2}\s?(?:AM|PM|am|pm))', txt)
Regex need to change like here.
import re
text = 'Mon - Fri:,10:00 am - 7:00 pm'
result = re.match(r'\D* - \D*:,([\d\s\w:]+) - ([\d\s\w:]+)', text)
print(result.group(1))
# it will print 10:00 am
print(result.group(2))
# it will print 7:00 pm
You need some thing like '+' and '*' to tell regex to get multiple word, if you only use \s it will only match one character.
You can learn more regex here.
https://regexr.com/
And here you can try regex online.
https://regex101.com/
Why not use the time module?
import time
data = "Mon - Fri:,10:00 am - 7:00 pm"
parts = data.split(",")
days = parts[0]
hours = parts[1]
parts = hours.split("-")
t1 = time.strptime(parts[0].strip(), "%I:%M %p")
t2 = time.strptime(parts[1].strip(), "%I:%M %p")
result = days + "," + time.strftime("%H:%M", t1) + " - " + time.strftime("%H:%M", t2)
Output:
Mon - Fri:,10:00 - 19:00

Extract strings that follow changing time strings

So I've been trying to extract the strings that follow the "dot" character in the text file, but only for lines that follow the pattern as below, that is, after the date and time:
09 May 2018 10:37AM • 6PR, Perth (Mornings)
The problem is for each of those lines, the date and time would change so the only common pattern is that there would be AM or PM right before the "dot".
However, if I search for "AM" or "PM" it wouldn't recognize the lines because the "AM" and "PM" are attached to the time.
This is my current code:
for i,s in enumerate(open(file)):
for words in ['PM','AM']:
if re.findall(r'\b' + words + r'\b', s):
source=s.split('•')[0]
Any idea how to get around this problem? Thank you.
I guess your regex is the problem here.
for i, s in enumerate(open(file)):
if re.findall(r'\d{2}[AP]M', s):
source = s.split('•')[0]
# 09 May 2018 10:37AM
If you are trying to extract the datetime try using regex.
Ex:
import re
s = "09 May 2018 10:37AM • 6PR, Perth (Mornings)"
m = re.search("(?P<datetime>\d{2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\s+\d{2}\:\d{2}(AM|PM))", s)
if m:
print m.group("datetime")
Output:
09 May 2018 10:37AM

How to Extract Date From String Python

I have a String Like
"Originally Posted on 09 May, 2016. By query 3 j...."
how can i extract date using python??
I tried this code:
dStr = "Originally Posted on 09 May, 2016. By query 3 j...."
date_st = re.findall("(\d+\ \w+,)", dStr)
printing date_st, i've got:
['09 May,']
what should i do for year??
You forget to add the year after ',', Only need to add '\d+.' is well.
re.findall("(\d+\ \w+, \d+\.)", dStr)
You will got this:
'09 May, 2016.'
You were almost there. Just add 4 digits for the year after the ,. It is better to use [a-z]+ instead of \w+ to match the month names as \w matches _ and 0-9 (along with alphabets) as well.
re.findall(r'\d+\s[a-z]+,\s\d{4}',s,re.I)

Xpath extract dates between certain characters AND use as dates

UPDATE: Regarding my 2nd question (how to convert string to date format in MySQL), I found a way and want to share it:
1) Save the "string date" data as VARCHAR (Don't use TEXT)
2) When showing MySQL data in PHP or other ways, use the function of str_to_date(string-date-column, date-format), such as the following example:
$sql = "SELECT * FROM yourtablename ORDER BY str_to_date(string-date-column, '%d %M %Y')";
I am using scrapy to collect data, write to database. From a website, the post date of each item is listed as following:
<p> #This is the last <p> within each <div>
<br>
[15 May 2015, #9789]
<br>
</p>
So the date is always behind a "[" and before a ",". I used the following xpath code to extract:
sel.xpath("p[last()]/text()[contains(., '[')]").extract()
But I will get the whole line:
[15 May 2015, #9789]
So, how to get only the part of "15 May 2015"? If this can be done, how to convert the scraped string (15 May 2015) as real DATE data, so it can be used for sorting? Thanks a bunch!
Regarding the first question, assuming that there is maximum one date at a time, you can use combination of XPath substring-after() and substring-before() functions to get 15 May 2015 part of the text node :
substring-before(substring-after(p[last()]/text()[contains(., '[')], '['), ',')
Regarding the second question, you can use datetime.strptime() to convert string to datetime :
import datetime
result = datetime.datetime.strptime("15 May 2015", "%d %b %Y")
print(result)
print(type(result))
output :
2015-05-15 00:00:00
<type 'datetime.datetime'>
A more "scrapic" approach would involve using the built-in regular expression support in the XPath expressions and/or .re().
This is with both applied:
In [1]: response.xpath("p[last()]/text()[re:test(., '\[\d+ \w+ \d{4}\, #\d+\]')]").re(r"\d+ \w+ \d{4}")
Out[1]: [u'15 May 2015']
Or, this is when you use .re() to extract the date locating the element as you did before:
In [2]: response.xpath("p[last()]/text()[contains(., '[')]").re(r"\d+ \w+ \d{4}")
Out[2]: [u'15 May 2015']

Categories

Resources