How to find date with regular expresions - python

I have to find some date inside an string with a regular expresions in python
astring ='L2A_T21HUB_A023645_20210915T135520'
and i'm trying to get the part before the T with shape xxxxxxxx where every x is a number.
desiredOutput = '20210915'
I'm new in regex so I have no idea how to solve this

If the astring's format is consistent, meaning it will always have the same shape with respect to the date, you can split the string by '_' and get the last substring and get the date from there as such:
astring ='L2A_T21HUB_A023645_20210915T135520'
date_split = astring.split("_"). # --> ['L2A', 'T21HUB', 'A023645', '20210915T135520']
desiredOutput = date_split[3][:8] # --> [3] = '20210915T135520' [:8] gets first 8 chars
print(desiredOutput) # --> 20210915

If you wanted an actual datetime object
>>> from datetime import datetime
>>> astring = 'L2A_T21HUB_A023645_20210915T135520'
>>> date_str = astring.split('_')[-1]
>>> datetime.strptime(date_str, '%Y%m%dT%H%M%S')
datetime.datetime(2021, 9, 15, 13, 55, 20)
From that, you can use datetime.strftime to reformat to a new string, or you can use split('T')[0] to get the string you want.

The trouble with Regex is that there can be unexpected patterns that match your expected pattern and throw things off. However, if you know that only the date portion will ever have 8 sequential digits, you can do this:
import re
date_patt = re.compile('\d{8}')
date = date_patt.search(astring).group(0)
You can develop more robust patterns based on your knowledge of the formatting of the incoming strings. For instance, if you know that the date will always follow an underscore, you could use a look-behind assertion:
date_patt = re.compile(r'(?<=\_)\d{8}') # look for '_' before the date, but don't capture
Hope this helps. Regex can be finicky and may take some tweaking, but hope this sets you in the right direction.

Related

How to create a datetime object from a string?

I tried to be smart and create a one liner which can extract the datetime of my_string and make a datetime of it. However, it did not work quiet well.
my_string = 'London_XX65TR_20211116_112413.txt'
This is my code:
datetime= datetime.datetime.strptime(my_string .split('_')[2],'%Y%m%d_%H%M%S')
This is my output:
ValueError: time data '20211116' does not match format '%Y%m%d_%H%M%S'
You could use the maxsplit argument in str.split:
>>> from datetime import datetime
>>> region, code, date_time = my_string[:-4].split('_', maxsplit=2)
>>> datetime.strptime(date_time, "%Y%m%d_%H%M%S")
datetime.datetime(2021, 11, 16, 11, 24, 13)
Which means only split at, at most maxsplit occurrences of the _ characters from the left, leave the rest as is.
For this particular case, instead of my_string[:-4], you could use my_string.rstrip('.txt'), it is not advised in general, because it may strip some useful information as well. Whereas, from Python 3.9+ you could use str.removesuffix:
>>> my_string = 'London_XX65TR_20211116_112413.txt'
>>> region, code, date_time = my_string.removesuffix('.txt').split('_', maxsplit=2)
>>> datetime.strptime(date_time, "%Y%m%d_%H%M%S")
datetime.datetime(2021, 11, 16, 11, 24, 13)
You could use re.findall here:
from datetime import datetime
my_string = 'London_XX65TR_20211116_112413.txt'
ts = re.findall(r'_(\d{8}_\d{6})\.', my_string)[0]
dt = datetime.strptime(ts, '%Y%m%d_%H%M%S')
print(dt) # 2021-11-16 11:24:13
This approach uses a regex to extract the timestamp from the input string. The rest of your logic was already correct.
The Method you are following is correct. It's just you are not considering the HH:MIN:Sec part and need to append that before formatting,
my_string = 'London_XX65TR_20211116_112413.txt'
my_date = (my_string .split('_')[2]+my_string .split('_')[3]).replace(".txt","")
datetime= datetime.datetime.strptime(my_date,'%Y%m%d%H%M%S')
print(datetime) # 2021-11-16 11:24:13
Your code does not work because my_string .split('_') gives ['London', 'XX65TR', '20211116', '112413.txt'] so in strptime('20211116', '%Y%m%d_%H%M%S') return an error.
You should either :
limit the format to `'%Y%m%d', loosing the HMS
find another way to get the whole substring matching the format
The first part of the alternative is trivial so lets go for the second one using regex.
import regex as re
datetime = datetime.datetime.strptime(re.search(r'\d{8}_\d{6}', my_string)[0],'%Y%m%d_%H%M%S')
from datetime import datetime
date_time_str = '18/09/19 01:55:19'
date_time_obj = datetime.strptime(date_time_str, '%d/%m/%y %H:%M:%S')
print ("The type of the date is now", type(date_time_obj))
print ("The date is", date_time_obj)

Python regular expression to change date formatting

I have an array of strings representing dates like '2015-6-03' and I want to convert these to the format '2015-06-03'.
Instead of doing the replacement with an ugly loop, I'd like to use a regular expression. Something along the lines of:
str.replace('(-){1}(\d){1}(-){1}', '-0{my digit here}-')
Is something like this possible?
You don't have to retrieve the digit from the match. You can replace the hyphen before a single-digit month with -0.
Like this:
re.sub('-(?=\d-)', '-0', text)
Note that (?=\d-) is a non-capturing expression because the opening parenthesis is followed by the special sequence ?=. That's why only the hyphen gets replaced.
Test:
import re
text = '2015-09-03 2015-6-03 2015-1-03 2015-10-03'
re.sub('-(?=\d-)', '-0', text)
Result:
'2015-09-03 2015-06-03 2015-01-03 2015-10-03'
Yes, a regex will accomplish what you want
\d+-(\d)-\d+
and so to replace you would use something like
import re
target = "2015-6-05"
out = re.sub(r'\d+-(\d)-\d+','(0\\1)', target)
No need for regex, you can load it as datetime object and format the string as requested when you print it:
import datetime
s = '2015-6-03'
date_obj = datetime.datetime.strptime(s, '%Y-%m-%d')
print "%d-%02d-%02d" % (date_obj.year, date_obj.month, date_obj.day)
OUTPUT
2015-06-03
Something along the lines of...
import re
def replaceRegex(what, pattern, filler):
regex = re.compile(pattern)
match = regex.match(what)
if match != None:
from, to = match.span()
return what.replace(what[from : to], filler)
else:
return None
Might help you.

Obtain part of a string using regular expressions in Python

I am dealing with strings that need to be converted into dates in Python. In a normal situation, my string would have %d/%m/%Y %H:%M:%S. For instance:
18/02/2013 09:21:14
However in some occasion I could obtain something like %d/%m/%Y %H:%M:%S:%ms, such as:06/01/2014 09:52:14:78
I would like to get rid of that ms bit but I need to figure out how. I have been able to create a regular expression which can test if the date matches:
mydate = re.compile("^((((31\/(0?[13578]|1[02]))|((29|30)\/(0?[1,3-9]|1[0-2])))\/(1[6-9]|[2-9]\d)?\d{2})|(29\/0?2\/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))|(0?[1-9]|1\d|2[0-8])\/((0?[1-9])|(1[0-2]))\/((1[6-9]|[2-9]\d)?\d{2})) (20|21|22|23|[0-1]?\d):[0-5]?\d:[0-5]?\d$")
s = "06/01/2014 09:52:14:78"
bool(mydate.match(s))
>>> False
However I do not know how to obtain only the interesting part, i.e 06/01/2014 09:52:14
Any ideas?
You can use a positive lookbehind and re.sub():
>>> re.sub(r'(?<=\d{2}:\d{2}:\d{2}).*','','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
Debuggex Demo
How about the re.sub function
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$',r'\1','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$,r'\1','8/02/2013 09:21:14')
'8/02/2013 09:21:14'
( \d{2}(:\d{2}){2}) matcheshours:min:sec` saved in capture group 1
.*$ matches the milliseconds
r'\1' replaced with contents of the first caputre group

date matching using python regex

What am i doing wrong in the below regular expression matching
>>> import re
>>> d="30-12-2001"
>>> re.findall(r"\b[1-31][/-:][1-12][/-:][1981-2011]\b",d)
[]
[1-31] matches 1-3 and 1 which is basically 1, 2 or 3. You cannot match a number rage unless it's a subset of 0-9. Same applies to [1981-2011] which matches exactly one character that is 0, 1, 2, 8 or 9.
The best solution is simply matching any number and then checking the numbers later using python itself. A date such as 31-02-2012 would not make any sense - and making your regex check that would be hard. Making it also handle leap years properly would make it even harder or impossible. Here's a regex matching anything that looks like a dd-mm-yyyy date: \b\d{1,2}[-/:]\d{1,2}[-/:]\d{4}\b
However, I would highly suggest not allowing any of -, : and / as : is usually used for times, / usually for the US way of writing a date (mm/dd/yyyy) and - for the ISO way (yyyy-mm-dd). The EU dd.mm.yyyy syntax is not handled at all.
If the string does not contain anything but the date, you don't need a regex at all - use strptime() instead.
All in all, tell the user what date format you expect and parse that one, rejecting anything else. Otherwise you'll get ambiguous cases such as 04/05/2012 (is it april 5th or may 4th?).
[1-31] does not means what you think it means. The square bracket syntax matches a range of characters, not a range of numbers. Matching a range of numbers with a regex is possible, but unwieldy.
If you really want to use regular expressions for this (rather than a date parsing library) you'd be better off matching all numbers of the right number of digits, capturing the values, and then checking the values yourself:
>>> import re
>>> d="30-12-2001"
>>> >>> re.findall(r"\b([0-9]{1,2})[-/:]([0-9]{1,2})[-/:]([0-9]{4})\b",d)
[('30', '12', '2001')]
You'll have to do actual date verification anyway, to catch invalid dates like 31-02-2012.
(Note that [/-:] doesn't work either, because it's interpreted as a range. Use [-/:] instead - putting the hyphen at the front prevents it being interpreted as a range separator.)
Regular expressions do not understand numbers; to a regular expression, 1 is just a character of the string - the same kind of thing that a is. Thus, for example, [1-31] is parsed as a character class which contains the range 1-3 and the (redundant) single symbol 1.
You do not want to use regular expressions to parse dates. There is already a built-in module for handling date parsing:
>>> import datetime
>>> datetime.datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0) # an object representing the date.
This also does all the secondary checks (for things like an attempt to refer to Feb. 31) for you. If you want to handle multiple types of separators, you can simply .replace them in the original string so that they all turn into the same separator, then use that in your format.
You're probably doing it wrong. Some other replies here are helping you with the regex, but I suggest you use the datetime.strptime method to turn your formatted date into a datetime object, and do further logic with that object:
>>> import datetime
>>> datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0)
More info on the strptime method and it's format strings.
regexp = r'(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\d\d)'
( #start of group #1
0?[1-9] # 01-09 or 1-9
| # ..or
[12][0-9] # 10-19 or 20-29
| # ..or
3[01] # 30, 31
) #end of group #1
/ # follow by a "/"
( # start of group #2
0?[1-9] # 01-09 or 1-9
| # ..or
1[012] # 10,11,12
) # end of group #2
/ # follow by a "/"
( # start of group #3
(19|20)\\d\\d # 19[0-9][0-9] or 20[0-9][0-9]
) # end of group #3
Maybe you can try this regex
^((0|1|2)[0-9]{1}|(3)[0-1]{1})/((0)[0-9]{1}|(1)[0-2]{1})/((19)[0-9]{2}|(20)[0-9]{2})$
this match for (01 to 31)/(01 to 12)/(1900 to 2099)

Python regex to match dates

What regular expression in Python do I use to match dates like this: "11/12/98"?
Instead of using regex, it is generally better to parse the string as a datetime.datetime object:
In [140]: datetime.datetime.strptime("11/12/98","%m/%d/%y")
Out[140]: datetime.datetime(1998, 11, 12, 0, 0)
In [141]: datetime.datetime.strptime("11/12/98","%d/%m/%y")
Out[141]: datetime.datetime(1998, 12, 11, 0, 0)
You could then access the day, month, and year (and hour, minutes, and seconds) as attributes of the datetime.datetime object:
In [143]: date.year
Out[143]: 1998
In [144]: date.month
Out[144]: 11
In [145]: date.day
Out[145]: 12
To test if a sequence of digits separated by forward-slashes represents a valid date, you could use a try..except block. Invalid dates will raise a ValueError:
In [159]: try:
.....: datetime.datetime.strptime("99/99/99","%m/%d/%y")
.....: except ValueError as err:
.....: print(err)
.....:
.....:
time data '99/99/99' does not match format '%m/%d/%y'
If you need to search a longer string for a date,
you could use regex to search for digits separated by forward-slashes:
In [146]: import re
In [152]: match = re.search(r'(\d+/\d+/\d+)','The date is 11/12/98')
In [153]: match.group(1)
Out[153]: '11/12/98'
Of course, invalid dates will also match:
In [154]: match = re.search(r'(\d+/\d+/\d+)','The date is 99/99/99')
In [155]: match.group(1)
Out[155]: '99/99/99'
To check that match.group(1) returns a valid date string, you could then parsing it using datetime.datetime.strptime as shown above.
I find the below RE working fine for Date in the following format;
14-11-2017
14.11.2017
14|11|2017
It can accept year from 2000-2099
Please do not forget to add $ at the end,if not it accept 14-11-201 or 20177
date="13-11-2017"
x=re.search("^([1-9] |1[0-9]| 2[0-9]|3[0-1])(.|-)([1-9] |1[0-2])(.|-|)20[0-9][0-9]$",date)
x.group()
output = '13-11-2017'
I built my solution on top of #aditya Prakash appraoch:
print(re.search("^([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])$|^([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])$",'01/01/2018'))
The first part (^([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])$) can handle the following formats:
01.10.2019
1.1.2019
1.1.19
12/03/2020
01.05.1950
The second part (^([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])$) can basically do the same, but in inverse order, where the year comes first, followed by month, and then day.
2020/02/12
As delimiters it allows ., /, -. As years it allows everything from 1900-2099, also giving only two numbers is fine.
If you have suggestions for improvement please let me know in the comments, so I can update the answer.
Using this regular expression you can validate different kinds of Date/Time samples, just a little change is needed.
^\d\d\d\d/(0?[1-9]|1[0-2])/(0?[1-9]|[12][0-9]|3[01]) (00|[0-9]|1[0-9]|2[0-3]):([0-9]|[0-5][0-9]):([0-9]|[0-5][0-9])$ -->validate this: 2018/7/12 13:00:00
for your format you cad change it to:
^(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[0-2])/\d\d$ --> validates this: 11/12/98
I use something like this
>>> import datetime
>>> regex = datetime.datetime.strptime
>>>
>>> # TEST
>>> assert regex('2020-08-03', '%Y-%m-%d')
>>>
>>> assert regex('2020-08', '%Y-%m-%d')
ValueError: time data '2020-08' does not match format '%Y-%m-%d'
>>> assert regex('08/03/20', '%m/%d/%y')
>>>
>>> assert regex('08-03-2020', '%m/%d/%y')
ValueError: time data '08-03-2020' does not match format '%m/%d/%y'
Well, from my understanding, simply for matching this format in a given string, I prefer this regular expression:
pattern='[0-9|/]+'
to match the format in a more strict way, the following works:
pattern='(?:[0-9]{2}/){2}[0-9]{2}'
Personally, I cannot agree with unutbu's answer since sometimes we use regular expression for "finding" and "extract", not only "validating".
Sometimes we need to get the date from a string.
One example with grouping:
record = '1518-09-06 00:57 some-alphanumeric-charecter'
pattern_date_time = ([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}) .+
match = re.match(pattern_date_time, record)
if match is not None:
group = match.group()
date = group[0]
print(date) // outputs 1518-09-06 00:57
As the question title asks for a regex that finds many dates, I would like to propose a new solution, although there are many solutions already.
In order to find all dates of a string that are in this millennium (2000 - 2999), for me it worked the following:
dates = re.findall('([1-9]|1[0-9]|2[0-9]|3[0-1]|0[0-9])(.|-|\/)([1-9]|1[0-2]|0[0-9])(.|-|\/)(20[0-9][0-9])',dates_ele)
dates = [''.join(dates[i]) for i in range(len(dates))]
This regex is able to find multiple dates in the same string, like bla Bla 8.05/2020 \n BLAH bla15/05-2020 blaa. As one could observe, instead of / the date can have . or -, not necessary at the same time.
Some explaining
More specifically it can find dates of format day , moth year. Day is an one digit integer or a zero followed by one digit integer or 1 or 2 followed by an one digit integer or a 3 followed by 0 or 1. Month is an one digit integer or a zero followed by one digit integer or 1 followed by 0, 1, or 2. Year is the number 20 followed by any number between 00 and 99.
Useful notes
One can add more date splitting symbols by adding | symbol at the end of both (.|-|\/). For example for adding -- one would do (.|-|\/|--)
To have years outside of this millennium one has to modify (20[0-9][0-9]) to ([0-9][0-9][0-9][0-9])
I use something like this :
string="text 24/02/2021 ... 24-02-2021 ... 24_02_2021 ... 24|02|2021 text"
new_string = re.sub(r"[0-9]{1,4}[\_|\-|\/|\|][0-9]{1,2}[\_|\-|\/|\|][0-9]{1,4}", ' ', string)
print(new_string)
out : text ... ... ... text
If you don't want to raise ValueError exception like in methods with datetime, you can use re. Maybe you should also check that day of month lower than 31 and month number is lower than 12, inclusive:
from re import search as re_search
date_input = '31.12.1998'
re_search(r'^(3[01]|[12][0-9]|0[1-9]).(1[0-2]|0[1-9]).[0-9]{4}$', date_input)
With datetime good answer gave #unutbu earlier.
In case anyone wants to match this type of date "24 November 2008"
you can use
import re
date = "24 November 2008"
regex = re.compile("\d+\s\w+\s\d+")
matchDate = regex.findall(date)
print(matchDate)
Or
import re
date = "24 November 2008"
matchDate = re.findall("\d+\s\w+\s\d+", date)
print(matchDate)
This regular expression for matching dates in this format "22/10/2021" works for me :
import re
date = "WHATEVER 22/10/2029 WHATEVER"
match = re.search("([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9][0-9][0-9][0-9])", date)
print(match)
OUTPUT = <re.Match object; span=(9, 19), match='22/10/2029'>
You can see in the fourth line that there is this string ([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9][0-9][0-9][0-9]), this is the regular expression that I made based in this page.

Categories

Resources