What am i doing wrong in the below regular expression matching
>>> import re
>>> d="30-12-2001"
>>> re.findall(r"\b[1-31][/-:][1-12][/-:][1981-2011]\b",d)
[]
[1-31] matches 1-3 and 1 which is basically 1, 2 or 3. You cannot match a number rage unless it's a subset of 0-9. Same applies to [1981-2011] which matches exactly one character that is 0, 1, 2, 8 or 9.
The best solution is simply matching any number and then checking the numbers later using python itself. A date such as 31-02-2012 would not make any sense - and making your regex check that would be hard. Making it also handle leap years properly would make it even harder or impossible. Here's a regex matching anything that looks like a dd-mm-yyyy date: \b\d{1,2}[-/:]\d{1,2}[-/:]\d{4}\b
However, I would highly suggest not allowing any of -, : and / as : is usually used for times, / usually for the US way of writing a date (mm/dd/yyyy) and - for the ISO way (yyyy-mm-dd). The EU dd.mm.yyyy syntax is not handled at all.
If the string does not contain anything but the date, you don't need a regex at all - use strptime() instead.
All in all, tell the user what date format you expect and parse that one, rejecting anything else. Otherwise you'll get ambiguous cases such as 04/05/2012 (is it april 5th or may 4th?).
[1-31] does not means what you think it means. The square bracket syntax matches a range of characters, not a range of numbers. Matching a range of numbers with a regex is possible, but unwieldy.
If you really want to use regular expressions for this (rather than a date parsing library) you'd be better off matching all numbers of the right number of digits, capturing the values, and then checking the values yourself:
>>> import re
>>> d="30-12-2001"
>>> >>> re.findall(r"\b([0-9]{1,2})[-/:]([0-9]{1,2})[-/:]([0-9]{4})\b",d)
[('30', '12', '2001')]
You'll have to do actual date verification anyway, to catch invalid dates like 31-02-2012.
(Note that [/-:] doesn't work either, because it's interpreted as a range. Use [-/:] instead - putting the hyphen at the front prevents it being interpreted as a range separator.)
Regular expressions do not understand numbers; to a regular expression, 1 is just a character of the string - the same kind of thing that a is. Thus, for example, [1-31] is parsed as a character class which contains the range 1-3 and the (redundant) single symbol 1.
You do not want to use regular expressions to parse dates. There is already a built-in module for handling date parsing:
>>> import datetime
>>> datetime.datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0) # an object representing the date.
This also does all the secondary checks (for things like an attempt to refer to Feb. 31) for you. If you want to handle multiple types of separators, you can simply .replace them in the original string so that they all turn into the same separator, then use that in your format.
You're probably doing it wrong. Some other replies here are helping you with the regex, but I suggest you use the datetime.strptime method to turn your formatted date into a datetime object, and do further logic with that object:
>>> import datetime
>>> datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0)
More info on the strptime method and it's format strings.
regexp = r'(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\d\d)'
( #start of group #1
0?[1-9] # 01-09 or 1-9
| # ..or
[12][0-9] # 10-19 or 20-29
| # ..or
3[01] # 30, 31
) #end of group #1
/ # follow by a "/"
( # start of group #2
0?[1-9] # 01-09 or 1-9
| # ..or
1[012] # 10,11,12
) # end of group #2
/ # follow by a "/"
( # start of group #3
(19|20)\\d\\d # 19[0-9][0-9] or 20[0-9][0-9]
) # end of group #3
Maybe you can try this regex
^((0|1|2)[0-9]{1}|(3)[0-1]{1})/((0)[0-9]{1}|(1)[0-2]{1})/((19)[0-9]{2}|(20)[0-9]{2})$
this match for (01 to 31)/(01 to 12)/(1900 to 2099)
Related
I have to find some date inside an string with a regular expresions in python
astring ='L2A_T21HUB_A023645_20210915T135520'
and i'm trying to get the part before the T with shape xxxxxxxx where every x is a number.
desiredOutput = '20210915'
I'm new in regex so I have no idea how to solve this
If the astring's format is consistent, meaning it will always have the same shape with respect to the date, you can split the string by '_' and get the last substring and get the date from there as such:
astring ='L2A_T21HUB_A023645_20210915T135520'
date_split = astring.split("_"). # --> ['L2A', 'T21HUB', 'A023645', '20210915T135520']
desiredOutput = date_split[3][:8] # --> [3] = '20210915T135520' [:8] gets first 8 chars
print(desiredOutput) # --> 20210915
If you wanted an actual datetime object
>>> from datetime import datetime
>>> astring = 'L2A_T21HUB_A023645_20210915T135520'
>>> date_str = astring.split('_')[-1]
>>> datetime.strptime(date_str, '%Y%m%dT%H%M%S')
datetime.datetime(2021, 9, 15, 13, 55, 20)
From that, you can use datetime.strftime to reformat to a new string, or you can use split('T')[0] to get the string you want.
The trouble with Regex is that there can be unexpected patterns that match your expected pattern and throw things off. However, if you know that only the date portion will ever have 8 sequential digits, you can do this:
import re
date_patt = re.compile('\d{8}')
date = date_patt.search(astring).group(0)
You can develop more robust patterns based on your knowledge of the formatting of the incoming strings. For instance, if you know that the date will always follow an underscore, you could use a look-behind assertion:
date_patt = re.compile(r'(?<=\_)\d{8}') # look for '_' before the date, but don't capture
Hope this helps. Regex can be finicky and may take some tweaking, but hope this sets you in the right direction.
I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.
I am dealing with strings that need to be converted into dates in Python. In a normal situation, my string would have %d/%m/%Y %H:%M:%S. For instance:
18/02/2013 09:21:14
However in some occasion I could obtain something like %d/%m/%Y %H:%M:%S:%ms, such as:06/01/2014 09:52:14:78
I would like to get rid of that ms bit but I need to figure out how. I have been able to create a regular expression which can test if the date matches:
mydate = re.compile("^((((31\/(0?[13578]|1[02]))|((29|30)\/(0?[1,3-9]|1[0-2])))\/(1[6-9]|[2-9]\d)?\d{2})|(29\/0?2\/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))|(0?[1-9]|1\d|2[0-8])\/((0?[1-9])|(1[0-2]))\/((1[6-9]|[2-9]\d)?\d{2})) (20|21|22|23|[0-1]?\d):[0-5]?\d:[0-5]?\d$")
s = "06/01/2014 09:52:14:78"
bool(mydate.match(s))
>>> False
However I do not know how to obtain only the interesting part, i.e 06/01/2014 09:52:14
Any ideas?
You can use a positive lookbehind and re.sub():
>>> re.sub(r'(?<=\d{2}:\d{2}:\d{2}).*','','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
Debuggex Demo
How about the re.sub function
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$',r'\1','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$,r'\1','8/02/2013 09:21:14')
'8/02/2013 09:21:14'
( \d{2}(:\d{2}){2}) matcheshours:min:sec` saved in capture group 1
.*$ matches the milliseconds
r'\1' replaced with contents of the first caputre group
I have a list of files, and I am trying to filter for a subset of file names that end in 000000, 060000, 120000, 180000. I know I could do a straight string match, but I would like to understand why the regular expression I attempted below r'[00|06|12|18]+0000', would not work (it is returning MSM_20130519210000.csv as well). I intend it to be match either one of 00, 06, 12, 18, follow by 0000. How can that be accomplished? Please keep the answer along the line of this intended regex instead of other functions, thanks.
Here is the code snippet:
import re
files_in_input_directory = ['MSM_20130519150000.csv', 'MSM_20130519180000.csv', 'MSM_20130519210000.csv',
'MSM_20130520000000.csv', 'MSM_20130520030000.csv', 'MSM_20130520060000.csv', 'MSM_20130520090000.csv',
'MSM_20130520120000.csv', 'MSM_20130520150000.csv', 'MSM_20130520180000.csv', 'MSM_20130520210000.csv',
'MSM_20130521000000.csv', 'MSM_20130521030000.csv', 'MSM_20130521060000.csv', 'MSM_20130521090000.csv',
'MSM_20130521120000.csv', 'MSM_20130521150000.csv', 'MSM_20130521180000.csv', 'MSM_20130521210000.csv',
'MSM_20130522000000.csv', 'MSM_20130522030000.csv', 'MSM_20130522060000.csv', 'MSM_20130522090000.csv',
'MSM_20130522120000.csv', 'MSM_20130522150000.csv', 'MSM_20130522180000.csv', 'MSM_20130522210000.csv',
'MSM_20130523000000.csv', 'MSM_20130523030000.csv', 'MSM_20130523060000.csv', 'MSM_20130523090000.csv',
'MSM_20130523120000.csv', 'MSM_20130523150000.csv', 'MSM_20130523180000.csv', 'MSM_20130523210000.csv',
'MSM_20130524000000.csv', 'MSM_20130524030000.csv', 'MSM_20130524060000.csv', 'MSM_20130524090000.csv',
'MSM_20130524120000.csv', 'MSM_20130524150000.csv', 'MSM_20130524180000.csv', 'MSM_20130524210000.csv',
'MSM_20130525000000.csv', 'MSM_20130525030000.csv', 'MSM_20130525060000.csv', 'MSM_20130525090000.csv',
'MSM_20130525120000.csv', 'MSM_20130525150000.csv', 'MSM_20130525180000.csv', 'MSM_20130525210000.csv',
'MSM_20130526000000.csv', 'MSM_20130526030000.csv', 'MSM_20130526060000.csv', 'MSM_20130526090000.csv',
'MSM_20130526120000.csv', 'MSM_20130526150000.csv', 'MSM_20130526180000.csv', 'MSM_20130526210000.csv',
'MSM_20130527000000.csv', 'MSM_20130527030000.csv', 'MSM_20130527060000.csv', 'MSM_20130527090000.csv',
'MSM_20130527120000.csv', 'MSM_20130527150000.csv', 'MSM_20130527180000.csv', 'MSM_20130527210000.csv',
'MSM_20130528000000.csv', 'MSM_20130528030000.csv', 'MSM_20130528060000.csv', 'MSM_20130528090000.csv',
'MSM_20130528120000.csv', 'MSM_20130528150000.csv', 'MSM_20130528180000.csv', 'MSM_20130528210000.csv',
'MSM_20130529000000.csv', 'MSM_20130529030000.csv', 'MSM_20130529060000.csv', 'MSM_20130529090000.csv']
print files_in_input_directory
print "\n"
# trying to match any string with 000000, 060000, 120000, 180000
# Question: I use + meaning one or more, and | to indicates the options, but this will match
# 'MSM_20130519210000.csv' as well, and I don't know why
print filter(lambda x:re.search(r'[00|06|12|18]+0000', x), files_in_input_directory)
print "\n"
# This verbose version works
print filter(lambda x:re.search(r'0000000|060000|120000|180000', x), files_in_input_directory)
print "\n"
If you are trying to match filenames that contain 000000, 060000, 120000 or 180000, then instead of
re.search(r'[00|06|12|18]+0000', x)
use
re.search(r'(00|06|12|18)0000', x)
The square brackets [...] only match a single character at a time, and the + character means "match 1 or more of the preceding expression".
[00|06|12|18] is the character set matching 00|06|12|18. Thus it will match 210000 in "SM_20130519210000.csv" because [00|06|12|18] is equivalent to writing [01268]. Not what you meant, I should think.
Instead of expressing a character set that can match one or more times, make it either a capturing group
r'(00|06|12|18)0000'
Or a negative lookbehind expression
r'(?<=00|06|12|18)0000'
They are equivalent for your purposes, since you don't care about the match or any groups.
The basic problem here is you were not grouping the patterns, but creating a character set fo match against using ``[ ... ]```.
This regex works: ((000)|(06)|(12)|(18))0000
I know that there are similar questions to mine that have been answered, but after reading through them I still don't have the solution I'm looking for.
Using Python 3.2.2, I need to match "Month, Day, Year" with the Month being a string, Day being two digits not over 30, 31, or 28 for February and 29 for February on a leap year. (Basically a REAL and Valid date)
This is what I have so far:
pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
expression = re.compile(pattern)
matches = expression.findall(sampleTextFile)
I'm still not too familiar with regex syntax so I may have characters in there that are unnecessary (the [,][ ] for the comma and spaces feels like the wrong way to go about it), but when I try to match "January, 26, 1991" in my sample text file, the printing out of the items in "matches" is ('January', '26', '1991', '19').
Why does the extra '19' appear at the end?
Also, what things could I add to or change in my regex that would allow me to validate dates properly? My plan right now is to accept nearly all dates and weed them out later using high level constructs by comparing the day grouping with the month and year grouping to see if the day should be <31,30,29,28
Any help would be much appreciated including constructive criticism on how I am going about designing my regex.
Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):
years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years
thirties = pattern % (
"September|April|June|November",
r'0?[1-9]|[12]\d|30')
thirtyones = pattern % (
"January|March|May|July|August|October|December",
r'0?[1-9]|[12]\d|3[01]')
fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))
feb = r'(February) +(?:%s|%s)' % (
r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year
r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours) # 29 leap years only
result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result
Then we have:
>>> r.match('January 30, 2001') is not None
True
>>> r.match('January 31, 2001') is not None
True
>>> r.match('January 32, 2001') is not None
False
>>> r.match('February 32, 2001') is not None
False
>>> r.match('February 29, 2001') is not None
False
>>> r.match('February 28, 2001') is not None
True
>>> r.match('February 29, 2000') is not None
True
>>> r.match('April 30, 1908') is not None
True
>>> r.match('April 31, 1908') is not None
False
And what is this glorious regexp, you may ask?
>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))
(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)
Here are some quick thoughts:
Everyone who is suggesting you use something other than regular expression is giving you very good advice. On the other hand, it's always a good time to learn more about regular expression syntax...
An expression in square brackets -- [...] -- matches any single character inside those brackets. So writing [,], which only contains a single character, is exactly identical to writing a simple unadorned comma: ,.
The .findall method returns a list of all matching groups in the string. A group is identified by parenthese -- (...) -- and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the final two match groups are going to be 1989 and 19.
A group is identified by parentheses (...) and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the two match groups are going to be 1989 and 19. Since you don't want the inner group (first two digits), you should use a non-capturing group instead. Non-capturing groups start with ?:, used like this: (?:a|b|c)
By the way, there is some good documentation on how to use regular expressions here.
Python has a date parser as part of the time module:
import time
time.strptime("December 31, 2012", "%B %d, %Y")
The above is all you need if the date format is always the same.
So, in real production code, I would write a regular expression that parses the date, and then use the results from the regular expression to build a date string that is always the same format.
Now that you said, in the comments, that this is homework, I'll post another answer with tips on regular expressions.
You have this regular expression:
pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
One feature of regular expressions is a "character class". Characters in square brackets make a character class. Thus [,] is a character class matching a single character, , (a comma). You might as well just put the comma.
Perhaps you wanted to make the comma optional? You can do that by putting a question mark after it: ,?
Anything you put into parentheses makes a "match group". I think the mysterious extra "19" came from a match group you didn't mean to have. You can make a non-matching group using this syntax: (?:
So, for example:
r'(?:red|blue) socks'
This would match "red socks" or "blue socks" but does not make a match group. If you then put that inside plain parentheses:
r'((?:red|blue) socks)'
That would make a match group, whose value would be "red socks" or "blue socks"
I think if you apply these comments to your regular expression, it will work. It is mostly correct now.
As for validating the date against the month, that is way beyond the scope of a regular expression. Your pattern will match "February 31" and there is no easy way to fix that.
First of all as other as said i don't think that regular expression are the best choice to solve this problem but to answer your question. By using parenthesis you are dissecting the string into several subgroups and when you call the function findall, you will create a list with all the matching group you created and the matching string.
((19|20)[0-9][0-9])
Here is your problem, the regex will match both the entire year and 19 or 20 depending on whether the year start with 19 or 20.