Obtain part of a string using regular expressions in Python - python

I am dealing with strings that need to be converted into dates in Python. In a normal situation, my string would have %d/%m/%Y %H:%M:%S. For instance:
18/02/2013 09:21:14
However in some occasion I could obtain something like %d/%m/%Y %H:%M:%S:%ms, such as:06/01/2014 09:52:14:78
I would like to get rid of that ms bit but I need to figure out how. I have been able to create a regular expression which can test if the date matches:
mydate = re.compile("^((((31\/(0?[13578]|1[02]))|((29|30)\/(0?[1,3-9]|1[0-2])))\/(1[6-9]|[2-9]\d)?\d{2})|(29\/0?2\/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))|(0?[1-9]|1\d|2[0-8])\/((0?[1-9])|(1[0-2]))\/((1[6-9]|[2-9]\d)?\d{2})) (20|21|22|23|[0-1]?\d):[0-5]?\d:[0-5]?\d$")
s = "06/01/2014 09:52:14:78"
bool(mydate.match(s))
>>> False
However I do not know how to obtain only the interesting part, i.e 06/01/2014 09:52:14
Any ideas?

You can use a positive lookbehind and re.sub():
>>> re.sub(r'(?<=\d{2}:\d{2}:\d{2}).*','','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
Debuggex Demo

How about the re.sub function
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$',r'\1','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$,r'\1','8/02/2013 09:21:14')
'8/02/2013 09:21:14'
( \d{2}(:\d{2}){2}) matcheshours:min:sec` saved in capture group 1
.*$ matches the milliseconds
r'\1' replaced with contents of the first caputre group

Related

How to find date with regular expresions

I have to find some date inside an string with a regular expresions in python
astring ='L2A_T21HUB_A023645_20210915T135520'
and i'm trying to get the part before the T with shape xxxxxxxx where every x is a number.
desiredOutput = '20210915'
I'm new in regex so I have no idea how to solve this
If the astring's format is consistent, meaning it will always have the same shape with respect to the date, you can split the string by '_' and get the last substring and get the date from there as such:
astring ='L2A_T21HUB_A023645_20210915T135520'
date_split = astring.split("_"). # --> ['L2A', 'T21HUB', 'A023645', '20210915T135520']
desiredOutput = date_split[3][:8] # --> [3] = '20210915T135520' [:8] gets first 8 chars
print(desiredOutput) # --> 20210915
If you wanted an actual datetime object
>>> from datetime import datetime
>>> astring = 'L2A_T21HUB_A023645_20210915T135520'
>>> date_str = astring.split('_')[-1]
>>> datetime.strptime(date_str, '%Y%m%dT%H%M%S')
datetime.datetime(2021, 9, 15, 13, 55, 20)
From that, you can use datetime.strftime to reformat to a new string, or you can use split('T')[0] to get the string you want.
The trouble with Regex is that there can be unexpected patterns that match your expected pattern and throw things off. However, if you know that only the date portion will ever have 8 sequential digits, you can do this:
import re
date_patt = re.compile('\d{8}')
date = date_patt.search(astring).group(0)
You can develop more robust patterns based on your knowledge of the formatting of the incoming strings. For instance, if you know that the date will always follow an underscore, you could use a look-behind assertion:
date_patt = re.compile(r'(?<=\_)\d{8}') # look for '_' before the date, but don't capture
Hope this helps. Regex can be finicky and may take some tweaking, but hope this sets you in the right direction.

regex replace group with specific value

I use python 2.7
I just try to change a group in a regex with a value:
import re
r = "/foo/bar/(?P<pk>[0-9]+)/"
rc = re.compile(r)
#that i try to do : rc["pk"] = 42 and get the resut
print rc.groupindex
#return {'pk' : 1}
I need to do this because i don't know the regex, but I know that ther is a group in it.
Edit:
I want to have a result like this:
rc["pk"] = 42
#now rc is /foo/bar/42 because (?P<pk>[0-9]+) is replace with 42
I am not a python programmer, but I work with regexes a great deal in a number of other systems. I believe you can use the re.sub function with backreferences to groups like so:
Search Pattern:
'(/foo/bar/)[0-9]+(/)'
Replacement pattern:
'\g<1>42\g<2>'
This would replace
'/foo/bar/17/'
with
'/foo/bar/42/'
This would even work where the folder names are expressions themselves:
'(/\w+/\w+/)\d+(/)'
Python also supports lookaround statements, like this:
'(?<=/foo/bar/)\d+(?=/)'
Then you just replace the match with '42'. (Lookarounds do not "consume" characters, so the text in '((?<=...)' and '(?=...)' would not be replaced.)

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Regular expression split

I have inputs similar to the following:
TV-12VX
TV-14JW
TV-2JIS
VC-224X
I need to remove everything after the numbers after the dash. The result would be:
TV-12
TV-14
TV-2
TV-224
How would I do this split via regular expressions?
The following code shows how to match strings of the form "TV-" + (some number):
>>> re.match('TV-[0-9]+','TV-12VX').group(0)
'TV-12'
(Note that, because I'm using match, this only works if the string starts with the bit you want to extract.)
I think this regex is appropriate for you: (.+?-\d+?)[a-zA-Z]. You can use it with re.findall, or re.match.
import re
p = re.match('([\w]{2}-\d+)', 'TV-12VX')
print(p.group(0))
Outputs
TV-12
You can remove everything after the digits with this:
re.sub(r"^(\w+-\d+).*", r"\1", input)

date matching using python regex

What am i doing wrong in the below regular expression matching
>>> import re
>>> d="30-12-2001"
>>> re.findall(r"\b[1-31][/-:][1-12][/-:][1981-2011]\b",d)
[]
[1-31] matches 1-3 and 1 which is basically 1, 2 or 3. You cannot match a number rage unless it's a subset of 0-9. Same applies to [1981-2011] which matches exactly one character that is 0, 1, 2, 8 or 9.
The best solution is simply matching any number and then checking the numbers later using python itself. A date such as 31-02-2012 would not make any sense - and making your regex check that would be hard. Making it also handle leap years properly would make it even harder or impossible. Here's a regex matching anything that looks like a dd-mm-yyyy date: \b\d{1,2}[-/:]\d{1,2}[-/:]\d{4}\b
However, I would highly suggest not allowing any of -, : and / as : is usually used for times, / usually for the US way of writing a date (mm/dd/yyyy) and - for the ISO way (yyyy-mm-dd). The EU dd.mm.yyyy syntax is not handled at all.
If the string does not contain anything but the date, you don't need a regex at all - use strptime() instead.
All in all, tell the user what date format you expect and parse that one, rejecting anything else. Otherwise you'll get ambiguous cases such as 04/05/2012 (is it april 5th or may 4th?).
[1-31] does not means what you think it means. The square bracket syntax matches a range of characters, not a range of numbers. Matching a range of numbers with a regex is possible, but unwieldy.
If you really want to use regular expressions for this (rather than a date parsing library) you'd be better off matching all numbers of the right number of digits, capturing the values, and then checking the values yourself:
>>> import re
>>> d="30-12-2001"
>>> >>> re.findall(r"\b([0-9]{1,2})[-/:]([0-9]{1,2})[-/:]([0-9]{4})\b",d)
[('30', '12', '2001')]
You'll have to do actual date verification anyway, to catch invalid dates like 31-02-2012.
(Note that [/-:] doesn't work either, because it's interpreted as a range. Use [-/:] instead - putting the hyphen at the front prevents it being interpreted as a range separator.)
Regular expressions do not understand numbers; to a regular expression, 1 is just a character of the string - the same kind of thing that a is. Thus, for example, [1-31] is parsed as a character class which contains the range 1-3 and the (redundant) single symbol 1.
You do not want to use regular expressions to parse dates. There is already a built-in module for handling date parsing:
>>> import datetime
>>> datetime.datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0) # an object representing the date.
This also does all the secondary checks (for things like an attempt to refer to Feb. 31) for you. If you want to handle multiple types of separators, you can simply .replace them in the original string so that they all turn into the same separator, then use that in your format.
You're probably doing it wrong. Some other replies here are helping you with the regex, but I suggest you use the datetime.strptime method to turn your formatted date into a datetime object, and do further logic with that object:
>>> import datetime
>>> datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0)
More info on the strptime method and it's format strings.
regexp = r'(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\d\d)'
( #start of group #1
0?[1-9] # 01-09 or 1-9
| # ..or
[12][0-9] # 10-19 or 20-29
| # ..or
3[01] # 30, 31
) #end of group #1
/ # follow by a "/"
( # start of group #2
0?[1-9] # 01-09 or 1-9
| # ..or
1[012] # 10,11,12
) # end of group #2
/ # follow by a "/"
( # start of group #3
(19|20)\\d\\d # 19[0-9][0-9] or 20[0-9][0-9]
) # end of group #3
Maybe you can try this regex
^((0|1|2)[0-9]{1}|(3)[0-1]{1})/((0)[0-9]{1}|(1)[0-2]{1})/((19)[0-9]{2}|(20)[0-9]{2})$
this match for (01 to 31)/(01 to 12)/(1900 to 2099)

Categories

Resources