How to separate strings using regex? - python

I have a datetime string and, I want to separate the hour format from the minute format and then print both of them on a separate line.
Here is the code:
import re
from datetime import datetime
ct = datetime.now().strftime("%H:%M")
time = "12:30"
minus = datetime.strptime(time,"%H:%M") - datetime.strptime(ct,"%H:%M")
minus = datetime.strptime(str(minus),"%H:%M:%S").strftime("%H:%M")
# print(minus)
regex2 = re.compile(r'(\d)+:(\d)+')
match = regex2.search(minus)
print(match.group(0))
If the variable minus gives an output: 01:22
Then I want it 01 and 22 to be printed on different lines.
Output Should be like:
01
22

you don't need regex for that.
you can use timedelta directly:
from datetime import datetime
ct = datetime.now().strftime("%H:%M")
time = "10:30"
minus = datetime.strptime(time,"%H:%M") - datetime.strptime(ct,"%H:%M")
## extract whatever values you want from this delta:
## eg:
seconds = minus.seconds
hours = minus.seconds//3600
minutes = (minus.seconds//60)%60
print(hours)
print(minutes)

Ignoring the fact that your script fails due to a runtime error at line 6 (which prevents any output from being generated from the regex part of the script), to capture the digits surrounding the colon, you need to put the repetition marker (i.e. the plus sign) inside the capturing markers (i.e. the parenthesis).
So revising your regex:
regex2 = re.compile(r'(\d)+:(\d)+')
as follows:
regex2 = re.compile(r'(\d+):(\d+)')
Will achieve that.
The second thing you need to do is print the matching groups (as opposed to the entire matched text - i.e. group(0)).
Printing the first and second matching groups is achieved as follows:
print(match.group(1))
print(match.group(2))
The difference between your regular expression and mine was in the number of groups that would ultimately be generated on a match.
In your case (\d)+ you requested the creation of one or more groups of a single digit.
In my case (\d+) I am requesting the creation of a single group containing one or more digits. I then use this pattern twice - once on each side of the colon.
So, for input like this 01:22, you would have created 4 groups being "0", "1", "2" and "2". Whereas my regex would generate just 2 groups "01" and "22". Which I think is what you are trying to achieve.
Note that if you had input like this "0:122", your regular expression would have generated the same 4 groups ("0", "1", "2" and "2") - you would have no idea where the ":" character was unless you also captured it (the colon). In my RE, the input "0:122" would generate "0" and "122" thereby correctly informing you where the colon was (i.e. between the "0" and the "122"

Use the regex below:
'([^:]+)'
See regex explanation here: https://regex101.com/r/EgXlcD/46
import re
from datetime import datetime
ct = datetime.now().strftime("%H:%M")
time = "10:30"
minus = datetime.strptime(time,"%H:%M") - datetime.strptime(ct,"%H:%M")
minus = datetime.strptime(str(minus),"%H:%M:%S"
"").strftime("%H:%M")
#print(minus)
matches = re.findall(r'([^:]+)',minus)
for match in matches:
print (match)

Related

Is there a way to find (potentially) multiple results with re.search?

While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!
No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']
re.findall instead of re.search will return a list of all matches
If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )

Unable to extract date of birth from a given random format

I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the text files but it is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform and broken.
Code:
import re
str = """ This is python to extract date
D
.O.B.
:
14
J
u
n
e
199
1
work in a team or as individual
contributor.
And Name is: Zon; DOB: 12/23/
1955 11/15/2014 11:53 AM"""
pattern = re.findall(r'.*?D.O.B.*?:\s+([\d]{1,2}\s(?:JAN|NOV|OCT|DEC|June)\s[\d]{4})', string)
pattern2 = re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)
print(pattern)
print(pattern2)`
Expected Output:
['14 June 1991']
['12/23/1955']
Working with date time is always a nightmare for developers for many reasons. In your case, you are trying to extract the date of birth, which is specified with a prefix of DOB with or without separators.
I suggest not to use and maintain a lot of regexes in the code, since you said the date formats can vary. You can use a good library like python-dateutil install it from pypy like pip install python-dateutil
All you have to do is find a good candidate section of the text, and use the library to parse it. Eg., in your case, find the date containing section of text like
import re
from dateutil.parser import parse
in_str = """DOB: 14 June 1991
work in a team or as individual
contributor"""
# find DOB prefixed string patterns
candidates = re.findall(r"D\.?O\.?B\.?:.*\d{4}\b", in_str)
#parse the dates from the candidates
parsed_dates = [parse(dt) for dt in candidates]
print(parsed_dates)
This will give you an output like
[datetime.datetime(1991, 6, 14, 0, 0)]
From here, you can manipulate or process them easily. Finding the date contained sections is again not a necessity for date parser to work, but that minimizes your work as well.
For the first pattern, you can add matching optional whitespace chars between the single characters.
\bD\s*\.\s*O\s*\.\s*B[^:]*:\s+(\d{1,2}\s*(?:JAN|NOV|OCT|DEC|J\s*u\s*n\s*e)(?:\s*\d){4})
Then in the match, remove the newlines.
See a regex demo and a Python demo.
For the second pattern, you can match optional whitespace chars around the / and then remove the whitespace chars from the matches.
\bDOB.*?:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b
See another regex demo and a Python demo.
For example
import re
pattern = r"\bDOB.*?:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b"
s = (" This is python to extract date\n"
"D\n"
".O.B.\n"
": \n"
"14 \n"
"J\n"
"u\n"
"n\n"
"e \n\n"
"199\n"
"1\n"
"work in a team or as individual \n"
"contributor.\n"
"And Name is: Zon; DOB: 12/23/\n"
" 1955 11/15/2014 11:53 AM")
res = [re.sub(r"\s+", "", s) for s in re.findall(pattern, s)]
print(res)
Output
['12/23/1955']
If there should not be a colon between DOB and matching the "date" part, you can also use a negated character class to exclude matching the colon instead of .*?
\bDOB[^:]*:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b
Regex demo
I agree with #Kris that you should try to have as little regex to maintain as possible, and make them as simple as possible. You should also, as he suggested, divide your problem in 2 steps:
1/ extracting candidates
2/ parsing (using, for example dateutil.parser.parse)
step 1: extracting candidates
One solution for making regex patterns simpler is to manipulate the input string (if possible).
For example in your case, the difficulty comes from varying newlines and spaces. Taking back your example:
import re
s1 = """ This is python to extract date
D
.O.B.
:
14
J
u
n
e
199
1
work in a team or as individual
contributor.
And Name is: Zon; DOB: 12/23/
1955 11/15/2014 11:53 AM"""
You can create s2 that removes new lines and spaces:
s2 = s.replace("\n", "").replace(" ", "")
Then your pattern becomes simpler:
pattern = re.compile(r"D\.?O\.?B\.?:(?P<date-of-birth>(.*?)(\d{4}))")
(see pattern explanation below)
Match the pattern with your simplified string:
matches = [m.group('date-of-birth') for m in pattern.finditer(s2) if m]
You get:
>>> print(matches)
['14June1991', '12/23/1955']
step 2: parsing candidates to date objects
#Kris suggestion works very well:
import dateutil
dobs = [dateutil.parser.parse(m) for m in matches]
You get your expected result:
>>> print(dobs)
[datetime.datetime(1991, 6, 14, 0, 0), datetime.datetime(1955, 12, 23, 0, 0)]
You can then use strftime if you want to make all your dates as pretty, standardized strings:
dobs_pretty = [d.strftime('%Y-%m-%d') for d in dobs]
print(dobs_pretty)
>>> ['1991-06-14', '1955-12-23']
Pattern explanation
D\.?O\.?B\.?: you look for "DOB", with or without periods (hence the ? operator)
(?P<date-of-birth>(.*?)(\d{4})): You capture everything on the right of "DOB" until you find 4 consecutive digits (representing the year). (.*?) captures everything "up until" (\d{4}) (the 4 consecutive digits)
?P<date-of-birth> allows you to name the captured group, making retrieving the date much easier. You simply put the group name (date-of-birth) in the group() method: m.group('date-of-birth')

Pandas: count dots in a string - same as length?

I'm trying to count the number of dots in an email address using Python + Pandas.
The first record is "addison.shepherd#gmail.com". It should count 2 dots. Instead, it returns 26, the length of the string.
import pandas as pd
url = "http://profalibania.com.br/python/EmailsDoctors.xlsx"
docs = pd.read_excel(url)
docs["PosAt"] = docs["Email"].str.count('.')
Can anybody help me? Thanks in advance!
pandas.Series.str.count takes a regex expression as input. To match a literal period (.), you must escape it:
docs["Email"].str.count('\.')
Just specifying . will use the regex meaning of the period (matching any single character)
The .str.count(..) method [pandas-doc] works with a regular expression [wiki]. This is specified in the documentation:
This function is used to count the number of times a particular regex pattern is repeated in each of the string elements of the Series.
For a regex, the dot means "all characters except new line". You can use a character set (by surrounding it by square brackets):
docs["PosAt"] = docs["Email"].str.count('[.]')
A variant here would be to compare the length of the original email column with the length of that column with all dots removed:
docs["Email"].str.len() - docs["Email"].str.replace("[.]", "").len()

python re match string with integer

I need to match strings like: '2017-08-09,08:59:20.445 INFO {peers_peak_parameters_grid} [eval_peers_peak] Evaluating batch 0 out of 2158',
I have tried different regular expressions such as: comp = re.compile("Evaluating batch ^[-+]?[0-9]+$ out of ^[-+]?[0-9]+$")
and this is an example usage:
def get_batch_process_time(log):
loglines = log.splitlines()
comp = re.compile("Evaluating batch ^[-+]?[0-9]+$ out of ^[-+]?[0-9]+$")
times = []
matches = []
for i, line in enumerate(loglines):
if comp.search(line):
time = string2datetime(line.split(' ')[0])
times.append(time)
matches.append(line)
return np.array(times), matches
Unfortunately none of the lines seems to match the given pattern. I assume that I'm using the wrong regular expression.
What is the right regular expression?
Am I using re correctly? (should I use match rather than search?)
^[-+]?[0-9]+$ alone would match a whole string consisting of an optional plus or minus operation then a non-empty sequence of digits.
When I say a whole string, it's because ^ and $ are "anchors" that will match respectively the start and end of the string, which is why your regex doesn't work.
I suppose you could also remove the optional sign part, i.e. [-+]?.
You could have found that out by yourself by testing your regex in regex101 (check the explanation panel on the top right) or a similar utility.

date matching using python regex

What am i doing wrong in the below regular expression matching
>>> import re
>>> d="30-12-2001"
>>> re.findall(r"\b[1-31][/-:][1-12][/-:][1981-2011]\b",d)
[]
[1-31] matches 1-3 and 1 which is basically 1, 2 or 3. You cannot match a number rage unless it's a subset of 0-9. Same applies to [1981-2011] which matches exactly one character that is 0, 1, 2, 8 or 9.
The best solution is simply matching any number and then checking the numbers later using python itself. A date such as 31-02-2012 would not make any sense - and making your regex check that would be hard. Making it also handle leap years properly would make it even harder or impossible. Here's a regex matching anything that looks like a dd-mm-yyyy date: \b\d{1,2}[-/:]\d{1,2}[-/:]\d{4}\b
However, I would highly suggest not allowing any of -, : and / as : is usually used for times, / usually for the US way of writing a date (mm/dd/yyyy) and - for the ISO way (yyyy-mm-dd). The EU dd.mm.yyyy syntax is not handled at all.
If the string does not contain anything but the date, you don't need a regex at all - use strptime() instead.
All in all, tell the user what date format you expect and parse that one, rejecting anything else. Otherwise you'll get ambiguous cases such as 04/05/2012 (is it april 5th or may 4th?).
[1-31] does not means what you think it means. The square bracket syntax matches a range of characters, not a range of numbers. Matching a range of numbers with a regex is possible, but unwieldy.
If you really want to use regular expressions for this (rather than a date parsing library) you'd be better off matching all numbers of the right number of digits, capturing the values, and then checking the values yourself:
>>> import re
>>> d="30-12-2001"
>>> >>> re.findall(r"\b([0-9]{1,2})[-/:]([0-9]{1,2})[-/:]([0-9]{4})\b",d)
[('30', '12', '2001')]
You'll have to do actual date verification anyway, to catch invalid dates like 31-02-2012.
(Note that [/-:] doesn't work either, because it's interpreted as a range. Use [-/:] instead - putting the hyphen at the front prevents it being interpreted as a range separator.)
Regular expressions do not understand numbers; to a regular expression, 1 is just a character of the string - the same kind of thing that a is. Thus, for example, [1-31] is parsed as a character class which contains the range 1-3 and the (redundant) single symbol 1.
You do not want to use regular expressions to parse dates. There is already a built-in module for handling date parsing:
>>> import datetime
>>> datetime.datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0) # an object representing the date.
This also does all the secondary checks (for things like an attempt to refer to Feb. 31) for you. If you want to handle multiple types of separators, you can simply .replace them in the original string so that they all turn into the same separator, then use that in your format.
You're probably doing it wrong. Some other replies here are helping you with the regex, but I suggest you use the datetime.strptime method to turn your formatted date into a datetime object, and do further logic with that object:
>>> import datetime
>>> datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0)
More info on the strptime method and it's format strings.
regexp = r'(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\d\d)'
( #start of group #1
0?[1-9] # 01-09 or 1-9
| # ..or
[12][0-9] # 10-19 or 20-29
| # ..or
3[01] # 30, 31
) #end of group #1
/ # follow by a "/"
( # start of group #2
0?[1-9] # 01-09 or 1-9
| # ..or
1[012] # 10,11,12
) # end of group #2
/ # follow by a "/"
( # start of group #3
(19|20)\\d\\d # 19[0-9][0-9] or 20[0-9][0-9]
) # end of group #3
Maybe you can try this regex
^((0|1|2)[0-9]{1}|(3)[0-1]{1})/((0)[0-9]{1}|(1)[0-2]{1})/((19)[0-9]{2}|(20)[0-9]{2})$
this match for (01 to 31)/(01 to 12)/(1900 to 2099)

Categories

Resources