I am using regex to extract the month and year of pairs of dates in text:
regex = (
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})?\s?\s?((to)|[\|\-\–\—])\s?\s?"
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})|(Present|Now|till\s?(now|date|today)?|current)))"
)
When I test the regex with some inputs that contain the day of the month in some and not in others:
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = re.finditer(regex,i,re.IGNORECASE)
for match in word:
print(match.group())
I get the following output:
Jan 2008 - May 2012
but my expected output is:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
What do I need to change to make the regex match text with an optional day in the date? When a date string includes the day, it is always an ordinal number with a st, nd, rd or th suffix.
You cannot "skip" part of a string during a single match operation, so if you have 26th August, you can't match or capture just 26 August. In these cases, you either need to capture parts of the match and then concatenate them, or replace the parts you do not need as a post-processing step.
So, here, I'd use the post-process replace approach with
import re
day = r'(?:((?:0?[1-9]|[12]\d|3[01])(?:\s*(?:st|[rn]d|th))?)\s*)?'
month = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)'
year = r'(\d{2}(?:\d{2})?)'
rx_valid = re.compile( fr'\b{day}{month}\s*{year}\s*[-—–]\s*{day}{month}\s*{year}(?!\d)', re.IGNORECASE )
rx_ordinal = re.compile( r'\s*\d+\s*(?:st|[rn]d|th)', re.IGNORECASE )
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = rx_valid.finditer(i)
for match in word:
print(rx_ordinal.sub("", match.group()))
Output:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
See the Python demo and the regex demo.
Related
i have table where it has dates in multiple formats. with that it also has some unwanted text which i want to drop so that i could process this date strings
Data :
sr.no. col_1 col_2
1 'xper may 2022 - nov 2022' 'derh 06/2022 - 07/2022 ubj'
2 'sp# 2021 - 2022' 'zpt May 2022 - December 2022'
Expected Output :
sr.no. col_1 col_2
1 'may 2022 - nov 2022' '06/2022 - 07/2022'
2 '2021 - 2022' 'May 2022 - December 2022'
def keep_valid_characters(string):
return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)
i am using the above pattern to drop but stuck. any other approach.?
You can try to split the pattern construction to multiple strings in complicated case like this:
months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/]+\s*-\s*(?:{months})?\s*[\d/]+)"
df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)
Prints:
sr.no. col_1 col_2
0 1 may 2022 - nov 2022 06/2022 - 07/2022
1 2 2021 - 2022 May 2022 - December 2022
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
Above is an example of the content in the text file. I want to extract a string with re.
How should I construct the findall condition to achieve the expected result below? I have tried the following:
match=re.findall(r'[Tue\w]+2018$',data2)
but it is not working. I understand that $ is the symbol for the end of the string. How can I do it?
Expected Result is:
Tue Aug 21 17:02:26 2018
Tue Aug 21 17:31:06 2018
Tue Aug 21 18:10:42 2018
.
.
.
Use the pattern:
^Tue.*?2018
^ Assert position beginning of line.
Tue Literal substring.
.*? Match anything lazily.
2018 Match literal substring.
Since you are working with a multiline string and you want to match pattern at the beginning of a string, you have to use the re.MULTILINE flag.
import re
mystr="""
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
"""
print(re.findall(r'^Tue.*?2018',mystr,re.MULTILINE))
Prints:
['Tue Aug 21 17:02:26 2018', 'Tue Aug 21 17:31:06 2018', 'Tue Aug 21 18:10:42 2018']
I have a text that will change weekly:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
I'm looking for regex patterns for year 1, and year 2.
(Both will change weekly so I need the formula to capture all months, days, years)
My output should be the following:
2015 = November 5, 2015
2016 = November 3, 2016
The framework I'm using does not allow for regex capture groups or splits, so I need the formula to be specialized for this type of string.
Thanks!
Code
As per my original comments
See regex in use here
(\w+\s+\d+,\s*(\d+))
Note: The above regex and the regex on regex101 do not match. This is done purposely. Regex101 can only demonstrate the output of substitutions, thus I've prepended .*? to the regex in order to properly display the expected output.
Results
Input
Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015
Output
2016 = November 3, 2016
2015 = November 5, 2015
Usage
import re
regex = r"(\w+\s+\d+,\s*(\d+))"
str = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
for (date, year) in re.findall(regex, str):
print year + ' = ' + date
You can try this:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
import re
final_data = sorted(["{} = {}".format(re.findall("\d+$", i)[0], i) for i in re.findall("[a-zA-Z]+\s\d+,\s\d+", text)], key=lambda x:int(re.findall("^\d+", x)[0]))
Output:
['2015 = November 5, 2015', '2016 = November 3, 2016']
Using #ctwheels regex:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
import re
result = [(date.split(",")[1].strip(), date) for date in re.findall(r'\w+\s+\d+,\s*\d+', text)]
print(result)
# [('2016', 'November 3, 2016'), ('2015', 'November 5, 2015')]
I have tried many regex code to extract the date from the emails that has this format but I couldn't:
Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM
This how it looks like in all emails and I want to extract them both.
Thank you in advance
You can do something like this using this kind of pattern:
Using Python3:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print("{0}, {1}".format(final[0][0], " ".join(final[0][1:])))
print(" ".join(final[0][1:]))
Using Python2:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print "%s, %s" % (final[0][0], " ".join(final[0][1:]))
print " ".join(final[0][1:])
Output:
Tue, 13 Nov 2001
13 Nov 2001
Edit:
A quick answer to the new update of your question, you can do something like this:
import re
email = '''Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM'''
data = email.split("\n")
pattern = r"(\w+: \w+, [0-9]+ \w+ [0-9]+)|(\w+: \w+, \w+ [0-9]+, [0-9]+)"
final = []
for k in data:
final += re.findall(pattern, k)
final = [j.split(":") for k in final for j in k if j != '']
# Python3
print(final)
# Python2
# print final
Output:
[['Date', ' Tue, 13 Nov 2001'], ['Sent', ' Thursday, November 08, 2001']]
import re
my_email = 'Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)'
match = re.search(r': (\w{3,3}, \d{2,2} \w{3,3} \d{4,4})', my_email)
print(match.group(1))
I am not regex expert but here is a solution, you can write some tests for that
d = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
dates = [re.search('(\d+ \w+ \d+)',date).groups()[0] for date in re.search('(Date: \w+, \d+ \w+ \d+)', d).groups()]
['13 Nov 2001']
Instead of using regex, you can use split() if only extract the same string model:
email_date = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
email_date = '%s %s %s %s' % (tuple(email_date.split(' ')[1:5]))
Output:
Tue, 13 Nov 2001
Hi I am having a problem with the regex in the following link.
https://regex101.com/r/wU4xK1/1
It matches almost all patterns. But when it encounter some characters or newline I am struggling.
My regex is :
(\b(?:(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4})
My text is:
July 2005 – December - 2006
(Nov '12 - Feb 12)
(Nov 12 - Feb 12 )
july 2005 – Dec 2012 ## Note here. If i press enter after Dec 2012 I will get a match. Dont know why ?
Just turn all the capturing group to non-capturing group and then include the whole pattern inside a single capturing group.
((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))
DEMO
>>> s = '''July 2005 – December - 2006
(Nov '12 - Feb 12)
(Nov 12 - Feb 12 )
july 2005 – Dec 2012 ## Note here. If i press enter after Dec 2012 I will get a match. Dont know why ?'''
>>> re.findall(r"(?mi)((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))", s)
[('July 2005 – December - 2006', ''), ("Nov '12 - Feb 12", ''), ('Nov 12 - Feb 12', ''), ('july 2005 – Dec 2012', '')]
>>> m = re.findall(r"(?mi)((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))", s)
>>> [(x) for x,y in m]
['July 2005 – December - 2006', "Nov '12 - Feb 12", 'Nov 12 - Feb 12', 'july 2005 – Dec 2012']
(?mi) here we combined multiline and case-insensitive modifier.
Your regex indeed works, but you have to delete the blank line at the end of your Regular expression.
See
https://regex101.com/r/wU4xK1/3
Apart from correct Aaron's remark (after removing that line break, the matches are displayed), I'd also like to mention that \s matches any white space character in [\r\n\t\f ] class, so you could avoid capturing line breaks by limiting the group to [\t\f ].