Extract date from email using python 2.7 regex - python

I have tried many regex code to extract the date from the emails that has this format but I couldn't:
Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM
This how it looks like in all emails and I want to extract them both.
Thank you in advance

You can do something like this using this kind of pattern:
Using Python3:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print("{0}, {1}".format(final[0][0], " ".join(final[0][1:])))
print(" ".join(final[0][1:]))
Using Python2:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print "%s, %s" % (final[0][0], " ".join(final[0][1:]))
print " ".join(final[0][1:])
Output:
Tue, 13 Nov 2001
13 Nov 2001
Edit:
A quick answer to the new update of your question, you can do something like this:
import re
email = '''Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM'''
data = email.split("\n")
pattern = r"(\w+: \w+, [0-9]+ \w+ [0-9]+)|(\w+: \w+, \w+ [0-9]+, [0-9]+)"
final = []
for k in data:
final += re.findall(pattern, k)
final = [j.split(":") for k in final for j in k if j != '']
# Python3
print(final)
# Python2
# print final
Output:
[['Date', ' Tue, 13 Nov 2001'], ['Sent', ' Thursday, November 08, 2001']]

import re
my_email = 'Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)'
match = re.search(r': (\w{3,3}, \d{2,2} \w{3,3} \d{4,4})', my_email)
print(match.group(1))

I am not regex expert but here is a solution, you can write some tests for that
d = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
dates = [re.search('(\d+ \w+ \d+)',date).groups()[0] for date in re.search('(Date: \w+, \d+ \w+ \d+)', d).groups()]
['13 Nov 2001']

Instead of using regex, you can use split() if only extract the same string model:
email_date = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
email_date = '%s %s %s %s' % (tuple(email_date.split(' ')[1:5]))
Output:
Tue, 13 Nov 2001

Related

dropping unnecessary text from list of string

Existing Df :
Id dates
01 ['ATIVE 04/2018 to 03/2020',' XYZ mar 2020 – Jul 2022','June 2021 - 2023 XYZ']
Expected Df :
Id dates
01 ['04/2018 to 03/2020','mar 2020 – Jul 2022','June 2021 - 2023']
I am looking to clean the List under the dates column. i tried it with below function but doesn't serve the purpose. Any leads on the same..?
def clean_dates_list(dates_list):
cleaned_dates_list = []
for date_str in dates_list:
cleaned_date_str = re.sub(r'[^A-Za-z\s\d]+', '', date_str)
cleaned_dates_list.append(cleaned_date_str)
return cleaned_dates_list
ls = ['ATIVE 04/2018 to 03/2020', ' XYZ mar 2020 – Jul 2022', 'June 2021 - 2023 XYZ']
ls_to_remove = ['ATIVE', 'XYZ']
for item in ls:
ls_str = item.split()
new_item = str()
for item in ls_str:
if item in ls_to_remove:
continue
new_item += " " + item
print(new_item)
I don't know your list of words to remove and it's not a good practice. But in your case it works.

Regex for extract months and year combination in a date

I am using regex to extract the month and year of pairs of dates in text:
regex = (
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})?\s?\s?((to)|[\|\-\–\—])\s?\s?"
r"((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?(t)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)"
r"\s?[\.\s\’\’\,\/\'\,\‘\-\–\—]?\s?(\d{4}|\d{2})|(Present|Now|till\s?(now|date|today)?|current)))"
)
When I test the regex with some inputs that contain the day of the month in some and not in others:
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = re.finditer(regex,i,re.IGNORECASE)
for match in word:
print(match.group())
I get the following output:
Jan 2008 - May 2012
but my expected output is:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
What do I need to change to make the regex match text with an optional day in the date? When a date string includes the day, it is always an ordinal number with a st, nd, rd or th suffix.
You cannot "skip" part of a string during a single match operation, so if you have 26th August, you can't match or capture just 26 August. In these cases, you either need to capture parts of the match and then concatenate them, or replace the parts you do not need as a post-processing step.
So, here, I'd use the post-process replace approach with
import re
day = r'(?:((?:0?[1-9]|[12]\d|3[01])(?:\s*(?:st|[rn]d|th))?)\s*)?'
month = r'(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:t(?:ember)?)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)'
year = r'(\d{2}(?:\d{2})?)'
rx_valid = re.compile( fr'\b{day}{month}\s*{year}\s*[-—–]\s*{day}{month}\s*{year}(?!\d)', re.IGNORECASE )
rx_ordinal = re.compile( r'\s*\d+\s*(?:st|[rn]d|th)', re.IGNORECASE )
lst = [
'July 2014 - 28th August 2014',
'Jan 2012 - 3rd sep 2014',
'Jan 2008 - May 2012',
'Jan 2008 and May 2012'
]
for i in lst:
word = rx_valid.finditer(i)
for match in word:
print(rx_ordinal.sub("", match.group()))
Output:
July 2014 - August 2014
Jan 2012 - sep 2014
Jan 2008 - May 2012
See the Python demo and the regex demo.

How to retrieve wanted string with re from lines

Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
Above is an example of the content in the text file. I want to extract a string with re.
How should I construct the findall condition to achieve the expected result below? I have tried the following:
match=re.findall(r'[Tue\w]+2018$',data2)
but it is not working. I understand that $ is the symbol for the end of the string. How can I do it?
Expected Result is:
Tue Aug 21 17:02:26 2018
Tue Aug 21 17:31:06 2018
Tue Aug 21 18:10:42 2018
.
.
.
Use the pattern:
^Tue.*?2018
^ Assert position beginning of line.
Tue Literal substring.
.*? Match anything lazily.
2018 Match literal substring.
Since you are working with a multiline string and you want to match pattern at the beginning of a string, you have to use the re.MULTILINE flag.
import re
mystr="""
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
"""
print(re.findall(r'^Tue.*?2018',mystr,re.MULTILINE))
Prints:
['Tue Aug 21 17:02:26 2018', 'Tue Aug 21 17:31:06 2018', 'Tue Aug 21 18:10:42 2018']

How to split large text file based on regex using python

I have large file contain multiple lines but in some line having unique pattern, I want to split our large file based on this pattern.
Below data in text file:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
commit 349e1b42d3b3d23e95a227a1ab744fc6167e6893
Date: Sat Jun 9 02:52:37 2018 +0530
Revert "Removing the printf added"
This reverts commit da0fac94719176009188ce40864b09cfb84ca590.
commit 8bfd4e7086ff5987491f280b57d10c1b6e6433fe
Date: Sat Jun 9 02:52:18 2018 +0530
Revert Bulk
This reverts commit c2ee318635987d44e579c92d0b86b003e1d2a076.
commit bcb10c54068602a96d367ec09f08530ede8059ef
Date: Fri Jun 8 19:53:03 2018 +0530
fix crash observed
commit a84169f79fbe9b18702f6885b0070bce54d6dd5a
Date: Fri Jun 8 18:14:21 2018 +0530
Interface PBR
commit 254726fe3fe0b9f6b228189e8a6fe7bdf4aa9314
Date: Fri Jun 8 18:12:10 2018 +0530
Crash observed
commit 18e7106d54e19310d32e8b31d584cec214fb2cb7
Date: Fri Jun 8 18:09:13 2018 +0530
Changes to fix crash
Currently my code as below:
import re
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
txtrawdata = fp.read()
commits = re.split(r'^(commit|)[ a-zA-Z0-9]{40}$',txtrawdata)
print(commits)
Expected Output:
I want to split above string based on "commit 18e7106d54e19310d32e8b31d584cec214fb2cb7" and convert them into python list.
import re
text = ''' commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.'''
print(re.split(r'^\s*commit \S*\s*', text, flags=re.MULTILINE))
This outputs:
['', 'Date: Sat Jun 9 04:11:37 2018 +0530\n\n configurations\n', 'Date: Sat Jun 9 02:59:56 2018 +0530\n\n remote\n', 'Date: Sat Jun 9 02:52:51 2018 +0530\n\n remote fix\n This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.']
Explanation of this regex in Regex101 here.
groups = re.findall(r'(^\s*commit\s+[a-z0-9]+.*?)(?=^commit|\Z)', data, flags=re.DOTALL|re.MULTILINE)
for g in groups:
print(g)
print('-' * 80)
Prints:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
--------------------------------------------------------------------------------
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
--------------------------------------------------------------------------------
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
--------------------------------------------------------------------------------
...and so on
This will extract the commit shas:
commits = list()
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
for line in fp:
m = re.match('^commit\s+([a-f0-9]{40})$', line)
if m:
commits.append(m.group(0))
commits is now a list of just the strings of the commit. Now if your gitlog output format changes this will change the matching regex. Make sure you're generating it with --no-abbrev-commit.

Regex is working. But seriously dont know, what is wrong in some parts

Hi I am having a problem with the regex in the following link.
https://regex101.com/r/wU4xK1/1
It matches almost all patterns. But when it encounter some characters or newline I am struggling.
My regex is :
(\b(?:(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4})
My text is:
July 2005 – December - 2006
(Nov '12 - Feb 12)
(Nov 12 - Feb 12 )
july 2005 – Dec 2012 ## Note here. If i press enter after Dec 2012 I will get a match. Dont know why ?
Just turn all the capturing group to non-capturing group and then include the whole pattern inside a single capturing group.
((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))
DEMO
>>> s = '''July 2005 – December - 2006
(Nov '12 - Feb 12)
(Nov 12 - Feb 12 )
july 2005 – Dec 2012 ## Note here. If i press enter after Dec 2012 I will get a match. Dont know why ?'''
>>> re.findall(r"(?mi)((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))", s)
[('July 2005 – December - 2006', ''), ("Nov '12 - Feb 12", ''), ('Nov 12 - Feb 12', ''), ('july 2005 – Dec 2012', '')]
>>> m = re.findall(r"(?mi)((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))", s)
>>> [(x) for x,y in m]
['July 2005 – December - 2006', "Nov '12 - Feb 12", 'Nov 12 - Feb 12', 'july 2005 – Dec 2012']
(?mi) here we combined multiline and case-insensitive modifier.
Your regex indeed works, but you have to delete the blank line at the end of your Regular expression.
See
https://regex101.com/r/wU4xK1/3
Apart from correct Aaron's remark (after removing that line break, the matches are displayed), I'd also like to mention that \s matches any white space character in [\r\n\t\f ] class, so you could avoid capturing line breaks by limiting the group to [\t\f ].

Categories

Resources