How to retrieve wanted string with re from lines

How to retrieve wanted string with re from lines - python

Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
Above is an example of the content in the text file. I want to extract a string with re.
How should I construct the findall condition to achieve the expected result below? I have tried the following:
match=re.findall(r'[Tue\w]+2018$',data2)
but it is not working. I understand that $ is the symbol for the end of the string. How can I do it?
Expected Result is:
Tue Aug 21 17:02:26 2018
Tue Aug 21 17:31:06 2018
Tue Aug 21 18:10:42 2018
.
.
.

Use the pattern:
^Tue.*?2018
^ Assert position beginning of line.
Tue Literal substring.
.*? Match anything lazily.
2018 Match literal substring.
Since you are working with a multiline string and you want to match pattern at the beginning of a string, you have to use the re.MULTILINE flag.
import re
mystr="""
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
"""
print(re.findall(r'^Tue.*?2018',mystr,re.MULTILINE))
Prints:
['Tue Aug 21 17:02:26 2018', 'Tue Aug 21 17:31:06 2018', 'Tue Aug 21 18:10:42 2018']

Related

Group python dataframe and display all correspond values for each unique key in a dictionary

I have the following dataset
id
date
7510
15 Jun 2020
7510
16 Jun 2020
7512
15 Jun 2020
7512
07 Jul 2020
7520
15 Jun 2020
7520
16 Aug 2020
I need to convert this to a dictionary which is quite straight forward, but need each unique id as a key and all corresponding values as values to the unique key.
for example;
dictionary = {7510: ["15 Jun 2020", "16 Jun 2020"], 7512: ["15 Jun 2020", "07 Jul 2020"],
7520: ["15 Jun 2020", "16 Aug 2020"] }

Try this:
df.groupby('id')['date'].agg(list).to_dict()
Output:
{7510: ['15 Jun 2020', '16 Jun 2020'],
7512: ['15 Jun 2020', '07 Jul 2020'],
7520: ['15 Jun 2020', '16 Aug 2020']}

python search the user name

Mon Apr 27 18:12:54 '<rdpDirect> ITC\109975#win_itsskdcbkp2_10.18.2.104
Mon Apr 27 18:14:31 '<rdpDirect> JDAWMSAPPDEV\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172
Mon Apr 27 18:14:32 '<rdpDirect> JDAWMSAPPDEV\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172
Mon Apr 27 18:24:21 '<rdpDirect> SAAMARTHA\UITSMUSR#win_saamarthaad_10.10.45.147
Mon Apr 27 18:24:21 '<rdpDirect> SAAMARTHA\UITSMUSR#win_saamarthaad_10.10.45.147
From the above text, need to search user name which is mentioned between the "\" and "#" sign. This can be numeric, characters, or alphanumeric.
eg: in the above example, 109975, VINOTHKUMAR, UITSMUSR etc.

Search for the group with zero or more characters within \ and # character.
import re
data = "Mon Apr 27 18:12:54 ' ITC\\109975#win_itsskdcbkp2_10.18.2.104 Mon Apr 27 18:14:31 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:14:32 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147"
usernames = re.findall(r"\\(.*?)#", data)
print(usernames)
Output:
['109975', 'VINOTHKUMAR', 'VINOTHKUMAR', 'UITSMUSR', 'UITSMUSR']
re.findall() method returns list of non overlapping occurrences of the searched string. More can be read in the official documentation of findall method.

A simple regex should do the job. You can try \\(.*?)\#.
This "captures" text between \ and #

buffer = '''
Mon Apr 27 18:12:54 ' ITC\\19975#win_itsskdcbkp2_10.18.2.104 Mon Apr 27 18:14:31 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:14:32 ' JDAWMSAPPDEV\\VINOTHKUMAR#win_jdawmsappdev_10.10.45.172 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147 Mon Apr 27 18:24:21 ' SAAMARTHA\\UITSMUSR#win_saamarthaad_10.10.45.147'''
import re
print(re.findall('\\\([\d\w\W]+?)#', buffer))
['19975', 'VINOTHKUMAR', 'VINOTHKUMAR', 'UITSMUSR', 'UITSMUSR']

How to split large text file based on regex using python

I have large file contain multiple lines but in some line having unique pattern, I want to split our large file based on this pattern.
Below data in text file:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
commit 349e1b42d3b3d23e95a227a1ab744fc6167e6893
Date: Sat Jun 9 02:52:37 2018 +0530
Revert "Removing the printf added"
This reverts commit da0fac94719176009188ce40864b09cfb84ca590.
commit 8bfd4e7086ff5987491f280b57d10c1b6e6433fe
Date: Sat Jun 9 02:52:18 2018 +0530
Revert Bulk
This reverts commit c2ee318635987d44e579c92d0b86b003e1d2a076.
commit bcb10c54068602a96d367ec09f08530ede8059ef
Date: Fri Jun 8 19:53:03 2018 +0530
fix crash observed
commit a84169f79fbe9b18702f6885b0070bce54d6dd5a
Date: Fri Jun 8 18:14:21 2018 +0530
Interface PBR
commit 254726fe3fe0b9f6b228189e8a6fe7bdf4aa9314
Date: Fri Jun 8 18:12:10 2018 +0530
Crash observed
commit 18e7106d54e19310d32e8b31d584cec214fb2cb7
Date: Fri Jun 8 18:09:13 2018 +0530
Changes to fix crash
Currently my code as below:
import re
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
txtrawdata = fp.read()
commits = re.split(r'^(commit|)[ a-zA-Z0-9]{40}$',txtrawdata)
print(commits)
Expected Output:
I want to split above string based on "commit 18e7106d54e19310d32e8b31d584cec214fb2cb7" and convert them into python list.

import re
text = ''' commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.'''
print(re.split(r'^\s*commit \S*\s*', text, flags=re.MULTILINE))
This outputs:
['', 'Date: Sat Jun 9 04:11:37 2018 +0530\n\n configurations\n', 'Date: Sat Jun 9 02:59:56 2018 +0530\n\n remote\n', 'Date: Sat Jun 9 02:52:51 2018 +0530\n\n remote fix\n This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.']

Explanation of this regex in Regex101 here.
groups = re.findall(r'(^\s*commit\s+[a-z0-9]+.*?)(?=^commit|\Z)', data, flags=re.DOTALL|re.MULTILINE)
for g in groups:
print(g)
print('-' * 80)
Prints:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
--------------------------------------------------------------------------------
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
--------------------------------------------------------------------------------
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
--------------------------------------------------------------------------------
...and so on

This will extract the commit shas:
commits = list()
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
for line in fp:
m = re.match('^commit\s+([a-f0-9]{40})$', line)
if m:
commits.append(m.group(0))
commits is now a list of just the strings of the commit. Now if your gitlog output format changes this will change the matching regex. Make sure you're generating it with --no-abbrev-commit.

Extract date from email using python 2.7 regex

I have tried many regex code to extract the date from the emails that has this format but I couldn't:
Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM
This how it looks like in all emails and I want to extract them both.
Thank you in advance

You can do something like this using this kind of pattern:
Using Python3:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print("{0}, {1}".format(final[0][0], " ".join(final[0][1:])))
print(" ".join(final[0][1:]))
Using Python2:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print "%s, %s" % (final[0][0], " ".join(final[0][1:]))
print " ".join(final[0][1:])
Output:
Tue, 13 Nov 2001
13 Nov 2001
Edit:
A quick answer to the new update of your question, you can do something like this:
import re
email = '''Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM'''
data = email.split("\n")
pattern = r"(\w+: \w+, [0-9]+ \w+ [0-9]+)|(\w+: \w+, \w+ [0-9]+, [0-9]+)"
final = []
for k in data:
final += re.findall(pattern, k)
final = [j.split(":") for k in final for j in k if j != '']
# Python3
print(final)
# Python2
# print final
Output:
[['Date', ' Tue, 13 Nov 2001'], ['Sent', ' Thursday, November 08, 2001']]

import re
my_email = 'Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)'
match = re.search(r': (\w{3,3}, \d{2,2} \w{3,3} \d{4,4})', my_email)
print(match.group(1))

I am not regex expert but here is a solution, you can write some tests for that
d = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
dates = [re.search('(\d+ \w+ \d+)',date).groups()[0] for date in re.search('(Date: \w+, \d+ \w+ \d+)', d).groups()]
['13 Nov 2001']

Instead of using regex, you can use split() if only extract the same string model:
email_date = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
email_date = '%s %s %s %s' % (tuple(email_date.split(' ')[1:5]))
Output:
Tue, 13 Nov 2001

Regex is working. But seriously dont know, what is wrong in some parts

Hi I am having a problem with the regex in the following link.
https://regex101.com/r/wU4xK1/1
It matches almost all patterns. But when it encounter some characters or newline I am struggling.
My regex is :
(\b(?:(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4})
My text is:
July 2005 – December - 2006
(Nov '12 - Feb 12)
(Nov 12 - Feb 12 )
july 2005 – Dec 2012 ## Note here. If i press enter after Dec 2012 I will get a match. Dont know why ?

Just turn all the capturing group to non-capturing group and then include the whole pattern inside a single capturing group.
((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))
DEMO
>>> s = '''July 2005 – December - 2006
(Nov '12 - Feb 12)
(Nov 12 - Feb 12 )
july 2005 – Dec 2012 ## Note here. If i press enter after Dec 2012 I will get a match. Dont know why ?'''
>>> re.findall(r"(?mi)((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))", s)
[('July 2005 – December - 2006', ''), ("Nov '12 - Feb 12", ''), ('Nov 12 - Feb 12', ''), ('july 2005 – Dec 2012', '')]
>>> m = re.findall(r"(?mi)((?:\b(?:(?:jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december)[/\.\s',’-]{0,4}\d{2,4}|(jan|january|feb|february|mar|march|apr|april|may|jun|june|jul|july|aug|august|set|sep|september|oct|october|nov|november|dec|december))[/\r-–––,]{0,4}[a-zA-Z]{3,8}[/\.\s',’-]{0,2}[\s]{0,4}\d{2,4}))", s)
>>> [(x) for x,y in m]
['July 2005 – December - 2006', "Nov '12 - Feb 12", 'Nov 12 - Feb 12', 'july 2005 – Dec 2012']
(?mi) here we combined multiline and case-insensitive modifier.

Your regex indeed works, but you have to delete the blank line at the end of your Regular expression.
See
https://regex101.com/r/wU4xK1/3

Apart from correct Aaron's remark (after removing that line break, the matches are displayed), I'd also like to mention that \s matches any white space character in [\r\n\t\f ] class, so you could avoid capturing line breaks by limiting the group to [\t\f ].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to retrieve wanted string with re from lines - python

Related

Group python dataframe and display all correspond values for each unique key in a dictionary

python search the user name

How to split large text file based on regex using python

Extract date from email using python 2.7 regex

Regex is working. But seriously dont know, what is wrong in some parts

Categories

Resources