Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
Above is an example of the content in the text file. I want to extract a string with re.
How should I construct the findall condition to achieve the expected result below? I have tried the following:
match=re.findall(r'[Tue\w]+2018$',data2)
but it is not working. I understand that $ is the symbol for the end of the string. How can I do it?
Expected Result is:
Tue Aug 21 17:02:26 2018
Tue Aug 21 17:31:06 2018
Tue Aug 21 18:10:42 2018
.
.
.
Use the pattern:
^Tue.*?2018
^ Assert position beginning of line.
Tue Literal substring.
.*? Match anything lazily.
2018 Match literal substring.
Since you are working with a multiline string and you want to match pattern at the beginning of a string, you have to use the re.MULTILINE flag.
import re
mystr="""
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
"""
print(re.findall(r'^Tue.*?2018',mystr,re.MULTILINE))
Prints:
['Tue Aug 21 17:02:26 2018', 'Tue Aug 21 17:31:06 2018', 'Tue Aug 21 18:10:42 2018']
I have large file contain multiple lines but in some line having unique pattern, I want to split our large file based on this pattern.
Below data in text file:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
commit 349e1b42d3b3d23e95a227a1ab744fc6167e6893
Date: Sat Jun 9 02:52:37 2018 +0530
Revert "Removing the printf added"
This reverts commit da0fac94719176009188ce40864b09cfb84ca590.
commit 8bfd4e7086ff5987491f280b57d10c1b6e6433fe
Date: Sat Jun 9 02:52:18 2018 +0530
Revert Bulk
This reverts commit c2ee318635987d44e579c92d0b86b003e1d2a076.
commit bcb10c54068602a96d367ec09f08530ede8059ef
Date: Fri Jun 8 19:53:03 2018 +0530
fix crash observed
commit a84169f79fbe9b18702f6885b0070bce54d6dd5a
Date: Fri Jun 8 18:14:21 2018 +0530
Interface PBR
commit 254726fe3fe0b9f6b228189e8a6fe7bdf4aa9314
Date: Fri Jun 8 18:12:10 2018 +0530
Crash observed
commit 18e7106d54e19310d32e8b31d584cec214fb2cb7
Date: Fri Jun 8 18:09:13 2018 +0530
Changes to fix crash
Currently my code as below:
import re
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
txtrawdata = fp.read()
commits = re.split(r'^(commit|)[ a-zA-Z0-9]{40}$',txtrawdata)
print(commits)
Expected Output:
I want to split above string based on "commit 18e7106d54e19310d32e8b31d584cec214fb2cb7" and convert them into python list.
import re
text = ''' commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.'''
print(re.split(r'^\s*commit \S*\s*', text, flags=re.MULTILINE))
This outputs:
['', 'Date: Sat Jun 9 04:11:37 2018 +0530\n\n configurations\n', 'Date: Sat Jun 9 02:59:56 2018 +0530\n\n remote\n', 'Date: Sat Jun 9 02:52:51 2018 +0530\n\n remote fix\n This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.']
Explanation of this regex in Regex101 here.
groups = re.findall(r'(^\s*commit\s+[a-z0-9]+.*?)(?=^commit|\Z)', data, flags=re.DOTALL|re.MULTILINE)
for g in groups:
print(g)
print('-' * 80)
Prints:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
--------------------------------------------------------------------------------
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
--------------------------------------------------------------------------------
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
--------------------------------------------------------------------------------
...and so on
This will extract the commit shas:
commits = list()
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
for line in fp:
m = re.match('^commit\s+([a-f0-9]{40})$', line)
if m:
commits.append(m.group(0))
commits is now a list of just the strings of the commit. Now if your gitlog output format changes this will change the matching regex. Make sure you're generating it with --no-abbrev-commit.
the below array is a result of my model, how to append it back to my dataframe's last column
In[] logireg.predict(X.head(5))
Out[] array([0, 0, 0, 1, 0], dtype=int64)
dataframe data:
age job month
33 blue apr
56 admin jun
37 tech aug
76 retired jun
56 service may
expected output
age job month predict
33 blue apr 0
56 admin jun 0
37 tech aug 0
76 retired jun 1
56 service may 0
use for loop or zip function?
You can just assign it directly to a dataframe column. Assuming your dataframe is called dataframe here:
predictions = logireg.predict(X.head(5))
dataframe['predict'] = predictions
I'm trying to pull the data contained within FTP LIST.
I'm using regex within Python 2.7.
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
(now without code formatting so you can see it without scrolling)
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
I've tried various incarnations of the following
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+)(?=[drwx\-]{10})')
with the last line as
'(?P<filename>.+)(?=[drwx\-]{10})')
'(?P<filename>.+(?=[drwx\-]{10}))')
and originally,
'(?P<filename>[\s\w\.\-]+(?=[drwx\-]{10}|$))')
so i can capture the last entry
but regardless, I keep getting the following output
ftp_list_re.findall(test)
[('-rw-r--r--',
'1',
'owner',
'group',
'75148624',
'Jan',
'6',
'2015',
'somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv')]
What am I doing wrong?
You should make sub-pattern before lookahead non-greedy. Further your regex can be shortened a bit like this:
(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>\d{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>\d{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)
Or using compile:
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})\s{1,20}'
'(?P<links>\d{1,8})\s{1,20}'
'(?P<owner>[\w-]{1,16})\s{1,20}'
'(?P<group>[\w-]{1,16})\s{1,20}'
'(?P<size>\d{1,16})\s{1,20}'
'(?P<month>[A-Za-z]{0,3})\s{1,20}'
'(?P<date>\d{1,2})\s{1,20}'
'(?P<timeyear>[\d:]{4,5})\s{1,20}'
'(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
RegEx Demo
Code:
import re
p = re.compile(ur'(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>[0-9]{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>[0-9]{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
test_str = u"-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
re.findall(p, test_str)
Regular expression quantifiers are by default "greedy" which means that they will "eat" as much as possible.
[\s\w\.\-]+
means to find at least one AND AS MANY AS POSSIBLE of whitespace, word, dot, or dash characters. The look ahead prevents it from eating the entire input (actually the regex engine will eat the entire input and then start backing off as needed), which means that it eats each file specification line, except for the last (which the look ahead insists must be left).
Adding a ? after a quantifier (*?, +?, ??, and so on) makes the quantifier "lazy" or "reluctant". This changes the meaning of "+" from "match at least one and as many as possible" to "match at least one and no more than necessary".
Therefore changing that last + to a +? should fix your problem.
The problem wasn't with the look ahead, which works just fine, but with the last subexpression before it.
EDIT:
Even with this change, your regular expression will not parse that last file specification line. This is because the regular expressions INSISTS that there must be a permission spec after the filename. To fix this, we must allow that look ahead to not match (but require it to match at everything BUT the last specification). Making the following change will fix that
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+?)(?=(?:(?:[drwx\-]{10})|$))')
What I have done here (besides making that last + lazy) is to make the lookahead check two possibilities - either a permission specification OR an end of string. The ?: are to prevent those parentheses from capturing (otherwise you will end up with undesired extra data in your matches).
Fixed your last line, filename group was not working. See fixed regex and the demo below:
(?P<permissions>[d-][rwx-]{9})[\s]{1,20}
(?P<links>[0-9]{1,8})[\s]{1,20}
(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<size>[0-9]{1,16})[\s]{1,20}
(?P<month>[A-Za-z]{0,3})[\s]{1,20}
(?P<date>[0-9]{1,2})[\s]{1,20}
(?P<timeyear>[0-9:]{4,5})[\s]{1,20}
(?P<filename>[\w\-]+.\w+)
Demo here:
With the PyPi regex module that allows to split with an empty match, you can do the same in a more simple way, without having to describe all fields:
import regex
fields = ('permissions', 'links', 'owner', 'group', 'size', 'month', 'day', 'year', 'filename')
p = regex.compile(r'(?=[d-](?:[r-][w-][x-]){3})', regex.V1)
res = [dict(zip(fields, x.split(None, 9))) for x in p.split(test)[1:]]