How to split large text file based on regex using python - python

I have large file contain multiple lines but in some line having unique pattern, I want to split our large file based on this pattern.
Below data in text file:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
commit 349e1b42d3b3d23e95a227a1ab744fc6167e6893
Date: Sat Jun 9 02:52:37 2018 +0530
Revert "Removing the printf added"
This reverts commit da0fac94719176009188ce40864b09cfb84ca590.
commit 8bfd4e7086ff5987491f280b57d10c1b6e6433fe
Date: Sat Jun 9 02:52:18 2018 +0530
Revert Bulk
This reverts commit c2ee318635987d44e579c92d0b86b003e1d2a076.
commit bcb10c54068602a96d367ec09f08530ede8059ef
Date: Fri Jun 8 19:53:03 2018 +0530
fix crash observed
commit a84169f79fbe9b18702f6885b0070bce54d6dd5a
Date: Fri Jun 8 18:14:21 2018 +0530
Interface PBR
commit 254726fe3fe0b9f6b228189e8a6fe7bdf4aa9314
Date: Fri Jun 8 18:12:10 2018 +0530
Crash observed
commit 18e7106d54e19310d32e8b31d584cec214fb2cb7
Date: Fri Jun 8 18:09:13 2018 +0530
Changes to fix crash
Currently my code as below:
import re
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
txtrawdata = fp.read()
commits = re.split(r'^(commit|)[ a-zA-Z0-9]{40}$',txtrawdata)
print(commits)
Expected Output:
I want to split above string based on "commit 18e7106d54e19310d32e8b31d584cec214fb2cb7" and convert them into python list.

import re
text = ''' commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.'''
print(re.split(r'^\s*commit \S*\s*', text, flags=re.MULTILINE))
This outputs:
['', 'Date: Sat Jun 9 04:11:37 2018 +0530\n\n configurations\n', 'Date: Sat Jun 9 02:59:56 2018 +0530\n\n remote\n', 'Date: Sat Jun 9 02:52:51 2018 +0530\n\n remote fix\n This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.']

Explanation of this regex in Regex101 here.
groups = re.findall(r'(^\s*commit\s+[a-z0-9]+.*?)(?=^commit|\Z)', data, flags=re.DOTALL|re.MULTILINE)
for g in groups:
print(g)
print('-' * 80)
Prints:
commit e6bcab96ffe1f55e80be8d0c1e5342fb9d69ca30
Date: Sat Jun 9 04:11:37 2018 +0530
configurations
--------------------------------------------------------------------------------
commit 5c8deb3114b4ed17c5d2ea31842869515073670f
Date: Sat Jun 9 02:59:56 2018 +0530
remote
--------------------------------------------------------------------------------
commit 499516b7e4f95daee4f839f34cc46df404b52d7a
Date: Sat Jun 9 02:52:51 2018 +0530
remote fix
This reverts commit 0a2917bd49eec7ca2f380c890300d75b69152353.
--------------------------------------------------------------------------------
...and so on

This will extract the commit shas:
commits = list()
readtxtfile = r'C:\gitlog.txt'
with open(readtxtfile) as fp:
for line in fp:
m = re.match('^commit\s+([a-f0-9]{40})$', line)
if m:
commits.append(m.group(0))
commits is now a list of just the strings of the commit. Now if your gitlog output format changes this will change the matching regex. Make sure you're generating it with --no-abbrev-commit.

Related

turning one column into multiple pro-rated column

I have a data regarding an insurance customer's premium during a certain year.
User ID
Period From
Period to
Period from-period to
Total premium
A8856
Jan 2022
Apr 2022
4
$600
A8857
Jan 2022
Feb 2022
2
$400
And I'm trying to turn it into a pro-rated one
Assuming that the input I'm expecting is like this:
User ID
Period From
Total premium
A8856
Jan 2022
$150
A8856
Feb 2022
$150
A8856
Mar 2022
$150
A8856
Apr 2022
$150
A8857
Jan 2022
$200
A8857
Feb 2022
$200
What kind of code do you think I should use? I use python and help is really appreciated.

How to retrieve wanted string with re from lines

Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
Above is an example of the content in the text file. I want to extract a string with re.
How should I construct the findall condition to achieve the expected result below? I have tried the following:
match=re.findall(r'[Tue\w]+2018$',data2)
but it is not working. I understand that $ is the symbol for the end of the string. How can I do it?
Expected Result is:
Tue Aug 21 17:02:26 2018
Tue Aug 21 17:31:06 2018
Tue Aug 21 18:10:42 2018
.
.
.
Use the pattern:
^Tue.*?2018
^ Assert position beginning of line.
Tue Literal substring.
.*? Match anything lazily.
2018 Match literal substring.
Since you are working with a multiline string and you want to match pattern at the beginning of a string, you have to use the re.MULTILINE flag.
import re
mystr="""
Tue Aug 21 17:02:26 2018 (gtgrhrthrhrhrthhhthrthrhrh)
fjfpjpgporejpejgjr[eh[[[jh[j[ej[[ej[ej[e]]]]
fkw[kgkeg[ekrk[ekg[kergk[erkg[eg[kg]
Tue Aug 21 17:31:06 2018 ( ijwejfwfjwpfjwf[[few[jjfwfefwfeffeww]]
fiowhfiweohewhfpwfhpfhpepwehfphpwhfpehfpwfh
f,wfpewfefewgpwpg,pewgp
Tue Aug 21 18:10:42 2018 ( reijpjfpjejferjfrejfpjefjer
k[pfk[epkf[kr[ek[ke[gkk]
r[g[keprkgpekg[rkg[pkg[ekg]
"""
print(re.findall(r'^Tue.*?2018',mystr,re.MULTILINE))
Prints:
['Tue Aug 21 17:02:26 2018', 'Tue Aug 21 17:31:06 2018', 'Tue Aug 21 18:10:42 2018']

In python, what's the most efficient way to identify clusters of n-events t-time apart?

For example, if I have the following event data, and want to find clusters of at least 2 events that are within 1 minute of each other in which id_1, id_2, and id_3 are all the same. For reference, I have the epoch timestamp (in microseconds) in addition to the date-time timestamp.
event_id timestamp id_1 id_2 id_3
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
For each cluster found, I want to return a set of (id_1, id_2, id_3, [list of event_ids], min_timestamp_of_cluster, max_timestamp_of_cluster). Additionally, if there's a cluster with (e.g.) 6 events, I'd only want to return a single result with all events, not one for each grouping of 3 events.
If I understood your problem correctly, you can make use of scikit-learn's DBSCAN clustering algorithm with a custom distance (or metric) function. Your distance function should return a very large number if either of the id_1, id_2 or id_3's are different. Otherwise is should return the time difference.
But with this method, the number of clusters are determined by the algorithm and not as an input given to the algorithm. If you are determined to pass the number of clusters as an input, k-means is the clustering algorithm you may need to look into.
In pure python, use a "sliding window" that encompasses all the events in a 1 minute range.
The premise is simple: maintain a queue of events that is a subsequence
of the total list, in order. The "window" (queue) should be all the events you care about. In this case, that is determined by the 60-second time gap requirement.
As you process events, add one event to the end of the queue. If the first event in the queue is more than 60 seconds from the newly-added last event, slide the window forward by dropping the first event from the front of the queue.
This is python3:
import collections
import operator
import itertools
from datetime import datetime
#### FROM HERE: vvv is just faking events. Delete or replace.
class Event(collections.namedtuple('Event', 'event_id timestamp id_1 id_2 id_3 epoch_ts')):
def __str__(self):
return ('{e.event_id} {e.timestamp} {e.id_1} {e.id_2} {e.id_3}'
.format(e=self))
def get_events():
event_list = map(operator.methodcaller('strip'), '''
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
'''.strip().splitlines())
for line in event_list:
idev, *rest = line.split()
ts = rest[:6]
id1, id2, id3 = rest[6:]
id2 = int(id2) # faster when sorting (see find_clustered_events)
id3 = int(id3) # faster when sorting (see find_clustered_events)
ts_str = ' '.join(ts)
dt = datetime.strptime(ts_str.replace('PDT', '-0700'), '%b %d, %Y %I:%M %p %z')
epoch = dt.timestamp()
ev = Event(idev, ts_str, id1, id2, id3, epoch)
yield ev
#### TO HERE: ^^^ was just faking up your events. Delete or replace.
def add_cluster(key, group):
'''Do whatever you want with the clusters. I'll print them.'''
print('Cluster:', key)
print(' ','\n '.join(map(str, group)), sep='')
def find_clustered_events(events, cluster=3, gap_secs=60):
'''Call add_cluster on clusters of events within a maximum time gap.
Args:
events (iterable): series of events, in chronological order
cluster (int): minimum number of events in a cluster
gap_secs (float): maximum time-gap from start to end of cluster
Returns:
None.
'''
window = collections.deque()
evkey = lambda e: (e.id_1, e.id_2, e.id_3)
for ev in events:
window.append(ev)
t0 = window[0].epoch_ts
tn = window[-1].epoch_ts
if tn - t0 < gap_secs:
continue
window.pop()
for k, g in itertools.groupby(sorted(window, key=evkey), key=evkey):
group = tuple(g)
if len(group) >= cluster:
add_cluster(k, group)
window.append(ev)
window.popleft()
# Call find_clustered with event generator, cluster args.
# Note that your data doesn't have any 3-clusters without time seconds. :-(
find_clustered_events(get_events(), cluster=2)
The output looks like this:
$ python test.py
Cluster: ('A', 1, 34567)
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
Cluster: ('A', 2, 12345)
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
Cluster: ('A', 3, 34567)
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
Cluster: ('A', 1, 12345)
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
Cluster: ('C', 2, 34567)
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
Please note: the code above doesn't try to keep track of events already in a cluster. So if you have, for example, an event type that occurs every fifteen seconds, you will have a sequence like this:
event1 t=0:00
event2 t=0:15
event3 t=0:30
event4 t=0:45
event5 t=1:00
And you will get overlapping clusters:
event1, event2, event3 (t=0:00 .. 0:30)
event2, event3, event4 (t=0:15 .. 0:45)
event3, event4, event5 (t=0:30 .. 1:00)
Technically, those are valid clusters, each slightly different. But you may wish to expunge previously-clustered events from the window, if you want the events to only appear in a single cluster.
Alternatively, if the chance of clustering and repetition is low, it might improve performance to implement repeat-checking in the add_cluster() function, to reduce the work done by the main loop.
A final note: this does a LOT of sorting. And the sorting is not efficient, since it gets repeated every time a new event appears. If you have a large data set, the performance will probably be bad. If your event keys are relatively few - that is, if the id1,2,3 values tend to repeat over and over again - you would be better off dynamically creating separate deques for each distinct key (id1+id2+id3) and dispatching the event to the appropriate deque, applying the same window logic, and then checking the length of the deque.
On the other hand, if you are processing something like web-server logs, where the requester is always changing, that might tend to create a memory problem with all the useless deques. So this is a memory vs. time trade-off you'll have to be aware of.

Extract date from email using python 2.7 regex

I have tried many regex code to extract the date from the emails that has this format but I couldn't:
Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM
This how it looks like in all emails and I want to extract them both.
Thank you in advance
You can do something like this using this kind of pattern:
Using Python3:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print("{0}, {1}".format(final[0][0], " ".join(final[0][1:])))
print(" ".join(final[0][1:]))
Using Python2:
import re
data = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
final = re.findall(r"Date: (\w+), ([0-9]+) (\w+) ([0-9]+)", data)
print "%s, %s" % (final[0][0], " ".join(final[0][1:]))
print " ".join(final[0][1:])
Output:
Tue, 13 Nov 2001
13 Nov 2001
Edit:
A quick answer to the new update of your question, you can do something like this:
import re
email = '''Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)
Sent: Thursday, November 08, 2001 10:25 AM'''
data = email.split("\n")
pattern = r"(\w+: \w+, [0-9]+ \w+ [0-9]+)|(\w+: \w+, \w+ [0-9]+, [0-9]+)"
final = []
for k in data:
final += re.findall(pattern, k)
final = [j.split(":") for k in final for j in k if j != '']
# Python3
print(final)
# Python2
# print final
Output:
[['Date', ' Tue, 13 Nov 2001'], ['Sent', ' Thursday, November 08, 2001']]
import re
my_email = 'Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)'
match = re.search(r': (\w{3,3}, \d{2,2} \w{3,3} \d{4,4})', my_email)
print(match.group(1))
I am not regex expert but here is a solution, you can write some tests for that
d = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
dates = [re.search('(\d+ \w+ \d+)',date).groups()[0] for date in re.search('(Date: \w+, \d+ \w+ \d+)', d).groups()]
['13 Nov 2001']
Instead of using regex, you can use split() if only extract the same string model:
email_date = "Date: Tue, 13 Nov 2001 08:41:49 -0800 (PST)"
email_date = '%s %s %s %s' % (tuple(email_date.split(' ')[1:5]))
Output:
Tue, 13 Nov 2001

Python Regular expression Lookahead overshooting pattern

I'm trying to pull the data contained within FTP LIST.
I'm using regex within Python 2.7.
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
(now without code formatting so you can see it without scrolling)
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
I've tried various incarnations of the following
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+)(?=[drwx\-]{10})')
with the last line as
'(?P<filename>.+)(?=[drwx\-]{10})')
'(?P<filename>.+(?=[drwx\-]{10}))')
and originally,
'(?P<filename>[\s\w\.\-]+(?=[drwx\-]{10}|$))')
so i can capture the last entry
but regardless, I keep getting the following output
ftp_list_re.findall(test)
[('-rw-r--r--',
'1',
'owner',
'group',
'75148624',
'Jan',
'6',
'2015',
'somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv')]
What am I doing wrong?
You should make sub-pattern before lookahead non-greedy. Further your regex can be shortened a bit like this:
(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>\d{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>\d{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)
Or using compile:
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})\s{1,20}'
'(?P<links>\d{1,8})\s{1,20}'
'(?P<owner>[\w-]{1,16})\s{1,20}'
'(?P<group>[\w-]{1,16})\s{1,20}'
'(?P<size>\d{1,16})\s{1,20}'
'(?P<month>[A-Za-z]{0,3})\s{1,20}'
'(?P<date>\d{1,2})\s{1,20}'
'(?P<timeyear>[\d:]{4,5})\s{1,20}'
'(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
RegEx Demo
Code:
import re
p = re.compile(ur'(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>[0-9]{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>[0-9]{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
test_str = u"-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
re.findall(p, test_str)
Regular expression quantifiers are by default "greedy" which means that they will "eat" as much as possible.
[\s\w\.\-]+
means to find at least one AND AS MANY AS POSSIBLE of whitespace, word, dot, or dash characters. The look ahead prevents it from eating the entire input (actually the regex engine will eat the entire input and then start backing off as needed), which means that it eats each file specification line, except for the last (which the look ahead insists must be left).
Adding a ? after a quantifier (*?, +?, ??, and so on) makes the quantifier "lazy" or "reluctant". This changes the meaning of "+" from "match at least one and as many as possible" to "match at least one and no more than necessary".
Therefore changing that last + to a +? should fix your problem.
The problem wasn't with the look ahead, which works just fine, but with the last subexpression before it.
EDIT:
Even with this change, your regular expression will not parse that last file specification line. This is because the regular expressions INSISTS that there must be a permission spec after the filename. To fix this, we must allow that look ahead to not match (but require it to match at everything BUT the last specification). Making the following change will fix that
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+?)(?=(?:(?:[drwx\-]{10})|$))')
What I have done here (besides making that last + lazy) is to make the lookahead check two possibilities - either a permission specification OR an end of string. The ?: are to prevent those parentheses from capturing (otherwise you will end up with undesired extra data in your matches).
Fixed your last line, filename group was not working. See fixed regex and the demo below:
(?P<permissions>[d-][rwx-]{9})[\s]{1,20}
(?P<links>[0-9]{1,8})[\s]{1,20}
(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<size>[0-9]{1,16})[\s]{1,20}
(?P<month>[A-Za-z]{0,3})[\s]{1,20}
(?P<date>[0-9]{1,2})[\s]{1,20}
(?P<timeyear>[0-9:]{4,5})[\s]{1,20}
(?P<filename>[\w\-]+.\w+)
Demo here:
With the PyPi regex module that allows to split with an empty match, you can do the same in a more simple way, without having to describe all fields:
import regex
fields = ('permissions', 'links', 'owner', 'group', 'size', 'month', 'day', 'year', 'filename')
p = regex.compile(r'(?=[d-](?:[r-][w-][x-]){3})', regex.V1)
res = [dict(zip(fields, x.split(None, 9))) for x in p.split(test)[1:]]

Categories

Resources