Python Regular expression Lookahead overshooting pattern

Python Regular expression Lookahead overshooting pattern - python

I'm trying to pull the data contained within FTP LIST.
I'm using regex within Python 2.7.
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
(now without code formatting so you can see it without scrolling)
test = "-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
I've tried various incarnations of the following
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+)(?=[drwx\-]{10})')
with the last line as
'(?P<filename>.+)(?=[drwx\-]{10})')
'(?P<filename>.+(?=[drwx\-]{10}))')
and originally,
'(?P<filename>[\s\w\.\-]+(?=[drwx\-]{10}|$))')
so i can capture the last entry
but regardless, I keep getting the following output
ftp_list_re.findall(test)
[('-rw-r--r--',
'1',
'owner',
'group',
'75148624',
'Jan',
'6',
'2015',
'somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv')]
What am I doing wrong?

You should make sub-pattern before lookahead non-greedy. Further your regex can be shortened a bit like this:
(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>\d{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>\d{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)
Or using compile:
from re import compile
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})\s{1,20}'
'(?P<links>\d{1,8})\s{1,20}'
'(?P<owner>[\w-]{1,16})\s{1,20}'
'(?P<group>[\w-]{1,16})\s{1,20}'
'(?P<size>\d{1,16})\s{1,20}'
'(?P<month>[A-Za-z]{0,3})\s{1,20}'
'(?P<date>\d{1,2})\s{1,20}'
'(?P<timeyear>[\d:]{4,5})\s{1,20}'
'(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
RegEx Demo
Code:
import re
p = re.compile(ur'(?P<permissions>[d-][rwx-]{9})\s{1,20}(?P<links>\d{1,8})\s{1,20}(?P<owner>[\w-]{1,16})\s{1,20}(?P<group>[\w-]{1,16})\s{1,20}(?P<size>[0-9]{1,16})\s{1,20}(?P<month>[A-Za-z]{0,3})\s{1,20}(?P<date>[0-9]{1,2})\s{1,20}(?P<timeyear>[\d:]{4,5})\s{1,20}(?P<filename>[\s\w.-]+?)(?=[drwx-]{10}|$)')
test_str = u"-rw-r--r-- 1 owner group 75148624 Jan 6 2015 somename.csv-rw-r--r-- 1 owner group 223259072 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 4041411 Jun 5 2015 somename-adjusted.csv-rw-r--r-- 1 owner group 2879228 May 13 2015 somename.csv-rw-r--r-- 1 owner group 11832668 Feb 13 2015 somename.csv-rw-r--r-- 1 owner group 1510522 Feb 19 2015 somename.csv-rw-r--r-- 1 owner group 2826664 Feb 25 2015 somename.csv-rw-r--r-- 1 owner group 582985 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 212427 Feb 26 2015 somename.csv-rw-r--r-- 1 owner group 3015592 Feb 27 2015 somename.csv-rw-r--r-- 1 owner group 103576 Feb 27 2015 somename-corrected.csv"
re.findall(p, test_str)

Regular expression quantifiers are by default "greedy" which means that they will "eat" as much as possible.
[\s\w\.\-]+
means to find at least one AND AS MANY AS POSSIBLE of whitespace, word, dot, or dash characters. The look ahead prevents it from eating the entire input (actually the regex engine will eat the entire input and then start backing off as needed), which means that it eats each file specification line, except for the last (which the look ahead insists must be left).
Adding a ? after a quantifier (*?, +?, ??, and so on) makes the quantifier "lazy" or "reluctant". This changes the meaning of "+" from "match at least one and as many as possible" to "match at least one and no more than necessary".
Therefore changing that last + to a +? should fix your problem.
The problem wasn't with the look ahead, which works just fine, but with the last subexpression before it.
EDIT:
Even with this change, your regular expression will not parse that last file specification line. This is because the regular expressions INSISTS that there must be a permission spec after the filename. To fix this, we must allow that look ahead to not match (but require it to match at everything BUT the last specification). Making the following change will fix that
ftp_list_re = compile('(?P<permissions>[d-][rwx-]{9})[\s]{1,20}'
'(?P<links>[0-9]{1,8})[\s]{1,20}'
'(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}'
'(?P<size>[0-9]{1,16})[\s]{1,20}'
'(?P<month>[A-Za-z]{0,3})[\s]{1,20}'
'(?P<date>[0-9]{1,2})[\s]{1,20}'
'(?P<timeyear>[0-9:]{4,5})[\s]{1,20}'
'(?P<filename>[\s\w\.\-]+?)(?=(?:(?:[drwx\-]{10})|$))')
What I have done here (besides making that last + lazy) is to make the lookahead check two possibilities - either a permission specification OR an end of string. The ?: are to prevent those parentheses from capturing (otherwise you will end up with undesired extra data in your matches).

Fixed your last line, filename group was not working. See fixed regex and the demo below:
(?P<permissions>[d-][rwx-]{9})[\s]{1,20}
(?P<links>[0-9]{1,8})[\s]{1,20}
(?P<owner>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<group>[0-9A-Za-z_-]{1,16})[\s]{1,20}
(?P<size>[0-9]{1,16})[\s]{1,20}
(?P<month>[A-Za-z]{0,3})[\s]{1,20}
(?P<date>[0-9]{1,2})[\s]{1,20}
(?P<timeyear>[0-9:]{4,5})[\s]{1,20}
(?P<filename>[\w\-]+.\w+)
Demo here:

With the PyPi regex module that allows to split with an empty match, you can do the same in a more simple way, without having to describe all fields:
import regex
fields = ('permissions', 'links', 'owner', 'group', 'size', 'month', 'day', 'year', 'filename')
p = regex.compile(r'(?=[d-](?:[r-][w-][x-]){3})', regex.V1)
res = [dict(zip(fields, x.split(None, 9))) for x in p.split(test)[1:]]

Related

turning one column into multiple pro-rated column

I have a data regarding an insurance customer's premium during a certain year.
User ID
Period From
Period to
Period from-period to
Total premium
A8856
Jan 2022
Apr 2022
4
$600
A8857
Jan 2022
Feb 2022
2
$400
And I'm trying to turn it into a pro-rated one
Assuming that the input I'm expecting is like this:
User ID
Period From
Total premium
A8856
Jan 2022
$150
A8856
Feb 2022
$150
A8856
Mar 2022
$150
A8856
Apr 2022
$150
A8857
Jan 2022
$200
A8857
Feb 2022
$200
What kind of code do you think I should use? I use python and help is really appreciated.

Extract date and sort rows by date

I have a dataset that includes some strings in the following forms:
Text
Jun 28, 2021 — Brendan Moore is p...
Professor of Psychology at University
Aug 24, 2019 — Chemistry (Nobel prize...
by A Craig · 2019 · Cited by 1 — Authors. ...
... 2020 | Volume 8 | Article 330Edited by:
I would like to create a new column where there are, if there exist, dates sorted by ascending order.
To do so, I need to extract the part of string which includes date information from each row, whether exits.
Something like this:
Text Numbering
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
All the rows not starting with a date (that follows the format: Jun 28, 2021 — are assigned to -1.
The first step would be identify the pattern: xxx xx, xxxx;
then, transforming date object into datetime (yyyy-mm-dd).
Once got this date information, it needs to be converted into numerical, then sorted.
I am having difficulties in answering the last point, specifically on how to filter only dates and sort them in an appropriate way.
The expected output would be
Text Numbering (sort by date asc)
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1

Mission accomplished:
# Find rows that start with a date
matches = df['Text'].str.match(r'^\w+ \d+, \d{4}')
# Parse dates out of date rows
df['date'] = pd.to_datetime(df[matches]['Text'], format='%b %d, %Y', exact=False, errors='coerce')
# Assign numbering for dates
df['Numbering'] = df['date'].sort_values().groupby(np.ones(df.shape[0])).cumcount() + 1
# -1 for the non-dates
df.loc[~matches, 'Numbering'] = -1
# Cleanup
df.drop('date', axis=1, inplace=True)
Output:
>>> df
Text Numbering
0 Jun 28, 2021 - Brendan Moore is p... 2
1 Professor of Psychology at University -1
2 Aug 24, 2019 - Chemistry (Nobel prize... 1
3 by A Craig - 2019 - Cited by 1 - Authors. ... -1
4 ... 2020 | Volume 8 | Article 330Edited by: -1

Django str() method for model based on other instances

In Django, assuming a model that has a date and a description as attributes:
2021 Feb 04 Description A
2021 Feb 02 Description B
2021 Jan 31 Description C
2021 Jan 30 Description D
2020 Dec 24 Description E
Is there an easy way to not print the year/month if there is an older record with that year/month?
04 Description A
Feb 02 Description B
31 Description C
2021 Jan 30 Description D
24 Description E
Can I write a __str()__ or another method for that model that considers others instances of the model? I have been through Meta options for models in Django but I'm not sure to which extent I can customize those.

Merging DataFrames with "uneven" data

Excuse the title, I'm not even sure how to label what I'm trying to do. I have data in a DataFrame that looks like this:
Name Month Status
---- ----- ------
Bob Jan Good
Bob Feb Good
Bob Mar Bad
Martha Feb Bad
John Jan Good
John Mar Bad
Not every name 'Name' has every 'Month' and 'Status'. What I want to get is:
Name Month Status
---- ----- ------
Bob Jan Good
Bob Feb Good
Bob Mar Bad
Martha Jan N/A
Martha Feb Bad
Martha Mar N/A
John Jan Good
John Feb N/A
John Mar Bad
Where the missing months are filled in with a value in the 'Status' column.
What I've tried to do so far is export all of the unique 'Month" values to a list, convert to a DataFrame, then join/merge the two DataFrames. But I can't get anything to work.
What is the best way to do this?

You have to take advantage of Pandas' indexing to reshape the data :
Step1 : create a new index from the unique values of Name and Month columns :
new_index = pd.MultiIndex.from_product(
(df.Name.unique(), df.Month.unique()), names=["Name", "Month"]
)
Step2 : set Name and Month as the new index, reindex with new_index and reset_index to get your final output :
df.set_index(["Name", "Month"]).reindex(new_index).reset_index()
UPDATE 2021/01/08:
You can use the complete function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import pyjanitor
df.complete("Name", "Month")

You can treat the month as a categorical column, then allow GroupBy to do the heavy lifting:
df['Month'] = pd.Categorical(df['Month'])
df.groupby(['Name', 'Month'], as_index=False).first()
Name Month Status
0 Bob Feb Good
1 Bob Jan Good
2 Bob Mar Bad
3 John Feb NaN
4 John Jan Good
5 John Mar Bad
6 Martha Feb Bad
7 Martha Jan NaN
8 Martha Mar NaN
The secret sauce here is that pandas treats missing "categories" by inserting a NaN there.
Caveat: This always sorts your data.

Do pivot
df=df.pivot(*df).stack(dropna=False).to_frame('Status').reset_index()
Name Month Status
0 Bob Feb Good
1 Bob Jan Good
2 Bob Mar Bad
3 John Feb NaN
4 John Jan Good
5 John Mar Bad
6 Martha Feb Bad
7 Martha Jan NaN
8 Martha Mar NaN

Extract month names and date numbers from a raw string using regex (Edit: new test cases from 7)

I have a table that has one if its field as a raw string of letters :
"get it as soon asdec. 5 - 9 when you choose expedited shipping at checkout."
"get it as soon asdec. 10 - 13 when you choose standard shipping at checkout."
"get it as soon as"
" order soon. get it as soon asnov. 21 - 26 when you choose standard shipping at checkout."
"this item ships to canada. get it by thursday, nov. 21 - monday, dec. 2 choose this date at checkout."
"want it friday, nov. 8?order within and choose two-day shipping at checkout."
"arrives: july 2 - 3detailsfastest delivery: sunday, june 28details"
"arrives: july 6 - 9 fastest delivery: july 1 - 6"
"arrives: july 6 - 7detailsfastest delivery: june 30 - july 3"
"arrives: july 6 - july 7detailsfastest delivery: june 30 - july 3"
YES, THERE IS NO SPACE BETWEEN "as" and "dec" IN SOME STRINGS ABOVE
I want to extract the month names and the dates from these strings and save them into new fields. An example would be:
mth_from mth_to rng_frm rng_to lat_mth lat_to lat_rn lat_rng_to
dec NULL 5 9 NULL NULL NULL NULL
dec NULL 10 13 NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL NULL
nov NULL 21 26 NULL NULL NULL NULL
nov dec 21 2 NULL NULL NULL NULL
nov NULL 8 NULL NULL NULL NULL NULL
july NULL 2 3 june NULL 28 NULL
july NULL 6 9 july NULL 1 6
july NULL 6 7 june july 30 3
july july 6 7 june july 30 3
I tried using regex and created groups
re.findall("(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec).*?(\d{1,2}).*?(\d{1,2})")
and (thanks #wiktor) New EDIT
(jan|feb|mar|apr|may|june?|july?|aug|sep|oct|nov|dec)\W*(\d{1,2})(?:\s*-\s*(\d+))?(?:(?:.*?(jan|feb|mar|apr|may|june?|july?|aug|sep|oct|nov|dec))?\W+(\d{1,2})(?:\s*-\s*(\d+))?)?
New Edit End
It is working well for cases 1, 2, and 4 from above list:
group 1 = dec
group 2 = 5
group 3 = 9 ...
However, it is grabbing full match for dec. 13 - monday, dec. 23 like:
group 1 = dec
group 2 = 13
group 3 = 23
instead of creating 4 groups I want when the month name is mentioned again i.e.
group 1 = dec
group 2 = 13
group 3 = dec
group 4 = 23
Furthermore, it is not extracting anything in case of want it friday, nov. 8? which should actually show results like:
group 1 = nov
group 2 = 8
Is there a better way to do that runs for all these test cases?
New EDIT
Is creating 8 groups ideal? Happy to learn more ideas.

One solution (works with your text input in your question, probably needs more input data to work-out quirks):
data = [
"get it as soon asdec. 5 - 9 when you choose expedited shipping at checkout.",
"get it as soon asdec. 10 - 13 when you choose standard shipping at checkout.",
"get it as soon as",
" order soon. get it as soon asnov. 21 - 26 when you choose standard shipping at checkout.",
"this item ships to canada. get it by thursday, nov. 21 - monday, dec. 2 choose this date at checkout.",
"want it friday, nov. 8?order within and choose two-day shipping at checkout.",
]
import re
for line in data:
m = re.findall(r'((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\.)|(\d+)', line)
month_from, month_to, range_from, range_to = 'NULL', 'NULL', 'NULL', 'NULL'
if len(m) == 3:
month_from = m[0][0]
range_from = m[1][1]
range_to = m[2][1]
elif len(m) == 4:
month_from = m[0][0]
month_to = m[2][0]
range_from = m[1][1]
range_to = m[3][1]
elif len(m) == 2:
month_from = m[0][0]
range_from = m[1][1]
print('{:<10} {:<10} {:<10} {:<10}'.format(month_from, month_to, range_from, range_to))
Prints:
dec. NULL 5 9
dec. NULL 10 13
NULL NULL NULL NULL
nov. NULL 21 26
nov. dec. 21 2
nov. NULL 8 NULL

You may use a pattern with a bit more precise patterns in between numbers and a couple of optional groups:
(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\W*(\d{1,2})(?:(?:.*?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))?\W+(\d{1,2}))?
Or, add word boundaries to only match months as whole words:
\b(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\W*(\d{1,2})(?:(?:.*?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))?\W+(\d{1,2}))?
See the regex demo
Details
\b - word boundary
(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec) - Group 1: abbreviated month names (when they are part of a longer pattern, it makes sense to make each alternative match at a different location in a string, thus, change it to (j(?:an|u[nl])|feb|ma[ry]|a(?:pr|ug)|sep|oct|nov|dec))
\W* - 0 or more non-word chars
(\d{1,2}) - Group 2: one or two digits
(?:(?:.*?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))?\W+(\d{1,2}))? - an optional sequence of:
(?:.*?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))? - an optional sequence of
.*? - any 0+ chars other than line break chars as few as possible
(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec) - Group 3: abbreviated month names
\W+ - 1 or more non-word chars
(\d{1,2}) - Group 4: one or two digits
In Python, you may build the pattern dynamically to make it readable:
import re
months = r'(j(?:an|u[nl])|feb|ma[ry]|a(?:pr|ug)|sep|oct|nov|dec)'
pat = r'\b{0}\W*(\d{{1,2}})(?:(?:.*?{0})?\W+(\d{{1,2}}))?'.format(months)
re.findall(pat, text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regular expression Lookahead overshooting pattern - python

Related

turning one column into multiple pro-rated column

Extract date and sort rows by date

Django str() method for model based on other instances

Merging DataFrames with "uneven" data

Extract month names and date numbers from a raw string using regex (Edit: new test cases from 7)

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regular expression Lookahead overshooting pattern - python

Related

turning one column into multiple pro-rated column

Extract date and sort rows by date

Django __str()__ method for model based on other instances

Merging DataFrames with "uneven" data

Extract month names and date numbers from a raw string using regex (Edit: new test cases from 7)

Categories

Resources

Django str() method for model based on other instances