List index out of range, splitting regular expression match - python

For some reason this little part of my code is giving me a problem. I have been trying to figure out why it is giving me a "list index out of range" error
#This works fine, and finds a match
if re.search("Manufacturer\/Distributor name:?", arg) != None:
#---->This is giving me the problem, "List index out of range"<----
address = arg.split("Manufacturer\/Distributor name:?", 1)[1]
This is the arg I'm feeding it:
Product Name: Tio Nacho Shampoo Mexican Herbs Recommended Use: Shampoo Manufacturer/Distributor name: Garcoa Laboratories, Inc. 26135 Mureau Road Calabasas, CA 91302 (818) 225 - 0375 Emerg ency telephone number: CHEMTREC 1 - 800 - 424 - 9300 2 .
When I have it set to [1], this is the result:
List index out of range
When I have the split set to [0], this is the result:
/Distributor name: Garcoa Laboratories, Inc. 26135 Mureau Road Calabasas, CA 91302 (818) 225 - 0375 Emerg ency telephone number: CHEMTREC 1 - 800 - 424 - 9300 2 .
I'm trying to get this result:
Garcoa Laboratories, Inc. 26135 Mureau Road Calabasas, CA 91302 (818) 225 - 0375 Emerg ency telephone number: CHEMTREC 1 - 800 - 424 - 9300 2 .
Its matching to it, but the split for some reason doesn't want to work. What am I missing? Why does it give that result for [0]
Thanks for the help!

str.split() doesn't take a regular expression, you need to use re.split().
address = re.split(r'Manufacturer\/Distributor name:?', arg, 1)[1]
You should also get in the habit of using raw strings for regular expressions, otherwise you need to escape the \.

I'm assuming arg is a string. string.split() does not accept regex as delimiter. You can read about it here.
Instead, you should use arg.split("Manufacturer/Distributor name", 1)[1].

Related

Using regex to split a column

The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.
I have this dataset:
Date_Title Date Copies
05-21 I. Don Quixote 1605 252
21-20 IV. Macbeth 1629 987
10-12 ML. To Kill a Mockingbird 1960 478
12 V. Invisible Man 1897 136
Basically, I would like to split the "Date Title", so, when I print a row, I would get this:
('05-21 I', 'I', 'Don Quixote', 1605, 252)
Or
('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)
In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.
You can use
df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
NumRoman Roman Name Date Copies
0 05-21 I I Don Quixote 1605 252
1 21-20 IV IV Macbeth 1629 987
2 10-12 ML ML To Kill a Mockingbird 1960 478
3 12 V V Invisible Man 1897 136
See the regex demo. Details:
^ - start of string
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
\d+(?:-\d+)? - one or more digits followed with an optional sequence of a - and one or more digits
\s* - zero or more whitespaces
(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - Group 2 ("Roman"): see How do you match only valid roman numerals with a regular expression? for explanation
\. - a dot
\s* - zero or more whitespaces
(.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible
Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.
I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:
df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])
Or:
df['Date_Title'] = (df['Date_Title'].str.split().str[0],
df['Date_Title'].str.split().str[1],
' '.join(df['Date_Title'].str.split().str[2:])
Focusing on the string split:
string = "21-20 IV. Macbeth"
i = string.index(".") # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:] # Macbeth
df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))

Python Regular expression of group to match text before amount

I am trying to write a python regular expression which captures multiple values from a few columns in dataframe. Below regular expression attempts to do the same. There are 4 parts of the string.
group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount - this group is optional
Some peculiar conditions for group 3 - text that
(1)the text itself might contain characters like "-" , "$". So we cannot use - & $ as the boundary of text.
(2) The text (group 3) sometimes may not be followed by amount.
(3) Empty space between group 3 and 4 is optional
Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.
def parse_values(args):
re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
srch=re.search(re_1, args[0])
if srch is None:
return args
m = re.match(re_1, args[0])
args['dt']=m.group(1)
args['txt']=m.group(3)
args['amt']=m.group(4)
if m.group(4) is None:
if pd.isnull(args['c3']):
args['amt']=args.c2
else:
args['amt']=args.c3
return args
And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.
tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t
However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).
How about this:
(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)
As seen at regex101.com
Explanation:
First off, I've shortened the regex by changing a few minor details like using \s* instead of \s{0,}, which mean the exact same thing.
The whole [Jan|...|DEC] code was using a character class i.e. [], whcih only takes a single character from the entire set. Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.
The meat of the regex: LOOKAHEADS
(?=[\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.

How to separate the number and string in python?

I am getting the string from the front end which contains both string and number Eg: "L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5ml".
Now I want to separate the 13.5ml to 13.5 as one value and ml as another value to insert the value in the backend table.
You could try using re.findall with the regex pattern \d+(?:\.\d+)?ml:
input = "L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5ml"
matches = re.findall(r'(\d+(?:\.\d+)?)(ml)', input)
print(matches)
This prints:
[('13.5', 'ml')]
Edit:
To handle capturing a known list of units, you may modify the above regex pattern to the following:
\d+(?:\.\d+)?(?:GM|KG|LIT)
This uses an alteration to represent each possible unit, and you may add new units as you see fit.
data = "L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5ml, 14dl"
for i in range(len(data)-1):
try:
# if number is before letter
int(data[i])
if data[i+1].isalpha():
data = data[:i+1] + ' ' + data[i+1:] # add space between number and letter
except:
pass
print (data)
output:
L'Oreal Paris L'Huile Nail Paint, 224 Rose Ballet, 13.5 ml, 14 dl

Extract string after a multiline string using python regex [duplicate]

I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?
Here is the expression I have pieced together so far:
doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'
Not a regex based solution. But does the trick.
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip()
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE
Hope it helps.
re based code snippet
import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]
Output
1st match: AFFIDAVIT OF
2nd match: CASH & MTGE
We can try using re.findall with the following pattern:
PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)
Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.
input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE
Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.
Why regular expressions?
It looks like you know the exact delimiting string, just str.split() by it and get the first part:
In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '
I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.
import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
if (parts[i] == "15/08/2017"):
print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342
positive lookbehind assertion**
m=re.search('(?<=15/08/2017).*', a)
m.group(0)
You have to return the right group:
re.match("(.*?)15/08/2017",a).group(1)
You nede to use group(1)
import re
re.match("(.*?)15/08/2017",a).group(1)
Output
'172 211 342 '
Building on your expression, this is what I believe you need:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)
Output:
'172 211 342 '
You can do this by using group(1)
re.match("(.*?)15/08/2017",a).group(1)
UPDATE
For updated string you can use .search instead of .match
re.search("(.*?)15\/08\/2017",a).group(1)
Your problem is that your string is formatted the way it is.
The line you are looking for is
182 246 612 01/10/2018 PHASED OF CASH & MTGE
And then you are looking for what ever comes after 'PHASED OF' and some spaces.
You want to search for
(?<=PHASED OF)\s*(?P.*?)\n
in your string. This will return a match object containing the value you are looking for in the group value.
m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')
Also: There are many good online regex testers to fiddle around with your regexes.
And only after finishing up the regex just copy and paste it into python.
I use this one: https://regex101.com/

How can I use regex in python to extract information from a legal description

I am trying to use regex to extract degrees/minutes/seconds and feet in a legal description for a land parcel. An example of a written legal description would be something like this:
CONT FROM THE PT ON THE NWLY ROW LN OF CO RD NO 31 N 56D 54M 00S W 365
FT TH S 32D 06M 00S W 91/89 FT TH S 61D 54M 00S E 335/77 FT TO THE
NWLY ROW OF SD CO RD NO 31 TH N 32D 06M 00S E 62/62 FT TO THE POB EXC
THAT PART CONVEYED IN BOOK 1132 PAGE 473 0/5900A
I have written a regex that will go through this and find the area's that are what I am looking for such as: N 32D 06M 00S E 62/62 FT.
The problem is sometimes the feet are not written directly after the degrees/minutes/seconds. For example it might say instead: N 32D 06M 00S E along the road for 62/62 FT.
The "along the road for" is the part that messes with my regex.
Is there a good way to get around this? Below is an example of my code
Input for user:
legal_input=input("Paste legal description from RW here: ")
Regex code to find cogo:
cogo_rgx = re.compile(r'([N]{,2}[S]{,2} \w{,1}\d{,2}D{,1} \d{,2}M{,1} \d{,2}S{,1}\s{,2}\w) (\s{,2}\d{1,4}\W{,1}\d{,2} FT){,1}')
full_legal=cogo_rgx.findall(legal_input)
Print message:
print("\nCogo below: \n")
Print the key from the dictionary followed by the value(dms followed by feet). This makes it easier to read:
for key, value in full_legal:
print(key, value)
Try Regex: ((?:N|S) \d{2}D \d{2}M \d{2}S (?:E|W) )(?:.)*?(?=\d+(?:\/\d+)? FT)(\d+(?:\/\d+)? FT)
and combine capture groups 1 and 2
Demo

Categories

Resources