Regular expression optimization

Regular expression optimization - python

I need to optimize this regular expression.
^(.+?)\|[\w\d]+?\s+?(\d\d\/\d\d\/\d\d\d\d\s+?\d\d:\d\d:\d\d\.\d\d\d)[\s\d]+?\s+?(\d+?)\s+?\d+?\s+?(\d+?)$
The input is something like this:
-tpf0q16|856B 11/20/2014 00:00:00.015 0 0 0 0 0 689 14 689 703 702 701 700
I'm already replaced all gready matches with lazy matches but this didn't helps. I've use DOTALL but it didn't help either. I use python and PCRE (re module), I know about re2 but I can't use it :(

The first step is to get rid of the unneeded reluctant (a.k.a. "lazy") quantifiers. According to RegexBuddy, your regex:
^(.+?)\|[\w\d]+?\s+?(\d\d\/\d\d\/\d\d\d\d\s+?\d\d:\d\d:\d\d\.\d\d\d)[\s\d]+?\s+?(\d+?)\s+?\d+?\s+?(\d+?)$
...takes 6425 steps to match your sample string. This one:
^(.+?)\|[\w\d]+\s+(\d\d\/\d\d\/\d\d\d\d\s+\d\d:\d\d:\d\d\.\d\d\d)[\s\d]+\s+(\d+)\s+\d+\s+(\d+)$
...takes 716 steps.
Reluctant quantifiers reduce backtracking by doing more work up front. Your regex wasn't prone to excessive backtracking, so the reluctant quantifiers were adding quite a lot to the workload.
This version brings it down to 237 steps:
^([^|]+)\|\w+\s+(\d\d/\d\d/\d\d\d\d\s+\d\d:\d\d:\d\d\.\d\d\d)(?:\s+\d+)+\s+(\d+)\s+\d+\s+(\d+)$
It also removes some noise, like the backslash before /; and [\w\d], which is exactly the same as \w.

A bit more optimized.
>>> import re
>>> s = "-tpf0q16|856B 11/20/2014 00:00:00.015 0 0 0 0 0 689 14 689 703 702 701 700"
>>> re.findall(r'(?m)^([^|]+)\|[\w\d]+?\s+?(\d{2}\/\d{2}\/\d{4}\s+\d{2}:\d{2}:\d{2}\.\d{3})[\s\d]+?(\d+)\s+\d+\s+(\d+?)$', s)
[('-tpf0q16', '11/20/2014 00:00:00.015', '702', '700')]
DEMO

Related

Using regex to split a column

The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.
I have this dataset:
Date_Title Date Copies
05-21 I. Don Quixote 1605 252
21-20 IV. Macbeth 1629 987
10-12 ML. To Kill a Mockingbird 1960 478
12 V. Invisible Man 1897 136
Basically, I would like to split the "Date Title", so, when I print a row, I would get this:
('05-21 I', 'I', 'Don Quixote', 1605, 252)
Or
('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)
In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.

You can use
df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
NumRoman Roman Name Date Copies
0 05-21 I I Don Quixote 1605 252
1 21-20 IV IV Macbeth 1629 987
2 10-12 ML ML To Kill a Mockingbird 1960 478
3 12 V V Invisible Man 1897 136
See the regex demo. Details:
^ - start of string
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
\d+(?:-\d+)? - one or more digits followed with an optional sequence of a - and one or more digits
\s* - zero or more whitespaces
(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - Group 2 ("Roman"): see How do you match only valid roman numerals with a regular expression? for explanation
\. - a dot
\s* - zero or more whitespaces
(.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible
Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.

I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:
df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])
Or:
df['Date_Title'] = (df['Date_Title'].str.split().str[0],
df['Date_Title'].str.split().str[1],
' '.join(df['Date_Title'].str.split().str[2:])

Focusing on the string split:
string = "21-20 IV. Macbeth"
i = string.index(".") # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:] # Macbeth

df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))

Regex expression to find strings between two strings in Python

I am trying to write a regular expression which returns a string which is between two other strings. For example: I want to get the string along with spaces which resides between the strings "15/08/2017" and "$610,000"
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
should return
"TRANSFER OF LAND"
Here is the expression I have pieced together so far:
re.search(r'15/08/2017(.*?)$610,000', a).group(1)
It doesn't return any matches. I think it is because we also need to consider spaces in the expression. Is there a way to find strings between two strings ignoring the spaces?

Use Regex Lookbehind & Lookahead
Ex:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
print(re.search(r"(?<=15/08/2017).*?(?=\$610,000)", a).group())
Output:
TRANSFER OF LAND

>>> re.search(r'15/08/2017(.*)\$610,000',a).group(1)
' TRANSFER OF LAND '
Since $ is a regex metacharacter (standing for the end of a logical line), you need to escape it to use as a literal '$'.

Might be easier to use find:
a = '172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
b = '15/08/2017'
c = '$610,000'
a[a.find(b) + len(b):a.find(c)].strip()
'TRANSFER OF LAND'

Extract string after a multiline string using python regex [duplicate]

I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?
Here is the expression I have pieced together so far:
doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'

Not a regex based solution. But does the trick.
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip()
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE
Hope it helps.

re based code snippet
import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]
Output
1st match: AFFIDAVIT OF
2nd match: CASH & MTGE

We can try using re.findall with the following pattern:
PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)
Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.
input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE
Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.

Why regular expressions?
It looks like you know the exact delimiting string, just str.split() by it and get the first part:
In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '

I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.
import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
if (parts[i] == "15/08/2017"):
print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342

positive lookbehind assertion**
m=re.search('(?<=15/08/2017).*', a)
m.group(0)

You have to return the right group:
re.match("(.*?)15/08/2017",a).group(1)

You nede to use group(1)
import re
re.match("(.*?)15/08/2017",a).group(1)
Output
'172 211 342 '

Building on your expression, this is what I believe you need:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)
Output:
'172 211 342 '

You can do this by using group(1)
re.match("(.*?)15/08/2017",a).group(1)
UPDATE
For updated string you can use .search instead of .match
re.search("(.*?)15\/08\/2017",a).group(1)

Your problem is that your string is formatted the way it is.
The line you are looking for is
182 246 612 01/10/2018 PHASED OF CASH & MTGE
And then you are looking for what ever comes after 'PHASED OF' and some spaces.
You want to search for
(?<=PHASED OF)\s*(?P.*?)\n
in your string. This will return a match object containing the value you are looking for in the group value.
m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')
Also: There are many good online regex testers to fiddle around with your regexes.
And only after finishing up the regex just copy and paste it into python.
I use this one: https://regex101.com/

How to split a multi-line string using regular expressions?

I have been banging my beginner head for most of the day trying various things.
Here is the string
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 Production active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Test active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 Backup active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
What I need is to split like below. I have tried to use regex101.com to simulate various regex but I did not have much luck. I managed to isolate the delimiters with (\n\d+) and then I wanted to use lookbehind but I got an error saying that I need fixed string length.
Here is a link to the regex101 section:
1 default active Eth2/45, Eth2/46, Eth2/47
Eth3/41, Eth3/42, Eth3/43
Eth4/41, Eth4/42, Eth4/43
47 VLAN047 active Po1, Po21, Po23, Po25, Po101
Po102, Eth2/1, Eth2/2, Eth2/3
Eth2/4, Eth3/29, Eth3/30
Eth3/31, Eth3/32, Eth3/33
Eth3/34, Eth3/35, Eth3/36
Eth3/37, Eth3/38, Eth3/39
Eth3/40, Eth3/44, Eth4/29
Eth4/30, Eth4/31, Eth4/32
Eth4/33, Eth4/34, Eth4/35
Eth4/36, Eth4/37, Eth4/38
Eth4/39, Eth4/40, Eth4/44
128 Rogers-Refresh-MGT active Po1, Eth1/13, Eth2/1, Eth2/2
Eth2/3, Eth2/4
129 ManagementSegtNorthW active Po1, Eth1/14, Eth2/1, Eth2/2
Eth2/3, Eth2/4
Update: I update the regex101 example but it is not selecting what I want. The python code works. I wonder what is the problem with regex101

That's pretty simple - use lookahead instead of lookbehind:
parsed = re.split(r'\n(?=\d)', data)

In python there is always more than one way to skin a cat. Multiline regexes are usually very hard. The following is a lot simpler, and more importantly readable
for line in data.split("\n"):
if line[0].isdigit():
if section:
sections.append("\n".join(section))
section=[]
section.append(line)
sections.append("\n".join(section)) # grab the last one
print(sections)
Performance wise, I think this would probably be better, because we are not looking for a pattern in the entire string. we are only looking at the first character in a line.

Extract phone numbers from email using python 2.7 regex

I'm trying to extract the phone numbers from many files of emails. I wrote regex code to extract them but I got the results for just one format.
PHONERX = re.compile("(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})")
phonenumber = re.findall(PHONERX,content)
when I reviewed the data, I found there were many formats for phone numbers.
How can I extract all the phone numbers that have these format together:
800-569-0123
1-866-523-4176
(324)442-9843
(212) 332-1200
713/853-5620
713 853-0357
713 837 1749
This link is a sample for the dataset. the problem is sometime the phone numbers regex extract from the messageId and other numbers in the email
https://www.dropbox.com/sh/pw2yfesim4ejncf/AADwdWpJJTuxaJTPfha38OdRa?dl=0

You may want to use:
\(?(?:1-)?\b[2-9][0-9]{2}\)?[-. \/]?[2-9][0-9]{2}[-. ]?[0-9]{4}\b
Which will match all your examples + ignore false positives, like:
113 837 1749
222 2222 22222
Regex Demo and Explanation
Python Demo

You don't need to include all the possibilities using a logical OR. You can use following regex:
(?:\(\d+\)\s?\d*|\d+)([-\/ ]\d+){1,3}
see the Demo
For using with re.findall() use non-captured group:
(?:\(\d+\)\s?\d*|\d+)(?:[-\/ ]\d+){1,3}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression optimization - python

Related

Using regex to split a column

Regex expression to find strings between two strings in Python

Extract string after a multiline string using python regex [duplicate]

How to split a multi-line string using regular expressions?

Extract phone numbers from email using python 2.7 regex

Categories

Resources