Parsing Data using Regex. Split it into columns via groups - python

I want to use REGEX to parse my data into 3 columns
Film data:
Marvel Comics Presents (1988) #125
Spider-Man Legends Vol. II: Todd Mcfarlane Book I (Trade Paperback)
Spider-Man Legends Vol. II: Todd Mcfarlane Book I
Spider-Man Legends Vol. II: Todd Mcfarlane Book I (1998)
Marvel Comics Presents #125
Expected output:
enter image description here
I can see how to group it, but can't seem to REGEX it:
enter image description here
I built this expression: (.*)\((\d{4})\)(.*)
I want to essentially use the ? quantifier to say the following:
(.*)\((\d{4})\)**?**(.*)
sort of like saying this group may or may not be there?
Nevertheless, it's not working.

You could use 2 capture groups, where the last 2 are optional:
^(.*?)(?:\((\d{4})\))?\s*(#\d+)?$
The pattern matches:
^ Start of string
(.*?) Capture group 1
(?:\((\d{4})\))? Optional non capture group capturing 4 digits in group 2
\s* match optional whitespace chars
(#\d+)? Optional group 3, match # and 1+ digits
$ End of string
See a regex101 demo.

Related

How to remove a specific pattern from re.findall() results

I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:
'ERIN E. SCHNEIDER',
'MONIQUE C. WINKLER',
'JASON M. HABERMEYER',
'MARC D. KATZ',
'JESSICA W. CHAN',
'RAHUL KOLHATKAR',
'TSPU or taken',
'TSPU or the',
'TSPU only',
'TSPU was',
'TSPU and']
I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?
JINA L. CHOI (NY Bar No. 2699718)
ERIN E. SCHNEIDER (Cal. Bar No. 216114) schneidere#sec.gov
MONIQUE C. WINKLER (Cal. Bar No. 213031) winklerm#sec.gov
JASON M. HABERMEYER (Cal. Bar No. 226607) habermeyerj#sec.gov
MARC D. KATZ (Cal. Bar No. 189534) katzma#sec.gov
JESSICA W. CHAN (Cal. Bar No. 247669) chanjes#sec.gov
RAHUL KOLHATKAR (Cal. Bar No. 261781) kolhatkarr#sec.gov
The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]
You can use
\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?
See this regex demo. Details:
\b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
(?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
[A-Z]{4,} - four or more uppercase ASCII letters
(?:(?:\s+\w\.)?\s+\w+)? - an optional occurrence of:
(?:\s+\w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
\s+ - one or more whitespaces
\w+ - one or more word chars.
In Python, you can use
re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?', text)
You can do some simple .filter-ing, if your array was results,
removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))

How should I construct a regex match for a various strings within repeated delimiters?

I have a string formatted as:
GENESIS 1:1 In the beginning God created the heavens ...
the ground. 2:7 And the LORD ...
I buried Leah. 49:32 The purchase of the field and of the cave ...
and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...
Using only one regular expression, I want to match as groups:
the book names
the chapter numbers (as above 1, 2, 49, 1)
the verse numbers (as above 1, 7, 32, 1)
the verses themselves
Take the first as example:
(GENESIS)g1 (1)g2:(1)g3 (In the beginning God created the heavens ...)g4
This requires that I individually match everything within number-pair colons, while retaining my other groups, and with the limitation of fixed length lookaheads / lookbehinds. That last part specifically is what is proving difficult.
My expression up to now is (%(BOOK1)s) (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$), where BOOK1 and BOOK2 change as they iterate through a predetermined list. $ appears because the very last book will not have a BOOK2 after it. I call re.finditer() on this expression over the whole string and then I iterate through the match object to produce my groups.
The functional part of my expression is currently (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$), but by itself this in effect treats GENESIS as BOOK1 always, and matches everything from just after ^ to whatever BOOK2 may be.
Alternatively, keeping my full expression (%(BOOK1)s) (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$) as is will only return the very first desired match.
I get the sense that some of my greedy / non-greedy terms are malformed, or that I could better use leading / trailing expressions. Any feedback will be highly appreciated.
One option could be making use of the Python PyPi regex module and use the \G anchor.
Capturing group 1 contains the name of the book and the numbers for the chapter and verse and the verse that follows are in group 2, 3 and 4.
Looping the result, you can check for the presence of the groups.
\b(?:([A-Z]{2,})(?= \d+:\d)|\G(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+|\d++(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))*)
Explanation
\b A word boundary
(?: Non capture group
([A-Z]{2,})(?= \d+:\d) Capture group 1, match 2 or more uppercase chars and assert what is directly at the right is a space, 1+ digits : and a digit
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close group
(?: Non capture group
(\d+):(\d+) Capture 1 or more digits in group 2 and group 3
)?\s* Close group and make it optional and match optional whitespace chars
( Capture group 4
(?: Non capture group
[^\dA-Z]+ Match 1+ times any char except a digit or A-Z
| Or
\d++(?!:\d) Match 1+ digits in a possessive way and assert what is at the right is not : followed by a digit
| Or
[A-Z](?![A-Z]+ \d+:\d) Match a char A-Z and assert what is directly at the right is not 1+ chars A-Z, space, 1+ digits : and a digit
)* Close group and repeat 0+ times
) Close group 4
Regex demo | Python demo
For example
import regex
pattern = r"\b(?:([A-Z]{2,})(?= \d+:\d)|\G(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+|\d++(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))*)"
s = ("GENESIS 1:1 In the beginning God created the heavens ... the ground. 2:7 And the LORD ... I buried Leah. 49:32 The purchase of the field and of the cave ... and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...\n")
matches = regex.finditer(pattern, s)
for matchNum, match in enumerate(matches, start=1):
if (match.group(1)):
print(f"Book name: {match.group(1)}")
print("------------------------------")
else:
print(f"Chapter Nr: {match.group(2)}\nVerse Nr: {match.group(3)}\nThe verse: {match.group(4)}\n")
Output
Book name: GENESIS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: In the beginning God created the heavens ... the ground.
Chapter Nr: 2
Verse Nr: 7
The verse: And the LORD ... I buried Leah.
Chapter Nr: 49
Verse Nr: 32
The verse: The purchase of the field and of the cave ... and he was put in a coffin in Egypt.
Book name: EXODUS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: Now these are the names ...
I came up with a solution in pure python with re. Thanks to the above response, I was able to get on the right track. Turns out that wrench I was trying to throw in by testing LORD 2:8 ... wasn't actually an issue since non-title capitals never occur before digits that way without punctuation between [A-Z] and \d in the full string.
Using the same example with the derived pattern:
import re
pattern = r"(?:([A-Z]{2,})(?= \d+:\d)|(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))+)"
s = ("GENESIS 1:1 In the beginning God created the heavens ... the ground. 2:7 And the LORD ... I buried Leah. 49:32 The purchase of the field and of the cave ... and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...\n")
match = re.finditer(pattern, s)
for matchNum, match in enumerate(matches, start=1):
if (match.group(1)):
print(f"Book name: {match.group(1)}")
print("------------------------------")
else:
print(f"Chapter Nr: {match.group(2)}\nVerse Nr: {match.group(3)}\nThe verse: {match.group(4)}\n")
As with regex, the Output is:
Book name: GENESIS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: In the beginning God created the heavens ... the ground.
Chapter Nr: 2
Verse Nr: 7
The verse: And the LORD ... I buried Leah.
Chapter Nr: 49
Verse Nr: 32
The verse: The purchase of the field and of the cave ... and he was put in a coffin in Egypt.
Book name: EXODUS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: Now these are the names ...

regex template for Ukrainian phone numbers

I am reading 'Automate the boring stuff with python'. Right now, I'm stuck on chapter 7 (Regex part). There is a template for American phone numbers, which I want to implement for Ukrainian phone numbers.
Ukrainian numbers can appear in the different formats, such as : +380445371428, +38(044)5371428, +38(044)537 14 28, +38(044)537-14-28, +38(044) 537.14.28, 044.537.14.28, 0445371428, 044-537-1428, (044)537-1428, 044 537-1428, etc.
Following, is my implementation, but it's not quite correct. What do I need?
When I'm copying some website pages, from all of the info I have copied, I want to extract the Ukrainian number appearing in this (044-537-1428) format.
phoneRegex = re.compile(r'''(
(^\+38?) # area code(not necessarily)
(\d{3}|\(\d{3}\)) # carrier code(usually starts with 0
(\s|-|\.)? # separator
(\d{3}|\(\d{3}\)) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
template for American number (according to the book) looks like the following
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(ext|x|ext.)\s*(\d{2,5}))?
)''', re.VERBOSE)
Maybe, an option would be to incorporate alternation, based on the types of patterns that we might have, such as:
^(?:\+38)?(?:\(044\)[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[0-9]{7})$
Or even more restricted than that, if we'd be validating.
Demo
Test
import re
regex = r'^(?:\+38)?(?:\(044\)[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[0-9]{7})$'
string = '''
+380445371428
+38(044)5371428
+38(044)537 14 28
+38(044)537-14-28
+38(044) 537.14.28
044.537.14.28
0445371428
044-537-1428
(044)537-1428
044 537-1428
+83(044)537 14 28
088 537-1428
'''
print(re.findall(regex, string, re.M))
Output
['+380445371428', '+38(044)5371428', '+38(044)537 14 28',
'+38(044)537-14-28', '+38(044) 537.14.28', '044.537.14.28',
'0445371428', '044-537-1428', '(044)537-1428', '044 537-1428']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Here is my regex for all Ukrainian numbers:
^\+?3?8?(0[\s\.-]\d{2}[\s\.-]\d{3}[\s\.-]\d{2}[\s\.-]\d{2})$
This allows:
+380 XX XXX XX XX or
+380-XX-XXX-XX-XX
or same without "plus"
I am not familiar with python but I think following regex would resolve your problem
((\+38)?\(?\d{3}\)?[\s\.-]?(\d{7}|\d{3}[\s\.-]\d{2}[\s\.-]\d{2}|\d{3}-\d{4}))
you can check it working here

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.
Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']
I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).
r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

Discover identically adjacent strings with regex and python

Consider this text:
...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,
genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast
beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow
(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)
beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...
I would like to parse with python this text and keep only the strings that appear exactly twice and are adjacent. For example an acceptable result should be
bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne
because the trend is that each string appears adjacent to an identical one, just like this:
bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne
So how can someone search for adjacent and identical strings with a regular expression? I am testing my trials here. Thanks!
You can use the following regex:
(\b.+)\1
See demo
Or, to just match and capture the unique substring part:
(\b.+)(?=\1)
Another demo
The word boundary \b makes sure we only match at the beginning of a word, and then match 1 or more characters other than a newline (in a singleline mode, . will also match a newline), and then with the help of a backreference we match exactly the same sequence of characters that was captured with (\b.+).
When using the version with a (?=\1) look-ahead, the matched text does not contain the duplicate part because look-aheads do not consume text and the match does not contain those chunks.
UPDATE
See Python demo:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)\1')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
print i.group(1).encode('utf-8')
Output:
zyme
abbrühen

Categories

Resources