Split String based on multiple Regex matches - python

First of all, I checked these previous posts, and did not help me. 1 & 2 & 3
I have this string (or a similar case could be) that need to be handled with regex:
"Text Table 6-2: Management of children study and actions"
What I am supposed to do is detect the word Table and the word(s) before if existed
detect the numbers following and they can be in this format: 6 or 6-2 or 66-22 or 66-2
Finally the rest of the string (in this case: Management of children study and actions)
After doing so, the return value must be like this:
return 1 and 2 as one string, the rest as another string
e.g. returned value must look like this: Text Table 6-2, Management of children study and actions
Below is my code:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.search("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr)
print(parts_of_title)
print(" ".join(parts_of_title.group().split()[0:3]), parts_of_title.group().split()[-1])
The first requirement is returned true as should be but the second doesn't so, I changed the code and used compile but the regex functionality changed, the code is like this:
mystr = "Text Table 6-2: Management of children study and actions"
if re.match("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?", mystr):
print("True matched")
parts_of_title = re.compile("([a-zA-Z0-9]+[ ])?(figure|list|table|Figure|List|Table)[ ][0-9]([-][0-9]+)?").split(mystr)
print(parts_of_title)
Output:
True matched
['', 'Text ', 'Table', '-2', ':\tManagement of children study and actions']
So based on this, how I can achieve this and stick to a clean and readable code? and why does using compile change the matching?

The matching changes because:
In the first part, you call .group().split() where .group() returns the full match which is a string.
In the second part, you call re.compile("...").split() where re.compile returns a regular expression object.
In the pattern, this part will match only a single word [a-zA-Z0-9]+[ ], and if this part should be in a capture group [0-9]([-][0-9]+)? the first (single) digit is currently not part of the capture group.
You could write the pattern writing 4 capture groups:
^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)
See a regex demo.
import re
pattern = r"^(.*? )?((?:[Ll]ist|[Tt]able|[Ff]igure))\s+(\d+(?:-\d+)?):\s+(.+)"
s = "Text Table 6-2: Management of children study and actions"
m = re.match(pattern, s)
if m:
print(m.groups())
Output
('Text ', 'Table', '6-2', 'Management of children study and actions')
If you want point 1 and 2 as one string, then you can use 2 capture groups instead.
^((?:.*? )?(?:[Ll]ist|[Tt]able|[Ff]igure)\s+\d+(?:-\d+)?):\s+(.+)
Regex demo
The output will be
('Text Table 6-2', 'Management of children study and actions')

you have already had answers but I wanted to try your problem to train myself so I give you all the same what I found if you are interested:
((?:[a-zA-Z0-9]+)? ?(?:[Ll]ist|[Tt]able|[Ff]igure)).*?((?:[0-9]+\-[0-9]+)|(?<!-)[0-9]+): (.*)
And here is the link to my tests: https://regex101.com/r/7VpPM2/1

Related

Why doesn't this pattern with a negative look-ahead restrict these overrides of the re.sub() function?

Using a negative look-ahead X(?!Y), check that it is NOT ahead of the match, the goal is to identify substrings "they" that are not ahead some sequence ((PERS)the , and if there aren't any then than replace that substring "they" with the string "((PERS)they NO DATA) ". Otherwise you should not make any replacement.
import re
# Example 1 :
input_text = "They are great friends, the cellos of they, soon they became best friends. They saw each other in the park before taking the old cabinets, since ((PERS)the printers) were still useful to the company themselves. They are somewhat worse than the new models."
# Example 2 :
input_text = "They finished the flow chart pretty quickly"
input_text = re.sub(r"\(\(PERS\)\s*the", "((PERS)the", input_text)
#constraint_pattern = r"\bthey\b(?<!\(\(PERS\)/s*the)" # --> re.error: look-ahead requires fixed-width pattern
constraint_pattern = r"\bellos\b(?<!\(\(PERS\)the)"
input_text = re.sub(constraint_pattern,
"((PERS)they NO DATA)",
input_text, flags = re.IGNORECASE)
print(input_text) # --> output
Using this code, for some reason all occurrences of the "ellos" substring are replaced by "((PERS)they NO DATA)", but really only "they" substrings that are NOT preceded by a sequence "((PERS)the" must be replaced by "((PERS)they NO DATA)"
The goal is really to get this output:
#correct output for example 1
"((PERS)they NO DATA) are great friends, the cellos of ((PERS)they NO DATA), soon ((PERS)they NO DATA) became best friends. ((PERS)they NO DATA) saw each other in the park before taking the old cabinets, since ((PERS)the printers) were still useful to the company themselves. They are somewhat worse than the new models."
#correct output for example 2
"((PERS)they NO DATA) finished the flow chart pretty quickly"

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Using Regex to extract certain phrases but exclude phrases followed by the word "of"

I am basically trying to extract Section references from a long document.
The following code does so quite well:
example1 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*', example1)
res.group(0)
Output: 'Sections 21(1), 54(2), 78(1)'
However, frequently the sections refer to outside books and I would like to either indicate those or exclude them. Generally, the section reference is followed by an "of" if it refers to another book (example below):
example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
So in this case, I would like to exclude these sections because they refer to Harry Potter and not to sections within the document. The following should achieve this but it doesn't work.
example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?(\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*)(?!\s+of)', example2)
res.group(0)
Expected output: Sections 21(1), 54(2), 78 --> (?!\s+of) removes the (1) behind 78 but not the entire reference.
You can emulate atomic groups with capturing groups and lookahead:
(?=(?P<section>Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*))(?P=section)(?! of)
Demo
Long story short:
* in positive lookahead you create a capturing group called section that finds a section pattern
* then you match the group contents in (?P=secion)
* then in negative lookahead you check that there is no of following
Here is a really good answer that explains that technique.
This is because after (?!\s+of) fails, it backtracks before optional (\(..\))? which matches because negative lookahead doesn't match.
Atomic group could be used with other regex engines but isn't implemented in python re.
Other solution is to use a possessive quantifier + after ? optional part :
r'Sections?(\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?+)*)(?!\s+of)'
note the + after ?

python, re.search / re.split for phrases which looks like a title, i.e. starting with an uppper case

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
I would use re.I and modify the list-comp to:
l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
How about modifying the input so that it is in the correct case before you use it in the regular expression?

Categories

Resources