How to capture all content between two captured groups - python

I have a txt file that I converted from a pdf that contains a long list of items. These items have a numbering convention as follows:
[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}
This expression would match something between:
A1.1.1
and
ZZ99.99.99
This works just fine. The issue I am having is that I am trying to capture this in group 1 and everything between each item number (the item description) in group 2.
I also need these returned as a list or an iterable so that, eventually, the contents captured can be exported to an excel spreadsheet.
This is the regex I have currently:
^([A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}\s)([\w\W]*?)(?:\n)
Follow this link to find a sample of what I have and the issues I am facing:
Debuggex Demo
Is anyone able to help me figure out how to capture everything between each number no matter how many paragraphs?
Any input would be greatly appreciated, thanks!

You are very close:
import re
s = """
A1.2.1 This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.ZZ99.99.99
"""
final_data = re.findall("[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}(.*?)[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}", s)
Output:
[' This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.']
By using (.*?) you can match any text between the letters and numbers as defined by your first regex.

Related

Python Regex for Searching pattern in text file

Tags in Sample.txt:
<ServiceRQ>want everything between...</ServiceRQ>
<ServiceRQ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>want everything between</ServiceRQ>
..
Please can someone help me to get the regex? To extract the expected output from a text file. I want to create a regex to find the above tags.
This is what is have tried re.search(r"<(.*?)RQ(.*?)>(.*?)</(.*?)RQ>", line) but not working properly. I want to make a search based on word RQ in text file
The expected output should be
1. <ServiceRQ>want everything between</ServiceRQ>
2. <ServiceRQ> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>want everything between</ServiceRQ>
Try this pattern
regex= r'<\w+RQ.*?>.*?</\w+RQ>'
data=re.findall(regex, line)
The above regex will give output like
['<ServiceRQ>want everything between...</ServiceRQ>', '<ServiceRQ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>want everything between</ServiceRQ>']
As Ashish has mentioned, this one gives the tag including the contents.
regex= r'<\w+RQ.*?>.*?</\w+RQ>'
data=re.findall(regex, line)
You can also do this to retrieve JUST the contents within the tags. Changing .*? to (.*?) between the tags.
regex = r'<\w+RQ.*?>(.*?)<\/\w+RQ>'
data = re.findall(regex, sample)
This would result in the following output:
['want everything between...', 'want everything between']

Unable to fetch multiline line text with Regex

I have scanned a PDF with Tika which contains the text in the following format, having multiple line breaks
Some non Interview text
interview with Mr.XYZ
Question: How are you?
Answer: I am fine.
Question: What do you do?
Answer: Nothing
Some non Interview text
How do I apply regex?I can match words and spaces but it is not going multiline. I tried the following regex:
https://regex101.com/r/sekUyT/1
What all I want is interview related text which starts with interview with and is considered end when the text does not contain any more Question: and Answer:
Use the re.findall funtion to get all the occurances of a particular text.
match = re.findall('interview with \s*?\w+.\w+',text)
match is a list of occurences of the matched text, if you only want the names, use : 'interview with \s*?(\w+.\w+)' as the search string.

Regex for parsing list items

I want to slurp a text into a list, and then parse a bit each item so I can keep the text I actually want.
I'm currently using:
with open("C:/text.txt" ,"rU") as input:
lines = [line.rstrip('\n') for line in input]
for line in lines:
#str(line)
regex = r"\:\s*\"(.*)\"\s{5}\d?"
try:
found = re.search(regex, line).group(1)
except AttributeError:
found ='nah'
print(found)
But it doesn't work. Always goes to the exception. When applied to a defined string, it works. Is there a difference when dealing with list items?
The text file is structured as such:
Thank you in advance!
It is clear from the image you provided that there are 3 whitespaces between the text and the digits.
Without exact text, it is impossible to classify the symbols, but it is clear there is at least one.
So, you need to modify the regex you are using to
r':\s*"(.*)"\s+'
Here, \s+ matches 1 or more whitespaces.
Note that \d? at the end of the pattern is not required if you are not interested in the whole match and only need Group 1 value.

Python: remove/filter equals sign from list

Quick question because I'm stuck and cannot seem to get any further.
Here is my problem:
I'm working in a dataset where I'm extracting every section name of a Wikipedia page from an XML dump. I extract the text and from the text, every section is given through:
==Section Name==
However, there are also subsections which I do not want to process and are given through
===Section Name===
Currently I am using a regex to filter the sections from the text (pagetext)
sections = re.findall("==(.*)==", pagetext)
The result however is is that the subsections are included in my list of sections as well. Question: how can I filter these subsections from my list of sections in order to only retrieve the sections from the text.
I have used this list comprehension but that does not work
sections = [section for section in sections if section[0] == (r"^=")]
Any help is greatly appreciated:) Many thanks in advance!!
If the surrounding text is completely arbitrary, you might have to resort to negative lookahead and negative lookbehind:
re.findall(r'(?<!=)==(?!=)(.*?)(?<!=)==(?!=)', pagetext)
# (?<!...) only matches if not preceded by ...
# (?!...) only matches if not followed by ...
# (.*?) the captured group itself, anything matched non-greedily
This ensures that the section enclosing '==' are neither preceded or followed by a '='.
Enable the multiline flag re.M so that the expression can be anchored at the beginning of the line.
Anchor the expression at the beginning of each line.
Exclude subsections in your original regex by excluding the third equals sign
for example
sections = re.findall("^==([^=].*)==", pagetext, re.M)

Need help in RegEx to grab anything after a mandatory value

I have a text in which I need to grab data and split it up. I need to find "Review frequency" within a large group of text, then once that is found, take everything after it and stop at the ')'.
Example text is:
No. of components Variable
Review frequency Quarterly (Mar., Jun., Sep., Dec.)
Quick facts
To learn more about the
What I need is 'Quarterly' and 'Mar., Jun., Sep., Dec.'
My current regex is:
((?=.*?\bReview frequency\b)(\b(Q|q)uarterly|(A|a)nnually|(S|s)emi-(A|a)nnually))
But this is not working. Essentially the 'Review frequency' needs to be the qualifier before we start picking up the other information, as there may be other dates/periods within the file. Thank you!
You are not matching the rest of the data on the line.
I suggest using:
(?m)^Review frequency[ \t]+(\w+)[ \t]+(.+)
See the regex demo
If the first capturing group can only contain 3 words as indicated in your pattern, use
(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)
See another regex demo
Use these patterns with re.findall:
import re
regex = r"(?m)^Review frequency[ \t]+([Qq]uarterly|(?:[Ss]emi-)?[Aa]nnually)[ \t]+(.*)"
test = "No. of components Variable\nReview frequency Quarterly (Mar., Jun., Sep., Dec.\nQuick facts\nTo learn more about the"
print(re.findall(regex, test))

Categories

Resources