Unable to fetch multiline line text with Regex - python

I have scanned a PDF with Tika which contains the text in the following format, having multiple line breaks
Some non Interview text
interview with Mr.XYZ
Question: How are you?
Answer: I am fine.
Question: What do you do?
Answer: Nothing
Some non Interview text
How do I apply regex?I can match words and spaces but it is not going multiline. I tried the following regex:
https://regex101.com/r/sekUyT/1
What all I want is interview related text which starts with interview with and is considered end when the text does not contain any more Question: and Answer:

Use the re.findall funtion to get all the occurances of a particular text.
match = re.findall('interview with \s*?\w+.\w+',text)
match is a list of occurences of the matched text, if you only want the names, use : 'interview with \s*?(\w+.\w+)' as the search string.

Related

regular expression in python to extract words that contain # but not dot [duplicate]

This question already has an answer here:
Regex: I want to match only words without a dot at the end
(1 answer)
Closed last year.
I am using Python to extract words from a text. I want to extract word that contain# but not dot.
Regular expression should match following: #bob, cat#bob
Regular expression should not match: xyz#bob.com.
I tried following: (?:\w+)?#\w+(?!\.) - but it extracts #bob, cat#bob and xyz#bo.
Just to elaborate, if I have text "hi #bob and cat#bob my email is xyz#bob.com" I want to extract #bob and cat#bob only from this text. My regular expression above extracts part of xyz#bob.com (precisely it extracts xyz#bo). How can I avoid extracting xyz#bob.com completely.
I was finally able to find a solution. The following expression worked for me: (?:\w+)?#\w+\b(?!.)

Regular expression to extract info from HTML file

I would like to use a regular expression to extract the following text from an HTML file: ">ABCDE</A></td><td>
I need to extract: ABCDE
Could anybody please help me with the regular expression that I should use?
Leaning on this, https://stackoverflow.com/a/40908001/11450166
(?<=(<A>))[A-Za-z]+(?=(<\/A>))
With that expression, supposing that your tag is <A> </A>, works fine.
This other match with your input form.
(?<=(>))[A-Za-z]+(?=(<\/A>))
You can try using this regular expression in your specific example:
/">(.*)<\/A><\/td><td>/g
Tested on string:
Lorem ipsum">ABCDE</A></td><td>Lorem ipsum<td></td>Lorem ipsum
extracts:
">ABCDE</A></td><td>
Then it's all a matter of extracting the substring from each match using any programming language. This can be done removing first 2 characters and last 13 characters from the matching string from regex, so that you can extract ABCDE only.
I also tried:
/">([^<]*)<\/A><\/td><td>/g
It has same effect, but it won't include matches that include additional HTML code. As far as I understand it, ([^<]*) is a negating set that won't match < characters in that region, so it won't catch other tag elements inside that region. This could be useful for more fine control over if you're trying to search some specific text and you need to filter nested HTML code.

How to capture all content between two captured groups

I have a txt file that I converted from a pdf that contains a long list of items. These items have a numbering convention as follows:
[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}
This expression would match something between:
A1.1.1
and
ZZ99.99.99
This works just fine. The issue I am having is that I am trying to capture this in group 1 and everything between each item number (the item description) in group 2.
I also need these returned as a list or an iterable so that, eventually, the contents captured can be exported to an excel spreadsheet.
This is the regex I have currently:
^([A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}\s)([\w\W]*?)(?:\n)
Follow this link to find a sample of what I have and the issues I am facing:
Debuggex Demo
Is anyone able to help me figure out how to capture everything between each number no matter how many paragraphs?
Any input would be greatly appreciated, thanks!
You are very close:
import re
s = """
A1.2.1 This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.ZZ99.99.99
"""
final_data = re.findall("[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}(.*?)[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}", s)
Output:
[' This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.']
By using (.*?) you can match any text between the letters and numbers as defined by your first regex.

Regex Python: Adding . after every 15 terms

I have a text file containing clean tweets and after every 15th term I need to insert a period.
In Python how do I add a character after a specific word using regex? Right now I am parsing the line word by word and I don't understand regex enough to write the code.
Basically, so that each line becomes its own string after a period.
Or is there an alternative way to split a paragraph into individual sentences.
Splitting paragraphs into sentences can be achieved with functions in nltk package. Please refer to this answer Python split text on sentences

Python Regex for selecting text between fixed title

I am trying to get the text between two fixed header in python.
Please check this link http://regex101.com/r/jV4oP5/1
I want to extract everything that starts after OPINION . The regex I wrote matches only first Line as well as OPINION BY.
Is there any other regex that can fetch the data.
Any help is appreciated
Use a dotall modifier(s) to extract everything after OPINION.
OPINION.*
DEMO
If you don't want to match OPINION then use a lookbehind,
(?<=OPINION).*
If you really mean you want the entire document after "OPINION", try: (OPINION(\n*.*)*$)
Your regex was only finding the first new line character followed by any normal characters (excluding new line).

Categories

Resources