Regular expression in Python sentence extractor - python

I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.
Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'
I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:
def iterphrases(text):
return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))
However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.

if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):
re.split(r'\.\s', text)
Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:
re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)
and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize
nltk.tokenize.sent_tokenize(text)

Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.
import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
return (match.group(0) for match in sentence.finditer(text))

If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:
matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]

Related

Printing words from a text

I've made this Python program for printing words from a text but I got stuck where Python reaches the next 'tab' index it returns to the initial one when it checks the conditional and I don't know why, so can anyone explain to me why it doesn't take the new 'tab' index?
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
text = re.sub('\W+', ' ', initial_text)
t = -1
for i in text:
n = text.find(i)
if i == ' ':
print(text[t+1:n])
t = n
This is because you are using the find() function, this will return the index number of the first occurrence of the word you are searching, that's why it is again moving to the first index.
You can refer to the find() function documentation.
Use this approach
import re
initial_text = "whatever your text is"
text = re.sub(r'[^\w\s]', '', initial_text)
words_list = text.split()
for word in words:
print(word)
Explanation using an example :
import re
initial_text = "Hello : David welcome to Stack ! overflow"
text = re.sub(r'[^\w\s]', '', initial_text)
Above piece removes the punctuations
words_list = text.split()
words_list after this step will be : ['Hello', 'David', 'welcome', 'to', 'Stack', 'overflow']
for word in words_list:
print(word)
Above code takes each element from the list and prints it.
Looks like you can use
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
words = re.findall(r'[^\W_]+', initial_text)
for word in words:
print(word)
See Python proof.
re.findall extracts all non-overlapping matches from the given text.
[^\W_]+ is a regular expression that matches one or more characters different from non-word and underscores, and that means it matches substrings that consist of digits or/and letters only (all, ASCII and other Unicode).
See regex proof.
EXPLANATION
[^\W_]+ any character except: non-word characters
(all but a-z, A-Z, 0-9, _), '_' (1 or more
times (matching the most amount possible))

Is there a way to tell if a newline character is splitting two distinct words in Python?

Using the below code, I imported a few .csv files with sentences like the following into Python:
df = pd.concat((pd.read_csv(f) for f in path), ignore_index=True)
Sample sentence:
I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n
While I have no problem removing the newline characters surrounded by spaces, in the middle of words, or at the end of the string, I don't know what to do with the newline characters separating words.
The output I want is as follows:
Goal sentence:
I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS.
Is there a way for me to indicate in my code that the newline character is surrounded by two distinct words? Or is this classic garbage in, garbage out?
df = df[~df['Sentence'].str.contains("\n")]
After doing some digging, I came up with two solutions.
1. The textwrap package: Though it seems that the textwrap package is normally used for visual formatting (i.e. telling a UI when to show "..." to signify a long string), it successfully identified the \n patterns I was having issues with. Though it's still necessary to remove extra whitespace of other kinds, this package got me 90% of the way there.
import textwrap
sample = 'I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n'
sample_wrap = textwrap.wrap(sample)
print(sample_wrap)
'I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS. '
2. Function to ID different \n appearance patterns: The 'boil the ocean' solution I came up with before learning about textwrap, and it doesn't work as well. This function finds matches defined as a newline character surrounded by two word (alphanumeric) characters. For all matches, the function searches NLTK's words.words() list for each string surrounding the newline character. If at least one of the two strings is a word in that list, it's considered to be two separate words.
This doesn't take into consideration domain-specific words, which have to be added to the wordlist, or words like "about", which would be incorrectly categorized by this function if the newline character appeared as "ab\nout". I'd recommend textwrap for this reason, but still thought I'd share.
carriage = re.compile(r'(\n+)')
wordword = re.compile(r'((\w+)\n+(\w+))')
def carriage_return(sentence):
if carriage.search(sentence):
if not wordword.search(sentence):
sentence = re.sub(carriage, '', sentence)
else:
matches = re.findall(wordword, sentence)
for match in matches:
word1 = match[1].lower()
word2 = match[2].lower()
if word1 in wordlist or word2 in wordlist or word1.isdigit() or word2.isdigit():
sentence = sentence.replace(match[0], word1 + ' ' + word2)
else:
sentence = sentence.replace(match[0], word1+word2)
sentence = re.sub(carriage, '', sentence)
display(sentence)
return sentence

Looking for Italicized Text In Python using re.sub

the gist of this is, i'm making a function that removes italicized text by using re.sub and duplicate the text. The function has an argument named sentence that contains a string.
A few examples:
sentence = <i>All of this text is italicized.</i>
Return value = "All of this text is italicized. All of this text is italicized."
sentence = <i>beep</i><i>bop</i><i>boop</i><i>bonk</i>
Return value: "beep beepbop bopboop boopbonk bonk"
sentence = "I <i>Like</i>, food because <i>it's so great</i>!"
return value: "I Like Like food because it's so great it's so great!".
Here's what i have so far:
pattern = r'<.*?>'
return re.sub(pattern, i, sentence)
Anyone can help?
First, your pattern is wrong - it matches everything from first < to last > which is clearly not what you want. Second, for i in sentence makes no sense - iterating over a string gives you single characters of the string, which won't match your pattern anyway.
This, however, seems to do what you want:
return re.sub('<i>(.*?)</i>', r'\1 \1', sentence)
\1 is a reference to whatever the first capturing group, ie. (.*?), has matched, and it is used twice to achieve the doubling effect.

insert space between regex match

I want to un-join typos in my string by locating them using regex and insert a space character between the matched expression.
I tried the solution to a similar question ... but it did not work for me -(Insert space between characters regex); solution- to use the replace string as '\1 \2' in re.sub .
import re
corpus = '''
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''
clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)
clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)
EXPECTED OUTPUT:
This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.
You need to put the capture group parentheses around the patterns that match each string that you want to copy to the result.
There's also no need to use + after \d. You only need to match the last digit of the number.
clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)
DEMO
I'm not sure about other possible inputs, we might be able to add spaces using an expression similar to:
(\d+)([a-z]+)\b
after that we would replace any two spaces with a single space and it might work, not sure though:
import re
print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Capture groups, marked by parenthesis ( and ), should be around the patterns you want to match.
So this should work for you
clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)
The regex (\d+)([^\d,\s]) reads: match 1 or more digits (\d+) as group 1 (first set of parenthesis), match non-digit and non-whitespace as group 2.
The reason why your's doesn't work was that you did not have parenthesis surrounding the patterns you want to reuse.

Python regex for multiple and single dots

I'm currently trying to clean a 1-gram file. Some of the words are as follows:
word - basic word, classical case
word. - basic word but with a dot
w.s.f.w. - (word stands for word) - correct acronym
w.s.f.w - incorrect acronym (missing the last dot)
My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:
find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)
The second one is used in order to recognise acronyms:
find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)
Let's say that I have an input_word as a sequence of characters. The output is obtained with:
"".join(re.findall(pattern, input_word))
Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.
Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.
The case no. 3 works as expected.
The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.
I'm looking for a RegEx (preferably one instead of two that are currently used) that:
given word returns word
given word. returns word
given w.s.f.w. returns w.s.f.w.
given w.s.f.w returns w.s.f.w.
given m.in returns m.in.
A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:
found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
found += "."
As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.
If you want one regex, you can use something like this:
((?:[A-Za-z](\.))*[A-Za-z]+)\.?
And substitute with:
\1\2
Regex demo.
Python 3 example:
import re
regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
word
word
w.s.f.w.
w.s.f.w.
m.in.
Python demo.

Categories

Resources