I've made this Python program for printing words from a text but I got stuck where Python reaches the next 'tab' index it returns to the initial one when it checks the conditional and I don't know why, so can anyone explain to me why it doesn't take the new 'tab' index?
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
text = re.sub('\W+', ' ', initial_text)
t = -1
for i in text:
n = text.find(i)
if i == ' ':
print(text[t+1:n])
t = n
This is because you are using the find() function, this will return the index number of the first occurrence of the word you are searching, that's why it is again moving to the first index.
You can refer to the find() function documentation.
Use this approach
import re
initial_text = "whatever your text is"
text = re.sub(r'[^\w\s]', '', initial_text)
words_list = text.split()
for word in words:
print(word)
Explanation using an example :
import re
initial_text = "Hello : David welcome to Stack ! overflow"
text = re.sub(r'[^\w\s]', '', initial_text)
Above piece removes the punctuations
words_list = text.split()
words_list after this step will be : ['Hello', 'David', 'welcome', 'to', 'Stack', 'overflow']
for word in words_list:
print(word)
Above code takes each element from the list and prints it.
Looks like you can use
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
words = re.findall(r'[^\W_]+', initial_text)
for word in words:
print(word)
See Python proof.
re.findall extracts all non-overlapping matches from the given text.
[^\W_]+ is a regular expression that matches one or more characters different from non-word and underscores, and that means it matches substrings that consist of digits or/and letters only (all, ASCII and other Unicode).
See regex proof.
EXPLANATION
[^\W_]+ any character except: non-word characters
(all but a-z, A-Z, 0-9, _), '_' (1 or more
times (matching the most amount possible))
Related
Using the below code, I imported a few .csv files with sentences like the following into Python:
df = pd.concat((pd.read_csv(f) for f in path), ignore_index=True)
Sample sentence:
I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n
While I have no problem removing the newline characters surrounded by spaces, in the middle of words, or at the end of the string, I don't know what to do with the newline characters separating words.
The output I want is as follows:
Goal sentence:
I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS.
Is there a way for me to indicate in my code that the newline character is surrounded by two distinct words? Or is this classic garbage in, garbage out?
df = df[~df['Sentence'].str.contains("\n")]
After doing some digging, I came up with two solutions.
1. The textwrap package: Though it seems that the textwrap package is normally used for visual formatting (i.e. telling a UI when to show "..." to signify a long string), it successfully identified the \n patterns I was having issues with. Though it's still necessary to remove extra whitespace of other kinds, this package got me 90% of the way there.
import textwrap
sample = 'I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n'
sample_wrap = textwrap.wrap(sample)
print(sample_wrap)
'I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS. '
2. Function to ID different \n appearance patterns: The 'boil the ocean' solution I came up with before learning about textwrap, and it doesn't work as well. This function finds matches defined as a newline character surrounded by two word (alphanumeric) characters. For all matches, the function searches NLTK's words.words() list for each string surrounding the newline character. If at least one of the two strings is a word in that list, it's considered to be two separate words.
This doesn't take into consideration domain-specific words, which have to be added to the wordlist, or words like "about", which would be incorrectly categorized by this function if the newline character appeared as "ab\nout". I'd recommend textwrap for this reason, but still thought I'd share.
carriage = re.compile(r'(\n+)')
wordword = re.compile(r'((\w+)\n+(\w+))')
def carriage_return(sentence):
if carriage.search(sentence):
if not wordword.search(sentence):
sentence = re.sub(carriage, '', sentence)
else:
matches = re.findall(wordword, sentence)
for match in matches:
word1 = match[1].lower()
word2 = match[2].lower()
if word1 in wordlist or word2 in wordlist or word1.isdigit() or word2.isdigit():
sentence = sentence.replace(match[0], word1 + ' ' + word2)
else:
sentence = sentence.replace(match[0], word1+word2)
sentence = re.sub(carriage, '', sentence)
display(sentence)
return sentence
How do you write a regular expression to match a specific word in a string, when the string has white space added in random places?
I've got a string that has been extracted from a pdf document that has a table structure. As a consequence of that structure the extracted string contains randomly inserted new lines and white spaces. The specific words and phrases that I'm looking for are there with characters all in the correct order, but chopped randomly with white spaces. For example: "sta ck over flow".
The content of the pdf document was extracted with PyPDF2 as this is the only option available on my company's python library.
I know that I can write a specific string match for this with a possible white space after every character, but there must be a better way of searching for it.
Here's an example of what I've been trying to do.
my_string = "find the ans weron sta ck over flow"
# r's\s*t\s*a\s*c\s*k\s*' # etc
my_cleaned_string = re.sub(r's\s*t\s*a\s*c\s*k\s*', '', my_string)
Any suggestions?
Actually what you're doing is the best way. The only addition I can suggest is to dynamically construct such regexp from a word:
word = "stack"
regexp = r'\s*'.join(word)
my_string = "find the ans weron sta ck over flow"
my_cleaned_string = re.sub(regexp, '', my_string)
The best you can probably do here is to just strip all whitespace and then search for the target string inside the stripped text:
my_string = "find the ans weron sta ck over flow"
my_string = re.sub(r'\s+', '', my_string)
if 'stack' in my_string:
print("MATCH")
The reason I use "best" above is that in general you won't know if a space is an actual word boundary, or just random whitespace which has been inserted. So, you can really only do as good as finding your target as a substring in the stripped text. Note that the input text 'rust acknowledge' would now match positive for stack.
The problem is that I now have a string where some words are sticked together:
fooledDog and I need fooled D****string text continues with inserted " "
whateveredJ and I need whatevered J*******string text continues with inserted " "
string = string.replace("edD","ed D")
string = string.replace("edJ","ed J")
but I need instead of "D" and "J" to have any possible character so to avoid hard coding values here so that the code will work with any letter or number in this position.
This is a pretty easy problem to solve with regular expressions (not something that is always true, even if regex are the best tool for the job). Try this:
import re
text = "fooledDog whateveredJob"
fixed_text = re.sub(r'ed([A-Z])', r'ed \1', text)
print(fixed_text) # prints "fooled Dog whatevered Job"
The pattern looks for the letters 'ed' in lowercase, followed by any capital letter (which gets captured). The replacement is 'ed' and a space, followed by the capital letter from the capturing group.
I don't fully understand your question, but it seems you have some camelCase words you wanna separate. If that's the case, try this:
import re
name = 'CamelCaseTest123'
splitted = re.sub('(?!^)([A-Z][a-z]+)', r' \1', name).split()
Output:
['Camel', 'Case', 'Test123']
I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.
Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'
I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:
def iterphrases(text):
return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))
However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.
if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):
re.split(r'\.\s', text)
Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:
re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)
and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize
nltk.tokenize.sent_tokenize(text)
Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.
import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
return (match.group(0) for match in sentence.finditer(text))
If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:
matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]
I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.