I am trying out regular expressions and would like to find out if there is any way to remove immediate subsequent words after the word I have identified.
For example,
text = "This is the full sentence that I wish to apply regex on."
If I want to remove the word "full", I understand that I can do the following:
result = re.sub(r"full", "", text)
which would give me
This is the sentence that I wish to apply regex on.
Is there any way to get the following statement from my above text? (e.g. keep parts 5 words after the word "full" or remove the first 5 words after "full" )
apply regex on.
Match everything from the beginning to the string to the full, then repeat a group of \s\S+ 5 times to match 5 words, and replace the whole match with the empty string:
^.*?full(?:\s\S+){5}\s
https://regex101.com/r/H6tdCH/1
Related
I am parsing some user input to make a basic Discord bot assigning roles and such. I am trying to generalize some code to reuse for different similar tasks (doing similar things in different categories/channels).
Generally, I am looking for a substring (the category), then taking the string after as that categories value. I am looking line by line for my category, replacing the "category" substring and returning a stripped version. However, what I have now also replaces any space in the "value" string.
Originally the string looks like this:
Gamertag : 00test gamertag
What I want to do, is preserve the spaces in the value. The regex I am trying to do is: match all non alpha-numeric chars until the first letter.
My return is already matching non alpha but can't figure out how to get just first group, looks like it should be simply adding a ? to make it a lazy operator but not sure.. example code and string below (regex I want to replace is the final print string).
String I am working with:
- 00test Gamertag #(or any non-alpha delimiter)
Desired Results (by matching and stripping the extra characters)
00test Gamertag #(remove leading space and any non-alpha characters before the first words)
The regex I am trying to do is: match all non alpha-numeric chars until the first letter. Should be something like the following, which is close to what I use to strip non-alphas now but it does all not the first group - so I want to match the first group of non-alphas in a string to strip that part using re.sub..
\W+?
https://www.online-python.com/gDVhZrnmlq
Thank you!
Your regex will substitute the non-alphanumerical characters anywhere in the input string. If you only need to have this happening at the start of the string, then use the start-of-input anchor (i.e. ^):
^\W+
It depends on your inputs, you can use two regex to achieve your goal, the first to remove all non alpha-numeric from your string including the ones between words, and the second one to remove whitespaces between words if there is more than one space between each two words :
import re
gamer_tag = "ยต& - 00test - Gamertag"
gamer_tag = re.sub(r"[^a-zA-Z0-9\s]", "", gamer_tag)
gamer_tag = re.sub(r" +", " ", gamer_tag)
print(gamer_tag.strip())
# Output: 00test Gamertag
You can remove the second re.sub() if you sure that there will no more than one space between words.
gamer_tag = "- 00test Gamertag "
gamer_tag = re.sub(r"[^a-zA-Z0-9\s]", "", gamer_tag)
print(gamer_tag.strip())
# Output: 00test Gamertag
I want to un-join typos in my string by locating them using regex and insert a space character between the matched expression.
I tried the solution to a similar question ... but it did not work for me -(Insert space between characters regex); solution- to use the replace string as '\1 \2' in re.sub .
import re
corpus = '''
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''
clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)
clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)
EXPECTED OUTPUT:
This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.
You need to put the capture group parentheses around the patterns that match each string that you want to copy to the result.
There's also no need to use + after \d. You only need to match the last digit of the number.
clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)
DEMO
I'm not sure about other possible inputs, we might be able to add spaces using an expression similar to:
(\d+)([a-z]+)\b
after that we would replace any two spaces with a single space and it might work, not sure though:
import re
print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Capture groups, marked by parenthesis ( and ), should be around the patterns you want to match.
So this should work for you
clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)
The regex (\d+)([^\d,\s]) reads: match 1 or more digits (\d+) as group 1 (first set of parenthesis), match non-digit and non-whitespace as group 2.
The reason why your's doesn't work was that you did not have parenthesis surrounding the patterns you want to reuse.
I have a list of words and am creating a regular expression like so:
((word1)|(word2)|(word3){1,3})
Basically I want to match a string that contains 1 - 3 of those words.
This works, however I want it to match the string only if the string contains words from the regex. For example:
((investment)|(property)|(something)|(else){1,3})
This should match the string investmentproperty but not the string abcinvestmentproperty. Likewise it should match somethinginvestmentproperty because all those words are in the regex.
How do I go about achieving that?
Thanks
You can use $...^ to match with a string with (^) and ($) to mark the beginning and ending of the string you want to match. Also note you need to add (...) around your group of words to match for the {1,3}:
^((investment)|(property)|(something)|(else)){1,3}$
Regex101 Example
Im looking for a python regex which matches sentences with 3 words before a string.
for example say i have sentence "this is the test" i want to match this one and only if there are any 3 words before the string test.
re.match(r'(\d\w+\d){3}test', "this is the test")
thought the above sentence would work but didnt work.
If you want to match strings with 3 words, separated by whitespace(s) followed by 'test': (\b){3}test
If you want to extract first 3 words, separated by whitespace(s) followed by 'test': (\w+\s+){3}test
If you want same thing, but want to allow whitespace(s) before the stopword: (\w+\s+){3}\w?test
If you want to do same, but with string ending with the stopword: (\w+\s+){3}\w?test$
I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.
Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'
I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:
def iterphrases(text):
return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))
However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.
if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):
re.split(r'\.\s', text)
Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:
re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)
and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize
nltk.tokenize.sent_tokenize(text)
Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.
import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
return (match.group(0) for match in sentence.finditer(text))
If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:
matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]