I'm searching for a regex solution to replace a substring given a dynamic pattern. The issue is that the substring might contain a known token and we don't know at which position this token occurs.
I can formulate the problem as: Replace (a given) pattern in string even if (known) token would conflict.
Let's assume we have my_string:
I like green and PLUS blue beans!
PLUS represents the known token we want to ignore in case it is hindering a match.
We also have a variable pattern called my_pattern which can be any part of my_string except PLUS such as:
1) green and blue
2) green and blue beans
3) I like green
We know PLUS may occur somewhere in my_string and we don't know the position. Theoretically, my_string could also be:
I PLUS like green and blue beans!
Since my_pattern can occur in form 1), 2), or 3), we also can't hardcode the solution using ORs.
The sought solution is something like:
my_string.replace(my_pattern, "red") with the output for my_pattern:
1) I like red beans!
2) I like red!
3) red and PLUS blue beans!
my_pattern shall match although the PLUS occurs in my_string (which might conflict with my_pattern).
It is something like: match my_pattern and ignore PLUS in case it is hindering a match.
You could modify your pattern such that a regex for your token is added between every single character.
What you did not explain explicitely, that the token also adds a space into the string, so token regex should look also for spaces to the left and right.
import re
token = 'PLUS'
patterns = ['green and blue', 'green and blue beans', 'I like green']
ptn_pls = [f'( ?{token} ?)?'.join(p) for p in patterns]
Applied to three different strings:
my_string = 'I like green and PLUS blue beans!'
for p in ptn_pls:
print(re.sub(p, 'red', my_string))
# I like red beans!
# I like red!
# red and PLUS blue beans!
my_string = 'I PLUS like green and blue beans!'
for p in ptn_pls:
print(re.sub(p, 'red', my_string))
# I PLUS like red beans!
# I PLUS like red!
# red and blue beans!
my_string = 'I like grPLUSeen a PLUSnd blue beans!'
for p in ptn_pls:
print(re.sub(p, 'red', my_string))
# I like red beans!
# I like red!
# red a PLUSnd blue beans!
If your token is a word with blanks around, this rude function can work:
import re
def skip_token(s, pattern, token, sub):
p = pattern.split()
gex = "|".join([pattern] + [" ".join(p[:i] + [token] + p[i:]) for i in range(1, len(p))])
return re.sub(gex, sub, s)
s = "I like green and PLUS blue beans!"
token = "PLUS"
sub = "red"
>>> print(skip_token(s, "green and blue", token, sub))
>>> print(skip_token(s, "green and blue beans", token, sub))
>>> print(skip_token(s, "I like green", token, sub))
I like red beans!
I like red!
red and PLUS blue beans!
but, if your my_string has punctuation AND token can be found literally everywhere, this will fails sometimes.
Related
I hope to extract the full sentence, if containing certain key words (like or love).
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]* like|love [^.]*\.'
re.findall(pattern,text)
Using | for the divider , I was expected ['I like blueberry icecream.']
But only got ['I like']
I also tried pattern = '[^.]*(like|love)[^.]*\.' but got only ['like']
What did I do wrong as I know single word works with following RegEx - '[^.]* like [^.]*\.'
You need to put a group around like|love. Otherwise the | applies to the entire patterns on either side of it. So it's matching either a string ending with like or a string beginning with love.
pattern = '[^.]* (?:like|love) [^.]*\.'
Research more and found out I was missing ?:
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]*(?:like|love)[^.]*\.'
Output
['I like blueberry icecream.']
Source: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
I actually think it would be easier to do this without regex. Just my two cents.
text = 'I like blueberry icecream. He has a green car. She has blue car. I love dogs.'
print([x for x in text.split('.') if any(y in x for y in ['like', 'love'])])
You can use below regex
regex = /[^.]* (?:like|love) [^.]*\./g
Demo here
I want to extract word before a certain character from the names column and append new colum as color
if there is no color before the name then I want to display empty string
I've been trying to extract the word before the match. For example, I have the following table:
import pandas as pd
import re
data = ['red apple','green topaz','black grapes','white grapes']
df = pd.DataFrame(data, columns = ['Names'])
Names
red apple
green apple
black grapes
white grapes
normal apples
red apple
The below code i was treid
I am geeting Partial getting output
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple', x)))
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple|grapes', x)))
Desired output:
Names color
red apple red
green apple green
black grapes black
white grapes white
normal apples
red apple red
Please help out me this issue
I found this solution:
gives me a color_column like ['red', 'green', 'black', 'white', '']
import re
data = ['red apple','green topaz','black grapes','white grapes','apples']
colors_column = list(map(lambda x: ' '.join(re.findall(r'(\S\w+)\s+\w+', x)) ,data))
One solution is just to remove the fruit names to get the color:
def remove_fruit_name(description):
return re.sub(r"apple|grapes", "", description)
df['Colors'] = df['Names'].apply(remove_fruit_name)
If you have many lines it may be faster to compile your regexp:
fruit_pattern = re.compile(r"apple|grapes")
def remove_fruit_name(description):
return fruit_pattern.sub("", description)
Another solution is to use a lookahead assertion, it's (probably) a bit faster, but the code is a bit more complex:
# That may be useful to have a set of fruits:
valid_fruit_names = {"apple", "grapes"}
any_fruit_pattern = '|'.join(valid_fruit_names)
fruit_pattern = re.compile(f"(\w*)\s*(?={any_fruit_pattern})")
def remove_fruit_name(description):
match = fruit_pattern.search(description)
if match:
return match.groups()[0]
return description
df['Colors'] = df['Names'].apply(remove_fruit_name)
Here is an example of lookahead quoted from the documentation:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Finally, if you want to make a difference between normal and green you'll need a dictionary of valid colors. Same goes for fruit names if you have non-fruit strings in your input, such as topaz.
Not necessarily an elegant trick, but this seems to work:
((re.search('(\w*) (apple|grape)',a)) or ['',''])[1]
Briefly, you search for the first word before apple or grape, but if there is no match, it returns None which is false. So you use or with a list of empty strings, but since you want to take the first element of the matched expression (index 1), I used a two element list of empty strings (to take the second element there).
I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things:
import os, re
texts = []
for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts.
with open(i, 'r') as f:
texts.append(f.read())
Now I want to find the word before and after a token.
myToken = 'blue'
found = []
for i in texts:
fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA-Z0-9]+|[a-zA-Z0-9]+ %s\.' %(myToken, myToken, myToken), i, re.IGNORECASE|re.UNICODE)
found.extend(fnd)
print myToken
for i in found:
print '\t\t%s' %(i)
I thought there would be three possibilities: The token might start sentence, the token might end sentence or the token might appear somewhere in the sentence, so I used the regex rule above. When I run, I come across those things:
blue
My blue car # What I exactly want.
he blue jac # That's not what I want. That must be "the blue jacket."
eir blue phone # Wrong! > their
a blue ali # Wrong! > alien
. Blue is # Okay.
is blue. # Okay.
...
I also tried \b\w\b or \b\W\b things, but unfortunately those did not return any results instead of returning wrong results. I tried:
'\b\w\b%s\b[a-zA-Z0-9]+|\.\b%s\b\w\b|\b\w\b%s\.'
'\b\W\b%s\b[a-zA-Z0-9]+|\.\b%s\b\W\b|\b\W\b%s\.'
I hope question is not too blur.
I think what you want is:
(Optionally) a word and a space;
(Always) 'blue';
(Optionally) a space and a word.
Therefore one appropriate regex would be:
r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'
For example:
>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue."""
>>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']
See demo and token-by-token explanation here.
Let's say token is test.
(?=^test\s+.*|.*?\s+test\s+.*?|.*?\s+test$).*
You can use lookahead.It will not eat up anything and at the same time validate as well.
http://regex101.com/r/wK1nZ1/2
Regex can be sometimes slow (if not implemented correctly) and moreover accepted answer did not work for me in several cases.
So I went for the brute force solution (not saying it is the best one), where keyword can be composed of several words:
#staticmethod
def find_neighbours(word, sentence):
prepost_map = []
if word not in sentence:
return prepost_map
split_sentence = sentence.split(word)
for i in range(0, len(split_sentence) - 1):
prefix = ""
postfix = ""
prefix_list = split_sentence[i].split()
postfix_list = split_sentence[i + 1].split()
if len(prefix_list) > 0:
prefix = prefix_list[-1]
if len(postfix_list) > 0:
postfix = postfix_list[0]
prepost_map.append([prefix, word, postfix])
return prepost_map
Empty string before or after the keyword indicates that keyword was the first or the last word in the sentence, respectively.
I looked around, but couldn't find what I was looking for....
Basically I have a string with lots of asterisks scattered around:
Example: red blue green * hello* pink orange 4pgp42g4jg42 * world* violet black
What I am trying to do is split the string up so I can extract "hello" and "world" and eventually print them out as a list using a for statement. The strings I'm working with are longer and do not necessarily have any set number of slices that I would want to take out.
Could anyone assist me with this please?
Thank you
I would expect that:
re.findall(r'\*([^*]+)\*',string)
would do the trick. Basically this regex looks for a '*' (\*) and then matches anything that isn't a '*' (([^*]+)) and then another '*'.
As an alternative to the excellent re suggestions:
Use split to separate sections of "between asterisks" and "not between asterisks":
>>> msg = "red blue green * hello* pink orange 4pgp42g4jg42 * world* violet black"
>>> msg.split()
['red blue green ', ' hello', ' pink orange 4pgp42g4jg42 ', ' world', ' violet black']
Then use array slicing to get every other section, starting with the second.
>>>msg.split("*")[1::2]
[' hello', ' world']
Have you ever tried the re module? It uses a syntax called regular expression that allows you to do very complicated matches (see the docs here). In your case, you could try something like this:
import re
# Store your string
my_str = 'red blue green * hello* pink orange 4pgp42g4jg42 * world* violet black'
# Find matches
match = re.findall(r'\*([^\*]*)\*', my_str)
# Print everything
print match
# Iterate
for item in match:
print item
You can use .split('*') and then take every other element.
For instance:
my_string = 'this is a *test* of my code that *I* have written'
split_string = my_string.split('*')
words_between = [split_string[i] for i in range(1, len(split_string), 2)]
Regex seems like overkill here. I would just use:
my_str = 'red blue green * hello* pink orange 4pgp42g4jg42 * world* violet black'
broken_up = my_str.split('*')
And if you don't want the ends, just do
broken_up[1:-1]
EDIT:
I think I just realized what you're really looking for. Technically 'pink orange 4pgp42g4jg42' is between asterisks too, which poses a problem. I think that this'll work though.
my_str = 'red blue green * hello* pink orange 4pgp42g4jg42 * world* violet black'
broken_up = my_str.split('*')
broken_up = [broken_up[i] for i in range(1, len(broken_up), 2)]
If you want to get rid of spaces, just use .strip() like
broken_up = [broken_up[i].strip() for i in range(1, len(broken_up), 2)]
Give this a try:
from re import findall
sstring = "red blue green * hello* pink orange 4pgp42g4jg42 * world*"
result = findall('\*.*?\*', sstring)
print r
This will give you:
['* hello*', '* world*']
I would do this using re.split to split this into a list of strings thusly:
import re
my_string = "red blue green * hello* pink orange 4pgp42g4jg42 * world* violet black"
all_split_up = re.split('\*', my_string)
When you do this, typing:
for item in all_split_up:
print item
will yield:
red blue green
hello
pink orange 4pgp42g4jg42
world
violet black
By using re.split instead of re.findall, you won't have to worry about specifying non-capturing groups in the regex pattern. I think this is the simplest regex answer though a little late on the 'Answer' button.
I want to match a list of words with an string and get how many of the words are matched.
Now I have this:
import re
words = ["red", "blue"]
exactMatch = re.compile(r'\b%s\b' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
print exactMatch.search("my blue cat")
print exactMatch.search("my red car")
print exactMatch.search("my red and blue monkey")
print exactMatch.search("my yellow dog")
My current regex will match the first 3, but I would like to find out how many of the words in the list words that matches the string passed to search. Is this possible without making a new re.compile for each word in the list?
Or is there another way to achieve the same thing?
The reason I want to keep the number of re.compile to a minimum is speed, since in my application I have multiple word lists and about 3500 strings to search against.
If you use findall instead of search, then you get a tuple as result containing all the matched words.
print exactMatch.findall("my blue cat")
print exactMatch.findall("my red car")
print exactMatch.findall("my red and blue monkey")
print exactMatch.findall("my yellow dog")
will result in
['blue']
['red']
['red', 'blue']
[]
If you need to get the amount of matches you get them using len()
print len(exactMatch.findall("my blue cat"))
print len(exactMatch.findall("my red car"))
print len(exactMatch.findall("my red and blue monkey"))
print len(exactMatch.findall("my yellow dog"))
will result in
1
1
2
0
If I got right the question, you only want to know the number of matches of blue or red in a sentence.
>>> exactMatch = re.compile(r'%s' % '|'.join(words), flags=re.IGNORECASE)
>>> print exactMatch.findall("my blue blue cat")
['blue', 'blue']
>>> print len(exactMatch.findall("my blue blue cat"))
2
You need more code if you want to test multiple colors
Why not storing all words in a hash and iterate a lookup of every words in sentences thru a finditer
words = { "red": 1 .... }
word = re.compile(r'\b(\w+)\b')
for i in word.finditer(sentence):
if words.get(i.group(1)):
....
for w in words:
if w in searchterm:
print "found"