Regex to grab word before a certain character in python

Regex to grab word before a certain character in python - python

I want to extract word before a certain character from the names column and append new colum as color
if there is no color before the name then I want to display empty string
I've been trying to extract the word before the match. For example, I have the following table:
import pandas as pd
import re
data = ['red apple','green topaz','black grapes','white grapes']
df = pd.DataFrame(data, columns = ['Names'])
Names
red apple
green apple
black grapes
white grapes
normal apples
red apple
The below code i was treid
I am geeting Partial getting output
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple', x)))
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple|grapes', x)))
Desired output:
Names color
red apple red
green apple green
black grapes black
white grapes white
normal apples
red apple red
Please help out me this issue

I found this solution:
gives me a color_column like ['red', 'green', 'black', 'white', '']
import re
data = ['red apple','green topaz','black grapes','white grapes','apples']
colors_column = list(map(lambda x: ' '.join(re.findall(r'(\S\w+)\s+\w+', x)) ,data))

One solution is just to remove the fruit names to get the color:
def remove_fruit_name(description):
return re.sub(r"apple|grapes", "", description)
df['Colors'] = df['Names'].apply(remove_fruit_name)
If you have many lines it may be faster to compile your regexp:
fruit_pattern = re.compile(r"apple|grapes")
def remove_fruit_name(description):
return fruit_pattern.sub("", description)
Another solution is to use a lookahead assertion, it's (probably) a bit faster, but the code is a bit more complex:
# That may be useful to have a set of fruits:
valid_fruit_names = {"apple", "grapes"}
any_fruit_pattern = '|'.join(valid_fruit_names)
fruit_pattern = re.compile(f"(\w*)\s*(?={any_fruit_pattern})")
def remove_fruit_name(description):
match = fruit_pattern.search(description)
if match:
return match.groups()[0]
return description
df['Colors'] = df['Names'].apply(remove_fruit_name)
Here is an example of lookahead quoted from the documentation:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Finally, if you want to make a difference between normal and green you'll need a dictionary of valid colors. Same goes for fruit names if you have non-fruit strings in your input, such as topaz.

Not necessarily an elegant trick, but this seems to work:
((re.search('(\w*) (apple|grape)',a)) or ['',''])[1]
Briefly, you search for the first word before apple or grape, but if there is no match, it returns None which is false. So you use or with a list of empty strings, but since you want to take the first element of the matched expression (index 1), I used a two element list of empty strings (to take the second element there).

Related

what is the fast way to match words in text?

i have a list of regex like :
regex_list = [".+rive.+",".+ll","[0-9]+ blue car.+"......] ## list of length 3000
what is the best method to match all this regex to my text
for example :
text : Hello, Owning 2 blue cars for a single driver
so in the output , i want to have a list of matched words :
matched_words = ["Hello","4 blue cars","driver"] ##Hello <==>.+llo

Alright, first of all, you will probably want to adjust your regex_list, because of now, matching those strings will give you the entire text back as match. This is because of .+, which states that there may follow any character any amount of time. What I have done here is the following:
import re
regex_list = [".rive.",".+ll.","[0-9]+ blue car."]
text = "Hello, Owning 2 blue cars for a single driver"
# Returns all the spans of matched regex items in text
spans = [re.search(regex_item,text).span() for regex_item in regex_list]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
print(matching_texts)
I adjusted your regex_list slightly, so it does not match the entire text. Then, I retrieve all spans from the matches with the text. Additionally, I sort the spans on first occurence. Lastly, I retrieve the texts via the indexes of the spans and print those out. What you will get is the following
['Hello', '2 blue cars', 'driver']
NOTE: I am unsure why you would like to match '4 blue cars', because that is not in your text.

You could also try this which is multi threaded version of #Lexpj answer
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
# list of length 3000
regex_list = [".rive.", ".+ll.", "[0-9]+ blue car."]
my_string = "Hello, Owning 2 blue cars for a single driver "
def test(text, regex):
# Returns all the spans of matched regex items in text
spans = [re.search(regex, text).span()]
# Sorts the spans on first occurence (so, first element in item for every item in span).
spans.sort()
# Retrieves the text via index of spans in text.
matching_texts = [text[x[0]:x[1]] for x in spans]
return matching_texts
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(test, my_string, regex)
for regex in regex_list}
# as_completed() gives you the threads once finished
matched = set()
for f in as_completed(futures):
# Get the results
rs = f.result()
matched = matched.union(set(rs))
print(matched)

Looking at the desired result, your regexes are not correct. You don't want to match .+, but \w+, and also with the second regex, you'll want to match some letters after ll too.
The main idea is then to make one regex for all, by concatenating them with the | symbol:
import re
regex_list = [r"\w+rive\w+", r"\w+ll\w+", r"\d+ blue car\w+"]
regex = re.compile('|'.join(regex_list))
text = "Hello, Owning 2 blue cars for a single driver "
print(regex.findall(text)) # ["Hello","2 blue cars","driver"]
This still could give undesired effects when there is a part of your string that would match with more than one regex in the list. In that case the first will "win". So make sure that when multiple regexes could match the same text, they are ordered along their desired priority.

Extract full sentence with list of words

I hope to extract the full sentence, if containing certain key words (like or love).
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]* like|love [^.]*\.'
re.findall(pattern,text)
Using | for the divider , I was expected ['I like blueberry icecream.']
But only got ['I like']
I also tried pattern = '[^.]*(like|love)[^.]*\.' but got only ['like']
What did I do wrong as I know single word works with following RegEx - '[^.]* like [^.]*\.'

You need to put a group around like|love. Otherwise the | applies to the entire patterns on either side of it. So it's matching either a string ending with like or a string beginning with love.
pattern = '[^.]* (?:like|love) [^.]*\.'

Research more and found out I was missing ?:
text = 'I like blueberry icecream. He has a green car. She has blue car.'
pattern = '[^.]*(?:like|love)[^.]*\.'
Output
['I like blueberry icecream.']
Source: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/

I actually think it would be easier to do this without regex. Just my two cents.
text = 'I like blueberry icecream. He has a green car. She has blue car. I love dogs.'
print([x for x in text.split('.') if any(y in x for y in ['like', 'love'])])

You can use below regex
regex = /[^.]* (?:like|love) [^.]*\./g
Demo here

Regex for matching a string with substrings

I'm searching for a regex solution to replace a substring given a dynamic pattern. The issue is that the substring might contain a known token and we don't know at which position this token occurs.
I can formulate the problem as: Replace (a given) pattern in string even if (known) token would conflict.
Let's assume we have my_string:
I like green and PLUS blue beans!
PLUS represents the known token we want to ignore in case it is hindering a match.
We also have a variable pattern called my_pattern which can be any part of my_string except PLUS such as:
1) green and blue
2) green and blue beans
3) I like green
We know PLUS may occur somewhere in my_string and we don't know the position. Theoretically, my_string could also be:
I PLUS like green and blue beans!
Since my_pattern can occur in form 1), 2), or 3), we also can't hardcode the solution using ORs.
The sought solution is something like:
my_string.replace(my_pattern, "red") with the output for my_pattern:
1) I like red beans!
2) I like red!
3) red and PLUS blue beans!
my_pattern shall match although the PLUS occurs in my_string (which might conflict with my_pattern).
It is something like: match my_pattern and ignore PLUS in case it is hindering a match.

You could modify your pattern such that a regex for your token is added between every single character.
What you did not explain explicitely, that the token also adds a space into the string, so token regex should look also for spaces to the left and right.
import re
token = 'PLUS'
patterns = ['green and blue', 'green and blue beans', 'I like green']
ptn_pls = [f'( ?{token} ?)?'.join(p) for p in patterns]
Applied to three different strings:
my_string = 'I like green and PLUS blue beans!'
for p in ptn_pls:
print(re.sub(p, 'red', my_string))
# I like red beans!
# I like red!
# red and PLUS blue beans!
my_string = 'I PLUS like green and blue beans!'
for p in ptn_pls:
print(re.sub(p, 'red', my_string))
# I PLUS like red beans!
# I PLUS like red!
# red and blue beans!
my_string = 'I like grPLUSeen a PLUSnd blue beans!'
for p in ptn_pls:
print(re.sub(p, 'red', my_string))
# I like red beans!
# I like red!
# red a PLUSnd blue beans!

If your token is a word with blanks around, this rude function can work:
import re
def skip_token(s, pattern, token, sub):
p = pattern.split()
gex = "|".join([pattern] + [" ".join(p[:i] + [token] + p[i:]) for i in range(1, len(p))])
return re.sub(gex, sub, s)
s = "I like green and PLUS blue beans!"
token = "PLUS"
sub = "red"
>>> print(skip_token(s, "green and blue", token, sub))
>>> print(skip_token(s, "green and blue beans", token, sub))
>>> print(skip_token(s, "I like green", token, sub))
I like red beans!
I like red!
red and PLUS blue beans!
but, if your my_string has punctuation AND token can be found literally everywhere, this will fails sometimes.

How can I get words after and before a specific token?

I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things:
import os, re
texts = []
for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts.
with open(i, 'r') as f:
texts.append(f.read())
Now I want to find the word before and after a token.
myToken = 'blue'
found = []
for i in texts:
fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA-Z0-9]+|[a-zA-Z0-9]+ %s\.' %(myToken, myToken, myToken), i, re.IGNORECASE|re.UNICODE)
found.extend(fnd)
print myToken
for i in found:
print '\t\t%s' %(i)
I thought there would be three possibilities: The token might start sentence, the token might end sentence or the token might appear somewhere in the sentence, so I used the regex rule above. When I run, I come across those things:
blue
My blue car # What I exactly want.
he blue jac # That's not what I want. That must be "the blue jacket."
eir blue phone # Wrong! > their
a blue ali # Wrong! > alien
. Blue is # Okay.
is blue. # Okay.
...
I also tried \b\w\b or \b\W\b things, but unfortunately those did not return any results instead of returning wrong results. I tried:
'\b\w\b%s\b[a-zA-Z0-9]+|\.\b%s\b\w\b|\b\w\b%s\.'
'\b\W\b%s\b[a-zA-Z0-9]+|\.\b%s\b\W\b|\b\W\b%s\.'
I hope question is not too blur.

I think what you want is:
(Optionally) a word and a space;
(Always) 'blue';
(Optionally) a space and a word.
Therefore one appropriate regex would be:
r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'
For example:
>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue."""
>>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']
See demo and token-by-token explanation here.

Let's say token is test.
(?=^test\s+.*|.*?\s+test\s+.*?|.*?\s+test$).*
You can use lookahead.It will not eat up anything and at the same time validate as well.
http://regex101.com/r/wK1nZ1/2

Regex can be sometimes slow (if not implemented correctly) and moreover accepted answer did not work for me in several cases.
So I went for the brute force solution (not saying it is the best one), where keyword can be composed of several words:
#staticmethod
def find_neighbours(word, sentence):
prepost_map = []
if word not in sentence:
return prepost_map
split_sentence = sentence.split(word)
for i in range(0, len(split_sentence) - 1):
prefix = ""
postfix = ""
prefix_list = split_sentence[i].split()
postfix_list = split_sentence[i + 1].split()
if len(prefix_list) > 0:
prefix = prefix_list[-1]
if len(postfix_list) > 0:
postfix = postfix_list[0]
prepost_map.append([prefix, word, postfix])
return prepost_map
Empty string before or after the keyword indicates that keyword was the first or the last word in the sentence, respectively.

Python RegEx, match words in string and get count

I want to match a list of words with an string and get how many of the words are matched.
Now I have this:
import re
words = ["red", "blue"]
exactMatch = re.compile(r'\b%s\b' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
print exactMatch.search("my blue cat")
print exactMatch.search("my red car")
print exactMatch.search("my red and blue monkey")
print exactMatch.search("my yellow dog")
My current regex will match the first 3, but I would like to find out how many of the words in the list words that matches the string passed to search. Is this possible without making a new re.compile for each word in the list?
Or is there another way to achieve the same thing?
The reason I want to keep the number of re.compile to a minimum is speed, since in my application I have multiple word lists and about 3500 strings to search against.

If you use findall instead of search, then you get a tuple as result containing all the matched words.
print exactMatch.findall("my blue cat")
print exactMatch.findall("my red car")
print exactMatch.findall("my red and blue monkey")
print exactMatch.findall("my yellow dog")
will result in
['blue']
['red']
['red', 'blue']
[]
If you need to get the amount of matches you get them using len()
print len(exactMatch.findall("my blue cat"))
print len(exactMatch.findall("my red car"))
print len(exactMatch.findall("my red and blue monkey"))
print len(exactMatch.findall("my yellow dog"))
will result in
1
1
2
0

If I got right the question, you only want to know the number of matches of blue or red in a sentence.
>>> exactMatch = re.compile(r'%s' % '|'.join(words), flags=re.IGNORECASE)
>>> print exactMatch.findall("my blue blue cat")
['blue', 'blue']
>>> print len(exactMatch.findall("my blue blue cat"))
2
You need more code if you want to test multiple colors

Why not storing all words in a hash and iterate a lookup of every words in sentences thru a finditer
words = { "red": 1 .... }
word = re.compile(r'\b(\w+)\b')
for i in word.finditer(sentence):
if words.get(i.group(1)):
....

for w in words:
if w in searchterm:
print "found"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to grab word before a certain character in python - python

I found this solution: gives me a color_column like ['red', 'green', 'black', 'white', ''] import re data = ['red apple','green topaz','black grapes','white grapes','apples'] colors_column = list(map(lambda x: ' '.join(re.findall(r'(\S\w+)\s+\w+', x)) ,data))

Related

what is the fast way to match words in text?

Extract full sentence with list of words

Regex for matching a string with substrings

How can I get words after and before a specific token?

Python RegEx, match words in string and get count

Categories

Resources