Python RegEx, match words in string and get count

Python RegEx, match words in string and get count - python

I want to match a list of words with an string and get how many of the words are matched.
Now I have this:
import re
words = ["red", "blue"]
exactMatch = re.compile(r'\b%s\b' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
print exactMatch.search("my blue cat")
print exactMatch.search("my red car")
print exactMatch.search("my red and blue monkey")
print exactMatch.search("my yellow dog")
My current regex will match the first 3, but I would like to find out how many of the words in the list words that matches the string passed to search. Is this possible without making a new re.compile for each word in the list?
Or is there another way to achieve the same thing?
The reason I want to keep the number of re.compile to a minimum is speed, since in my application I have multiple word lists and about 3500 strings to search against.

If you use findall instead of search, then you get a tuple as result containing all the matched words.
print exactMatch.findall("my blue cat")
print exactMatch.findall("my red car")
print exactMatch.findall("my red and blue monkey")
print exactMatch.findall("my yellow dog")
will result in
['blue']
['red']
['red', 'blue']
[]
If you need to get the amount of matches you get them using len()
print len(exactMatch.findall("my blue cat"))
print len(exactMatch.findall("my red car"))
print len(exactMatch.findall("my red and blue monkey"))
print len(exactMatch.findall("my yellow dog"))
will result in
1
1
2
0

If I got right the question, you only want to know the number of matches of blue or red in a sentence.
>>> exactMatch = re.compile(r'%s' % '|'.join(words), flags=re.IGNORECASE)
>>> print exactMatch.findall("my blue blue cat")
['blue', 'blue']
>>> print len(exactMatch.findall("my blue blue cat"))
2
You need more code if you want to test multiple colors

Why not storing all words in a hash and iterate a lookup of every words in sentences thru a finditer
words = { "red": 1 .... }
word = re.compile(r'\b(\w+)\b')
for i in word.finditer(sentence):
if words.get(i.group(1)):
....

for w in words:
if w in searchterm:
print "found"

Related

How to group words of a string into different strings using pre-defined word groups in python?

I would like to convert a string which contains words like this: The Red Fox The Cat The Dog Is Blue, into 3 strings which would contain The Red Fox for the first one, The Cat for the second and The Dog Is Blue for the last one.
More simply explained, it should do like so:
# String0 = The Red Fox The Cat The Dog Is Blue
# The line above should transform to the lines below
# String1 = The Red Fox
# String2 = The Cat
# String3 = The Dog Is Blue
You must note that the words that form the expressions are meant to change (but still forming known expressions) so I was thinking about making a dictionary which would help to recognize the words and define how they should group together if it is possible.
I hope that I am understandable and that someone will have the answer to my question.

You can use regex:
import re
string = "The Red Fox The Cat The Dog Is Blue"
# create a regex by joining your words using pipe (|)
pattern = "(The(\\s(Red|Fox|Cat|Dog|Is|Blue))+)"
print([x[0] for x in re.findall(pattern, string)]) # ['The Red Fox', 'The Cat', 'The Dog Is Blue']
In the above example, you can dynamically create your pattern from a list of words that you have.
EDIT: Dynamically constructing the pattern:
pattern = f"(The(\\s({'|'.join(list_of_words)}))+)"

This gets you what you need, the basic code:
def separate():
string0 = "The Red Fox The Cat The Dog Is Blue"
sentences = ["The "+sentence.strip() for sentence in string0.lower().split("the") if sentence != ""]
for sentence in sentences:
print(sentence)

Regex to grab word before a certain character in python

I want to extract word before a certain character from the names column and append new colum as color
if there is no color before the name then I want to display empty string
I've been trying to extract the word before the match. For example, I have the following table:
import pandas as pd
import re
data = ['red apple','green topaz','black grapes','white grapes']
df = pd.DataFrame(data, columns = ['Names'])
Names
red apple
green apple
black grapes
white grapes
normal apples
red apple
The below code i was treid
I am geeting Partial getting output
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple', x)))
df['Names'].apply(lambda x: ' '.join(re.findall(r'(\w+)\s+apple|grapes', x)))
Desired output:
Names color
red apple red
green apple green
black grapes black
white grapes white
normal apples
red apple red
Please help out me this issue

I found this solution:
gives me a color_column like ['red', 'green', 'black', 'white', '']
import re
data = ['red apple','green topaz','black grapes','white grapes','apples']
colors_column = list(map(lambda x: ' '.join(re.findall(r'(\S\w+)\s+\w+', x)) ,data))

One solution is just to remove the fruit names to get the color:
def remove_fruit_name(description):
return re.sub(r"apple|grapes", "", description)
df['Colors'] = df['Names'].apply(remove_fruit_name)
If you have many lines it may be faster to compile your regexp:
fruit_pattern = re.compile(r"apple|grapes")
def remove_fruit_name(description):
return fruit_pattern.sub("", description)
Another solution is to use a lookahead assertion, it's (probably) a bit faster, but the code is a bit more complex:
# That may be useful to have a set of fruits:
valid_fruit_names = {"apple", "grapes"}
any_fruit_pattern = '|'.join(valid_fruit_names)
fruit_pattern = re.compile(f"(\w*)\s*(?={any_fruit_pattern})")
def remove_fruit_name(description):
match = fruit_pattern.search(description)
if match:
return match.groups()[0]
return description
df['Colors'] = df['Names'].apply(remove_fruit_name)
Here is an example of lookahead quoted from the documentation:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Finally, if you want to make a difference between normal and green you'll need a dictionary of valid colors. Same goes for fruit names if you have non-fruit strings in your input, such as topaz.

Not necessarily an elegant trick, but this seems to work:
((re.search('(\w*) (apple|grape)',a)) or ['',''])[1]
Briefly, you search for the first word before apple or grape, but if there is no match, it returns None which is false. So you use or with a list of empty strings, but since you want to take the first element of the matched expression (index 1), I used a two element list of empty strings (to take the second element there).

Matching a word to first 3 letters of any word in a list

I have a list of key words (in csv format), which all entries coming into to my database should match. I am trying to write a python code whereby if an entered word matches the first 3 or more letters as any word in a list.
For example:
if my word is ora
the list of words:
orange
yellow
blue
green
purple
I want to assign the word ora to the key orange. Is there some way of doing this on python?
Another example is if the word is orazzz, I still want it to pick up that the first 3 letters match orange and assign it to that key.
I would like to put it into an if statement if possible

You can handle this with Set.
word=set('orange')
db_entry=set('orngesdksd')
if len(word.intersection(db_entry))>=5:
print(word.intersection(db_entry))
Output:
{'n', 'e', 'o', 'g', 'r'}

Use a dictionary for lookups and a try catch to handle th enot found situation.
keywords = ('orange yellow blue green purple'.split())
keys = dict((w[0:3], w) for w in keywords)
entry = 'orzazzz'
try:
key = keys[entry[0:3]]
print( 'Entered value {0} matches key {1}'.format(entry, key) )
except KeyError:
print( 'Entered value {0} does not match and keyword.'.format(entry) )

How to print all the string after matching a string in python

a. I have a line as given below:
HELLO CMD-LINE: hello how are you -color blue how is life going -color red,green life is pretty -color orange,violet,red
b. I wanted to print the string after -color.
c. I tried the below reg exp method,
for i in range (len(tar_read_sp)):
print tar_read_sp[i]
wordy = re.findall(r'-color.(\w+)', tar_read_sp[i], re.M|re.I|re.U)
# print "%s"%(wordy.group(0))
if wordy:
print "Matched"
print "Full match: %s" % (wordy)
print "Full match: %s" % (wordy[0])
# wordy_ls = wordy.group(0).split('=')
# print wordy_ls[1]
# break
else:
print "Not Matched"
but it prints only the first word matching after the string like,
['blue', 'red', 'orange'].
c. But how to print all the string after matching string? like
['blue', 'red', 'green', 'orange', 'violet'] and remove the repeating variable?
Please share your comments and suggestions to print the same?

Agree with depperm: fix your indentation.
Using his regex suggestion and combining it with the necessary split, de-duping, and re-ordering the list:
wordy = re.findall(r'(?:-color.((?:\w+,?)+))', test_string, re.M|re.I|re.U)
wordy = list({new_word for word in wordy for new_word in word.split(',')})[::-1]
That should give you a flattened, unique list like you asked for (at least I assume that's what you mean by "remove the repeating variable").

My personal preference would to do something like this:
import re
tar_read_sp = "hello how are you -color blue how is life going -color red,green life is pretty -color orange,violet,red"
wordy = re.findall(r'-color.([^\s]+)', tar_read_sp, re.I)
big_list = []
for match in wordy:
small_list = match.split(',')
big_list.extend(small_list)
big_set = list(set(big_list))
print (big_set)
I find this approach a little easier to read and update down the road. The idea is to get all those color matches, build a big list of them, and the use set to dedupe it. The regex I'm using:
-color ([^\s])+
Will capture the 'small_list' of colors up the the next space.

I have a solution not using regex.
test_string = 'hello how are you -color blue how is life going -color red,green life is pretty -color orange,violet,red'
result = []
for colors in [after_color.split(' ')[1] for after_color in test_string.split('-color')[1:]]:
result = result+colors.split(',')
print result
The result is:
['blue', 'red', 'green', 'orange', 'violet', 'red']

How can I get words after and before a specific token?

I currently work on a project which is simply creating basic corpus databases and tokenizes texts. But it seems I am stuck in a matter. Assume that we have those things:
import os, re
texts = []
for i in os.listdir(somedir): # Somedir contains text files which contain very large plain texts.
with open(i, 'r') as f:
texts.append(f.read())
Now I want to find the word before and after a token.
myToken = 'blue'
found = []
for i in texts:
fnd = re.findall('[a-zA-Z0-9]+ %s [a-zA-Z0-9]+|\. %s [a-zA-Z0-9]+|[a-zA-Z0-9]+ %s\.' %(myToken, myToken, myToken), i, re.IGNORECASE|re.UNICODE)
found.extend(fnd)
print myToken
for i in found:
print '\t\t%s' %(i)
I thought there would be three possibilities: The token might start sentence, the token might end sentence or the token might appear somewhere in the sentence, so I used the regex rule above. When I run, I come across those things:
blue
My blue car # What I exactly want.
he blue jac # That's not what I want. That must be "the blue jacket."
eir blue phone # Wrong! > their
a blue ali # Wrong! > alien
. Blue is # Okay.
is blue. # Okay.
...
I also tried \b\w\b or \b\W\b things, but unfortunately those did not return any results instead of returning wrong results. I tried:
'\b\w\b%s\b[a-zA-Z0-9]+|\.\b%s\b\w\b|\b\w\b%s\.'
'\b\W\b%s\b[a-zA-Z0-9]+|\.\b%s\b\W\b|\b\W\b%s\.'
I hope question is not too blur.

I think what you want is:
(Optionally) a word and a space;
(Always) 'blue';
(Optionally) a space and a word.
Therefore one appropriate regex would be:
r'(?i)((?:\w+\s)?blue(?:\s\w+)?)'
For example:
>>> import re
>>> text = """My blue car
the blue jacket
their blue phone
a blue alien
End sentence. Blue is
is blue."""
>>> re.findall(r'(?i)((?:\w+\s)?{0}(?:\s\w+)?)'.format('blue'), text)
['My blue car', 'the blue jacket', 'their blue phone', 'a blue alien', 'Blue is', 'is blue']
See demo and token-by-token explanation here.

Let's say token is test.
(?=^test\s+.*|.*?\s+test\s+.*?|.*?\s+test$).*
You can use lookahead.It will not eat up anything and at the same time validate as well.
http://regex101.com/r/wK1nZ1/2

Regex can be sometimes slow (if not implemented correctly) and moreover accepted answer did not work for me in several cases.
So I went for the brute force solution (not saying it is the best one), where keyword can be composed of several words:
#staticmethod
def find_neighbours(word, sentence):
prepost_map = []
if word not in sentence:
return prepost_map
split_sentence = sentence.split(word)
for i in range(0, len(split_sentence) - 1):
prefix = ""
postfix = ""
prefix_list = split_sentence[i].split()
postfix_list = split_sentence[i + 1].split()
if len(prefix_list) > 0:
prefix = prefix_list[-1]
if len(postfix_list) > 0:
postfix = postfix_list[0]
prepost_map.append([prefix, word, postfix])
return prepost_map
Empty string before or after the keyword indicates that keyword was the first or the last word in the sentence, respectively.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python RegEx, match words in string and get count - python

Why not storing all words in a hash and iterate a lookup of every words in sentences thru a finditer words = { "red": 1 .... } word = re.compile(r'\b(\w+)\b') for i in word.finditer(sentence): if words.get(i.group(1)): ....

for w in words: if w in searchterm: print "found"

Related

How to group words of a string into different strings using pre-defined word groups in python?

Regex to grab word before a certain character in python

Matching a word to first 3 letters of any word in a list

How to print all the string after matching a string in python

How can I get words after and before a specific token?

Categories

Resources