Python: match a single word (with spaces) - python

The problem is that I am trying to match a word (spaces on either side) if it exists.
The code I have working (at least mostly) is:
import re, os
str1 = "the host offered $ rec*ting advice"
str1 = re.sub('[*]', '(.*?)', str1)
str1 = re.sub('[$]', '(.*?)', str1)
str1 = str1.lower()
print str1
previous_dir = os.getcwd()
os.chdir('testfilefolder')
for filename in os.listdir('.'):
with open(filename) as f:
file_contents = f.read().lower()
output = re.search("%s" % str1, file_contents)
if output:
print (" Match found in " + filename))
So for example if I have the string "the host has offered some recruiting advice" and do a search on the string "the host offered some $ rec*ting advice" it will not work - due to the dollar sign (which is replaced by the (.*?). The interesting thing is, if I have "the host offered $ rec*ting advice" - note "some" is gone and hence this works - so I can match 1 word if it exists -looks like (.*?) is supposed to match one character which each word has at least one character in it so I suppose that is why it works. I am not sure if the (.*?) is even right to use but it is the best that I have gotten working so far after my research. Any advice on that would be very appreciated. Note above I have (.*?) in the text it seems to show up that somehow the (.*?) is some sort of tag and just formats the string between the (.*?)'s.
I however want to match 0 or 1 word. I had found something before similar to \bs+\b (I can't quite remember and I can't find it again), but couldn't get it to work anyways. I know that \b is supposed to match an empty string on either side of the possible existence of a word.
I appologize if this question is asked elsewhere but it seems that everything I have found (that I can still find and was able to get working) is looking for a particular word - I however am looking to see if only 0 or 1 exists:
How do I match a word in a text file using python?

Your question is very hard to understand so this is probably not exactly what you are looking for but it may help you in the right direction.
If you want to find all words in the text this is how it could be done:
import re
str1 = "the host offered $ rec*ting advice"
re.findall(r'\b\S+\b',str1)
This will produce:
['the', 'host', 'offered', 'rec*ting', 'advice']
The \b-thing in the pattern is not actually matching a character, but a place in the string where a word starts or ends (see http://docs.python.org/2/library/re for more info on this).
The dollar sign is not considered a word since its not a word character according to the \b-definition used.
If you want to get the first word in a string if there is a word there to get you could use:
re.findall(r'\b\S+\b',str1)[:1]
You will then get a list of zero or one element!

Related

Find all word with a #

I want to find all words which have a # attached to it.
I tried:
import re
text = "I was searching my #source to make a big desk yesterday."
re.findall(r'\b#\w+', text)
but it does not work...
Here's a small regex to do that:
>>> import re
>>> s = "I was searching my #source to make a big desk yesterday."
>>> re.findall(r"#(\w+)", s)
['source']
If you want to include the hashtag then use:
>>> re.findall(r"#.\w+", s)
['#source']
You can use:
re.findall(r"#.+?\b", text)
which gives:
['#source']
Here is a link to regex101 which gives in-depth insight into what each part does.
Basically what is happening is:
the # means capture the '#' character literally
then we say to match any character with a .
but the + signifies to capture one or more of them
then the ? begins a non-greedy match to whatever follows
the \b is a word boundary and signifies when to stop the lookup
Update
As pointed out by #AnthonySottile, there is a case where the above regex will fail, namely:
hello#fred
where a match is made when it shouldn't be.
To get around this problem, a /s could be added to the front of the regex so as to make sure the # comes after some whitespace, but this fails in the case where the hashtag comes right at the start of the string. A /b also won't suffice as the # makes the hashtag not count as a word.
So, to get around these, I came up with this rather ugly solution of adding a space to the start of the string before doing the findall:
re.findall(r"\s(#.+?)\b", " " + text)
It's not very neat I know but there really isn't another way of doing it. I tried using an OR at the start to match a whitespace or the start of the string, as in (^|\s), but this will produce multiple groups (as tuples) in the list that is returned from re.findall so would requires some post-processing which is even less neat.
You do not need regex to solve this problem:
text = "I was searching my #source to make a big desk yesterday."
final_text = [i for i in text.split() if i.startswith('#')]
Output:
['#source']
However, this regex will work:
import re
text = "I was searching my #source to make a big desk yesterday."
final_text = filter(lambda x:x, re.findall('(?<=^)|(?<=\s)#\w+(?=\s)|(?=$)', text))
Output:
['#source']

Matching an optional '#' does not seem to be working properly

I'm attempting to get full words or hashtags from a string, it seems as though I'm applying the 'optional character' ? flag wrong in regex.
Here is my code:
print re.findall(r'(#)?\w*', text)
print re.findall(r'[#]?\w*', text)
Thus 'this is a sentence talking about this, #this, #that, #etc'
Should return matches for 'this' and '#this'
Yet it seems to be returning a list with empty strings as well as other random things.
What is wrong with the regex?
EDIT:
I'm attempting to get whole spam words, and I seem to have jumbled myself...
s = 'spamword'
print re.findall(r'(#)?'+s, text)
I need to match the whole word, and not word parts...
You can use word boundary in your regex:
s = 'spamword'
re.findall(r'#?' + s + r'\b', text)
The above answers really explains why,Here is one piece of code that should work.
>>>re.findall(r'#?\w+\b')

Python multiple repeat Error

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

python: replacing exact with minus sign

Given the following string:
"-local locally local test local."
my objective is to replace the string "local" with "we" such that the result becomes
"-local locally we test local."
so far (with the help from the guys here at stackoverflow: Python: find exact match) I have been able to come up with the following regular expression:
variable='local'
re.sub(r'\b%s([\b\s])' %variable, r'we\1', "-local locally local test local.")
However I have two problems with this code:
The search goes through the minus sign and the output becomes:
'-we locally we test local.'
where it should have been
'-local locally we test local.'
searching for a string starting with a minus sign such as "-local" fails the search
Try the following:
re.sub(r'(^|\s)%s($|\s)' % re.escape(variable), r'\1we\2', some_string)
The regex that was suggested in the other question is kind of odd, since \b in a character class means a backspace character.
Basically what you have now is a regex that searches for your target string with a word boundary at the beginning (going from a word character to a non-word character or vice versa), and a whitespace character at the end.
Since you don't want to match the final "local" since it is followed by a period, I don't think that word boundaries are the way to go here, instead you should look for whitespace or beginning/end of string, which is what the above regex does.
I also used re.escape on the variable, that way if you include a characters in your target string like . or $ that usually have special meanings, they will be escaped and interpreted as literal characters.
Examples:
>>> s = "-local locally local test local."
>>> variable = 'local'
>>> re.sub(r'(^|\s)%s($|\s)' % re.escape(variable), r'\1we\2', s)
'-local locally we test local.'
>>> variable = '-local'
>>> re.sub(r'(^|\s)%s($|\s)' % re.escape(variable), r'\1we\2', s)
'we locally local test local.'
You could separate the string into substrings using the spaces as the delimiter. Then check each string, replace if it matches what you are looking for, and recombine them.
Certainly not efficient though :)
sed 's/ local / we /g' filename
I don't do python, but the idea is just put a space before and after local in the pattern to find, and also include spaces in the replacement.
If you just want to replace all occurences of the word that are separeted by spaces, you could split the string and operate on the resulting list:
search = "local"
replace = "we"
s = "-local locally local test local."
result = ' '.join([x if not x == search else replace for x in s.split(" ")])

extract a sentence using python

I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches.
Just a quick reminder: Sentence breaking is actually a pretty complex thing, there's exceptions to the period rule, such as "Mr." or "Dr." There's also a variety of sentence ending punctuation marks. But there's also exceptions to the exception (if the next word is Capitalized and is not a proper noun, then Dr. can end a sentence, for example).
If you're interested in this (it's a natural language processing topic) you could check out:
the natural language tool kit's (nltk) punkt module.
If you have each sentence in a string you can use find() on your word and if found return the sentence. Otherwise you could use a regex, something like this
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, yourwholetext)
if match != None:
sentence = match.group("sentence")
I havent tested this but something along those lines.
My test:
import re
text = "muffins are good, cookies are bad. sauce is awesome, veggies too. fmooo mfasss, fdssaaaa."
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, text)
if match != None:
print match.group("sentence")
dutt did a good job answering this. just wanted to add a couple things
import re
text = "go directly to jail. do not cross go. do not collect $200."
pattern = "\.(?P<sentence>.*?(go).*?)\."
match = re.search(pattern, text)
if match != None:
sentence = match.group("sentence")
obviously, you'll need to import the regex library (import re) before you begin. here is a teardown of what the regular expression actually does (more info can be found at the Python re library page)
\. # looks for a period preceding sentence.
(?P<sentence>...) # sets the regex captured to variable "sentence".
.*? # selects all text (non-greedy) until the word "go".
again, the link to the library ref page is key.

Categories

Resources