Matching an optional '#' does not seem to be working properly - python

I'm attempting to get full words or hashtags from a string, it seems as though I'm applying the 'optional character' ? flag wrong in regex.
Here is my code:
print re.findall(r'(#)?\w*', text)
print re.findall(r'[#]?\w*', text)
Thus 'this is a sentence talking about this, #this, #that, #etc'
Should return matches for 'this' and '#this'
Yet it seems to be returning a list with empty strings as well as other random things.
What is wrong with the regex?
EDIT:
I'm attempting to get whole spam words, and I seem to have jumbled myself...
s = 'spamword'
print re.findall(r'(#)?'+s, text)
I need to match the whole word, and not word parts...

You can use word boundary in your regex:
s = 'spamword'
re.findall(r'#?' + s + r'\b', text)

The above answers really explains why,Here is one piece of code that should work.
>>>re.findall(r'#?\w+\b')

Related

Replacing method for words with boundaries in python (like with regex)

I am seeking for a more robust replace method in python because I am building a
spellchecker to input words in ocr-context.
Let's say we have the following text in python:
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
It is easy to realize that instead of "his is a text" the right phrase would be "this is a text".
And if I do text.replace('his','this') then I replace every single 'his' for this, so I would get errors like "tthis" is a text.
When I do a replacement. I would like to replace the whole word 'this' and not his or this.
Why not trying this?
word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text
Awesome, we did it, but the problem is... what if the word to correct contains an special character like '|'. For example,
'|ights are on' instead of 'lights are one'. Trust me, it happened to me, the re.sub is a disaster in that case.
The question is, have you encountered the same problem? Is there any method to solve this? The replacement is the most
robust option.
I tried text.replace(' '+word_to_replace+' ',' '+word_to_replace+' ') and this solve a lot of things but still
have the problem of phrases like "his is a text " because the replacement doesnt work here since 'his' is at the begining of a sentence
and not ' his ' for ' this '.
Is there any replacement method in python that takes the whole word like in regexs \b word_to_correct \b
as input ?
after a few days I solved the problem that I had. I hope this could
be helpful for someone else. Let me know if you have any question or something.
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
# Asume you already have corrected your word via ocr
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'
#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
# Match word between boundaries \\b\ using regex. This will capture his and its context but not this and its context
phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
# Once you matched the context, input the new word
phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
# Now replace the old phrase (phrase2correct) with the new one *phrase_corrected
text = text.replace(phrase2correct,phrase_corrected)
return text
Test if the function works...
print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))
Output:
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with.
It worked for my purpose. I hope this is helpful for someone else.

Find and print the number of non-alphanumeric characters in python

I'm doing a project and I'm having issues getting the re.findall to work properly. Here's the code this far (short and sweet):
pattern = ['^a-zA-Z0-9_']
results = re.findall(pattern, (str(lorem_ipsum))
print(len(results))
I'm getting a syntax error printing this way. Any help will be greatly appreciated. I'm crunched for time, and will be tweaking tomorrow when I have some more time.
You actually don't need any regex to do this. Simply use .isalnum()
text = "hello 23232#"
for character in text:
if not character.isalnum():
print("found: \'{}\'".format(character))
output:
found ' '
found '#'
You're missing a closing bracket after lorem ipsum, you'll also need to turn your pattern into a raw string. essentially pattern must be a string not a list. We add the r in front to make sure that backslashes are considered literally rather than needing to be escaped.
pattern = r'[^a-zA-Z0-9\_]'
results = re.findall(pattern, (str(lorem_ipsum)))
print(len(results))

Find all word with a #

I want to find all words which have a # attached to it.
I tried:
import re
text = "I was searching my #source to make a big desk yesterday."
re.findall(r'\b#\w+', text)
but it does not work...
Here's a small regex to do that:
>>> import re
>>> s = "I was searching my #source to make a big desk yesterday."
>>> re.findall(r"#(\w+)", s)
['source']
If you want to include the hashtag then use:
>>> re.findall(r"#.\w+", s)
['#source']
You can use:
re.findall(r"#.+?\b", text)
which gives:
['#source']
Here is a link to regex101 which gives in-depth insight into what each part does.
Basically what is happening is:
the # means capture the '#' character literally
then we say to match any character with a .
but the + signifies to capture one or more of them
then the ? begins a non-greedy match to whatever follows
the \b is a word boundary and signifies when to stop the lookup
Update
As pointed out by #AnthonySottile, there is a case where the above regex will fail, namely:
hello#fred
where a match is made when it shouldn't be.
To get around this problem, a /s could be added to the front of the regex so as to make sure the # comes after some whitespace, but this fails in the case where the hashtag comes right at the start of the string. A /b also won't suffice as the # makes the hashtag not count as a word.
So, to get around these, I came up with this rather ugly solution of adding a space to the start of the string before doing the findall:
re.findall(r"\s(#.+?)\b", " " + text)
It's not very neat I know but there really isn't another way of doing it. I tried using an OR at the start to match a whitespace or the start of the string, as in (^|\s), but this will produce multiple groups (as tuples) in the list that is returned from re.findall so would requires some post-processing which is even less neat.
You do not need regex to solve this problem:
text = "I was searching my #source to make a big desk yesterday."
final_text = [i for i in text.split() if i.startswith('#')]
Output:
['#source']
However, this regex will work:
import re
text = "I was searching my #source to make a big desk yesterday."
final_text = filter(lambda x:x, re.findall('(?<=^)|(?<=\s)#\w+(?=\s)|(?=$)', text))
Output:
['#source']

Python regex: find words and emoticons

I want to find matches between a tweet and a list of strings containing words, phrases, and emoticons. Here is my code:
words = [':)','and i','sleeping','... :)','! <3','facebook']
regex = re.compile(r'\b%s\b|(:\(|:\))+' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
I keep receiving this error:
error: unbalanced parenthesis
Apparently there is something wrong with the code and it cannot match emoticons. Any idea how to fix it?
I tried the below and it stopped throwing the error:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']
The re module has a function escape that takes care of correct escaping of words, so you could just use
words = map(re.escape, [':)','and i','sleeping','... :)','! <3','facebook'])
Note that word boundaries might not work as you expect when used with words that don't start or end with actual word characters.
While words has all the necessary formatting, re uses ( and ) as special characters. This requires you to use \( or \) to avoid them being interpreted as special characters, but rather as the ASCII characters 40 and 41. Since you didn't understand what #Nicarus was saying, you need to use this:
words = [':\)','and i','sleeping','... :\)','! <3','facebook']
Note: I'm only spelling it out because this doesn't seem like a school assignment, for all the people who might want to criticize this. Also, look at the documentation prior to going to stack overflow. This explains everything.

Multiple punctuation stripping

I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ',
This code:
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.match(i):
regex.sub('', i)
I got from:
Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations.
I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it
Am I just missing some obvious piece that I am just being oblivious too?
I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets
I use Python 2.7
Your code doesn't work because regex.match needs the beginning of the string or complete string to match.
Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.
regex.search returns a match if the pattern is found anywhere in the string and works as expected:
import re
import string
words = ['a.bc,,', 'cdd,gf.f.d,fe']
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.search(i):
i = regex.sub('', i)
print i
Edit: As pointed out below by #senderle, the while clause isn't necessary and can be left out completely.
this will replace everything not alphanumeric ...
re.sub("[^a-zA-Z0-9 ]","",my_text)
>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot tin roof'
Here is a simple way:
>>> print str.translate("My&& Dog's {{{%!##%!##$L&&&ove Sal*mon", None,'~`!##$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon
Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.

Categories

Resources