Find all word with a # - python

I want to find all words which have a # attached to it.
I tried:
import re
text = "I was searching my #source to make a big desk yesterday."
re.findall(r'\b#\w+', text)
but it does not work...

Here's a small regex to do that:
>>> import re
>>> s = "I was searching my #source to make a big desk yesterday."
>>> re.findall(r"#(\w+)", s)
['source']
If you want to include the hashtag then use:
>>> re.findall(r"#.\w+", s)
['#source']

You can use:
re.findall(r"#.+?\b", text)
which gives:
['#source']
Here is a link to regex101 which gives in-depth insight into what each part does.
Basically what is happening is:
the # means capture the '#' character literally
then we say to match any character with a .
but the + signifies to capture one or more of them
then the ? begins a non-greedy match to whatever follows
the \b is a word boundary and signifies when to stop the lookup
Update
As pointed out by #AnthonySottile, there is a case where the above regex will fail, namely:
hello#fred
where a match is made when it shouldn't be.
To get around this problem, a /s could be added to the front of the regex so as to make sure the # comes after some whitespace, but this fails in the case where the hashtag comes right at the start of the string. A /b also won't suffice as the # makes the hashtag not count as a word.
So, to get around these, I came up with this rather ugly solution of adding a space to the start of the string before doing the findall:
re.findall(r"\s(#.+?)\b", " " + text)
It's not very neat I know but there really isn't another way of doing it. I tried using an OR at the start to match a whitespace or the start of the string, as in (^|\s), but this will produce multiple groups (as tuples) in the list that is returned from re.findall so would requires some post-processing which is even less neat.

You do not need regex to solve this problem:
text = "I was searching my #source to make a big desk yesterday."
final_text = [i for i in text.split() if i.startswith('#')]
Output:
['#source']
However, this regex will work:
import re
text = "I was searching my #source to make a big desk yesterday."
final_text = filter(lambda x:x, re.findall('(?<=^)|(?<=\s)#\w+(?=\s)|(?=$)', text))
Output:
['#source']

Related

python regex get last words of a text up to a stop word with re module (not regex module)

I am after getting the last words of a text up to a stopword.
Imagine I have the text:
first_part = "This is a text that with the blue paper"
going from end back I would like to get "blue paper".
In order to do that I use the regex module
import regex as re
print(first_part)
result=re.search(r"(?r)(?<=(\s*\b(an|a|the|for)\b\s*))(?P<feature>.*?)(?=\s*)$",first_part)
print(result)
Regex explanation:
(?r) = reverse
(?<=(\s*\b(an|a|the|for)\b\s*)) =look behind any of the stop words with word boundary \b
(?P feature .?) = basically whatever .
$ = from the end of the string
This works just fine.
but I am using the module regex in order to be able to use "(?r)" meaning reverse.
Anyone knows if it would be possible to do this using re?
I need to implement this functionality with standard libraries functionalities.
If you add a greedy match in front and a lazy one in the back, you will just get the last words.. Not 100% sure this is what you want though.
>>> first_part = "This is a text that with the blue paper"
>>> m = re.match(r"(?:.*)(?:an|a|the|for)\W(.+?)$", first_part)
>>> m[1]
'blue paper'

Matching an optional '#' does not seem to be working properly

I'm attempting to get full words or hashtags from a string, it seems as though I'm applying the 'optional character' ? flag wrong in regex.
Here is my code:
print re.findall(r'(#)?\w*', text)
print re.findall(r'[#]?\w*', text)
Thus 'this is a sentence talking about this, #this, #that, #etc'
Should return matches for 'this' and '#this'
Yet it seems to be returning a list with empty strings as well as other random things.
What is wrong with the regex?
EDIT:
I'm attempting to get whole spam words, and I seem to have jumbled myself...
s = 'spamword'
print re.findall(r'(#)?'+s, text)
I need to match the whole word, and not word parts...
You can use word boundary in your regex:
s = 'spamword'
re.findall(r'#?' + s + r'\b', text)
The above answers really explains why,Here is one piece of code that should work.
>>>re.findall(r'#?\w+\b')

Multiple punctuation stripping

I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ',
This code:
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.match(i):
regex.sub('', i)
I got from:
Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations.
I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it
Am I just missing some obvious piece that I am just being oblivious too?
I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets
I use Python 2.7
Your code doesn't work because regex.match needs the beginning of the string or complete string to match.
Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.
regex.search returns a match if the pattern is found anywhere in the string and works as expected:
import re
import string
words = ['a.bc,,', 'cdd,gf.f.d,fe']
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.search(i):
i = regex.sub('', i)
print i
Edit: As pointed out below by #senderle, the while clause isn't necessary and can be left out completely.
this will replace everything not alphanumeric ...
re.sub("[^a-zA-Z0-9 ]","",my_text)
>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot tin roof'
Here is a simple way:
>>> print str.translate("My&& Dog's {{{%!##%!##$L&&&ove Sal*mon", None,'~`!##$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon
Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.

Python split by regular expression

In Python, I am extracting emails from a string like so:
split = re.split(" ", string)
emails = []
pattern = re.compile("^[a-zA-Z0-9_\.-]+#[a-zA-Z0-9-]+.[a-zA-Z0-9-\.]+$");
for bit in split:
result = pattern.match(bit)
if(result != None):
emails.append(bit)
And this works, as long as there is a space in between the emails. But this might not always be the case. For example:
Hello, foo#foo.com
would return:
foo#foo.com
but, take the following string:
I know my best friend mailto:foo#foo.com!
This would return null. So the question is: how can I make it so that a regex is the delimiter to split? I would want to get
foo#foo.com
in all cases, regardless of punctuation next to it. Is this possible in Python?
By "splitting by regex" I mean that if the program encounters the pattern in a string, it will extract that part and put it into a list.
I'd say you're looking for re.findall:
>>> email_reg = re.compile(r'[a-zA-Z0-9_.-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
>>> email_reg.findall('I know my best friend mailto:foo#foo.com!')
['foo#foo.com']
Notice that findall can handle more than one email address:
>>> email_reg.findall('Text text foo#foo.com, text text, baz#baz.com!')
['foo#foo.com', 'baz#baz.com']
Use re.search or re.findall.
You also need to escape your expression properly (. needs to be escaped outside of character classes, not inside) and remove/replace the anchors ^ and $ (for example with \b), eg:
r"\b[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b"
The problem I see in your regex is your use of ^ which matches the start of a string and $ which matches the end of your string. If you remove it and then run it with your sample test case it will work
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","I know my best friend mailto:foo#foo.com!")
['foo#foo.com']
>>> re.findall("[A-Za-z0-9\._-]+#[A-Za-z0-9-]+.[A-Za-z0-9-\.]+","Hello, foo#foo.com")
['foo#foo.com']
>>>

extract a sentence using python

I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches.
Just a quick reminder: Sentence breaking is actually a pretty complex thing, there's exceptions to the period rule, such as "Mr." or "Dr." There's also a variety of sentence ending punctuation marks. But there's also exceptions to the exception (if the next word is Capitalized and is not a proper noun, then Dr. can end a sentence, for example).
If you're interested in this (it's a natural language processing topic) you could check out:
the natural language tool kit's (nltk) punkt module.
If you have each sentence in a string you can use find() on your word and if found return the sentence. Otherwise you could use a regex, something like this
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, yourwholetext)
if match != None:
sentence = match.group("sentence")
I havent tested this but something along those lines.
My test:
import re
text = "muffins are good, cookies are bad. sauce is awesome, veggies too. fmooo mfasss, fdssaaaa."
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, text)
if match != None:
print match.group("sentence")
dutt did a good job answering this. just wanted to add a couple things
import re
text = "go directly to jail. do not cross go. do not collect $200."
pattern = "\.(?P<sentence>.*?(go).*?)\."
match = re.search(pattern, text)
if match != None:
sentence = match.group("sentence")
obviously, you'll need to import the regex library (import re) before you begin. here is a teardown of what the regular expression actually does (more info can be found at the Python re library page)
\. # looks for a period preceding sentence.
(?P<sentence>...) # sets the regex captured to variable "sentence".
.*? # selects all text (non-greedy) until the word "go".
again, the link to the library ref page is key.

Categories

Resources