There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.
import nltk
words = nltk.word_tokenize("I've found a medicine for my disease.")
result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.
Edit:
I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:
result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')
You can use "treebank detokenizer" - TreebankWordDetokenizer:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.
To reverse word_tokenize from nltk, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.
Short of doing crazy hacks on nltk, you can try this:
>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."
use token_utils.untokenize from here
import re
def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()
tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
untokenize(tokenized)
"I've found a medicine for my disease."
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
I propose to keep offsets in tokenization: (token, offset).
I think, this information is useful for processing over the original sentence.
import re
from nltk.tokenize import word_tokenize
def offset_tokenize(text):
tail = text
accum = 0
tokens = self.tokenize(text)
info_tokens = []
for tok in tokens:
scaped_tok = re.escape(tok)
m = re.search(scaped_tok, tail)
start, end = m.span()
# global offsets
gs = accum + start
ge = accum + end
accum += end
# keep searching in the rest
tail = tail[end:]
info_tokens.append((tok, (gs, ge)))
return info_token
sent = '''I've found a medicine for my disease.
This is line:3.'''
toks_offsets = offset_tokenize(sent)
for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]
Gives:
True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .
For me, it worked when I installed python nltk 3.2.5,
pip install -U nltk
then,
import nltk
nltk.download('perluniprops')
from nltk.tokenize.moses import MosesDetokenizer
If you are using insides pandas dataframe, then
df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))
The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:
1) The original string
2) The original tokens
3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)
Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.
Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.
tokenizer=nltk.tokenize.casual.TweetTokenizer()
string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
tokens=tokenizer.tokenize(string)
replacement_tokens=list(tokens)
replacement_tokens[-3]="cute"
def detokenize(string,tokens,replacement_tokens):
spans=[]
cursor=0
for token in tokens:
while not string[cursor:cursor+len(token)]==token and cursor<len(string):
cursor+=1
if cursor==len(string):break
newcursor=cursor+len(token)
spans.append((cursor,newcursor))
cursor=newcursor
i=len(tokens)-1
for start,end in spans[::-1]:
string=string[:start]+replacement_tokens[i]+string[end:]
i-=1
return string
>>> detokenize(string,tokens,replacement_tokens)
'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'
The reason tokenize.untokenize does not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize:
from StringIO import StringIO
import tokenize
sentence = "I've found a medicine for my disease.\n"
tokens = tokenize.generate_tokens(StringIO(sentence).readline)
print tokenize.untokenize(tokens)
Additional Help:
Tokenize - Python Docs |
Potential Problem
I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens
_SPLITTER_ = r"([-.,/:!?\";)(])"
def basic_detokenizer(sentence):
""" This is the basic detokenizer helps us to resolves the issues we created by our tokenizer"""
detokenize_sentence =[]
words = sentence.split(' ')
pos = 0
while( pos < len(words)):
if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
left = detokenize_sentence.pop()
detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in '[(' and pos < len(words) - 1:
detokenize_sentence.append(''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in ']).,:!?;' and pos > 0:
left = detokenize_sentence.pop()
detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))
else:
detokenize_sentence.append(words[pos])
pos +=1
return ' '.join(detokenize_sentence)
Use the join function:
You could just do a ' '.join(words) to get back the original string.
Related
I use Pythainlp package to tokenize my Thai language data for doing sentiment analysis.
first, I build a function to add new words set and tokenize it
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
from pythainlp import word_tokenize
def text_tokenize(Mention):
new_words = {'คนละครึ่ง', 'ยืนยันตัวตน', 'เติมเงิน', 'เราชนะ', 'เป๋าตัง', 'แอปเป๋าตัง'}
words = new_words.union(thai_words())
custom_dictionary_trie = dict_trie(words)
dataa = word_tokenize(Mention, custom_dict=custom_dictionary_trie, keep_whitespace=False)
return dataa
after that I apply it within my text_process function which including remove punctuation and stop words.
puncuations = '''.?!,;:-_[]()'/<>{}\##$&%~*ๆฯ'''
from pythainlp import word_tokenize
def text_process(Mention):
final = "".join(u for u in Mention if u not in puncuations and ('ๆ', 'ฯ'))
final = text_tokenize(final)
final = " ".join(word for word in final)
final = " ".join(word for word in final.split() if word.lower not in thai_stopwords)
return final
dff['text_tokens'] = dff['Mention'].apply(text_process)
dff
the point is it takes too long to run this function. it took 17 minutes and still not finished. I tried to replace
final = text_tokenize(final) with final = word_tokenize(final)
and it took just 2 minutes but I can't no longer use it because I need to add new custom dictionary. I know there is something wrong but really don't know how to fix it
I am new to python and nlp so please help.
Ps. sorry for my broken English
I am not familiar with Thai language, but assume that for tokenization you can also use language agnostic tokenization tools.
If you want to perform word tokenization, try the example below:
from nltk.tokenize import word_tokenize
s = '''This is the text I want to tokenize'''
word_tokenize(s)
>>> ['This', 'is', 'the', 'text', 'I', 'want', 'to', 'tokenize']
Input:"My favorite game is call of duty."
And I set "call of duty" as a key-words, this phrase will be one word in tokenize process.
Finally want to get the result:['my','favorite','game','is','call of duty']
So, how to set the key-words in python NLP ?
I think what you want is keyphrase extraction, and you can do it for instance by first tagging each word with it's PoS-tag and then apply some sort of regular expression over the PoS-tags to join interesting words into keyphrases.
import nltk
from nltk import pos_tag
from nltk import tokenize
def extract_phrases(my_tree, phrase):
my_phrases = []
if my_tree.label() == phrase:
my_phrases.append(my_tree.copy(True))
for child in my_tree:
if type(child) is nltk.Tree:
list_of_phrases = extract_phrases(child, phrase)
if len(list_of_phrases) > 0:
my_phrases.extend(list_of_phrases)
return my_phrases
def main():
sentences = ["My favorite game is call of duty"]
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)
for x in sentences:
sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)
print "\nNoun phrases:"
list_of_noun_phrases = extract_phrases(tree, 'NP')
for phrase in list_of_noun_phrases:
print phrase, "_".join([x[0] for x in phrase.leaves()])
if __name__ == "__main__":
main()
This will output the following:
Noun phrases:
(NP favorite/JJ game/NN) favorite_game
(NP call/NN) call
(NP duty/NN) duty
But,you can play around with
grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
trying other types of expressions, so that you can get exactly what you want, depending on the words/tags you want to join together.
Also if you are interested, check this very good introduction to keyphrase/word extraction:
https://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/
This is, of course, way too late to be useful to the OP, but I thought I'd put this answer here for others:
It sounds like what you might be really asking is: How do I make sure that compound phrases like 'call of duty' get grouped together as one token?
You can use nltk's multiword expression tokenizer, like so:
string = 'My favorite game is call of duty'
tokenized_string = nltk.word_tokenize(string)
mwe = [('call', 'of', 'duty')]
mwe_tokenizer = nltk.tokenize.MWETokenizer(mwe)
tokenized_string = mwe_tokenizer.tokenize(tokenized_string)
Where mwestands for multi-word expression. The value of tokenized_string will be ['My', 'favorite', 'game', 'is', 'call of duty']
In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]
output:
[This, is, a, #, sentence, .]
I'd like to have hashtags tokenized as follows, is that possible?
[This, is, a, #sentence, .]
I also tried several ways to prevent spaCy from splitting hashtags or words with hyphens like "cutting-edge". My experience is that merging tokens afterwards can be problematic, because the pos tagger and dependency parsers already used the wrong tokens for their decisions. Touching the infix, prefix, suffix regexps is kind of error prone / complex, because you don't want to produce side effects by your changes.
The simplest way is indeed, as pointed out by before, to modify the token_match function of the tokenizer. This is a re.match identifying regular expressions that will not be split. Instead of importing the speficic URL pattern I'd rather extend whatever spaCy's default is.
from spacy.tokenizer import _get_regex_pattern
nlp = spacy.load('en')
# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match
text = "#Pete: choose low-carb #food #eatsmart ;-) 😋👍"
doc = nlp(text)
This yields:
['#Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '😋', '👍']
This is more of a add-on to the great answer by #DhruvPathak AND a shameless copy from the below linked github thread (and the even better answer by #csvance). spaCy features (since V2.0) the add_pipe method. Meaning you can define #DhruvPathak great answer in a function and add the step (conveniently) into your nlp processing pipeline, as below.
Citations starts here:
def hashtag_pipe(doc):
merged_hashtag = False
while True:
for token_index,token in enumerate(doc):
if token.text == '#':
if token.head is not None:
start_index = token.idx
end_index = start_index + len(token.head.text) + 1
if doc.merge(start_index, end_index) is not None:
merged_hashtag = True
break
if not merged_hashtag:
break
merged_hashtag = False
return doc
nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)
doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'
Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.
PS It's clear when reading the code, but for the copy&pasters, don't disable the parser :)
You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g
> >>> import re
> >>> import spacy
> >>> nlp = spacy.load('en')
> >>> sentence = u'This is my twitter update #MyTopic'
> >>> parsed = nlp(sentence)
> >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence)
> >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic'
> >>> parsed = nlp(new_sentence)
> >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]
[u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
You can try setting custom seperators in spacy's tokenizer.
I am not aware of methods to do that.
UPDATE : You can use a regex to find span of token you would want to stay as single token, and retokenize using span.merge method as mentioned here : https://spacy.io/docs/api/span#merge
Merge example:
>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
... parsed.merge(start_idx=start,end_idx=end)
...
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>
I found this on github, which uses spaCy's Matcher:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
doc = nlp('This is a #sentence. Here is another #hashtag. #The #End.')
matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
hashtags.append(doc[start:end])
for span in hashtags:
span.merge()
print([t.text for t in doc])
outputs:
['This', 'is', 'a', '#sentence', '.', 'Here', 'is', 'another', '#hashtag', '.', '#The', '#End', '.']
A list of hashtags is also available in the hashtags list:
print(hashtags)
output:
[#sentence, #hashtag, #The, #End]
I spent quite a bit of time on this and found I share what I came up with:
Subclassing the Tokenizer and adding the regex for hashtags to the default URL_PATTERN was the easiest solution for me, additionally adding a custom extension to match on hashtags to identify them:
import re
import spacy
from spacy.language import Language
from spacy.tokenizer import Tokenizer
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
def create_tokenizer(nlp):
# contains the regex to match all sorts of urls:
from spacy.lang.tokenizer_exceptions import URL_PATTERN
# spacy defaults: when the standard behaviour is required, they
# need to be included when subclassing the tokenizer
prefix_re = spacy.util.compile_prefix_regex(Language.Defaults.prefixes)
infix_re = spacy.util.compile_infix_regex(Language.Defaults.infixes)
suffix_re = spacy.util.compile_suffix_regex(Language.Defaults.suffixes)
# extending the default url regex with regex for hashtags with "or" = |
hashtag_pattern = r'''|^(#[\w_-]+)$'''
url_and_hashtag = URL_PATTERN + hashtag_pattern
url_and_hashtag_re = re.compile(url_and_hashtag)
# set a custom extension to match if token is a hashtag
hashtag_getter = lambda token: token.text.startswith('#')
Token.set_extension('is_hashtag', getter=hashtag_getter)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=url_and_hashtag_re.match
)
nlp.tokenizer = create_tokenizer(nlp)
doc = nlp("#spreadhappiness #smilemore so_great#good.com https://www.somedomain.com/foo")
for token in doc:
print(token.text)
if token._.is_hashtag:
print("-> matches hashtag")
# returns: "#spreadhappiness -> matches hashtag #smilemore -> matches hashtag so_great#good.com https://www.somedomain.com/foo"
Ahoy StackOverlow-ers!
I have a rather trivial question but it's something that I haven't been able to find in other questions here or on online tutorials: How might we be able to format the output of a Python program that so that it fits a certain aesthetic format without any extra modules?
The aim here is that I have a block of plain text like that from a newspaper article, and I've filtered through it earlier to extract just the words I want but now I'd like to print it out in the format that each line only has 70 characters along it and any word won't be broken if it should normally fall on a line break.
Using .ljust(70) as in stdout.write(article.ljust(70)) doesn't seem to do anything to it.
The other thing about not having words broken would be as:
Latest news tragic m
urder innocent victi
ms family quiet neig
hbourhood
Looking more like this:
Latest news tragic
murder innocent
victims family
quiet neighbourhood
Thank you all kindly in advance!
Checkout the python textwrap module (a standard module)
>>> import textwrap
>>> t="""Latest news tragic murder innocent victims family quiet neighbourhood"""
>>> print "\n".join(textwrap.wrap(t, width=20))
Latest news tragic
murder innocent
victims family quiet
neighbourhood
>>>
use textwrap module:
http://docs.python.org/library/textwrap.html
I'm sure this can be improved on. Without any libraries:
def wrap_text(text, wrap_column=80):
sentence = ''
for word in text.split(' '):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence
EDIT: From the comment if you want to use Regular expressions to just pick out words use this:
import re
def wrap_text(text, wrap_column=80):
sentence = ''
for word in re.findall(r'\w+', text):
if len(sentence + word) <= 70:
sentence += ' ' + word
else:
print sentence
sentence = word
print sentence
I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I have been trying thus far.
For example, "I love #stackoverflow because #people are very #helpful!"
This should pull the 3 hashtags into an array.
A simple regex should do the job:
>>> import re
>>> s = "I love #stackoverflow because #people are very #helpful!"
>>> re.findall(r"#(\w+)", s)
['stackoverflow', 'people', 'helpful']
Note though, that as suggested in other answers, this may also find non-hashtags, such as a hash location in a URL:
>>> re.findall(r"#(\w+)", "http://example.org/#comments")
['comments']
So another simple solution would be the following (removes duplicates as a bonus):
>>> def extract_hash_tags(s):
... return set(part[1:] for part in s.split() if part.startswith('#'))
...
>>> extract_hash_tags("#test http://example.org/#comments #test")
set(['test'])
>>> s="I love #stackoverflow because #people are very #helpful!"
>>> [i for i in s.split() if i.startswith("#") ]
['#stackoverflow', '#people', '#helpful!']
The best Twitter hashtag regular expression:
import re
text = "#promovolt #1st # promovolt #123"
re.findall(r'\B#\w*[a-zA-Z]+\w*', text)
>>> ['#promovolt', '#1st']
Suppose that you have to retrieve your #Hashtags from a sentence full of punctuation symbols. Let's say that #stackoverflow #people and #helpfulare terminated with different symbols, you want to retrieve them from text but you may want to avoid repetitions:
>>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!"
if you try with set([i for i in text.split() if i.startswith("#")]) alone, you will get:
>>> set(['#helpful???',
'#people',
'#stackoverflow,',
'#stackoverflow',
'#helpful!!!',
'#helpful!',
'#people...'])
which in my mind is redundant. Better solution using RE with module re:
>>> import re
>>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set(['#people', '#helpful', '#stackoverflow'])
Now it's ok for me.
EDIT: UNICODE #Hashtags
Add the re.UNICODE flag if you want to delete punctuations, but still preserving letters with accents, apostrophes and other unicode-encoded stuff which may be important if the #Hashtags may be expected not to be only in english... maybe this is only an italian guy nightmare, maybe not! ;-)
For example:
>>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!"
will be unicode-encoded as:
>>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!'
and you can retrieve your (correctly encoded) #Hashtags in this way:
>>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
EDITx2: UNICODE #Hashtags and control for # repetitions
If you want to control for multiple repetitions of the # symbol, as in (forgive me if the text example has become almost unreadable):
>>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!"
>>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!'
then you should substitute these multiple occurrences with a unique #.
A possible solution is to introduce another nested implicit set() definition with the sub() function replacing occurrences of more-than-1 # with a single #:
>>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
AndiDogs answer will screw up with links and other stuff, you may want to filter them out first. After that use this code:
UTF_CHARS = ur'a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff'
TAG_EXP = ur'(^|[^0-9A-Z&/]+)(#|\uff03)([0-9A-Z_]*[A-Z_]+[%s]*)' % UTF_CHARS
TAG_REGEX = re.compile(TAG_EXP, re.UNICODE | re.IGNORECASE)
It may seem overkill but this has been converted from here http://github.com/mzsanford/twitter-text-java.
It will handle like 99% of all hashtags in the same way that twitter handles them.
For more converted twitter regex check out this: http://github.com/BonsaiDen/Atarashii/blob/master/atarashii/usr/share/pyshared/atarashii/formatter.py
EDIT:
Check out: http://github.com/BonsaiDen/AtarashiiFormat
simple gist (better than chosen answer)
https://gist.github.com/mahmoud/237eb20108b5805aed5f
also work with unicode hashtags
hashtags = [word for word in tweet.split() if word[0] == "#"]
i had a lot of issues with unicode languages.
i had seen many ways to extract hashtag, but found non of them answering on all cases
so i wrote some small python code to handle most of the cases. it works for me.
def get_hashtagslist(string):
ret = []
s=''
hashtag = False
for char in string:
if char=='#':
hashtag = True
if s:
ret.append(s)
s=''
continue
# take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' )
if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
ret.append(s)
s=''
hashtag=False
if hashtag:
s+=char
if s:
ret.append(s)
return list(set([word for word in ret if len(ret)>1 and len(ret)<20]))
I extracted hashtags in a silly but effective way.
def retrive(s):
indice_t = []
tags = []
tmp_str = ''
s = s.strip()
for i in range(len(s)):
if s[i] == "#":
indice_t.append(i)
for i in range(len(indice_t)):
index = indice_t[i]
if i == len(indice_t)-1:
boundary = len(s)
else:
boundary = indice_t[i+1]
index += 1
while index < boundary:
if s[index] in "`~!##$%^&*()-_=+[]{}|\\:;'"",.<>?/ \n\t":
tags.append(tmp_str)
tmp_str = ''
break
else:
tmp_str += s[index]
index += 1
if tmp_str != '':
tags.append(tmp_str)
return tags