How to tokenize all currency symbols using Regex in python? - python

I want to tokenize all the symbols of currency by using NLTK tokenize with regex.
For example this is my sentence:
The price of it is $5.00.
The price of it is RM5.00.
The price of it is €5.00.
I used this pattern of regex:
pattern = r'''(['()""\w]+|\.+|\?+|\,+|\!+|\$?\d+(\.\d+)?%?)'''
tokenize_list = nltk.regexp_tokenize(sentence, pattern)
But as we can see it only considers $.
I tried to use \p{Sc} as explained in What is regex for currency symbol? but it is still not working for me.

Try pad the numbering with the currency symbol with spaces then tokenize:
>>> import re
>>> from nltk import word_tokenize
>>> sents = """The price of it is $5.00.
... The price of it is RM5.00.
... The price of it is €5.00.""".split('\n')
>>>
>>> for sent in sents:
... numbers_in_sent = re.findall("[-+]?\d+[\.]?\d*", sent)
... for num in numbers_in_sent:
... sent = sent.replace(num, ' '+num+' ')
... print word_tokenize(sent)
...
['The', 'price', 'of', 'it', 'is', '$', '5.00', '.']
['The', 'price', 'of', 'it', 'is', 'RM', '5.00', '.']
['The', 'price', 'of', 'it', 'is', '\xe2\x82\xac', '5.00', '.']

Related

Splitting words and take specific words

I have a problem with my python script, it's a straightforward problem but I can't resolve it.
For example, I have 11 words text and then I split using re.split(r"\s+", text) function
import re
text = "this is the example text and i will splitting this text"
split = re.split(r"\s+", text)
for a in (range(len(split))):
print(split[a])
The result is
this
is
the
example
text
and
i
will
splitting
this
text
I only need to take 10 words from 11 words, so the result I need is only like this
is
the
example
text
and
i
will
splitting
this
text
Can you solve this problem? It will very helpful
Thank you!
Just index like that
>>> import re
>>>
>>> text = "this is the example text and i will splitting this text"
>>> split = re.split(r"\s+", text)
>>> split
['this', 'is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
>>> split[-10:]
['is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
No need of regex:
text = "this is the example text and i will splitting this text"
l = text.split() # Split with whitespace
l.pop(0) # Remove first item
print(l) # Print the results
Results: ['is', 'the', 'example', 'text', 'and', 'i', 'will', 'splitting', 'this', 'text']
See Python proof.

python regular expression to split string and get all words is not working

I'm trying to split string using regular expression with python and get all the matched literals.
RE: \w+(\.?\w+)*
this need to capture [a-zA-Z0-9_] like stuff only.
Here is example
but when I try to match and get all the contents from string, it doesn't return proper results.
Code snippet:
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(r"\w+(.?\w+)*", string))
[' etc', ' well', ' same', ' wait', ' like', ' it']
it's only returning some of words, but actually it should return all the words, numbers and underscore(s)[as in linked example].
python version: Python 3.6.2 (default, Jul 17 2017, 16:44:45)
Thanks.
You need to use a non-capturing group (see here why) and escape the dot (see here what chars should be escaped in regex):
>>> import re
>>> from pprint import pprint
>>> pattern = r"\w+(?:\.?\w+)*"
>>> string = """this is some test string and there are some digits as well that need to be captured as well like 1234567890 and 321 etc. But it should also select _ as well. I'm pretty sure that that RE does exactly the same.
... Oh wait, it also need to filter out the symbols like !##$%^&*()-+=[]{}.,;:'"`| \(`.`)/
...
... I guess that's it."""
>>> pprint(re.findall(pattern, string, re.A))
['this', 'is', 'some', 'test', 'string', 'and', 'there', 'are', 'some', 'digits', 'as', 'well', 'that', 'need', 'to', 'be', 'captured', 'as', 'well', 'like', '1234567890', 'and', '321', 'etc', 'But', 'it', 'should', 'also', 'select', '_', 'as', 'well', 'I', 'm', 'pretty', 'sure', 'that', 'that', 'RE', 'does', 'exactly', 'the', 'same', 'Oh', 'wait', 'it', 'also', 'need', 'to', 'filter', 'out', 'the', 'symbols', 'like', 'I', 'guess', 'that', 's', 'it']
Also, to only match ASCII letters, digits and _ you must pass re.A flag.
See the Python demo.

Python: Regex Search

I wanted to split a sentence on multiple delimiters:
.?!\n
However, I want to keep the comma along with the word.
For example for the string
'Hi, How are you?'
I want the result
['Hi,', 'How', 'are', 'you', '?']
I tried the following, but not getting the required result
words = re.findall(r"\w+|\W+", text)
re.split and keep your delimiters, then filter out the strings which only contain whitespace.
>>> import re
>>> s = 'Hi, How are you?'
>>> [x for x in re.split('(\s|!|\.|\?|\n)', s) if x.strip()]
['Hi,', 'How', 'are', 'you', '?']
If using re.findall:
>>> ss = """
... Hi, How are
...
... yo.u
... do!ing?
... """
>>> [ w for w in re.findall('(\w+\,?|[.?!]?)?\s*', ss) if w ]
['Hi,', 'How', 'are', 'yo', '.', 'u', 'do', '!', 'ing', '?']
You can use:
re.findall('(.*?)([\s\.\?!\n])', text)
With a bit of itertools magic and list comprehensions:
[i.strip() for i in itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)) if i.strip()]
And a bit more comprehensible version:
words = []
found = itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)
for i in found:
w = i.strip()
if w:
words.append(w)

text.replace(punctuation,'') does not remove all punctuation contained in list(punctuation)?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
# URL for Obama's presidential acceptance speech in 2008
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
# read in URL
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
# BS magic
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# obama_4427_div.text.lower() removes extraneous characters (e.g. '<br/>')
# and places all letters in lowercase
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
for punct in list(p):
obama_4427_str_processed = obama_4427_str.replace(p,'')
obama_4427_str_processed_2 = obama_4427_str_processed.replace(p,'')
print(obama_4427_str_processed_2)
# store individual words
words = obama_4427_str_processed.split(' ')
print(words)
Long story short, I have a speech from President Obama, and am looking to remove all punctuation, so that I'm left only with the words. I've imported the punctuation module, ran a for loop which didn't remove all my punctuation. What am I doing wrong here?
str.replace() searches for the whole value of the first argument. It is not a pattern, so only if the whole `string.punctuation* value is there will this be replaced with an empty string.
Use a regular expression instead:
import re
from string import punctuation as p
punctuation = re.compile('[{}]+'.format(re.escape(p)))
obama_4427_str_processed = punctuation.sub('', obama_4427_str)
words = obama_4427_str_processed.split()
Note that you can just use str.split() without an argument to split on any arbitrary-width whitespace, including newlines.
If you want to remove the punctuation you can rstrip it off:
obama_4427_str = obama_4427_div.text.lower()
# for further text analysis, remove punctuation
from string import punctuation
print([w.rstrip(punctuation) for w in obama_4427_str.split()])
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
................................................................
using python3 to remove from anywhere use str.translate:
from string import punctuation
tbl = str.maketrans({ord(ch):"" for ch in punctuation})
obama_4427_str = obama_4427_div.text.lower().translate(tbl)
print(obama_4427_str.split())
For python2:
from string import punctuation
obama_4427_str = obama_4427_div.text.lower().encode("utf-8").translate(None,punctuation)
print( obama_4427_str.split())
Output:
['transcript', 'to', 'chairman', 'dean', 'and', 'my', 'great',
'friend', 'dick', 'durbin', 'and', 'to', 'all', 'my', 'fellow',
'citizens', 'of', 'this', 'great', 'nation', 'with', 'profound',
'gratitude', 'and', 'great', 'humility', 'i', 'accept', 'your',
'nomination', 'for', 'the', 'presidency', 'of', 'the', 'united',
............................................................
On a another note, you can iterate over a string so list(p) is redundant in your own code.

Python NLTK: how to lemmatize text include verb in english?

I want to lemmatize this text and it is only lemmatize the nouns i need to lemmatize the verbs also
>>> import nltk, re, string
>>> from nltk.stem import WordNetLemmatizer
>>> from urllib import urlopen
>>> url="https://raw.githubusercontent.com/evandrix/nltk_data/master/corpora/europarl_raw/english/ep-00-01-17.en"
>>> raw = urlopen(url).read()
>>> raw ="".join(l for l in raw if l not in string.punctuation)
>>> tokens=nltk.word_tokenize(raw)
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> lem = [lemmatizer.lemmatize(t) for t in tokens]
>>> lem[:20]
['Resumption', 'of', 'the', 'session', 'I', 'declare', 'resumed', 'the', 'session', 'of', 'the', 'European', 'Parliament', 'adjourned', 'on', 'Friday', '17', 'December', '1999', 'and']
here verb like resumed it suppose to be resume can you tell me what i should to do for lemmatize the whole text
Using the pos parameter in wordnetlemmatizer:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('resumed')
'resumed'
>>> wnl.lemmatize('resumed', pos='v')
u'resume'
Here's a complete code, with pos_tag function:
>>> from nltk import word_tokenize, pos_tag
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> txt = """Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period ."""
>>> [wnl.lemmatize(i,j[0].lower()) if j[0].lower() in ['a','n','v'] else wnl.lemmatize(i) for i,j in pos_tag(word_tokenize(txt))]
['Resumption', 'of', 'the', 'session', 'I', 'declare', u'resume', 'the', 'session', 'of', 'the', 'European', 'Parliament', u'adjourn', 'on', 'Friday', '17', 'December', '1999', ',', 'and', 'I', 'would', 'like', 'once', 'again', 'to', 'wish', 'you', 'a', 'happy', 'new', 'year', 'in', 'the', 'hope', 'that', 'you', u'enjoy', 'a', 'pleasant', 'festive', 'period', '.']

Categories

Resources