Can't use NLTK on text pulled from the Silmarillion - python

I'm trying to use Tolkein's Silmarillion as a practice text for learning some NLP with nltk.
I am having trouble getting started because I'm running into text encoding issues.
I'm using the TextBlob wrapper (https://github.com/sloria/TextBlob) around NLTK because it's a lot easier. TextBlog is available at:
The sentence that I can't parse is:
"But Húrin did not answer, and they sat beside the stone, and did not speak again".
I believe it's the special character in Hurin causing the issue.
My code:
from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
b.noun_phrases
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
As this is just a for-fun project, I just want to be able to use this text and extracting some attributes and do some basic processing.
How can I convert this text to ASCII when I don't know what the initial encoding is? I tried to decode from UTF8, then re-encode into ASCII:
>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)
But even that doesn't worry. Any suggestions are appreciated -- I'm fine with losing the special characters, as long as it's done consistently across the document.
I'm using python 2.6.8 with the required modules also correctly installed.

First, update TextBlob to the latest version (0.6.0 as of this writing), as there have some unicode fixes in recent updates. This can be done with
$ pip install -U textblob
Then, use a unicode literal, like so:
from text.blob import TextBlob
b = TextBlob( u'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
noun_phrases = b.noun_phrases
print noun_phrases
# WordList([u'h\xfarin'])
print noun_phrases[0]
# húrin
This is verified on Python 2.7.5 with TextBlob 0.6.0, but it should work with Python 2.6.8 as well.

Related

UnicodeDecodeError with nltk

I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?
Update:
I looked around and found a solution within python that seems to work perfectly:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
try opening the file with:
f = open(fname, encoding="ascii", errors="surrogateescape")
Change the "ascii" with the desired encoding.

(Python 2.7)How to solve :UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 6: ordinal not in range(128)

here is my question :
if not line.startswith(' File "<frozen importlib._bootstrap'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 45: ordinal not in range(128)
I've tried the method :
import sys
reload(sys)
sys.setdefaultencoding('utf8')
but it seems that it didn't work on the version 2.7
You would benefit from learning about string encoding.
That said, there's something in Python 2.7 which solves these problems most of the time.
from __future__ import unicode_literals
This has to be the first statement in your Python module. It will change the data types of all strings in the module to Unicode.
Be careful and read about pros and cons.
Generally speaking, you will be fine if you are starting a new project, but introducing this change in existing (especially big) project can be a hard and time consuming task.
One more thing - when starting a new Python project now, you should really consider doing it in Python 3. The end of life of for Python 2.7 has been moved to 2020 but it's just 2 years from now. Python 3 has many interesting features and improvements. What you will find important is that str is Unicode in Python 3 - it means that encoding problems don't happen so often.

How to handle encoding in Python 2.7 and SQLAlchemy 🏴‍☠️

I have written a code in Python 3.5, where I was using Tweepy & SQLAlchemy & the following lines to load Tweets into a database and it worked well:
twitter = Twitter(str(tweet.user.name).encode('utf8'), str(tweet.text).encode('utf8'))
session.add(twitter)
session.commit()
Using the same code now in Python 2.7 raises an Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in
position 139: ordinal not in range(128)
Whats the solution? My MySQL Configuration is the following one:
Server side --> utf8mb4 encoding
Client side --> create_engine('mysql+pymysql://abc:def#abc/def', encoding='utf8', convert_unicode=True)):
UPDATE
It seems that there is no solution, at least not with Python 2.7 + SQLAlchemy. Here is what I found out so far and if I am wrong, please correct me.
Tweepy, at least in Python 2.7, returns unicode type objects.
In Python 2.7: tweet = u'☠' is a <'unicode' type>
In Python 3.5: tweet = u'☠' is a <'str' class>
This means Python 2.7 will give me an 'UnicodeEncodeError' if I do str(tweet) because Python 2.7 then tries to encode this character '☠' into ASCII, which is not possible, because ASCII can only handle this basic characters.
Conclusion:
Using just this statement tweet.user.name in the SQLAlchemy line gives me the following error:
UnicodeEncodeError: 'latin-1' codec can't encode characters in
position 0-4: ordinal not in range(256)
Using either this statement tweet.user.name.encode('utf-8') or this one str(tweet.user.name.encode('utf-8')) in the SQLAlchemy line should actually work the right way, but it shows me unencoded characters on the database side:
ð´ââ ï¸Jack Sparrow
This is what I want it to show:
Printed: 🏴‍☠️ Jack Sparrow
Special characters unicode: u'\U0001f3f4\u200d\u2620\ufe0f'
Special characters UTF-8 encoding: '\xf0\x9f\x8f\xb4\xe2\x80\x8d\xe2\x98\xa0\xef\xb8\x8f'
Do not use any encode/decode functions; they only compound the problems.
Do set the connection to be UTF-8.
Do set the column/table to utf8mb4 instead of utf8.
Do use # -*- coding: utf-8 -*- at the beginning of Python code.
More Python tips Note that that has a link to "Python 2.7 issues; improvements in Python 3".

UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

I am trying to do OCR on an image file in python using teseract-OCR.
My environment is-
Python 3.5 Anaconda on Windows Machine.
Here is the code:
from PIL import Image
from pytesseract import image_to_string
out = image_to_string(Image.open('sample.png'))
The error I am getting is :
File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode
return codecs.charmap_decode(input, self.errors, decoding_table)[0]
UnicodeDecodeError:'charmap' codec can't decode byte 0x81 in position 1583: character maps to <undefined>
I have tried the solution mentioned here
The hack is not working
I have tried my code on Mac OS it is working.
I have looked into the pytesseract issues:
Here is this an open issue
Thanks
Hmm..something very weird going on there -
The character "\x81" is unprintable when we talk about the "latin1" text encoding. However, on the "cp1252" encoding the library is using, it is mapped instead to an "undefined character" - this is explicit.
What happens is that "latin1" is somewhat a "no-op" codec, used sometimes in Python to simply translate a byte sequence to an unicode string (the default string in Python 3.x). The codec "cp1252" is almost the samething, and in some contexts it is used interchangeable with latin1 - but this "\x81" code is one difference between the two. In your case, a crucial one.
The correct thing to do there is try to supply the image_to_string function with the optional lang parameter - so that it might use the correct codec to decode your text - if it recognizes better what is the character it is exposing as "0x81". However, this might not work - as it might simply be an OCR error to a very weird character not related to the language at all.
So, the workaround for you is to monkeypatch the "cp1252" codec so that instead of an error, it fills in an Unicode "unrecognized" character - one way to do that is to isnert these lines before calling tesseract:
from encodings import cp1252
original_decode = cp1252.Codec.decode
cp1252.Codec.decode = lambda self, input, errors="replace": original_decode(self, input, errors)
But please, if you can, open a bug report against the pytesseract project. My guess is they should be using "latin1" and not "cp1252" encoding at this point.

Python write (iPhone) Emoji to a file

I have been trying to write a simple script that can save user input (originating from an iPhone) to a text file. The issue I'm having is that when a user uses an Emoji icon, it breaks the whole thing.
OS: Ubuntu
Python Version: 2.7.3
My code currently looks like this
f = codecs.open(path, "w+", encoding="utf8")
f.write("Desc: " + json_obj["description"])
f.close()
When an Emoji character is passed in the description variable, I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
Any possible help is appreciated.
The most likely problem here is that json_obj["description"] is actually a UTF-8-encoded str, not a unicode. So, when you try to write it to a codecs-wrapped file, Python has to decode it from str to unicode so it can re-encode it. And that's the part that fails, because that automatic decoding uses sys.getdefaultencoding(), which is 'ascii'.
For example:
>>> f = codecs.open('emoji.txt', 'w+', encoding='utf-8')
>>> e = u'\U0001f1ef'
>>> print e
🇯
>>> e
u'\U0001f1ef'
>>> f.write(e)
>>> e8 = e.encode('utf-8')
>>> e8
'\xf0\x9f\x87\xaf'
>>> f.write(e8)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
There are two possible solutions here.
First, you can explicitly decode everything to unicode as early as possible. I'm not sure where your json_obj is coming from, but I suspect it's not actually the stdlib json.loads, because by default, that always gives you unicode keys and values. So, replacing whatever you're using for JSON with the stdlib functions will probably solve the problem.
Second, you can leave everything as UTF-8 str objects and stay in binary mode. If you know you have UTF-8 everywhere, just open the file instead of codecs.open, and write without any encoding.
Also, you should strongly consider using io.open instead of codecs.open. It has a number of advantages, including:
Raises an exception instead of doing the wrong thing if you pass it incorrect values.
Often faster.
Forward-compatible with Python 3.
Has a number of bug fixes that will never be back-ported to codecs.
The only disadvantage is that it's not backward compatible to Python 2.5. Unless that matters to you, don't use codecs.

Categories

Resources