VADER-Sentiment-Analysis toolkit and decoding to UTF-8

VADER-Sentiment-Analysis toolkit and decoding to UTF-8 - python

I'm trying out this awesome sentiment analysis toolkit for python called Vader (https://github.com/cjhutto/vaderSentiment#python-code-example). However, I'm not even able to run their examples, because of a decoding problem (?).
I've tried the .decode('utf-8'), but it still gives me this error code:
Traceback (most recent call last):
File "/Users/solari/Codes/EmotionalTwitter/vader.py", line 22, in
<module>
analyzer = SentimentIntensityAnalyzer()
File "/usr/local/lib/python3.6/site-
packages/vaderSentiment/vaderSentiment.py", line 199, in __init__
self.lexicon_full_filepath = f.read()
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/l
ib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6573: ordinal not in range(128)
[Finished in 0.5s with exit code 1]
Why does it complain about this "ascii codec"? Because if I've read their documentation correctly this should be in utf-8 anyway. Also, I'm using Python 3.6.2.

Related

Emoji support reading from file in python? [duplicate]

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

Since you are using Python 3, just add the encoding parameter to open():
corpus = open(
r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

About UnicodeDecodeError

I am writing a program to count the words with python(3.6), the code runs smoothly from the terminal. But if I use python IDLE, below error happens:
Traceback (most recent call last):
File "/Users/zhangchaont/python/Course Python Programming/6.7V2.py", line 122, in <module>
main()
File "/Users/zhangchaont/python/Course Python Programming/6.7V2.py", line 21, in main
for line in txtFile:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 33: ordinal not in range(128)
How to solve this?

Since there is not much info about your code. I can only suggest instead of codecs you can also use this package.
https://github.com/iki/unidecode. The method below should solve your problem. Open your file with open method, and pass it the file_handle.read()
unidecode.unidecode_expect_nonascii(string)

Decoding HappyBase data from HBase

While trying to decode the values from HBase, i am seeing an error but it is apparent that Python thinks it is not in UTF-8 format but the Java application that put the data into HBase encoded it in UTF-8 only
a = '\x00\x00\x00\x00\x10j\x00\x00\x07\xe8\x02Y'
a.decode("UTF-8")
Traceback (most recent call last):
File "", line 1, in
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 9: invalid continuation byte
any thoughts?

that data is not valid utf-8, so if you really retrieved it as such from the database, you should check who/what put it in there.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte

I am using hfcca to calculate cyclomatic complexity for a c++ code. hfcca is a simple python script (https://code.google.com/p/headerfile-free-cyclomatic-complexity-analyzer/). When i am trying to run the script to generate the output in the form of an xml file i am getting following errors :
Traceback (most recent call last):
"./hfcca.py", line 802, in <module>
main(sys.argv[1:])
File "./hfcca.py", line 798, in main
print(xml_output([f for f in r], options))
File "./hfcca.py", line 798, in <listcomp>
print(xml_output([f for f in r], options))
File "/x/home06/smanchukonda/PREFIX/lib/python3.3/multiprocessing/pool.py", line 652, in next
raise value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte
Please help me with this..

The problem looks like the file has characters represented with latin1 that aren't characters in utf8. The file utility can be useful for figuring out what encoding a file should be treated as, e.g:
monk#monk-VirtualBox:~$ file foo.txt
foo.txt: UTF-8 Unicode text
Here's what the bytes mean in latin1:
>>> b'\xe2'.decode('latin1')
'â'
Probably easiest is to convert the files to utf8.

I also had the same problem rendering Markup("""yyyyyy""") but i solved it using an online tool with removed the 'bad' characters. https://pteo.paranoiaworks.mobi/diacriticsremover/
It is a nice tool and works even offline.

UnicodeDecodeError for writing file

I know that this is a very common error, but it's the first time I've encountered it when trying to write a file.
I'm using networkx to work with graphs for network analysis, and when I try to write into any format:
nx.write_gml(G, "Graph.gml")
nx.write_pajek(G, "Graph.net")
nx.write_gexf(G, "graph.gexf")
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in write_pajek
File "/Library/Python/2.7/site-packages/networkx/utils/decorators.py", line 263, in _open_file
result = func(*new_args, **kwargs)
File "/Library/Python/2.7/site-packages/networkx/readwrite/pajek.py", line 100, in write_pajek
path.write(line.encode(encoding))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
I haven't found documentation on this, so quite confused.

Wondering if you can make use of codec module to solve it or not. Just create a file object by codec as following before feeding to networkx.
ex,
import codecs
f = codecs.open("graph.gml", "w", "utf-8")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

VADER-Sentiment-Analysis toolkit and decoding to UTF-8 - python

Related

Emoji support reading from file in python? [duplicate]

About UnicodeDecodeError

Decoding HappyBase data from HBase

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte

UnicodeDecodeError for writing file

Categories

Resources