Google App Engine: UnicodeDecode Error in bulk data upload - python

I'm getting an odd error with Google App Engine devserver 1.3.5, and Python 2.5.4, on Windows.
A sample row in the CSV:
EQS,550,foobar,"<some><html><garbage /></html></some>",odp,Ti4=,http://url.com,success
The error:
..................................................................................................................[ERROR ] [Thread-1] WorkerThread:
Traceback (most recent call last):
File "C:\Program Files\Google\google_appengine\google\appengine\tools\adaptive_thread_pool.py", line 150, in WorkOnItems
status, instruction = item.PerformWork(self.__thread_pool)
File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkloader.py", line 695, in PerformWork
transfer_time = self._TransferItem(thread_pool)
File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkloader.py", line 852, in _TransferItem
self.request_manager.PostEntities(self.content)
File "C:\Program Files\Google\google_appengine\google\appengine\tools\bulkloader.py", line 1296, in PostEntities
datastore.Put(entities)
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore.py", line 282, in Put
req.entity_list().extend([e._ToPb() for e in entities])
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore.py", line 687, in _ToPb
properties = datastore_types.ToPropertyPb(name, values)
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_types.py", line 1499, in ToPropertyPb
pbvalue = pack_prop(name, v, pb.mutable_value())
File "C:\Program Files\Google\google_appengine\google\appengine\api\datastore_types.py", line 1322, in PackString
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 36: ordinal not in range(128)
[INFO ] Unexpected thread death: Thread-1
[INFO ] An error occurred. Shutting down...
..[ERROR ] Error in Thread-1: 'ascii' codec can't decode byte 0xe8 in position 36: ordinal not in range(128)
Is the error being generated by an issue with a base64 string, of which there is one in every row?
KGxwMAoobHAxCihTJ0JJT0VFJwpwMgpJMjYxMAp0cDMKYWEu
KGxwMAoobHAxCihTJ01BVEgnCnAyCkkyOTQwCnRwMwphYS4=
The data loader:
class CourseLoader(bulkloader.Loader):
def __init__(self):
bulkloader.Loader.__init__(self, 'Course',
[('dept_code', str),
('number', int),
('title', str),
('full_description', str),
('unparsed_pre_reqs', str),
('pickled_pre_reqs', lambda x: base64.b64decode(x)),
('course_catalog_url', str),
('parse_succeeded', lambda x: x == 'success')
])
loaders = [CourseLoader]
Is there a way to tell from the traceback which row caused the error?
UPDATE: It looks like there are two characters causing errors: è, and ®. How can I get Google App Engine to handle them?

Looks like some row of the CSV has some non-ascii data (maybe a LATIN SMALL LETTER E WITH GRAVE -- that's what 0xe8 would be in ISO-8859-1, for example) and yet you're mapping it to str (should be unicode, and I believe the CSV should be in utf-8).
To find if any row of a text file has non-ascii data, a simple Python snippet will help, e.g.:
>>> f = open('thefile.csv')
>>> prob = []
>>> for i, line in enumerate(f):
... try: unicode(line)
... except: prob.append(i)
...
>>> print 'Problems in %d lines:' % len(prob)
>>> print prob

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 49: for textacy

I am using the textacy method to get synonyms.
import textacy.resources
rs = textacy.resources.ConceptNet()
syn=rs.get_synonyms('happy')
I get the below error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\resources\concept_net.py", line 353, in get_synonyms
return self._get_relation_values(self.synonyms, term, lang=lang, sense=sense)
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\resources\concept_net.py", line 338, in synonyms
self._synonyms = self._get_relation_data("/r/Synonym", is_symmetric=True)
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\resources\concept_net.py", line 162, in _get_relation_data
for row in rows:
File "C:\Users\Dhiraj\Desktop\Work\QGen\lib\site-packages\textacy\io\csv.py", line 96, in read_csv
for row in csv_reader:
File "C:\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 49: character maps to <undefined>
I have tried to enforce encoding='utf8' in both concept_net.py", line 162, and io\csv.py", line 96, in read_csv, but that gives another error
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
What can be done ?

ignore encoding error when parsing pdf with pdfminer

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fn='test.pdf'
with open(fn, mode='rb') as fp:
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
item = {}
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
item[name]=value
Hello, I need help with this code as it is giving me Unicode error on some characters
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 80, in resolve1
x = x.resolve(default=default)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 67, in resolve
return self.doc.getobj(self.objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 673, in getobj
stream = stream_value(self.getobj(strmid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 676, in getobj
obj = self._getobj_parse(index, objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 648, in _getobj_parse
raise PDFSyntaxError('objid mismatch: %r=%r' % (objid1, objid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 85, in __repr__
return self.name.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
is there anything I can add so it "ingores" the charchters that its not able to decode or at least return the name with the value as blank in name, value = field.get('T'), field.get('V').
any help is appreciated
Here is one way you can fix it
nano "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py"
then in line 85
def __repr__(self):
return self.name.decode('ascii', 'ignore') # this fixes it
I don't believe it's recommended to edit source scripts, you should also post an issue on Github

NLTK Python word_tokenize [duplicate]

This question already has answers here:
How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
(20 answers)
Python (nltk) - UnicodeDecodeError: 'ascii' codec can't decode byte
(4 answers)
Closed 4 years ago.
I have loaded a txt file that contains 6000 lines of sentences. I have tried to split("/n") and word_tokenize the sentences, but I get the following error:
Traceback (most recent call last):
File "final.py", line 52, in <module>
short_pos_words = word_tokenize(short_pos)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 95, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter
for el in it:
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 312, in _pair_iter
prev = next(it)
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 581, in _annotate_first_pass
for aug_tok in tokens:
File "/home/tuanct1997/anaconda2/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 546, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
The issue is related to the encoding of file's content. Assuming that you want to decode str to UTF-8 unicode:
Option 1 (Deprecated in Python 3):
import sys
reload(sys)
sys.setdefaultencoding('utf8')
Option 2:
Pass the encode parameter to the open function when trying to open your text file:
f = open('/path/to/txt/file', 'r+', encoding="utf-8")

decode subprocess.Popen and store in file

I wrote a script / Addon for pyLoad.
Basically it executes FileBot with arguments.
What I am trying to do is to get the output and store it into the pyLoad Log file.
So far so good. It works until that point where a single character needs to be decoded.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 5: ordinal not in range(128)
I dont know how to do that.
I hope u guys can help.
try:
if self.getConfig('output_to_log') is True:
log = open('Logs/log.txt', 'a')
subprocess.Popen(args, stdout=log, stderr=log, bufsize=-1)
Thanks in advance
[edit]
28.05.2015 12:34:06 DEBUG FileBot-Hook: MKV-Checkup (package_extracted)
28.05.2015 12:34:06 DEBUG Hier sind keine Archive
28.05.2015 12:34:06 INFO FileBot: executed
28.05.2015 12:34:06 INFO FileBot: cleaning
Locking /usr/share/filebot/data/logs/amc.log
Done ヾ(@⌒ー⌒@)ノ
Parameter: exec = cd / && ./filebot.sh "{file}"
Parameter: clean = y
Parameter: skipExtract = y
Parameter: reportError = n
Parameter: storeReport = n
Parameter: artwork = n
Parameter: subtitles = de
Parameter: movieFormat = /mnt/HD/Medien/Movies/{n} ({y})/{n} ({y})
Parameter: seriesFormat = /mnt/HD/Medien/TV Shows/{n}/Season {s.pad(2)}/{n} - {s00e00} - {t}
Parameter: extras = n
So im guessing this
Done ヾ(@⌒ー⌒@)ノ
is causing the issue
when i open the loginterface on the webgui to see the log - this is the traceback
Traceback (most recent call last):
File "/usr/share/pyload/module/lib/bottle.py", line 733, in _handle
return route.call(**args)
File "/usr/share/pyload/module/lib/bottle.py", line 1448, in wrapper
rv = callback(*a, **ka)
File "/usr/share/pyload/module/web/utils.py", line 113, in _view
return func(*args, **kwargs)
File "/usr/share/pyload/module/web/pyload_app.py", line 464, in logs
[pre_processor])
File "/usr/share/pyload/module/web/utils.py", line 30, in render_to_response
return t.render(**args)
File "/usr/share/pyload/module/lib/jinja2/environment.py", line 891, in render
return self.environment.handle_exception(exc_info, True)
File "/usr/share/pyload/module/web/templates/Next/logs.html", line 1, in top-level template code
{% extends 'Next/base.html' %}
File "/usr/share/pyload/module/web/templates/Next/base.html", line 179, in top-level template code
{% block content %}
File "/usr/share/pyload/module/web/templates/Next/logs.html", line 30, in block "content"
<tr><td class="logline">{{line.line}}</td><td>{{line.date}}</td><td class="loglevel">{{line.level}}</td><td>{{line.message}}</td></tr>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 5: ordinal not in range(128)
I found a solution.
proc=subprocess.Popen(args, stdout=subprocess.PIPE)
for line in proc.stdout:
self.logInfo(line.decode('utf-8').rstrip('\r|\n'))
proc.wait()

rendering a file in python using pygal - ascii code error

I am trying to create a pygal chart in python and saving it to a .svg file.
#Creating pygal charts
pie_chart = pygal.Pie(style=DarkSolarizedStyle, legend_box_size = 20, pretty_print=True)
pie_chart.title = 'Github-Migration Status Chart (in %)'
pie_chart.add('Intro', int(intro))
pie_chart.add('Parallel', int(parallel))
pie_chart.add('In Progress', int(in_progress) )
pie_chart.add('Complete', int(complete))
pie_chart.render_to_file('../../../../../usr/share/nginx/html/TeamFornax/githubMigration/OverallProgress/overallProgress.svg')
This simple piece of code seems to give the error -
> Traceback (most recent call last): File
> "/home/ec2-user/githubr/migrationcharts.py", line 161, in <module>
> pie_chart.render_to_file('../../../../../usr/share/nginx/html/TeamFornax/githubMigration/OverallProgress/overallProgress.svg')
> File "/usr/lib/python2.6/site-packages/pygal/ghost.py", line 149, in
> render_to_file
> f.write(self.render(is_unicode=True, **kwargs)) File "/usr/lib/python2.6/site-packages/pygal/ghost.py", line 112, in render
> .render(is_unicode=is_unicode)) File "/usr/lib/python2.6/site-packages/pygal/graph/base.py", line 293, in
> render
> is_unicode=is_unicode, pretty_print=self.pretty_print) File "/usr/lib/python2.6/site-packages/pygal/svg.py", line 271, in render
> self.root, **args) File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1010, in
> tostring
> return string.join(data, "") File "/usr/lib64/python2.6/string.py", line 318, in join
> return sep.join(words) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 40: ordinal not in range(128)
Any idea why ?
try to decode the path string to unicode that you send to render_to_file.
such like:
pie_chart.render_to_file('path/to/overallProgress.svg'.decode('utf-8'))
the decoding charset should be consistent with your file encoding.

Categories

Resources