Scrapy response.replace encoding error - python

I am trying to replace the response body of a search result block of a search result page of google using response.replace() and I face some encoding issues.
scrapy shell "http://www.google.de/search?q=Zuckerccc"
>>> srb = hxs.select("//li[#class='g']").extract()
>>> body = '<html><body>' + srb[0] + '</body></html>' # get only 1st search result block
>>> b = response.replace(body = body)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 54, in replace
return Response.replace(self, *args, **kwargs)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 77, in replace
return cls(*args, **kwargs)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 31, in __init__
super(TextResponse, self).__init__(*args, **kwargs)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 19, in __init__
self._set_body(body)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 48, in _set_body
self._body = body.encode(self._encoding)
File "../local_1/Linux-2.6c2.5-x86_64/Python/Python-147.0-0/lib/python2.6/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0131' in position 529: character maps to <undefined>
I tried to create my own response as well,
>>> x = HtmlResponse("http://www.google.de/search?q=Zuckerccc", body = body, encoding = response.encoding)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 31, in __init__
super(TextResponse, self).__init__(*args, **kwargs)
self._set_body(body)
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/text.py", line 48, in _set_body
self._body = body.encode(self._encoding)
File "../local_1/Linux-2.6c2.5-x86_64/Python/Python-147.0-0/lib/python2.6/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0131' in position 529: character maps to <undefined>
File "scrapy/lib/python2.6/site-packages/scrapy/http/response/__init__.py", line 19, in __init__
Also, when I use _body_declared_encoding() for encoding in replace() function, it works.
replace(body = body, encoding = response._body_declared_encoding())
I don't understand why response._body_declared_encoding() and response.encoding are different. Can anybody please shed some light on this.
So, what will be a good way to fix this ?

I successfully replaced the response body with these lines of code:
scrapy shell "http://www.google.de/search?q=Zuckerccc"
>>> google_result = response.xpath('//li[#class="g"]').extract()[0]
>>> body = '<html><body>' + google_result + '</body></html>'
>>> b = response.replace(body = body)

I check the source code from scrapy.http.response.text , when we use TextResponse, we need to tell self._encoding first. So we can do like this:
>>>response._encoding='utf8'
>>>response._set_body("aaaaaa")
>>>response.body
>>>'aaaaaa'

Related

ignore encoding error when parsing pdf with pdfminer

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fn='test.pdf'
with open(fn, mode='rb') as fp:
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
item = {}
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
item[name]=value
Hello, I need help with this code as it is giving me Unicode error on some characters
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 80, in resolve1
x = x.resolve(default=default)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 67, in resolve
return self.doc.getobj(self.objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 673, in getobj
stream = stream_value(self.getobj(strmid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 676, in getobj
obj = self._getobj_parse(index, objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 648, in _getobj_parse
raise PDFSyntaxError('objid mismatch: %r=%r' % (objid1, objid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 85, in __repr__
return self.name.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
is there anything I can add so it "ingores" the charchters that its not able to decode or at least return the name with the value as blank in name, value = field.get('T'), field.get('V').
any help is appreciated
Here is one way you can fix it
nano "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py"
then in line 85
def __repr__(self):
return self.name.decode('ascii', 'ignore') # this fixes it
I don't believe it's recommended to edit source scripts, you should also post an issue on Github

Scrapy: ascii' codec can't encode characters

I am having problem on running my crawler
UnicodeEncodeError: 'ascii' codec can't encode characters in position
I am using this code
author = str(info.css(".author::text").extract_first())
but still I am having that error any idea how can solve it?
Thank you!
Here's the error
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line
102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/middlewares.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/app/__main__.egg/teslamotorsclub_spider/spiders/teslamotorsclub.py", line 40, in parse
author = str(info.css(".author::text").extract_first())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Try:
author = info.css(".author::text").extract_first().decode('utf-8')
The reason for this is extract_first returns a raw bytes object. To convert this to a string, python makes no guesses as to how it's encoded, therefore, you need to make that explicit. Utf-8 will handle just about anything you throw at it.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x89 in position 51: invalid start byte

An error occurred when compiling "QRCODE.py"
from pyshorteners import Shortener
class Shortening():
def __init__(self):
self.shortener=Shortener('Tinyurl')
fo = open('/home/jayu/Desktop/qr.png','r+')
apiKey = fo.read()
self.shortener = Shortener('Google',api_key = apiKey)
def shortenURL(self):
self.url = raw_input("Enter The Url to shortener : ");
shortener = self.shortener.short(self.url)
print ("the short url : " +shortenURL)
def decodeURL(self):
self.url = raw_input("Enter The Url to expand: ");
expandURL = self.shortener.expand(self.url)
print ("the short url : " +expandURL);
def generateQRcode(self):
self.url = raw_input("Enter the URL to get QR code :")
self.shortener.short(self.url)
print (self.shortener.qrcode(150,150))
app = Shortening()
option = int (input("Enter ur choice : "))
if option==1:
app.shortenURL()
elif option==2:
decodeURL()
elif option==3:
app.generateQRcode()
else:
print ("wrong ")
Traceback (most recent call last):
jayu#jayu:~/Desktop$ python QRCODE.py
Enter ur choice : 3
Enter the URL to get QR code :http://www.google.com
Traceback (most recent call last):
File "QRCODE.py", line 29, in <module>
app.generateQRcode()
File "QRCODE.py", line 19, in generateQRcode
self.shortener.short(self.url)
File "/home/jayu/.local/lib/python2.7/site-packages/pyshorteners/shorteners/__init__.py", line 115, in short
self.shorten = self._class(**self.kwargs).short(url)
File "/home/jayu/.local/lib/python2.7/site-packages/pyshorteners/shorteners/googl.py", line 25, in short
response = self._post(url, data=params, headers=headers)
File "/home/jayu/.local/lib/python2.7/site-packages/pyshorteners/shorteners/base.py", line 32, in _post
timeout=self.kwargs['timeout'])
File "/home/jayu/.local/lib/python2.7/site-packages/requests/api.py", line 112, in post
return request('post', url, data=data, json=json, **kwargs)
File "/home/jayu/.local/lib/python2.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/home/jayu/.local/lib/python2.7/site-packages/requests/sessions.py", line 498, in request
prep = self.prepare_request(req)
File "/home/jayu/.local/lib/python2.7/site-packages/requests/sessions.py", line 441, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/home/jayu/.local/lib/python2.7/site-packages/requests/models.py", line 309, in prepare
self.prepare_url(url, params)
File "/home/jayu/.local/lib/python2.7/site-packages/requests/models.py", line 359, in prepare_url
url = url.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x89 in position 51: invalid start byte
What is the cause of the error? Python's version is 2.7.15rc1
Each time I tried to run python QRCODE.py I got a same position N in the traceback.
can anyone correct me ?
If you have this problem in open(...) function you need to set encoding in this function
fo = open(filename, something_else, encoding = 'UTF-8')
but it's work only in python3 in python 2 you need to use io.open:
fo = io.open(filename, something else, encoding = 'UTF-8')
go to google i don't know full sintax, but i already answered alike ask here: unable to decode this string using python

Python Twitter Unicode Error 'UCS-2'

Hello when I try to get some tweets with nltk or tweepy. I get an error like that
>>> from nltk.twitter import Twitter
>>> tw = Twitter()
>>> tw.tweets(keywords='love', limit=50)
Tweet outputs
RT #BookOProverbs: Love your neighbor as yourself. -Mat 22:37
RT #davebernstein: Dear #SpeakerRyan & #GOP:
You had 8 years– 8 years to come up with a replacement to #ACA.
ERROR
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
tw.tweets(keywords='love', limit=50)
File "C:\Python\lib\site-packages\nltk\twitter\twitterclient.py", line 380, in tweets
self.streamer.filter(track=keywords, follow=follow, lang=lang)
File "C:\Python\lib\site-packages\nltk\twitter\twitterclient.py", line 118, in filter
self.statuses.filter(track=track, follow=follow, lang=lang)
File "C:\Python\lib\site-packages\twython\streaming\types.py", line 66, in filter
self.streamer._request(url, 'POST', params=params)
File "C:\Python\lib\site-packages\twython\streaming\api.py", line 154, in _request
if self.on_success(data): # pragma: no cover
File "C:\Python\lib\site-packages\nltk\twitter\twitterclient.py", line 73, in on_success
self.handler.handle(data)
File "C:\Python\lib\site-packages\nltk\twitter\twitterclient.py", line 404, in handle
print(text)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 29-29: Non-BMP character not supported in Tk

How to avoid scrapy UnicodeEncodeError

I have the following code in my parse_item callback:
sel = Selector(response)
item['name'] = sel.xpath('//div[#class="productDescriptionBlock"]/h2/text()').extract()[0]
return item
But I get UnicodeEncodeError:
exceptions.UnicodeEncodeError: 'charmap' codec can't encode character u'\uff01' in position 271761: character maps to <undefined>
I also tried adding .encode('utf-8') but still get the same error.
Traceback (most recent call last):
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 23, in process_spider_output
for x in result:
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 124, in extract_links
).encode(response.encoding)
File "/home/scraper/.fakeroot/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
exceptions.UnicodeEncodeError: 'charmap' codec can't encode character u'\x99' in position 349751: character maps to <undefined>
I've seen this before. If I'm not wrong, you are using the restrict_xpaths parameter in your rule's link extractor.
Possible solutions are:
Avoid to use restrict_xpaths for that particular site. This happens because the page content contains characters not defined in the declared encoding.
Identify the invalid characters and replace them before the rule acts on it. This can be tricky, though.
Use the middleware in this answer to re-encode the response into its declared encoding: UnicodeEncodeError after setting restrict_xpaths settings

Categories

Resources