unicode issue with scraped data via scrapy

unicode issue with scraped data via scrapy - python

I am having hard time last 2 weeks to handle some data that I scraped with scrapy. I am using python 2.7 on a windows7. This is a small snippet of data scraped and extracted through scrapy xpath selector:
{'city': [u'Mangenberger Str.\xa0162', u'42655\xa0Solingen']}
These data are scraped from a page utf-8 encoded, at least that is what it says:
Content-Type: text/html;charset=utf-8
So I believe that I need to decode them in order to get:
Mangenberger Str. 16242655 Solingen
This is what I am getting in my console:
>>> s='Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'
>>> s1=s.decode('utf-8')
>>> print s1
Mangenberger Str. 16242655 Solingen
Perfect!
But this is far away from what I receive when I run my script. I tried to encode and decode:
uft-8 encoding
{'city': 'Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 17:
utf-8-sig encoding
{'city': '\xef\xbb\xbfMangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
utf-8 decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 17:
utf-8-sig decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 17:
Encode code:
item['city']= "".join(element.select('//div[#id="bubble_2"]/div/text()').extract()).encode('utf-8')
Decode code:
item['city']= "".join(element.select('//div[#id="bubble_2"]/div/text()').extract()).decode('utf-8')
From what I understand that BOM byte is the problem in case when I try to decode this string? But then why does it work without problems in my console and doesn't work (error) once I run scrapy?

\xa0 in that Python unicode string is the Non-breaking space character
u'Mangenberger Str.\xa0162' and u'42655\xa0Solingen' are perfectly valid unicode strings. Python works with unicode strings wonderfully.
Scrapy XPath selector extract() calls get you list of unicode strings. And dealing with unicode all along is usually the way to go.
I would NOT recommend encoding the unicode string to something else in your scrapy code.
(and it's encoding you're after, decoding is for non-unicode strings to convert them to unicode strings)
The only step it makes sense to encode the strings is at the end, when exporting the data (CSV, XML) and even that is handled already.
Maybe you can explain what is causing you trouble with these unicode strings.

Related

Error : 'utf-8' codec can't decode byte 0xb0 in position 14: invalid start byte

I'm a beginner at Python, and I would like to read multiple csv file and when i encode them with encoding = "ISO-8859-1",I get this kind of characters in my csv file : "DÂ°faut". So I tried to encode in utf-8, I get this error : 'utf-8' codec can't decode byte 0xb0 in position 14: invalid start byte'.
Can someone help me please ?
Thank you !

If you decode with utf-8 you should also encode with utf-8.
Depending on the unicode character you want to display (basically everything except for basic latin letters, digits and the usual symbols) utf-8 needs multiple bytes to store it. Since the file is read byte by byte you need to know if the next character needs more than a byte. This is indicated by the most significant bit of the byte. 0xb0 translates to 1011 0000 in binary and as you can see, the first bit is a 1 and that tells the utf-8 decoder that it needs more bytes for the character to be read. Since you encoded with iso-8859-1 the following byte will be part of the current character and encoding fails.
If you want to encode the degree symbol (°), it would be encoded as 0xC2 0xB0.
In any case: Always encode with the same encoding as you want to decode. If you need characters outside the code page, use utf-8. In general using any of the utf encodings is a good advice.

selenium unicode encode error

When retrieving the content of a google search result page I get this error?
print driver.find_element_by_tag_name('body').get_attribute('innerHTML')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 15663: ordinal not in range(128)
I'm calling the python script from PHP like this
exec('python selenium_scrape.py');
This solves the problem, but then all unicode chars will be encoded twice
print driver.find_element_by_tag_name('body').get_attribute('innerHTML').encode('utf-8')

That's probably because you're printing to a stdout that uses ASCII (7 bit) encoding. Call Python with a locale setting that uses utf-8, or do some appropriate encoding of the (unicode) HTML content to a 7-bit character string first.

Try to encode the the text before printing:
print driver.find_element_by_tag_name('body').get_attribute('innerHTML').encode("utf-‌8")

Python: Unicode problems

I am getting an error at this line
logger.info(u"Data: {}".format(data))
I'm getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 4: ordinal not in range(128)
Before that line, I tried adding data = data.decode('utf8') and I still get the same error.
I tried data = data.encode('utf8') and it says UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
How do I fix this? I don't know if I should encode or decode but neither works.

Use a string literal:
if isinstance(data, unicode):
data = data.encode('utf8')
logger.info("Data: {}".format(data))
The logging module needs you to pass in string values as these values are passed on unaltered to formatters and the handlers. Writing log messages to a file means that unicode values are encoded with the default (ASCII) codec otherwise. But you also need to pass in a bytestring value when formatting.
Passing in a str value into a unicode .format() template leads to decoding errors, passing in a unicode value into a str .format() template leads to encoding errors, and passing a formatted unicode value to logger.info() leads to encoding errors too.
Better not mix and encode explicitly beforehand.

You could do something such as
data.decode('utf-8').encode("ascii",errors="ignore")
This will "ignore" the unicode characters
edit: data.encode('ascii',error='ignore') may be enough but i'm not in a position to test this currently.

Printing decoded JSON string

I am receiving a JSON string, pass it through json.loads and ends with an array of unicode strings. That's all well and good. One of the strings in the array is:
u'\xc3\x85sum'
now should translate into 'Åsum' when decoded using decode('utf8') but instead I get an error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
To test what's wrong I did the following
'Åsum'.encode('utf8')
'\xc3\x85sum'
print '\xc3\x85sum'.decode('utf8')
Åsum
So that worked fine, but if I make it to a unicode string as json.loads does I get the same error:
print u'\xc3\x85sum'.decode('utf8')
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
I tried doing json.loads(jsonstring, encoding = 'uft8') but that changes nothing.
Is there a way to solve it? Make json.loads not make it unicode or make it decode using 'utf8' as I ask it to.
Edit:
The original string I receive look like this, or the part that causes trouble:
"\\u00c3\\u0085sum"

You already have a Unicode value, so trying to decode it forces an encode first, using the default codec.
It looks like you received malformed JSON instead; JSON values are already unicode. If you have UTF-8 data in your Unicode values, the only way to recover is to encode to Latin-1 (which maps the first 255 codepoints to bytes one-on-one), then decode from that as UTF8:
>>> print u'\xc3\x85sum'.encode('latin1').decode('utf8')
Åsum
The better solution is to fix the JSON source, however; it should not doubly-encode to UTF-8. The correct representation would be:
json.dumps(u'Åsum')
'"\\u00c5sum"'

Use latin characters in appengine

How can store latin characters in appengine? (e.g. "peña") when I want to store this I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 2: ordinal not in range(128)
I can change the Ñ by N, but, there not another and better way?
And if i encode the value, how can print "Peña" again?

GAE stores strings in unicode. Perhaps encode your string in unicode before saving it.
value = "peña"
value.encode("utf8")

From the error ("Unicode Decode Error"), it seems you could have more luck using Unicode - I'd try UTF-8.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unicode issue with scraped data via scrapy - python

Related

Error : 'utf-8' codec can't decode byte 0xb0 in position 14: invalid start byte

selenium unicode encode error

Python: Unicode problems

Printing decoded JSON string

Use latin characters in appengine

Categories

Resources