BeautifulSoup: scraping spanish characters issue - python

I'm trying to get some Spanish text from a website using BeautifulSoup and urllib2. I currently get this: ¡Hola! ¿Cómo estás?.
I have tried applying the different unicode functions I have seen on related threads, but nothing seems to work for my issue:
# import the main window object (mw) from aqt
from aqt import mw
# import the "show info" tool from utils.py
from aqt.utils import showInfo
# import all of the Qt GUI library
from aqt.qt import *
from BeautifulSoup import BeautifulSoup
import urllib2
wiki = "http://spanishdict.com/translate/hola"
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
dictionarydiv = soup.find("div", { "class" : "dictionary-neodict-example" })
dictionaryspans = dictionarydiv.contents
firstspan = dictionaryspans[0]
firstspantext = firstspan.contents
thetext = firstspantext[0]
thetextstring = str(thetext)

thetext is type <class 'BeautifulSoup.NavigableString'>. Printing it returns a Unicode string, which will be encoded in the output terminal encoding:
print thetext
Output (in a Windows console):
¡Hola! ¿Cómo estás?
This will work on any terminal configured for an encoding supporting the Unicode characters being printed.
You'll get UnicodeEncodeError if your terminal is configured with an encoding that doesn't support the Unicode characters you try to print.
Using str on that type returns a byte string...in this case encoded in UTF-8. If you print that on anything but a UTF-8-configured terminal, you'll get an incorrect display.

Related

Latin encoding issue

I am working on a python web scraper to extract data from this webpage. It contains latin characters like ą, č, ę, ė, į, š, ų, ū, ž. I use BeautifulSoup to recognise the encoding:
def decode_html(html_string):
converted = UnicodeDammit(html_string)
print(converted.original_encoding)
if not converted.unicode_markup:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.tried_encodings))
return converted.unicode_markup
The encoding that it always seems to use is "windows-1252". However, this turns characters like ė into ë and ų into ø when printing to file or console. I use the lxml library to scrape the data. So I would think that it uses the wrong encoding, but what's odd is that if I use lxml.html.open_in_browser(decoded_html), all the characters are back to normal. How do I print the characters to a file without all the mojibake?
This is what I am using for output:
def write(filename, obj):
with open(filename, "w", encoding="utf-8") as output:
json.dump(obj, output, cls=CustomEncoder, ensure_ascii=False)
return
From the HTTP headers set on the specific webpage you tried to load:
Content-Type:text/html; charset=windows-1257
so Windows-1252 will result in invalid results. BeautifulSoup made a guess (based on statistical models), and guessed wrong. As you noticed, using 1252 instead leads to incorrect codepoints:
>>> 'ė'.encode('cp1257').decode('cp1252')
'ë'
>>> 'ų'.encode('cp1257').decode('cp1252')
'ø'
CP1252 is the fallback for the base characterset detection implementation in BeautifulSoup. You can improve the success-rate of BeautifulSoup's character-detection code by installing an external library; both chardet and cchardet are supported. These two libraries guess at MacCyrillic and ISO-8859-13, respectively (both wrong, but cchardet got pretty close, perhaps close enough).
In this specific case, you can make use of the HTTP headers instead. In requests, I generally use:
import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)
The above only uses the encoding from the response if explicitly set by the server, and there was no HTML <meta> header. For text/* mime-types, HTTP specifies that the response should be considered as using Latin-1, which requests adheres too, but that default would be incorrect for most HTML data.

Python print function side-effect

I'm using lxml to parse some HTML with Russian letters. That's why i have headache with encodings.
I transform html text to tree using following code. Then i'm trying to extract some things from the page (header, arcticle content) using css queries.
from lxml import html
from bs4 import UnicodeDammit
doc = UnicodeDammit(html_text, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
tree = html.fromstring(html_text, parser=parser)
...
def extract_title(tree):
metas = tree.cssselect("meta[property^=og]")
for meta in metas:
# print(meta.attrib)
# print(sys.stdout.encoding)
# print("123") # Uncomment this to fix error
content = meta.attrib['content']
print(content.encode('utf-8')) # This fails with "[Decode error - output not utf-8]"
I get "Decode error" when i'm trying to print unicode symbols to stdout. But if i add some print statement before failing print then everything works fine. I never saw such strange behavior of python print function. I thought it has no side-effects.
Do you have any idea why this is happening?
I use Windows and Sublime to run this code.

Spynner wrong encoding

I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐолÑÑиÑÑ Farm Story⢠в App Store. ÐÑоÑмоÑÑеÑÑ ÑкÑинÑоÑÑ Ð¸ ÑейÑинги, пÑоÑиÑаÑÑ Ð¾ÑзÑÐ²Ñ Ð¿Ð¾ÐºÑпаÑелей." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>��}ksÇ�g!���4�I/z�O���/)�(yw���é®i��{�<v���:��ٷ�س-?�b�b�� j�... even if I remove the tags from the begin and end of the result.
Any help?
Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.
str(browser.webframe.toHtml()) saved me

Is there a module to convert Chinese character to Japanese (kanji) or Korean (hanja) in Python 3?

I'd like to switch CJK characters in Python 3.3. That is, I need to get 價(Korean) from 价(Chinese), and 価(Japanese) from 價. Is there a external module like that?
Unihan information
The Unihan page about 價 provide a simplified variant (vs. traditionnal), but doesn't seems to give Japanese/Korean one. So...
CJKlib
I would recommend to have a look at CJKlib, which has a feature section called Variants stating:
Z-variant forms, which only differ in typeface
[update] Z-variant
Your sample character 價 (U+50F9) doesn't have a z-variant. However 価 (U+4FA1) has a kZVariant to U+50F9 價. This seems weird.
Further reading
Package documentation is available on Python.org/pypi/cjklib ;
Z-variant form definition.
Here is a relatively complete conversion table. You can dump it to json for later use:
import requests
from bs4 import BeautifulSoup as BS
import json
def gen(soup):
for tr in soup.select('tr'):
tds = tr.select('td.tdR4')
if len(tds) == 6:
yield tds[2].string, tds[3].string
uri = 'http://www.kishugiken.co.jp/cn/code10d.html'
soup = BS(requests.get(uri).content, 'html5lib')
d = {}
for hanzi, kanji in gen(soup):
a = d.get(hanzi, [])
a.append(kanji)
d[hanzi] = a
print(json.dumps(d, indent=4))
The code and it's output are in this gist.

detect and change website encoding in python

I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.

Categories

Resources