libxml for python's utf encoding issue or mine? - python

hi all I'm trying to extract the "META" description from a webpage using libxml for python. When it encounters UTF chars it seems to choke and display garbage chars. However when getting the data via a regex I get the unicode chars just fine. Am I doing something wrong with libxml?
thanks
''' test encoding issues with utf8 '''
from lxml.html import fromstring
from lxml.html.clean import Cleaner
import urllib2
import re
url = 'http://www.youtube.com/watch?v=LE-JN7_rxtE'
page = urllib2.urlopen(url).read()
xmldoc = fromstring(page)
desc = xmldoc.xpath('/html/head/meta[#name="description"]/#content')
meta_description = desc[0].strip()
print "**** LIBXML TEST ****\n"
print meta_description
print "**** REGEX TEST ******"
reg = re.compile(r'<meta name="description" content="(.*)">')
for desc in reg.findall(page):
print desc
OUTPUTS:
**** LIBXML TEST ****
My name is Hikakin.<br>I'm Japanese Beatboxer.<br><br>HIKAKIN Official Blog<br>http://ameblo.jp/hikakin/<br><br>ãã³çã³ãã¥<br>http://com.nicovideo.jp/community/co313576<br><br>â»å¾¡ç¨ã®æ¹ã¯Youtubeã®ã¡ãã»ã¼ã¸ã¾ã...
**** REGEX TEST ******
My name is Hikakin.<br>I'm Japanese Beatboxer.<br><br>HIKAKIN Official Blog<br>http://ameblo.jp/hikakin/<br><br>ニコ生コミュ<br>http://com.nicovideo.jp/community/co313576<br><br>※御用の方はYoutubeのメッセージまた...

Does this help?
xmldoc = fromstring(page.decode('utf-8'))

It is very possible that the problem is that your console does not support the display of Unicode characters. Try piping the output to a file and then open it in something that can display Unicode.

In lxml, you need to pass the encoding to the parser.
For HTML/XML parsing:
url = 'http://en.wikipedia.org/wiki/' + wiki_word
parser = lxml.etree.HTMLParser(encoding='utf-8') # you can either use an XMLParser()
page = urllib2.urlopen(url)
doc = etree.parse(page, parser)
T = doc.xpath('//p//text()')
text = u''.join(T).encode('utf-8')

Related

Python print function side-effect

I'm using lxml to parse some HTML with Russian letters. That's why i have headache with encodings.
I transform html text to tree using following code. Then i'm trying to extract some things from the page (header, arcticle content) using css queries.
from lxml import html
from bs4 import UnicodeDammit
doc = UnicodeDammit(html_text, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
tree = html.fromstring(html_text, parser=parser)
...
def extract_title(tree):
metas = tree.cssselect("meta[property^=og]")
for meta in metas:
# print(meta.attrib)
# print(sys.stdout.encoding)
# print("123") # Uncomment this to fix error
content = meta.attrib['content']
print(content.encode('utf-8')) # This fails with "[Decode error - output not utf-8]"
I get "Decode error" when i'm trying to print unicode symbols to stdout. But if i add some print statement before failing print then everything works fine. I never saw such strange behavior of python print function. I thought it has no side-effects.
Do you have any idea why this is happening?
I use Windows and Sublime to run this code.

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Is there a module to convert Chinese character to Japanese (kanji) or Korean (hanja) in Python 3?

I'd like to switch CJK characters in Python 3.3. That is, I need to get 價(Korean) from 价(Chinese), and 価(Japanese) from 價. Is there a external module like that?
Unihan information
The Unihan page about 價 provide a simplified variant (vs. traditionnal), but doesn't seems to give Japanese/Korean one. So...
CJKlib
I would recommend to have a look at CJKlib, which has a feature section called Variants stating:
Z-variant forms, which only differ in typeface
[update] Z-variant
Your sample character 價 (U+50F9) doesn't have a z-variant. However 価 (U+4FA1) has a kZVariant to U+50F9 價. This seems weird.
Further reading
Package documentation is available on Python.org/pypi/cjklib ;
Z-variant form definition.
Here is a relatively complete conversion table. You can dump it to json for later use:
import requests
from bs4 import BeautifulSoup as BS
import json
def gen(soup):
for tr in soup.select('tr'):
tds = tr.select('td.tdR4')
if len(tds) == 6:
yield tds[2].string, tds[3].string
uri = 'http://www.kishugiken.co.jp/cn/code10d.html'
soup = BS(requests.get(uri).content, 'html5lib')
d = {}
for hanzi, kanji in gen(soup):
a = d.get(hanzi, [])
a.append(kanji)
d[hanzi] = a
print(json.dumps(d, indent=4))
The code and it's output are in this gist.

Python & lxml / xpath: Parsing XML

I need to get the value from the FLVPath from this link : http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite
from lxml import html
sub_r = requests.get("http://www.testpage.co/v2/videoConfigXmlCode.php?pg=video_%s_no_0_extsite" % list[6])
sub_root = lxml.html.fromstring(sub_r.content)
for sub_data in sub_root.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print sub_data.text
But no data returned
You're using lxml.html to parse the document, which causes lxml to lowercase all element and attribute names (since that doesn't matter in html), which means you'll have to use:
sub_root.xpath('//player_settings[#name="FLVPath"]/#value')
Or as you're parsing a xml file anyway, you could use lxml.etree.
You could try
print sub_data.attrib['Value']
url = "http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite"
response = requests.get(url)
# Use `lxml.etree` rathern than `lxml.html`,
# and unicode `response.text` instead of `response.content`
doc = lxml.etree.fromstring(response.text)
for path in doc.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print path

Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.
Below is a sample code from Ruby's sanitize library and that's what I am after in Python:
require 'rubygems'
require 'sanitize'
html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
Thanks for your suggestions.
-e
>>> import BeautifulSoup
>>> html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)
>>> bs.findAll(text=True)
[u'foo']
This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).
If you don't want to use separate libs then you can import standard django utils. For example:
from django.utils.html import strip_tags
html = '<b>foo</b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped
# you got: foo
Also its already included in Django templates, so you dont need anything else, just use filter, like this:
{{ unsafehtml|striptags }}
Btw, this is one of the fastest way.
Using lxml:
htmlstring = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
from lxml.html import fromstring
mySearchTree = fromstring(htmlstring)
for item in mySearchTree.cssselect('a'):
print item.text
#!/usr/bin/python
from xml.dom.minidom import parseString
def getText(el):
ret = ''
for child in el.childNodes:
if child.nodeType == 3:
ret += child.nodeValue
else:
ret += getText(child)
return ret
html = '<b>this is a link and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)
Prints:
this is a link and some bold text followed by an image
Late, but.
You can use Jinja2.Markup()
http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags
from jinja2 import Markup
Markup("<div>About</div>").striptags()
u'About'

Categories

Resources