Parsing failing with LXML Xpath using utf-16 - python

I am parsing the following page: http://www.amazon.de/product-reviews/B004K1K172
Using lxml based etree for parsing.
Content variable containing the entire page content
Code:
myparser = etree.HTMLParser(encoding="utf-16") #As characters are beyond utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[#id='productReviews']/tr/td[1]/div[1]/text()")
This is returning an empty list.
But when i change the the code to:
myparser = etree.HTMLParser(encoding="utf-8") #Neglecting some reviews having ascii character above utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[#id='productReviews']/tr/td[1]/div[1]/text()")
Now i am getting proper data with the same Xpath.
But most of the reviews getting rejected.
So is this the problem with lxml based xpath or mine xpath implementation?
How can i parse the above page with utf-16 encoding?

Based on the suggestion of nymkParsed the page using ISO-8859-15 encoding.Thus the changing the following line in the code.myparser = etree.HTMLParser(encoding="ISO-8859-15")But changes has to made in SQL so as to accept encoding other than utf-8.

To get the character encoding from http headers automatically:
import cgi
import urllib2
from lxml import html
response = urllib2.urlopen("http://www.amazon.de/product-reviews/B004K1K172")
# extract encoding from Content-Type
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html_text = response.read().decode(params['charset'])
root = html.fromstring(html_text)
reviews = root.xpath(".//*[#id='productReviews']/tr/td[1]/div[1]/text()")

Related

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title would be كابستون علوم البيانات التطبيقية
but the result is منهجية علم البيانات.
What is the problem? And how do I fix it?
Thank you for taking time to answer.
The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية
What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية
There a nice library called ftfy. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية
I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

parsing xml and html page with lxml and requests package in python

I have been trying to parse xml and html page by using lxml and requests package in python. I using the following code for this purpose:
in python:
import requests
import lxml.etree
url = ""
req = requests.get(url)
tree = html.fromstring(req.content)
root = tree.xpath('')
for item in root:
print(item.text)
This code works fine but for some web pages can't show their contents properly and need to set encoding utf-8 but i don't know how i can add set encoding in this code
requests automatically decodes content from the server.
Important to understand:
r.content - contains not yet decoded response content
r.encoding - contains information about response content encoding
r.text - according to the official doc it is already decoded version of r.content
Following the unicode standard, I get used to r.text but you still can decode your content manually using
r.content.decode(r.encoding)
Hope it helps.

how to decode and encode web page with python?

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.
Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)
>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:
<meta http-equiv="content-type" content="text/html; charset=GBK" />
>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}
Another solution.
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)
I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html
Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:
from bs4 import BeautifulSoup
import requests
url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")

Urllib2 returning HTML with newlines and tabs

I want to scrape the HTML from some website and then send it off to BeautifulSoup for parsing. The problem is that the HTML returned by urllib2.urlopen() contains newlines (\n) and tabs (\t) as well as having single quotes and other characters escaped. When I try to build a BeautifulSoup object with this HTML, I get an error.
b = BeautifulSoup(src)
gives this error.
My code:
def get_page_source(url):
"""
Retrieves the HTML source code for url.
"""
try:
return urllib2.urlopen(url)
except:
return ""
def retrieve_links(url):
"""
Use the BeautifulSoup module to efficiently grab all links from the source
code retrieved by get_page_source.
"""
src = get_page_source(url)
b = BeautifulSoup(src)
.
.
.
How can I solve this problem?
EDIT
import urllib2
link = "http://www.techcrunch.com/"
src = urllib2.urlopen(link).read()
f = open('out.txt', 'w')
f.write(src)
f.close()
gives this output.
The problem is that the HTML you are parsing contains embedded JavaScript code (the BeautifulSoup error complains about line 130, which is in the middle of embedded JavaScript), and the JavaScript contains embedded HTML.
Line 130, notice the <a> tag:
adNode += "<a href='http://t.aol.com?ncid=...
It's Matryoshka doll of HTML and JavaScript, and Python's built-in parser can't handle it.
You can follow the instructions for installing a parser, given by BeatifulSoup itself in the error message you posted:
Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.

Why is beautiful soup returning ascii codes instead actual character when I output it with HttpResponse

I am reading an html file from storage directory, making some alterations with beautiful soup and then outputting the result using HttpResponse. My problem is that some of the characters such as the <> symbols are being returned as ascii codes instead of symbols e.g. < instead of <
To remove any chance that changes I am making using BeautifulSoup I simplified it down to the basics. This works:
file = default_storage.open(fileLocation, 'r')
html = file.read()
HttpResponse(html)
This does not:
file = default_storage.open(fileLocation, 'r')
html = file.read()
soup = BeautifulSoup(html)
HttpResponse(str(soup))
This by no means represents my only attempt at this. I have combed through the BeautifulSoup documentation and tried several different encoding methods, but with the same result.

Categories

Resources