parsing xml and html page with lxml and requests package in python

parsing xml and html page with lxml and requests package in python - python

I have been trying to parse xml and html page by using lxml and requests package in python. I using the following code for this purpose:
in python:
import requests
import lxml.etree
url = ""
req = requests.get(url)
tree = html.fromstring(req.content)
root = tree.xpath('')
for item in root:
print(item.text)
This code works fine but for some web pages can't show their contents properly and need to set encoding utf-8 but i don't know how i can add set encoding in this code

requests automatically decodes content from the server.
Important to understand:
r.content - contains not yet decoded response content
r.encoding - contains information about response content encoding
r.text - according to the official doc it is already decoded version of r.content
Following the unicode standard, I get used to r.text but you still can decode your content manually using
r.content.decode(r.encoding)
Hope it helps.

Related

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title would be كابستون علوم البيانات التطبيقية
but the result is Ù…Ù†Ù‡Ø¬ÙŠØ© Ø¹Ù„Ù… Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª.
What is the problem? And how do I fix it?
Thank you for taking time to answer.

The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية

What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية

There a nice library called ftfy. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية

I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

Encoding/decoding while web-scraping

I am trying to scrape a website into a string, but when i use decode("utf-8") on my bytes object it doesn't return a string, i instead get an UnicodeEncodeError.
I am trying to scrape this website: https://www.futbin.com/20/player/24248/leon-goretzka, which i know uses charset = "utf-8".
from bs4 import BeautifulSoup
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
text = r.text.encode("utf-8")
html = text.decode("utf-8")
print(html)

The get function for requests needs to take an actual link. In your example, you're providing a string "link".
r = requests.get("https://www.futbin.com/20/player/24248/leon-goretzka")
data = r.text
print(data)
This gives you a Response object for r. Using r.text will give you the string, r.content will give you bytes (which would require decoding).
Here's a link for reference: Response example

BeautifulSoup gives garbage for html conversion

I am trying to scape this
url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf
url. This is my code
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup #gives garbage
However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as a pdf should it be?
I tried to the following:
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
response = urllib2.urlopen(request)
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
and this too :
Python and BeautifulSoup encoding issues
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup.prettify('utf-8')
Both gave me garbage, i.e. not html tags parsed correctly. The last link also suggested encoding might me different despite metaa charset being 'utf8' so I tried the above with 'latin-1' too But nothing seems to work
Any suggestions on how I can scrape the given link for data? Please don't suggest downloading and using pdfminer on the file. Feel free to ask more information!

That's because the URL points to a document in PDF format, so interpreting it as HTML won't make any sense at all.

check if the page is HTML page in python?

I am trying to write a code in python for Web crawler. I want to check if the page I am about to crawl is a HTML page and not page like .pdf/.doc/.docx etc.. I do not want to check it with extension .html as asp,aspx, or pages like http://bing.com/travel/ do not .html extensions explicitly but they are html pages. Is there any good way in python?

This gets the header only from the server:
import urllib2
url = 'http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.7-rc6.tar.bz2'
req = urllib2.Request(url)
req.get_method = lambda: 'HEAD'
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)
prints
application/x-bzip2
From which you could conclude this is not HTML. You could use
'html' in content_type
to programmatically test if the content is HTML (or possibly XHTML).
If you wanted to be even more sure the content is HTML you could download the contents and try to parse it with an HTML parser like lxml or BeautifulSoup.
Beware of using requests.get like this:
import requests
r = requests.get(url)
print(r.headers['content-type'])
This takes a long time and my network monitor shows a sustained load leading me to believe this is downloading the entire file, not just the header.
On the other hand,
import requests
r = requests.head(url)
print(r.headers['content-type'])
gets the header only.

Don't bother with what the standard library throws at you but, rather try requests.
>>> import requests
>>> r = requests.get("http://www.google.com")
>>> r.headers['content-type']
'text/html; charset=ISO-8859-1'

Parsing failing with LXML Xpath using utf-16

I am parsing the following page: http://www.amazon.de/product-reviews/B004K1K172
Using lxml based etree for parsing.
Content variable containing the entire page content
Code:
myparser = etree.HTMLParser(encoding="utf-16") #As characters are beyond utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[#id='productReviews']/tr/td[1]/div[1]/text()")
This is returning an empty list.
But when i change the the code to:
myparser = etree.HTMLParser(encoding="utf-8") #Neglecting some reviews having ascii character above utf-8
tree = etree.HTML(content,parser = myparser)
review = tree.xpath(".//*[#id='productReviews']/tr/td[1]/div[1]/text()")
Now i am getting proper data with the same Xpath.
But most of the reviews getting rejected.
So is this the problem with lxml based xpath or mine xpath implementation?
How can i parse the above page with utf-16 encoding?

Based on the suggestion of nymkParsed the page using ISO-8859-15 encoding.Thus the changing the following line in the code.myparser = etree.HTMLParser(encoding="ISO-8859-15")But changes has to made in SQL so as to accept encoding other than utf-8.

To get the character encoding from http headers automatically:
import cgi
import urllib2
from lxml import html
response = urllib2.urlopen("http://www.amazon.de/product-reviews/B004K1K172")
# extract encoding from Content-Type
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html_text = response.read().decode(params['charset'])
root = html.fromstring(html_text)
reviews = root.xpath(".//*[#id='productReviews']/tr/td[1]/div[1]/text()")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing xml and html page with lxml and requests package in python - python

Related

How to extract a unicode text inside a tag?

Encoding/decoding while web-scraping

BeautifulSoup gives garbage for html conversion

check if the page is HTML page in python?

Parsing failing with LXML Xpath using utf-16

Categories

Resources