how to decode and encode web page with python? - python

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.

Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)
>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:
<meta http-equiv="content-type" content="text/html; charset=GBK" />
>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}

Another solution.
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)

I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html
Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:
from bs4 import BeautifulSoup
import requests
url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")

Related

Extractinf info form HTML that has no tags

I am using both selenium and BeautifulSoup in order to do some web scraping. I have managed myself to obtain the next piece of code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
The output soup produces has the following structure:
<html>
<head>
</head>
<body>
<rf-list-detail line-color="245,150,40" line-number="C2" line-text="Línea C2"
list="[{... ;direction":"Place1"}
,... ,
;direction":"Place2"}...
Recall both text and output style have been modified for reading reasons. I attach an image of the actual output just in case it is more convinient.
Does anyone know how could I obtain every PlaceN (in the image, Moixent would be Place1) in a list? Something like
places = [Place1,...,PlaceN]
I have tried parsing it, but as it has no tags (or at least my html knowledge, which is barely none, says so) I obtain nothing. I have also tried using a regular expression, which I have just found out where a thing, but I am not sure how to do it properly.
Any thoughts?
Thank you in advance!!
output of soup
This site responds with non-html structure. So, you need no html-parser like BeautifulSoup or lxml for this task.
Here example using requests library. You can install it like this
pip install requests
import requests
import html
import json
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
response = requests.get(url)
data = response.text # get data from site
raw_list = data.split("'")[1] # extract rf-list-detail.list attribute
json_list = html.unescape(raw_list) # decode html symbols
parsed_list = json.loads(json_list) # parse json
print(parsed_list) # printing result
directions = []
for item in parsed_list:
directions.append(item["direction"])
print(directions) # extracting directions
# ['Moixent', 'Vallada', 'Montesa', "L'Alcudia de Crespins", 'Xàtiva', "L'Enova-Manuel", 'La Pobla Llarga', 'Carcaixent', 'Alzira', 'Algemesí', 'Benifaió-Almussafes', 'Silla', 'Catarroja', 'Massanassa', 'Alfafar-Benetússer', 'València Nord']

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title would be كابستون علوم البيانات التطبيقية
but the result is منهجية علم البيانات.
What is the problem? And how do I fix it?
Thank you for taking time to answer.
The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية
What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية
There a nice library called ftfy. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية
I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

How to fix Cyrillic characters while web-scraping with Python

I'm scraping a Cyrillic website with python using BeautifulSoup, but I'm having some trouble, every word is showing like this:
СилÑановÑка Ðавкова во Ðази
I also tried some other Cyrillic websites, but they are working good.
My code is this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
How should I fix it?
requests fails to detect it as utf-8.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly

BeautifulSoup chinese character encoding error

I'm trying to identify and save all of the headlines on a specific site, and keep getting what I believe to be encoding errors.
The site is: http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm
the current code is:
holder = {}
url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
soup = BeautifulSoup(url, 'lxml')
head1 = soup.find_all(['h1','h2','h3'])
print head1
holder["key"] = head1
The output of the print is:
[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]
I'm reasonably certain that those are unicode characters, but I haven't been able to figure out how to convince python to display them as the characters.
I have tried to find the answer elsewhere. The question that was more clearly on point was this one:
Python and BeautifulSoup encoding issues
which suggested adding
soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))
however that gave me the same error that is mentioned in a comment ("AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'")
removing the second '.BeautifulSoup' resulted in a different error ("RuntimeError: maximum recursion depth exceeded while calling a Python object").
I also tried the answer suggested here:
Chinese character encoding error with BeautifulSoup in Python?
by breaking up the creation of the object
html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)
but that also generated the recursion error. Any other tips would be most appreciated.
thanks
decode using unicode-escape:
In [6]: from bs4 import BeautifulSoup
In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""
In [8]: soup = BeautifulSoup(h, 'lxml')
In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化
If you look at the source you can see the data is utf-8 encoded:
<meta http-equiv="content-language" content="utf-8" />
For me using bs4 4.4.1 just decoding what urllib returns works fine also:
In [1]: from bs4 import BeautifulSoup
In [2]: import urllib
In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')
In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化
When you are writing to a csv you will want to encode the data to a utf-8 str:
.decode("unicode-escape").encode("utf-8")
You can do the encode when you save the data in your dict.
This may provide a pretty simple solution, not sure if it does absolutely everything you need it to though, let me know:
holder = {}
url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
soup = BeautifulSoup(url, 'lxml')
head1 = soup.find_all(['h1','h2','h3'])
print unicode(head1)
holder["key"] = head1
Reference: Python 2.7 Unicode

python list.append between text

In Python 3, how would you go about taking the string between header tags, for example, printing Hello, world!, out of <h1>Hello, world!</h1>:
import urllib
from urllib.request import urlopen
#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")
webPage = urllib.request.urlopen(userAddress)
list = []
while webPage != "":
webPage.read()
list.append()
You need an HTML Parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(webPage)
print(soup.find("h1").get_text(strip=True))
Demo:
>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.hobo-web.co.uk/headers/"
>>> webPage = urlopen(url)
>>>
>>> soup = BeautifulSoup(webPage, "html.parser")
>>> print(soup.find("h1").get_text(strip=True))
How To Use H1-H6 HTML Elements Properly
I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way?
If you are, for some reason, not allowed to use third-parties, you can use a built-in html.parser module. Some people also use regular expressions to parse HTML. It is not always a bad thing, but you have to be very careful with that, see:
RegEx match open tags except XHTML self-contained tags
Definitely HTMLParser is your best friend to deal with that issue.
There are related question which already exist and cover your needs.

Categories

Resources