Persian characters in url and working with python urlopen() method - python

I need help for encoding/decoding non-ascii url to appropriate form for feeding urlopen() method. My code for scraping url(non-ascii url) from a page and going to next page:
from urllib.request import urlopen
from bs4 import BeautifulSoup
Enterance url copy-pasted from chrome browser:
url = 'https://www.sheypoor.com/%DA%A9%D9%85%D8%AF %D9%86%D9%88%D8%AC%D9%88%D8%A7%D9%86-34926671.html'
for i in range(1,10):
html = urlopen(url)
page = BeautifulSoup(html.read(), 'html.parser')
url_obj = page.findAll('a')[13]['href'].strip()
print(url_obj)
url = url_obj
But I got an error:
'ascii' codec can't encode characters in position 5-9: ordinal not in range(128)
When I checked "UnicodeEncodeError", I saw this notification:
----> 8 html = urlopen(url)
As you are aware of the process: In first loop, urlopen() method can work with "enterance url", because it is in form of:
https://www.sheypoor.com/%DA%A9%D9%85%D8%AF-%D9%86%D9%88%D8%AC%D9%88%D8%A7%D9%86-34926671.html
But the problem will start when url_obj, which is scraped from BeautifulSoup object, is in form of
https://www.sheypoor.com/سرویس-تخت-کمد-نوجوان-44887762.html
replaced with older url, and this form is not appropriate for feeding to urlopen() method:
I tried to find solution for converting my url_object to correct url form such as enterance url,but I failed! :-(
I would be so pleased for your support and guide to solving this problem.

you could use something like this
from urllib.request import urlopen
from urllib.parse import quote
persian_url = 'https://www.isna.ir/news/99010100077/' + quote('حواشی-در-آکروباتیک-ژیمناستیک-بالا-گرفت-دبیر-هم-استعفا-کرد')
page = urlopen(persian_url)
the url was : 'https://www.isna.ir/news/99010100077/حواشی-در-آکروباتیک-ژیمناستیک-بالا-گرفت-دبیر-هم-استعفا-کرد'

Related

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title would be كابستون علوم البيانات التطبيقية
but the result is منهجية علم البيانات.
What is the problem? And how do I fix it?
Thank you for taking time to answer.
The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية
What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية
There a nice library called ftfy. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية
I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

Why is urlopen giving me a strange string of characters?

I am trying to scrape the NBA game predictions on FiveThirtyEight. I usually use urllib2 and BeautifulSoup to scrape data from the web. However, the html that is returning from this process is very strange. It is a string of characters such as "\x82\xdf\x97S\x99\xc7\x9d". I cannot encode it into regular text. Here is my code:
from urllib2 import urlopen
html = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/').read()
This method works on other websites and other pages on 538, but not this one.
Edit: I tried to decode the string using
html.decode('utf-8')
and the method located here, but I got the following error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte
That page appears to return gzipped data by default. The following should do the trick:
from urllib2 import urlopen
import zlib
opener = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/')
if 'gzip' in opener.info().get('Content-Encoding', 'NOPE'):
html = zlib.decompress(opener.read(), 16 + zlib.MAX_WBITS)
else:
html = opener.read()
The result went into BeautifulSoup with no issues.
The HTTP headers (returned by the .info() above) are often helpful when trying to deduce the cause of issues with the Python url libraries.

How to open with urllib, link parsed by BeautifulSoup?

I use python 3, Beautiful Soup 4 and urllib for parsing some html.
I need to parse some pages, get some links from this pages, and than parse pages from that links. I try to do it like that:
import urllib.request
import urllib
from bs4 import BeautifulSoup
with urllib.request.urlopen("https://example.com/mypage?myparam=%D0%BC%D0%B2") as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
total = soup.find(attrs={"class":"item_total"})
link = u"https://example.com" + total.find('a').get('href')
with urllib.request.urlopen(link) as response:
exthtml = BeautifulSoup(html,response.read())
But urllib can't open second link, because it is not encoded, like fist link. It has different languages symbols, and white spaces.
I try to encode url, like:
link = urllib.parse.quote("https://example.com" + total.find('a').get('href'))
But it encode all symbols. How can I get properly url form bs, and get request?
UPD:
exapmle of second link, resulted by
link = u"https://example.com" + total.find('a').get('href')
is
"https://example.com/mypage?p1url=www.example.net%2Fthatpage%2F01234&text=абвгд еёжз ийклмно"
should just be urlencoding your link.
link = "https://example.com" + urllib.parse.quote(total.find('a').get('href'))

how to decode and encode web page with python?

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.
Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)
>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:
<meta http-equiv="content-type" content="text/html; charset=GBK" />
>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}
Another solution.
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)
I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html
Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:
from bs4 import BeautifulSoup
import requests
url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")

BeautifulSoup responses with error

I am trying to get my feet wet with BS.
I tried to work my way through the documentation butat the very first step I encountered already a problem.
This is my code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5....1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description')
print(soup.prettify())
This is the response I get:
Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/bs4/__init__.py", line 189
'"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an
HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
UserWarning: "https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5...b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description"
looks like a URL. Beautiful Soup is not an HTTP client. You should
probably use an HTTP client to get the document behind the URL, and feed that document
to Beautiful Soup.
https://api.flickr.com/services/rest/?method=flickr.photos.search&api;_key=5...b&per;_page=250&accuracy;=1&has;_geo=1&extras;=geo,tags,views,description
Is it because I try to call http**s** or is it another problem?
Thanks for your help!
You are passing URL as a string. Instead you need to get the page source via urllib2 or requests:
from urllib2 import urlopen # for Python 3: from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5....1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description'
soup = BeautifulSoup(urlopen(URL))
Note that you don't need to call read() on the result of urlopen(), BeautifulSoup allows the first argument to be a file-like object, urlopen() returns a file-like object.
The error says everything, you are passing a URL to Beautiful Soup. You need to first get the website content, and only then pass the content to BS.
To download content you can use urlib2
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
and later
soup = BeautifulSoup(html)

Categories

Resources