How to fix Cyrillic characters while web-scraping with Python

How to fix Cyrillic characters while web-scraping with Python - python

I'm scraping a Cyrillic website with python using BeautifulSoup, but I'm having some trouble, every word is showing like this:
Ð¡Ð¸Ð»ÑÐ°Ð½Ð¾Ð²ÑÐºÐ° ÐÐ°Ð²ÐºÐ¾Ð²Ð° Ð²Ð¾ ÐÐ°Ð·Ð¸
I also tried some other Cyrillic websites, but they are working good.
My code is this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
How should I fix it?

requests fails to detect it as utf-8.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly

Related

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title would be كابستون علوم البيانات التطبيقية
but the result is Ù…Ù†Ù‡Ø¬ÙŠØ© Ø¹Ù„Ù… Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª.
What is the problem? And how do I fix it?
Thank you for taking time to answer.

The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية

What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية

There a nice library called ftfy. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية

I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

Python Webscraping: Problems parsing chinese characters with beautiful soup/requests

I am scraping a Chinese website and usually there is no problem to parse the chinese characters which i use to find specific urls with the pattern function within bs4.
However, for this particular chinese website the soup cannot be parsed properly.
Below is the code i use to set up the soup:
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content, "html.parser")
An example of the printed soup is the following:
Current soup
Note: I had to add a picture as Stack though it was spam :)
The above should have looked like the following:
Proper soup
I wonder if i have to specify some kind of encoding within the request or perhaps something within the soup but as for now i have not found anything that would work.
Thanks in advance!

I don't know Chinese. Does this give the desired results?
import requests
from bs4 import BeautifulSoup as bs
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content.decode('GBK', 'ignore'), "html.parser")
print(soup)

Get strange letters from Arabic alphabet when scrape an Arabic website

I would to scrape this site: http://waqfeya.com/book.php?bid=1
but when I do I get characters like these ÇáÞÑÂä ÇáßÑíã .
This how looks my script:
import requests
from bs4 import BeautifulSoup
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
soup = BeautifulSoup(source.text, 'lxml')
print(soup)
I tried these things but don't work for me:
source.encoding = 'utf-8'
and this:
source.encoding = 'ISO-8859-1'
also this:
soup = BeautifulSoup(source.text, from_endocing='ISO-8859-1')
But none worked for me.

Use urlopen instead of request
from bs4 import BeautifulSoup
from urllib import urlopen
BASE_URL = "http://waqfeya.com/book.php?bid=1"
open = urlopen(BASE_URL)
soup = BeautifulSoup(open, 'lxml')
print(soup.encode('utf-8'))

Sometimes Requests may get the encoding wrong. For this site we can get the correct encoding from the Source.
You can assign the encoding like source.encoding='windows-1256' before using source.text in BeautifulSoup.
import requests
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
print(source.encoding)
print(source.apparent_encoding)
source.encoding='windows-1256'
print(source.text)
I was able to get all the Arabic characters properly.

Unicode Encode Error: Charmap cannot encode character \xa9 in Python

Hi there I am writing scraping code but when i try to get all paragraph from website it give me following error
Unicode Encode Error: Charmap cannot encode character '\xa9'
here is my code:
#Loading Libraries
import urllib
from urllib.parse import urlparse
from urllib.parse import urljoin
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
newsurl = "http://www.techspot.com/news/67832-netflix-exceeds-growth-expectations-home-abroad-stock-soars.html"
thepage = urllib.request.urlopen(newsurl)
soup = BeautifulSoup(thepage ,"html.parser")
article = soup.find_all('div' , {'class','articleBody'})
for pg in article:
paragraph = soup.findAll('p')
ptag = paragraph
print(ptag)
Error I am getting is following:
Let me how to remove this error

soup.findAll() returns a ResultSet object which is basically a list which does not have an attribute encode. You either meant to use .text instead:
text = soup.text
Or, "join" the texts:
text = "".join(soup.findAll(whatever, you, want))

At times this error occurs while using Beautiful soup 4 or bs4 or using getData requests or command . So try using the below mentioned code with your print statement.
print(myHtmlData.encode("utf-8"))

how to decode and encode web page with python?

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.

Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)
>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:
<meta http-equiv="content-type" content="text/html; charset=GBK" />
>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}

Another solution.
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)

I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html
Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:
from bs4 import BeautifulSoup
import requests
url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fix Cyrillic characters while web-scraping with Python - python

Related

How to extract a unicode text inside a tag?

Python Webscraping: Problems parsing chinese characters with beautiful soup/requests

Get strange letters from Arabic alphabet when scrape an Arabic website

Unicode Encode Error: Charmap cannot encode character \xa9 in Python

how to decode and encode web page with python?

Categories

Resources