Encoding Emojis with Beautiful Soup

Encoding Emojis with Beautiful Soup - python

Looking for some help. I am working on a project scraping specific Craigslist posts using Beautiful Soup in Python. I can successfully display emojis found within the post title but have been unsuccessful within the post body. I've tried different variations but nothing has worked so far. Any help would be appreciated.
Code:
f = open("clcondensed.txt", "w")
html2 = requests.get("https://raleigh.craigslist.org/wan/6078682335.html")
soup = BeautifulSoup(html2.content,"html.parser")
#Post Title
title = soup.find(id="titletextonly")
title1 = soup.title.string.encode("ascii","xmlcharrefreplace")
f.write(title1)
#Post Body
body = soup.find(id="postingbody")
body = str(body)
body = body.encode("ascii","xmlcharrefreplace")
f.write(body)
Error received from the body:
'ascii' codec can't decode byte 0xef in position 273: ordinal not in range(128)

You should use unicode
body = unicode(body)
Please refer Beautiful Soup Documentation NavigableString
Update:
Sorry for the quick answer. It's not that right.
Here you should use lxml parser instead of html parser, because html parser do not support well for NCR (Numeric Character Reference) emoji.
In my test, when NCR emoji decimal value greater than 65535, such as your html demo emoji 🚢, HTML parser just decode it with wrong unicode \ufffd than u"\U0001F6A2". I can not find the accurate Beautiful Soup reference for this, but the lxml parser is just OK.
Below is the tested code:
import requests
from bs4 import BeautifulSoup
f = open("clcondensed.txt", "w")
html = requests.get("https://raleigh.craigslist.org/wan/6078682335.html")
soup = BeautifulSoup(html.content, "lxml")
#Post Title
title = soup.find(id="titletextonly")
title = unicode(title)
f.write(title.encode('utf-8'))
#Post Body
body = soup.find(id="postingbody")
body = unicode(body)
f.write(body.encode('utf-8'))
f.close()
You can ref lxml entity handling to do more things.
If you do not install lxml, just ref lxml installing.
Hope this help.

Related

How to extract a unicode text inside a tag?

I'm trying to collect data for my lab from this website: link
Here is my code:
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title)
I expect title would be كابستون علوم البيانات التطبيقية
but the result is Ù…Ù†Ù‡Ø¬ÙŠØ© Ø¹Ù„Ù… Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª.
What is the problem? And how do I fix it?
Thank you for taking time to answer.

The issue you are facing is due to improper encoding when fetching the URL using requests.get() function. By default the pages requested via requests library have a default encoding of ISO-8859-1 which results in the incorrect encoding of the html itself. In order to force a proper encoding for the requested page, you need to change the encoding using the encoding attribute of the requested page. For this to work the line requests.get(url).text has to be broken like so:
...
# Request the URL and store the request
request = requests.get(url)
# Change the encoding before extracting the text
# Automatically infer encoding
request.encoding = request.apparent_encoding
# Now extract the HTML as text
html = request.text
...
In the above code snippet, request.apparent_encoding will automatically infer the encoding of the page without having to forcefully specify one or the other encoding.
So, the final code would be as follows:
from bs4 import BeautifulSoup
import requests
url = 'https://www.coursera.org/learn/applied-data-science-capstone-ar'
request = requests.get(url)
request.encoding = request.apparent_encoding
html = request.text
soup = BeautifulSoup(html,'lxml')
info = soup.find('div',class_='_1wb6qi0n')
title = info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0')
print(title.text)
PS: You must call title.text before printing to print the inner content of the tag.
Output:
كابستون علوم البيانات التطبيقية

What were causing the error is the encoding of the html data.
Arabic letters need 2 bytes to show
You need to set html data encoding to UTF-8
from bs4 import BeautifulSoup
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url)
html.encoding = html.apparent_encoding
soup=BeautifulSoup(html.text,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').get_text()
print(title)
In above apparent_encoding will automatically set the encoding to what suits the data
OUTPUT :
كابستون علوم البيانات التطبيقية

There a nice library called ftfy. It has multiple language support.
Installation: pip install ftfy
Try this:
from bs4 import BeautifulSoup
import ftfy
import requests
url='https://www.coursera.org/learn/applied-data-science-capstone-ar'
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
info=soup.find('div',class_='_1wb6qi0n')
title=info.find('h1',class_='banner-title banner-title-without--subtitle m-b-0').text
title = ftfy.fix_text(title)
print(title)
Output:
كابستون علوم البيانات التطبيقية

I think you need to use UTF8 encoding/decoding! and if your problem is in terminal i think you have no solution, but if your result environment is in another environment like web pages, you can see true that!

parsing xml and html page with lxml and requests package in python

I have been trying to parse xml and html page by using lxml and requests package in python. I using the following code for this purpose:
in python:
import requests
import lxml.etree
url = ""
req = requests.get(url)
tree = html.fromstring(req.content)
root = tree.xpath('')
for item in root:
print(item.text)
This code works fine but for some web pages can't show their contents properly and need to set encoding utf-8 but i don't know how i can add set encoding in this code

requests automatically decodes content from the server.
Important to understand:
r.content - contains not yet decoded response content
r.encoding - contains information about response content encoding
r.text - according to the official doc it is already decoded version of r.content
Following the unicode standard, I get used to r.text but you still can decode your content manually using
r.content.decode(r.encoding)
Hope it helps.

BeautifulSoup chinese character encoding error

I'm trying to identify and save all of the headlines on a specific site, and keep getting what I believe to be encoding errors.
The site is: http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm
the current code is:
holder = {}
url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
soup = BeautifulSoup(url, 'lxml')
head1 = soup.find_all(['h1','h2','h3'])
print head1
holder["key"] = head1
The output of the print is:
[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]
I'm reasonably certain that those are unicode characters, but I haven't been able to figure out how to convince python to display them as the characters.
I have tried to find the answer elsewhere. The question that was more clearly on point was this one:
Python and BeautifulSoup encoding issues
which suggested adding
soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))
however that gave me the same error that is mentioned in a comment ("AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'")
removing the second '.BeautifulSoup' resulted in a different error ("RuntimeError: maximum recursion depth exceeded while calling a Python object").
I also tried the answer suggested here:
Chinese character encoding error with BeautifulSoup in Python?
by breaking up the creation of the object
html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)
but that also generated the recursion error. Any other tips would be most appreciated.
thanks

decode using unicode-escape:
In [6]: from bs4 import BeautifulSoup
In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""
In [8]: soup = BeautifulSoup(h, 'lxml')
In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化
If you look at the source you can see the data is utf-8 encoded:
<meta http-equiv="content-language" content="utf-8" />
For me using bs4 4.4.1 just decoding what urllib returns works fine also:
In [1]: from bs4 import BeautifulSoup
In [2]: import urllib
In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')
In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化
When you are writing to a csv you will want to encode the data to a utf-8 str:
.decode("unicode-escape").encode("utf-8")
You can do the encode when you save the data in your dict.

This may provide a pretty simple solution, not sure if it does absolutely everything you need it to though, let me know:
holder = {}
url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
soup = BeautifulSoup(url, 'lxml')
head1 = soup.find_all(['h1','h2','h3'])
print unicode(head1)
holder["key"] = head1
Reference: Python 2.7 Unicode

BeautifulSoup gives garbage for html conversion

I am trying to scape this
url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf
url. This is my code
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup #gives garbage
However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as a pdf should it be?
I tried to the following:
How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
response = urllib2.urlopen(request)
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
and this too :
Python and BeautifulSoup encoding issues
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup.prettify('utf-8')
Both gave me garbage, i.e. not html tags parsed correctly. The last link also suggested encoding might me different despite metaa charset being 'utf8' so I tried the above with 'latin-1' too But nothing seems to work
Any suggestions on how I can scrape the given link for data? Please don't suggest downloading and using pdfminer on the file. Feel free to ask more information!

That's because the URL points to a document in PDF format, so interpreting it as HTML won't make any sense at all.

how to decode and encode web page with python?

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.

Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)
>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:
<meta http-equiv="content-type" content="text/html; charset=GBK" />
>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}

Another solution.
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)

I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html
Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:
from bs4 import BeautifulSoup
import requests
url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding Emojis with Beautiful Soup - python

Related

How to extract a unicode text inside a tag?

parsing xml and html page with lxml and requests package in python

BeautifulSoup chinese character encoding error

BeautifulSoup gives garbage for html conversion

how to decode and encode web page with python?

Categories

Resources