Handling Indian Languages in BeautifulSoup - python

I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.
My code so far is :
import urllib2
from bs4 import BeautifulSoup
htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"
fptr = open(FileName, "w")
fptr.seek(0)
page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")
li = soup.findAll( 'li')
for link_tag in li:
hypref = link_tag.find('a').contents[0]
strhyp = str(hypref)
fptr.write(strhyp)
fptr.write("\n")
The error I get is :
Traceback (most recent call last):
File "./ScrapeTemplate.py", line 30, in <module>
strhyp = str(hypref)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
I got the same error even when I didn't include the from_encoding parameter. I initially used it as fromEncoding, but python warned me that it was deprecated usage.
How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!

What you see is a NavigableString instance (which is derived from the Python unicode type):
(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)
You need to convert to utf-8 using
hypref.encode('utf-8')

strhyp = hypref.encode('utf-8')
http://joelonsoftware.com/articles/Unicode.html

Related

codec can't encode character python3

I want to scrape the name and price from this website:
https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2
Both name and price are within div tags.
Name:
Price
Printing name works fine, but printing Price gives me an error:
Traceback (most recent call last):
File "c:\File.py", line 37, in <module>
print(price.text)
File "C:\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 0: character maps to <undefined>
Code:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
response = requests.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniq")
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
name=a.find('div', attrs={'class':'_3wU53n'})
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
print(name.text)
What is the difference between those?
So why one of them give me an error and the other one is not?
It is yielding that error because python is having trouble with that currency sign. The Indian rupee sign is interpreted differently depending on the language and is not in the python charmap by default. If we change your last print statement to print(str(price.text.encode("utf-8"))) we will get results that look like this:
b'\xe2\x82\xb961,990'
b'\xe2\x82\xb940,000'
b'\xe2\x82\xb963,854'
b'\xe2\x82\xb934,990'
b'\xe2\x82\xb948,990'
b'\xe2\x82\xb952,990'
b'\xe2\x82\xb932,990'
b'\xe2\x82\xb954,990'
b'\xe2\x82\xb952,990'
Since this output isn't very pretty and probably isn't usable, I would personally truncate that symbol before printing. If you really want python to print the Indian rupee symbol, you can add it to your charmap. Follow this steps from this post to add customizations to the charmap.

Finding Mac address for Python3 [duplicate]

I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage:
import urllib.request
import re
url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern = re.compile(regex)
with urllib.request.urlopen(url) as response:
html = response.read()
title = re.findall(pattern, html)
print(title)
And I get this unexpected error:
Traceback (most recent call last):
File "path\to\file\Crawler.py", line 11, in <module>
title = re.findall(pattern, html)
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
What am I doing wrong?
You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').
See Convert bytes to a Python String
The problem is that your regex is a string, but html is bytes:
>>> type(html)
<class 'bytes'>
Since python doesn't know how those bytes are encoded, it throws an exception when you try to use a string regex on them.
You can either decode the bytes to a string:
html = html.decode('ISO-8859-1') # encoding may vary!
title = re.findall(pattern, html) # no more error
Or use a bytes regex:
regex = rb'<title>(,+?)</title>'
# ^
In this particular context, you can get the encoding from the response headers:
with urllib.request.urlopen(url) as response:
encoding = response.info().get_param('charset', 'utf8')
html = response.read().decode(encoding)
See the urlopen documentation for more details.
Based upon last one, this was smimple to do when pdf read was done .
text = text.decode('ISO-8859-1')
Thanks #Aran-fey

UnicodeDecodeError error with every urllib.request

When I use urllib in Python3 to get the HTML code of a web page, I use this code:
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read().decode('utf-8')
print(html)
return html
However, this fails every time with the error:
Traceback (most recent call last):
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('https://www.hltv.org/team/7900/spirit-academy')
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The page is in UTF-8 and I am decoding it properly according to the urllib docs. The page is not gzipped or in another charset from what I can tell.
url.info().get_charset() returns None for the page, however the meta tags specify UTF-8. I have no problems viewing the HTML in any program.
I do not want to use any external libraries.
Is there a solution? What is going on? This works fine with the following Python2 code:
def getHTML(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()
return html
You don't need to decode('utf-8')
The following should return the fetched html.
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read()
return html
There, found your error, the parsing was done just fine, everything was evaluated alright. But when you read the Traceback carefully:
Traceback (most recent call last): File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('hltv.org/team/7900/spirit-academy') File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The error was caused by the print statement, as you can see, this is in the traceback print(html).
This is somewhat common exception, it's just telling you that with your current system encoding, some of the text cannot be printed to the console. One simple solution will be to add print(html.encode('ascii', 'ignore')) to ignore all the unprintable characters. You still can do all the other stuff with html, it's just that you can't print it.
See this if you want a better "fix": https://wiki.python.org/moin/PrintFails
btw: re module can search byte strings. Copy this exactly as-is, will work:
import re
print(re.findall(b'hello', b'hello world'))

Write data scraped to text file with python script

I am newbie to data scraping. This is my first program i am writting in python to scrape data and store it into the text file. I have written following code to scrape the data.
from bs4 import BeautifulSoup
import urllib2
text_file = open("scrape.txt","w")
url = urllib2.urlopen("http://ga.healthinspections.us/georgia/search.cfm?1=1&f=s&r=name&s=&inspectionType=&sd=04/24/2016&ed=05/24/2016&useDate=NO&county=Appling&")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
type = soup.find('span',attrs={"style":"display:inline-block; font- size:10pt;"}).findAll()
for found in type:
text_file.write(found)
However i run this program using command prompt it shows me following error.
c:\PyProj\Scraping>python sample1.py
Traceback (most recent call last):
File "sample1.py", line 9, in <module>
text_file.write(found)
TypeError: expected a string or other character buffer object
What am i missing here, or is there anything i haven't added to. Thanks.
You need to check if type is None, ie soup.find did not actually find what you searched.
Also, don't use the name type, it's a builtin.
find, much like find_all return one/a list of Tag object(s). If you call print on a Tag you see a string representation. This automatism isn;t invoked on file.write. You have to decide what attribute of found you want to write.

Python Beautifulsoup : file.write(str) method get TypeError: write() argument must be str, not BeautifulSoup

I wrote below codes:
from bs4 import BeautifulSoup
import sys # where is the sys module in the source code folder ?
try:
import urllib.request as urllib2
except ImportError:
import urllib2
print (sys.argv) #
print(type(sys.argv)) #
#baseUrl = "https://ecurep.mainz.de.xxx.com/ae5/"
baseUrl = "http://www.bing.com"
baseUrl = "http://www.sohu.com/"
print(baseUrl)
url = baseUrl
page = urllib2.urlopen(url) #urlopen is a function, function is also an object
soup = BeautifulSoup(page.read(), "html.parser") #NameError: name 'BeautifulSoup' is not defined
html_file = open("Output.html", "w")
soup_string = str(soup)
print(type(soup_string))
html_file.write(soup_string) # TypeError: write() argument must be str, not BeautifulSoup
html_file.close()
which compiler give below errors:
C:\hzg>py Py_logDownload2.py 1
['Py_logDownload2.py', '1']
<class 'list'>
http://www.sohu.com/
<class 'str'>
Traceback (most recent call last):
File "Py_logDownload2.py", line 25, in <module>
html_file.write(soup_string) # TypeError: write() argument must be str, not
BeautifulSoup
File "C:\Users\ADMIN\AppData\Local\Programs\Python\Python35\lib\encodings\
cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 376-377:
character maps to <undefined>
but soup_string is obviously a str, so why the first error given by compiler ?
also don't know why the second one appear.
what is more confusion, if I change code to:
baseUrl = "https://ecurep.mainz.de.xxx.com/ae5/"
#baseUrl = "http://www.bing.com"
#baseUrl = "http://www.sohu.com/"
it will compile and have no error. (I have change the original name to "xxx").
Can anyone help to debug this?
Update :
I used to write Java codes and am a newbie for Python. With the help of you, I think I have made some progress with debugging Python: when debugging Java, always try to solve the first error; while in Python, ayways try to solve the last one.
Try to open your file with utf-8
import codecs
f = codecs.open("test", "w", "utf-8")
You can ignore the coding error (not recommended) by
f = codecs.open("test", "w", "utf-8", errors='ignore')

Categories

Resources