Unable to decode HTML page with urllib.request - python

I've wrote the following piece of code which searches URL and saves the HTML to a text file. However, I have two issues
Most importantly, it does not save € and £ in the HTML as this. This is likely a decoding issue which I've tried to fix, but so far without success
The following code does not replace the "\n" in the HTML with "". This isn't as important to me, but I am curious as to why it is not working
Any ideas?
import urllib.request
while True: # this is an infinite loop
with urllib.request.urlopen('WEBSITE_URL') as f:
fDecoded = f.read().decode('utf-8')
data = str(fDecoded .read()).replace('\n', '') # does not seem to work?
myfile = open("TestFile.txt", "r+")
myfile.write(data)
print ('----------------')

When you do this -
fDecoded = f.read().decode('utf-8')
fDecoded is already of type str , you are reading the byte string from the request and decoding it into str using utf-8 encoding.
Then after this you cannot call -
str(fDecoded .read()).replace('\n', '')
str has no method read() and you do not actually need to convert it to str again. Just do -
data = fDecoded.replace('\n', '')

Related

Fixing damaged text after the fact

Last month I made a scraper for this Latin dictionary. It finally finished executing (that website gave me response times of 6 to 8 seconds per page). To bad I find out that a good chunk of my data is severely compromised...
eg. commandūcor ----> command\xc5\xabcor || commandūcāris ----> command\xc5\xabc\xc4\x81ris
I made the stupid mistake of using the str() function on the raw data I got from requests. Just like this:
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-
flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
I'd really appreciate it if anyone could help me restore the broken text.
Thank you in advance!
Just .decode them using utf-8 (the default). You can read more about character encodings in Python's Unicode howto.
b'command\xc5\xabcor'.decode() # 'commandūcor'
b'command\xc5\xabc\xc4\x81ris'.decode() # 'commandūcāris'
r.content returns bytes. (In contrast, r.text returns a str. The requests module attempts to guess the correct decoding based on HTTP headers and decodes the bytes using that encoding for you. In the future maybe that is what you would want to use instead).
If r.content contained bytes such as b'command\xc5\xabcor', then
str(r.content) returns a str which begins with the characters b' and ends with a literal '.
In [45]: str(b'command\xc5\xabcor')
Out[45]: "b'command\\xc5\\xabcor'"
You can use ast.literal_eval to recover the bytes:
In [46]: ast.literal_eval(str(b'command\xc5\xabcor'))
Out[46]: b'command\xc5\xabcor'
You could then decode these bytes to a str. The URL you posted declares the content is UTF-8 encoded:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Assuming all the data you downloaded uses the same encoding, you could recover the content as a str by calling the bytes.decode('utf-8') method:
In [47]: ast.literal_eval(str(b'command\xc5\xabcor')).decode('utf-8')
Out[47]: 'commandūcor'
import ast
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
with open("test.html", 'r') as f_in, open("test-fixed.html", 'w') as f_out:
broken_text = f_in.read()
content = ast.literal_eval(broken_text)
assert content == r.content
text = content.decode('utf-8')
f_out.write(text)

writing persian text into a text file in the way which could be read in python

I have developed a simple program which sends a request to a persian web server and gets the source code of the main page. Then I convert it to string , use file.open (new_file , 'w') and paste the string in it.
When i use print the string in python idle I can see the right words in persian but the text file which i made in directory is written with strings like \xd9\x8a\xd8\xb9\n.
Here is the code:
import urllib.request as ul
import sys
url = 'http://www.uut.ac.ir/'
resp = ul.urlopen(url).read()
string = str(resp)
create_file(filename , string) # this function creates a text file in desktop
I also used:
file.open(new_file , 'w' , encoding = 'utf-8')
string = resp.encode('utf-8')
But nothing changed. Any help would be appreciated.
So look at your code:
>>> resp = ul.urlopen(url).read()
>>> type(resp)
<class 'bytes'>
resp has the type bytes. In the next you have used:
string = str(resp)
But you have forgot to set the encoding. The right command is:
string = str(resp, encoding="utf-8")
Now you get the right string and can write it directly to your file.
Your solution 2 is false. You must use decode instead of encode.
string = resp.decode('utf-8')
decode the web site content before write into file
import urllib.request as ul
import sys
url = 'http://www.uut.ac.ir/'
resp = ul.urlopen(url).read()
string = str(resp.decode())
f=open("a.txt",'w')
f.write(string)

Writing on text file, accents and special characters not displaying correctly

Here's what I'm doing, I'm web crawling for my personal use on a website to copy the text and put the chapters of a book on text format and then transform it with another program to pdf automatically to put it in my cloud. Everything is fine until this happens: special characters are not copying correctly, for example the accent is showed as: \xe2\x80\x99 on the text file and the - is showed as \xe2\x80\x93. I used this (Python 3):
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
Because since I had a bug when reading those characters and it just stopped my program, I encoded everything to utf-8 and retransform everything to string with python's method str()
I will post the whole code if anyone has a better solution to my problem, here's the part that crawl the website from page 1 to max_pages, you can modify it on line 21 to get more or less chapters of the book:
import requests
from bs4 import BeautifulSoup
def crawl_ATG(max_pages):
page = 1
while page <= max_pages:
x= page
url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
source = requests.get(url)
chapter = source.content
soup = BeautifulSoup(chapter.decode('utf-8', 'ignore'), 'html.parser')
f = open('atg_chapter' + str(x) + '.txt', 'w+')
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
f.close
page +=1
crawl_ATG(10)
I will do the clean up of the first useless lines that are copied later when I get a solution to this problem. Thank you
The easiest way to fix this problem that I found is adding encoding= "utf-8" in the open function:
with open('file.txt','w',encoding='utf-8') as file :
file.write('ñoño')
For some reason, you (wrongly) have utf8 encoded data in a Python3 string. The real cause of that is probably that requests.content is already a unicode string, so you should not decode it, but use it directly:
url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
source = requests.get(url)
chapter = source.content
soup = BeautifulSoup(chapter, 'html.parser')
If it is not enough, that means if you still have ’ and – (Unicode u'\u2019' and u'\u2013') display as \xe2\x80\x99 and \xe2\x80\x93', that could be caused by the html page not correctly declaring its encoding. In that case you should first encode to a byte string with latin1 encoding, and then decode it as utf8:
chapter = source.content.encode('latin1', 'ignore').decode('utf8', 'ignore')
soup = BeautifulSoup(chapter, 'html.parser')
Demonstration:
t = u'\xe2\x80\x99 \xe2\x80\x93'
t = t.encode('latin1').decode('utf8')
Displays : u'\u2019 \u2013'
print(t)
Displays : ’ –
The only error I can spot is,
str(texta).encode("utf-8")
In it, you are forcing a conversion to str and encoding it. It should be replaced with,
texta.encode("utf-8")
EDIT:
The error stems in the server not giving the correct encoding for the page. So requests assumes a 'ISO-8859-1'. As noted in this bug, it is a deliberate decision.
Luckily, chardet library correctly detects the 'utf-8' encoding, so you can do:
source.encoding = source.apparent_encoding
chapter = source.text
And there won't be any need to manually decode the text in chapter, since requests uses it to decode the content for you.

Some issue with Unicode encoding

I am trying to open and parse a Json file using python script and write its content into another Json file after formatting it as I want. Now my source Json file has character /"
which I want to replace with a blank. I don't have any issue in parsing or creating news file only the issue is that character is not getting replaced by blank. How do I do it. Earlier I have achieved the same task but then there was no such character in the document that time.
Here is my code
doubleQuote = "\""
try:
destination = open("TodaysHtScrapedItemsOutput.json","w") # open JSON file for output
except IOError:
pass
with open('TodaysHtScrapedItems.json') as f: #load json file
data = json.load(f)
print "file successfully loaded"
for dataobj in data:
for news in data[cnt]["body"]:
news = news.encode("utf-8")
if(news.find(doubleQuote) != -1): # if doublequotes found in first body tag
# print "found double quote"
news.replace(doubleQuote,"")
if(news !=""):
my_news = my_news +" "+ news
destination.write("{\"body\":"+ "\""+my_news+"\"}"+"\n")
my_news = ""
cnt= cnt + 1
Some things to try:
You should write and read the json files as binaries, so "w" becomes "wb" and you need to add "rb".
You can define your search string as unicode, with:
doubleQuote = u'"'
You can lookup the integer value of the character with this command.
ord(u'"')
I get 34 as a response. The reverse function is chr(34). Are the double quotes you are looking for the same double quotes as the json contains? See here for details.
You don't need the if loop to check if news contains the '"'. Doing a replace on 'news' is enough.
Try these steps and let me know if it still doesn't work.
str.replace doesn't change the original string.So you need to assign the string back to news.
if(news.find(doubleQuote) != -1): # if doublequotes found in first body tag
# print "found double quote"
news = news.replace(doubleQuote,"")

Some characters (trademark sign, etc) unable to write to a file but is printable on the screen

I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.
I've tried the encode and decode functions and the various encodings but to no avail.
Please find an excerpt of the current code that I've written below:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:
f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)
... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:
line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))
Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.

Categories

Resources