Decoding html file downloaded with urllib - python

I tried to download a html file like this:
import urllib
req = urllib.urlopen("http://www.stream-urls.de/webradio")
html = req.read()
print html
html = html.decode('utf-16')
print html
Since the output after req.read() looks like unicode I tried to convert the response but getting this error:
Traceback (most recent call last): File
"e:\Documents\Python\main.py", line 8, in <module>
html = html.decode('utf-16')
File "E:\Software\Python2.7\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 38-39: illegal UTF-16 surrogate
What do I have to do to get the right encoding?

Use requests and you get correct, ungzipped HTML
import requests
r = requests.get("http://www.stream-urls.de/webradio")
print r.text
EDIT: how to use gzip and StringIO to ungzip data without saving in file
import urllib
import gzip
import StringIO
req = urllib.urlopen("http://www.stream-urls.de/webradio")
# create file-like object in memory
buf = StringIO.StringIO(req.read())
# create gzip object using file-like object instead of real file on disk
f = gzip.GzipFile(fileobj=buf)
# get data from file
html = f.read()
print html

Related

How can I encode html file after read file with ZipFile?

I am reading a zip file from a URL. Inside the zip file, there is an HTML file. After I read the file everything works fine. But when I print the text I am facing a Unicode problem. Python version: 3.8
from zipfile import ZipFile
from io import BytesIO
from bs4 import BeautifulSoup
from lxml import html
content = requests.get("www.url.com")
zf = ZipFile(BytesIO(content.content))
file_name = zf.namelist()[0]
file = zf.open(file_name)
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='utf-8',exclude_encodings='utf-8')
for product in soup.find_all('tr'):
product = product.find_all('td')
if len(product) < 2: continue
print(product[1].text)
I already try to open file and print text with .decode('utf-8') I got following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
I add from_encoding and exclude_encodings in BeautifulSoup but nothing change and I didn't get an error.
Expected prints:
ÇEŞİTLİ MADDELER TOPLAMI
Tarçın
Fidanı
What I am getting:
ÇEÞÝTLÝ MADDELER TOPLAMI
Tarçýn
Fidaný
I look at the file and the encoding is not utf-8, but iso-8859-9.
Change the encoding and everything will be fine:
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='iso-8859-9')
This will output: ÇEŞİTLİ MADDELER TOPLAMI

Got stuck while reading files

what Code DO's
I am trying to read each file from the folder which i have given ,And extracting some line using bs4 Soup package in python.
I got an error reading the file that some unicode chars not able to read.
error
Traceback (most recent call last): File "C:-----\check.py", line 25, in
soup=BeautifulSoup(text.read(), 'html.parser') File "C:\Python\Python37\lib\encodings\cp1252.py",
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
3565: character maps to
from bs4 import BeautifulSoup
from termcolor import colored
import re, os
import requests
path = "./brian-work/"
freddys_library = os.listdir(path)
def getfiles():
for r, d, f in os.walk(path):
for file in f:
if '.html' in file:
files.append(os.path.join(r, file))
return files
for book in getfiles():
print("file is printed")
print(book)
text = open(book, "r")
soup=BeautifulSoup(text.read(), 'html.parser')
h1 = soup.select('h1')[0].text.strip()
print(h1)
if soup.find('h1'):
h1 = soup.select('h1')[0].text.strip()
else:
print ("no h1")
continue
filename1=book.split("/")[-1]
filename1=filename1.split(".")[0]
print(h1.split(' ', 1)[0])
print(filename1)
if h1.split(' ', 1)[0].lower() == filename1.split('-',1)[0] :
print('+++++++++++++++++++++++++++++++++++++++++++++');
print('same\n');
else:
print('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX');
print('not')
count=count+1
Please help what should i correct here.
Thanks
The problem is opening a file without knowing its encoding. The default encoding for text = open(book, "r"), per open docs, is the value returned from locale.getpreferredencoding(False), which is cp1252 for your system. The file is some other encoding, so it fails.
Use text = open(book, "rb") (binary mode) and let BeautifulSoup figure it out. HTML files usually indicate their encoding.
You can also use text = open(book,encoding='utf8') or whatever the correct encoding is if you know it already.

How to serialize a image into str and deserialize it as image?

I need to send an jpg image over network via json. I tried to convert the data into str via base64, as below:
from PIL import Image
from tinydb import TinyDB, Query
import base64
import io
from pdb import set_trace as bp
# note: with 'encoding' in name, it is always a bytes obj
in_jpg_encoding = None
# open some randome image
with open('rkt2.jpg', 'rb') as f:
# The file content is a jpeg encoded bytes object
in_jpg_encoding = f.read()
# output is a bytes object
in_b64_encoding = base64.b64encode(in_jpg_encoding)
# interpret above bytes as str, because json value need to be string
in_str = in_b64_encoding.decode(encoding='utf-8')
# in_str = str(in_b64_encoding) # alternative way of above statement
# simulates a transmission, e.g. sending the image data over internet using json
out_str = in_str
# strip-off the utf-8 interpretation to restore it as a base64 encoding
out_utf8_encoding = out_str.encode(encoding='utf-8')
# out_utf8_encoding = out_str.encode() # same way of writing above statement
# strip off the base64 encoding to restore it as its original jpeg encoded conent
# note: output is still a bytes obj due to b64 decoding
out_b64_decoding = base64.b64decode(out_utf8_encoding)
out_jpg_encoding = out_b64_decoding
# ---- verification stage
out_jpg_file = io.BytesIO(out_jpg_encoding)
out_jpg_image = Image.open(out_jpg_file)
out_jpg_image.show()
But I got error at the deserialization stage, saying the cannot identify the image as file:
Traceback (most recent call last):
File "3_test_img.py", line 38, in <module>
out_jpg_image = Image.open(out_jpg_file)
File "/home/gaopeng/Envs/venv_celeb_parser/lib/python3.6/site-packages/PIL/Image.py", line 2687, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x7f6f823c6b48>
Did I missed something?

Unable to decode unicode for Stack Exchange API

I was looking at this codegolf problem, and decided to try taking the python solution and use urllib instead. I modified some sample code for manipulating json with urllib:
import urllib.request
import json
res = urllib.request.urlopen('http://api.stackexchange.com/questions?sort=hot&site=codegolf')
res_body = res.read()
j = json.loads(res_body.decode("utf-8"))
This gives:
➜ codegolf python clickbait.py
Traceback (most recent call last):
File "clickbait.py", line 7, in <module>
j = json.loads(res_body.decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
If you go to: http://api.stackexchange.com/questions?sort=hot&site=codegolf and click under "Headers" it says charset=utf-8. Why is it giving me these weird results with urlopen?
res_body is gzipped. I'm not sure that uncompressing the response is something urllib takes care of by default.
You'll have your data if you uncompress the response from the API server.
import urllib.request
import zlib
import json
with urllib.request.urlopen(
'http://api.stackexchange.com/questions?sort=hot&site=codegolf'
) as res:
decompressed_data = zlib.decompress(res.read(), 16+zlib.MAX_WBITS)
j = json.loads(decompressed_data, encoding='utf-8')
print(j)

Parse text file from content-type=application/zip and base64 encoding in AWS SES

On amazon SES, I have a rule to save incoming emails to S3 buckets. Amazon saves these in MIME format.
These emails have a .txt in attachment that will be shown in the MIME file as content-type=text/plain, Content-Disposition=attachment ... .txt, and Content-Transfer-Encoding=quoted-printable or bases64.
I am able to parse it fine using python.
I have a problem decoding the content of the .txt file attachment when this is compressed (i.e., content-type: applcation/zip), as if the encoding wasn't base64.
My code:
import base64
s = unicode(base64.b64decode(attachment_content), "utf-8")
throws the error:
Traceback (most recent call last):
File "<input>", line 796, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcf in position 10: invalid continuation byte
Below are the first few lines of the "base64" string in attachment_content, which btw has length 53683 + "==" at the end, and I thought that the length of a base64 should be a multiple of 4 (??).
So maybe the decoding is failing because the compression is changing attachment_content and I need some other operation before/after decoding it? I have really no idea..
UEsDBBQAAAAIAM9Ah0otgkpwx5oAADMTAgAJAAAAX2NoYXQudHh0tL3bjiRJkiX23sD+g0U3iOxu
REWGu8c1l2Ag8lKd0V2ZWajM3kLuC6Hubu5uFeZm3nYJL6+n4T4Ry8EOdwCSMyQXBRBLgMQ+7CP5
QPBj5gdYn0CRI6JqFxWv7hlyszursiJV1G6qonI5cmQyeT6dPp9cnCaT6Yvp5Yvz6xfJe7cp8P/k
1SbL8xfJu0OSvUvr2q3TOnFVWjxrknWZFeuk2VRlu978s19MRvNMrHneOv51SOZlGUtMLYnfp0nd
...
I have also tried used "latin-1", but get gibberish.
The problem was that, after conversion, I was dealing with a zipped file in format, like "PK \x03 \x04 \X3C \Xa \x0c ...", and I needed to unzip it before transforming it to UTF-8 unicode.
This code worked for me:
import email
# Parse results from email
received_email = email.message_from_string(email_text)
for part in received_email.walk():
c_type = part.get_content_type()
c_enco = part.get('Content-Transfer-Encoding')
attachment_content = part.get_payload()
if c_enco == 'base64':
import base64
decoded_file = base64.b64decode(attachment_content)
print("File decoded from base64")
if c_type == "application/zip":
from cStringIO import StringIO
import zipfile
zfp = zipfile.ZipFile(StringIO(decoded_file), "r")
unzipped_list = zfp.open(zfp.namelist()[0]).readlines()
decoded_file = "".join(unzipped_list)
print('And un-zipped')
result = unicode(decoded_file, "utf-8")

Categories

Resources