How can I encode html file after read file with ZipFile? - python

I am reading a zip file from a URL. Inside the zip file, there is an HTML file. After I read the file everything works fine. But when I print the text I am facing a Unicode problem. Python version: 3.8
from zipfile import ZipFile
from io import BytesIO
from bs4 import BeautifulSoup
from lxml import html
content = requests.get("www.url.com")
zf = ZipFile(BytesIO(content.content))
file_name = zf.namelist()[0]
file = zf.open(file_name)
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='utf-8',exclude_encodings='utf-8')
for product in soup.find_all('tr'):
product = product.find_all('td')
if len(product) < 2: continue
print(product[1].text)
I already try to open file and print text with .decode('utf-8') I got following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
I add from_encoding and exclude_encodings in BeautifulSoup but nothing change and I didn't get an error.
Expected prints:
ÇEŞİTLİ MADDELER TOPLAMI
Tarçın
Fidanı
What I am getting:
ÇEÞÝTLÝ MADDELER TOPLAMI
Tarçýn
Fidaný

I look at the file and the encoding is not utf-8, but iso-8859-9.
Change the encoding and everything will be fine:
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='iso-8859-9')
This will output: ÇEŞİTLİ MADDELER TOPLAMI

Related

Python - Pandas : how to save csv file from url

so I'm trying to get a csv file with requests and save it to my project:
import requests
import pandas as pd
import csv
def get_and_save_countries():
url = 'https://www.trackcorona.live/api/countries'
r = requests.get(url)
data = r.json()
data = data["data"]
with open("corona/dash_apps/finished_apps/apicountries.csv","w",newline="") as f:
title = "location,country_code,latitude,longitude,confirmed,dead,recovered,updated".split(",")
cw = csv.DictWriter(f,title,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
cw.writeheader()
cw.writerows(data)
I've managed that but when I try this:
get_data.get_and_save_countries()
df = pd.read_csv("corona\\dash_apps\\finished_apps\\apicountries.csv")
I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
And I have no idea why. Any help is welcome. Thanks.
Try:
with open("corona/dash_apps/finished_apps/apicountries.csv","w",newline="", encoding ='utf-8') as f:
to explicitly specify the encoding with encoding='utf-8'
When you write to a file, the default encoding is locale.getpreferredencoding(False). On Windows that is usually not UTF-8 and even on Linux the terminal could be configured other than UTF-8. Pandas is defaulting to utf-8, so specify encoding='utf8' as another parameter to open.

Got stuck while reading files

what Code DO's
I am trying to read each file from the folder which i have given ,And extracting some line using bs4 Soup package in python.
I got an error reading the file that some unicode chars not able to read.
error
Traceback (most recent call last): File "C:-----\check.py", line 25, in
soup=BeautifulSoup(text.read(), 'html.parser') File "C:\Python\Python37\lib\encodings\cp1252.py",
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
3565: character maps to
from bs4 import BeautifulSoup
from termcolor import colored
import re, os
import requests
path = "./brian-work/"
freddys_library = os.listdir(path)
def getfiles():
for r, d, f in os.walk(path):
for file in f:
if '.html' in file:
files.append(os.path.join(r, file))
return files
for book in getfiles():
print("file is printed")
print(book)
text = open(book, "r")
soup=BeautifulSoup(text.read(), 'html.parser')
h1 = soup.select('h1')[0].text.strip()
print(h1)
if soup.find('h1'):
h1 = soup.select('h1')[0].text.strip()
else:
print ("no h1")
continue
filename1=book.split("/")[-1]
filename1=filename1.split(".")[0]
print(h1.split(' ', 1)[0])
print(filename1)
if h1.split(' ', 1)[0].lower() == filename1.split('-',1)[0] :
print('+++++++++++++++++++++++++++++++++++++++++++++');
print('same\n');
else:
print('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX');
print('not')
count=count+1
Please help what should i correct here.
Thanks
The problem is opening a file without knowing its encoding. The default encoding for text = open(book, "r"), per open docs, is the value returned from locale.getpreferredencoding(False), which is cp1252 for your system. The file is some other encoding, so it fails.
Use text = open(book, "rb") (binary mode) and let BeautifulSoup figure it out. HTML files usually indicate their encoding.
You can also use text = open(book,encoding='utf8') or whatever the correct encoding is if you know it already.

Unable to decode unicode for Stack Exchange API

I was looking at this codegolf problem, and decided to try taking the python solution and use urllib instead. I modified some sample code for manipulating json with urllib:
import urllib.request
import json
res = urllib.request.urlopen('http://api.stackexchange.com/questions?sort=hot&site=codegolf')
res_body = res.read()
j = json.loads(res_body.decode("utf-8"))
This gives:
➜ codegolf python clickbait.py
Traceback (most recent call last):
File "clickbait.py", line 7, in <module>
j = json.loads(res_body.decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
If you go to: http://api.stackexchange.com/questions?sort=hot&site=codegolf and click under "Headers" it says charset=utf-8. Why is it giving me these weird results with urlopen?
res_body is gzipped. I'm not sure that uncompressing the response is something urllib takes care of by default.
You'll have your data if you uncompress the response from the API server.
import urllib.request
import zlib
import json
with urllib.request.urlopen(
'http://api.stackexchange.com/questions?sort=hot&site=codegolf'
) as res:
decompressed_data = zlib.decompress(res.read(), 16+zlib.MAX_WBITS)
j = json.loads(decompressed_data, encoding='utf-8')
print(j)

Decoding html file downloaded with urllib

I tried to download a html file like this:
import urllib
req = urllib.urlopen("http://www.stream-urls.de/webradio")
html = req.read()
print html
html = html.decode('utf-16')
print html
Since the output after req.read() looks like unicode I tried to convert the response but getting this error:
Traceback (most recent call last): File
"e:\Documents\Python\main.py", line 8, in <module>
html = html.decode('utf-16')
File "E:\Software\Python2.7\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 38-39: illegal UTF-16 surrogate
What do I have to do to get the right encoding?
Use requests and you get correct, ungzipped HTML
import requests
r = requests.get("http://www.stream-urls.de/webradio")
print r.text
EDIT: how to use gzip and StringIO to ungzip data without saving in file
import urllib
import gzip
import StringIO
req = urllib.urlopen("http://www.stream-urls.de/webradio")
# create file-like object in memory
buf = StringIO.StringIO(req.read())
# create gzip object using file-like object instead of real file on disk
f = gzip.GzipFile(fileobj=buf)
# get data from file
html = f.read()
print html

Some characters (trademark sign, etc) unable to write to a file but is printable on the screen

I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.
I've tried the encode and decode functions and the various encodings but to no avail.
Please find an excerpt of the current code that I've written below:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:
f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)
... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:
line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))
Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.

Categories

Resources