Python Selenium page cannot save source code encode error - python

I am trying to save source code with Selenium into .txt, but the .txt file stays empty.
When I tryed to print the source code with command:
htmlcode = driver.page_source
(driver.page_source).encode('utf-8'))
print(htmlcode)
It will print the source code but then it kills the script with error:
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u20ac' in position 16329: character maps to <undefined>

Problem solved! After 3 hours searching ':-)
html = driver.page_source
f = open('savepage.html', 'w')
f.write(html.encode('utf-8'))
f.close()

Related

How can I encode html file after read file with ZipFile?

I am reading a zip file from a URL. Inside the zip file, there is an HTML file. After I read the file everything works fine. But when I print the text I am facing a Unicode problem. Python version: 3.8
from zipfile import ZipFile
from io import BytesIO
from bs4 import BeautifulSoup
from lxml import html
content = requests.get("www.url.com")
zf = ZipFile(BytesIO(content.content))
file_name = zf.namelist()[0]
file = zf.open(file_name)
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='utf-8',exclude_encodings='utf-8')
for product in soup.find_all('tr'):
product = product.find_all('td')
if len(product) < 2: continue
print(product[1].text)
I already try to open file and print text with .decode('utf-8') I got following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
I add from_encoding and exclude_encodings in BeautifulSoup but nothing change and I didn't get an error.
Expected prints:
ÇEŞİTLİ MADDELER TOPLAMI
Tarçın
Fidanı
What I am getting:
ÇEÞÝTLÝ MADDELER TOPLAMI
Tarçýn
Fidaný
I look at the file and the encoding is not utf-8, but iso-8859-9.
Change the encoding and everything will be fine:
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='iso-8859-9')
This will output: ÇEŞİTLİ MADDELER TOPLAMI

UnicodeDecodeError open python file with windows command prompt [duplicate]

This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode character... problems
(2 answers)
Closed 2 years ago.
I have a script that called runsplit.py it looks like this:
import sys
sys.stdout = open('final.txt', 'w')
import re
with open('a.txt') as f:
new_split = [item.strip() for item in f.readlines()]
for word in new_split:
m = re.match(r"(?:\{[^-#={}/|]+\})?(?:([^-#={}/|]+)-)?([^-#={}/|]+)(?:/[^-#={}/|]+)?(?:[#=]([^-#={}/|]+))?", word)
if m:
print("\t".join([str(item).lstrip() for item in m.groups()]))
else:
print("(no match: %s)" % word)
and I have a text file called a.txt which I want to split in final.txt file but a.txt file has some characters like ⁱ and ǐ in it that made error when I run the script in command prompt the error said this:
File "runsplit_in_terminal.py", line 9, in <module>
print("\t".join([str(item).lstrip() for item in m.groups()]))
File "C:\Users\Sina\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x94' in position 10: character maps to
<undefined>
is there any advice to solve this issue thanks.
You may try opening the text file with Unicode Text Format by adding the parameter encoding = 'utf-8' in the open() function.
Example:f=open('hii.txt',encoding='utf-8')

Docx (xml) file parsing error on Python 'charmap' codec can't decode byte 0x98 in position 7618: character maps to <undefined>

im trying to parse docx file. I unziped it first, then tried to read Document.xml file with with open(..) and its raise error that "'charmap' codec can't decode byte 0x98 in position 7618: character maps to ". XML is "UTF-8" encoding:
Error:
I wrote the following code:
with open(self.tempDir + self.CONFIG['main_xml']) as xml_file:
self.dom_xml = etree.parse(xml_file)
I treid to force encode to UTF-8, but then i can't read etree.fromstring(..) correctly
7618 symbol (from error) is :
Please help me. How to read xml file correctly?
Thnks
This works without errors on your file:
import zipfile
import xml.etree.ElementTree as ET
zipfile.ZipFile('file.docx').extractall()
root = ET.parse('word/document.xml').getroot()

Writing to a file but getting ASCII encoding error

I am trying to scrape data from a table on a website.
page_soup = soup(html, 'html.parser')
stat_table = page_soup.find_all('table')
stat_table = stat_table[0]
with open ('stats.txt','w') as q:
for row in stat_table.find_all('tr'):
for cell in row.find_all('td'):
q.write(cell.text)
However, when I try to write the file, I get this error message: 'ascii' codec can't encode character '\xa0' in position 19: ordinal not in range(128).
I understand that it should be encoded with .encode('utf-8'), but
cell.text.encode('utf-8')
doesn't work.
Any help would be greatly appreciated. Using Python 3.6
The file encoding is determined from the current Environment, in this case assuming ascii. You can specify the file encoding directly use:
with open ('stats.txt', 'w', encoding='utf8') as q:
pass

UnicodeEncodeError: 'charmap' codec can't encode character inspite of encoding to utf-8

I am converting my XML documents to plain text. There is a directory containing XML files and one python file to compile.
I have opened my XML files as:
with open(file, 'r', encoding = 'utf-8') as f:
then wrote in another file the contents of f:
for items in xmllist:
fx.write(items)
but it gives me the error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2009' in position 25: character maps to

Categories

Resources