UnicodeDecodeError open python file with windows command prompt [duplicate] - python

This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode character... problems
(2 answers)
Closed 2 years ago.
I have a script that called runsplit.py it looks like this:
import sys
sys.stdout = open('final.txt', 'w')
import re
with open('a.txt') as f:
new_split = [item.strip() for item in f.readlines()]
for word in new_split:
m = re.match(r"(?:\{[^-#={}/|]+\})?(?:([^-#={}/|]+)-)?([^-#={}/|]+)(?:/[^-#={}/|]+)?(?:[#=]([^-#={}/|]+))?", word)
if m:
print("\t".join([str(item).lstrip() for item in m.groups()]))
else:
print("(no match: %s)" % word)
and I have a text file called a.txt which I want to split in final.txt file but a.txt file has some characters like ⁱ and ǐ in it that made error when I run the script in command prompt the error said this:
File "runsplit_in_terminal.py", line 9, in <module>
print("\t".join([str(item).lstrip() for item in m.groups()]))
File "C:\Users\Sina\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x94' in position 10: character maps to
<undefined>
is there any advice to solve this issue thanks.

You may try opening the text file with Unicode Text Format by adding the parameter encoding = 'utf-8' in the open() function.
Example:f=open('hii.txt',encoding='utf-8')

Related

Continuing for loop after exception in Python

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem.
I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point.
As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement.
This is the code:
import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
try:
clean = unidecode(line)
search_keyword = clean
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
#print("https://www.youtube.com/watch?v=" + video_ids[0])
except:
print("Error encounted with Line: " + line)
This is the full error message, to see that the for loop itself is causing the problem.
Traceback (most recent call last):
File "ytbysearchtolinks.py", line 6, in
for line in infh:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
If you need an example of input I'm working with: https://pastebin.com/LEkwdU06
The try-except-block looks correct and should allow you to catch all occurring exceptions.
The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.
One solution is to use urllib's quote() function. As per documentation:
Replace special characters in string using the %xx escape.
This is what works for me with the input you've provided:
import urllib.request
from urllib.parse import quote
import re
with open('out.txt', 'r', encoding='utf-8') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
search_keyword = quote(line)
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
print("https://www.youtube.com/watch?v=" + video_ids[0])
EDIT:
After thinking about it, I believe you are running into the following problem:
You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:
$ file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators
This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).
Make sure that you are using encoding='utf-8' when opening the file.
i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to <undefined>

I am parsing a csv file and i am getting the below error
import os
import csv
from collections import defaultdict
demo_data = defaultdict(list)
if os.path.exists("infoed_daily _file.csv"):
f = open("infoed_daily _file.csv", "rt")
csv_reader = csv.DictReader(f)
line_no = 0
for line in csv_reader:
line_no +=1
print(line,line_no)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2483: character maps to
<undefined>
Please advise.
Thanks..
-Prasanna.K
Error may means you have file in encoding different then UTF-8 which (probably in most systems) is used as default in open()
When I run
b'\x81'.decode('Latin1')
b'\x81'.decode('Latin2')
b'\x81'.decode('iso8859')
b'\x81'.decode('iso8859-2')
then it runs without error - so your file can be in some of these encodings (or similar encoding) and you have to use it
open(..., encoding='Latin1')
or similar.
List of other encodings: codecs: standard encodings
f=open("myfile1.txt",'r')
print(f.read())
Well, for the above code I got an error as:
'charmap' codec can't decode byte 0x81 in position 637: character maps to
so i tried changing the name of the file extension and it worked.
Happy Coding
Thanks!
Vani
you can use '
with open ('filename.txt','r') as f:
f.write(content)
The good thing is that it automatically closes the file after work is done.

UnicodeEncodeError: 'charmap' codec can't encode character inspite of encoding to utf-8

I am converting my XML documents to plain text. There is a directory containing XML files and one python file to compile.
I have opened my XML files as:
with open(file, 'r', encoding = 'utf-8') as f:
then wrote in another file the contents of f:
for items in xmllist:
fx.write(items)
but it gives me the error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2009' in position 25: character maps to

Removing Unicode \uxxxx in String from JSON Using Regex

I have a JSON file that store text data called stream_key.json :
{"text":"RT #WBali: Ideas for easter? Digging in with Seminyak\u2019s best beachfront view? \nRSVP: b&f.wbali#whotels.com https:\/\/t.co\/fRoAanOkyC"}
As we can see that the text in the json file contain unicode \u2019, I want to remove this code using regex in Python 2.7, this is my code so far (eraseunicode.py):
import re
import json
def removeunicode(text):
text = re.sub(r'\\[u]\S\S\S\S[s]', "", text)
text = re.sub(r'\\[u]\S\S\S\S', "", text)
return text
with open('stream_key.json', 'r') as f:
for line in f:
tweet = json.loads(line)
text = tweet['text']
text = removeunicode(text)
print(text)
The result i get is:
Traceback (most recent call last):
File "eraseunicode.py", line 17, in <module>
print(text)
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 53: character maps to <undefined>
As I already use function to remove the \u2019 before print, I don't understand why it is still error. Please Help. Thanks
When the data is in a text file, \u2019 is a string. But once loaded in json it becomes unicode and replacement doesn't work anymore.
So you have to apply your regex before loading into json and it works
tweet = json.loads(removeunicode(line))
of course it processes the entire raw line. You also can remove non-ascii chars from the decoded text by checking character code like this (note that it is not strictly equivalent):
text = "".join([x for x in tweet['text'] if ord(x)<128])

Python Selenium page cannot save source code encode error

I am trying to save source code with Selenium into .txt, but the .txt file stays empty.
When I tryed to print the source code with command:
htmlcode = driver.page_source
(driver.page_source).encode('utf-8'))
print(htmlcode)
It will print the source code but then it kills the script with error:
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u20ac' in position 16329: character maps to <undefined>
Problem solved! After 3 hours searching ':-)
html = driver.page_source
f = open('savepage.html', 'w')
f.write(html.encode('utf-8'))
f.close()

Categories

Resources