Editing UTF-8 text file on Windows - python

I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.
This is the code:
input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
new_line = line.replace(" ", "+")
new_line2 = new_line.replace("\t", "+")
out.write(new_line2)
#print(new_line2)
fh.close()
out.close()
It gives me an error:
Traceback (most recent call last):
File "music.py", line 3, in <module>
for line in input:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>
As music.txt is saved in UTF-8, I changed the first line to:
input = open('music.txt', 'r', encoding="utf8")
This gives another error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>
I tried other things with the out.write() but it didn't work.
This is the raw data of music.txt.
https://pastebin.com/FVsVinqW
I saved it in windows editor as UTF-8 .txt file.

If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.
with open('music.txt', 'r', encoding='utf-8') as infh,\
open("out.txt", "w", encoding='utf-8') as outfh:
for line in infh:
line = line.replace(" ", "+").replace("\t", "+")
outfh.write(line)
This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.
Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.

Related

How to use Special characters in Python?

this is my code
f = open('test.txt','w')
f.write("\N{Circled White Star}")
f.close
And I get this error
Traceback (most recent call last):
File "f:/mc experiment/python/SkyblockSniper-main/df.py", line 3, in <module>
f.write("\N{Circled White Star}")
File "F:\programing\python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u272a' in position 0: character maps to <undefined>
What I expected
The test.txt file should have
✪
What i got
<nothing>
Try changing the encoding of the file you open. UTF-8 worked for my testing.
You should also open files using context managers instead of the way you did it.
star = "✪"
with open('test.txt', 'w', encoding="UTF-8") as f:
f.write(f"\n{star}")
Without access to the original Circled White Star symbol, unicode may be useful
star = '\u272A'
with open('test.txt', 'w', encoding="UTF-8") as f:
f.write(star)

I get python frameworks error while reading a csv file, when I try a different easier file it works fine

import csv
exampleFile = open('example.csv')
exampleReader = csv.reader(exampleFile)
for row in exampleReader:
print('Row #' + str(exampleReader.line_num) + ' ' + str(row))
Traceback (most recent call last):
File "/Users/jossan113/Documents/Python II/test.py", line 7, in <module>
for row in exampleReader:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 4627: ordinal not in range(128)
Do anyone have any idea why I get this error? I tried an very easy cvs file from the internet and it worked just fine, but when I try the bigger file it doesn't
The file contains unicode characters, which was painful to deal with in old versions of python, since you are using 3.5 try opening the file as utf-8 and see if the issue goes away:
exampleFile = open('example.csv', encoding="utf-8")
From the docs:
Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
csv modeule docs

UnicodeEncodeError when reading a file

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.
Here is the code -
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
print(line, file = out_file, end = '')
print("done")
if __name__ == '__main__':
main()
Some words are not utf-8.
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 6, in main
print(line, file = out_file, end = '')
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0e45' in position
0: character maps to <undefined>
Update
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
out_file.write(line)
print("done")
if __name__ == '__main__':
main()```
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 3, in main
for line in in_file:
File "C:\Python\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali
d continuation byte
Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print()).
By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252) that can't represent all Unicode characters and '\u0e45' character in particular. cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million (1114111) Unicode characters. It can't represent them all.
Pass encoding that can represent all the desired data e.g., encoding='utf-8' must work (as #robyschek suggested)—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.
Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file (for line in in_file). Not all byte sequences are valid utf-8 e.g., os.urandom(100).decode('utf-8') may fail. What to do depends on the application.
If you expect the file to be encoded as utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences. Or you could use some other error handlers depending on your application.
If the actual character encoding used in the file is different then you should pass the actual character encoding. bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess) e.g., if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python
Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding. bs4.BeautifulSoup can handle some special cases. You could also try ftfy utility/library and see if it helps in your case e.g., ftfy may fix some utf-8 variations.
Hey I was having a similar issue, in the case of rockyou.txt wordlist, I tried a number of encodings that Python had to offer and I found that encoding = 'kio8_u' worked to read the file.

How to print to a file a string with diacritics?

I have a word in Polish as a string variable which I need to print to a file:
# coding: utf-8
a = 'ilośc'
with open('test.txt', 'w') as f:
print(a, file=f)
This throws
Traceback (most recent call last):
File "C:/scratches/scratch_3.py", line 5, in <module>
print(a, file=f)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015b' in position 3: character maps to <undefined>
Looking for existing answers (with .decode("utf-8"), or with .encode("utf-8")) and trying various incantations I finally managed the file to be created.
Unfortunately what was written was b'ilośc'and not ilośc. When I tried to decode that before printing to the file, I got back to the initial error and the same traceback.
How to write a str containing diacritics to a file so that it is a string and not a bytes representation?
The traceback says that you are trying to save 'ś' ('\u015b') character using cp1252 encoding (the default is locale.getpreferredencoding(False)—your Windows ANSI code page) that can't represent this Unicode character (there more than a million Unicode characters and cp1252 is a single-byte encoding that can represent only 256 characters).
Use a character encoding that can represent the desired characters:
with open(filename, 'w', encoding='utf-16') as file:
print('ilośc', file=file)
a = 'ilośc'
with open('test.txt', 'w') as f:
f.write(a)
You can even write to the file using the binary mode:
a = 'ilośc'
with open('test.txt', 'wb') as f:
f.write(a.encode())

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

I don't know exactly what's the source of this error and how to fix it. I am getting it by running this code.
Traceback (most recent call last):
File "t1.py", line 86, in <module>
write_results(results)
File "t1.py", line 34, in write_results
dw.writerows(results)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Any explanation is really appreciated!
I changed the code and now I get this error:
File "t1.py", line 88, in <module>
write_results(results)
File "t1.py", line 35, in write_results
dw.writerows(results)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Here's the change:
with codecs.open('results.csv', 'wb', 'utf-8') as f:
dw = csv.DictWriter(f, fieldnames=fields, delimiter='|')
dw.writer.writerow(dw.fieldnames)
dw.writerows(results)
The error is raised by this part of the code:
with open('results.csv', 'w') as f:
dw = csv.DictWriter(f, fieldnames=fields, delimiter='|')
dw.writer.writerow(dw.fieldnames)
dw.writerows(results)
You're opening an ASCII file, and then you're trying to write non-ASCII data to it. I guess that whoever wrote that script happened to never encounter a non-ASCII character during testing, so he never ran into an error.
But if you look at the docs for the csv module, you'll see that the module can't correctly handle Unicode strings (which is what Beautiful Soup returns), that CSV files always have to be opened in binary mode, and that only UTF-8 or ASCII are safe to write.
So you need to encode all the strings to UTF-8 before writing them. I first thought that it should suffice to encode the strings on writing, but the Python 2 csv module chokes on the Unicode strings anyway. So I guess there's no other way but to encode each string explicitly:
In parse_results(), change the line
results.append({'url': url, 'create_date': create_date, 'title': title})
to
results.append({'url': url, 'create_date': create_date, 'title': title.encode("utf-8")})
That might already be sufficient since I don't expect URLs or dates to contain non-ASCII characters.
This should work. works for me. Code snippet
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf8')
data = [["a", "b", u'\xe9']]
with open("output.csv", "w") as csv_file:
writer = csv.writer(csv_file, quoting=csv.QUOTE_ALL)
writer.writerows(data)

Categories

Resources