Dealing with multi-language directories (Python) - python

I'm trying to open a file and I just realized that py is having trouble with my username (It's in Russian). Any suggestions on how to properly decode/encode this to make idle happy?
I'm using py 2.6.5
xmlfile = open(u"D:\\Users\\Эрик\\Downloads\\temp.xml", "r")
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
xmlfile = open(str(u"D:\\Users\\Эрик\\Downloads\\temp.xml"), "r")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-12: ordinal not in range(128)
os.sys.getfilesystemencoding()
'mbcs'
xmlfile = open(u"D:\Users\Эрик\Downloads\temp.xml".encode("mbcs"), "r")
Traceback (most recent call last):
File "", line 1, in
xmlfile = open(u"D:\Users\Эрик\Downloads\temp.xml".encode("mbcs"), "r")
IOError: [Errno 22] invalid mode ('r') or filename: 'D:\Users\Y?ee\Downloads\temp.xml'

The first problem is that the parser tries to interpret backslashes in strings unless you use the r"raw quote" prefix. In 2.6.5, you needn't treat your Unicode string specially, but you may need a file encoding declaration in your source code like:
# -*- coding: utf-8 -*-
as defined in PEP 263. Here is an example of it working interactively:
$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2
>>> f = r"D:\Users\Эрик\Downloads\temp.xml"
>>> f
'D:\\Users\\\xd0\xad\xd1\x80\xd0\xb8\xd0\xba\\Downloads\\temp.xml'
>>> x = open(f, 'w')
>>> x.close()
>>>
$ ls D*
D:\Users\Эрик\Downloads\temp.xml
Yes, this is on a Unix system so the \ isn't meaningful and my terminal encoding is utf-8, but it works. You just may have to give the coding hint to the parser when it is reading a file.

First problem:
xmlfile = open(u"D:\\Users\\Эрик\\Downloads\\temp.xml", "r")
### The above line should be OK, provided that you have the correct coding line
### For example # coding: cp1251
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
xmlfile = open(str(u"D:\\Users\\Эрик\\Downloads\\temp.xml"), "r")
### HOWEVER the above traceback line shows you actually using str()
### which is DIRECTLY causing the error because it is attempting
### to decode your filename using the default ASCII codec -- DON'T DO THAT.
### Please copy/paste; don't type from memory.
UnicodeEncodeError: 'ascii' codec can't encode characters in position 9-12: ordinal not in range(128)
Second problem:
os.sys.getfilesystemencoding() produces 'mbcs'
xmlfile = open(u"D:\Users\Эрик\Downloads\temp.xml".encode("mbcs"), "r")
### (a) \t is interpreted as a TAB character, hence the file name is invalid.
### (b) encoding with mbcs seems not to be useful; it messes up your name ("Y?ee").
Traceback (most recent call last):
File "", line 1, in xmlfile = open(u"D:\Users\Эрик\Downloads\temp.xml".encode("mbcs"), "r")
IOError: [Errno 22] invalid mode ('r') or filename: 'D:\Users\Y?ee\Downloads\temp.xml'
General advice on hard-coding filenames in Windows, in descending order of preference:
(1) Don't
(2) Use / e.g. "c:/temp.xml"
(3) Use raw strings with backslashes r"c:\temp.xml"
(4) Use doubled backslashes "c:\\temp.xml"

Related

How to use Special characters in Python?

this is my code
f = open('test.txt','w')
f.write("\N{Circled White Star}")
f.close
And I get this error
Traceback (most recent call last):
File "f:/mc experiment/python/SkyblockSniper-main/df.py", line 3, in <module>
f.write("\N{Circled White Star}")
File "F:\programing\python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u272a' in position 0: character maps to <undefined>
What I expected
The test.txt file should have
✪
What i got
<nothing>
Try changing the encoding of the file you open. UTF-8 worked for my testing.
You should also open files using context managers instead of the way you did it.
star = "✪"
with open('test.txt', 'w', encoding="UTF-8") as f:
f.write(f"\n{star}")
Without access to the original Circled White Star symbol, unicode may be useful
star = '\u272A'
with open('test.txt', 'w', encoding="UTF-8") as f:
f.write(star)

Editing UTF-8 text file on Windows

I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.
This is the code:
input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
new_line = line.replace(" ", "+")
new_line2 = new_line.replace("\t", "+")
out.write(new_line2)
#print(new_line2)
fh.close()
out.close()
It gives me an error:
Traceback (most recent call last):
File "music.py", line 3, in <module>
for line in input:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>
As music.txt is saved in UTF-8, I changed the first line to:
input = open('music.txt', 'r', encoding="utf8")
This gives another error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>
I tried other things with the out.write() but it didn't work.
This is the raw data of music.txt.
https://pastebin.com/FVsVinqW
I saved it in windows editor as UTF-8 .txt file.
If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.
with open('music.txt', 'r', encoding='utf-8') as infh,\
open("out.txt", "w", encoding='utf-8') as outfh:
for line in infh:
line = line.replace(" ", "+").replace("\t", "+")
outfh.write(line)
This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.
Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.

Unable to use csv.reader for a non ascii string in python 3

This is currently my code:
# -*- coding: utf-8 -*-
import csv
import codecs
# original directory
phys_comp_dir = '/Users/lmnt74/Physician_Compare'
# for row in Performance_Scores:
# print(','.join(row))
# file name
National_Downloadable_File = ('/Physician_Compare_National_Downloadable'
'_File.csv')
National_File = csv.reader(open(phys_comp_dir+National_Downloadable_File,
newline='', encoding='utf-8'),
quotechar='|', quoting=csv.QUOTE_MINIMAL,
lineterminator='\n'
)
for row in National_File:
for i in row:
try:
print(i)
except UnicodeError:
print(i.encode('latin-1').decode('utf-8'))
I receive the following error:
Traceback (most recent call last):
File "/Users/lmn74/Physician_Compare/q2.py", line 41, in <module>
print(i)
UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 52: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/lmnt74/Physician_Compare/q2.py", line 43, in <module>
print(i.encode('latin-1').decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 52: invalid start byte
I am unsure about how to proceed. I know the string that is throwing the error is the (R), the registered trademark. I would like to figure out how to re-write my code so that it is able to check for this in each string OR if a better way exists to allocate for this when reading the file initially, I'm all for that.
What I've done so far:
I've read the unicode documentation.
I've read the CSV documentation
I've read about the unicode sandwich
None of which have helped me or are easy enough reads for me to understand. I'm a fairly new beginner and anything to point me in the right direction would be greatly appreciated.
EDIT: Figured it out, see below:
changed:
print(i.encode('latin-1').decode('utf-8'))
to:
print(i.encode('ascii', 'ignore').decode('utf-8', 'ignore'))
Sorry to waste anyone's time.

How to convert binary file into readable format on linux server

I am trying to convert binary file into readable format but unable to do so, please suggest how it could be achieved.
$ file test.docx
test.docx: Microsoft Word 2007+
$ file -i test.docx
test.docx: application/msword; charset=binary
$
>>> raw = codecs.open('test.docx', encoding='ascii').readlines()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/Python/installPath/lib/python2.7/codecs.py", line 694, in readlines
return self.reader.readlines(sizehint)
File "/home/Python/installPath/lib/python2.7/codecs.py", line 603, in readlines
data = self.read()
File "/home/Python/installPath/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 18: ordinal not in range(128)
Try the below code, Working with Binary Data
with open("test_file.docx", "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
print(data)
# Seek position and read N bytes
binary_file.seek(0) # Go to beginning
couple_bytes = binary_file.read(2)
print(couple_bytes)
you'll have to read it in binary mode :
import binascii
with open('test.docx', 'rb') as f: # 'rb' stands for read binary
hexdata = binascii.hexlify(f.read()) # convert to hex
print(hexdata)
I think others have not answered this question - at least the part as #ankitpandey clarified in his comment about catdoc returning an error
" catdoc then error is This file looks like ZIP archive or Office 2007
or later file. Not supported by catdoc"
I too had just encountered this same issue with catdoc, had found this solution that worked for me
the .zip archive mention was a clue - and I was able to the following command
unzip -q -c 'test.docx' word/document.xml | python etree.py
to extract the text portion of test.docx to stdout
the python code was placed in etree.py
from lxml import etree
import sys
xml = sys.stdin.read().encode('utf-8')
root = etree.fromstring(xml)
bits_of_text = root.xpath('//text()')
# print(bits_of_text) # Note that some bits are whitespace-only
joined_text = ' '.join(
bit.strip() for bit in bits_of_text
if bit.strip() != ''
)
print(joined_text)

How to print to a file a string with diacritics?

I have a word in Polish as a string variable which I need to print to a file:
# coding: utf-8
a = 'ilośc'
with open('test.txt', 'w') as f:
print(a, file=f)
This throws
Traceback (most recent call last):
File "C:/scratches/scratch_3.py", line 5, in <module>
print(a, file=f)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015b' in position 3: character maps to <undefined>
Looking for existing answers (with .decode("utf-8"), or with .encode("utf-8")) and trying various incantations I finally managed the file to be created.
Unfortunately what was written was b'ilośc'and not ilośc. When I tried to decode that before printing to the file, I got back to the initial error and the same traceback.
How to write a str containing diacritics to a file so that it is a string and not a bytes representation?
The traceback says that you are trying to save 'ś' ('\u015b') character using cp1252 encoding (the default is locale.getpreferredencoding(False)—your Windows ANSI code page) that can't represent this Unicode character (there more than a million Unicode characters and cp1252 is a single-byte encoding that can represent only 256 characters).
Use a character encoding that can represent the desired characters:
with open(filename, 'w', encoding='utf-16') as file:
print('ilośc', file=file)
a = 'ilośc'
with open('test.txt', 'w') as f:
f.write(a)
You can even write to the file using the binary mode:
a = 'ilośc'
with open('test.txt', 'wb') as f:
f.write(a.encode())

Categories

Resources