Output difference after reading files saved in different encoding option in python - python

I have a unicode string list file, saved in encode option utf-8. I have another input file, saved in normal ansi. I read directory path from that ansi file and do os.walk() and try to match if any file present in the list (saved by utf-8). But it is not matching even if it is present.
Later I do some normal checking with a single string "40M_Ãz­µ´ú¸ÕÀÉ" and save this particular string (from notepad) in three different files with encoding option ansi, unicode and utf-8. I write a python script to print:
print repr(string)
print string
And the output is like:
ANSI Encoding
'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9'
40M_Ãz­µ´ú¸ÕÀÉ
UNICODE Encoding
'\x004\x000\x00M\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00'
4 0 M _ Ã z ­µ ´ ú ¸ Õ À É
UTF-8 Encoding
'40M_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89'
40M_Ãz­µ´ú¸ÕÀÉ
I really I can't understand how to compare same string coming from differently encoded file. Please help.
PS: I have some typical unicode characters like: 唐朝小栗子第集.mp3 which are very difficult to handle.

I really I can't understand how to compare same string coming from differently encoded file.
Notepad encoded your character string with three different encodings, resulting in three different byte sequences. To retrieve the character string you must decode those bytes using the same encodings:
>>> ansi_bytes = '40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9'
>>> utf16_bytes = '4\x000\x00M\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00'
>>> utf8_bytes = '40M_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89'
>>> ansi_bytes.decode('mbcs')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
>>> utf16_bytes.decode('utf-16le')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
>>> utf8_bytes.decode('utf-8')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
‘ANSI’ (not “ASCI”) is what Windows (somewhat misleadingly) calls its default locale-specific code page, which in your case is 1252 (Western European, which you can get in Python as windows-1252) but this will vary from machine to machine. You can get whatever this encoding is from Python on Windows using the name mbcs.
‘Unicode’ is the name Windows uses for the UTF-16LE encoding (very misleadingly, because Unicode is the character set standard and not any kind of bytes⇔characters encoding in itself). Unlike ANSI and UTF-8 this is not an ASCII-compatible encoding, so your attempt to read a line from the file has failed because the line terminator in UTF-16LE is not \n, but \n\x00. This has left a spurious \x00 at the start of the byte string you have above.
‘UTF-8’ is at least accurately named, but Windows likes to put fake Byte Order Marks at the front of its “UTF-8” files that will give you an unwanted u'\uFEFF' character when you decode them. If you want to accept “UTF-8” files saved from Notepad you can manually remove this or use Python's utf-8-sig encoding.
You can use codecs.open() instead of open() to read a file with automatic Unicode decoding. This also fixes the UTF-16 newline problem, because then the \n characters are detected after decoding instead of before.
I read directory path from that asci file and do os.walk()
Windows filenames are natively handled as Unicode, so when you give Windows a byte string it has to guess what encoding is needed to convert those bytes into characters. It chooses ANSI not UTF-8. That would be fine if you were using a byte string from a file also encoded in the same machine's ANSI encoding, however in that case you would be limited to filenames that fit within your machine's locale. In Western European 40M_Ãz­µ´ú¸ÕÀÉ would fit but 唐朝小栗子第集.mp3 would not so you wouldn't be able to refer to Chinese files at all.
Python supports passing Unicode filenames directly to Windows, which avoids the problem (most other languages can't do this). Pass a Unicode string into filesystem functions like os.walk() and you should get Unicode strings out, instead of failure.
So, for UTF-8-encoded input files, something like:
with codecs.open(u'directory_path.txt', 'rb', 'utf-8-sig') as fp:
directory_path = fp.readline().strip(u'\r\n') # unicode dir path
good_names = set()
with codecs.open(u'filename_list.txt', 'rb', 'utf-8-sig') as fp:
for line in fp:
good_names.add(line.strip(u'\r\n')) # set of unicode file names
for dirpath, dirnames, filenames in os.walk(directory path): # names will be unicode strings
for filename in filenames:
if filename in good_names:
# do something with file

Related

FPDF encoding error when reading a UTF8 txt file in Python [duplicate]

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).
# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)
("u'Capit\xe1n'", "'Capit\xc3\xa1n'")
print ss, ss8
print >> open('f1','w'), ss8
>>> file('f1').read()
'Capit\xc3\xa1n\n'
So I type in Capit\xc3\xa1n into my favorite editor, in file f2.
Then:
>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?
What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'
Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).
We can also use open from the codecs standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read() and readline().
In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.
Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.
In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
So, I've found a solution for what I'm looking for, which is:
print open('f2').read().decode('string-escape').decode("utf-8")
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.
This allows for the sort of round trip that I was imagining.
This works for reading a file with UTF-8 encoding in Python 3.2:
import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
print(line)
# -*- encoding: utf-8 -*-
# converting a unknown formatting file in utf-8
import codecs
import commands
file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)
file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')
for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
To read in an Unicode string and then send to HTML, I did this:
fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')
Useful for python powered http servers.
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can't unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
0000000: 4361 7069 745c 7863 335c 7861 316e Capit\xc3\xa1n
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
I was trying to parse iCal using Python 2.7.9:
from icalendar import Calendar
But I was getting:
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
and it was fixed with just:
print "{}".format(e[attr].encode("utf-8"))
(Now it can print liké á böss.)
I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
any open, print or other statement will just use utf8.
Works at least for Python 2.7.9.
Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

type of encoding to read csv files in pandas

Alright, So I'm writing a code where I read a CSV file using pandas.read_csv, the problem is with the encoding, I was using utf-8-sig encoding and this is working. However, this gives me an error with other CSV files. I found out that some files need other types of encoding such as cp1252. The problem is that I can't restrict the user to a specific CSV type that matches my encoding.
So is there any solution for this? for example is there a universal encoding type that works for all CSV's? or can I pass an array of all the possible encoders?
A CSV file is a text file. If it contains only ASCII characters, no problem nowadays, most encodings can correctly handle plain ASCII characters. The problem arises with non ASCII characters. Exemple
character
Latin1 code
cp850 code
UTF-8 codes
é
'\xe9'
'\x82'
'\xc3\xa9'
è
'\xe8'
'\x8a'
'\xc3\xa8'
ö
'\xf6'
'\x94'
'\xc3\xb6'
Things are even worse, because single bytes character sets can represent at most 256 characters while UTF-8 can represent all. For example beside the normal quote character ', unicode contains left ‘or right ’ versions of it, none of them being represented in Latin1 nor CP850.
Long Story short, there is nothing like an universal encoding. But certain encodings, for example Latin1 have a specificity: they can decode any byte. So if you declare a Latin1 encoding, no UnicodeDecodeError will be raised. Simply if the file was UTF-8 encoded, a é will look like é. And the right single quote would be 'â\x80\x99' but will appear as â on an Latin1 system and as ’ on a cp1252 one.
As you spoke of CP1252, it is a Windows variant of Latin1, but it does not share the property of being able to decode any byte.
The common way is to ask people sending you CSV file to use the same encoding and try to decode with that encoding. Then you have two workarounds for badly encoded files. First is the one proposed by CygnusX: try a sequence of encodings terminated with Latin1, for example encodings = ["utf-8-sig", "utf-8", "cp1252", "latin1"] (BTW Latin1 is an alias for ISO-8859-1 so no need to test both).
The second one is to open the file with errors='replace': any offending byte will be replaced with a replacement character. At least all ASCII characters will be correct:
with open(filename, encoding='utf-8-sig', errors='replace') as file:
fd = pd.read_csv(file, other_parameters...)
You could try this: https://stackoverflow.com/a/48556203/11246056
Or iterate over several formats in a try/except statement:
encodings = ["utf-8-sig, "cp1252", "iso-8859-1", "latin1"]
try:
for encoding in encodings:
pandas.read_csv(..., encoding=encoding, ...)
...
except ValueError: # or the error you receive
continue

python codecs can't encode to cp1252...but notepad++ can?

I have a very simple piece of code that's converting a csv....also do note i reference notepad++ a few times but my standard IDE is vs-code.
with codecs.open(filePath, "r", encoding = "UTF-8") as sourcefile:
lines = sourcefile.read()
with codecs.open(filePath, 'w', encoding = 'cp1252') as targetfile:
targetfile.write(lines)
Now the job I'm doing requires a specific file be encoded to windows-1252 and from what i understand cp1252=windows-1252. Now this conversion works fine when i do it using the UI features in notepad++, but when i try using python codecs to encode this file it fails;
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 561488: character maps to <undefined>
When i saw this failure i was confused, so i double checked the output from when i manually convert the file using notepad++, and the converted file is encoded in windows-1252.....so what gives? Why can a UI feature in notepad++ able to do the job when but codecs seems not not be able to? Does notepad++ just ignore errors?
Looks like your input text has the character "�" (the actual placeholder "replacement character" character, not some other undefined character), which cannot be mapped to cp1252 (because it doesn't have the concept).
Depending on what you need, you can:
Filter it out (or replace it, or otherwise handle it) in Python before writing out lines to the output file.
Pass errors=... to the second codecs.open, choosing one of the other error-handling modes; the default is 'strict', you can also use 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' or 'namereplace'.
Check the input file and see why it's got the "�" character; is it corrupted?
Probably Python is simply more explicit in its error handling. If Notepad++ managed to represent every character correctly in CP-1252 then there is a bug in the Python codec where it should not fail where it currently does; but I'm guessing Notepad++ is silently replacing some characters with some other characters, and falsely claiming success.
Maybe try converting the result back to UTF-8 and compare the files byte by byte if the data is not easy to inspect manually.
Uncode U+FFFD is a reserved character which serves as a placeholder for a character which cannot be represented in Unicode; often, it's an indication of a conversion problem previously, when presumably this data was imperfectly input or converted at an earlier point in time.
(And yes, Windows-1252 is another name for Windows code page 1252.)
Why notepad++ "succeeds"
Notepad++ does not offer you to convert your file to cp1252, but to reinterpret it using this encoding. What lead to your confusion is that they are actually using the wrong term for this. This is the encoding menu in the program:
When "Encode with cp1252" is selected, Notepad decodes the file using cp1252 and shows you the result. If you save the character '\ufffd' to a file using utf8:
with open('f.txt', 'w', encoding='utf8') as f:
f.write('\ufffd')`
and use "Encode with cp1252" you'd see three characters:
That means that Notepad++ does not read the character in utf8 and then writes it in cp1252, because then you'd see exactly one character. You could achieve similar results to Notepad++ by reading the file using cp1252:
with open('f.txt', 'r', encoding='cp1252') as f:
print(f.read()) # Prints �
Notepad++ lets you actually convert to only five encodings, as you can see in the screenshot above.
What should you do
This character does not exist in the cp1252 encoding, which means you can't convert this file without losing information. Common solutions are to skip such characters or replace them with other similar characters that exist in your encoding (see encoding error handlers)
You are dealing with the "utf-8-sig" encoding -- please specify this one as the encoding argument instead of "utf-8".
There is information on it in the docs (search the page for "utf-8-sig").
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. [...]

How to use .write() with characters from foreign languages (ã, à, ê, ó, ...)

I'm working on a small project in Python 3 where I have to scan a drive full of files and output a .txt file with the path of all of the files inside the drive. The problem is that some of the files are in Brazilian Portuguese which has "accented letters" such as "não", "você" and others and those special letters are being output wrongly in the final .txt.
The code is just these few lines below:
import glob
path = r'path/path'
files = [f for f in glob.glob(path + "**/**", recursive=True)]
with open("file.txt", 'w') as output:
for row in files:
output.write(str(row.encode('utf-8') )+ '\n')
An example of outputs
path\folder1\Treino_2.doc
path\folder1\Treino_1.doc
path\folder1\\xc3\x81gua de Produ\xc3\xa7\xc3\xa3o.doc
The last line show how some of the ouputs are wrong since x81gua de Produ\xc3\xa7\xc3\xa3o should be Régua de Produção
Python files handle Unicode text (including Brazilian accented characters) directly. All you need to do is using the file in text mode, which is the default unless you explicitly ask open() to give you a binary file. "w" gives you a text file that's writable.
You may want to be explicit about the encoding, however, by using the encoding argument for the open() function:
with open("file.txt", "w", encoding="utf-8") as output:
for row in files:
output.write(row + "\n")
If you don't explicitly set the encoding, then a system-specific default is selected. Not all encodings can encode all possible Unicode codepoints. This happens on Windows more than on other operating systems, where the default ANSI codepage then leads to charmap codepage can't encode character errors, but it can happen on other Operating Systems as well if the current locale is configured to use a non-Unicode encoding.
Do not encode to bytes and then convert the resulting bytes object back to a string again with str(). That only makes a big mess with string representations and escapes and the b prefix there too:
>>> path = r"path\folder1\Água de Produção.doc"
>>> v.encode("utf8") # bytes are represented with the "b'...'" syntax
b'path\\folder1\\\xc3\x81gua de Produ\xc3\xa7\xc3\xa3o.doc'
>>> str(v.encode("utf8")) # converting back with `str()` includes that syntax
"b'path\\\\folder1\\\\\\xc3\\x81gua de Produ\\xc3\\xa7\\xc3\\xa3o.doc'"
See What does a b prefix before a python string mean? for more details as to what happens here.
You probably just want to write the filename strings directly to the file, without first encoding them as UTF-8, since they already are in such an encoding. That is:
…
for row in files:
output.write(row + '\n')
Should do the right thing.
I say “probably” since filenames do not have to be valid UTF-8 in some operating systems (e.g. Linux!), and treating those as UTF-8 will fail. In that case your only recourse is to handle the filenames as raw byte sequences — however, this won’t ever happen in your code, since glob already returns strings rather than byte arrays, i.e. Python has already attempted to decode the byte sequences representing the filenames as UTF-8.
You can tell glob to handle arbitrary byte filenames (i.e. non-UTF-8) by passing the globbing pattern as a byte sequence. On Linux, the following works:
filename = b'\xbd\xb2=\xbc \xe2\x8c\x98'
with open(filename, 'w') as file:
file.write('hi!\n')
import glob
print(glob.glob(b'*')[0])
# b'\xbd\xb2=\xbc \xe2\x8c\x98'
# BUT:
print(glob.glob('*')[0])
#---------------------------------------------------------------------------
#UnicodeEncodeError Traceback (most recent call last)
#<ipython-input-12-2bce790f5243> in <module>
#----> 1 print(glob.glob('*')[0])
#
#UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Converting Non-UTF-8 characters to UTF-8

I have some files which are present on my Linux system. These files names can be other the un_eng-utf8. I want to convert them from non-utf8 character to the utf-8 character. How can I do that using C library function or python scripts.
If you know the character encoding that is used to encode the filenames:
unicode_filename = bytestring_filename.decode(character_encoding)
utf8filename = unicode_filename.encode('utf-8')
If you don't know the character encoding then there is no way in the general case to do the conversion without loosing data -- "non-utf8" is not specific enough e.g., if you have a filename that contains b'\xae' byte then it can be interpreted differently depending on the filename encoding -- it is u'®' in cp1252 encoding but the same byte represents u'«' in cp437. There are modules such as chardet that allow you to guess the character encoding but it is only a guess -- "There Ain't No Such Thing as Plain Text."
def converttoutf8(a):
return unicode(a, "utf-8")
now for every filename you iterate through, that will return the utf-8 formatted filename
or even better, use convmv. it converts filenames from one encoding to another and takes a directory as an argument. sounds perfect.

Categories

Resources