Windows file names displayed corrupted characters in Linux - python

I believe this is a common issue when it comes to the default encoding of characters on Linux and Windows. However after I searched the internet I have not got any easy way to fix it automatically and therefore I am about to write a script to do it.
Here is the scenario:
I created some files on Windows system, some with non-English names (Chinese specifically in my case). And I compressed them into a zip file using 7-zip. After that I downloaded the zip file to a Linux and extract the files on the Linux system (Ubuntu 16.04 LTS) (the default archive program). As much as I have guessed, all the non-English file names are now displayed as some corrupted characters! At first I thought it should be easy with convmv, but...
I tried convmv, and it says:"Skipping, already utf8". Nothing got changed.
And then I decided to write a tool using Python to do the dirty job, after some testing I come to a point where I cannot associate the original file names to the corrupted file names, (unless by hashing the contents.)
Here is an example. I setup a webserver to list the file names on Windows, and one file, after encoded with "gbk" in python, is displayed as
u'j\u63a5\u53e3\u6587\u6863'
And I can query the file names on my Linux system. I can create a file directly with the name as shown above, and the name is CORRECT. I can also encode the unicode gbk string to utf8 encoding and create a file, the name is also CORRECT. (Thus I am not able to do them at the same time since they are indeed the same name). Now when I read the file name which I extracted earlier, which should be the same file. BUT the file name is completely different as:
'j\xe2\x95\x9c\xe2\x95\x99.....'
decoding it with utf8, it is something like u'j\u255c\u2559...'. decoding it with gbk resulted in UnicodeDecodeError exception, and I also tried to decode it with utf8 and then encode with gbk, but the result is still something else.
To summarize it, I cannot inspect the original file name by decoding or encoding it after it was extracted to the linux system. If I really want to let a program do the job, I have to either re-do the archive with possibly some encoding options maybe, or just go with my script but using file content hash (like md5 or sha1) to determine its original file name on Windows.
Do I still got any chance to infer the original name from a python script in above case other than comparing file contents between two systems?

With a little experimentation with common encodings, I was able to reverse your mojibake:
bad = 'j\xe2\x95\x9c\xe2\x95\x99\xe2\x94\x90\xe2\x94\x8c\xe2\x95\xac\xe2\x94\x80\xe2\x95\xa1\xe2\x95\xa1'
>>> good = bad.decode('utf8').encode('cp437').decode('gbk')
>>> good
u'j\u63a5\u53e3\u6587\u6863' # u'j接口文档'
gbk - common Chinese Windows encoding
cp437 - common US Windows OEM console encoding
utf8 - common Linux encoding

Related

decoding a file with .msr extension

I am trying to decode a file with .msr extension. This is the data file for the old version of the program "PHYWE measure 4". This program is for measuring various physical experiments. It has a completely incomprehensible encoding, I went through all the available encodings in notepad++ and tried to read bytes using python. The first line contains data like this:
\x19\x05\x06\x07\x08\tmeasure45 FileFormat\x04\x00\x01\x00X\x00\x00\x00\x02\x00\xfc\x1d\xba\x13\x00\x00\x00\x004\xa1\xd3\xdf\xb7\xca\xe5#\xa4\xf8\xb8\x13\xe4\x17\xb9\x13\x00\x00\x00\x00\x02\x00\x00\x00\xa3\xf3\x00\x00\x01\x00\x00\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa4p}?\n
Can you please tell me if it is possible to get numerical data in my case?
The PHYWE measure software uses a proprietary (binary) file format, but you can download the software for free from the PHYWE website:
https://repository.curriculab.net/files/software/setupm.exe
In the measure application, you can use the option "Export data..." to either save the numerical data as text file or copy the values to the clipboard.

What is the rule of thumb to choose proper encoding for dealing Japanese characters in python

I have been opening and rebuilding 2000+ HTML files with python bs4, and the contents of all of them are written in Japanese.
The thing is, I cannot find a proper rule-of-thumb-encoding for those files.
I made try-excepts for the following encodings, but there are still errors.
utf-8, utf-16, utf-32, shift-jis, cp1252, euc
Nothing worked without errors and there were even files that I cannot open at all(but it opens in text editors, though! I tried saving with utf-8, but it never worked too!!!)
I am completely exhausted to find out how can I let my code run without me.
Can anybody help me? What is the error-free encoding to open / write Japanese?

UnicodeDecodeError when trying to read an hdf file made with python 2.7

I have a bunch of hdf files that I need to read in with pandas pd.read_hdf() but they have been saved in a python 2.7 environment. Nowadays, I'm on python 3.7, and when trying to read them with data = pd.read_hdf('data.h5', 'data'), I'm getting
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 6: invalid start byte
Now I know, those files can contain various weird things like Ä or ö, and 0xf6 probably is ö.
So how do I read this hdf file?
The documentation for read_hdf only specifies mode as a parameter, but this doesn't do anything. Apparently, this is an old bug in pandas, or rather with underlying pytables that can't be fixed. However, that is from 2017, so I wonder if that's fixed, or rather if there's a workaround that I just can't find. According to the bug report, you can also pass enconding='' to the reader, but that doesn't do anything when I specify encoding='UTF8' as suggested in the bug, or encoding='cp1250' which I would assume could be the culprit.
It's quite annoying to have a file format that is meant to archive data, which apparently can't be read anymore by the program that produced it after just one version step. I would be perfectly fine with just having the ös garbled to ␣ý⌧ or similar fun things as usual with encoding errors, but simply not being able to read it is an issue.

strange Python version-dependent behavior with writing sorted output using heapq.merge

Some of the gsutil users have reported a failure when running gsutil rsync, which I've tracked down to being apparently a Python 2.7.8-specific problem: we write sorted lists of the source and destination directories being synchronized in binary mode ('w+b'), and then read those lists back in, also in binary mode ('rb'). This works fine under Python 2.6.x and Python 2.7.3, but under Python 2.7.8 the output ends up in a garbled-looking binary format, which then doesn't parse correctly when being read back in.
If I switch the output to use 'w+' mode instead the problem goes away. But (a) I think I do want to write in binary mode, since these files can contain Unicode, and (b) I'd like to understand why this is a Python version-dependent problem.
Does anyone have any ideas about why this might be happening?
FYI, I tried to reproduce this problem with a short program that just writes a file in binary mode and reads it back in binary mode, but the problem doesn't repro with this program. I'm wondering if there might be something about the heapq.merge implementation that changed in Python 2.7.8 that might explain this problem (we sort in batches, and the individual sorted files are fine; its the output from heapq.merge that gets garbled in binary mode under Python 2.7.8).
Any suggsetions/thoughts would be appreciated.
It sounds to me as if the file object hasn't properly been flushed or no seek has been done between a read and write action, or vice versa. A binary object would be more susceptible to this as the OS won't be doing newline translation either. At a C level undefined behaviour can be triggered and uninitialised memory is then read or written. There is a Python issue about this at http://bugs.python.org/issue1394612.
Why this changed in a Python minor version is interesting however, and if you have a reproducible case you should definitively report it to the Python project issue tracker.
If you are just writing Unicode, then encode that Unicode to a UTF encoding; you do not need to use a binary file mode for that as UTF-8 will never use newline bytes in other codepoints.
Alternatively, use the io.open() function to open a Unicode-aware file object for your data.

Python, UTF-8 filesystem, iso-8859-1 files

I have an application written in Python 2.7 that reads user's file from the hard-drive using os.walk.
The application requires a UTF-8 system locale (we check the env variables before it starts) because we handle files with Unicode characters (audio files with the artist name in it for example), and want to make sure we can save these files with the correct file name to the filesystem.
Some of our users have UTF-8 locales (therefore a UTF-8 fs), but still somehow manage to have ISO-8859-1 files stored on their drive. This causes problems when our code tries to os.walk() these directories as Python throws an exception when trying to decode this sequence of ISO-8859-1 bytes using UTF-8.
So my question is, how do I get python to ignore this file and move on to the next one instead of aborting the entire os.walk(). Should I just roll my own os.walk() function?
Edit: Until now we've been telling our users to use the convmv linux command to correct their filenames, however many users have various different types of encodings (8859-1, 8859-2, etc.), and using convmv requires the user to make an educated guess on what files have what encoding before they run convmv on each one individually.
Please read Unicode filenames, part of the Python Unicode how-to. Most importantly, filesystem encodings are not necessarily the same as the current LANG setting in the terminal.
Specifically, os.walk is built upon os.listdir, and will thus switch between unicode and 8-bit bytes depending on wether or not you give it a unicode path.
Pass it an 8-bit path instead, and your code will work properly, then decode from UTF-8 or ISO 8859-1 as needed.
Use character encoding detection, chardet modules for python work well for determining actual encoding with some confidence. "as appropriate" -- You either know the encoding or you have to guess at it. If with chardet you guess wrong, at least you tried.

Categories

Resources