python - Reading all kinds of files in different encodings

python - Reading all kinds of files in different encodings - python

I built a Python steganographer that hides UTF-8 text in images and it works fine for it. I was wondering if I could encode complete files in images. For this, the program needs to read all kinds of files. The problem is that not all files are encoded with UTF-8 and therefore, you have to read them with:
file = open('somefile.docx', encoding='utf-8', errors='surrogateescape')
and if you copy it to a new file and read them then it says that the files are not decipherable. I need a way to read all kinds of files and later write them so that they still work. Do you have a way to do this in Python 3?
Thanks.

Change your view. You don't "hide UTF-8 text in images". You hide bytes in images.
These bytes could be - purely accidentally - interpretable as UTF-8-encoded text. But in reality they could be anything.
Reading a file as text with open("...", encoding="...") has the hidden step of decoding the bytes of the file into string. This is convenient when you want to treat the file contents as string in your program.
Skip that hidden decoding step and read the file as bytes: open("...", "rb").

Related

Not able to run python script within Batch File - Possible Bootcamp Problem

I am simply trying to run a test script in python using task scheduler, for which I need to make a .bat file of to let it be able to run. This is the current bat file:
"‪C:\Program Files (x86)\Python38-32\python.exe" "C:\Users\Declan\Documents\_Automation\test.py"
pause
However it is giving me the following
C:\Users\Declan\Documents\_Automation>"ÔÇ¬C:\Program Files (x86)\Python
38-32\python.exe" "C:\Users\Declan\Documents\_Automation\test.py"
The filename, directory name, or volume label syntax is incorrect.
C:\Users\Declan\Documents\_Automation>pause
Press any key to continue . . .

This is likely an encoding issue. Check the character encoding of your batch file.
If you don't know how, simply create a new one, using your favourite text editor and instead of copying and pasting the text from the original source, just rewrite the batch file from scratch.
However, if you have more than just a bare bones text editor (like Notepad++, UltraEdit, etc.) there will be menu options that allow you to inspect and change the encoding of an existing file. UTF-8 without a BOM, or Ansi (varies depending on code page) are options to try.
In case you're wondering: not all text files are created equally. On disk, a text file (like any file) is just a series of bytes and each character of the 'text' is represented by a number of bytes (the exact number depends on the encoding). What character is represented by the byte sequences depends on the chosen character encoding for the file - many encodings will use the same byte sequences for the most common characters, but may use different byte sequences for non-standard characters, or use specific sequences to represent characters that aren't in other encodings. Think about special characters that are needed in some languages, but not in others for example.

Change to recognized encoding when reading a text file?

When a text file is open for reading using (say) UTF-8 encoding, is it possible to change encoding during the reading?
Motivation: It hapens that you need to read a text file that was written using non-default encoding. The text format may contain the information about the used encoding. Let an HTML file be the example, or XML, or ASCIIDOC, and many others. In such cases, the lines above the encoding information are allowed to contain only ASCII or some default encoding.
In Python, it is possible to read the file in binary mode, and translate the lines of bytes type to str on your own. When the information about the encoding is found on some line, you just switch the encoding to be used when converting the lines to unicode strings.
In Python 3, text files are implemented using TextIOBase that defines also the encoding attribute, the buffer, and other things.
Is there any nice way to change the encoding information (used for decoding the bytes) so that the next lines would be decoded in the wanted way?

Classic usage is:
Open the file in binary format (bytes string)
read a chunk and guess the encoding (For instance with a simple scanning or using RegEx)
Then:
close the file and re-open it in text mode with the found encoding
Or
move to the beginning: seek(0), read the whole content as a bytes string then decode the content using the found encoding.
See this example: Detect character encoding in an XML file (Python recipe)
note: the code is a little old, but useful.

Encoding of measurements exported from photoshop CS4 to read in Python

I exported some measurements (not image files) from Photoshop in CS4. This page describes the tool in CS6 and up, where the measurements nicely export to a csv with utf-8 encoding. In CS4, it exports as a tab-delimited text, and I can't figure out what the encoding is.
Now I want to read the file in Python using pandas or csv. I've tried utf8, ISO-8859-1, US-ASCII, cp1252, and latin1 encoding, all of which give me errors (i.e. invalid start byte, NULL byte, and EOF inside string).
How can I read this file?

I was getting close to posting one of these files here to see if someone could figure it out, but before doing so I opened one in Excel. When I did Save As, Excel defaulted to a utf-16 text file.
I was able to open these files with csv and encoding = utf-16, and with pandas read_table and encoding = utf-16, engine = 'python'. read_csv didn't work.
Seems rather obscure to me, so I'm posting in case this helps someone else.

Read file with Python without knowing encoding

I want to read all files from a folder (with os.walk) and convert them to one encoding (UTF-8). The problem is those files don't have same encoding. They could be UTF-8, UTF-8 with BOM, UTF-16.
Is there any way to do read those files without knowing their encoding?

You can read those files in binary mode. And there is the chardet module. Whit it you can detect the encoding of your files and decode the data you get. Though this module has limitations.
As an example:
from chardet import detect
with open('your_file.txt', 'rb') as ef:
detect(ef.read())

If it is indeed always one of these 3 then it is easy. If you can read the file using UTF-8 then it is probably UTF-8. Otherwise it will be UTF-16. Python can also automatically discard the BOM if present.
You can use a try ... except block to try both:
try:
tryToConvertMyFile(from, to, 'utf-8-sig')
except UnicodeDecodeError:
tryToConvertMyFile(from, to, 'utf-16')
If other encodings are present as well (like ISO-8859-1) then forget it, there is no 100% reliable way of figuring out the encoding. But you can guess—see for example Is there a Python library function which attempts to guess the character-encoding of some bytes?

How to convert a CP949 RTF to a UTF-8 encoded RTF?

I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949.
My script is as follows:
cpstr = open('terms.rtf').read()
utfstr = cpstr.decode('cp949').encode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(utfstr)
tmp.close()
But this doesn't change the encoding as I intended.

There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using less/more/cat/type/whatever to print it out to your terminal.
First, the easy cases: plaintext RTF.
A plaintext RTF file starts of with {\rtf, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with \—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine.
However, the file may also have a formatting command that specifies what character set it's written in. This command looks like \ansicpg949. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it.
The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's \ansicpg65001, maybe it's \utf8, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of \ansicpg949, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…)
Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a é is stored as \'e9), so there's nothing to convert.
Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names.
The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the file command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD).
Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes.
Finally, the hard case: binary RTF.
The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself.
A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.

import codecs
cpstr = codecs.open('terms.rtf','r','cp949').read()
u = cpstr.encode('cp949').decode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(u)
tmp.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.