How to convert a CP949 RTF to a UTF-8 encoded RTF? - python

I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949.
My script is as follows:
cpstr = open('terms.rtf').read()
utfstr = cpstr.decode('cp949').encode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(utfstr)
tmp.close()
But this doesn't change the encoding as I intended.

There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using less/more/cat/type/whatever to print it out to your terminal.
First, the easy cases: plaintext RTF.
A plaintext RTF file starts of with {\rtf, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with \—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine.
However, the file may also have a formatting command that specifies what character set it's written in. This command looks like \ansicpg949. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it.
The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's \ansicpg65001, maybe it's \utf8, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of \ansicpg949, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…)
Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a é is stored as \'e9), so there's nothing to convert.
Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names.
The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the file command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD).
Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes.
Finally, the hard case: binary RTF.
The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself.
A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.

import codecs
cpstr = codecs.open('terms.rtf','r','cp949').read()
u = cpstr.encode('cp949').decode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(u)
tmp.close()

Related

UnicodeDecodeError "charmap" when trying to import a .json file into python [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?
I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.
I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:
For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.
From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source
I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.
if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

UT8 issue - Is there a way to convert strange looking characters ä to its proper German character ä in Python?

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is ä instead of ä, à instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.
One work around to solve this issue is -
Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.
Now, when you import the files, in SAS or Python, then everything is imported correctly.
But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.
I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.
Is there any Python library which can automatically translate these strange characters into their proper characters - like ä gets translated to ä and so on?
did you try to use codecs library?
import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')
If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.
with open(filename, 'r', encoding='utf-8') as f:
# do things with f
If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.
with open(filename, 'r', encoding='latin-1') as inp, \
open('newfile', 'w', encoding='utf-8') as outp:
for line in inp:
outp.write(line)
The ftfy library claims to be able to identify and correct a number of common mojibake problems.

python - Reading all kinds of files in different encodings

I built a Python steganographer that hides UTF-8 text in images and it works fine for it. I was wondering if I could encode complete files in images. For this, the program needs to read all kinds of files. The problem is that not all files are encoded with UTF-8 and therefore, you have to read them with:
file = open('somefile.docx', encoding='utf-8', errors='surrogateescape')
and if you copy it to a new file and read them then it says that the files are not decipherable. I need a way to read all kinds of files and later write them so that they still work. Do you have a way to do this in Python 3?
Thanks.
Change your view. You don't "hide UTF-8 text in images". You hide bytes in images.
These bytes could be - purely accidentally - interpretable as UTF-8-encoded text. But in reality they could be anything.
Reading a file as text with open("...", encoding="...") has the hidden step of decoding the bytes of the file into string. This is convenient when you want to treat the file contents as string in your program.
Skip that hidden decoding step and read the file as bytes: open("...", "rb").

Change to recognized encoding when reading a text file?

When a text file is open for reading using (say) UTF-8 encoding, is it possible to change encoding during the reading?
Motivation: It hapens that you need to read a text file that was written using non-default encoding. The text format may contain the information about the used encoding. Let an HTML file be the example, or XML, or ASCIIDOC, and many others. In such cases, the lines above the encoding information are allowed to contain only ASCII or some default encoding.
In Python, it is possible to read the file in binary mode, and translate the lines of bytes type to str on your own. When the information about the encoding is found on some line, you just switch the encoding to be used when converting the lines to unicode strings.
In Python 3, text files are implemented using TextIOBase that defines also the encoding attribute, the buffer, and other things.
Is there any nice way to change the encoding information (used for decoding the bytes) so that the next lines would be decoded in the wanted way?
Classic usage is:
Open the file in binary format (bytes string)
read a chunk and guess the encoding (For instance with a simple scanning or using RegEx)
Then:
close the file and re-open it in text mode with the found encoding
Or
move to the beginning: seek(0), read the whole content as a bytes string then decode the content using the found encoding.
See this example: Detect character encoding in an XML file (Python recipe)
note: the code is a little old, but useful.

Trying to unpack/decode proprietary data file in Python

tl;dr - While trying to reverse engineer a proprietary database file, I found that Wordpad was able to automagically decode some of the data into a legible format. I'm trying to implement that decoding in python. Now, even the Wordpad voodoo is not repeatable.
Ready for a brain teaser?
I'm trying to crack a bit of a strange problem. I have a data file, it is the database of a program for a scientific instrument (Mettler DSC / STARe software), and I'm trying to grab sample information from experiments. From my digging around in the file, it appears to consist of plaintext, unencrypted information about the experiments run, along with data. It's a .t00 file, over 40 mb in size (it stores essentially all the data of the runs), and I know very little about the encoding (other than it's seemingly arbitrary. It's not meant to be a text file). I can open this file in Wordpad and can see the information I'm looking for (sample names, timestamp, experiment parameters), surrounded by experimental run data (as expected, this looks like lots of gobbledygook, e.g. ¶+ú#”‹ø#ðßö#¨...). It seems like I basically got lucky with it being able to make some sense of the contents, and I'm trying to replicate that.
I can read the file into python with a basic file handler and use regex to get some of the pieces of info I want. 'r' vs 'rb' doesn't seem to help.
def textOpenLines(filename,mode='rb'):
with open(filename, mode) as content_file:
return [line for line in content_file]
I'm able to take that list and search it for relevant strings and get the sample name from it. BUT from looking at the file in Wordpad, I found that the sample name is listed twice, the second time it has the datestamp following it (e.g. 'Dibenzoylperoxid 120 C 03.05.1994 14:24:30'). In python, I can't find this string. I can't find even the timestamp by itself. When I look at the line where it is supposed to occur, I get a bunch of random bytes. Opening in Notepad looks like the python output.
I suspect it's an encoding issue. I've tried reading the file in as Unicode, I've tried taking snippets of lines and reading those in, but I can't crack it. I'm stumped.
Any thoughts on how to read this in so that it decodes right? Wordpad got it right (though now subsequently trying to open it, it looks like the Notepad output).
Thanks!!
Edit:
I don't know who changed the title, but of course it 'looks like random bytes in Python/Notepad'. It's mostly data.
It's not meant to be a text file. I sorta got lucky with the Wordpad opening
It's not corrupted. The DSC instrument program reads it just fine. It's just proprietary so I have no idea how it ticks.
I've tried using 'r', 'rb', and 'U' flags.
I've tried codecs.open using utf8, 16 and 32, but it gives UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 49: invalid continuation byte. I don't think it has a BOM, because I don't think it's meant to be human readable.
First 32 bytes (f.read(32)) reads
'\x10 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x10\x00\x00'
I don't know much about BOMs, but from reading the Wiki page, that doesn't look like any of the valid UTF markings.
The start of the file, when first automagically decoded in Wordpad, looks like this:
121 22Dibenzoylperoxid 120 C 03.05.1994 14:24:30 1 0 4096 ESTimeAI–#£®#nôÂ#49Õ#kÉå#FÞò#`sþ#N5A2­A®"A"A—¥A¿ÝA¡zA"ÓAÿãAÐÅAäHA‚œAÑÌAŸäA¤ÆAE–AFNATöAÐ|AõAº^A(ÄAèAýqA¹AÖûAº8A¬uAK«AgÜAüAÞAo4A>N
AfAB
The start of the file, when opened in Notepad, Python, and now Wordpad, looks like this:
(empty bytes x00...)](x00...)eß(x00...)NvN(x00)... etc
Your file is not comprised of ascii characters but is being interpreted as such by applications that open it. The same thing would happen if you opened up a .jpg image in wordpad - you would get a bunch of binary and some ascii characters that are printable and recognizible to the human eye.
This is the reason why you can't do a plain-text search for your timestamp, for example.
Here is an example in code to demonstrate the issue. In your binary file you have the following bytes:
\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30
If you were to open this inside of a text editor like wordpad it would render the following:
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
Here is a code snippet in Python:
>>> c='\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
>>> print c
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
These bytes are in hexadecimal format which is why you can't search it with plaintext.
The reason for this is because the binary file is following a very particular structure (protocol, specification) so that the program that reads it can parse it correctly. If you take a jpeg image as an example you will find that the first bytes and the last bytes of the image are always the same (depending on the format used) - FF D8 will be the first two bytes of a jpeg and FF D9 will be the last two bytes of a jpeg to identify it as such. An image editing program will now know to start parsing this binary data as a jpeg and it will "walk" the structures inside the file to render the image. Here is a link to a resource that helps you identify files based on "signatures" or "headers" - the first two bytes of your file 10 00 do not show up in that database so you are likely dealing with a proprietary format so you won't be able to find the specs online very easily. This is where reverse engineering comes in handy.
I would recommend you open your file up in a hexeditor - it will give you both the hexadecimal output as well as the ascii output so that you can start to analyze the file format. I personally use the Hackman Hexeditor found here (it's free and has a lot of features).
But for now - to give you something useful to use in searching the file for data that you are interested in here is a quick method to covert your search queries to binary before the start the search.
import struct
#binary_data = open("your_binary_file.bin","rb").read()
#your binary data would show up as a big string like this one when you .read()
binary_data = '\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
def search(text):
#convert the text to binary first
s = ""
for c in text:
s+=struct.pack("b", ord(c))
results = binary_data.find(s)
if results == -1:
print "no results found"
else:
print "the string [%s] is found at position %s in the binary data"%(text, results)
search("Dibenzoylperoxid")
search("03.05.1994")
The results of the above script are:
the string [Dibenzoylperoxid] is found at position 0 in the binary data
the string [03.05.1994] is found at position 25 in the binary data
This should get you started.
it's FutureMe.
You probably got lucky with the Wordpad thing. I don't know for sure, because that data is long gone, but I am guessing Wordpad made a valiant effort to try to decode the file as UTF-8 (or maybe UTF-16 or CP1252). The reason this seemed to work was that in most binary protocols, strings are probably encoded as UTF-8, so for the ASCII character set, they will look the same in the dump. However, everything else is going to be binary encoded.
You had the right idea with open(fn, 'rb') but you should have just read the whole blob in, rather than readlines, which tries to split on \n. Since the db file isn't \n delimited, that just won't work.
What would have been a better approach is a histogram on the bytes and try to infer what the field/row separators are, if even exist. Look for TLV (type-length-value) encoded fields. Since you know the list of sample names, you could take a list of the starting strings, use that to find slice points in the blob, and determine how regular the field sizes are.
Also, buy bitcoin.

Categories

Resources