tl;dr - While trying to reverse engineer a proprietary database file, I found that Wordpad was able to automagically decode some of the data into a legible format. I'm trying to implement that decoding in python. Now, even the Wordpad voodoo is not repeatable.
Ready for a brain teaser?
I'm trying to crack a bit of a strange problem. I have a data file, it is the database of a program for a scientific instrument (Mettler DSC / STARe software), and I'm trying to grab sample information from experiments. From my digging around in the file, it appears to consist of plaintext, unencrypted information about the experiments run, along with data. It's a .t00 file, over 40 mb in size (it stores essentially all the data of the runs), and I know very little about the encoding (other than it's seemingly arbitrary. It's not meant to be a text file). I can open this file in Wordpad and can see the information I'm looking for (sample names, timestamp, experiment parameters), surrounded by experimental run data (as expected, this looks like lots of gobbledygook, e.g. ¶+ú#”‹ø#ðßö#¨...). It seems like I basically got lucky with it being able to make some sense of the contents, and I'm trying to replicate that.
I can read the file into python with a basic file handler and use regex to get some of the pieces of info I want. 'r' vs 'rb' doesn't seem to help.
def textOpenLines(filename,mode='rb'):
with open(filename, mode) as content_file:
return [line for line in content_file]
I'm able to take that list and search it for relevant strings and get the sample name from it. BUT from looking at the file in Wordpad, I found that the sample name is listed twice, the second time it has the datestamp following it (e.g. 'Dibenzoylperoxid 120 C 03.05.1994 14:24:30'). In python, I can't find this string. I can't find even the timestamp by itself. When I look at the line where it is supposed to occur, I get a bunch of random bytes. Opening in Notepad looks like the python output.
I suspect it's an encoding issue. I've tried reading the file in as Unicode, I've tried taking snippets of lines and reading those in, but I can't crack it. I'm stumped.
Any thoughts on how to read this in so that it decodes right? Wordpad got it right (though now subsequently trying to open it, it looks like the Notepad output).
Thanks!!
Edit:
I don't know who changed the title, but of course it 'looks like random bytes in Python/Notepad'. It's mostly data.
It's not meant to be a text file. I sorta got lucky with the Wordpad opening
It's not corrupted. The DSC instrument program reads it just fine. It's just proprietary so I have no idea how it ticks.
I've tried using 'r', 'rb', and 'U' flags.
I've tried codecs.open using utf8, 16 and 32, but it gives UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 49: invalid continuation byte. I don't think it has a BOM, because I don't think it's meant to be human readable.
First 32 bytes (f.read(32)) reads
'\x10 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x10\x00\x00'
I don't know much about BOMs, but from reading the Wiki page, that doesn't look like any of the valid UTF markings.
The start of the file, when first automagically decoded in Wordpad, looks like this:
121 22Dibenzoylperoxid 120 C 03.05.1994 14:24:30 1 0 4096 ESTimeAI–#£®#nôÂ#49Õ#kÉå#FÞò#`sþ#N5A2A®"A"A—¥A¿ÝA¡zA"ÓAÿãAÐÅAäHA‚œAÑÌAŸäA¤ÆAE–AFNATöAÐ|AõAº^A(ÄAèAýqA¹AÖûAº8A¬uAK«AgÜAüAÞAo4A>N
AfAB
The start of the file, when opened in Notepad, Python, and now Wordpad, looks like this:
(empty bytes x00...)](x00...)eß(x00...)NvN(x00)... etc
Your file is not comprised of ascii characters but is being interpreted as such by applications that open it. The same thing would happen if you opened up a .jpg image in wordpad - you would get a bunch of binary and some ascii characters that are printable and recognizible to the human eye.
This is the reason why you can't do a plain-text search for your timestamp, for example.
Here is an example in code to demonstrate the issue. In your binary file you have the following bytes:
\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30
If you were to open this inside of a text editor like wordpad it would render the following:
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
Here is a code snippet in Python:
>>> c='\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
>>> print c
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
These bytes are in hexadecimal format which is why you can't search it with plaintext.
The reason for this is because the binary file is following a very particular structure (protocol, specification) so that the program that reads it can parse it correctly. If you take a jpeg image as an example you will find that the first bytes and the last bytes of the image are always the same (depending on the format used) - FF D8 will be the first two bytes of a jpeg and FF D9 will be the last two bytes of a jpeg to identify it as such. An image editing program will now know to start parsing this binary data as a jpeg and it will "walk" the structures inside the file to render the image. Here is a link to a resource that helps you identify files based on "signatures" or "headers" - the first two bytes of your file 10 00 do not show up in that database so you are likely dealing with a proprietary format so you won't be able to find the specs online very easily. This is where reverse engineering comes in handy.
I would recommend you open your file up in a hexeditor - it will give you both the hexadecimal output as well as the ascii output so that you can start to analyze the file format. I personally use the Hackman Hexeditor found here (it's free and has a lot of features).
But for now - to give you something useful to use in searching the file for data that you are interested in here is a quick method to covert your search queries to binary before the start the search.
import struct
#binary_data = open("your_binary_file.bin","rb").read()
#your binary data would show up as a big string like this one when you .read()
binary_data = '\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
def search(text):
#convert the text to binary first
s = ""
for c in text:
s+=struct.pack("b", ord(c))
results = binary_data.find(s)
if results == -1:
print "no results found"
else:
print "the string [%s] is found at position %s in the binary data"%(text, results)
search("Dibenzoylperoxid")
search("03.05.1994")
The results of the above script are:
the string [Dibenzoylperoxid] is found at position 0 in the binary data
the string [03.05.1994] is found at position 25 in the binary data
This should get you started.
it's FutureMe.
You probably got lucky with the Wordpad thing. I don't know for sure, because that data is long gone, but I am guessing Wordpad made a valiant effort to try to decode the file as UTF-8 (or maybe UTF-16 or CP1252). The reason this seemed to work was that in most binary protocols, strings are probably encoded as UTF-8, so for the ASCII character set, they will look the same in the dump. However, everything else is going to be binary encoded.
You had the right idea with open(fn, 'rb') but you should have just read the whole blob in, rather than readlines, which tries to split on \n. Since the db file isn't \n delimited, that just won't work.
What would have been a better approach is a histogram on the bytes and try to infer what the field/row separators are, if even exist. Look for TLV (type-length-value) encoded fields. Since you know the list of sample names, you could take a list of the starting strings, use that to find slice points in the blob, and determine how regular the field sizes are.
Also, buy bitcoin.
Related
I have a lot of CSV files and I want to merge them into one CSV file. The thing is that the CSV files contain data in different languages like Russian, English, Croatian, Spanish, etc. Some of the CSV files even have their data written in multiple languages.
When I open the CSV files, the data looks perfectly fine, written properly in their languages and I want to read all the CSV files in their language, and write them to one big CSV file as they are.
The code I use is this:
directory_path = os.getcwd()
all_files=glob.glob(os.path.join(directory_path,"DR_BigData_*.csv"))
print(all_files)
merge_file='data_5.csv'
df_from_each_file=(pd.read_csv(f,encoding='latin1') for f in all_files)
df_merged=pd.concat(df_from_each_file,ignore_index=True)
df_merged.to_csv(merge_file,index=False)
If I use "encoding='latin1'", it successfully writes all the CSV files into one but as you might guess, the characters are so messed up.
Here is a part of the output as an example:
I also tried to write them into .xlsx with using encoding='latin1', I still encountered the same issue. In addition to these, I tried many different encoding, but those gave me decoding errors.
When you force the input encoding to Latin-1, you are basically wrecking any input files which are not actually Latin-1. For example, a Russian text file containing the text привет in code page 1251 will silently be translated to ïðèâåò. (The same text in the UTF-8 encoding would map to the similarly bogus but completely different string пÑивеÑ.)
The sustainable solution is to, first, correctly identify the input encoding of each file, and then, second, choose an output encoding which can accommodate all of the input encodings correctly.
I would choose UTF-8 for output, but any Unicode variant will technically work. If you need to pass the result to something more or less braindead (cough Microsoft cough Java) maybe UTF-16 will be more convenient for your use case.
data = dict()
for file in glob.glob("DR_BigData_*.csv"):
if 'ru' in file:
enc = 'cp1251'
elif 'it' in file:
enc = 'latin-1'
# ... add more here
else:
raise KeyError("I don't know the encoding for %s" % file)
data[file] = pd.read_csv(file, encoding=enc)
# ... merge data[] as previously
The if statement is really just a placeholder for something more useful; without access to your files, I have no idea how your files are named, or which encodings to use for which ones. This simplistically assumes that files in Russian would all have the substring "ru" in their names, and that you want to use a specific encoding for all of those.
If you only have two encodings, and one of them is UTF-8, this is actually quite easy; try to decode as UTF-8, then if that doesn't work, fall back to the other encoding:
for file in glob.glob("DR_BigData_*.csv"):
try:
data[file] = pd.read_csv(file, encoding='utf-8')
except UnicodeDecodeError:
data[file] = pd.read_csv(file, encoding='latin-1')
This is likely to work simply because text which is not valid UTF-8 will typically raise a UnicodeDecodeError very quickly. The encoding is designed so that bytes with the 8th bit set have to adhere to a very specific pattern. This is a useful feature, not something you should feel frustrated about. Not getting the correct data from the file is much worse.
If you don't know what encodings are, now would be a good time to finally read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
As an aside, your computer already knows which directory it's in; you basically never need to call os.getcwd() unless you require to find out the absolute path of the current directory.
If I understood your question correctly, you can easily merge all your csv files (as they are) using cat command:
cat file1.csv file2.csv file3.csv ... > Merged.csv
I'm trying to extract data from a face dataset I found online which provides png pictures and their corresponding pcd files. However, whenever I try to extract data from the pcd files I get the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 202: invalid start byte
I understand that this is because I'm trying to read a non-ASCII character, however, I haven't seen any people run into this problem when opening an outside source's .pcd files. Is there an error on the end of the dataset, or is there a workaround that will let me read this file. I eventually want to work towards a depth image for machine learning applications (I'm fairly new to machine learning in general).
If this is a problem with the dataset, I'd love to hear about other RGB-D face datasets, as I haven't been able to find any others that provide depth information.
If this is my problem, I'd like to know what I can do to fix it, because I have tried a number of different techniques and libraries to read the files and have only gotten this error.
Thanks!
import os
import math
import numpy as np
from PIL import Image
filePath = "001_01_cloud.pcd"
with open(filePath, "r") as pcd_file:
lines = [line.strip().split(" ") for line in pcd_file.readlines()]
Googling for the specification for the PCD format says that the actual point cloud data could be stored in binary form, and I'm assuming that's what's going on here:
DATA - specifies the data type that the point cloud data is stored in. As of version 0.7, two data types are supported: ascii and binary.
Since you're opening the file with mode "r", Python will assume it's text, and will handily attempt to interpret everything as UTF-8 (by default; you can pass encoding="...").
However since the format has a text header followed by text or binary data, you will need to open it in binary mode, "rb". (This means reads from the file will yield bytes objects, not strings.) You can then .decode() bytes objects into strings if you need to handle them as text.
You also shouldn't use .readlines() with a file like this; the binary data that follows the textual headers can contain \n characters, and that data would be "broken" if split into lines.
Anyway, you may be reinventing the wheel here; there seems to be a Python library for PCD files.
My code encrypts a file, and the second part will decrypt it. It works fine with txt files but if I put a .docx through it it throws up an error that I can not figure out how to solve. Below is the main part of the code that I need help with.
I already did encoding and decoding it using examples from this site, but it does not work, it just gives the same error.
dwdfa = input('Enter the entire file directory plus extension you wish to decrypt:')
dodf = open(dwdfa,"r+").read()
a = len(dodf)
dfirst = dodf[a-2]+dodf[a-1]+dodf[:a-2]
for i in dfirst:
dsecond = (chr(ord(i) - 5))
Word.append(dsecond)
dsecond = ''.join(Word)
print(dsecond)
new = open(dwdfa + "1", "w")
new.write(dsecond)
I expected the output to give me the decoded version of the text and print it out, however it just gives the same encrypted text and the error of:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
18: character maps to
If possible try the answer simplified as I do not understand conversion of bytes to string or anything else like that.
The r+ is there to open the file, if needed, ill add the encryption bit.
docx is a binary format (to be precise, a zip archive containing XML files), and therefore needs to be processed as bytes rather than string in Python.
If you want to simply encrypt arbitrary files (eg. images, executables) you will need to rewrite the function to work on bytes instead of characters (for example, that -5 caesar shift (chr(ord(i) - 5)) would be (i - 5 + 256) % 256, and add b to the flags of the open() calls. Text files will then still remain text files unless they contain Unicode (which will be broken). Encrypted docx files will be gibberish, so they can't be opened in Word until decrypted.
But if you want to work on the text of your docx files, you will need a special docx library (eg https://python-docx.readthedocs.io/en/latest/). Note that doing the processing in-place (leaving formatting and layout intact) may not be trivial.
First of all, I found the following which is basically the same as my question, but it is closed and I'm not sure I understand the reason for closing vs. the content of the post. I also don't really see a working answer.
I have 20+ input files from 4 apps. All files are exported as .csv files. The first 19 files worked (4 others exported from the same app work) and then I ran into a file that gives me this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5762: character maps to <undefined>
If I looked that up right it is a < ctrl >. The code below are the relevant lines:
with open(file, newline = '') as f:
reader = csv.DictReader(f, dialect = 'excel')
for line in reader:
I know I'm going to be getting a file. I know it will be a .csv. There may be some variance in what I get due to the manual generation/export of the source files. There may also be some strange characters in some of the files (e.g. Japanese, Russian, etc.). I provide this information because going back to the source to get a different file might just kick the can down the road until I have to pull updated data (or worse, someone else does).
So the question is probably multi-part:
1) Is there a way to tell the csv.DictReader to ignore undefined characters? (Hint for the codec: if I can't see it, it is of no value to me.)
2) If I do have "crazy" characters, what should I do? I've considered opening each input as a binary file, filtering out offending hex characters, writing the file back to disk and then opening the new file, but that seems like a lot of overhead for the program and even more for me. It's also a few JCL statements from being 1977 again.
3) How do I figure out what I'm getting as an input if it crashes while I'm reading it in.
4) I chose the "dialect = 'excel'"; because many of the inputs are Excel files that can be downloaded from one of the source applications. From the docs on dictreader, my impression is that this just defines delimiter, quote character and EOL characters to expect/use. Therefore, I don't think this is my issue, but I'm also a Python noob, so I'm not 100% sure.
I posted the solution I went with in the comments above; it was to set the errors argument of open() to 'ignore':
with open(file, newline = '', errors='ignore') as f:
This is exactly what I was looking for in my first question in the original post above (i.e. whether there is a way to tell the csv.DictReader to ignore undefined characters).
Update: Later I did need to work with some of the Unicode characters and couldn't ignore them. The correct answer for that solution based on Excel-produced unicode .csv file was to use the 'utf_8_sig' codec. That deletes the byte order marker (utf-16 BOM) that Windows writes at the top of the file to let it know there are unicode characters in it.
I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949.
My script is as follows:
cpstr = open('terms.rtf').read()
utfstr = cpstr.decode('cp949').encode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(utfstr)
tmp.close()
But this doesn't change the encoding as I intended.
There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using less/more/cat/type/whatever to print it out to your terminal.
First, the easy cases: plaintext RTF.
A plaintext RTF file starts of with {\rtf, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with \—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine.
However, the file may also have a formatting command that specifies what character set it's written in. This command looks like \ansicpg949. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it.
The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's \ansicpg65001, maybe it's \utf8, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of \ansicpg949, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…)
Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a é is stored as \'e9), so there's nothing to convert.
Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names.
The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the file command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD).
Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes.
Finally, the hard case: binary RTF.
The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself.
A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.
import codecs
cpstr = codecs.open('terms.rtf','r','cp949').read()
u = cpstr.encode('cp949').decode('utf-8')
tmp = open('terms_utf.rtf', 'w')
tmp.write(u)
tmp.close()