Decoding a .bin file encoded with ANSI format

Decoding a .bin file encoded with ANSI format - python

I am attempting to decode a very large .bin file which contains a header as well as numeric data. The problem I am running into is the method I am using to decode is producing a jumbled list of characters. When opening the file in a text editor it tells me that the encoding is 'ANSI' so that is what I am using for this
I am using this code to decode it:
import codecs as cs
file = open('3D_test.bin', 'rb')
b = file.read( ) ##inserting a number here in the actual code
data = cs.decode(b, encoding='ANSI')
print(b)
print(data) ##this one shows me the jumbled characters but I can't seem to show that here
The first 100 bytes, which is header data:
b'\x08\x00\x00\x00\x1f\x85\xebQ\xb8\x1e\x10#\x08\x00\x00\x00\x0c\x00\x00\x00L\x00\x00\x00L\x00\x00\x006\x00\x00\x00\x0c\x00\x00\x00\x04\x00\x00\x00(\x00\x00\x00\x04\x00\x00\x00(\x00\x00\x00 Longitude (\x00\x00\x00(\x00\x00\x00'
Some bytes that are after the header data:
b'\xd7\xa4\x97\x07\xa4\x94\x10#\xe9A\xc3h\x00\xee\x10#\xfb\xde\xee\xc9\\G\x11#\r|\x1a+\xb9\xa0\x11#\x1f\x19F\x8c\x15\xfa\x11#1\xb6q\xedqS\x12#CS'
I've tried quite a bit of digging but can't seem to find anything that helps me with this problem. I'm not new to python but have never worked with binary data/decoding data before.

Related

decoding issue while parsing JSON [python]

I am reading a JSON file in Python which has lots of fields and values (~8000 records).
Env: windows 10, python 3.6.4;
code:
import json
json_data = json.load(open('json_list.json'))
print (json_data)
With this I get an error. Below is the stack trace:
json_data = json.load(open('json_list.json'))
File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>
Along with this I have tried
import json
with open('json_list.json', encoding='utf-8') as fd:
json_data = json.load(fd)
print (json_data)
with this my program runs for a long time then hangs with no output.
I have searched almost all topics related to this and could not find a solution.
Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.
Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.
Here is what the file looks like around the reported error:
>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
b'":"2017-04-10","storage_size_gb":"84.747')

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.
Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.
The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).
Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.
Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

Error reading csv file using pandas [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 5 years ago.
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
But i don't know what exactly should be changed.so need help.

Known encoding
If you know the encoding of the file you want to read in,
you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])

Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.
Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)

One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.

Above method used by importing and then detecting file type works
import chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])

Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.

I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Adding delimiters to a text file using python

I have recently started my job as an ETL Developer and as a part of my exercise, I am extracting data from a text file containing raw data. My raw data looks like this as shown in the image.
My Raw Data
Now I want to add delimiters to my data file. Basically after every line, I want to add a comma (,). My code in Python looks like this.
with open ('new_locations.txt', 'w') as output:
with open('locations.txt', 'r') as input:
for line in input:
new_line = line+','
output.write(new_line)
where new_locations.txt is the output text file, locations.txt is the raw data.
However, it throws me error all the time.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3724: character maps to
Where exactly am I going wrong?
Note: The characters in raw data are not all ASCII characters. Some are Latin characters as well.

When you open a file in python 3 in "text" mode then reading and writing convert the bytes in the file to python (unicode) strings. The default encoding is platform dependent, but is usually UTF-8.
If you file uses latin-1 encoding, you should open with
with open('locations.txt', 'r', encoding='latin_1') as input
You should probably also do this with the output if you want the output also to be in latin-1.
In the longer term, you should probably consider converting all your data to a unicode format in the data files.

So when you write to file you need to encode it before writing. If you google that you will find ton of results.
Here is how it can be done :
output.write(new_line.encode('utf-8'))# or ascii
You can also ask to ignore which can't be converted but that wil cause loss of charachter and may not be the desired output, here is how that will be done :
output.write(new_line.encode('ascii','ignore'))# or 'utf-8'

Trying to unpack/decode proprietary data file in Python

tl;dr - While trying to reverse engineer a proprietary database file, I found that Wordpad was able to automagically decode some of the data into a legible format. I'm trying to implement that decoding in python. Now, even the Wordpad voodoo is not repeatable.
Ready for a brain teaser?
I'm trying to crack a bit of a strange problem. I have a data file, it is the database of a program for a scientific instrument (Mettler DSC / STARe software), and I'm trying to grab sample information from experiments. From my digging around in the file, it appears to consist of plaintext, unencrypted information about the experiments run, along with data. It's a .t00 file, over 40 mb in size (it stores essentially all the data of the runs), and I know very little about the encoding (other than it's seemingly arbitrary. It's not meant to be a text file). I can open this file in Wordpad and can see the information I'm looking for (sample names, timestamp, experiment parameters), surrounded by experimental run data (as expected, this looks like lots of gobbledygook, e.g. ¶+ú#”‹ø#ðßö#¨...). It seems like I basically got lucky with it being able to make some sense of the contents, and I'm trying to replicate that.
I can read the file into python with a basic file handler and use regex to get some of the pieces of info I want. 'r' vs 'rb' doesn't seem to help.
def textOpenLines(filename,mode='rb'):
with open(filename, mode) as content_file:
return [line for line in content_file]
I'm able to take that list and search it for relevant strings and get the sample name from it. BUT from looking at the file in Wordpad, I found that the sample name is listed twice, the second time it has the datestamp following it (e.g. 'Dibenzoylperoxid 120 C 03.05.1994 14:24:30'). In python, I can't find this string. I can't find even the timestamp by itself. When I look at the line where it is supposed to occur, I get a bunch of random bytes. Opening in Notepad looks like the python output.
I suspect it's an encoding issue. I've tried reading the file in as Unicode, I've tried taking snippets of lines and reading those in, but I can't crack it. I'm stumped.
Any thoughts on how to read this in so that it decodes right? Wordpad got it right (though now subsequently trying to open it, it looks like the Notepad output).
Thanks!!
Edit:
I don't know who changed the title, but of course it 'looks like random bytes in Python/Notepad'. It's mostly data.
It's not meant to be a text file. I sorta got lucky with the Wordpad opening
It's not corrupted. The DSC instrument program reads it just fine. It's just proprietary so I have no idea how it ticks.
I've tried using 'r', 'rb', and 'U' flags.
I've tried codecs.open using utf8, 16 and 32, but it gives UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 49: invalid continuation byte. I don't think it has a BOM, because I don't think it's meant to be human readable.
First 32 bytes (f.read(32)) reads
'\x10 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x10\x00\x00'
I don't know much about BOMs, but from reading the Wiki page, that doesn't look like any of the valid UTF markings.
The start of the file, when first automagically decoded in Wordpad, looks like this:
121 22Dibenzoylperoxid 120 C 03.05.1994 14:24:30 1 0 4096 ESTimeAI–#£®#nôÂ#49Õ#kÉå#FÞò#`sþ#N5A2A®"A"A—¥A¿ÝA¡zA"ÓAÿãAÐÅAäHA‚œAÑÌAŸäA¤ÆAE–AFNATöAÐ|AõAº^A(ÄAèAýqA¹AÖûAº8A¬uAK«AgÜAüAÞAo4A>N
AfAB
The start of the file, when opened in Notepad, Python, and now Wordpad, looks like this:
(empty bytes x00...)](x00...)eß(x00...)NvN(x00)... etc

Your file is not comprised of ascii characters but is being interpreted as such by applications that open it. The same thing would happen if you opened up a .jpg image in wordpad - you would get a bunch of binary and some ascii characters that are printable and recognizible to the human eye.
This is the reason why you can't do a plain-text search for your timestamp, for example.
Here is an example in code to demonstrate the issue. In your binary file you have the following bytes:
\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30
If you were to open this inside of a text editor like wordpad it would render the following:
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
Here is a code snippet in Python:
>>> c='\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
>>> print c
Dibenzoylperoxid 120 C 03.05.1994 14:24:30
These bytes are in hexadecimal format which is why you can't search it with plaintext.
The reason for this is because the binary file is following a very particular structure (protocol, specification) so that the program that reads it can parse it correctly. If you take a jpeg image as an example you will find that the first bytes and the last bytes of the image are always the same (depending on the format used) - FF D8 will be the first two bytes of a jpeg and FF D9 will be the last two bytes of a jpeg to identify it as such. An image editing program will now know to start parsing this binary data as a jpeg and it will "walk" the structures inside the file to render the image. Here is a link to a resource that helps you identify files based on "signatures" or "headers" - the first two bytes of your file 10 00 do not show up in that database so you are likely dealing with a proprietary format so you won't be able to find the specs online very easily. This is where reverse engineering comes in handy.
I would recommend you open your file up in a hexeditor - it will give you both the hexadecimal output as well as the ascii output so that you can start to analyze the file format. I personally use the Hackman Hexeditor found here (it's free and has a lot of features).
But for now - to give you something useful to use in searching the file for data that you are interested in here is a quick method to covert your search queries to binary before the start the search.
import struct
#binary_data = open("your_binary_file.bin","rb").read()
#your binary data would show up as a big string like this one when you .read()
binary_data = '\x44\x69\x62\x65\x6e\x7a\x6f\x79\x6c\x70\x65\x72\x6f\x78\x69\x64\x20\x31\
x32\x30\x20\x43\x20\x30\x33\x2e\x30\x35\x2e\x31\x39\x39\x34\x20\x31\x34\x3a\x32\
x34\x3a\x33\x30'
def search(text):
#convert the text to binary first
s = ""
for c in text:
s+=struct.pack("b", ord(c))
results = binary_data.find(s)
if results == -1:
print "no results found"
else:
print "the string [%s] is found at position %s in the binary data"%(text, results)
search("Dibenzoylperoxid")
search("03.05.1994")
The results of the above script are:
the string [Dibenzoylperoxid] is found at position 0 in the binary data
the string [03.05.1994] is found at position 25 in the binary data
This should get you started.

it's FutureMe.
You probably got lucky with the Wordpad thing. I don't know for sure, because that data is long gone, but I am guessing Wordpad made a valiant effort to try to decode the file as UTF-8 (or maybe UTF-16 or CP1252). The reason this seemed to work was that in most binary protocols, strings are probably encoded as UTF-8, so for the ASCII character set, they will look the same in the dump. However, everything else is going to be binary encoded.
You had the right idea with open(fn, 'rb') but you should have just read the whole blob in, rather than readlines, which tries to split on \n. Since the db file isn't \n delimited, that just won't work.
What would have been a better approach is a histogram on the bytes and try to infer what the field/row separators are, if even exist. Look for TLV (type-length-value) encoded fields. Since you know the list of sample names, you could take a list of the starting strings, use that to find slice points in the blob, and determine how regular the field sizes are.
Also, buy bitcoin.

while reading json file in python some additional unicode symbols appear in the data

I need to get data from json file to further send it in the post-request. Unfortunately, when i read the file, some unexplained unicode symbols at the beginning
path = '.\jsons_updated'
newpath = os.path.join(path, 'Totem Plus eT 00078-20140224_060406.ord.txt')
file = open(newpath, 'r')
#data = json.dumps(file.read())
data = file.read()
print('data= ', data)
file.close()
Data in the file starts with this:
{"PriceTableHash": [{"Hash": ...
I get the result:
data= п»ї{"PriceTableHash": [{"Hash": ...
or in case of data = json.dumps(file.read())
data= "\u043f\u00bb\u0457{\"PriceTableHash\": [{
So my request can't process this data.
Odd symbols are the same for all the files i have.
UPD:
If i copy data manyally in the new json or txt file, problem dissappears. But i have about 2,5k files, so that's not an option =)

The command open(newpath, 'r') opens the file with your system's default encoding (whichever that might be). So when you read encoded Unicode data, that will mangle the encoding (so instead of reading the UTF-8 encoded data with a UTF-8 decoder, Python will try Cp-1250 or something).
Use codecs.open() instead and specify the correct encoding of the data (i.e. the one which was used when the files were written).
The odd bytes you get look like a BOM header. You may want to change the code which writes those files to omit it and send you pure UTF-8. See also Reading Unicode file data with BOM chars in Python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.