Python - convert a raw binary dump into ASCII HEX bytes - python

Further to this question: Handling and working with binary data HEX with python (and thanks to awesome pointers I received) I'm stuck on one last aspect of tool.
I am basically writing a cleaner for files that I have with data past the EOF marker. This extra data means they fail some validation tools. I need to strip the extra data, so they be presented to the validator, however I don't want to throw this data away (in fact I have to keep it...)
I've written an XML container to hold the data, and a few other provenance/audit type values, but I'm (still) stuck on elegantly moving between raw binary and something I can "bake" in to a file.
example:
A jpg file ends with (hex editor view)
96 1a 9c fd ab 4f 9e 69 27 ad fd da 0a db 76 bb
ee d2 6a fd ff 00 ff d9 ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
The EOF marker for jpg is ff d9, so the cleaner works backwards through the file until its a match against the EOF marker. In this case it would create a new jpg file stopping at the ff d9 and then attempt to write the stripped data to the XML (via the elementTree lib): changeString.text =str(excessData)
Of course this wont work as the XML writer is looking to write ASCII not binary dumps.
In the above case, the error is UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) which I can see if because its not a valid ASCII character
My question therefore, is how do I elegantly deal with this raw data, in a way it can stored and used in the future? (I plan to write an 'uncleaner' next that can take the clean file and the XML and reconstruct the original file...)
______EDIT_______
Using the suggestions from below, this is the traceback:
Traceback (most recent call last):
File "C:\...\EOF_cleaner\scripts\test6.py", line 87, in <module> main()
File "C:\...\EOF_cleaner\scripts\test6.py", line 73, in main splitFile(f_data, offset)
File "C:\...EOF_cleaner\scripts\test6.py", line 60, in splitFile makeXML(excessData)
File "C:\...\EOF_cleaner\scripts\test6.py", line 53 in makeXML ET.ElementTree(root).write(noteFile)
File "c:\python27\lib\xml\etree\ElementTree.py", line 815, in write serialize(write, self._root, encoding, qnames, namespaces)
File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
File "c:\python27\lib\xml\etree\ElementTree.py", line 934, in _serialize_xml_serialize_xml(write, e, encoding, qnames, None)
File "c:\python27\lib\xml\etree\ElementTree.py", line 932, in _serialize_xml write(_escape_cdata(text, encoding))
File "c:\python27\lib\xml\etree\ElementTree.py", line 1068, in _escape_cdata return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
The line that throws things is changeString.text = excessData.encode('base64') (line 45) and ET.ElementTree(root).write(noteFile) (line 53)

Use Base64:
excessData.encode('base64')
It'll be easy to turn that back to binary data later on with a simple .decode('base64') call.
Base64 encodes to ASCII data safe for inclusion in XML, in a reasonably compact format; every 3 bytes of binary information become 4 Base64 characters.

To convert raw bytes to space-separated ASCII hex, you can use something like:
>>> a = "abc\x01\x02"
>>> print(" ".join("{:02x}".format(x) for x in a))
61 62 63 01 02
However, as mentioned in other answers, something like Base64 is probably going to be more efficient and easier to work with.

Related

python read binary file byte by byte and read line feed as 0a

This might seem pretty stupid, but I'm a complete newbie in python.
So, I have a binary file that holds data as
47 40 ad e8 66 29 10 87 d7 73 0a 40 10
When I tried to read it with python by
with open(absolutePathInput, "rb") as f:
for line in self.file:
for byte, nextbyte in zip(line[:], line[1:]):
if state == 'wait_for_sync_1':
if ((byte == 0x10) and (nextbyte == 0x87)):
state = 'message_id'
I get
all bytes but 0a is read as line feed(i.e. \n)
It seems that it considers as "line feed" and self.file reads only till 0a.
correct me to read 0a as hex value

Converting broken byte string from unicode back to corresponding bytes

The following code retrieves an iterable object of strings in rows which contains a PDF byte stream. The string row was type of str. The resulting file was a PDF format and could be opened.
with open(fname, "wb") as fd:
for row in rows:
fd.write(row)
Due to a new C-Library and changes in the Python implementation the str changes to unicode. And the corresponding content changed as well so my PDF file is broken.
Starting bytes of first row object:
old row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 E2 E3 CF D3 0D 0A ...
new row[0]: 25 50 44 46 2D 31 2E 33 0D 0A 25 C3 A2 C3 A3 C3 8F C3 93 0D 0A ...
I adjust the corresponding byte positions here so it looks like a unicode problem.
I think this is a good start but I still have a unicode string as input...
>>> "\xc3\xa2".decode('utf8') # but as input I have u"\xc3\xa2"
u'\xe2'
I already tried several calls of encode and decode so I need a more analytical way to fix this. I can't see the wood for the trees. Thank you.
When you find u"\xc3\xa2" in a Python unicode string, it often means that you have read an UTF-8 encoded file as is it was Latin1 encoded. So the best thing to do is certainly to fix the initial read.
That being said if you have to depend on broken code, the fix is still easy: you just encode the string as Latin1 and then decode it as UTF-8:
fixed_u_str = broken_u_str.encode('Latin1').decode('UTF-8')
For example:
u"\xc3\xa2\xc3\xa3".encode('Latin1').decode('utf8')
correctly gives u"\xe2\xe3" which displays as âã
This looks like you should be doing
fd.write(row.encode('utf-8'))
assuming the type of row is now unicode (this is my understanding of how you presented things).

How to store key in the smart card reader?

I am using reader ACR1281 and MIFARE cards.
I communicate with the cards using python smartcard library (pc/sc).
I know the MIFARE key to read the card blocks and want to store the key in reader to use it (as I see in the doc this is the only way to use my key - store it in the reader and 'authenticated' with it the block to read).
But specified in the ACR documentation command FF 82 00 00 06 FF FF FF FF FF FF returns error 63 00.
In the command above I use key number 0 (volatile) and key value FF FF FF FF FF FF.
Silly error.
I am using volatile key (P1 = key_structure = 0).
And by the ACS documentation for that I can use only one key number - session key (P2 = key_number = 0x20).
So correct command is
FF 82 00 20 06 FF FF FF FF FF FF
Wrong command I'd got from another ACS reader documentation.

Python 3 UnicodeDecodeError - How do I debug UnicodeDecodeError?

I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f:
fields = f.readline().strip().split('\t')
for line in f.readlines():
yield process_tag_record(fields, line)
I receive the following error:
Traceback (most recent call last):
File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module>
main()
File "/home/randm/Projects/finance/secxbrl.py", line 143, in main
all_tags = list(tags("tag.txt"))
File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags
content = f.read()
File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read
return self.reader.read(size)
File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
Given that I probably can't go back to the SEC and tell them they have files that don't seem to be encoded in UTF-8, how should I debug and catch this error?
What have I tried
I did a hexdump of the file and found that the offending text was the text "SUPPLEMENTAL DISCLOSURE OF NON�CASH INVESTING". If I decode the offending byte as a hex code point (i.e. "U+00AD"), it makes sense in context as it is the soft hyphen. But the following does not seem to work:
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b"\x41".decode("utf-8")
'A'
>>> b"\xad".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte
>>> b"\xc2ad".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte
I've used errors='replace', which seems to pass. But I'd like to understand what will happen if I try to insert that into a database.
Hexdump:
0036ae40 31 09 09 09 09 53 55 50 50 4c 45 4d 45 4e 54 41 |1....SUPPLEMENTA|
0036ae50 4c 20 44 49 53 43 4c 4f 53 55 52 45 20 4f 46 20 |L DISCLOSURE OF |
0036ae60 4e 4f 4e ad 43 41 53 48 20 49 4e 56 45 53 54 49 |NON.CASH INVESTI|
0036ae70 4e 47 20 41 4e 44 20 46 49 4e 41 4e 43 49 4e 47 |NG AND FINANCING|
0036ae80 20 41 43 54 49 56 49 54 49 45 53 3a 09 0a 50 72 | ACTIVITIES:..Pr|
You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:
>>> '\u00ad'.encode('utf8')
b'\xc2\xad'
Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.
I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace' is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.
Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt, I can't decode the data as UTF-8:
>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A'
b'CTIVITIES:\t\nProceedsFromSaleOfIn')
There are two such non-ASCII characters in the file:
>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0'
b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I'
b'NVESTING AND FINANCING ACTIVITIES:\t\n',
b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani'
b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h'
b'e.\n']
Hotel Kranichh\xf6he decoded as Latin-1 is Hotel Kranichhöhe.
There are also several 0xC1 / 0xD1 pairs in the file:
>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'\t')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th'
>>> quotes[1].split(b'\t')[-1][50:130]
b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'
I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C and 1D parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!
There is no codec shipping with Python that would encode '\u201C\u201D' to b'\x1C\x1D', making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.
If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:
_map = {
# dashes
0x13: '\u2013', 0x14: '\u2014',
# single quotes
0x18: '\u2018', 0x19: '\u2019',
# double quotes
0x1c: '\u201c', 0x1d: '\u201d',
}
def repair(line, _map=_map):
"""Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
return line.translate(_map)
then apply that to lines you read:
with open(filename, 'r', encoding='latin-1') as f:
repaired = map(repair, f)
fields = next(repaired).strip().split('\t')
for line in repaired:
yield process_tag_record(fields, line)
Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open(); that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open(). Do not use f.readlines(); you don't need to read the whole file into a list here. Just iterate over the file directly:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
fields = next(f).strip().split('\t')
for line in f:
yield process_tag_record(fields, line)
If process_tag_record also splits on tabs, use a csv.reader() object and avoid splitting each row manually:
import csv
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.reader(f, delimiter='\t')
fields = next(reader)
for row in reader:
yield process_tag_record(fields, row)
If process_tag_record combines the fields list with the values in row to form a dictionary, just use csv.DictReader() instead:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.DictReader(f, delimiter='\t')
# first row is used as keys for the dictionary, no need to read fields manually.
yield from reader

Problem writing unicode UTF-16 data to file in python

I'm working on Windows with Python 2.6.1.
I have a Unicode UTF-16 text file containing the single string Hello, if I look at it in a binary editor I see:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00
BOM H e l l o CR LF
What I want to do is read in this file, run it through Google Translate API, and write both it and the result to a new Unicode UTF-16 text file.
I wrote the following Python script (actually I wrote something more complex than this with more error checking, but this is stripped down as a minimal test case):
#!/usr/bin/python
import urllib
import urllib2
import sys
import codecs
def translate(key, line, lang):
ret = ""
print "translating " + line.strip() + " into " + lang
url = "https://www.googleapis.com/language/translate/v2?key=" + key + "&source=en&target=" + lang + "&q=" + urllib.quote(line.strip())
f = urllib2.urlopen(url)
for l in f.readlines():
if l.find("translatedText") > 0 and l.find('""') == -1:
a,b = l.split(":")
ret = unicode(b.strip('"'), encoding='utf-16', errors='ignore')
break
return ret
rd_file_name = sys.argv[1]
rd_file = codecs.open(rd_file_name, encoding='utf-16', mode="r")
rd_file_new = codecs.open(rd_file_name+".new", encoding='utf-16', mode="w")
key_file = open("api.key","r")
key = key_file.readline().strip()
for line in rd_file.readlines():
new_line = translate(key, line, "ja")
rd_file_new.write(unicode(line) + "\n")
rd_file_new.write(new_line)
rd_file_new.write("\n")
This gives me an almost-Unicode file with some extra bytes in it:
FF FE 48 00 65 00 6C 00 6C 00 6F 00 0D 00 0A 00 0A 00
20 22 E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF 22 0A 00
I can see that 20 is a space, 22 is a quote, I assume that "E3" is an escape character that urllib2 is using to indicate that the next character is UTF-16 encoded??
If I run the same script but with "cs" (Czech) instead of "ja" (Japanese) as the target language, the response is all ASCII and I get the Unicode file with my "Hello" first as UTF-16 chars and then "Ahoj" as single byte ASCII chars.
I'm sure I'm missing something obvious but I can't see what. I tried urllib.unquote() on the result from the query but that didn't help. I also tried printing the string as it comes back in f.readlines() and it all looks pretty plausible, but it's hard to tell because my terminal window doesn't support Unicode properly.
Any other suggestions for things to try? I've looked at the suggested dupes but none of them seem to quite match my scenario.
I believe the output from Google is UTF-8, not UTF-16. Try this fix:
ret = unicode(b.strip('"'), encoding='utf-8', errors='ignore')
Those E3 bytes are not "escape characters". If one had no access to documentation, and was forced to make a guess, the most likely suspect for the response encoding would be UTF-8. Expectation (based on a one-week holiday in Japan): something like "konnichiwa".
>>> response = "\xE3\x81\x93\xE3\x82\x93\xE3\x81\xAB\xE3\x81\xA1\xE3\x81\xAF"
>>> ucode = response.decode('utf8')
>>> print repr(ucode)
u'\u3053\u3093\u306b\u3061\u306f'
>>> import unicodedata
>>> for c in ucode:
... print unicodedata.name(c)
...
HIRAGANA LETTER KO
HIRAGANA LETTER N
HIRAGANA LETTER NI
HIRAGANA LETTER TI
HIRAGANA LETTER HA
>>>
Looks close enough to me ...

Categories

Resources