I have a binary string representation of some (former) binary file created by python str().
The string (or actually the file the string is stored to) looks like
some\nexample\x00text'with"all\xbe\xa1Dsorts\\of[itchy%chars
So we have ascii, escape sequences, hex escape sequences and all sorts of itchy ascii chars like quotes.
Is there any way, to convert this file back to the actual binary?
Edit 1:
The file is actually the result of a fd.write(str(dict(bottle.request.forms))).
The bottle request dictionary contains multiple entries, one of which has a pdf file as value.
The string is not encoded, the encoding is for display purposes only.
Printing it with print function/command will print its content.
Related
I'm trying to convert a binary I have in python (a gzipped protocol buffer object) to an hexadecimal string in a string escape fashion (eg. \xFA\x1C ..).
I have tried both
repr(<mygzipfileobj>.getvalue())
as well as
<mygzipfileobj>.getvalue().encode('string-escape')
In both cases I end up with a string which is not made of HEX chars only.
\x86\xe3$T]\x0fPE\x1c\xaa\x1c8d\xb7\x9e\x127\xcd\x1a.\x88v ...
How can I achieve a consistent hexadecimal conversion where every single byte is actually translated to a \xHH format ? (where H represents a valid hex char 0-9A-F)
The \xhh format you often see is a debugging aid, the output of the repr() applied to a string with non-ASCII codepoints. Any ASCII codepoints are left a in-place to leave what readable information is there.
If you must have a string with all characters replaced by \xhh escapes, you need to do so manually:
''.join(r'\x{0:02x}'.format(ord(c)) for c in value)
If you need quotes around that, you'd need to add those manually too:
"'{0}'".format(''.join(r'\x{:02x}'.format(ord(c)) for c in value))
In Python 2, I want to write a Unicode character which integer value is k to text file.
How should I do that?
(For instance, with ASCII, if I want to write the character with value 65, in text file it should appeared as 'A').
Afterwards, how should I read the file back to integer value?
The last question, how many Unicode characters are there in total? (as I know, there are more than one Unicode alphabets, such as UTF-8, UTF-16, etc.)
Thanks a lot
You can't write Unicode code points to text files. They must be encoded. UTF-8, UTF-16 and UTF-32 are encodings that support the full range of Unicode code points. unichr() is the function to turn an integer into a Unicode codepoint. Note that Python 2 will default to an encoding that depends on your operating system if you don't specify one, but it won't be able to write all Unicode characters unless that default is one of the UTF encodings.
Create a Unicode character:
k = 65
u = unichr(k)
Write it to a file encoded in UTF-8:
import io
with io.open('output.txt','w',encoding='utf8') as f:
f.write(u)
ord() will convert a character back to an integer.
Example (make sure to open with the same encoding as written):
import io
with io.open('output.txt',encoding='utf8') as f:
u = f.read()
k = ord(u)
Unicode code points range from U+0000 to U+10FFFF. Not all code points are defined, but there are 1,114,112 possible values in that range.
I have big data hex files from which I need to compare some hex values.When i read through python read it automatically converts it into ascii and so I have to decode it again.How can i directly read file in hex??
Till now i have tried using Intelhex python package but it is throwing an error :
intelhex.HexRecordError: Hex files contain invalid record.So is there any issues with my files only?
How much performance difference it is going to make if I successfully read hex data without decoding
split file into hex words consisting of purely [0-9a-fA-F] characters then int(word, 16) will change a word to a normal python integer. You can directly compare integers.
Alternatively you can keep the hex words and then convert an integer to a hex string using '{0:x}'.format(someinteger), prior to comparing the hex strings.
>>> s = open('input_file', 'rb').read(10)
>>> s
'\x00\x00\x00\x02\x00\xe6\x00\xa1I\x8d'
It is an ordinary sequence of bytes. If a byte is in ascii range then it is shown as the corresponding character in the representation e.g.,s[-2] == 'I'. The byte is the same (73 in decimal form), it is just shown in a human readable form.
You don't need to do any conversion to compare bytestrings (a[2:10] == b[4:12] works). Python does not decode your files to hex, ascii, or anything else unless you ask. Just make sure you open the files in binary mode (rb).
In Python 3, when I opened a text file with mode string 'rb', and then did f.read(), I was taken aback to find the file contents enclosed in single quotes after the character 'b'.
In Python 2 I just get the file contents.
I'm sure this is well known, but I can't find anything about it in the doco. Could someone point me to it?
You get "just the file contents" in Python 3 as well. Most likely you can just keep on doing whatever you were doing anyway. Read on for a longer explanation:
The b'' signifies that the result value is a bytes string. A bytes-string is quite similar to a normal string, but not quite, and is used to handle binary, non-textual data.
Some of the methods on a string that doesn't make sense for binary data is gone, but most are still there. A big difference is that when you get a specific byte from a bytes string you get an integer back, while for a normal str you get a one-length str.
>>> b'foo'[1]
111
>>> 'foo'[1]
'o'
If you open the file in text mode with the 't' flag you get a str back. The Python 3 str is what in Python 2 was called unicode. It's used to handle textual data.
You convert back and forth between bytes and str with the .encode() and .decode methods.
First of all, the Python 2 str type has been renamed to bytes in Python 3, and byte literals use the b'' prefix. The Python 2 unicode type is the new Python 3 str type.
To get the Python 3 file behaviour in Python 2, you'd use io.open() or codecs.open(); Python 3 decodes text files to Unicode by default.
What you see is that for binary files, Python 3 gives you the exact same thing as in Python 2, namely byte strings. What changed then, is that the repr() of a byte string is prefixed with b and the print() function will use the repr() representation of any object passed to it except for unicode values.
To print your binary data as unicode text with the print() function., decode it to unicode first. But then you could perhaps have opened the file as a text file instead anyway.
The bytes type has some other improvements to reflect that you are dealing with binary data, not text. Indexing individual bytes or iterating over a bytes value gives you int values (between 0 and 255) and not characters, for example.
Sometimes we need (needed?) to know whether a text file had single-character newlines (0A) or double character newlines (0D0A).
We used to avoid confusion by opening the text file in binary mode, recognising 0D and 0A, and treating other bytes as regular text characters.
One could port such code by finding all binarymode reads and replacing them with a new function oldread() that stripped off the added material, but it’s a bit painful.
I suppose the Python theologians thought of keeping ‘rb’ as it was, and adding a new ‘rx’ or something for the new behaviour. It seems a bit high-handed just to abolish something.
But, there it is, the question is certainly answered by a search for ‘rb’ in Lennert’s document.
When you use the .read(n) method on a file object in python you get n amount of bytes back.
What if I first load a file in a string, is there some function that lets me do the same thing?
Because I guess it's not as easy filestring[0:5], because of different types of encoding.
(And I don't really want to pay attention to that, the file read can be a text file in any format or a binary file)
If string is type str (not a Unicode string, type unicode), then it's a byte string and slicing will work as expected:
prefixed_bits = "extract this double:\xc2\x8eET\xfb!\t#"
pos = prefixed_bits.index(":") + 1
print "That looks like the value %f" % struct.unpack("d", prefixed_bits[pos:pos+8])
This prints 3.141593, the binary representation of which is encoded in the string literal.