I have big data hex files from which I need to compare some hex values.When i read through python read it automatically converts it into ascii and so I have to decode it again.How can i directly read file in hex??
Till now i have tried using Intelhex python package but it is throwing an error :
intelhex.HexRecordError: Hex files contain invalid record.So is there any issues with my files only?
How much performance difference it is going to make if I successfully read hex data without decoding
split file into hex words consisting of purely [0-9a-fA-F] characters then int(word, 16) will change a word to a normal python integer. You can directly compare integers.
Alternatively you can keep the hex words and then convert an integer to a hex string using '{0:x}'.format(someinteger), prior to comparing the hex strings.
>>> s = open('input_file', 'rb').read(10)
>>> s
'\x00\x00\x00\x02\x00\xe6\x00\xa1I\x8d'
It is an ordinary sequence of bytes. If a byte is in ascii range then it is shown as the corresponding character in the representation e.g.,s[-2] == 'I'. The byte is the same (73 in decimal form), it is just shown in a human readable form.
You don't need to do any conversion to compare bytestrings (a[2:10] == b[4:12] works). Python does not decode your files to hex, ascii, or anything else unless you ask. Just make sure you open the files in binary mode (rb).
Related
In Python 2, I want to write a Unicode character which integer value is k to text file.
How should I do that?
(For instance, with ASCII, if I want to write the character with value 65, in text file it should appeared as 'A').
Afterwards, how should I read the file back to integer value?
The last question, how many Unicode characters are there in total? (as I know, there are more than one Unicode alphabets, such as UTF-8, UTF-16, etc.)
Thanks a lot
You can't write Unicode code points to text files. They must be encoded. UTF-8, UTF-16 and UTF-32 are encodings that support the full range of Unicode code points. unichr() is the function to turn an integer into a Unicode codepoint. Note that Python 2 will default to an encoding that depends on your operating system if you don't specify one, but it won't be able to write all Unicode characters unless that default is one of the UTF encodings.
Create a Unicode character:
k = 65
u = unichr(k)
Write it to a file encoded in UTF-8:
import io
with io.open('output.txt','w',encoding='utf8') as f:
f.write(u)
ord() will convert a character back to an integer.
Example (make sure to open with the same encoding as written):
import io
with io.open('output.txt',encoding='utf8') as f:
u = f.read()
k = ord(u)
Unicode code points range from U+0000 to U+10FFFF. Not all code points are defined, but there are 1,114,112 possible values in that range.
I have a binary string representation of some (former) binary file created by python str().
The string (or actually the file the string is stored to) looks like
some\nexample\x00text'with"all\xbe\xa1Dsorts\\of[itchy%chars
So we have ascii, escape sequences, hex escape sequences and all sorts of itchy ascii chars like quotes.
Is there any way, to convert this file back to the actual binary?
Edit 1:
The file is actually the result of a fd.write(str(dict(bottle.request.forms))).
The bottle request dictionary contains multiple entries, one of which has a pdf file as value.
The string is not encoded, the encoding is for display purposes only.
Printing it with print function/command will print its content.
I am using the dpkt python module to parse a pcap file. I'm looking deep enough into the packets that some of the data is represented as byte streams. I can convert from regular byte strings easily enough, however some of the byte strings appear as:
\t\x01\x1c\x88
The first value should be 09, however for some reason it's using an escaped tab character. (the hex code of a tab is 09).
It's doing this for other characters in other streams as well.
Some more sample outputs:
\x10\x00#\x00\
\x05q\x00\x00\
\x069\x9c\n\x00
So my question is: can I convert this byte stream to one without these extra characters?
Alternatively, how would I go about converting something like '\t' to hex so that it returns '09'?
Update:
Turns out that I was creating the strings to be converted using a function that would return
\t011c88 in place of the first stream.
Leaving it alone and using stream.encode("hex") worked
The repr function by default escapes all non-printable characters like you've seen.
To get a hex-only representation, use
string.encode("hex")
NOTE: The original bytestream is correct, you should only convert to hex for viewing purposes rather than integity purposes. It only shows the data in a strange way.
In Python 3, when I opened a text file with mode string 'rb', and then did f.read(), I was taken aback to find the file contents enclosed in single quotes after the character 'b'.
In Python 2 I just get the file contents.
I'm sure this is well known, but I can't find anything about it in the doco. Could someone point me to it?
You get "just the file contents" in Python 3 as well. Most likely you can just keep on doing whatever you were doing anyway. Read on for a longer explanation:
The b'' signifies that the result value is a bytes string. A bytes-string is quite similar to a normal string, but not quite, and is used to handle binary, non-textual data.
Some of the methods on a string that doesn't make sense for binary data is gone, but most are still there. A big difference is that when you get a specific byte from a bytes string you get an integer back, while for a normal str you get a one-length str.
>>> b'foo'[1]
111
>>> 'foo'[1]
'o'
If you open the file in text mode with the 't' flag you get a str back. The Python 3 str is what in Python 2 was called unicode. It's used to handle textual data.
You convert back and forth between bytes and str with the .encode() and .decode methods.
First of all, the Python 2 str type has been renamed to bytes in Python 3, and byte literals use the b'' prefix. The Python 2 unicode type is the new Python 3 str type.
To get the Python 3 file behaviour in Python 2, you'd use io.open() or codecs.open(); Python 3 decodes text files to Unicode by default.
What you see is that for binary files, Python 3 gives you the exact same thing as in Python 2, namely byte strings. What changed then, is that the repr() of a byte string is prefixed with b and the print() function will use the repr() representation of any object passed to it except for unicode values.
To print your binary data as unicode text with the print() function., decode it to unicode first. But then you could perhaps have opened the file as a text file instead anyway.
The bytes type has some other improvements to reflect that you are dealing with binary data, not text. Indexing individual bytes or iterating over a bytes value gives you int values (between 0 and 255) and not characters, for example.
Sometimes we need (needed?) to know whether a text file had single-character newlines (0A) or double character newlines (0D0A).
We used to avoid confusion by opening the text file in binary mode, recognising 0D and 0A, and treating other bytes as regular text characters.
One could port such code by finding all binarymode reads and replacing them with a new function oldread() that stripped off the added material, but it’s a bit painful.
I suppose the Python theologians thought of keeping ‘rb’ as it was, and adding a new ‘rx’ or something for the new behaviour. It seems a bit high-handed just to abolish something.
But, there it is, the question is certainly answered by a search for ‘rb’ in Lennert’s document.
I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().