I am using the dpkt python module to parse a pcap file. I'm looking deep enough into the packets that some of the data is represented as byte streams. I can convert from regular byte strings easily enough, however some of the byte strings appear as:
\t\x01\x1c\x88
The first value should be 09, however for some reason it's using an escaped tab character. (the hex code of a tab is 09).
It's doing this for other characters in other streams as well.
Some more sample outputs:
\x10\x00#\x00\
\x05q\x00\x00\
\x069\x9c\n\x00
So my question is: can I convert this byte stream to one without these extra characters?
Alternatively, how would I go about converting something like '\t' to hex so that it returns '09'?
Update:
Turns out that I was creating the strings to be converted using a function that would return
\t011c88 in place of the first stream.
Leaving it alone and using stream.encode("hex") worked
The repr function by default escapes all non-printable characters like you've seen.
To get a hex-only representation, use
string.encode("hex")
NOTE: The original bytestream is correct, you should only convert to hex for viewing purposes rather than integity purposes. It only shows the data in a strange way.
Related
I am trying to capture some data from a piece of hardware I'm developing through one of cypress' fx2lp chips. I used cypress' software to record a sample of my data stream to a file, which I am trying to read with python. However, when I read it, I'm getting some interesting output that I'm not sure how to interpret.
I am opening the file like this:
f = open("testdata_5Aug2014.dat","rb")
Then I read the data in various sized chunks, similar to this:
f.read(100)
Typically, the result of the above line (and what I want to see) is something like this:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
But I sometimes get returns that include 't's and '?'s thrown in there like this:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14K\x01?\x00\xff??\x00\xff??\x00\xff??\x00\xff?\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
This is a problem, because when I use struct.unpack to parse this out, it won't return any of those bytes with the special characters appended.
So my question is: What are those symbols? How did they get there? and How do I remove them or deal with them?
You're reading binary data from a file, but f.read returns that data as a string. When you print that string, it's interpreting those bytes as characters. However, not every byte value maps to a displayable character, so some bytes are shown as escape sequences: \x followed by two hexadecimal digits. For example, 0 shows up as \x00 and 255 shows up as \xff.
Some values do map to characters, such as 63 mapping to '?' and 116 mapping to 't'. The ord and chr functions can be used to fetch the numerical value of a character, and the character mapping for a number, respectively, so ord('t') returns 116 and chr(63) returns '?'.
Either way, no matter how it's displayed, your data should be fine, and struct.unpack should be able to work with it as usual.
I have big data hex files from which I need to compare some hex values.When i read through python read it automatically converts it into ascii and so I have to decode it again.How can i directly read file in hex??
Till now i have tried using Intelhex python package but it is throwing an error :
intelhex.HexRecordError: Hex files contain invalid record.So is there any issues with my files only?
How much performance difference it is going to make if I successfully read hex data without decoding
split file into hex words consisting of purely [0-9a-fA-F] characters then int(word, 16) will change a word to a normal python integer. You can directly compare integers.
Alternatively you can keep the hex words and then convert an integer to a hex string using '{0:x}'.format(someinteger), prior to comparing the hex strings.
>>> s = open('input_file', 'rb').read(10)
>>> s
'\x00\x00\x00\x02\x00\xe6\x00\xa1I\x8d'
It is an ordinary sequence of bytes. If a byte is in ascii range then it is shown as the corresponding character in the representation e.g.,s[-2] == 'I'. The byte is the same (73 in decimal form), it is just shown in a human readable form.
You don't need to do any conversion to compare bytestrings (a[2:10] == b[4:12] works). Python does not decode your files to hex, ascii, or anything else unless you ask. Just make sure you open the files in binary mode (rb).
I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().
I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]
I have been studying unicode and its Python implementation now for two days, and I think I'm getting a glimpse of what it is about. Just to get confident, I'm asking if my assumptions for my current problems are correct.
In Django, forms give me unicode strings which I suspect to be "broken". Unicode strings in Python should be encoded in UTF-8, is that right? After entering the string "fähre" into a text field, the browser sends the string "f%c3%a4hre" in the POST request (checked via wireshark). When I retrieve the value via form.cleaned_data, I'm getting the string u'f\xa4hre' (note it is a unicode string), though. As far as I understand that, that is ISO-8859-1-encoded unicode string, which is incorrect. The correct string should be u'f\xc3\xa4hre', which would be a UTF-8-encoded unicode string. Is that a Django bug or is there something wrong with my understanding of it?
To fix the issue, I wrote a function to apply it to any text input from Django forms:
def fix_broken_unicode(s):
return unicode(s.encode(u'utf-8'), u'iso-8859-1')
which does
>>> fix_broken_unicode(u'f\xa4hre')
u'f\xc3\xa4hre'
That doesn't seem very elegant to me, but setting Django's settings.DEFAULT_CHARSET to 'utf-8' didn't help, nor did anything else. I am trying to work with unicode throughout the whole application so I won't get any weird errors later on, but it obviously does not suffice to mark all strings with u'...'.
Edit: Considering the answers from Dirk and sth, I will now save the strings to the database as they are. The real problem was that I was trying to urlencode these kinds of strings to use them as input for the Twitter API etc. In GET or POST requests, though, UTF-8 encoding is obviously expected which the standard urllib.urlencode() function does not process correctly (throws exceptions). Take a look at my solution in the pastebin and feel free to comment on it also.
u'f\xa4hre'is a unicode string, not encoded as anything. The unicode codepoint 0xa4 is the character ä. It is not really important that ä would also be encoded as byte 0xa4 in ISO-8859-1.
The unicode string can contain any unicode characters without encoding them in some way. For example 轮渡 would be represented as u'\u8f6e\u6e21', which are simply two unicode codepoints. The UTF-8 encoding would be the much longer '\xe8\xbd\xae\xe6\xb8\xa1'.
So there is no need to fix the encoding, you are just seeing the internal representation of the unicode string.
Not exactly: after having been decoded, the unicode string is unicode which means, it may contain characters with codes beyond 255. How the interpreter represents these depends on the platform, but usually nowadays it uses character elements with a width of at least 16 bits. ISO-8859-1 is a proper subset of unicode. Thus, the string u'f\xa4hre' is actually proper -- the \xa4 is a rendering artifact, since Python doesn't know if (and when) it is safe to include characters with codes beyond a certain range on the console.
UTF-8 is a transport encoding that is, a special way to write unicode data such, that it can be stored in "channels" with an element width of 8 bits per character/byte. In order to compute the proper "external" (or transport) encoding of a unicode string, you'd use the encode method, passing the desired representation. It returns a properly encoded byte string (as opposed to a unicode character string).
The reverse transformation is decode which takes a byte string and an encoding name and yields a unicode character string.