I am trying to capture some data from a piece of hardware I'm developing through one of cypress' fx2lp chips. I used cypress' software to record a sample of my data stream to a file, which I am trying to read with python. However, when I read it, I'm getting some interesting output that I'm not sure how to interpret.
I am opening the file like this:
f = open("testdata_5Aug2014.dat","rb")
Then I read the data in various sized chunks, similar to this:
f.read(100)
Typically, the result of the above line (and what I want to see) is something like this:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x05\x12\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
But I sometimes get returns that include 't's and '?'s thrown in there like this:
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14t\x14K\x01?\x00\xff??\x00\xff??\x00\xff??\x00\xff?\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
This is a problem, because when I use struct.unpack to parse this out, it won't return any of those bytes with the special characters appended.
So my question is: What are those symbols? How did they get there? and How do I remove them or deal with them?
You're reading binary data from a file, but f.read returns that data as a string. When you print that string, it's interpreting those bytes as characters. However, not every byte value maps to a displayable character, so some bytes are shown as escape sequences: \x followed by two hexadecimal digits. For example, 0 shows up as \x00 and 255 shows up as \xff.
Some values do map to characters, such as 63 mapping to '?' and 116 mapping to 't'. The ord and chr functions can be used to fetch the numerical value of a character, and the character mapping for a number, respectively, so ord('t') returns 116 and chr(63) returns '?'.
Either way, no matter how it's displayed, your data should be fine, and struct.unpack should be able to work with it as usual.
Related
I want to store numpy bytes with Presto. I have the following
import numpy as np
array = np.array([1.0,3.4,5.1])
these_bytes = array.tobytes()
and then I want to store them in presto, using a query like this
query = f"INSERT INTO some_table VALUES ({these_bytes},'2021-03-11')"
where the {these_bytes} entry is a VARBINARY column. Of course, presto gives the error 'b' not recognized as "these_bytes" is actually a bytes object and not a string so it looks like b'...'. It seems I should then be decoding this object and storing the decoding... What is the correct way to store python binary bytes with presto, and then will there be any transformations required upon retrieval? Assume my python presto client just passes the query without doing additional transformations.
The expanded fstring looks like
INSERT INTO imu_test_table_1000 VALUES (b'\x00\x00\x00\x00\x00\x00\xf0?333333\x0b#ffffff\x14#','2021-03-11')
which is not right.
I guess you need these_bytes.decode() (although, if you have unprintable characters that is probably going to be an issue and will fail). You would also need to know what encoding your client wants the characters to be in (utf8, utf16, etc.). If you don't know that, then I would be confused how your client interprets what characters it is sent, or why it wants to receive characters instead of bytes.
In general, most data transferring is done in bytes (I don't know about your particular client). This would include things like subprocessing, sockets, requests, etc. all use bytes (since this is what computers talk in). A byte string can contain any byte value and is simply data ready for transfer. A string is a collection of characters (each character could be made of several bytes in memory) which represents some kind of human writing/text. In particular, not every byte string can be encoded into characters, hence str.decode() will not work unless you specifically have bytes for the utf8 code-set.
If you want to combine arbitray bytes (not representing any particular format or characters), then you CANNOT use a string; use must keep the data in a byte array. Like
query = b"INSERT INTO some_table VALUES (" + these_bytes + b",'2021-03-11')"
I am having NVARCHAR type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using pyodbc for the database connection).
# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'
# prints something in chineese
>>> print my_string
䅗䍇湥整㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭〶〶ㄵ㐲㔸ⴷㄴ〹㔭
The closest I have gone is via encoding it to utf-16 as:
>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5
But the actual value that I need as per the value store in database is:
WAGCenter-04190517953516060503-20160605124857-4190-51
I tried with encoding it to utf-8, utf-16, ascii, utf-32 but nothing seemed to work.
Does anyone have the idea regarding what I am missing? And how to get the desired result from the my_string.
Edit: On converting it to utf-16-le, I am able to remove unwanted characters from start, but still one character is missing from end
>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5
On trying for some other columns, it is working. What might be the cause of this intermittent issue?
You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:
the type of the database
the way you input values in it
the way you extract values to obtain your pseudo unicode string
the actual content if you use direct (native) database access
What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:
def cvt(ustring):
l = []
for uc in ustring:
l.append(chr(ord(uc) & 0xFF)) # low order byte
l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
return ''.join(l)
cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'
The issue was, I was using UTF-16 in my odbcinst.ini file where as I had to use UTF-8 format of character encoding.
Earlier I was changing it as an OPTION parameter while making connection to PyODBC. But later changing it in odbcinst.ini file fixed the issue.
I am using the dpkt python module to parse a pcap file. I'm looking deep enough into the packets that some of the data is represented as byte streams. I can convert from regular byte strings easily enough, however some of the byte strings appear as:
\t\x01\x1c\x88
The first value should be 09, however for some reason it's using an escaped tab character. (the hex code of a tab is 09).
It's doing this for other characters in other streams as well.
Some more sample outputs:
\x10\x00#\x00\
\x05q\x00\x00\
\x069\x9c\n\x00
So my question is: can I convert this byte stream to one without these extra characters?
Alternatively, how would I go about converting something like '\t' to hex so that it returns '09'?
Update:
Turns out that I was creating the strings to be converted using a function that would return
\t011c88 in place of the first stream.
Leaving it alone and using stream.encode("hex") worked
The repr function by default escapes all non-printable characters like you've seen.
To get a hex-only representation, use
string.encode("hex")
NOTE: The original bytestream is correct, you should only convert to hex for viewing purposes rather than integity purposes. It only shows the data in a strange way.
In Python 3, when I opened a text file with mode string 'rb', and then did f.read(), I was taken aback to find the file contents enclosed in single quotes after the character 'b'.
In Python 2 I just get the file contents.
I'm sure this is well known, but I can't find anything about it in the doco. Could someone point me to it?
You get "just the file contents" in Python 3 as well. Most likely you can just keep on doing whatever you were doing anyway. Read on for a longer explanation:
The b'' signifies that the result value is a bytes string. A bytes-string is quite similar to a normal string, but not quite, and is used to handle binary, non-textual data.
Some of the methods on a string that doesn't make sense for binary data is gone, but most are still there. A big difference is that when you get a specific byte from a bytes string you get an integer back, while for a normal str you get a one-length str.
>>> b'foo'[1]
111
>>> 'foo'[1]
'o'
If you open the file in text mode with the 't' flag you get a str back. The Python 3 str is what in Python 2 was called unicode. It's used to handle textual data.
You convert back and forth between bytes and str with the .encode() and .decode methods.
First of all, the Python 2 str type has been renamed to bytes in Python 3, and byte literals use the b'' prefix. The Python 2 unicode type is the new Python 3 str type.
To get the Python 3 file behaviour in Python 2, you'd use io.open() or codecs.open(); Python 3 decodes text files to Unicode by default.
What you see is that for binary files, Python 3 gives you the exact same thing as in Python 2, namely byte strings. What changed then, is that the repr() of a byte string is prefixed with b and the print() function will use the repr() representation of any object passed to it except for unicode values.
To print your binary data as unicode text with the print() function., decode it to unicode first. But then you could perhaps have opened the file as a text file instead anyway.
The bytes type has some other improvements to reflect that you are dealing with binary data, not text. Indexing individual bytes or iterating over a bytes value gives you int values (between 0 and 255) and not characters, for example.
Sometimes we need (needed?) to know whether a text file had single-character newlines (0A) or double character newlines (0D0A).
We used to avoid confusion by opening the text file in binary mode, recognising 0D and 0A, and treating other bytes as regular text characters.
One could port such code by finding all binarymode reads and replacing them with a new function oldread() that stripped off the added material, but it’s a bit painful.
I suppose the Python theologians thought of keeping ‘rb’ as it was, and adding a new ‘rx’ or something for the new behaviour. It seems a bit high-handed just to abolish something.
But, there it is, the question is certainly answered by a search for ‘rb’ in Lennert’s document.
I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]