I've been using the python websockets library of 8.1 version. This has been a good tool for receiving string data, yet now I have experienced a need to receive a mixture of string and bytes data.
Let me explain.
There is a socket, which encodes its data with the algorithm, which uses character codes as numbers.
For example, at the beginning of the message it has a character c, whose ord(c) == 777. It doesn't mean, though, that this is a chr(777), as a human would read it. It represents, for example, the type of message the client got. So it's a message with the type 777, and there is an algorithm to handle this type of messages. The next character would represent the lenght of the message. And so on.
There is a problem, though. There is a message, whose type is 0, which means a NUL byte. When the client receives such a string, for some reason either Python or the websockets recv method interpretes it as a space character, resulting in ord(c) == 32 instead of ord(c) == 0. Which, obviously, makes the whole message incorrect. I could replace 32 with 0 and vice versa, yet it would lead to even more errors, as those characters are not interchangable.
I suppose, if I received bytes instead of str, the problem would go away? But I cannot seem to find a method for that. Maybe, someone had such experience in the past with python and/or websockets incorrectly treating unreadable characters?
You can use the struct module to build your packets
https://docs.python.org/3/library/struct.html
Related
I'm working through the Black Hat Python book, and though it was written in 2015 some of the code seems a little dated. For example, print statements aren't utilizing parentheses. However, i cannot seem to get the below script to run, and keep getting an error.
# TCP Client Tool
import socket
target_host = "www.google.com"
target_port = 80
# creates a socket object. AF_INET parameter specifies IPv4 addr/host. SOCK_STREAM is TCP specific, not UDP.
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# connect the client
client.connect((target_host, target_port))
# sending some data
client.send("GET / HTTP/1.1\r\nHost: google.com\r\n\r\n\")
# receive some data
response = client.recv(4096)
print(response)
The error i'm getting simply reads, File "", line 15
client.send("GET / HTTP/1.1\r\nHost: google.com\r\n\r\n\")
^
You are escaping " by putting a \ before, which means python does not know that the string ends here. You can notice that in your post, all the code after that line is coloured as if it was a string.
client.send also needs a byte-like object, not a string. You can specify that by putting a b before your string:
client.send(b"GET / HTTP/1.1\r\nHost: google.com\r\n\r\n")
After that the script works fine
I think #Anonyme2000 answered the question in full and all the details needed to solve the issue are there. However, since this is a learning exercise from a book, others might come here and the details of what's going on in #Anonyme2000's answers are a bit short, I'll expand some more.
Strings
Python, like many other languages have what's called Escape Sequences, in short, putting \ infront of something means that - whatever follows will have a special meaning. Two examples:
Example 1: Row breaks (new-lines)
print("Something \nThis is a new line")
This will cause python to interpret n not as letter "n", but a special character indicating that "here there should be a new line", all thanks to \n being in-front of the letter n. \r is also a "new-line" but in older days it was the equivilent of moving the carriage printer head to the start of the line - not just down one line.
Example 2: Quote escapes in strings
print("I want to print this quote: \" in my string")
In this example, because we are using the quote character " to start and end our string, adding it in the middle would break the string (hopefully this is clear to you). In order to then proceed adding quotes in the middle of the text, we need to again, add a escape sequence character \ before the quote, this tells Python not to parse the quote as a quote, but simply add it into the string. There's an alternative to doing this, and that is:
print('I want to print this quote: " in my string')
And that's because the whole string is started and ended by ' instead, which enables Python to accurately guess (parse) start and stop of the actual whole string - which makes it 100% confident that the quote in this case - just just another piece of the string. These escape sequences are described here with more examples.
Bytes vs Strings
To better understand the difference, we'll first have a look at how Python and the terminal you use interact. I'm assuming you're running your python scripts from cmd.exe, powershell.exe or in Linux something like xterm or something. Basic terminals that is.
The terminal, will try to parse anything sent to it's output buffer and represent it to you. You can test this by doing:
print('\xc3\xa5\xc3\xa4\xc3\xb6') # Most Linux systems
print('\xe5\xe4\xf6') # Most Windows systems
In theory, one of the prints above should have let you just printed a bunch of bytes that the terminal some how knew how to render as åäö. Even your browser just did that for you (Fun side note, that's how they solve the Emoji-problem too, everyone's agreed that certain byte combinations should become 🙀). I say most windows and Linux, because this result is entirely up to what region/language you selected when you installed your operating system. I'm in EU North (Sweden) so my default codec in Windows is ISO-8859-1 and in all my Linux machines I have UTF-8. These codecs are important, as that's the machine-human interface in representing text.
Knowing this, anything you send to the output buffer of your terminal by doing either print("...") or sys.stdout.write("...") - will be interpreted by the terminal and rendered in your locale. If that's not possible, errors will occur.
This is where Python2 and Python3 starts to become two different beasts. And that's why you're here today. Putting it in simple terms, Python2 did a lot of automated and magic "guess-work" on strings, so that you could send a string to a socket - and Python would take care of the encoding for you. Python2 parsed them and converted them in all kinds of ways. In Python3 a lot of that automated guess-work was removed because it was more often than not confusing people. And the data being sent through functions and sockets was essentially a schrödingers data, it was strings some times and bytes some times. So instead, it's now up to you the developer to convert the data and encode it.. always.
So what is bytes vs strings?
bytes is in lay man terms, a string that hasn't been encoded in any way and thus can contain anything "data"-related. It doesn't have to be just a string (a-Z, 0-9, !"#¤% and so on), it can also contain special bytes like \x00 which is a Null byte/character. And Python will never try to auto-parse this data in Python3. And when doing:
print(b'\xe5\xe4\xf6')
Like above, except you define the string as a bytes string in Python3, Python will instead send a representation of the bytes not the actual bytes to the terminal buffer, thus, the terminal will never interpret them as the actual bytes they are.
Example 1: Encoding your data
Which brings us to this first example. So how do you convert your bytes containing print(b'\xe5\xe4\xf6') to the represented characters in your terminal, well, by converting it to a strings with a particular encoding. In the above example, the three characters \xe5\xe4\xf6 happens to be the ISO-8859-1 encoder in the making. I know this because I'm currently on windows and, if you run the command chcp in your terminal, you'll get which code page/encoder you're using.
There for, I can do:
print(b'\xe5\xe4\xf6'.decode('ISO-8859-1')
And that will convert the bytes objects into a string object (with a encoding).
The problem here, is that if you send this string to my Linux machine, it won't have a clue what's going on. Because, if you try:
print(b'\x86\x84\x94'.decode('UTF-8'))
You will end up with a error message like this:
>>> print(b'\x86\x84\x94'.decode('UTF-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 0: invalid start byte
This is because, in UTF-8 land, byte \x86 doesn't exists. So it has no way of knowing what to do with it. And because my Linux machine's default encoder is UTF-8 - your windows data is garbage to my machine.
Which, brings us to..
Sockets
In Python3 and most physical realms of a computer, encodings and strings are not welcome as they aren't really a thing. Instead, machines communicate in bits, in short, 1's and 0's. 8 of those becomes a byte, and that's where Python's bytes comes in to play. When sending something from machine to machine (or application to application), we will have to convert any text-representation, into a bytes sequence - so that the machines can talk to each other. Without encodings, without parsing things. Just - take the data.
We do this in three ways and they are:
print('åäö'.encode('UTF-8'))
print(bytes('åäö', 'UTF-8'))
print(b'åäö')
The last option, will fail - but I'll leave it like this on purpose to show the differences of telling Python, "hey, this weird thing, convert it to a bytes object".
All of these options, will return a bytes representation of åäö using a encoder *(except the last one, it will only encode using the ASCII parser, which is limited at best).
In the UTF-8 case, you will be returned something like:
b'\xc3\xa5\xc3\xa4\xc3\xb6'
And this, this is something you can send out on a socket. Because it's just a series of bytes, that the terminals, machines and applications won't touch or deal with in any other way than a series of ones and zeroes *('11000011 10100101 11000011 10100100 11000011 10110110' to be specific)
Together with some network logic, that's what's going to be sent out on your socket. And that's how machines communicate.
This is an overview of what's going on. The "human" is the terminal, aka, the machine-human-interface where you input your åäö and the terminal encodes/parses it as a certain encoding. Your application has to do magic in order to convert it to something the socket/physical world can work with.
The protocol for a device I'm working with sends a UDP packet to a server (My Python program, using twisted for networking) in a comma separated ASCII format. In order to acknowledge that my application received the packet correctly, I have to take a couple of values from the header of that packet, convert them to binary and send them back. I have everything setup, though I can't seem to figure out the binary part.
I've been going through a lot of stuff and I'm still confused. The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
The format of the acknowledgement requires me to send "0xFE 0x02" + an 8 unsigned integer (IMEI number, 15 digits) + 2 byte unsigned integer (Sequence ID)
How would I go about converting the ASCII text values that I have into "binary"?
First:
The documentation says "binary" though it looks more like hexadecimal because it says the ACK packet has to start with "0xFE 0x02".
Well, it's impossible print actual binary data in a human-readable form, so documentation will usually either give a sequence of hexadecimal bytes. They could use a bytes literal like b'\xFE\x02' or something instead, but that's still effectively hexadecimal to the human reader, right?
So, if they say "binary", they probably mean "binary", and the hex is just how they're showing you what binary bytes you need.
So, you need to convert the ASCII representation of a number into an actual number, which you do with the int function. Then you need to pack that into 8 bytes, which you do with the struct module.
You didn't mention whether you needed big-endian or little-endian. Since this sounds like a network protocol, and it sounds like it wasn't designed by Microsoft, I would guess big-endian, but you should actually know, not guess.
So:
imei_string = '1234567890123456789'
imei_number = int(imei_string) # 1234567890123456789
imei_bytes = struct.pack('>Q', imei_number) # b'\x11\x22\x10\xf4\x7d\xe9\x81\x15'
buf = b'\xFE\x02' + imei_bytes + seq_bytes
(You didn't say where you're supposed to get the sequence number from, but wherever it comes from, you'll pack it the same way, just using >H instead of >Q.)
If you actually did need a hex string rather than binary, you'd need to know exactly what format. The binascii.hexlify function gives you "bare hex", two characters per byte, no 0x or other header or footer. If you want something different, well, it depends on what exactly you want; no format is really that hard. But, again, I'm pretty sure you don't need a hex string here.
One last thing: Since you didn't specify your Python version, I wrote this in a way that's compatible with both 2.6+ and 3.0+; if you need to use 2.5 or earlier, just drop the b prefix on the literal in the buf = line.
I'm trying to comprehend something, So I'm receiving data from raw network data in the form of HEX, in this particular example a MAC address, now I'm using the Unhexlify() / Hexilfy() functions from the Binascii library in Python 2.7, and for example, I'm recieving for example the following MAC address in the form of hex
"\xa5\xbb%\x8f\xa0\xda"
it's a six octet long mac address and I absolutely have no clue what's going on....
if I use the function
binascii.hexlify('\xa5\xbb%\x8f\xa0\xda')
it returns
a5bb258fa0da
which is indeed the correct MAC address I'm expecting to receive but this really really doesn't make sense....
"\xa5\xbb%\x8f\xa0\xda", this isn't correct form of HEX, it contains a %, and somehow the binascii.hexlify() function manages to translate it to the correct mac address...
I'm honestly failing to understand this, I'm guessing it has something to do with translating from hex to ascii and not hex to dec, but I'm failing to understand how the unhexlify() / hexlify() functions work, and how come I'm receiving data in a form of hex and it contains a % in it, and yet my hex to ascii function manages to handle it...
what's going on.
Andrey is right, if characters are not preceded by the \x the standard ASCII table is used:
>>> print binascii.hexlify('012')
'303132'
"\xa5\xbb%\x8f\xa0\xda" - it is a sequence of characters that are defined by their hex code. One character is "\xa5", it is single character, not 4. For example print '\x61' will produce just a.
About % sign, it is printable character that is why it printed as is in the string. It has hex code of 0x25 which is actually used. you can write it as \x25: "\xa5\xbb\x25\x8f\xa0\xda"
More here.
When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I am using the following line:
print result[2].decode('cp037')
...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?
Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?
Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.
First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)
If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)
Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.
If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.
So, how do you know which one of these it is?
Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.
Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.
Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.
It seems that your result[2] is already unicode:
>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'
In fact, as pointed out #abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:
encoding to string with sys.getdefaultencoding(),
then decoding back to unicode
i.e., you statement is translated as:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')
and the error you get is from the first part of expression, u'\xe3'.encode('ascii')
All right, so as #abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.
The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.
So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)
For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as #abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).
Edit:
We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:
IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).
I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:
1) I use google toolkit's gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.
2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \u and \x chars in there being some monetary prefixes like pound, yen, etc)
3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.
The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?
Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.
The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.
That \u20ac char is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quote method.
url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.
The output isn't very pretty but it should be a correct uri encoding.
>>> s = u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'
Remember that you will need to both unquote() and decode() this to print it out properly if you're debugging or whatever.
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
>>> # oops, nasty  means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
This is, in fact, what the django functions mentioned in another answer do.
The functions
django.utils.http.urlquote() and
django.utils.http.urlquote_plus() are
versions of Python’s standard
urllib.quote() and urllib.quote_plus()
that work with non-ASCII characters.
(The data is converted to UTF-8 prior
to encoding.)
Be careful if you are applying any further quotes or encodings not to mangle things.
i want to second pycruft's remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like 'unicode percent escapes' would be the future, but they never caught on.
it is of paramount importance to be pedantically picky in this area about the difference between unicode objects and octet strings (in Python < 3.0; that's, confusingly, str unicode objects and bytes/bytearray objects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.
even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxx escape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.
You are out of your luck with stdlib, urllib.quote doesn't work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode