I'm working through the Black Hat Python book, and though it was written in 2015 some of the code seems a little dated. For example, print statements aren't utilizing parentheses. However, i cannot seem to get the below script to run, and keep getting an error.
# TCP Client Tool
import socket
target_host = "www.google.com"
target_port = 80
# creates a socket object. AF_INET parameter specifies IPv4 addr/host. SOCK_STREAM is TCP specific, not UDP.
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# connect the client
client.connect((target_host, target_port))
# sending some data
client.send("GET / HTTP/1.1\r\nHost: google.com\r\n\r\n\")
# receive some data
response = client.recv(4096)
print(response)
The error i'm getting simply reads, File "", line 15
client.send("GET / HTTP/1.1\r\nHost: google.com\r\n\r\n\")
^
You are escaping " by putting a \ before, which means python does not know that the string ends here. You can notice that in your post, all the code after that line is coloured as if it was a string.
client.send also needs a byte-like object, not a string. You can specify that by putting a b before your string:
client.send(b"GET / HTTP/1.1\r\nHost: google.com\r\n\r\n")
After that the script works fine
I think #Anonyme2000 answered the question in full and all the details needed to solve the issue are there. However, since this is a learning exercise from a book, others might come here and the details of what's going on in #Anonyme2000's answers are a bit short, I'll expand some more.
Strings
Python, like many other languages have what's called Escape Sequences, in short, putting \ infront of something means that - whatever follows will have a special meaning. Two examples:
Example 1: Row breaks (new-lines)
print("Something \nThis is a new line")
This will cause python to interpret n not as letter "n", but a special character indicating that "here there should be a new line", all thanks to \n being in-front of the letter n. \r is also a "new-line" but in older days it was the equivilent of moving the carriage printer head to the start of the line - not just down one line.
Example 2: Quote escapes in strings
print("I want to print this quote: \" in my string")
In this example, because we are using the quote character " to start and end our string, adding it in the middle would break the string (hopefully this is clear to you). In order to then proceed adding quotes in the middle of the text, we need to again, add a escape sequence character \ before the quote, this tells Python not to parse the quote as a quote, but simply add it into the string. There's an alternative to doing this, and that is:
print('I want to print this quote: " in my string')
And that's because the whole string is started and ended by ' instead, which enables Python to accurately guess (parse) start and stop of the actual whole string - which makes it 100% confident that the quote in this case - just just another piece of the string. These escape sequences are described here with more examples.
Bytes vs Strings
To better understand the difference, we'll first have a look at how Python and the terminal you use interact. I'm assuming you're running your python scripts from cmd.exe, powershell.exe or in Linux something like xterm or something. Basic terminals that is.
The terminal, will try to parse anything sent to it's output buffer and represent it to you. You can test this by doing:
print('\xc3\xa5\xc3\xa4\xc3\xb6') # Most Linux systems
print('\xe5\xe4\xf6') # Most Windows systems
In theory, one of the prints above should have let you just printed a bunch of bytes that the terminal some how knew how to render as åäö. Even your browser just did that for you (Fun side note, that's how they solve the Emoji-problem too, everyone's agreed that certain byte combinations should become 🙀). I say most windows and Linux, because this result is entirely up to what region/language you selected when you installed your operating system. I'm in EU North (Sweden) so my default codec in Windows is ISO-8859-1 and in all my Linux machines I have UTF-8. These codecs are important, as that's the machine-human interface in representing text.
Knowing this, anything you send to the output buffer of your terminal by doing either print("...") or sys.stdout.write("...") - will be interpreted by the terminal and rendered in your locale. If that's not possible, errors will occur.
This is where Python2 and Python3 starts to become two different beasts. And that's why you're here today. Putting it in simple terms, Python2 did a lot of automated and magic "guess-work" on strings, so that you could send a string to a socket - and Python would take care of the encoding for you. Python2 parsed them and converted them in all kinds of ways. In Python3 a lot of that automated guess-work was removed because it was more often than not confusing people. And the data being sent through functions and sockets was essentially a schrödingers data, it was strings some times and bytes some times. So instead, it's now up to you the developer to convert the data and encode it.. always.
So what is bytes vs strings?
bytes is in lay man terms, a string that hasn't been encoded in any way and thus can contain anything "data"-related. It doesn't have to be just a string (a-Z, 0-9, !"#¤% and so on), it can also contain special bytes like \x00 which is a Null byte/character. And Python will never try to auto-parse this data in Python3. And when doing:
print(b'\xe5\xe4\xf6')
Like above, except you define the string as a bytes string in Python3, Python will instead send a representation of the bytes not the actual bytes to the terminal buffer, thus, the terminal will never interpret them as the actual bytes they are.
Example 1: Encoding your data
Which brings us to this first example. So how do you convert your bytes containing print(b'\xe5\xe4\xf6') to the represented characters in your terminal, well, by converting it to a strings with a particular encoding. In the above example, the three characters \xe5\xe4\xf6 happens to be the ISO-8859-1 encoder in the making. I know this because I'm currently on windows and, if you run the command chcp in your terminal, you'll get which code page/encoder you're using.
There for, I can do:
print(b'\xe5\xe4\xf6'.decode('ISO-8859-1')
And that will convert the bytes objects into a string object (with a encoding).
The problem here, is that if you send this string to my Linux machine, it won't have a clue what's going on. Because, if you try:
print(b'\x86\x84\x94'.decode('UTF-8'))
You will end up with a error message like this:
>>> print(b'\x86\x84\x94'.decode('UTF-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 0: invalid start byte
This is because, in UTF-8 land, byte \x86 doesn't exists. So it has no way of knowing what to do with it. And because my Linux machine's default encoder is UTF-8 - your windows data is garbage to my machine.
Which, brings us to..
Sockets
In Python3 and most physical realms of a computer, encodings and strings are not welcome as they aren't really a thing. Instead, machines communicate in bits, in short, 1's and 0's. 8 of those becomes a byte, and that's where Python's bytes comes in to play. When sending something from machine to machine (or application to application), we will have to convert any text-representation, into a bytes sequence - so that the machines can talk to each other. Without encodings, without parsing things. Just - take the data.
We do this in three ways and they are:
print('åäö'.encode('UTF-8'))
print(bytes('åäö', 'UTF-8'))
print(b'åäö')
The last option, will fail - but I'll leave it like this on purpose to show the differences of telling Python, "hey, this weird thing, convert it to a bytes object".
All of these options, will return a bytes representation of åäö using a encoder *(except the last one, it will only encode using the ASCII parser, which is limited at best).
In the UTF-8 case, you will be returned something like:
b'\xc3\xa5\xc3\xa4\xc3\xb6'
And this, this is something you can send out on a socket. Because it's just a series of bytes, that the terminals, machines and applications won't touch or deal with in any other way than a series of ones and zeroes *('11000011 10100101 11000011 10100100 11000011 10110110' to be specific)
Together with some network logic, that's what's going to be sent out on your socket. And that's how machines communicate.
This is an overview of what's going on. The "human" is the terminal, aka, the machine-human-interface where you input your åäö and the terminal encodes/parses it as a certain encoding. Your application has to do magic in order to convert it to something the socket/physical world can work with.
Related
I've been using the python websockets library of 8.1 version. This has been a good tool for receiving string data, yet now I have experienced a need to receive a mixture of string and bytes data.
Let me explain.
There is a socket, which encodes its data with the algorithm, which uses character codes as numbers.
For example, at the beginning of the message it has a character c, whose ord(c) == 777. It doesn't mean, though, that this is a chr(777), as a human would read it. It represents, for example, the type of message the client got. So it's a message with the type 777, and there is an algorithm to handle this type of messages. The next character would represent the lenght of the message. And so on.
There is a problem, though. There is a message, whose type is 0, which means a NUL byte. When the client receives such a string, for some reason either Python or the websockets recv method interpretes it as a space character, resulting in ord(c) == 32 instead of ord(c) == 0. Which, obviously, makes the whole message incorrect. I could replace 32 with 0 and vice versa, yet it would lead to even more errors, as those characters are not interchangable.
I suppose, if I received bytes instead of str, the problem would go away? But I cannot seem to find a method for that. Maybe, someone had such experience in the past with python and/or websockets incorrectly treating unreadable characters?
You can use the struct module to build your packets
https://docs.python.org/3/library/struct.html
Im currently trying to parse a apache log in a format I can't do normally. (Tried using goaccess)
In sublime it the delimiters show up as ENQ, SOH, and ETX which too my understanding are "|", space, and superscript L. Im trying to use re.split to separate the individual components of the log, but i'm not sure how to deal w/ the superscript L.
On sublime it shows up as 3286d68255beaf010000543a000012f1/Madonna_Home_1.jpgENQx628a135bENQZ1e5ENQAB50632SOHA50.134.214.130SOHC98.138.19.91SOHD42857ENQwwww.newprophecy.net...
With ENQ's as '|' and SOH as ' ' when I open the file in a plain text editor (Like notepad)
I just need to parse out the IP addresses so the rest of the line is mostly irrelevant.
Currently I have
pkts = re.split("\s|\\|")
But I don't know what to do for the L.
Those 3-letter codes are ASCII control codes - these are ASCII characters which occur prior to 32 (space character) in the ASCII character set. You can find a full list online.
These character do not correspond to anything printable, so you're incorrect in assuming they correspond to those characters. You can refer to them as literals in several languages using \x00 notation - for example, control code ETX corresponds to \x03 (see the reference I linked to above). You can use these to split strings or anything else.
This is the literal answer to your question, but all this aside I find it quite unlikely that you actually need to split your Apache log file by control codes. At a guess what's actually happened is that perhaps som Unicode characters have crept into your log file somehow, perhaps with UTF-8 encoding. An encoding is a way of representing characters that extend beyond the 255 limit of a single byte by encoding extended characters with multiple bytes.
There are several types of encoding, but UTF-8 is one of the most popular. If you use UTF-8 it has the property that standard ASCII characters will appear as normal (so you might never even realise that UTF-8 was being used), but if you view the file in an editor which isn't UTF-8 aware (or which incorrectly identifies the file as plain ASCII) then you'll see these odd control codes. These are places where really the code and the character(s) before or after it should be interpreted together as a single unit.
I'm not sure that this is the reason, it's just an educated guess, but if you haven't already considered it then it's important to figure out the encoding of your file since it'll affect how you interpret the entire content of it. I suggest loading the file into an editor that understands encodings (I'm sure something as popular as Sublime does with proper configuration) and force the encoding to UTF-8 and see if that makes the content seem more sensible.
I have a web-server on which I try to submit a form containing Cyrillic letters. As a result I get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
This message comes from the following line of the code:
ups = 'rrr {0}'.format(body.replace("'","''"))
(body contains Cyrillic letters). Strangely I cannot reproduce this error message in the python command line. The following works fine:
>>> body = 'ппп'
>>> ups = 'rrr {0}'.format(body.replace("'","''"))
It's working in the interactive prompt because your terminal is using your locale to determine what encoding to use. Directly from the Python docs:
Whereas the other file-like objects in python always convert to ASCII
unless you set them up differently, using print() to output to the
terminal will use the user’s locale to convert before sending the
output to the terminal.
On the other hand, while your server is running the scripts, there is no such assumption. Everything read as a byte str from a file-like object is encoded as ASCII in memory unless otherwise specified. Your Cyrillic characters, presumably encoded as UTF-8, can't be converted; they're far beyond the U+007F code point that maps directly between UTF-8 and ASCII. (Unicode uses hex to map its code points; U+007F, then, is U+00127 in decimal. In fact, ASCII only has 127 zero-indexed code points because it uses only 1 byte, and of that one byte, only the least-significant 7 bits. The most significant bit is always 0.)
Back to your problem. If you want to operate on the body of the file, you'll have to specify that it should be opened with a UTF-8 encoding. (Again, I'm assuming it's UTF-8 because it's information submitted from the web. If it's not -- well, it really should be.) The solution has already been given in other StackOverflow answers, so I'll just link to one of them rather than reiterate what's already been answered. The best answer may vary a little bit depending on your version of Python -- if you let me know in a comment I could give you a clearer recommendation.
When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I am using the following line:
print result[2].decode('cp037')
...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?
Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?
Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.
First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)
If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)
Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.
If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.
So, how do you know which one of these it is?
Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.
Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.
Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.
It seems that your result[2] is already unicode:
>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'
In fact, as pointed out #abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:
encoding to string with sys.getdefaultencoding(),
then decoding back to unicode
i.e., you statement is translated as:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')
and the error you get is from the first part of expression, u'\xe3'.encode('ascii')
All right, so as #abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.
The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.
So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)
For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as #abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).
Edit:
We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:
IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).
I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:
1) I use google toolkit's gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.
2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \u and \x chars in there being some monetary prefixes like pound, yen, etc)
3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.
The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?
Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.
The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.
That \u20ac char is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quote method.
url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.
The output isn't very pretty but it should be a correct uri encoding.
>>> s = u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'
Remember that you will need to both unquote() and decode() this to print it out properly if you're debugging or whatever.
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
>>> # oops, nasty  means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
This is, in fact, what the django functions mentioned in another answer do.
The functions
django.utils.http.urlquote() and
django.utils.http.urlquote_plus() are
versions of Python’s standard
urllib.quote() and urllib.quote_plus()
that work with non-ASCII characters.
(The data is converted to UTF-8 prior
to encoding.)
Be careful if you are applying any further quotes or encodings not to mangle things.
i want to second pycruft's remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like 'unicode percent escapes' would be the future, but they never caught on.
it is of paramount importance to be pedantically picky in this area about the difference between unicode objects and octet strings (in Python < 3.0; that's, confusingly, str unicode objects and bytes/bytearray objects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.
even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxx escape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.
You are out of your luck with stdlib, urllib.quote doesn't work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode