Bizarre behavior of python printing non-alphabetic ASCII characters

Bizarre behavior of python printing non-alphabetic ASCII characters - python

I have the following Python code:
for num in range(80, 150):
input()
print(num)
print(chr(27))
print(chr(num))
The input() statement is only there to control how quickly the for loop proceeds. I am not expecting this to do anything special, but when the loop hits certain numbers, printing that ASCII character, preceded by ASCII 27 (which is the ESC character) does some unexpected things:
At 92 and 94, the number does not print. http://i.stack.imgur.com/DzUew.png
At 99 (the letter c), a bunch of terminal output gets deleted. http://i.stack.imgur.com/5XPy3.png
At 108 (the letter l), the current line jumps up several lines (but text remains below). (didn't get a proper screencap, I'll add one later if it helps)
At 128 or 129, the first character starts getting masked. You have to type something (I typed "jjj") in order to prevent this from happening on that line. http://i.stack.imgur.com/DRwTm.png
I don't know why any of this happens although I imagine it has something to do with the ESC character interacting with the terminal. Could someone help me figure this out?

It is due to confusion between escape sequences and character-encoding.
Your program is printing escape sequences, including
escapec (resets the terminal)
escape^ (begins a privacy message, which causes other characters to be eaten)
In ISO-8859-1 (and ECMA-48), character bytes between 128 and 159 are considered control characters, referred to as C1 controls. Several of these are treated the same as escape combined with another character. The mapping between C1 and "another character" is not straightforward, but the interesting ones include
0x9a which is device attributes, causing characters to be sent to the host.
0x9b which is control sequence initiator, more usually seen as escape[.
On the other hand, bytes in the 128-159 range are legal parts of a UTF-8 character. If your terminal is not properly configured to match the locale settings, you can find that your terminal responds to control sequences.
OSX terminal implements (does not document...) many of the standard control sequences. XTerm documents these (and many others), so you may find the following useful:
XTerm Control Sequences
C1 (8-Bit) Control Characters (specifically)
Standard ECMA-48:
Control Functions for Coded Character Sets
For amusement, you are referred to the xterm FAQ: Interesting but misleading

Esc with those characters make a special code for terminal .
A terminal control code is a special sequence of characters that is
printed (like any other text). If the terminal understands the code,
it won't display the character-sequence, but will perform some action.
You can print the codes with a simple echo command.
Terminal Codes
For example,
ESC/ = ST, String Terminator (chr(92))
ESC^ = PM, Privacy Message (chr(94)) .
Control Sequences are different based on what terminal do you use.
More about:
Xterm Control Sequences
ANSI escape code
ANSI/VT100 Terminal Control Escape Sequences,

Related

Different results when running C program from Python Subprocess vs in Bash

I've got a string/argument that I'd like to pass to a C program. It's a string format exploit.
'\xb2\x33\02\x08%13x%2$n'
However, there seems to be different behaviours exhibited if I call the C program from Python by doing
subprocess.Popen(["env", "-i", "./practice", '\xb2\x33\02\x08%13x%2$n'])
versus
./practice '\xb2\x33\02\x08%13x%2$n'
The difference is that the string exploit attack works as expected when calling the script via subprocess, but not when I call it through the CLI.
What might the reason be? Thanks.

Bash manpage says:
Words of the form $'string' are treated specially. The word expands to
string, with backslash-escaped characters replaced as specified by the
ANSI C standard. Backslash escape sequences, if present, are decoded
as follows: [snipped]
\xHH the eight-bit character whose value is the hexadecimal
value HH (one or two hex digits)
Then would you please try:
./practice $'\xb2\x33\02\x08%13x%2$n'

Black Hat Python TCP Client

I'm working through the Black Hat Python book, and though it was written in 2015 some of the code seems a little dated. For example, print statements aren't utilizing parentheses. However, i cannot seem to get the below script to run, and keep getting an error.
# TCP Client Tool
import socket
target_host = "www.google.com"
target_port = 80
# creates a socket object. AF_INET parameter specifies IPv4 addr/host. SOCK_STREAM is TCP specific, not UDP.
client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# connect the client
client.connect((target_host, target_port))
# sending some data
client.send("GET / HTTP/1.1\r\nHost: google.com\r\n\r\n\")
# receive some data
response = client.recv(4096)
print(response)
The error i'm getting simply reads, File "", line 15
client.send("GET / HTTP/1.1\r\nHost: google.com\r\n\r\n\")
^

You are escaping " by putting a \ before, which means python does not know that the string ends here. You can notice that in your post, all the code after that line is coloured as if it was a string.
client.send also needs a byte-like object, not a string. You can specify that by putting a b before your string:
client.send(b"GET / HTTP/1.1\r\nHost: google.com\r\n\r\n")
After that the script works fine

I think #Anonyme2000 answered the question in full and all the details needed to solve the issue are there. However, since this is a learning exercise from a book, others might come here and the details of what's going on in #Anonyme2000's answers are a bit short, I'll expand some more.
Strings
Python, like many other languages have what's called Escape Sequences, in short, putting \ infront of something means that - whatever follows will have a special meaning. Two examples:
Example 1: Row breaks (new-lines)
print("Something \nThis is a new line")
This will cause python to interpret n not as letter "n", but a special character indicating that "here there should be a new line", all thanks to \n being in-front of the letter n. \r is also a "new-line" but in older days it was the equivilent of moving the carriage printer head to the start of the line - not just down one line.
Example 2: Quote escapes in strings
print("I want to print this quote: \" in my string")
In this example, because we are using the quote character " to start and end our string, adding it in the middle would break the string (hopefully this is clear to you). In order to then proceed adding quotes in the middle of the text, we need to again, add a escape sequence character \ before the quote, this tells Python not to parse the quote as a quote, but simply add it into the string. There's an alternative to doing this, and that is:
print('I want to print this quote: " in my string')
And that's because the whole string is started and ended by ' instead, which enables Python to accurately guess (parse) start and stop of the actual whole string - which makes it 100% confident that the quote in this case - just just another piece of the string. These escape sequences are described here with more examples.
Bytes vs Strings
To better understand the difference, we'll first have a look at how Python and the terminal you use interact. I'm assuming you're running your python scripts from cmd.exe, powershell.exe or in Linux something like xterm or something. Basic terminals that is.
The terminal, will try to parse anything sent to it's output buffer and represent it to you. You can test this by doing:
print('\xc3\xa5\xc3\xa4\xc3\xb6') # Most Linux systems
print('\xe5\xe4\xf6') # Most Windows systems
In theory, one of the prints above should have let you just printed a bunch of bytes that the terminal some how knew how to render as åäö. Even your browser just did that for you (Fun side note, that's how they solve the Emoji-problem too, everyone's agreed that certain byte combinations should become 🙀). I say most windows and Linux, because this result is entirely up to what region/language you selected when you installed your operating system. I'm in EU North (Sweden) so my default codec in Windows is ISO-8859-1 and in all my Linux machines I have UTF-8. These codecs are important, as that's the machine-human interface in representing text.
Knowing this, anything you send to the output buffer of your terminal by doing either print("...") or sys.stdout.write("...") - will be interpreted by the terminal and rendered in your locale. If that's not possible, errors will occur.
This is where Python2 and Python3 starts to become two different beasts. And that's why you're here today. Putting it in simple terms, Python2 did a lot of automated and magic "guess-work" on strings, so that you could send a string to a socket - and Python would take care of the encoding for you. Python2 parsed them and converted them in all kinds of ways. In Python3 a lot of that automated guess-work was removed because it was more often than not confusing people. And the data being sent through functions and sockets was essentially a schrödingers data, it was strings some times and bytes some times. So instead, it's now up to you the developer to convert the data and encode it.. always.
So what is bytes vs strings?
bytes is in lay man terms, a string that hasn't been encoded in any way and thus can contain anything "data"-related. It doesn't have to be just a string (a-Z, 0-9, !"#¤% and so on), it can also contain special bytes like \x00 which is a Null byte/character. And Python will never try to auto-parse this data in Python3. And when doing:
print(b'\xe5\xe4\xf6')
Like above, except you define the string as a bytes string in Python3, Python will instead send a representation of the bytes not the actual bytes to the terminal buffer, thus, the terminal will never interpret them as the actual bytes they are.
Example 1: Encoding your data
Which brings us to this first example. So how do you convert your bytes containing print(b'\xe5\xe4\xf6') to the represented characters in your terminal, well, by converting it to a strings with a particular encoding. In the above example, the three characters \xe5\xe4\xf6 happens to be the ISO-8859-1 encoder in the making. I know this because I'm currently on windows and, if you run the command chcp in your terminal, you'll get which code page/encoder you're using.
There for, I can do:
print(b'\xe5\xe4\xf6'.decode('ISO-8859-1')
And that will convert the bytes objects into a string object (with a encoding).
The problem here, is that if you send this string to my Linux machine, it won't have a clue what's going on. Because, if you try:
print(b'\x86\x84\x94'.decode('UTF-8'))
You will end up with a error message like this:
>>> print(b'\x86\x84\x94'.decode('UTF-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 0: invalid start byte
This is because, in UTF-8 land, byte \x86 doesn't exists. So it has no way of knowing what to do with it. And because my Linux machine's default encoder is UTF-8 - your windows data is garbage to my machine.
Which, brings us to..
Sockets
In Python3 and most physical realms of a computer, encodings and strings are not welcome as they aren't really a thing. Instead, machines communicate in bits, in short, 1's and 0's. 8 of those becomes a byte, and that's where Python's bytes comes in to play. When sending something from machine to machine (or application to application), we will have to convert any text-representation, into a bytes sequence - so that the machines can talk to each other. Without encodings, without parsing things. Just - take the data.
We do this in three ways and they are:
print('åäö'.encode('UTF-8'))
print(bytes('åäö', 'UTF-8'))
print(b'åäö')
The last option, will fail - but I'll leave it like this on purpose to show the differences of telling Python, "hey, this weird thing, convert it to a bytes object".
All of these options, will return a bytes representation of åäö using a encoder *(except the last one, it will only encode using the ASCII parser, which is limited at best).
In the UTF-8 case, you will be returned something like:
b'\xc3\xa5\xc3\xa4\xc3\xb6'
And this, this is something you can send out on a socket. Because it's just a series of bytes, that the terminals, machines and applications won't touch or deal with in any other way than a series of ones and zeroes *('11000011 10100101 11000011 10100100 11000011 10110110' to be specific)
Together with some network logic, that's what's going to be sent out on your socket. And that's how machines communicate.
This is an overview of what's going on. The "human" is the terminal, aka, the machine-human-interface where you input your åäö and the terminal encodes/parses it as a certain encoding. Your application has to do magic in order to convert it to something the socket/physical world can work with.

print statement in python, a space is written before the object is writen unless

From the docs:
A space is written before each object is (converted and) written, unless the output system believes it is positioned at the beginning of a line. This is the case (1) when no characters have yet been written to standard output, (2) when the last character written to standard output is a whitespace character except ' ', or (3) when the last write operation on standard output was not a print statement.
But I don't understand what the (2) means...
when the last character written to standard output is a whitespace character except ' '

They mean any whitespace character other than the ASCII space character U+0020 SPACE (i.e. the character created when you press the space bar on a typical American keyboard). In particular, this includes the carriage return, line feed (one or both of which may be created when pressing the enter key, depending on your operating system), horizontal and vertical tabs, and (perhaps) a variety of non-ASCII characters that the Unicode consortium has seen fit to create over the years, but which you are unlikely to encounter "in the wild" unless you go looking for them or allow the end user to supply you with arbitrary data.
Since you have not stated whether this is Python 2 or Python 3, it is not clear to me whether the system has enough information to recognize these non-ASCII characters when they are printed. If, for example, you are using Python 2 and 8-bit strings, the system does not know what encoding you are using, and may not be able to deal with anything that doesn't closely follow ASCII.

Processing delimiters with python

Im currently trying to parse a apache log in a format I can't do normally. (Tried using goaccess)
In sublime it the delimiters show up as ENQ, SOH, and ETX which too my understanding are "|", space, and superscript L. Im trying to use re.split to separate the individual components of the log, but i'm not sure how to deal w/ the superscript L.
On sublime it shows up as 3286d68255beaf010000543a000012f1/Madonna_Home_1.jpgENQx628a135bENQZ1e5ENQAB50632SOHA50.134.214.130SOHC98.138.19.91SOHD42857ENQwwww.newprophecy.net...
With ENQ's as '|' and SOH as ' ' when I open the file in a plain text editor (Like notepad)
I just need to parse out the IP addresses so the rest of the line is mostly irrelevant.
Currently I have
pkts = re.split("\s|\\|")
But I don't know what to do for the L.

Those 3-letter codes are ASCII control codes - these are ASCII characters which occur prior to 32 (space character) in the ASCII character set. You can find a full list online.
These character do not correspond to anything printable, so you're incorrect in assuming they correspond to those characters. You can refer to them as literals in several languages using \x00 notation - for example, control code ETX corresponds to \x03 (see the reference I linked to above). You can use these to split strings or anything else.
This is the literal answer to your question, but all this aside I find it quite unlikely that you actually need to split your Apache log file by control codes. At a guess what's actually happened is that perhaps som Unicode characters have crept into your log file somehow, perhaps with UTF-8 encoding. An encoding is a way of representing characters that extend beyond the 255 limit of a single byte by encoding extended characters with multiple bytes.
There are several types of encoding, but UTF-8 is one of the most popular. If you use UTF-8 it has the property that standard ASCII characters will appear as normal (so you might never even realise that UTF-8 was being used), but if you view the file in an editor which isn't UTF-8 aware (or which incorrectly identifies the file as plain ASCII) then you'll see these odd control codes. These are places where really the code and the character(s) before or after it should be interpreted together as a single unit.
I'm not sure that this is the reason, it's just an educated guess, but if you haven't already considered it then it's important to figure out the encoding of your file since it'll affect how you interpret the entire content of it. I suggest loading the file into an editor that understands encodings (I'm sure something as popular as Sublime does with proper configuration) and force the encoding to UTF-8 and see if that makes the content seem more sensible.

Python '\x0e' in character by character XOR encryption

I am trying to build an encryption system using python. It is based on the lorenz cipher machine used by Germany in WWII, though a lot more complicated (7-bit ascii encryption and 30 rotors compared with the original's 5-bit and 12 rotors).
So far I have worked out and written the stepping system. I have also created a system for the and chopping up the plaintext. But when checking the output, in character for character (By not stitching together the ciphertext) I got this for hello:
['H', 'Z', '\x0e', '>', 'f']
I have realised that '\x0e' must be some special character in ascii, but I am certain that when the program goes to decrypt it will look at each of the letters in it individually. Can someone please tell me what '\x0e' signifies, if there are other such characters, and if there's an easy way to get around it.
Thanks in advance!

It's the ASCII "shift-out" control character and is nonprintable.
A control character which is used in conjunction with SHIFT IN and
ESCAPE to extend the graphic character set of the code. It may alter
the meaning of octets 33 - 126 (dec.). The effect of this character
when using code extension techniques is described in International
Standard ISO 2022.

'\x0e' is the ASCII SO (shift out) unprintable character. It is a single character, and any reasonable program dealing with the string will treat it as such; you're only seeing it represented like that because you're printing a list, which shows the repr of each value in the list.
As for the question of if there are others, yes, there are 33 of them; ASCII 0-31 and 127 are all generally considered "control characters" which aren't typically printable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.