Python extension for Upskirt: garbage at end of string

Python extension for Upskirt: garbage at end of string - python

I've been trying to make a Python extension for Upskirt. I though it would not be too hard for a first C project since there are examples (example program in the Upskirt code and the Ruby extension).
The extension works, it converts the Markdown I throw at it, but sometimes the output has some garbage at the end of the string. And I don't know what causes it.
Here's some output:
python test.py
<module 'pantyshot' from '/home/frank/Code/pantyshot/virtenv/lib/python2.7/site-packages/pantyshot.so'>
<built-in function render>
'<p>This <strong>is</strong> <em>a</em> <code>test</code>. Test.</p>\n\x7f'
<p>This <strong>is</strong> <em>a</em> <code>test</code>. Test.</p>
--------------------------------------------------------------------------------
'<p>This <strong>is</strong> <em>a</em> <code>test</code>. Test.</p>\n\x7f'
<p>This <strong>is</strong> <em>a</em> <code>test</code>. Test.</p>
--------------------------------------------------------------------------------
My code can be found in my Github repo. I called it pantyshot, because I thought of that when I heard upskirt. Strange name, I know.
I hope someone can help me.

You are doing a strdup in pantyshot_render:
output_text = strdup(ob->data); /* ob is a "struct buf *" */
But I don't think ob->data is a nul-terminated C string. You'll find this inside upskirt/buffer.c:
/* bufnullterm • NUL-termination of the string array (making a C-string) */
void
bufnullterm(struct buf *buf) {
if (!buf || !buf->unit) return;
if (buf->size < buf->asize && buf->data[buf->size] == 0) return;
if (bufgrow(buf, buf->size + 1))
buf->data[buf->size] = 0; }
So, you're probably running off the end of the buffer and getting lucky by hitting a '\0' before doing any damage. I think you're supposed to call bufnullterm(ob) before copying ob->data as a C string; or you could look at ob->size, use malloc and strncpy to copy it, and take care of the nul-terminator by hand (but make sure you allocation ob->size + 1 bytes for your copied string).
And if you want to get rid of the newline (i.e. the trailing \n), then you'll probably have to do some whitespace stripping by hand somewhere.

Related

snprintf still complaints about buffer overflow

When I ran my code through one of the analyzer tool Fortify, it complains about both the snprintf's saying "the format string argument does not properly limit the amount of data the function can write"
I understand snprintf should not result in buffer overflow. But still why the tool raises this complain. Can anyone please help?

I understand snprintf should not result in buffer overflow. But still why the tool raises this complain. Can anyone please help?
Often I have seen analysis tool's complaints are pedantically real, yet the tool points to the wrong original culprit.
Code has at least this weakness
Potential junk in timebuf[]
char timebuf[20];
// Enough space?
strftime(timebuf, sizeof(timebuf),"%Y-%m-%d %H:%M:%S", adjustedtime);
As adjustedtime->tm_year, an int, may have values in the range -2147483648 ... 2147483647, more than size 20 needed.
Avoid under sizing. Recommend:
#define INT_TEXT_LENGTH_MAX 11
char timebuf[6*INT_TEXT_LENGTH_MAX + sizeof "%Y-%m-%d %H:%M:%S"];
Further, it the buffer is not big enough, then:
If the total number of resulting characters including the terminating null character is not more than maxsize, the strftime function returns the number of characters placed into the array pointed to by s not including the terminating null character. Otherwise, zero is returned and the contents of
the array are indeterminate. C17dr § 7.27.3.5 Library 8
Thus an analysis tool can assume any content for timebuf[] including a non-string following an unchecked strftime(). That can easily break snprintf(gendata, sizeof(gendata), "%s", timebuf); as "%s" requires a string, which timebuf[] is not guarantied to be. The sizeof(gendata) in snprintf(gendata, sizeof(gendata), ... is not sufficient to prevent UB of an unterminated timebuf[].
Better code would also check the size.
struct tm *adjustedtime = localtime(&t);
if (adjustedtime == NULL) {
Handle_Error();
}
if (strftime(timebuf, sizeof(timebuf),"%Y-%m-%d %H:%M:%S", adjustedtime) == 0) {
Handle_Error();
}
Now we can continue with snprintf() code.

How CPython handles multiline input in REPL?

Python's REPL reads input line by line.
However, function definitions consist from multiple lines.
For example:
>>> def answer():
... return 42
...
>>> answer()
42
How does CPython's parser request additional input after partial def answer(): line?

Python's REPL reads input line by line.
That statement is technically correct, but it's somewhat misleading. I suppose you got it from some Python "tutorial"; please be aware that it is, at best, an oversimplification, and that it is quite possible that you will encounter other oversimplifications in the tutorial.
The Python REPL does read input line by line, in order to avoid reading too much. This differs from the way Python reads files; these are read in larger blocks, for efficiency. If the REPL did that, then the following wouldn't work:
>>> print(f"******* {input()} *******")
Hello, world
******* Hello, world *******
because the line intended as input to the expression would have already been consumed before the expression was evaluated. (And, of course, the whole point of the REPL is that you immediately see the result of executing a statement, rather than having to wait for the entire input to be read.)
So the REPL only reads lines as needed, and it does read whole lines. But that doesn't mean that it executes line-by-line. It reads an entire command, then compiles the command, and then executes it, printing the result.
That doesn't answer the question as to how the REPL knows that it has reached the end of a command, though. To answer that, we have to start with the Python grammar, conveniently reproduced in the Python documentation.
The first five lines of that grammar are the five different top-level targets of the parser. The first two, file and interactive, are the top-level targets used for reading files and for use in an interactive session. (The others are used in different parsing contexts, and I'm not going to consider them here.)
file and interactive are very different grammars. The file target is intended to parse an entire file, consisting of an optional list of statements ([statements]) followed by an end-of-file markers (ENDMARKER). In contrast, the interactive target reads a single statement_newline, whose definition is a few lines later in the grammar:
statement_newline:
| compound_stmt NEWLINE
| simple_stmts
| NEWLINE
| ENDMARKER
Here, simple_stmts is a single line consisting of a sequence of ;-separated simple statements, followed by a NEWLINE:
>>> a = 3; print(a)
3
The import aspect of the definition of statement_newline is that every option either ends with (or is) a NEWLINE, or is the end of the file itself.
None of the above has anything to do with actually reading input, because the Python parser --like most language parsers-- is not responsible for handling input. As is usual, the parser takes as input a sequence of tokens, which it requests one at a time as needed. In the grammar, tokens are represented either with CAPITAL_LETTERs (NEWLINE) or as quoted literals ('if' or '+'), which represent themselves.
These tokens come from the lexical analyser (the "lexer" in common usage), which is responsible for acquiring input as necessary and turning it into a token stream by:
recognising classes of tokens with the same syntactic usage (like NUMBER and NAME, whose precise characters are not important to the parser, although they will obviously be needed later on in the process).
recognising individual keyword tokens (the quoted literals in the grammar), which includes operator tokens. (It might sound odd to call + a keyword, but from the viewpoint of the lexer, that's what it is: a particular sequence of characters which make up a unique token.)
fabricating other tokens as needed. In Python, these have to do with the way leading whitespace is handled; the generated tokens are NEWLINE, INDENT and DEDENT.
ignoring comments and irrelevant whitespace.
The NEWLINE token represents a newline character (or, as it happens, the two-byte sequence \r\n sometimes used as a newline marker, for example by Windows or in many internet protocols). But not every newline character is turned into a NEWLINE token. Newlines which occur inside triple-quoted strings are considered ordinary characters. A newline immediately following a \ indicates that the next physical line is logically a continuation of the current input line. Newline characters inside parenthetic syntaxes ((...), [...], and {...}) are considered ignorable whitespace. And finally, in one of the few places where the lexer distinguishes between file and interactive input, the newline at the end of a line containing only whitespace and possibly a comment is ignored, unless the input is interactive and the line is completely empty.
The distinction in the last rule is required in order to implement the REPL rule that an empty line terminates a multi-line compound statement, which is not the case in file input. In file input, a compound statement terminates when a another statement is encountered at the same indent level, but that rule isn't suitable for interactive input, because it would require reading the first line of the next statement.
The fact that bracketed newlines are considered ignorable whitespace requires the lexer to duplicate a small amount of the work of the parser. In particular, the lexer maintains its own stack of open parenthesis/brace/bracket, which lets it track the tokens ()[]{}. Newline characters encountered in the input stream are ignored unless the bracket stack is empty. The slight duplication of effort is annoying but sometimes such deviations from perfection are necessary.
If you're interested in the way that INDENT and DEDENT are constructed, you can read about it in the reference manual; it's interesting, but not relevant here. (NEWLINE handling is also described in the reference manual section on Lexical Analysis, but I summarised it above because it is relevant to this question.)
So, to get back to the original question: How does the REPL know that it has read a complete command? The answer is simple: it asks the parser to recognise a single statement_newline target. As noted above, that construct is terminated by a NEWLINE token, and when the NEWLINE token which terminates the statement_newline target is encountered, the parser returns the resulting AST to the REPL, which proceeds to compile and execute it.
Not all NEWLINEs match the end of statement_newline, as you can see with a careful reading of the grammar. In particular, NEWLINEs inside compound statements are part of the compound statement syntax. The grammar for compound statements does not allow two consecutive NEWLINEs, but that can never happen when reading from a file because the lexical analyser does not produce a NEWLINE token for a blank line, as noted above. In interactive input, though, the lexical analyser does produce a NEWLINE token for a blank line, so it is possible for the parser to receive two consecutive NEWLINEs. Since the compound statement syntax doesn't include the second one, it becomes part of the statement_newline syntax, thereby terminating the parser's target.

TLDR: Digging into source code of CPython, I figured out that lazy lexer outputs >>> and ... promts.
Entry point for REPL is pymain_repl function:
static void
pymain_repl(PyConfig *config, int *exitcode)
{
/* ... */
PyCompilerFlags cf = _PyCompilerFlags_INIT;
int res = PyRun_AnyFileFlags(stdin, "<stdin>", &cf); // <-
*exitcode = (res != 0);
}
Which sets name of compiled file to "<stdin>".
If the name of file is "<stdin>", then _PyRun_InteractiveLoopObject will be called.
It's the REPL loop itself. Also, here >>> and ... are loaded to some global state.
int
_PyRun_InteractiveLoopObject(FILE *fp, PyObject *filename, PyCompilerFlags *flags)
{
/* ... */
PyObject *v = _PySys_GetAttr(tstate, &_Py_ID(ps1));
if (v == NULL) {
_PySys_SetAttr(&_Py_ID(ps1), v = PyUnicode_FromString(">>> ")); // <-
Py_XDECREF(v);
}
v = _PySys_GetAttr(tstate, &_Py_ID(ps2));
if (v == NULL) {
_PySys_SetAttr(&_Py_ID(ps2), v = PyUnicode_FromString("... ")); // <-
Py_XDECREF(v);
}
/* ... */
do {
ret = PyRun_InteractiveOneObjectEx(fp, filename, flags); // <-
/* ... */
} while (ret != E_EOF);
return err;
}
PyRun_InteractiveOneObjectEx reads, parses, compiles and runs single python's object
static int
PyRun_InteractiveOneObjectEx(FILE *fp, PyObject *filename,
PyCompilerFlags *flags)
{
/* ... */
v = _PySys_GetAttr(tstate, &_Py_ID(ps1)); // <-
/* ... (ps1 is set to v) */
w = _PySys_GetAttr(tstate, &_Py_ID(ps2)); // <-
/* ... (ps2 is set to w) */
mod = _PyParser_ASTFromFile(fp, filename, enc, Py_single_input,
ps1, ps2, flags, &errcode, arena);
/* ... */
}
Then we have bunch of parsing function...
Finally, we see tok_underflow_interactive function, that requests tokens with prompt through PyOS_Readline(stdin, stdout, tok->prompt) call
P.S: The 'Your Guide to the CPython Source Code' article was really helpful. But beware - linked source code is coming from an older branch.

It depends on the code you want to insert into the console. But in this case, as Python detects the keyword def referring to a function declaration, it will initiate a process that detects the end of the function code by looking at its indentation.
def a():
if 1==1:
if not 1==1:
pass
else:
return "End of execution"
#End of function
As you can see, the indentation of a function, or any similar structure is fundamental when writing it on multiple lines on the Python console. Here Python reads line by line until it detects and end on the normal function spacing, so it continues reading instructions outside a().

depythonifying 'char', got 'str' for pyobjc

0
Story would be: I was using a hardware which can be automatic controlled by a objc framework, it was already used by many colleagues so I can see it as a "fixed" library. But I would like to use it via Python, so with pyobjc I can already connect to this device, but failed to send data into it.
The objc command in header is like this
(BOOL) executeabcCommand:(NSString*)commandabc
withArgs:(uint32_t)args
withData:(uint8_t*)data
writeLength:(NSUInteger)writeLength
readLength:(NSUInteger)readLength
timeoutMilliseconds:(NSUInteger)timeoutMilliseconds
error:(NSError **) error;
and from my python code, data is an argument which can contain 256bytes of data such
as 0x00, 0x01, 0xFF. My python code looks like this:
senddata=Device.alloc().initWithCommunicationInterface_(tcpInterface)
command = 'ABCw'
args= 0x00
writelength = 0x100
readlength = 0x100
data = '\x50\x40'
timeout = 500
success, error = senddata.executeabcCommand_withArgs_withData_writeLength_readLength_timeoutMilliseconds_error_(command, args, data, writelength, readlength, timeout, None)
Whatever I sent into it, it always showing that.
ValueError: depythonifying 'char', got 'str'
I tired to dig in a little bit, but failed to find anything about convert string or list to char with pyobjc

Objective-C follows the rules that apply to C.
So in objc as well as C when we look at uint8_t*, it is in fact the very same as char* in memory. string differs from this only in that sense that it is agreed that the last character ends in \0 to indicate that the char* block that we call string has its cap. So char* blocks end with \0 because, well its a string.
What do we do in C to find out the length of a character block?
We iterate the whole block until we find \0. Usually with a while loop, and break the loop when you find it, your counter inside the loop tells you your length if you did not give it somehow anyway.
It is up to you to interpret the data in the desired format.
Which is why sometime it is easier to cast from void* or to take indeed a char* block which is then cast to and declared as uint8_t data inside the function which makes use if it. Thats the nice part of C to be able to define that as you wish, use that force that was given to you.
So to make your life easier, you could define a length parameter like so
-withData:(uint8_t*)data andLength:(uint64_t)len; to avoid parsing the character stream again, as you know already it is/or should be 256 characters long. The only thing you want to avoid at all cost in C is reading attempts at indices that are out of bound throwing an BAD_ACCESS exception.
But this basic information should enable you to find a way to declare your char* block containing uint8_t data addressed with the very first pointer (*) which also contains the first uint8_t character of the block as str with a specific length or up to the first appearance of \0.
Sidenote:
objective-c #"someNSString" == pythons u"pythonstring"
PS: in your question is not clear who throw that error msg.
Python? Because it could not interpret the data when receiving?
Pyobjc? Because it is python syntax hell when you mix with objc?
The objc runtime? Because it follows the strict rules of C as well?

Python has always been very forgiving about shoe-horning one type into another, but python3 uses Unicode strings by default, which need to be converted into binary strings before plugging into pyobjc methods.

Try specifying the strings as byte objects as b'this'
I was hitting the same error trying to use IOKit:
import objc
from Foundation import NSBundle
IOKit = NSBundle.bundleWithIdentifier_('com.apple.framework.IOKit')
functions = [("IOServiceGetMatchingService", b"II#"), ("IOServiceMatching", b"#*"),]
objc.loadBundleFunctions(IOKit, globals(), functions)
The problem arose when I tried to call the function like so:
IOServiceMatching('AppleSmartBattery')
Receiving
Traceback (most recent call last):
File "<pyshell#53>", line 1, in <module>
IOServiceMatching('AppleSmartBattery')
ValueError: depythonifying 'charptr', got 'str'
While as a byte object I get:
IOServiceMatching(b'AppleSmartBattery')
{
IOProviderClass = AppleSmartBattery;
}

Using NULL bytes in bash (for buffer overflow)

I programmed a little C program that is vulnerable to a buffer overflow. Everything is working as expected, though I came across a little problem now:
I want to call a function which lies on address 0x00007ffff7a79450 and since I am passing the arguments for the buffer overflow through the bash terminal (like this:
./a "$(python -c 'print "aaaaaaaaaaaaaaaaaaaaaa\x50\x94\xA7\xF7\xFF\x7F\x00\x00"')" )
I get an error that the bash is ignoring the nullbytes.
/bin/bash: warning: command substitution: ignored null byte in input
As a result I end up with the wrong address in memory (0x7ffff7a79450instead of0x00007ffff7a79450).
Now my question is: How can I produce the leading 0's and give them as an argument to my program?

I'll take a bold move and assert what you want to do is not possible in a POSIX environment, because of the way arguments are passed.
Programs are run using the execve system call.
int execve(const char *filename, char *const argv[], char *const envp[]);
There are a few other functions but all of them wrap execve in the end or use an extended system call with the properties that follow:
Program arguments are passed using an array of NUL-terminated strings.
That means that when the kernel will take your arguments and put them aside for the new program to use, it will only read them up to the first NUL character, and discard anything that follows.
So there is no way to make your example work if it has to include nul characters. This is why I suggested reading from stdin instead, which has no such limitation:
char buf[256];
read(STDIN_FILENO, buf, 2*sizeof(buf));
You would normally need to check the returned value of read. For a toy problem it should be enough for you to trigger your exploit. Just pipe your malicious input into your program.

Python bug: null byte in input prompt

I've found that
input('some\x00 text')
will prompt for some instead of some text.
From sources, I've figured out that this function uses C function PyOS_Readline, which ignores everything in prompt after NULL byte.
From PyOS_StdioReadline(FILE *sys_stdin, FILE *sys_stdout, const char *prompt):
fprintf(stderr, "%s", prompt);
https://github.com/python/cpython/blob/3.6/Python/bltinmodule.c#L1989
https://github.com/python/cpython/blob/3.6/Parser/myreadline.c#L251
Is this a bug or there is a reason for that?
Issue: http://bugs.python.org/issue30431

The function signature pretty much requires a NUL terminated C-string, PyOS_StdioReadline(FILE *sys_stdin, FILE *sys_stdout, const char *prompt), so there isn't much than can be done about this without changing the API and breaking interoperability with GNU readline.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.