How CPython handles multiline input in REPL?

How CPython handles multiline input in REPL? - python

Python's REPL reads input line by line.
However, function definitions consist from multiple lines.
For example:
>>> def answer():
... return 42
...
>>> answer()
42
How does CPython's parser request additional input after partial def answer(): line?

Python's REPL reads input line by line.
That statement is technically correct, but it's somewhat misleading. I suppose you got it from some Python "tutorial"; please be aware that it is, at best, an oversimplification, and that it is quite possible that you will encounter other oversimplifications in the tutorial.
The Python REPL does read input line by line, in order to avoid reading too much. This differs from the way Python reads files; these are read in larger blocks, for efficiency. If the REPL did that, then the following wouldn't work:
>>> print(f"******* {input()} *******")
Hello, world
******* Hello, world *******
because the line intended as input to the expression would have already been consumed before the expression was evaluated. (And, of course, the whole point of the REPL is that you immediately see the result of executing a statement, rather than having to wait for the entire input to be read.)
So the REPL only reads lines as needed, and it does read whole lines. But that doesn't mean that it executes line-by-line. It reads an entire command, then compiles the command, and then executes it, printing the result.
That doesn't answer the question as to how the REPL knows that it has reached the end of a command, though. To answer that, we have to start with the Python grammar, conveniently reproduced in the Python documentation.
The first five lines of that grammar are the five different top-level targets of the parser. The first two, file and interactive, are the top-level targets used for reading files and for use in an interactive session. (The others are used in different parsing contexts, and I'm not going to consider them here.)
file and interactive are very different grammars. The file target is intended to parse an entire file, consisting of an optional list of statements ([statements]) followed by an end-of-file markers (ENDMARKER). In contrast, the interactive target reads a single statement_newline, whose definition is a few lines later in the grammar:
statement_newline:
| compound_stmt NEWLINE
| simple_stmts
| NEWLINE
| ENDMARKER
Here, simple_stmts is a single line consisting of a sequence of ;-separated simple statements, followed by a NEWLINE:
>>> a = 3; print(a)
3
The import aspect of the definition of statement_newline is that every option either ends with (or is) a NEWLINE, or is the end of the file itself.
None of the above has anything to do with actually reading input, because the Python parser --like most language parsers-- is not responsible for handling input. As is usual, the parser takes as input a sequence of tokens, which it requests one at a time as needed. In the grammar, tokens are represented either with CAPITAL_LETTERs (NEWLINE) or as quoted literals ('if' or '+'), which represent themselves.
These tokens come from the lexical analyser (the "lexer" in common usage), which is responsible for acquiring input as necessary and turning it into a token stream by:
recognising classes of tokens with the same syntactic usage (like NUMBER and NAME, whose precise characters are not important to the parser, although they will obviously be needed later on in the process).
recognising individual keyword tokens (the quoted literals in the grammar), which includes operator tokens. (It might sound odd to call + a keyword, but from the viewpoint of the lexer, that's what it is: a particular sequence of characters which make up a unique token.)
fabricating other tokens as needed. In Python, these have to do with the way leading whitespace is handled; the generated tokens are NEWLINE, INDENT and DEDENT.
ignoring comments and irrelevant whitespace.
The NEWLINE token represents a newline character (or, as it happens, the two-byte sequence \r\n sometimes used as a newline marker, for example by Windows or in many internet protocols). But not every newline character is turned into a NEWLINE token. Newlines which occur inside triple-quoted strings are considered ordinary characters. A newline immediately following a \ indicates that the next physical line is logically a continuation of the current input line. Newline characters inside parenthetic syntaxes ((...), [...], and {...}) are considered ignorable whitespace. And finally, in one of the few places where the lexer distinguishes between file and interactive input, the newline at the end of a line containing only whitespace and possibly a comment is ignored, unless the input is interactive and the line is completely empty.
The distinction in the last rule is required in order to implement the REPL rule that an empty line terminates a multi-line compound statement, which is not the case in file input. In file input, a compound statement terminates when a another statement is encountered at the same indent level, but that rule isn't suitable for interactive input, because it would require reading the first line of the next statement.
The fact that bracketed newlines are considered ignorable whitespace requires the lexer to duplicate a small amount of the work of the parser. In particular, the lexer maintains its own stack of open parenthesis/brace/bracket, which lets it track the tokens ()[]{}. Newline characters encountered in the input stream are ignored unless the bracket stack is empty. The slight duplication of effort is annoying but sometimes such deviations from perfection are necessary.
If you're interested in the way that INDENT and DEDENT are constructed, you can read about it in the reference manual; it's interesting, but not relevant here. (NEWLINE handling is also described in the reference manual section on Lexical Analysis, but I summarised it above because it is relevant to this question.)
So, to get back to the original question: How does the REPL know that it has read a complete command? The answer is simple: it asks the parser to recognise a single statement_newline target. As noted above, that construct is terminated by a NEWLINE token, and when the NEWLINE token which terminates the statement_newline target is encountered, the parser returns the resulting AST to the REPL, which proceeds to compile and execute it.
Not all NEWLINEs match the end of statement_newline, as you can see with a careful reading of the grammar. In particular, NEWLINEs inside compound statements are part of the compound statement syntax. The grammar for compound statements does not allow two consecutive NEWLINEs, but that can never happen when reading from a file because the lexical analyser does not produce a NEWLINE token for a blank line, as noted above. In interactive input, though, the lexical analyser does produce a NEWLINE token for a blank line, so it is possible for the parser to receive two consecutive NEWLINEs. Since the compound statement syntax doesn't include the second one, it becomes part of the statement_newline syntax, thereby terminating the parser's target.

TLDR: Digging into source code of CPython, I figured out that lazy lexer outputs >>> and ... promts.
Entry point for REPL is pymain_repl function:
static void
pymain_repl(PyConfig *config, int *exitcode)
{
/* ... */
PyCompilerFlags cf = _PyCompilerFlags_INIT;
int res = PyRun_AnyFileFlags(stdin, "<stdin>", &cf); // <-
*exitcode = (res != 0);
}
Which sets name of compiled file to "<stdin>".
If the name of file is "<stdin>", then _PyRun_InteractiveLoopObject will be called.
It's the REPL loop itself. Also, here >>> and ... are loaded to some global state.
int
_PyRun_InteractiveLoopObject(FILE *fp, PyObject *filename, PyCompilerFlags *flags)
{
/* ... */
PyObject *v = _PySys_GetAttr(tstate, &_Py_ID(ps1));
if (v == NULL) {
_PySys_SetAttr(&_Py_ID(ps1), v = PyUnicode_FromString(">>> ")); // <-
Py_XDECREF(v);
}
v = _PySys_GetAttr(tstate, &_Py_ID(ps2));
if (v == NULL) {
_PySys_SetAttr(&_Py_ID(ps2), v = PyUnicode_FromString("... ")); // <-
Py_XDECREF(v);
}
/* ... */
do {
ret = PyRun_InteractiveOneObjectEx(fp, filename, flags); // <-
/* ... */
} while (ret != E_EOF);
return err;
}
PyRun_InteractiveOneObjectEx reads, parses, compiles and runs single python's object
static int
PyRun_InteractiveOneObjectEx(FILE *fp, PyObject *filename,
PyCompilerFlags *flags)
{
/* ... */
v = _PySys_GetAttr(tstate, &_Py_ID(ps1)); // <-
/* ... (ps1 is set to v) */
w = _PySys_GetAttr(tstate, &_Py_ID(ps2)); // <-
/* ... (ps2 is set to w) */
mod = _PyParser_ASTFromFile(fp, filename, enc, Py_single_input,
ps1, ps2, flags, &errcode, arena);
/* ... */
}
Then we have bunch of parsing function...
Finally, we see tok_underflow_interactive function, that requests tokens with prompt through PyOS_Readline(stdin, stdout, tok->prompt) call
P.S: The 'Your Guide to the CPython Source Code' article was really helpful. But beware - linked source code is coming from an older branch.

It depends on the code you want to insert into the console. But in this case, as Python detects the keyword def referring to a function declaration, it will initiate a process that detects the end of the function code by looking at its indentation.
def a():
if 1==1:
if not 1==1:
pass
else:
return "End of execution"
#End of function
As you can see, the indentation of a function, or any similar structure is fundamental when writing it on multiple lines on the Python console. Here Python reads line by line until it detects and end on the normal function spacing, so it continues reading instructions outside a().

Related

Parsing blocks as Python

I am writing a lexer + parser in JFlex + CUP, and I wanted to have Python-like syntax regarding blocks; that is, indentation marks the block level.
I am unsure of how to tackle this, and whether it should be done at the lexical or sintax level.
My current approach is to solve the issue at the lexical level - newlines are parsed as instruction separators, and when one is processed I move the lexer to a special state which checks how many characters are in front of the new line and remembers in which column the last line started, and accordingly introduces and open block or close block character.
However, I am running into all sort of trouble. For example:
JFlex cannot match empty strings, so my instructions need to have at least one blanck after every newline.
I cannot close two blocks at the same time with this approach.
Is my approach correct? Should I be doing things different?

Your approach of handling indents in the lexer rather than the parser is correct. Well, it’s doable either way, but this is usually the easier way, and it’s the way Python itself (or at least CPython and PyPy) does it.
I don’t know much about JFlex, and you haven’t given us any code to work with, but I can explain in general terms.
For your first problem, you're already putting the lexer into a special state after the newline, so that "grab 0 or more spaces" should be doable by escaping from the normal flow of things and just running a regex against the line.
For your second problem, the simplest solution (and the one Python uses) is to keep a stack of indents. I'll demonstrate something a bit simpler than what Python does.
First:
indents = [0]
After each newline, grab a run of 0 or more spaces as spaces. Then:
if len(spaces) == indents[-1]:
pass
elif len(spaces) > indents[-1]:
indents.append(len(spaces))
emit(INDENT_TOKEN)
else:
while len(spaces) != indents[-1]:
indents.pop()
emit(DEDENT_TOKEN)
Now your parser just sees INDENT_TOKEN and DEDENT_TOKEN, which are no different from, say, OPEN_BRACE_TOKEN and CLOSE_BRACE_TOKEN in a C-like language.
Of you’d want better error handling—raise some kind of tokenizer error rather than an implicit IndexError, maybe use < instead of != so you can detect that you’ve gone too far instead of exhausting the stack (for better error recovery if you want to continue to emit further errors instead of bailing at the first one), etc.
For real-life example code (with error handling, and tabs as well as spaces, and backslash newline escaping, and handling non-syntactic indentation inside of parenthesized expressions, etc.), see the tokenize docs and source in the stdlib.

How to automatically insert spaces to make empty lines in Python files indent?

I recently encountered the common "unexpected indent" problem when trying to evaluate python code by copying them from PyDev and Emacs into a python interpreter.
After trying to fix tab/spaces and some searches, I found the cause in this answer:
This error can also occur when pasting something into the Python
interpreter (terminal/console).
Note that the interpreter interprets an empty line as the end of an
expression, so if you paste in something like
def my_function():
x = 3
y = 7
the interpreter will interpret the empty line before y = 7 as the end
of the expression ...
, which is exactly the case in my situation. And there is also a comment to the answer which points out a solution:
key being that blank lines within the function definition are fine,
but they still must have the initial whitespace since Python
interprets any blank line as the end of the function
But the solution is impractical as I have many empty lines that are problematic for the interpreter. My question is:
Is there a method/tool to automatically insert the right number of initial whitespaces to empty lines so that I can copy-and-paste my code from an editor to an interpreter?

Don't bother with inserting spaces. Tell the interpreter to execute a block of text instead:
>>> exec(r'''
<paste your code>
''')
The r''' ... ''' tripple-quoted string preserves escapes and newlines. Sometimes (though in my experience, rarely) you need to use r""" ... """ instead, when the code block contains tripple-quoted strings using single quotes.
Another option is to switch to using IPython to do your day-to-day testing of pasted code, which handles pasted code with blank lines natively.

Using NULL bytes in bash (for buffer overflow)

I programmed a little C program that is vulnerable to a buffer overflow. Everything is working as expected, though I came across a little problem now:
I want to call a function which lies on address 0x00007ffff7a79450 and since I am passing the arguments for the buffer overflow through the bash terminal (like this:
./a "$(python -c 'print "aaaaaaaaaaaaaaaaaaaaaa\x50\x94\xA7\xF7\xFF\x7F\x00\x00"')" )
I get an error that the bash is ignoring the nullbytes.
/bin/bash: warning: command substitution: ignored null byte in input
As a result I end up with the wrong address in memory (0x7ffff7a79450instead of0x00007ffff7a79450).
Now my question is: How can I produce the leading 0's and give them as an argument to my program?

I'll take a bold move and assert what you want to do is not possible in a POSIX environment, because of the way arguments are passed.
Programs are run using the execve system call.
int execve(const char *filename, char *const argv[], char *const envp[]);
There are a few other functions but all of them wrap execve in the end or use an extended system call with the properties that follow:
Program arguments are passed using an array of NUL-terminated strings.
That means that when the kernel will take your arguments and put them aside for the new program to use, it will only read them up to the first NUL character, and discard anything that follows.
So there is no way to make your example work if it has to include nul characters. This is why I suggested reading from stdin instead, which has no such limitation:
char buf[256];
read(STDIN_FILENO, buf, 2*sizeof(buf));
You would normally need to check the returned value of read. For a toy problem it should be enough for you to trigger your exploit. Just pipe your malicious input into your program.

Find word/pattern/string in txt/xml-file and add incremental number

I'm using Textwrangler who falls short when it comes to adding numbers to replacements.
I have an xml-file with several strings containg the words:
generatoritem id="Outline Text"
I need to add an incrementing number at the end of each substitution, like so:
generatoritem id="Outline Text1"
So I need to replace 'Outline Text' with 'Outline Text' and an incrementing number.
I found an answer on a similar question and tried to type in this in textwrangler and hit Check Syntax:
perl -ple 's/"Outline Text"/$n++/e' /path/of/file.xml
Plenty of errors.. So I need to be explained this nice one liner. Or perhaps get a new one or a Python script?

-p makes perl read your file(s) one line at a time, and for each line it will execute the script and then emit the line as modified by your script. Note that there is an implicit use of a variable called $_ - it is used as the variable holding the line being read, it's also the default target for s/// and it's the default source for the print after each line.
You won't need the -l (the l in the middle of -ple) for the task you describe, so I won't bother going into it. Remove it.
The final flag -e (the e at the end of -ple) introduces your 'script' from the command line itself - ie allowing a quick and small use of perl without a source file.
Onto the guts of your script: it is wrong for the purpose you describe, and as an attempt it's also a bit unsafe.
If you want to change "Outline text" into something else, your current script replaces ALL of it with $n - which is not what you describe you want. A simple way to do exactly what you ask for is
s/(id="Outline text)(")/$1 . $n++ . $2/eg;
This matches the exact text you want, and notice that I'm also matching id= for extra certainty in case OTHER parts of your file contains "Outline text" - don't laugh, it can happen!
By putting ( ) around parts of the pattern, those bits are saved in variables known as $1, $2 etc. I am then using these in the replacement part. The . operator glues the pieces together, giving your result.
The /e at the end of the s/// means that the replacement is treated as a Perl expression, not just a plain replacement string. I've also added g which makes it match more than once on a line - you may have more than one interesting id= on a line in the input file, so be ready for it.
One final point. You seem to suggest you want to start numbering from 1, so replace $n++ with ++$n. For my suggested change, the variable $n will start as empty (effectively zero) it will be incremented to 1 (and 2, and 3 and ......) and THEN it's value will be used.

Why does python use unconventional triple-quotation marks for comments?

Why didn't python just use the traditional style of comments like C/C++/Java uses:
/**
* Comment lines
* More comment lines
*/
// line comments
// line comments
//
Is there a specific reason for this or is it just arbitrary?

Python doesn't use triple quotation marks for comments. Comments use the hash (a.k.a. pound) character:
# this is a comment
The triple quote thing is a doc string, and, unlike a comment, is actually available as a real string to the program:
>>> def bla():
... """Print the answer"""
... print 42
...
>>> bla.__doc__
'Print the answer'
>>> help(bla)
Help on function bla in module __main__:
bla()
Print the answer
It's not strictly required to use triple quotes, as long as it's a string. Using """ is just a convention (and has the advantage of being multiline).

A number of the answers got many of the points, but don't give the complete view of how things work. To summarize...
# comment is how Python does actual comments (similar to bash, and some other languages). Python only has "to the end of the line" comments, it has no explicit multi-line comment wrapper (as opposed to javascript's /* .. */). Most Python IDEs let you select-and-comment a block at a time, this is how many people handle that situation.
Then there are normal single-line python strings: They can use ' or " quotation marks (eg 'foo' "bar"). The main limitation with these is that they don't wrap across multiple lines. That's what multiline-strings are for: These are strings surrounded by triple single or double quotes (''' or """) and are terminated only when a matching unescaped terminator is found. They can go on for as many lines as needed, and include all intervening whitespace.
Either of these two string types define a completely normal string object. They can be assigned a variable name, have operators applied to them, etc. Once parsed, there are no differences between any of the formats. However, there are two special cases based on where the string is and how it's used...
First, if a string just written down, with no additional operations applied, and not assigned to a variable, what happens to it? When the code executes, the bare string is basically discarded. So people have found it convenient to comment out large bits of python code using multi-line strings (providing you escape any internal multi-line strings). This isn't that common, or semantically correct, but it is allowed.
The second use is that any such bare strings which follow immediately after a def Foo(), class Foo(), or the start of a module, are treated as string containing documentation for that object, and stored in the __doc__ attribute of the object. This is the most common case where strings can seem like they are a "comment". The difference is that they are performing an active role as part of the parsed code, being stored in __doc__... and unlike a comment, they can be read at runtime.

Triple-quotes aren't comments. They're string literals that span multiple lines and include those line breaks in the resulting string. This allows you to use
somestr = """This is a rather long string containing
several lines of text just as you would do in C.
Note that whitespace at the beginning of the line is\
significant."""
instead of
somestr = "This is a rather long string containing\n\
several lines of text just as you would do in C.\n\
Note that whitespace at the beginning of the line is\
significant."

Most scripting languages use # as a comment marker so to skip automatically the shebang (#!) which specifies to the program loader the interpreter to run (like in #!/bin/bash). Alternatively, the interpreter could be instructed to automatically skip the first line, but it's way more convenient just to define # as comment marker and that's it, so it's skipped as a consequence.

Guido - the creator of Python, actually weighs in on the topic here:
https://twitter.com/gvanrossum/status/112670605505077248?lang=en
In summary - for multiline comments, just use triple quotes. For academic purposes - yes it technically is a string, but it gets ignored because it is never used or assigned to a variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.