How to split a string of Python source code into Python "statements"?

How to split a string of Python source code into Python "statements"? - python

Given a string s containing (syntactically valid) Python source code, how can I split s into an array whose elements are the strings corresponding to the Python "statements" in s?
I put scare-quotes around "statements" because this term does not capture exactly what I'm looking for. Rather than trying to come up with a more accurate wording, here's an example. Compare the following two ipython interactions:
In [1]: if 1 > 0:
......: pass
......:
In [2]: if 1 > 0
File "<ipython-input-1082-0b411f095922>", line 1
if 1 > 0
^
SyntaxError: invalid syntax
In the first interaction, after the first [RETURN] statement, ipython processes the input if 1 > 0: without objection, even though it is still incomplete (i.e. it is not a full Python statement). In contrast, in the second interaction, the input is not only incomplete (in this sense), but also not acceptable to ipython.
As a second, more complete example, suppose the file foo.py contains the following Python source code:
def print_vertically(s):
'''A pretty useless procedure.
Prints the characters in its argument one per line.
'''
for c in s:
print c
greeting = ('hello '
'world'.
upper())
print_vertically(greeting)
Now, if I ran the following snippet, featuring the desired split_python_source function:
src = open('foo.py').read()
for i, s in enumerate(split_python_source(src)):
print '%d. >>>%s<<<' % (i, s)
the output would look like this:
0. >>>def print_vertically(s):<<<
1. >>> '''A pretty useless procedure.
Prints the characters in its argument one per line.
'''<<<
2. >>> for c in s:<<<
3. >>> print c<<<
4. >>>greeting = ('hello '
'world'.
upper())<<<
5. >>>print_vertically(greeting)<<<
As you can see, in this splitting, for c in s: (for example) gets assigned to its own item, rather being part of some "compound statement."
In fact, I don't have a very precise specification for how the splitting should be done, as long as it is done "at the joints" (like ipython does).

I'm not familiar with the internals of the Python lexer (though almost certainly many people on SO are :), but my guess is that you're basically looking for lines, with one important exception : paired open-close delimiters that can span multiple lines.
As a quick and dirty first pass, you might be able to start with something that splits a piece of code on newlines, and then you could merge successive lines that are found to contain paired delimiters -- parentheses (), braces {}, brackets [], and quotes '', ''' ''' are the ones that come to mind.

Related

functioning of replace()

print("abc".replace("","|")) #Explain this
#|a|b|c|
print("".replace("","abc"))
#abc
print("".replace("","abc",3))
#no output why? is this bug ?
I am really unable to understand this lines please explain it breefly...

In the first line you're trying to replace each nothing character with |, so the output should be and is a|b|c . If your code was like a b c, then your output would be |a| |b| |c|
Regarding the last line and your expected output which should be abcabcabc, the replace function replaces, not multiplies. So you can modify your code to thing like this, that first of all you replace your desired characters and then multiply them by 3 to reach what you want.
print("".replace("", "abc")*3)
Output is now abcabcabc.
But about your code, your telling Python interpreter that hey, find three '' and replace them by 'abc', but your code includes only one nothing and you cannot replace 3 of nothing by abc and get empty value.
That is not a bug in fact.
Edit
I searched a bit more and figured out Issue 28029 was a bug like your case in Python Bugs in Python version 3.8. I checked it again with Python 3.9 IDLE and now it is working fine:
print(''.replace('', 'abc', 3))
abc

According to doc (help(str.replace)):
replace(self, old, new, count=-1, /)
Return a copy with all occurrences of substring old replaced by new.
count
Maximum number of occurrences to replace.
-1 (the default value) means replace all occurrences.
If the optional argument count is given, only the first count occurrences are
replaced.
Basically you're setting the limit of occurences to be replaced.
For what concern the first example:
it seems that print("abc".replace("","|")) fall in a special case handle by python developers like that. Checking the code here zero length special-case.
You switch from python to c, there you can read that when the string to be replaced has 0 length the function stringlib_replace_interleave is called.
The example said it will:
/* insert the 'to' bytes everywhere. */
/* >>> b"Python".replace(b"", b".") */
/* b'.P.y.t.h.o.n.' */
For all the other cases you can check string replace implementation.

triple-quoted strings comments crash simple python program

I found this strange problem when I trying to add comments to my code. I used the triple-quoted strings to comment but the program crashed by giving the following error:
IndentationError: unexpected indent
When I use # to comment the triple-quoted strings, everything works normally. Does anyone know the reason behind this error and how I could fix it?
My Code:
#This programs show that comments using # rather than """ """
def main():
print("let's do something")
#Try using hashtag to comment this block to get code working
'''
Note following block gives you a non-sense indent error
The next step would be to consider how to get all the words from spam and ham
folder from different directory. My suggestion would be do it twice and then
concentrate two lists
Frist think about the most efficient way
For example, we might need to get rid off the duplicated words in the beginning
The thoughts of writing the algorithem to create the dictionary
Method-1:
1. To append all the list from the email all-together
2. Eliminate those duplicated words
cons: the list might become super large
I Choose method-2 to save the memory
Method-2:
1. kill the duplicated words in each string
2. Only append elements that is not already in the dictionary
Note:
1. In this case, the length of feature actually was determined by the
training cohorts, as we used the different English terms to decide feature
cons: the process time might be super long
'''
def wtf_python(var1, var2):
var3 = var1 + var2 + (var1*var2)
return var3
wtfRst1 = wtf_python(1,2)
wtfRst2 = wtf_python(3,4)
rstAll = { "wtfRst1" : wtfRst1,
"wtfRst2" : wtfRst2
}
return(rstAll)
if __name__ == "__main__":
mainRst = main()
print("wtfRst1 is :\n", mainRst['wtfRst1'])
print("wtfRst2 is :\n", mainRst['wtfRst2'])

The culprit:
Move the comments inside the function definition:
The reason:
Since the triple-quote strings are valid python exp, they should be treated like-wise, i.e. inside the function scope.
Hence:
def main():
print("let's do something")
#Try using hashtag to comment this block to get code working
'''
Note following block gives you a non-sense indent error
The next step would be to consider how to get all the words from spam and ham
folder from different directory. My suggestion would be do it twice and then
concentrate two lists
Frist think about the most efficient way
For example, we might need to get rid off the duplicated words in the beginning
The thoughts of writing the algorithem to create the dictionary
Method-1:
1. To append all the list from the email all-together
2. Eliminate those duplicated words
cons: the list might become super large
I Choose method-2 to save the memory
Method-2:
1. kill the duplicated words in each string
2. Only append elements that is not already in the dictionary
Note:
1. In this case, the length of feature actually was determined by the
training cohorts, as we used the different English terms to decide feature
cons: the process time might be super long
'''
def wtf_python(var1, var2):
var3 = var1 + var2 + (var1*var2)
return var3
wtfRst1 = wtf_python(1,2)
wtfRst2 = wtf_python(3,4)
rstAll = { "wtfRst1" : wtfRst1,
"wtfRst2" : wtfRst2
}
return(rstAll)
if __name__ == "__main__":
mainRst = main()
print("wtfRst1 is :\n", mainRst['wtfRst1'])
print("wtfRst2 is :\n", mainRst['wtfRst2'])
OUTPUT:
let's do something
wtfRst1 is :
5
wtfRst2 is :
19

You should push the indentation level of you triple-quote strings one tag to the right.
Although triple-quote strings are often used as comments, they are normal python expressions, so they should follow the language's syntax.

Triple quoted strings as comments must be valid Python strings. Valid Python strings must be properly indented.
Python sees the multi-line string, evaluates it, but since you don't assign a variable to it the string gets thrown away in the next line.

Remove '>>> ' from copied and pasted doctest

Reading documentations I often encounter doctests that I would like to run. Let's say you want to run the following in a Jupyter notebook:
>>> a = 2
>>> b = 3
>>> c = a + b
What is the fastest way to do it?

Just copy and paste it in a new cell. Jupyter strips such markup for you when it runs the sample:
If you must strip the markup (for aesthetic reasons, perhaps), you can use a bit of Python code to do so:
def extract_console_code(sample):
return ''.join([l[4:] for l in sample.splitlines(True) if l[:4] in ('>>> ', '... ')])
print(extract_console_code(r'''<paste code here>'''))
Note the r raw string literal! This should work for most Python code. Only if your code sample contains more ''' triple-single-quotes would you have to handle those separately (by using double quotes around the code, or by concatenating sections together with different string literal styles). Also, note that we skip any line that doesn't start with >>> or ...; those are output lines and not code.
You'll have to run this in a Python script, because the Jupyter console still just strips those initial lines away, and so for your exact example, depending on how you added the lines, it could be that none or only a few of the lines are returned; any line starting with >>> or ..., even in a string literal, will have been stripped by Jupyter already!

Python Regex Error--returns either Nonetype or wrong part of string

I couldn't quite find a similar question on here (or don't know Python well enough to figure it out from other questions), so here goes.
I'm trying to extract part of a string with re.search().start (I've also tried end()), and that line either seems to find something (but a few spaces off) or it returns None, which is baffling me. For example:
def getlsuscore(line):
print(line)
start=re.search(' - [0-9]', line).start()+2
score=line[start:start+3]
print(score)
score=int(score.strip())
return(score)
The two prints are in there for troubleshooting purposes. The first one prints out:
02:24 LSU 62 - 80 EDDLESTONE,BRANDON SUB IN. SHORTESS,HENRY SUB IN. ROBINSON III,ELBERT SUB OUT. QUARTERMAN,TIM SUB OUT.
Exactly as I expect it to. For the record, I'm trying to extract the 80 in that line and force it to an int. I've tried with various things in the regex match, always including the hyphen, and accordingly different numbers at the end to get me to the correct starting point, and I've tried playing with this in many other ways and still haven't got it to work. As for the print(score), I either get "AttributeError: 'NoneType' object has no attribute 'start'" when I have the start()+whatever correct, or if I change it to something wrong just to try it out, I get something like "ValueError: invalid literal for int() with base 10: '-'" or "ValueError: invalid literal for int() with base 10: '- 8'", with no addition or +1, respectively. So why when I put +2 or +3 at the end of start() does it give me an error? What am I messing up here?
Thanks for the help, I'm a noob at Python so if there's another/better way to do this that isn't regex, that works as well. I've just been using this exact same thing quite a bit on this project and had no problems, so I'm a bit stumped.
Edit: More code/context
def getprevlsuscore(file, time):
realline=''
for line in file:
line=line.rstrip()
if line[0:4]==time:
break
if re.search('SUB IN', line):
if not re.search('LSU', line[:9]):
realline=line
return(getlsuscore(realline))
It only throws the error when called in this block of code, and it's reading from a text file that has the play by play of a basketball game. Several hundred lines long, formatted like the line above, and it only throws an error towards the end of the file (I've tried on a couple different games).
The above function is called by this one:
def plusminus(file, list):
for player in list:
for line in file:
line=line.rstrip()
if not re.search('SUB IN', line):
continue
if not re.search('LSU', line):
continue
if not re.search(player.name, line):
continue
lsuscore=getlsuscore(line)
previouslsuscore=getprevlsuscore(file, line[0:4])
oppscore=getoppscore(line)
previousoppscore=getprevoppscore(file, line[0:4])
print(lsuscore)
print(previouslsuscore)
print(oppscore)
print(previousoppscore)
Obviously not finished, the prints are to check the numbers. The scope of the project is that I'm trying to read a txt file copy/paste of a play by play and create a plus/minus for each player, showing the point differentials for the time they've played (e.g. if player X was in for 5 minutes, and his school scored 15 while the other school scored 5 in that time, he'd be +10).

I think a much easier way of getting the scores extracted from that string, no regex involved, is to use the split() method. That method will split the input string on any whitespace and return an array of the substrings.
def getlsuscore(line):
# Example line: 02:24 LSU 62 - 80 ...
splitResults = line.split()
# Index 0 holds the time,
# Index 1 holds the team name,
# Index 2 holds the first score (as a string still),
# Index 3 holds the separating dash,
# Index 4 holds the second score (as a string still),
# And further indexes hold everything else
firstScore = int(splitResults[2])
secondScore = int(splitResults[4])
print(firstScore, secondScore)
return firstScore

You could try something like this:
m = re.search('(\\d+)\\s*-\\s*(\\d+)', line)
s1 = int(m.group(1))
s2 = int(m.group(2))
print(s1, s2)
This just looks for two numbers separated by a hypen, then decodes them into s1 and s2, after which you can do what you like with them. In practice, you should check m to make sure it isn't None, which would indicate a failed search.

Use a group to extract the number instead of resorting to fiddling with the start index of the match:
>>> import re
>>> line="02:24 LSU 62 - 80 EDDLESTONE,BRANDON SUB IN. blah blah..."
>>> int(re.search(r' - (\d+)', line).group(1))
80
>>>
If you get an error like AttributeError: 'NoneType' object has no attribute 'group' that means that the line you are working on doesn't have the " - (\d+)" sequence in it. For instance, maybe its an empty line. You can catch the problem with a try/except block. Then you have to decide whether this is a bad error or not. If you are absolutely positive that all lines follow your rules then maybe its a fatal error and you should exit warning the user that data is bad. Or if you are more loosey about the data, ignore it and continue.

Python: 2.6 and 3.1 string matching inconsistencies

I wrote my module in Python 3.1.2, but now I have to validate it for 2.6.4.
I'm not going to post all my code since it may cause confusion.
Brief explanation:
I'm writing a XML parser (my first interaction with XML) that creates objects from the XML file. There are a lot of objects, so I have a 'unit test' that manually scans the XML and tries to find a matching object. It will print out anything that doesn't have a match.
I open the XML file and use a simple 'for' loop to read line-by-line through the file. If I match a regular expression for an 'application' (XML has different 'application' nodes), then I add it to my dictionary, d, as the key. I perform a lxml.etree.xpath() query on the title and store it as the value.
After I go through the whole thing, I iterate through my dictionary, d, and try to match the key to my value (I have to use the get() method from my 'application' class). Any time a mismatch is found, I print the key and title.
Python 3.1.2 has all matching items in the dictionary, so nothing is printed. In 2.6.4, every single value is printed (~600) in all. I can't figure out why my string comparisons aren't working.
Without further ado, here's the relevant code:
for i in d:
if i[1:-2] != d[i].get('id'):
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
I slice the strings because the strings are different. Where the key would be "9626-2008olympics_Prod-SH"\n the value would be 9626-2008olympics_Prod-SH, so I have to cut the quotes and newline. I also added the Xs and Ys to the print statements to make sure that there wasn't any kind of whitespace issues.
Here is an example line of output:
X9626-2008olympics_Prod-SHX Y9626-2008olympics_Prod-SHY
Remember to ignore the Xs and Ys. Those strings are identical. I don't understand why Python2 can't match them.
Edit:
So the problem seems to be the way that I am slicing.
In Python3,
if i[1:-2] != d[i].get('id'):
this comparison works fine.
In Python2,
if i[1:-3] != d[i].get('id'):
I have to change the offset by one.
Why would strings need different offsets? The only possible thing that I can think of is that Python2 treats a newline as two characters (i.e. '\' + 'n').
Edit 2:
Updated with requested repr() information.
I added a small amount of code to produce the repr() info from the "2008olympics" exmpale above. I have not done any slicing. It actually looks like it might not be a unicode issue. There is now a "\r" character.
Python2:
'"9626-2008olympics_Prod-SH"\r\n'
'9626-2008olympics_Prod-SH'
Python3:
'"9626-2008olympics_Prod-SH"\n'
'9626-2008olympics_Prod-SH'
Looks like this file was created/modified on Windows. Is there a way in Python2 to automatically suppress '\r'?

You are printing i[1:-3] but comparing i[1:-2] in the loop.
Very Important Question
Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!

Russell Borogrove is right.
Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.
In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].
I just added a manual check for the Python major version.
Here is the fixed code:
for i in d:
# The keys in D contain quotes and a newline which need
# to be removed. In v3, newline = 1 char and in v2,
# newline = 2 char.
if sys.version_info[0] < 3:
if i[1:-3] != d[i].get('id'):
print('%s %s' % (i[1:-3], d[i].get('id')))
else:
if i[1:-2] != d[i].get('id'):
print('%s %s' % (i[1:-2], d[i].get('id')))
Thanks for the responses everyone! I appreciate your help.

repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.
Instead of
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
do
print('%r %r' % (i, d[i].get('id')))
Note leaving off the [1:-3] so that you can see what is in i before you slice it.
Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":
How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?
Update after actual input finally provided by OP:
If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.

I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?
for i in d:
stripped = i.strip()
if stripped != d[i].get('id'):
print('X%sX Y%sY' % (stripped, d[i].get('id')))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a string of Python source code into Python "statements"? - python

Related

functioning of replace()

triple-quoted strings comments crash simple python program

Remove '>>> ' from copied and pasted doctest

Python Regex Error--returns either Nonetype or wrong part of string

Python: 2.6 and 3.1 string matching inconsistencies

Categories

Resources