docstring blocks elif statement - python

Let me past the exact code I have:
This is the short module
class SentenceSplitter:
def __init__(self, filename=None):
self._raw_text = self.raw_text(filename)
self._sentences = self.to_sentences()
def raw_text(self, filename):
text = ''
with open(filename, 'r') as file:
for line in file.readlines():
line = line.strip()
text += ''.join(line.replace(line, line+' '))
file.close()
text = text.strip() # Deal with the last whitespace
return text
def to_sentences(self):
""" Sentence boundaries occur at '.', '!', '?' except that,
there are some not-sentence boundaries that
may occur before/after the period.
"""
raw_text = self._raw_text
sentences = []
sentence = ''
boundary = None
for char in raw_text:
sentence += ''.join(char)
if char == '!' or char == '?':
sentences.append(sentence)
sentence = ''
""" The sign -> refers to 'followed by' """
elif char == '.':
i = raw_text.index(char) # slicing previous/following characters
boundary = True
if boundary:
sentences.append(sentence)
sentence = ''
return sentences
And the main:
import textchange
ss = textchange.SentenceSplitter(filename='text.txt')
print(ss._sentences)
The docstring after the first if statement
""" The sign -> refers to 'followed by' """
I commented it out and the program runs, else does not.
There is more code in the elif statement but removed it after making sure it still throwing error. Here is the traceback:
Traceback (most recent call last):
File "D:\Programs\Python 3.3.2\Tutorials\46 Simple Python Exercises.py", line 26, in
<module>
import textchange
File "D:\Programs\Python 3.3.2\Tutorials\textchange.py", line 51
elif char == '.':
^
SyntaxError: invalid syntax

Docstrings are just string literals that are found at the start of the function. They still have to follow indentation rules.
Your string is not correctly indented for the elif block; by being de-dented from the if block before, you ended the if-elif-else blocks altogether and no elif is permitted to follow.
Use a regular, normal comment instead, a line starting with #; lines that contain only comments are exempt from the indentation rules:
if char == '!' or char == '?':
sentences.append(sentence)
sentence = ''
# The sign -> refers to 'followed by'
elif char == '.':
i = raw_text.index(char) # slicing previous/following characters
boundary = True
or indent the string (which is entirely still executed by Python as code, but otherwise not assigned and thus discarded again):
if char == '!' or char == '?':
sentences.append(sentence)
sentence = ''
elif char == '.':
""" The sign -> refers to 'followed by' """
i = raw_text.index(char) # slicing previous/following characters
boundary = True

Related

How to check if a char in a parsed line is part of a string variable in python

I am trying to implement in python a function that checks if the '#' symbol inside a parsed line is part of a string variable.
def comment_part_of_string(line,comment_idx):
"""
:param line: stripped line that has '#' symbol
comment_idx: index of '#' symbol in line
:return: return True when the '#' symbol is inside a string variable
"""
for example, I want the function to return True for:
> line="peace'and#much'love"
> comment_idx=line.find('#')
and False for:
> line="peace#love"
> comment_idx=line.find('#')
How can I check if a char in a parsed line is part of a string variable?
edit
I tried this and it also worked:
def comment_part_of_string(line, comment_idx):
"""
:param comment_idx: index of '#' symbol in line
:param line: stripped line that has '#' symbol
:return: return True when the '#' symbol is inside a string variable
"""
if ((line[:comment_idx].count(b"\'") % 2 == 1 and line[comment_idx:].count(b"\'") % 2 == 1)
or (line[:comment_idx].count(b"\"") % 2 == 1 and line[comment_idx:].count(b"\"") % 2 == 1)):
return True
return False
You can do it by checking the number of single quotes(') before the # symbol. If it is even, that means it is outside a string literal and if its odd, then it is inside a string. Do it like so:
def comment_part_of_string(line, comment_idx):
"""
:param line: stripped line that has '#' symbol
comment_idx: index of '#' symbol in line
:return: return True when the '#' symbol is inside a string variable
"""
count = line.split(line[comment_idx])[0].count("'")
if(count % 2):
return True
else:
return False
Hope this helps :)
I think this should work
def iscomment(line):
line = line.split(" ")
for i in line:
if "#" in i:
if '"' in i or "'" in i:
return True
return False
It splits line for spaces, then is goes through parts of line and if it find ' or " and # in line it returns True.
This can be solved using regex.
Note: Strings can be inside ' or ". So have to consider that also.
import re
def comment_part_of_string(line):
pattern=r'\'.*#.*\'|\".*#.*\"'
if re.findall(pattern,line):
return True
return False
Output:
>>> comment_part_of_string("peace'and#much'love")
True
>>> comment_part_of_string("peace#love")
False
>>> comment_part_of_string('peace"and#much"love')
True

How can I capitalize the first letter of a string in Python, ignoring HTML tags?

I would like to capitalize the first letter of a string, ignoring HTML tags. For instance:
hello world
should become:
Hello world
I wrote the following, which works, but it seems inefficient, since every character of the string is being copied to the output. Is there a better way to do it?
#register.filter
def capinit(value):
gotOne = False
inTag = False
outValue = ''
for c in value:
cc = c
if c == '<':
inTag = True
if c == '>':
inTag = False
if not inTag:
if c.isalpha() or c.isdigit():
if not gotOne:
cc = c.upper()
gotOne = True
outValue = outValue + cc
return outValue
Note that this ignores initial punctuation. It will capitalize the first letter it finds, unless it finds a number first in which case it doesn't capitalize anything.
I tried to do what you wanted:
html = 'hello world'
afterletter = None
dontcapital = 0
afterhtml = ""
for character in html:
if character == "/" and afterletter == "<":
afterhtml += character
dontcapital = 1
elif afterletter == ">":
if dontcapital == 0:
afterhtml += character.upper()
else:
afterhtml += character
dontcapital = 0
else:
afterhtml += character
afterletter = character
print(afterhtml)
#afterhtml is the output!
this should work from all the tests i did.
if anyone wants to work on it you can.

Bracket balancing algorithm doesn't detect imbalanced brackets

The code takes in any combination of brackets and checks if they are balanced or not. If they are balanced it should output success; if they aren't balanced it should output the index (starting at index 1) where the brackets are not balanced.
Example:
Input: ())
Output: 3
\\
Input: ()
Output: Success
The code always displays "Success" regardless of it being balanced or not.
Instead i get this:
Input: ())
Output: Success
import sys
def Match(self, c):
if self == '[' and c == ']':
return True
if self == '{' and c == '}':
return True
if self == '(' and c == ')':
return True
else:
return False
if __name__ == "__main__":
text = sys.stdin.read()
char_code = 0
opening_brackets_stack = []
for i, next in enumerate(text):
if next == '(' or next == '[' or next == '{':
char_code += 1
opening_brackets_stack.append(next)
stack_pop = opening_brackets_stack.pop()
if next == ')' or next == ']' or next == '}':
char_code += 1
if not Match(stack_pop, next):
print(char_code)
else:
char_code += 1
print ('Success')
Your code is printing "Success" because you've told it that after it finishes it should always print success
if __name__ == "__main__":
# A bunch of stuff unrelated to program flow...
print ('Success')
You probably only want success if you've reached the end of your text with nothing in the queue.
if __name__ == "__main__":
text = sys.stdin.read()
char_code = 0
opening_brackets_stack = []
for i, next in enumerate(text):
if next == '(' or next == '[' or next == '{':
char_code += 1
opening_brackets_stack.append(next)
stack_pop = opening_brackets_stack.pop()
if next == ')' or next == ']' or next == '}':
char_code += 1
if not Match(stack_pop, next):
print(char_code)
else:
char_code += 1
if not opening_brackets_stack: # <-- new line
print ('Success')
Except this won't solve your problem either, since you've never properly checked if you have an unmatched closing bracket, only an unmatched opening bracket. Consider this, instead:
# this will let us check for an expected closing bracket more easily
opening_brackets = "([{"
closing_brackets = ")]}"
mapping = dict(zip(opening_brackets, closing_brackets))
stack = []
for i, ch in enumerate(text):
if ch in opening_brackets:
# throw the closing bracket on the stack
matching_closer = mapping[ch]
stack.append(matching_closer)
elif ch == stack[-1]:
# if the character closes the last-opened bracket
stack.pop() # pop it off
elif ch in closing_brackets:
# this is an unmatched closing bracket, making the brackets
# imbalanced in this expression
print("FAILED")
sys.exit(1) # closes the program immediately with a retcode of 1
else:
# not a bracket, continue as normal
# this is technically a NOP and everything from the `else` can be
# omitted, but I think this looks more obvious to the reader.
continue
if not stack: # empty stack means matched brackets!
print("SUCCESS")
else:
print("FAILED")
Code can contain any brackets from the set []{}(), where the opening brackets are [,{, and ( and the closing brackets corresponding to them are ],}, and ).
For convenience, the text editor should not only inform the user that there is an error in the usage of brackets, but also point to the exact place in the code with the problematic bracket. First priority is to find the first unmatched closing bracket which either doesn’t have an opening bracket before it, like ] in ](), or closes the wrong opening bracket, like } in ()[}. If there are no such mistakes, then it should find the first unmatched opening bracket without the corresponding closing bracket after it, like ( in {}([]. If there are no mistakes, text editor should inform the user that the usage of brackets is correct.
Apart from the brackets, code can contain big and small latin letters, digits and punctuation marks.
More formally, all brackets in the code should be divided into pairs of matching brackets, such that in each pair the opening bracket goes before the closing bracket, and for any two pairs of brackets either one of them is nested inside another one as in (foo[bar]) or they are separate as in f(a,b)-g[c]. The bracket [ corresponds to the bracket ], { corresponds to }, and ( corresponds to ).
# python3
from collections import namedtuple
Bracket = namedtuple("Bracket", ["char", "position"])
def are_matching(left, right):
return (left + right) in ["()", "[]", "{}"]
def find_mismatch(text):
opening_brackets_stack = []
mismatch_pos = None
for i, next in enumerate(text):
if next in "([{":
# Process opening bracket, write your code here
opening_brackets_stack.append(next)
if len(opening_brackets_stack) < 2:
mismatch_pos = Bracket(next, i + 1).position
if next in ")]}":
# Process closing bracket, write your code here
if len(opening_brackets_stack) == 0:
return Bracket(next, i + 1).position
top = opening_brackets_stack.pop()
if not are_matching(top, next):
return Bracket(next, i + 1).position
if len(opening_brackets_stack) == 0:
return "Success"
return mismatch_pos
def main():
text = input()
mismatch = find_mismatch(text)
# Printing answer, write your code here
print(mismatch)
if __name__ == "__main__":
main()

Splitting bracket delimited text which can contain quoted strings

I am trying to split some text. Basically I want to separate level-1 brackets, like "('1','a',NULL),(2,'b')" => ["('1','a',NULL)", "(2,'b')]", but I need to be aware of possible quoted strings inside. It needs to at least satisfy the following py.tests:
from splitter import split_text
def test_normal():
assert split_text("('1'),('2')") == ["('1')", "('2')"]
assert split_text("(1),(2),(3)") == ["(1)", "(2)", "(3)"]
def test_complex():
assert split_text("('1','a'),('2','b')") == ["('1','a')", "('2','b')"]
assert split_text("('1','a',NULL),(2,'b')") == ["('1','a',NULL)", "(2,'b')"]
def test_apostrophe():
assert split_text("('\\'1','a'),('2','b')") == ["('\\'1','a')", "('2','b')"]
def test_coma_in_string():
assert split_text("('1','a,c'),('2','b')") == ["('1','a,c')", "('2','b')"]
def test_bracket_in_string():
assert split_text("('1','a)c'),('2','b')") == ["('1','a)c')", "('2','b')"]
def test_bracket_and_coma_in_string():
assert split_text("('1','a),(c'),('2','b')") == ["('1','a),(c')", "('2','b')"]
def test_bracket_and_coma_in_string_apostrophe():
assert split_text("('1','a\\'),(c'),('2','b')") == ["('1','a\\'),(c')", "('2','b')"]
I have tried the following:
1) Regular expressions
This looks like the best solution, but unfortunately I did not come up with anything satisfying all tests.
My best try is:
def split_text(text):
return re.split('(?<=\)),(?=\()', text)
But obviously, that is rather simplistic and fails test_bracket_and_coma_in_string and test_bracket_and_coma_in_string_apostrophe.
2) Finite-state-machine-like solution
I tried to code the FSM myself:
OUTSIDE, IN_BRACKETS, IN_STRING, AFTER_BACKSLASH = range(4)
def split_text(text):
state = OUTSIDE
read = []
result = []
for character in text:
if state == OUTSIDE:
if character == ',':
result.append(''.join(read))
read = []
elif character == '(':
read.append(character)
state = IN_BRACKETS
else:
read.append(character)
elif state == IN_BRACKETS:
read.append(character)
if character == ')':
state = OUTSIDE
elif character == "'":
state = IN_STRING
elif state == IN_STRING:
read.append(character)
if character == "'":
state = IN_BRACKETS
elif character == '\\':
state = AFTER_BACKSLASH
elif state == AFTER_BACKSLASH:
read.append(character)
state = IN_STRING
result.append(''.join(read)) # The rest of string
return result
It works, passes all tests, but is very slow.
3) pyparsing
from pyparsing import QuotedString, ZeroOrMore, Literal, Group, Suppress, Word, nums
null_value = Literal('NULL')
number_value = Word(nums)
string_value = QuotedString("'", escChar='\\', unquoteResults=False)
value = null_value | number_value | string_value
one_bracket = Group(Literal('(') + value + ZeroOrMore(Literal(',') + value) + Literal(')'))
all_brackets = one_bracket + ZeroOrMore(Suppress(',') + one_bracket)
def split_text(text):
parse_result = all_brackets.parseString(text)
return [''.join(a) for a in parse_result]
Also passes all tests, but surprisingly it is even slower than solution #2.
Any ideas how to make the solution fast and robust? I have this feeling that I am missing something obvious.
One way would be to use the newer regex module which supports the (*SKIP)(*FAIL) functionality:
import regex as re
def split_text(text):
rx = r"""'.*?(?<!\\)'(*SKIP)(*FAIL)|(?<=\)),(?=\()"""
return re.split(rx, text)
Broken down it says:
'.*?(?<!\\)' # look for a single quote up to a new single quote
# that MUST NOT be escaped (thus the neg. lookbehind)
(*SKIP)(*FAIL)| # these parts shall fail
(?<=\)),(?=\() # your initial pattern with a positive lookbehind/ahead
This succeeds on all your examples.
I cooked this and it works on given tests.
tests = ["('1'),('2')",
"(1),(2),(3)",
"('1','a'),('2','b')",
"('1','a',NULL),(2,'b')",
"('\\'1','a'),('2','b')",
"('1','a,c'),('2','b')",
"('1','a)c'),('2','b')",
"('1','a),(c'),('2','b')",
"('1','a\\'),(c'),('2','b')"]
for text in tests:
tmp = ''
res = []
bracket = 0
quote = False
for idx,i in enumerate(text):
if i=="'":
if text[idx-1]!='\\':
quote = not quote
tmp += i
elif quote:
tmp += i
elif i==',':
if bracket: tmp += i
else: pass
else:
if i=='(': bracket += 1
elif i==')': bracket -= 1
if bracket: tmp += i
else:
tmp += i
res.append(tmp)
tmp = ''
print res
Output:
["('1')", "('2')"]
['(1)', '(2)', '(3)']
["('1','a')", "('2','b')"]
["('1','a',NULL)", "(2,'b')"]
["('\\'1','a')", "('2','b')"]
["('1','a,c')", "('2','b')"]
["('1','a)c')", "('2','b')"]
["('1','a),(c')", "('2','b')"]
["('1','a\\'),(c')", "('2','b')"]
The code has room for improvement, and edits are welcome. :)
This is the regular expression which seems to work and passes all the tests. Running it on real data it is about 6x faster than finite state machine implemented in Python.
PATTERN = re.compile(
r"""
\( # Opening bracket
(?:
# String
(?:'(?:
(?:\\')|[^'] # Either escaped apostrophe, or other character
)*'
)
|
# or other literal not containing right bracket
[^')]
)
(?:, # Zero or more of them separated with comma following the first one
# String
(?:'(?:
(?:\\')|[^'] # Either escaped apostrophe, or other character
)*'
)
|
# or other literal
[^')]
)*
\) # Closing bracket
""",
re.VERBOSE)
def split_text(text):
return PATTERN.findall(text)

Writing Sebesda's lexical analyzer in python. Does not work for last lexeme in the input file

I have to translate lexical analyzer the code in Sebesda's Concpets of Programming Languages (chapter 4, section 2) to python. Here's what I have so far:
# Character classes #
LETTER = 0
DIGIT = 1
UNKNOWN = 99
# Token Codes #
INT_LIT = 10
IDENT = 11
ASSIGN_OP = 20
ADD_OP= 21
SUB_OP = 22
MULT_OP = 23
DIV_OP = 24
LEFT_PAREN = 25
RIGHT_PAREN = 26
charClass = ''
lexeme = ''
lexLen = 0
token = ''
nextToken = ''
### lookup - function to lookup operators and parentheses ###
### and return the token ###
def lookup(ch):
def left_paren():
addChar()
globals()['nextToken'] = LEFT_PAREN
def right_paren():
addChar()
globals()['nextToken'] = RIGHT_PAREN
def add():
addChar()
globals()['nextToken'] = ADD_OP
def subtract():
addChar()
globals()['nextToken'] = SUB_OP
def multiply():
addChar()
globals()['nextToken'] = MULT_OP
def divide():
addChar()
globals()['nextToken'] = DIV_OP
options = {')': right_paren, '(': left_paren, '+': add,
'-': subtract, '*': multiply , '/': divide}
if ch in options.keys():
options[ch]()
else:
addChar()
### addchar- a function to add next char to lexeme ###
def addChar():
#lexeme = globals()['lexeme']
if(len(globals()['lexeme']) <=98):
globals()['lexeme'] += nextChar
else:
print("Error. Lexeme is too long")
### getChar- a function to get the next Character of input and determine its character class ###
def getChar():
globals()['nextChar'] = globals()['contents'][0]
if nextChar.isalpha():
globals()['charClass'] = LETTER
elif nextChar.isdigit():
globals()['charClass'] = DIGIT
else:
globals()['charClass'] = UNKNOWN
globals()['contents'] = globals()['contents'][1:]
## getNonBlank() - function to call getChar() until it returns a non whitespace character ##
def getNonBlank():
while nextChar.isspace():
getChar()
## lex- simple lexical analyzer for arithmetic functions ##
def lex():
globals()['lexLen'] = 0
getNonBlank()
def letterfunc():
globals()['lexeme'] = ''
addChar()
getChar()
while(globals()['charClass'] == LETTER or globals()['charClass'] == DIGIT):
addChar()
getChar()
globals()['nextToken'] = IDENT
def digitfunc():
globals()['lexeme'] = ''
addChar()
getChar()
while(globals()['charClass'] == DIGIT):
addChar()
getChar()
globals()['nextToken'] = INT_LIT
def unknownfunc():
globals()['lexeme'] = ''
lookup(nextChar)
getChar()
lexDict = {LETTER: letterfunc, DIGIT: digitfunc, UNKNOWN: unknownfunc}
if charClass in lexDict.keys():
lexDict[charClass]()
print('The next token is: '+ str(globals()['nextToken']) + ' The next lexeme is: ' + globals()['lexeme'])
with open('input.txt') as input:
contents = input.read()
getChar()
lex()
while contents:
lex()
I'm using the string sum + 1 / 33 as my sample input string. From what I understand, the first call to getChar() at the top level sets the characterClass to LETTER andcontents to um + 1 / 33.
The program then enters the while loop and calls lex(). lex() in turn accumulates the word sum in to lexeme. When the while loop inside letterfunc encounters the first white-space character, it breaks, exiting lex()
Since contents is not empty, the program goes through the while loop at the top level again. This time, the getNonBlank() call inside lex() "throws out the spaces in contents and the same process as before is repeated.
Where I encounter an error, is at the last lexeme. I'm told that globals()['contents'][0] is out of range when called by getChar(). I'm not expecting it to be a difficult error to find but I've tried tracing it by hand and can't seem to spot the problem. Any help would be greatly appreciated.
It is simply because after successfully reading the last 3 of input string, the digitfunc function iterate one more time getchar. But at that moment content has been exhausted and is empty, so contents[0] is passed end of buffer, hence the error.
As a workaround, if you add a newline or a space after the last character of expression, your current code does not exhibit the problem.
The reason for that is that when last char is UNKNOWN you immediately return from lex and exit the loop because content is empty, but if your are processing a number or a symbol you loop calling getchar without testing end of input. By the way, if your input string ends with a right paren, your lexer eats it and forget to display that it found it.
So you should at least:
test end of input in getchar:
def getchar():
if len(contents) == 0:
# print "END OF INPUT DETECTED"
globals()['charClass'] = UNKNOWN
globals()['nextChar'] = ''
return
...
display the last token if one is left:
...
while contents:
lex()
lex()
control if a lexeme is present (weird things may happen at end of input)
...
if charClass in lexDict.keys():
lexDict[charClass]()
if lexeme != '':
print('The next token is: '+ str(globals()['nextToken']) +
' The next lexeme is: >' + globals()['lexeme'] + '<')
But your usage of globals is bad. The common idiom to use a global from within a function is to declare it before usage:
a = 5
def setA(val):
global a
a = val # sets the global variable a
But globals in Python are code smell. The best you could do is to properly encapsulate you parser in a class. Objects are better than globals

Categories

Resources