Use readlines() with indices or parse lines on the fly? - python

I'm making a simple test function that asserts that the output from an interpreter I'm developing is correct, by reading from a file the expression to evaluate and the expected result, much like python's doctest. This is for scheme, so an example of an input file would be
> 42
42
> (+ 1 2 3)
6
My first attempt for a function that can parse such a file looks like the following, and it seems to work as expected:
def run_test(filename):
interp = Interpreter()
response_next = False
num_tests = 0
with open(filename) as f:
for line in f:
if response_next:
assert response == line.rstrip('\n')
response_next = False
elif line.startswith('> '):
num_tests += 1
response = interp.eval(line[2:])
response = str(response) if response else ''
response_next = True
print "{:20} Ran {} tests successfully".format(os.path.basename(filename),
num_tests)
I wanted to improve it slightly by removing the response_next flag, as I am not a fan of such flags, and instead read in the next line within the elif block with next(f). I had a small unrelated question regarding that which I asked about in IRC at freenode. I got the help I wanted but I was also given the suggestion to use f.readlines() instead, and then use indexing on the resulting list. (I was also told that I could use groupby() in itertools for the pairwise lines, but I'll investigate that approach later.)
Now to the question, I was very curious why that approach would be better, but my Internet connection was a flaky one on a train and I was unable to ask, so I'll ask it here instead. Why would it be better to read everything with readlines() instead of parsing every line as they are read on the fly?
I'm really wondering as my feeling is the opposite, I think it seems cleaner to parse the lines one at a time so that everything is finished in one go. I usually avoid using indices in arrays in Python and prefer to work with iterators and generators. Maybe it is impossible to answer and guess what the person was thinking in case it was a subjective opinion, but if there is some general recommendation I'd be happy to hear about it.

It's certainly more Pythonic to process input iteratively rather than reading the whole input at once; for example, this will work if the input is a console.
An argument in favour of reading a whole array and indexing is that using next(f) could be unclear when combined with a for loop; the options there would be either to replace the for loop with a while True or to fully document that you are calling next on f within the loop:
try:
while True:
test = next(f)
response = next(f)
except StopIteration:
pass
As Jonas suggests you could accomplish this (if you're sure that the input will always consist of lines test/response/test/response etc.) by zipping the input with itself:
for test, response in zip(f, f): # Python 3
for test, response in itertools.izip(f, f): # Python 2

Reading everything into an array gives you the equivalent of random access: You use an array index to move down the array, and at any time you can check what's next and back up if necessary.
If you can carry out your task without backing up, you don't need the random access and it would be cleaner to do without it. In your examples, it seems that your syntax is always a single-line (?) expression followed by the expected response. So, I'd have written a top-level loop that iterates once per expression-value pair, reading lines as necessary.
If you want to support multi-line expressions and results, you can write separate functions to read each one: One that reads a complete expression, one that reads a result (up to the next blank line). The important thing is they should be able consume as much input as they need, and leave the input pointer in a reasonable state for the next input.

from itertools import ifilter,imap
def run_test(filename):
interp = Interpreter()
num_tests, num_passed, last_result = 0, 0, None
with open(filename) as f:
# iterate over non-blank lines
for line in ifilter(None, imap(str.strip, f)):
if line.startswith('> '):
last_result = interp.eval(line[2:])
else:
num_tests += 1
try:
assert line == repr(last_test_result)
except AssertionError, e:
print e.message
else:
num_passed += 1
print("Ran {} tests, {} passed".format(num_tests, num_passed))
... this simply assumes that any result-line refers to the preceding test.
I would avoid .readlines() unless you get get some specific benefit from having the whole file available at once.
I also changed the comparison to look at the representation of the result, so it can distinguish between output types, ie
'6' + '2'
> '62'
60 + 2
> 62

Related

Evaluation of IF statement in Python

Please help me try to understand the evaluation of this script. It must be something simple I'm not understanding. I want to scan a text file and output lines with specific string value.
So for example my txt file will contain simple stuff:
beep
boop
bop
Hey
beep
boop
bop
and the script is looking for the line with "Hey" and trying to output that line
file_path = "C:\\downloads\\test.txt"
i = 0
file_in = open(file_path,"r+") # read & write
for lines in file_in.readlines():
i+=1
#if lines.rfind("this entity will be skipped..."):
if lines.find("Hey"):
print(i, lines)
file_in.close()
For some reason it outputs every line except the one it found a match on. How is this possible?
It's probably more straightforward to do if "Hey" in lines: instead of if lines.find("Hey"):. If you really want to use find(), you could do this: if lines.find("Hey") != -1:
While Daniel's answer is, of course, correct and should be accepted, I want to fuel my OCD and offer some improvements:
# use context managers to automatically release file handler
with open(file_path) as file_in:
# file objects are iterable, will return lines
# reading entire file in memory can crash if file is too big
# enumerate() is a more readable alternative to manual counters
for i, line in enumerate(file_in): # 'line' should be singular
if "Hey" in line: # same as Daniel
print(i, line)
.find(sub) returns an integer - the first index of sub if sub is found, and -1 if it is not.
When "Hey" is present, it is at the first index (0), so that is what .find() returns. Otherwise, "Hey" is not found, so .find() returns -1.
Since python treats all integers as True except 0, then the conditional evaluates to True when 0 is returned, i.e. when "Hey" is found.
Change your use of .find() to something which fulfills your if statement the way you want.

How to avoid Nonetype when combining lists in Python

I am very new to Python and am looking for assistance to where I am going wrong with an assignment. I have attempted different ways to approach the problem but keep getting stuck at the same point(s):
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
Problem 2: When I try and combine the lists I keep receiving "None" for my result or Nonetype errors [which I think means I have added the None's together(?)].
The assignment is:
#8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.You can download the sample data at http://www.py4e.com/code3/romeo.txt
My current code which is giving me a Nonetype error is:
poem = input("enter file:")
play = open(poem)
lst= list()
for line in play:
line=line.rstrip()
word=line.split()
if not word in lst:
lst= lst.append(word)
print(lst.sort())
If someone could just talk me through where I am going wrong that will be greatly appreciated!
your problem was lst= lst.append(word) this returns None
with open(poem) as f:
lines = f.read().split('\n') #you can also you readlines()
lst = []
for line in lines:
words = line.split()
for word in words:
if word:
lst.append(word)
Problem 1: When I am trying to create a list of words from a file, I keep making a list for the words per line rather than the entire file
You are doing play = open(poem) then for line in play: which is method for processing file line-by-line, if you want to process whole content at once then do:
play = open(poem)
content = play.read()
words = content.split()
Please always remember to close file after you used it i.e. do
play.close()
unless you use context manager way (i.e. like with open(poem) as f:)
Just to help you get into Python a little more:
You can:
1. Read whole file at once (if it is big it is better to grab it into RAM if you have enough of it, if not grab as much as you can for the chunk to be reasonable, then grab another one and so on)
2. Split data you read into words and
3. Use set() or dict() to remove duplicates
Along the way, you shouldn't forget to pay attention to upper and lower cases,
if you need same words, not just different not repeating strings
This will work in Py2 and Py3 as long as you do something about input() function in Py2 or use quotes when entering the path, so:
path = input("Filename: ")
f = open(filename)
c = f.read()
f.close()
words = set(x.lower() for x in c.split()) # This will make a set of lower case words, with no repetitions
# This equals to:
#words = set()
#for x in c.split():
# words.add(x.lower())
# set() is an unordered datatype ignoring duplicate items
# and it mimics mathematical sets and their properties (unions, intersections, ...)
# it is also fast as hell.
# Checking a list() each time for existance of the word is slow as hell
#----
# OK, you need a sorted list, so:
words = sorted(words)
# Or step-by-step:
#words = list(words)
#words.sort()
# Now words is your list
As for your errors, do not worry, they are common at the beginning in almost any objective oriented language.
Other explained them well in their comments. But not to make the answer lacking...:
Always pay attention on functions or methods which operate on the datatype (in place sort - list.sort(), list.append(), list.insert(), set.add()...) and which ones return a new version of the datatype (sorted(), str.lower()...).
If you ran into a similar situation again, use help() in interactive shell to see what exactly a function you used does.
>>> help(list.append)
>>> help(list.sort)
>>> help(str.lower)
>>> # Or any short documentation you need
Python, especially Python 3.x is sensitive to trying operations between types, but some might have a different connotation and can actually work while doing unexpected stuff.
E.g. you can do:
print(40*"x")
It will print out 40 'x' characters, because it will create a string of 40 characters.
But:
print([1, 2, 3]+None)
will, logically not work, which is what is happening somewhere in the rest of your code.
In some languages like javascript (terrible stuff) this will work perfectly well:
v = "abc "+123+" def";
Inserting the 123 seamlessly into the string. Which is usefull, but a programming nightmare and nonsense from another viewing angle.
Also, in Py3 a reasonable assumption from Py2 that you can mix unicode and byte strings and that automatic cast will be performed is not holding.
I.e. this is a TypeError:
print(b"abc"+"def")
because b"abc" is bytes() and "def" (or u"def") is str() in Py3 - what is unicode() in Py2)
Enjoy Python, it is the best!

Don't understand why getting 'ValueError: Mixing iteration and read methods would lose data' [duplicate]

I'm writing a script that logs errors from another program and restarts the program where it left off when it encounters an error. For whatever reasons, the developers of this program didn't feel it necessary to put this functionality into their program by default.
Anyways, the program takes an input file, parses it, and creates an output file. The input file is in a specific format:
UI - 26474845
TI - the title (can be any number of lines)
AB - the abstract (can also be any number of lines)
When the program throws an error, it gives you the reference information you need to track the error - namely, the UI, which section (title or abstract), and the line number relative to the beginning of the title or abstract. I want to log the offending sentences from the input file with a function that takes the reference number and the file, finds the sentence, and logs it. The best way I could think of doing it involves moving forward through the file a specific number of times (namely, n times, where n is the line number relative to the beginning of the seciton). The way that seemed to make sense to do this is:
i = 1
while i <= lineNumber:
print original.readline()
i += 1
I don't see how this would make me lose data, but Python thinks it would, and says ValueError: Mixing iteration and read methods would lose data. Does anyone know how to do this properly?
You get the ValueError because your code probably has for line in original: in addition to original.readline(). An easy solution which fixes the problem without making your program slower or consume more memory is changing
for line in original:
...
to
while True:
line = original.readline()
if not line: break
...
Use for and enumerate.
Example:
for line_num, line in enumerate(file):
if line_num < cut_off:
print line
NOTE: This assumes you are already cleaning up your file handles, etc.
Also, the takewhile function could prove useful if you prefer a more functional flavor.
Assuming you need only one line, this could be of help
import itertools
def getline(fobj, line_no):
"Return a (1-based) line from a file object"
return itertools.islice(fobj, line_no-1, line_no).next() # 1-based!
>>> print getline(open("/etc/passwd", "r"), 4)
'adm:x:3:4:adm:/var/adm:/bin/false\n'
You might want to catch StopIteration errors (if the file has less lines).
Here's a version without the ugly while True pattern and without other modules:
for line in iter(original.readline, ''):
if …: # to the beginning of the title or abstract
for i in range(lineNumber):
print original.readline(),
break

find common elements in the strings python

I'm trying to find common elements in the strings reading from a file. And this is what I wrote:
file = open ("words.txt", 'r')
while 1:
line = file.readlines()
if len(line) == 0:
break
print line
file.close
def com_Letters(*strings):
return set.intersection(*map(set,strings))
and the result turns out: ['out\n', 'dog\n', 'pingo\n', 'coconut']
I put com_Letters(line), but the result is empty.
There are two problems, but neither one is with com_Letters.
First, this code guarantees that line will always be an empty list:
while 1:
line = file.readlines()
if len(line) == 0:
break
print line
The first time through the loop, you call readlines(), which will
Read until EOF using readline() and return a list containing the lines thus read.
If the file is empty, that's an empty list, so you'll break.
Otherwise, you'll print out the list, and go back into the loop. At which point readlines() is going to have nothing left to read, since you already read until EOF, so it's guaranteed to be an empty list. Which means you'll break.
Either way, list ends up empty.
It's not clear what you're trying to do with that loop. There's never any good reason to call readlines() repeatedly on the same file. But, even if there were, you'd probably want to accumulate all of the results, rather than just keeping the last (guaranteed-empty) result. Something like this:
while 1:
new_line = file.readlines()
if len(new_line) == 0:
break
print new_line
line += new_line
Anyway, if you fix that problem (e.g., by scrapping the whole loop and just using line = file.readlines()), you're calling com_Letters with a single list of strings. That's not particularly useful; it's just a very convoluted way of calling set. If it's not clear why:
Since there's only one argument (a list of strings), *strings ends up as a one-element tuple of that argument.
map(set, strings) on a single-element tuple just calls set on that element and returns a single-element list.
*map(set, strings) explodes that into one argument, the set.
set.intersection(s) is the same thing as s.intersection(), which just returns s itself.
All of this would be easier to see if you broke up some of those complex expressions and printed the intermediate values. Then you'd know exactly where it first goes wrong, instead of just knowing it's somewhere in a long chain of events.
A few side notes:
You forgot the () on the file.close, which means you're not actually closing the file. One of the many reasons that with is better is that it means you can't make that mistake.
Use plural names for collections. line sounds like a variable that should have a single line in it, not a variable that should have all of your lines.
The readlines function with no sizehint argument is basically useless. If you're just going to iterate over the lines, you can do that to the file itself. If you really need the lines in a list instead of reading them lazily, list(file) makes your intention clearer—and doesn't mislead you into thinking it might be useful to do repeatedly.
The Pythonic way to check for an empty collection is just if not line:, rather than if len(line) == 0:.
while True is clearer than while 1.
I suggest modifying the function as follows:
def com_Letters(strings):
return set.intersection(*map(set,strings))
I think the function is treating the argument strings as a list of a list of strings (only one argument passed in this case a single list) and therefore not finding the intersection.

Escape quotes contained within certain html tags

I've done a mysqldump of a large database, ~300MB. It has made an error though, it has not escaped any quotes contained in any <o:p>...</o:p> tags. Here's a sample:
...Text here\' escaped correctly, <o:p> But text in here isn't. </o:p> Out here all\'s well again...
Is it possible to write a script (preferably in Python, but I'll take anything!) that would be able to scan and fix these errors automatically? There's quite a lot of them and Notepad++ can't handle a file of that size very well...
If the "lines" your file is divided into are of reasonable lengths, and there are no binary sequences in it that "reading as text" would break, you can use fileinput's handy "make believe I'm rewriting a file in place" functionality:
import re
import fileinput
tagre = re.compile(r"<o:p>.*?</o:p>")
def sub(mo):
return mo.group().replace(r"'", r"\'")
for line in fileinput.input('thefilename', inplace=True):
print tagre.sub(sub, line),
If not, you'll have to simulate the "in-place rewriting" yourself, e.g. (oversimplified...):
with open('thefilename', 'rb') as inf:
with open('fixed', 'wb') as ouf:
while True:
b = inf.read(1024*1024)
if not b: break
ouf.write(tagre.sub(sub, b))
and then move 'fixed' to take place of 'thefilename' (either in code, or manually) if you need that filename to remain after the fixing.
This is oversimplified because one of the crucial <o:p> ... </o:p> parts might end up getting split between two successive megabyte "blocks" and therefore not identified (in the first example, I'm assuming each such part is always fully contained within a "line" -- if that's not the case then you should not use that code, but the following, anyway). Fixing this requires, alas, more complicated code...:
with open('thefilename', 'rb') as inf:
with open('fixed', 'wb') as ouf:
while True:
b = getblock(inf)
if not b: break
ouf.write(tagre.sub(sub, b))
with e.g.
partsofastartag = '<', '<o', '<o:', '<o:p'
def getblock(inf):
b = ''
while True:
newb = inf.read(1024 * 1024)
if not newb: return b
b += newb
if any(b.endswith(p) for p in partsofastartag):
continue
if b.count('<o:p>') != b.count('</o:p>'):
continue
return b
As you see, this is pretty delicate code, and therefore, what with it being untested, I can't know that it is correct for your problem. In particular, can there be cases of <o:p> that are NOT matched by a closing </o:p> or vice versa? If so, then a call to getblock could end up returning the whole file in quite a costly way, and even the RE matching and substitution might backfire (the latter would also occur if SOME of the single-quotes in such tags are already properly escaped, but not all).
If you have at least a GB or so, avoiding the delicate issues with block division, at least, IS feasible, since everything should fit in memory, making the code much simpler:
with open('thefilename', 'rb') as inf:
with open('fixed', 'wb') as ouf:
b = inf.read()
ouf.write(tagre.sub(sub, b))
However, the other issues mentioned above (possible unbalanced opening/closing tags, etc) might remain -- only you can study your existing defective data and see if it affords such a reasonably simple approach at fixing!

Categories

Resources