How to loop until EOF in Python?

How to loop until EOF in Python? - python

I need to loop until I hit the end of a file-like object, but I'm not finding an "obvious way to do it", which makes me suspect I'm overlooking something, well, obvious. :-)
I have a stream (in this case, it's a StringIO object, but I'm curious about the general case as well) which stores an unknown number of records in "<length><data>" format, e.g.:
data = StringIO("\x07\x00\x00\x00foobar\x00\x04\x00\x00\x00baz\x00")
Now, the only clear way I can imagine to read this is using (what I think of as) an initialized loop, which seems a little un-Pythonic:
len_name = data.read(4)
while len_name != "":
len_name = struct.unpack("<I", len_name)[0]
names.append(data.read(len_name))
len_name = data.read(4)
In a C-like language, I'd just stick the read(4) in the while's test clause, but of course that won't work for Python. Any thoughts on a better way to accomplish this?

You can combine iteration through iter() with a sentinel:
for block in iter(lambda: file_obj.read(4), ""):
use(block)

Have you seen how to iterate over lines in a text file?
for line in file_obj:
use(line)
You can do the same thing with your own generator:
def read_blocks(file_obj, size):
while True:
data = file_obj.read(size)
if not data:
break
yield data
for block in read_blocks(file_obj, 4):
use(block)
See also:
file.read

I prefer the already mentioned iterator-based solution to turn this into a for-loop. Another solution written directly is Knuth's "loop-and-a-half"
while 1:
len_name = data.read(4)
if not len_name:
break
names.append(data.read(len_name))
You can see by comparison how that's easily hoisted into its own generator and used as a for-loop.

I see, as predicted, that the typical and most popular answer are using very specialized generators to "read 4 bytes at a time". Sometimes generality isn't any harder (and much more rewarding;-), so, I've suggested instead the following very general solution:
import operator
def funlooper(afun, *a, **k):
wearedone = k.pop('wearedone', operator.not_)
while True:
data = afun(*a, **k)
if wearedone(data): break
yield data
Now your desired loop header is just: for len_name in funlooper(data.read, 4):.
Edit: made much more general by the wearedone idiom since a comment accused my slightly less general previous version (hardcoding the exit test as if not data:) of having "a hidden dependency", of all things!-)
The usual swiss army knife of looping, itertools, is fine too, of course, as usual:
import itertools as it
for len_name in it.takewhile(bool, it.imap(data.read, it.repeat(4))): ...
or, quite equivalently:
import itertools as it
def loop(pred, fun, *args):
return it.takewhile(pred, it.starmap(fun, it.repeat(args)))
for len_name in loop(bool, data.read, 4): ...

The EOF marker in python is an empty string so what you have is pretty close to the best you are going to get without writing a function to wrap this up in an iterator. I could be written in a little more pythonic way by changing the while like:
while len_name:
len_name = struct.unpack("<I", len_name)[0]
names.append(data.read(len_name))
len_name = data.read(4)

I'd go with Tendayi's suggestion re function and iterator for readability:
def read4():
len_name = data.read(4)
if len_name:
len_name = struct.unpack("<I", len_name)[0]
return data.read(len_name)
else:
raise StopIteration
for d in iter(read4, ''):
names.append(d)

Related

How do I remove from the beginning to the 2nd specific character of a string?

I have a bunch of strings that are of the form:
'foo.bar.baz.spam.spam.spam...etc'
In all likelihood they have three or more multi-letter substrings separated by .'s. There might be ill formed strings with less than two .'s, and I want the original string in that case.
The first thing that comes to mind is the str.partition method, which I would use if I were after everything after the first .:
'foo.bar.baz.boink.a.b.c'.partition('.')[2]
returns
'bar.baz.boink.a.b.c'
This could be repeated:
def secondpartition(s):
return s.partition('.')[2].partition('.')[2] or s
But is this efficient? It doesn't seem efficient to call a method twice and use a subscript twice. It is certainly inelegant. Is there a better way?
The main question is:
How do you drop everything from the beginning up to the second instance of the . character, so that 'foo.bar.baz.spam.spam.spam' becomes 'baz.spam.spam.spam'? What would be the best/most efficient way to do that?

Using str.split with maxsplit argument:
>>> 'foo.bar.baz.spam.spam.spam'.split('.', 2)[-1]
'baz.spam.spam.spam'
UPDATE
To handle string with less than two .s:
def secondpartition(s):
parts = s.split('.', 2)
if len(parts) <= 2:
return s
return parts[-1]

Summary: This is the most performant approach (generalized to n characters):
def maxsplittwoexcept(s, n, c):
'''
given string s, return the string after the nth character c
if less than n c's, return the whole string s.
'''
try:
return s.split(c, 2)[2]
except IndexError:
return s
but I show other approaches for comparison.
There are various ways of doing this with string methods and regular expressions. I'll ensure you can follow along with an interpreter by being able to cut and paste everything in order.
First imports:
import re
import timeit
from itertools import islice
Different approaches: string methods
The way mentioned in the question is to partition twice, but I discounted it because it seems rather inelegant and unnecessarily repetitive:
def secondpartition(s):
return s.partition('.')[2].partition('.')[2] or s
The second way that came to mind to do this is to split on the .'s, slice from the second on, and join with .'s. This struck me as fairly elegant and I assumed it would be rather efficient.
def splitslicejoin(s):
return '.'.join(s.split('.')[2:]) or s
But slices create an unnecessary extra list. However, islice from the itertools module provides an iterable that doesn't! So I expected this to do even better:
def splitislicejoin(s):
return '.'.join(islice(s.split('.'), 2, None)) or s
Different approaches: regular expressions
Now regular expressions. First way that came to mind with regular expressions was to find and substitute with an empty string up to the second ..
dot2 = re.compile('.*?\..*?\.')
def redot2(s):
return dot2.sub('', s)
But it occurred to me that it might be better to use a non-capturing group, and return a match on the end:
dot2match = re.compile('(?:.*?\..*?\.)(.*)')
def redot2match(s):
match = dot2match.match(s)
if match is not None:
return match.group(1)
else:
return s
Finally, I could use a regular expression search to find the end of the second . and then use that index to slice the string, which would use a lot more code, but might still be fast and memory efficient.
dot = re.compile('\.')
def find2nddot(s):
for i, found_dot in enumerate(dot.finditer(s)):
if i == 1:
return s[found_dot.end():] or s
return s
update Falsetru suggests str.split's maxsplit argument, which had completely slipped my mind. My thoughts are that it may be the most straightforward approach, but the assignment and extra checking might hurt it.
def maxsplittwo(s):
parts = s.split('.', 2)
if len(parts) <= 2:
return s
return parts[-1]
And JonClements suggests using an except referencing https://stackoverflow.com/a/27989577/541136 which would look like this:
def maxsplittwoexcept(s):
try:
return s.split('.', 2)[2]
except IndexError:
return s
which would be totally appropriate since not having enough .s would be exceptional.
Testing
Now let's test our functions. First, let's assert that they actually work (not a best practice in production code, which should use unittests, but useful for fast validation on StackOverflow):
functions = ('secondpartition', 'redot2match', 'redot2',
'splitslicejoin', 'splitislicejoin', 'find2nddot',
'maxsplittwo', 'maxsplittwoexcept')
for function in functions:
assert globals()[function]('foo.baz') == 'foo.baz'
assert globals()[function]('foo.baz.bar') == 'bar'
assert globals()[function]('foo.baz.bar.boink') == 'bar.boink'
The asserts don't raise AssertionErrors, so now let's time them to see how they perform:
Performance
setup = 'from __main__ import ' + ', '.join(functions)
perfs = {}
for func in functions:
perfs[func] = min(timeit.repeat(func + '("foo.bar.baz.a.b.c")', setup))
for func in sorted(perfs, key=lambda x: perfs[x]):
print('{0}: {1}'.format(func, perfs[func]))
Results
Update Best performer is falsetru's maxsplittwo, which slightly edges out the secondpartition function. Congratulations to falsetru. It makes sense since it is a very direct approach. And JonClements's modification is even better...
maxsplittwoexcept: 1.01329493523
maxsplittwo: 1.08345508575
secondpartition: 1.1336209774
splitslicejoin: 1.49500417709
redot2match: 2.22423219681
splitislicejoin: 3.4605550766
find2nddot: 3.77172589302
redot2: 4.69134306908
Older run and analysis without falsetru's maxsplittwo and JonClements' maxsplittwoexcept:
secondpartition: 0.636116637553
splitslicejoin: 1.05499717616
redot2match: 1.10188927335
redot2: 1.6313087087
find2nddot: 1.65386564664
splitislicejoin: 3.13693511439
It turns out that the most performant approach is to partition twice, even though my intuition didn't like it.
Also, it turns out my intuition on using islice was wrong in this case, it is much less performant, and so the extra list from the regular slice is probably worth the tradeoff if faced with a similar bit of code.
Of the regular expressions, the match approach for my desired string is the best performer here, nearly tied with splitslicejoin.

Is there a way to get the return value of a function and test it's "nonzero" at the same time?

I have code that looks like this:
if(func_cliche_start(line)):
a=func_cliche_start(line)
#... do stuff with 'a' and line here
elif(func_test_start(line)):
a=func_test_start(line)
#... do stuff with a and line here
elif(func_macro_start(line)):
a=func_macro_start(line)
#... do stuff with a and line here
...
Each of the func_blah_start functions either return None or a string (based on the input line). I don't like the redundant call to func_blah_start as it seems like a waste (func_blah_start is "pure", so we can assume no side effects). Is there a better idiom for this type of thing, or is there a better way to do it?
Perhaps I'm wrong, (my C is rusty), but I thought that you could do something this in C:
int a;
if(a=myfunc(input)){ /*do something with a and input here*/ }
is there a python equivalent?

Why don't you assign the function func_cliche_start to variable a before the if statement?
a = func_cliche_start(line)
if a:
pass # do stuff with 'a' and line here
The if statement will fail if func_cliche_start(line) returns None.

You can create a wrapper function to make this work.
def assign(value, lst):
lst[0] = value
return value
a = [None]
if assign(func_cliche_start(line), a):
#... do stuff with 'a[0]' and line here
elif assign(func_test_start(line), a):
#...

You can just loop thru your processing functions that would be easier and less lines :), if you want to do something different in each case, wrap that in a function and call that e.g.
for func, proc in [(func_cliche_start, cliche_proc), (func_test_start, test_proc), (func_macro_start, macro_proc)]:
a = func(line)
if a:
proc(a, line)
break;

I think you should put those blocks of code in functions. That way you can use a dispatcher-style approach. If you need to modify a lot of local state, use a class and methods. (If not, just use functions; but I'll assume the class case here.) So something like this:
from itertools import dropwhile
class LineHandler(object):
def __init__(self, state):
self.state = state
def handle_cliche_start(self, line):
# modify state
def handle_test_start(self, line):
# modify state
def handle_macro_start(self, line):
# modify state
line_handler = LineHandler(initial_state)
handlers = [line_handler.handle_cliche_start,
line_handler.handle_test_start,
line_handler.handle_macro_start]
tests = [func_cliche_start,
func_test_start,
func_macro_start]
handlers_tests = zip(handlers, tests)
for line in lines:
handler_iter = ((h, t(line)) for h, t in handlers_tests)
handler_filter = ((h, l) for h, l in handler_iter if l is not None)
handler, line = next(handler_filter, (None, None))
if handler:
handler(line)
This is a bit more complex than your original code, but I think it compartmentalizes things in a much more scalable way. It does require you to maintain separate parallel lists of functions, but the payoff is that you can add as many as you want without having to write long if statements -- or calling your function twice! There are probably more sophisticated ways of organizing the above too -- this is really just a roughed-out example of what you could do. For example, you might be able to create a sorted container full of (priority, test_func, handler_func) tuples and iterate over it.
In any case, I think you should consider refactoring this long list of if/elif clauses.

You could take a list of functions, make it a generator and return the first Truey one:
functions = [func_cliche_start, func_test_start, func_macro_start]
functions_gen = (f(line) for f in functions)
a = next((x for x in functions_gen if x), None)
Still seems a little strange, but much less repetition.

Python functional evaluation efficiency

If I do this:
x=[(t,some_very_complex_computation(y)) for t in z]
Apparently some_very_complex_computation(y) is not dependent on t. So it should be evaluated only once. Is there any way to make Python aware of this, so it won't evaluate some_very_complex_computation(y) for every iteration?
Edit: I really want to do that in one line...

Usually you should follow San4ez's advise and just use a temporary variable here. I will still present a few techniques that might prove useful under certain circumstances:
In general, if you want to bind a name just for a sub-expression (which is usually why you need a temporary variable), you can use a lambda:
x = (lambda result=some_very_complex_computation(y): [(t, result) for t in z])()
In this particular case, the following is a quite clean and readable solution:
x = zip(z, itertools.repeat(some_very_complex_computation(y)))
A general note about automatic optimizations like these
In a dynamic language like Python, an implementation would have a very hard time to figure out that some_very_complex_computation is referentially transparent, that is, that it will always return the same result for the same arguments. You might want to look into a functional language like Haskell if you want magic like that.
"Explicit" pureness: Memoization
What you can do however is make some_very_complex_computation explicitly cache its return values for recent arguments:
from functools import lru_cache
#lru_cache()
def some_very_complex_computation(y):
# ...
This is Python 3. In Python 2, you'd have to write the decorator yourself:
from functools import wraps
def memoize(f):
cache = {}
#wraps(f)
def memoized(*args):
if args in cache:
return cache[args]
res = cache[args] = f(*args)
return res
return memoized
#memoize
some_very_complex_computation(x):
# ...

No, you should save value in variable
result = some_very_complex_computation(y)
x = [(t, result) for t in z]

I understand the sometimes perverse urge to get everything into one line, but at the same time it is good to keep things readable. You may consider this more readable than the lambda version:
x=[(t,s) for s in [some_very_complex_calculation(y)] for t in z]
However, you are probably better going for the answer by San4ez as being simple, readable (and possibly faster than creating and iterating through a one element list).

You can either:
Move the call out of the list comprehension
or
Use memoization (i.e. when some_very_complex_computation(y) gets called store the result in a dictionary, and if it gets called again with the same value just return the value stored in the dict

TL;DR version
zip(z, [long_computation(y)] * len(z))
Original answer:
As a rule of thumb, if you have some computation with a long execution time, it would be a good idea to cache the result directly in the function like this:
_cached_results = {}
def computation(v):
if v in _cached_results:
return _cached_results[v]
# otherwise do the computation here...
_cached_results[v] = result
return result
This would solve your problem too.
On one-liners
Doing one-liners for the sake of them is poor coding, yet... if you really wanted to do it in one line:
>>> def func(v):
... print 'executing func'
... return v * 2
...
>>> z = [1, 2, 3]
>>> zip(z, [func(10)] * len(z))
executing func
[(1, 20), (2, 20), (3, 20)]

#San4ez has given traditional, correct, simple, and beautiful answer.
In the spirit of the one-liner though, here's a technique for putting it all in one statement. The core idea is to use a nested for-loop to pre-evaluate subexpressions:
result = [(t, result) for result in [some_very_complex_computation(y)] for t in z]
If that blows your mind, you could just use a semicolon to put multiple statements on one line:
result = some_very_complex_computation(y); x = [(t, result) for t in z]

It can't know whether the function has side effects and changes from run to run, so you have to move the call out of the list comprehension manually.

How do I split a string and rejoin it without creating an intermediate list in Python?

Say I have something like the following:
dest = "\n".join( [line for line in src.split("\n") if line[:1]!="#"] )
(i.e. strip any lines starting with # from the multi-line string src)
src is very large, so I'm assuming .split() will create a large intermediate list. I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? Is my assumption correct? What's the most (memory) efficient way to handle this?
Clarification: This arose due to my code running out of memory. I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src?

Here's a way to do a general type of split using itertools
>>> import itertools as it
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "\n".__ne__) if i)
>>> '\n'.join(s for s in line_gen if s[0]!="#")
'hello\nworld'
groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures
Probably better to spend a few lines and make a generator
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
... i=j=0
... while True:
... try:
... j = s.index(t, i)
... except ValueError:
... if i<len(s):
... yield s[i:]
... raise StopIteration
... yield s[i:j]
... i = j+1
...
>>> '\n'.join(x for x in isplit(src, '\n') if x[0]!='#')
'hello\nworld'
re has a method called finditer, that could be used for this purpose too
>>> import re
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(\n|$)",src))
>>> '\n'.join(s for s in line_gen if not s.startswith("#"))
'hello\nworld'
comparing the performance is an exercise for the OP to try on the real data

buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")
Of course, this really makes the most sense if you use StringIO throughout. It works mostly the same as files. You can seek, read, write, iterate (as shown), etc.

In your existing code you can change the list to a generator expression:
dest = "\n".join(line for line in src.split("\n") if line[:1]!="#")
This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part.
A completely different approach that avoids the temporary construction of both lists is to use a regular expression:
import re
regex = re.compile('^#.*\n?', re.M)
dest = regex.sub('', src)
This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. Here are some performance measurements of the proposed solutions:
init = r'''
import re, StringIO
regex = re.compile('^#.*\n?', re.M)
src = ''.join('foo bar baz\n' for _ in range(100000))
'''
method1 = r'"\n".join([line for line in src.split("\n") if line[:1] != "#"])'
method2 = r'"\n".join(line for line in src.split("\n") if line[:1] != "#")'
method3 = 'regex.sub("", src)'
method4 = '''
buffer = StringIO.StringIO(src)
dest = "".join(line for line in buffer if line[:1] != "#")
'''
import timeit
for method in [method1, method2, method3, method4]:
print timeit.timeit(method, init, number = 100)
Results:
9.38s # Split then join with temporary list
9.92s # Split then join with generator
8.60s # Regular expression
64.56s # StringIO
As you can see the regular expression is the fastest method.
From your comments I can see that you are not actually interested in avoiding creating temporary objects. What you really want is to reduce the memory requirements for your program. Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem.
If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. Instead store the input and output in files on the disk and read from them in a streaming fashion. This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time.

If I understand your question about "more generic calls to split()" correctly, you could use re.finditer, like so:
output = ""
for i in re.finditer("^.*\n",input,re.M):
i=i.group(0).strip()
if i.startswith("#"):
continue
output += i + "\n"
Here you can replace the regular expression by something more sophisticated.

The problem is that strings are immutable in python, so it's going to be very difficult to do anything at all without intermediate storage.

How to do variable assignment inside a while(expression) loop in Python?

I have the variable assignment in order to return the assigned value and compare that to an empty string, directly in the while loop.
Here is how I'm doing it in PHP:
while((name = raw_input("Name: ")) != ''):
names.append(name)
What I'm trying to do is identical to this in functionality:
names = []
while(True):
name = raw_input("Name: ")
if (name == ''):
break
names.append(name)
Is there any way to do this in Python?

from functools import partial
for name in iter(partial(raw_input, 'Name:'), ''):
do_something_with(name)
or if you want a list:
>>> names = list(iter(partial(raw_input, 'Name: '), ''))
Name: nosklo
Name: Andreas
Name: Aaron
Name: Phil
Name:
>>> names
['nosklo', 'Andreas', 'Aaron', 'Phil']

You can wrap raw_input() to turn it into a generator:
def wrapper(s):
while True:
result = raw_input(s)
if result = '': break
yield result
names = wrapper('Name:')
which means we're back to square one but with more complex code. So if you need to wrap an existing method, you need to use nosklo's approach.

No, sorry. It's a FAQ, explained well here:
In Pydocs, and Fredrik Lundh's blog.
The reason for not allowing assignment in Python expressions is a common, hard-to-find bug in those other languages.
Many alternatives have been proposed. Most are hacks that save some typing but use arbitrary or cryptic syntax or keywords, and fail the simple criterion for language change proposals: it should intuitively suggest the proper meaning to a human reader who has not yet been introduced to the construct.
An interesting phenomenon is that most experienced Python programmers recognize the while True idiom and don’t seem to be missing the assignment in expression construct much; it’s only newcomers who express a strong desire to add this to the language.
There’s an alternative way of spelling this that seems attractive:
line = f.readline() while line:
... # do something with line...
line = f.readline()

I'm only 7 years late, but there's another solution. It's not the best solution I can think of, but it highlights an interesting use of the StopIteration exception. You can do a similar loop for chunk reading files/sockets and handle Timeouts and whatnot nicely.
names=[]
try:
while True:
f = raw_input()
if not f:
raise StopIteration
else:
names.append(f)
except StopIteration:
pass
print names

names = []
for name in iter(lambda: raw_input("Name: "), ''):
names.append(name)

PEP 572 proposes Assignment Expressions and has already been accepted. Starting with Python 3.8, you will be able to write:
while name := input("Name: "):
names.append(name)
Quoting the Syntax and semantics part of the PEP for some more examples:
# Handle a matched regex
if (match := pattern.search(data)) is not None:
# Do something with match
# A loop that can't be trivially rewritten using 2-arg iter()
while chunk := file.read(8192):
process(chunk)
# Reuse a value that's expensive to compute
[y := f(x), y**2, y**3]
# Share a subexpression between a comprehension filter clause and its output
filtered_data = [y for x in data if (y := f(x)) is not None]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to loop until EOF in Python? - python

You can combine iteration through iter() with a sentinel: for block in iter(lambda: file_obj.read(4), ""): use(block)

I'd go with Tendayi's suggestion re function and iterator for readability: def read4(): len_name = data.read(4) if len_name: len_name = struct.unpack("<I", len_name)[0] return data.read(len_name) else: raise StopIteration for d in iter(read4, ''): names.append(d)

Related

How do I remove from the beginning to the 2nd specific character of a string?

Is there a way to get the return value of a function and test it's "nonzero" at the same time?

Python functional evaluation efficiency

How do I split a string and rejoin it without creating an intermediate list in Python?

How to do variable assignment inside a while(expression) loop in Python?

Categories

Resources