Disclaimer: I'm fairly new to python!
If I want all the lines of a file until (edit: and including) the line containing some string stopterm, is there a way of using the list syntax for it? I was hoping there would be something like:
usefullines = [line for line in file until stopterm in line]
For now, I've got
usefullines = []
for line in file:
usefullines.append(line)
if stopterm in line:
break
It's not the end of the world, but since there rest of Python syntax is so straightforward, I was hoping for a 1 thought->1 Python line mapping.
from itertools import takewhile
usefullines = takewhile(lambda x: not re.search(stopterm, x), lines)
from itertools import takewhile
usefullines = takewhile(lambda x: stopterm not in x, lines)
Here's a way that keeps the stopterm line:
def useful_lines(lines, stopterm):
for line in lines:
if stopterm in line:
yield line
break
yield line
usefullines = useful_lines(lines, stopterm)
# or...
for line in useful_lines(lines, stopterm):
# ... do stuff
pass
" I was hoping for a 1 thought->1 Python line mapping." Wouldn't we all love a programming language that somehow mirrored our natural language?
You can achieve that, you just need to define your unique thoughts once. Then you have the 1:1 mapping you were hoping for.
def usefulLines( aFile ):
for line in aFile:
yield line
if line == stopterm:
break
Is pretty much it.
for line in usefulLines( aFile ):
# process a line, knowing it occurs BEFORE stopterm.
There are more general approaches. The lassevk answers with enum_while and enum_until are generalizations of this simple design pattern.
That itertools solution is neat. I have earlier been amazed by itertools.groupby, one handy tool.
But still i was just tinkering if I could do this without itertools. So here it is
(There is one assumption and one drawback though: the file is not huge and its goes for one extra complete iteration over the lines, respectively.)
I created a sample file named "try":
hello
world
happy
day
bye
once you read the file and have the lines in a variable name lines:
lines=open('./try').readlines()
then
print [each for each in lines if lines.index(each)<=[lines.index(line) for line in lines if 'happy' in line][0]]
gives the result:
['hello\n', 'world\n', 'happy\n']
and
print [each for each in lines if lines.index(each)<=[lines.index(line) for line in lines if 'day' in line][0]]
gives the result:
['hello\n', 'world\n', 'happy\n', 'day\n']
So you got the last line - the stop term line also included.
Forget this
Leaving the answer, but marking it community. See Stewen Huwig's answer for the correct way to do this.
Well, [x for x in enumerable] will run until enumerable doesn't produce data any more, the if-part will simply allow you to filter along the way.
What you can do is add a function, and filter your enumerable through it:
def enum_until(source, until_criteria):
for k in source:
if until_criteria(k):
break;
yield k;
def enum_while(source, while_criteria):
for k in source:
if not while_criteria(k):
break;
yield k;
l1 = [k for k in enum_until(xrange(1, 100000), lambda y: y == 100)];
l2 = [k for k in enum_while(xrange(1, 100000), lambda y: y < 100)];
print l1;
print l2;
Of course, it doesn't look as nice as what you wanted...
I think it's fine to keep it that way. Sophisticated one-liner are not really pythonic, and since Guido had to put a limit somewhere, I guess this is it...
I'd go with Steven Huwig's or S.Lott's solutions for real usage, but as a slightly hacky solution, here's one way to obtain this behaviour:
def stop(): raise StopIteration()
usefullines = list(stop() if stopterm in line else line for line in file)
It's slightly abusing the fact that anything that raises StopIteration will abort the current iteration (here the generator expression) and uglier to read than your desired syntax, but will work.
Related
I have a file with many sections like the following:
[40.742742,-73.993847]
[40.739389,-73.985667]
[40.74715499999999,-73.97992]
[40.750573,-73.988415]
[40.742742,-73.993847]
[40.734706,-73.991915]
[40.736917,-73.990263]
[40.736104,-73.98846]
[40.740315,-73.985263]
[40.74364800000001,-73.993353]
[40.73729099999999,-73.997988]
[40.734706,-73.991915]
[40.729226,-74.003463]
[40.7214529,-74.006038]
[40.717745,-74.000389]
[40.722299,-73.996634]
[40.725291,-73.994413]
[40.729226,-74.003463]
[40.754604,-74.007836]
[40.751289,-74.000649]
[40.7547179,-73.9983309]
[40.75779,-74.0054339]
[40.754604,-74.007836]
I need to read in each of these sections as a list of pairs of coordinates (Each section being separated by an extra \n).
In a similar file I have (same except there are no extra newline breaks), I am drawing one polygon from the whole file. I can use the following code to read in the coordinates and draw it in matplotlib:
mVerts = []
with open('Manhattan_Coords.txt') as f:
for line in f:
pair = [float(s) for s in line.strip()[1:-1].split(", ")]
mVerts.append(pair)
plt.plot(*zip(*mVerts))
plt.show()
How can I accomplish the same task, except with many more than 1 polygon, each polygon in my file separated by an extra newline?
Here's my personal favorite way to "chunk" a file into groups of things that are separated by whitespace:
from itertools import groupby
def chunk_groups(it):
stripped_lines = (x.strip() for x in it)
for k, group in groupby(stripped_lines, bool):
if k:
yield list(group)
And I'd recommend ast.literal_eval to turn those string-representations of lists into actual python lists:
from ast import literal_eval
with open(filename) as f:
result = [[literal_eval(li) for li in chunk] for chunk in chunk_groups(f)]
Gives:
result
Out[66]:
[[[40.742742, -73.993847],
[40.739389, -73.985667],
[40.74715499999999, -73.97992],
[40.750573, -73.988415],
[40.742742, -73.993847]],
[[40.734706, -73.991915],
[40.736917, -73.990263],
[40.736104, -73.98846],
[40.740315, -73.985263],
[40.74364800000001, -73.993353],
[40.73729099999999, -73.997988],
[40.734706, -73.991915]],
[[40.729226, -74.003463],
[40.7214529, -74.006038],
[40.717745, -74.000389],
[40.722299, -73.996634],
[40.725291, -73.994413],
[40.729226, -74.003463],
[40.754604, -74.007836],
[40.751289, -74.000649],
[40.7547179, -73.9983309],
[40.75779, -74.0054339],
[40.754604, -74.007836]]]
A slight variation on roippi's idea, using json instead of ast,
import json
from itertools import groupby
with open(FILE, "r") as coodinates_file:
grouped = groupby(coodinates_file, lambda line: line.isspace())
groups = (group for empty, group in grouped if not empty)
polygons = [[json.loads(line) for line in group] for group in groups]
from pprint import pprint
pprint(polygons)
#>>> [[[40.742742, -73.993847],
#>>> [40.739389, -73.985667],
#>>> [40.74715499999999, -73.97992],
#>>> [40.750573, -73.988415],
#>>> [40.742742, -73.993847]],
#>>> [[40.734706, -73.991915],
#>>> [40.736917, -73.990263],
#>>> [40.736104, -73.98846],
#>>> [40.740315, -73.985263],
#>>> [40.74364800000001, -73.993353],
#>>> [40.73729099999999, -73.997988],
#>>> [40.734706, -73.991915]],
#>>> [[40.729226, -74.003463],
#>>> [40.7214529, -74.006038],
#>>> [40.717745, -74.000389],
#>>> [40.722299, -73.996634],
#>>> [40.725291, -73.994413],
#>>> [40.729226, -74.003463],
#>>> [40.754604, -74.007836],
#>>> [40.751289, -74.000649],
#>>> [40.7547179, -73.9983309],
#>>> [40.75779, -74.0054339],
#>>> [40.754604, -74.007836]]]
There are a lot of nifty approaches taken in the answers already posted. There's nothing wrong with any of them.
However, there's also nothing wrong with taking the obvious-but-readable approach.
On a side note, you seem to be working with geographic data. This sort of format is something you'll run into all of the time, and the segment delimiter often isn't something as obvious as an extra newline. (There are a lot of fairly bad ad-hoc "ascii export" formats out there, particularly in obscure proprietary software. For example, one common format uses an F at the end of the last line in a segment as the delimiter (i.e. 1.0 2.0F). Plenty of others don't use a delimiter at all, and require you to start a new segment/polygon if you're more than "x" distance away from the last point.) Furthermore, these things often wind up being multi-GB ascii files, so reading the entire thing into memory can be impractical.
My point is: Regardless of the approach you choose, make sure you understand it. You're going to be doing this again, and it's going to be just different enough to be difficult to generalize. You absolutely should learn libraries like itertools well, but make sure you fully understand the functions you're calling.
Here's one version of the "obvious-but-readable" approach. It's more verbose, but no one is going to be left scratching their heads as to what it does. (You could write this same logic several slightly different ways. Use what makes the most sense to you.)
import matplotlib.pyplot as plt
def polygons(infile):
group = []
for line in infile:
line = line.strip()
if line:
coords = line[1:-1].split(',')
group.append(map(float, coords))
else:
yield group
group = []
else:
yield group
fig, ax = plt.subplots()
ax.ticklabel_format(useOffset=False)
with open('data.txt', 'r') as infile:
for poly in polygons(infile):
ax.plot(*zip(*poly))
plt.show()
In many of my python projects, I find myself having to go through a file, match lines against regexes, and then perform some computation on the basis of elements from the line extracted by regex.
In pseudo-C code, this is pretty-easy:
while (read(line))
{
if (m=matchregex(regex1,line))
{
/* munch on the components extracted in regex1 by accessing m */
}
else if (m=matchregex(regex2,line))
{
/* munch on the components extracted in regex2 by accessing m */
}
else if ...
...
else
{
error("Unrecognized line format");
}
}
However, because python does not allow an assignment in the conditional of an if, this can't be done elegantly. One could first parse against all the regexes and then do the if on the various match objects, but that is neither elegant nor efficient.
What I find myself doing instead is including code like this at the base level of every project:
im=None
img=None
def imps(p,s):
global im
global img
im=re.search(p,s)
if im:
img=im.groups()
return True
else:
img=None
return False
Then I can work like this:
for line in open(file,'r').read().splitlines():
if imps(regex1,line):
# munch on contents of img
elsif imps(regex2,line):
# munch on contents of img
else:
error('Unrecognised line: {}'.format(line))
That works, is reasonably compact, and easy to type. But it is hardly beautiful; it uses global variables and is not thread safe (which has not been an issue for me so far).
But I'm sure others have run across this problem before and come up with an equally compact, but more python-y and generally superior solution. What is it?
Depends on the needs of the code.
A common choice I use is something like this:
# note, order is important here. The first one to match will exit the processing
parse_regexps = [
(r"^foo", handle_foo),
(r"^bar", handle_bar),
]
for regexp, handler in parse_regexps:
m = regexp.match(line)
if m:
handler(line) # possibly other data too like m.groups
break
else:
error("Unrecognized format....")
This has the advantage of moving the handling code into clear and obvious functions which makes testing and change easy.
You can just use continue:
for line in file:
m = re.match(re1, line)
if m:
do stuff
continue
m = re.match(re2, line)
if m:
do stuff
continue
raise BadLine
Another, less obvious, option is to have a function like this:
def match_any(subject, *regexes):
for n, regex in enumerate(regexes):
m = re.match(regex, subject)
if m:
return n, m
return -1, None
and then:
for line in file:
n, m = match_any(line, re1, re2)
if n == 0:
....
elif n == 1:
....
else:
raise BadLine
I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.
Here is the code:
import math
with open("test.stl") as file:
vertices = [map(float, line.split()[1:4])
for line in file
if line.lstrip().startswith('vertex')]
normals = [map(float, line.split()[2:5])
for line in file
if line.lstrip().startswith('facet')]
V=len(vertices)
ordering=[]
N=len(normals)
for i in range(0,N):
p1=vertices[3*i]
p2=vertices[3*i+1]
p3=verticies[3*i+2]
print p1
x1=p1[0]
y1=p1[1]
z1=p1[2]
x2=p2[0]
y2=p2[1]
z2=p2[2]
x3=p3[0]
y3=p3[1]
z3=p3[2]
a=[x2-x1,y2-y1,z2-z1]
b=[x3-x1,y3-y1,z3-z1]
a1=x2-x1
a2=y2-y1
a3=z2-z1
b1=x3-x1
b2=y3-y1
b3=z3-z1
normal=normals[i]
cross_vector=[a2*b3-a3*b2,a3*b1-a1*b3,a1*b2-a2*b1]
if cross_vector==normal:
ordering.append([i,i+1,i+2])
else:
ordering.append([i,i+2,i+1])
print ordering
print cross_vector
If I try to add print p1 (or any of the other variables such as cross_vector) inside of the for loop, there aren't any errors but no output and if I try to print them outside of the for loop it says NameError: name '(variable name)' is not defined. So if none of these variables are being defined, obviously my ordering array prints as [] (blank). How can I change this. Do variables have to be declared before they are defined?
Edit: Here is the error output when the code above is run:
Traceback (most recent call last):
File "convert.py", line 52, in <module>
print cross_vector
NameError: name 'cross_vector' is not defined
As explained above this happens with any variable defined in the for loop, I am just using cross_vector as an example.
This line:
vertices = [map(float, line.split()[1:4])
for line in file
if line.lstrip().startswith('vertex')]
reads through all the lines in the file. After that, you're at the end of the file, and there's nothing left to read. So
normals = [map(float, line.split()[2:5])
for line in file
if line.lstrip().startswith('facet')]
is empty (normals == []). Thus
N=len(normals)
sets N to 0, meaning that this loop:
for i in range(0,N):
is never executed. That's why printing from inside it does nothing -- the loop isn't being run.
To solve the problem diagnosed by DSM, use:
import math
import itertools
with open("test.stl") as file:
i1, i2 = itertools.tee(file)
vertices = [map(float, line.split()[1:4])
for line in i1
if line.lstrip().startswith('vertex')]
normals = [map(float, line.split()[2:5])
for line in i2
if line.lstrip().startswith('facet')]
You might also want to try and drop the list comprehension, and work with iterators throughout, to save on memory for large files.
Edit:
At present, you load the entire file into memory, and then create two more full size lists in memory. Instead, you can write it in a way that only reads from the file in memory as required. As an example, we can replace the list comprehensions with generator comprehensions:
import math
import itertools
with open("test.stl") as file:
i1, i2 = itertools.tee(file)
vertexIter = (map(float, line.split()[1:4])
for line in i1
if line.lstrip().startswith('vertex'))
normalIter = (map(float, line.split()[2:5])
for line in i2
if line.lstrip().startswith('facet'))
Here, we've avoided using any memory at all.
For this to be useful, you need to be able to replace your loop, from:
for i in range(0,N):
p1=vertices[3*i]
p2=vertices[3*i+1]
p3=verticies[3*i+2]
normal = normals[i]
# processing
To a single iterator:
for normal, p1, p2, p3 in myMagicIterator:
# processing
One way I can think of doing this is:
myMagicIterator = itertools.izip(
normalIter,
itertools.islice(vertexIter, 0, 3),
itertools.islice(vertexIter, 1, 3),
itertools.islice(vertexIter, 2, 3)
)
Which is the iterator equivalent of:
myNormalList = zip(normals, vertices[0::3], vertices[1::3], vertices[2::3])
Declare them outside of it (before the for loop) and see what happens. Even if it would be ok to declare them in the for loop, you would probably like to have a "default" value of them when the loop doesn't run.
And please try to post a lot smaller example if necessary.
How to read n lines from a file instead of just one when iterating over it? I have a file which has well defined structure and I would like to do something like this:
for line1, line2, line3 in file:
do_something(line1)
do_something_different(line2)
do_something_else(line3)
but it doesn't work:
ValueError: too many values to unpack
For now I am doing this:
for line in file:
do_someting(line)
newline = file.readline()
do_something_else(newline)
newline = file.readline()
do_something_different(newline)
... etc.
which sucks because I am writing endless 'newline = file.readline()' which are cluttering the code.
Is there any smart way to do this ? (I really want to avoid reading whole file at once because it is huge)
Basically, your fileis an iterator which yields your file one line at a time. This turns your problem into how do you yield several items at a time from an iterator. A solution to that is given in this question. Note that the function isliceis in the itertools module so you will have to import it from there.
If it's xml why not just use lxml?
You could use a helper function like this:
def readnlines(f, n):
lines = []
for x in range(0, n):
lines.append(f.readline())
return lines
Then you can do something like you want:
while True:
line1, line2, line3 = readnlines(file, 3)
do_stuff(line1)
do_stuff(line2)
do_stuff(line3)
That being said, if you are using xml files, you will probably be happier in the long run if you use a real xml parser...
itertools to the rescue:
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
fobj= open(yourfile, "r")
for line1, line2, line3 in grouper(3, fobj):
pass
for i in file produces a str, so you can't just do for i, j, k in file and read it in batches of three (try a, b, c = 'bar' and a, b, c = 'too many characters' and look at the values of a, b and c to work out why you get the "too many values to unpack").
It's not clear entirely what you mean, but if you're doing the same thing for each line and just want to stop at some point, then do it like this:
for line in file_handle:
do_something(line)
if some_condition:
break # Don't want to read anything else
(Also, don't use file as a variable name, you're shadowning a builtin.)
If your're doing the same thing why do you need to process multiple lines per iteration?
For line in file is your friend. It is in general much more efficient than manually reading the file, both in terms of io performance and memory.
Do you know something about the length of the lines/format of the data? If so, you could read in the first n bytes (say 80*3) and f.read(240).split("\n")[0:3].
If you want to be able to use this data over and over again, one approach might be to do this:
lines = []
for line in file_handle:
lines.append(line)
This will give you a list of the lines, which you can then access by index. Also, when you say a HUGE file, it is most likely trivial what the size is, because python can process thousands of lines very quickly.
why can't you just do:
ctr = 0
for line in file:
if ctr == 0:
....
elif ctr == 1:
....
ctr = ctr + 1
if you find the if/elif construct ugly you could just create a hash table or list of function pointers and then do:
for line in file:
function_list[ctr]()
or something similar
It sounds like you are trying to read from disk in parallel... that is really hard to do. All the solutions given to you are realistic and legitimate. You shouldn't let something put you off just because the code "looks ugly". The most important thing is how efficient/effective is it, then if the code is messy, you can tidy it up, but don't look for a whole new method of doing something because you don't like how one way of doing it looks like in code.
As for running out of memory, you may want to check out pickle.
It's possible to do it with a clever use of the zip function. It's short, but a bit voodoo-ish for my tastes (hard to see how it works). It cuts off any lines at the end that don't fill a group, which may be good or bad depending on what you're doing. If you need the final lines, itertools.izip_longest might do the trick.
zip(*[iter(inputfile)] * 3)
Doing it more explicitly and flexibly, this is a modification of Mats Ekberg's solution:
def groupsoflines(f, n):
while True:
group = []
for i in range(n):
try:
group.append(next(f))
except StopIteration:
if group:
tofill = n - len(group)
yield group + [None] * tofill
return
yield group
for line1, line2, line3 in groupsoflines(inputfile, 3):
...
N.B. If this runs out of lines halfway through a group, it will fill in the gaps with None, so that you can still unpack it. So, if the number of lines in your file might not be a multiple of three, you'll need to check whether line2 and line3 are None.