python style question around reading small files - python

What is the most pythonic way to read in a named file, strip lines that are either empty, contain only spaces, or have # as a first character, and then process remaining lines? Assume it all fits easily in memory.
Note: it's not tough to do this -- what I'm asking is for the most pythonic way. I've been writing a lot of Ruby and Java and have lost my feel.
Here's a strawman:
file_lines = [line.strip() for line in open(config_file, 'r').readlines() if len(line.strip()) > 0]
for line in file_lines:
if line[0] == '#':
continue
# Do whatever with line here.
I'm interested in concision, but not at the cost of becoming hard to read.

Generators are perfect for tasks like this. They are readable, maintain perfect separation of concerns, and efficient in memory-use and time.
def RemoveComments(lines):
for line in lines:
if not line.strip().startswith('#'):
yield line
def RemoveBlankLines(lines):
for line in lines:
if line.strip():
yield line
Now applying these to your file:
filehandle = open('myfile', 'r')
for line in RemoveComments(RemoveBlankLines(filehandle)):
Process(line)
In this case, it's pretty clear that the two generators can be merged into a single one, but I left them separate to demonstrate their composability.

lines = [r for r in open(thefile) if not r.isspace() and r[0] != '#']
The .isspace() method of strings is by far the best way to test if a string is entirely whitespace -- no need for contortions such as len(r.strip()) == 0 (ech;-).

for line in open("file"):
sline=line.strip()
if sline and not sline[0]=="#" :
print line.strip()
output
$ cat file
one
#
#
two
three
$ ./python.py
one
two
three

I would use this:
processed = [process(line.strip())
for line in open(config_file, 'r')
if line.strip() and not line.strip().startswith('#')]
The only ugliness I see here is all the repeated stripping. Getting rid of it complicates the function a bit:
processed = [process(line)
for line in (line.strip() for line in open(config_file, 'r'))
if line and not line.startswith('#')]

This matches the description, ie
strip lines that are either empty,
contain only spaces, or have # as a
first character, and then process
remaining lines
So lines that start or end in spaces are passed through unfettered.
with open("config_file","r") as fp:
data = (line for line in fp if line.strip() and not line.startswith("#"))
for item in data:
print repr(item)

I like Paul Hankin's thinking, but I'd do it differently:
from itertools import ifilter, ifilterfalse, imap
with open(r'c:\temp\testfile.txt', 'rb') as f:
s1 = ifilterfalse(str.isspace, f)
s2 = ifilter(lambda x: not x.startswith('#'), s1)
s3 = imap(str.rstrip, s2)
print "\n".join(s3)
I'd probably only do it this way instead of using some of the more obvious approaches suggested here if I were concerned about memory usage. And I might define an iscomment function to eliminate the lambda.

The file is small, so performance is not really an issue. I will go for clarity than conciseness:
fp = open('file.txt')
for line in fp:
line = line.strip()
if line and not line.startswith('#'):
# process
fp.close()
If you want, you can wrap this in a function.

Using slightly newer idioms (or with Python 2.5 from __future__ import with) you could do this, which has the advantage of cleaning up safely yet is quite concise.
with file('file.txt') as fp:
for line in fp:
line = line.strip()
if not line or line[0] == '#':
continue
# rest of processing here
Note that stripping the line first means the check for "#" will actually reject lines with that as the first non-blank, not merely "as first character". Easy enough to modify if you're strict about that.

Related

splitlines() and iterating over an opened file give different results

I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:
with open('test.txt', 'wb') as f: # simulate a file with weird end-of-lines
f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
for l in f:
print(l)
# b'abc\r\r\n'
# b'def'
I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:
print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']
Even with keepends=True, it's not the same result.
Question: how to have the same behaviour than for l in f with splitlines()?
Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232
Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.
Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :
def bin_split(input_str):
start = 0
while start>=0 :
found = input_str.find(b'\n', start) + 1
if 0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break
There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list() part if you just iterate over the results (or if using Python2):
for line in filter(None, text.splitlines()):
pass # do stuff with line
I would iterate through like this:
text = "b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)
This is a for l in f: solution:
The key to this is the newline argument on the open call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:
with open('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n'
'def'
A quasi-splitlines solution:
This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'
with open('test.txt', 'w', newline='') as f:
f.write(input)
with open('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'
Regex Demo

Python, select a specific line

I want to read a textfile using python and print out specific lines. The problem is that I want to print a line which starts with the word "nominal" (and I know how to do it) and the line following this which is not recognizable by some specific string. Could you point me to some lines of code that are able to do that?
In good faith and under the assumption that this will help you start coding and showing some effort, here you go:
file_to_read = r'myfile.txt'
with open(file_to_read, 'r') as f_in:
flag = False
for line in f_in:
if line.startswith('nominal'):
print(line)
flag = True
elif flag:
print(line)
flag = False
it might work out-of-the-box but please try to spend some time going through it and you will definitely get the logic behind it. Note that text comparison in python is case sensitive.
If the file isn't too large, you can put it all in a list:
def printLines(fname):
with open(fname) as f:
lines = f.read().split('\n')
if len(lines) == 0: return None
if lines[0].startswith('Nominal'): print(lines[0])
for i, line in enumerate(lines[1:]):
if lines[i-1].startswith('Nominal') or line.startswith('Nominal'):
print(line)
Then e.g. printLines('test.txt') will do what you want.

Improving the speed of a python script

I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()
Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.
You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))
Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.
Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.

Python: Concise / elegant way to reformat a set of text files?

I have written a python script to process a set of ASCII files within a given dir. I wonder if there is a more concise and/or "pythonesque" way to do it, without loosing readability?
Python Code
import os
import fileinput
import glob
import string
indir='./'
outdir='./processed/'
for filename in glob.glob(indir+'*.asc'): # get a list of input ASCII files to be processed
fin=open(indir+filename,'r') # input file
fout=open(outdir+filename,'w') # out: processed file
lines = iter(fileinput.input([indir+filename])) # iterator over all lines in the input file
fout.write(next(lines)) # just copy the first line (the header) to output
for line in lines:
val=iter(string.split(line,' '))
fout.write('{0:6.2f}'.format(float(val.next()))), # first value in the line has it's own format
for x in val: # iterate over the rest of the numbers in the line
fout.write('{0:10.6f}'.format(float(val.next()))), # the rest of the values in the line has a different format
fout.write('\n')
fin.close()
fout.close()
An example:
Input:
;;; This line is the header line
-5.0 1.090074154029272 1.0034662411357929 0.87336062116561186 0.78649408279093869 0.65599958665017222 0.4379879132749317 0.26310799350679176 0.087808018565486673
-4.9900000000000002 1.0890770415316042 1.0025480136545413 0.87256100700428996 0.78577373527626004 0.65539842673645277 0.43758616966566649 0.26286647978335914 0.087727357602906453
-4.9800000000000004 1.0880820021223023 1.0016316956763136 0.87176305623792771 0.78505488659611744 0.65479851808106115 0.43718526271594083 0.26262546925502467 0.087646864773454014
-4.9700000000000006 1.0870890372077564 1.0007172884938402 0.87096676998908273 0.78433753775986659 0.65419986152386733 0.4367851929843618 0.26238496225635727 0.087566540188423345
-4.9600000000000009 1.086098148170821 0.99980479337809591 0.87017214936140763 0.78362168975984026 0.65360245789061966 0.4363859610200459 0.26214495911617541 0.087486383957276398
Processed:
;;; This line is the header line
-5.00 1.003466 0.786494 0.437988 0.087808
-4.99 1.002548 0.785774 0.437586 0.087727
-4.98 1.001632 0.785055 0.437185 0.087647
-4.97 1.000717 0.784338 0.436785 0.087567
-4.96 0.999805 0.783622 0.436386 0.087486
Other than a few minor changes, due to how Python has changed through time, this looks fine.
You're mixing two different styles of next(); the old way was it.next() and the new is next(it). You should use the string method split() instead of going through the string module (that module is there mostly for backwards compatibility to Python 1.x). There's no need to use go through the almost useless "fileinput" module, since open file handle are also iterators (that module comes from a time before Python's file handles were iterators.)
Edit: As #codeape pointed out, glob() returns the full path. Your code would not have worked if indir was something other than "./". I've changed the following to use the correct listdir/os.path.join solution. I'm also more familiar with the "%" string interpolation than string formatting.
Here's how I would write this in more idiomatic modern Python
def reformat(fin, fout):
fout.write(next(fin)) # just copy the first line (the header) to output
for line in fin:
fields = line.split(' ')
# Make a format header specific to the number of fields
fmt = '%6.2f' + ('%10.6f' * (len(fields)-1)) + '\n'
fout.write(fmt % tuple(map(float, fields)))
basenames = os.listdir(indir) # get a list of input ASCII files to be processed
for basename in basenames:
input_filename = os.path.join(indir, basename)
output_filename = os.path.join(outdir, basename)
with open(input_filename, 'r') as fin, open(output_filename, 'w') as fout:
reformat(fin, fout)
The Zen of Python is "There should be one-- and preferably only one --obvious way to do it". It's interesting how you functions which, during the last 10+ years, was "obviously" the right solution, but are no longer. :)
fin=open(indir+filename,'r') # input file
fout=open(outdir+filename,'w') # out: processed file
#code
fin.close()
fout.close()
can be written as:
with open(indir+filename,'r') as fin, open(outdir+filename,'w') as fout:
#code
In python 2.6, you can use:
with open(indir+filename,'r') as fin:
with open(outdir+filename,'w') as fout:
#code
And the line
lines = iter(fileinput.input([indir+filename]))
is useless. You can just iterate over an open file(fin in your case)
You can also do line.split(' ') instead of string.split(line, ' ')
If you change those things, there is no need to import string and fileinput.
Edit: I didn't know you can use inline code. That's cool
In my build script, I have this code:
inFile = open(sourceFile,'r')
outFile = open(targetFile,'w')
for line in inFile:
line = doKeywordSubstitution(line)
outFile.write(line)
inFile.close()
outFile.close()
I don't know of a way to make this any more concise. Putting the line-changing logic in a different function looks neater to me though.
I may be missing the point of your code, but I don't understand why you have lines = iter(fileinput.input([indir+filename])).
I don't understand why do you use: string.split(line, ' ') instead of just line.split(' ').
Well maybe I would write the string-processing part like this:
values = line.split(' ')
values[0] = '{0:6.2f}'.format(float(values[0]))
values[1:] = ['{0:10.6f}'.format(float(v)) for v in values[1:]]
fout.write(' '.join(values))
At least for me this looks better but this might be subjective :)
Instead of indir I would use os.curdir. Instead of "./processed" I would do: os.path.join(os.curdir, 'processed').

How do I write only certain lines to a file in Python?

I have a file that looks like this(have to put in code box so it resembles file):
text
(starts with parentheses)
tabbed info
text
(starts with parentheses)
tabbed info
...repeat
I want to grab only "text" lines from the file(or every fourth line) and copy them to another file. This is the code I have, but it copies everything to the new file:
import sys
def process_file(filename):
output_file = open("data.txt", 'w')
input_file = open(filename, "r")
for line in input_file:
line = line.strip()
if not line.startswith("(") or line.startswith(""):
output_file.write(line)
output_file.close()
if __name__ == "__main__":
process_file(sys.argv[1])
The reason why your script is copying every line is because line.startswith("") is True, no matter what line equals.
You might try using isspace to test if line begins with a space:
def process_file(filename):
with open("data.txt", 'w') as output_file:
with open(filename, "r") as input_file:
for line in input_file:
line=line.rstrip()
if not line.startswith("(") or line[:1].isspace():
output_file.write(line)
with open('data.txt','w') as of:
of.write(''.join(textline
for textline in open(filename)
if textline[0] not in ' \t(')
)
To write every fourth line use slice result[::4]
with open('data.txt','w') as of:
of.write(''.join([textline
for textline in open(filename)
if textline[0] not in ' \t('][::4])
)
I need not to rstrip the newlines as I use them with write.
In addition to line.startswith("") always being true, line.strip() will remove the leading tab forcing the tabbed data to be written as well. change it to line.rstrip() and use \t to test for a tab. That part of your code should look like:
line = line.rstrip()
if not line.startswith(('(', '\t')):
#....
In response to your question in the comments:
#edited in response to comments in post
for i, line in input_file:
if i % 4 == 0:
output_file.write(line)
try:
if not line.startswith("(") and not line.startswith("\t"):
without doing line.strip() (this will strip the tabs)
So the issue is that (1) you are misusing boolean logic, and (2) every possible line starts with "".
First, the boolean logic:
The way the or operator works is that it returns True if either of its operands is True. The operands are "not line.startswith('(')" and "line.startswith('')". Note that the not only applies to one of the operands. If you want to apply it to the total result of the or expression, you will have to put the whole thing in parentheses.
The second issue is your use of the startswith() method with a zero-length strong as an argument. This essentially says "match any string where the first zero characters are nothing. It matches any strong you could give it.
See other answers for what you should be doing here.

Categories

Resources