Elegant way to skip first line when using python fileinput module? - python

Is there an elegant way of skipping first line of file when using python fileinput module?
I have data file with nicely formated data but the first line is header. Using fileinput I would have to include check and discard line if the line does not seem to contain data.
The problem is that it would apply the same check for the rest of the file.
With read() you can open file, read first line then go to loop over the rest of the file. Is there similar trick with fileinput?
Is there an elegant way to skip processing of the first line?
Example code:
import fileinput
# how to skip first line elegantly?
for line in fileinput.input(["file.dat"]):
data = proces_line(line);
output(data)

lines = iter(fileinput.input(["file.dat"]))
next(lines) # extract and discard first line
for line in lines:
data = proces_line(line)
output(data)
or use the itertools.islice way if you prefer
import itertools
finput = fileinput.input(["file.dat"])
lines = itertools.islice(finput, 1, None) # cuts off first line
dataset = (process_line(line) for line in lines)
results = [output(data) for data in dataset]
Since everything used are generators and iterators, no intermediate list will be built.

The fileinput module contains a bunch of handy functions, one of which seems to do exactly what you're looking for:
for line in fileinput.input(["file.dat"]):
if not fileinput.isfirstline():
data = proces_line(line);
output(data)
fileinput module documentation

It's right in the docs: http://docs.python.org/library/fileinput.html#fileinput.isfirstline

One option is to use openhook:
The openhook, when given, must be a function that takes two arguments,
filename and mode, and returns an accordingly opened file-like object.
You cannot use inplace and openhook together.
One could create helper function skip_header and use it as openhook, something like:
import fileinput
files = ['file_1', 'file_2']
def skip_header(filename, mode):
f = open(filename, mode)
next(f)
return f
for line in fileinput.input(files=files, openhook=skip_header):
# do something

Do two loops where the first one calls break immediately.
with fileinput.input(files=files, mode='rU', inplace=True) as f:
for line in f:
# add print() here if you only want to empty the line
break
for line in f:
process(line)
Lets say you want to remove or empty all of the first 5 lines.
with fileinput.input(files=files, mode='rU', inplace=True) as f:
for line in f:
# add print() here if you only want to empty the first 5 lines
if f._filelineno == 5:
break
for line in f:
process(line)
But if you only want to get rid of the first line, just use next before the loop to remove the first line.
with fileinput.input(files=files, mode='rU', inplace=True) as f:
next(f)
for line in f:
process(line)

with open(file) as j: #open file as j
for i in j.readlines()[1:]: #start reading j from second line.

Related

How do I split each line into two strings and print without the comma?

I'm trying to have output to be without commas, and separate each line into two strings and print them.
My code so far yields:
173,70
134,63
122,61
140,68
201,75
222,78
183,71
144,69
But i'd like it to print it out without the comma and the values on each line separated as strings.
if __name__ == '__main__':
# Complete main section of code
file_name = "data.txt"
# Open the file for reading here
my_file = open('data.txt')
lines = my_file.read()
with open('data.txt') as f:
for line in f:
lines.split()
lines.replace(',', ' ')
print(lines)
In your sample code, line contains the full content of the file as a str.
my_file = open('data.txt')
lines = my_file.read()
You then later re-open the file to iterate the lines:
with open('data.txt') as f:
for line in f:
lines.split()
lines.replace(',', ' ')
Note, however, str.split and str.replace do not modify the existing value, as strs in python are immutable. Also note you are operating on lines there, rather than the for-loop variable line.
Instead, you'll need to assign the result of those functions into new values, or give them as arguments (E.g., to print). So you'll want to open the file, iterate over the lines and print the value with the "," replaced with a " ":
with open("data.txt") as f:
for line in f:
print(line.replace(",", " "))
Or, since you are operating on the whole file anyway:
with open("data.txt") as f:
print(f.read().replace(",", " "))
Or, as your file appears to be CSV content, you may wish to use the csv module from the standard library instead:
import csv
with open("data.txt", newline="") as csvfile:
for row in csv.reader(csvfile):
print(*row)
with open('data.txt', 'r') as f:
for line in f:
for value in line.split(','):
print(value)
while python can offer us several ways to open files this is the prefered one for working with files. becuase we are opening the file in lazy mode (this is the prefered one espicialy for large files), and after exiting the with scope (identation block) the file io will be closed automaticly by the system.
here we are openening the file in read mode. files folow the iterator polices, so we can iterrate over them like lists. each line is a true line in the file and is a string type.
After getting the line, in line variable, we split (see str.split()) the line into 2 tokens, one before the comma and the other after the comma. split return new constructed list of strings. if you need to omit some unwanted characters you can use the str.strip() method. usualy strip and split combined together.
elegant and efficient file reading - method 1
with open("data.txt", 'r') as io:
for line in io:
sl=io.split(',') # now sl is a list of strings.
print("{} {}".format(sl[0],sl[1])) #now we use the format, for printing the results on the screen.
non elegant, but efficient file reading - method 2
fp = open("data.txt", 'r')
line = None
while (line=fp.readline()) != '': #when line become empty string, EOF have been reached. the end of file!
sl=line.split(',')
print("{} {}".format(sl[0],sl[1]))

Python skip operation on first line of file

I have a csv file which looks like following:
CCC;reserved;reserved;pIndex;wedgeWT;NA;NA;NA;NA;NA;xOffset;yOffset;zOffset
0.10089,0,0,1,0,0,0,0,0,0,-1.8,-0.7,1999998
0.1124,0,0,3,0,0,0,0,0,0,-1.2,1.8,-3.9
I am using the fileinput method to do some operation in the file, but I want to skip the operation on the first (header) line, though still keeping it there. I have tried using next(f) and f.isfirstline(), but they delete the header line. I want to keep the header line intact, though not doing any operation on it.
with fileinput.input(inplace=True) as f:
#skip line
for line in f:
.
.
You can use enumerate to easily keep track of the line numbers:
for linenum, line in enumerate(f):
if linenum == 0:
# 'line' is the header line
continue
# 'line' is a data line
# ...
You can use iter and skip it with next:
with fileinput.input(inplace=True) as f:
iterF = iter(f)
print next(iterF)#skipping computation but printing data
for line in iterF:
#...
This way you will just get the overhead of creating the iterator once, but will not create the indexes nor compute an if in each iteration loop as in #JonathonReinhart solution (wich is also valid).

Read in every line that starts with a certain character from a file

I am trying to read in every line in a file that starts with an 'X:'. I don't want to read the 'X:' itself just the rest of the line that follows.
with open("hnr1.abc","r") as file: f = file.read()
id = []
for line in f:
if line.startswith("X:"):
id.append(f.line[2:])
print(id)
It doesn't have any errors but it doesn't print anything out.
try this:
with open("hnr1.abc","r") as fi:
id = []
for ln in fi:
if ln.startswith("X:"):
id.append(ln[2:])
print(id)
dont use names like file or line
note the append just uses the item name not as part of the file
by pre-reading the file into memory the for loop was accessing the data by character not by line
for line in f:
search = line.split
if search[0] = "X":
storagearray.extend(search)
That should give you an array of all the lines you want, but they'll be split into separate words. Also, you'll need to have defined storagearray before we call it in the above block of code. It's an inelegant solution, as I'm a learner myself, but it should do the job!
edit: If you want to output the lines, simply use python's inbuilt print function:
str(storagearray)
print storagearray
Read every line in the file (for loop)
Select lines that contains X:
Slice the line with index 0: with starting char's/string as X: = ln[0:]
Print lines that begins with X:
for ln in input_file:
if ln.startswith('X:'):
X_ln = ln[0:]
print (X_ln)

Improving the speed of a python script

I have an input file with containing a list of strings.
I am iterating through every fourth line starting on line two.
From each of these lines I make a new string from the first and last 6 characters and put this in an output file only if that new string is unique.
The code I wrote to do this works, but I am working with very large deep sequencing files, and has been running for a day and has not made much progress. So I'm looking for any suggestions to make this much faster if possible. Thanks.
def method():
target = open(output_file, 'w')
with open(input_file, 'r') as f:
lineCharsList = []
for line in f:
#Make string from first and last 6 characters of a line
lineChars = line[0:6]+line[145:151]
if not (lineChars in lineCharsList):
lineCharsList.append(lineChars)
target.write(lineChars + '\n') #If string is unique, write to output file
for skip in range(3): #Used to step through four lines at a time
try:
check = line #Check for additional lines in file
next(f)
except StopIteration:
break
target.close()
Try defining lineCharsList as a set instead of a list:
lineCharsList = set()
...
lineCharsList.add(lineChars)
That'll improve the performance of the in operator. Also, if memory isn't a problem at all, you might want to accumulate all the output in a list and write it all at the end, instead of performing multiple write() operations.
You can use https://docs.python.org/2/library/itertools.html#itertools.islice:
import itertools
def method():
with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
seen = set()
for line in itertools.islice(inf, None, None, 4):
s = line[:6]+line[-6:]
if s not in seen:
seen.add(s)
ouf.write("{}\n".format(s))
Besides using set as Oscar suggested, you can also use islice to skip lines rather than use a for loop.
As stated in this post, islice preprocesses the iterator in C, so it should be much faster than using a plain vanilla python for loop.
Try replacing
lineChars = line[0:6]+line[145:151]
with
lineChars = ''.join([line[0:6], line[145:151]])
as it can be more efficient, depending on the circumstances.

How to only read lines in a text file after a certain string?

I'd like to read to a dictionary all of the lines in a text file that come after a particular string. I'd like to do this over thousands of text files.
I'm able to identify and print out the particular string ('Abstract') using the following code (gotten from this answer):
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
print line;
But how do I tell Python to start reading the lines that only come after the string?
Just start another loop when you reach the line you want to start from:
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
for line in f: # now you are at the lines you want
# do work
A file object is its own iterator, so when we reach the line with 'Abstract' in it we continue our iteration from that line until we have consumed the iterator.
A simple example:
gen = (n for n in xrange(8))
for x in gen:
if x == 3:
print('Starting second loop')
for x in gen:
print('In second loop', x)
else:
print('In first loop', x)
Produces:
In first loop 0
In first loop 1
In first loop 2
Starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7
You can also use itertools.dropwhile to consume the lines up to the point you want:
from itertools import dropwhile
for files in filepath:
with open(files, 'r') as f:
dropped = dropwhile(lambda _line: 'Abstract' not in _line, f)
next(dropped, '')
for line in dropped:
print(line)
Use a boolean to ignore lines up to that point:
found_abstract = False
for files in filepath:
with open(files, 'r') as f:
for line in f:
if 'Abstract' in line:
found_abstract = True
if found_abstract:
#do whatever you want
You can use itertools.dropwhile and itertools.islice here, a pseudo-example:
from itertools import dropwhile, islice
for fname in filepaths:
with open(fname) as fin:
start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
for line in islice(start_at, 1, None): # ignore the line still with Abstract in
print line
To me, the following code is easier to understand.
with open(file_name, 'r') as f:
while not 'Abstract' in next(f):
pass
for line in f:
#line will be now the next line after the one that contains 'Abstract'
Just to clarify, your code already "reads" all the lines. To start "paying attention" to lines after a certain point, you can just set a boolean flag to indicate whether or not lines should be ignored, and check it at each line.
pay_attention = False
for line in f:
if pay_attention:
print line
else: # We haven't found our trigger yet; see if it's in this line
if 'Abstract' in line:
pay_attention = True
If you don't mind a little more rearranging of your code, you can also use two partial loops instead: one loop that terminates once you've found your trigger phrase ('Abstract'), and one that reads all following lines. This approach is a little cleaner (and a very tiny bit faster).
for skippable_line in f: # First skim over all lines until we find 'Abstract'.
if 'Abstract' in skippable_line:
break
for line in f: # The file's iterator starts up again right where we left it.
print line
The reason this works is that the file object returned by open behaves like a generator, rather than, say, a list: it only produces values as they are requested. So when the first loop stops, the file is left with its internal position set at the beginning of the first "unread" line. This means that when you enter the second loop, the first line you see is the first line after the one that triggered the break.
Making a guess as to how the dictionary is involved, I'd write it this way:
lines = dict()
for filename in filepath:
with open(filename, 'r') as f:
for line in f:
if 'Abstract' in line:
break
lines[filename] = tuple(f)
So for each file, your dictionary contains a tuple of lines.
This works because the loop reads up to and including the line you identify, leaving the remaining lines in the file ready to be read from f.

Categories

Resources