I want to skip the first 17 lines while reading a text file.
Let's say the file looks like:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
good stuff
I just want the good stuff. What I'm doing is a lot more complicated, but this is the part I'm having trouble with.
Use a slice, like below:
with open('yourfile.txt') as f:
lines_after_17 = f.readlines()[17:]
If the file is too big to load in memory:
with open('yourfile.txt') as f:
for _ in range(17):
next(f)
for line in f:
# do stuff
Use itertools.islice, starting at index 17. It will automatically skip the 17 first lines.
import itertools
with open('file.txt') as f:
for line in itertools.islice(f, 17, None): # start=17, stop=None
# process lines
for line in dropwhile(isBadLine, lines):
# process as you see fit
Full demo:
from itertools import *
def isBadLine(line):
return line=='0'
with open(...) as f:
for line in dropwhile(isBadLine, f):
# process as you see fit
Advantages: This is easily extensible to cases where your prefix lines are more complicated than "0" (but not interdependent).
Here are the timeit results for the top 2 answers. Note that "file.txt" is a text file containing 100,000+ lines of random string with a file size of 1MB+.
Using itertools:
import itertools
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for line in itertools.islice(fo, 90000, None):
line.strip()""", number=100)
>>> 1.604976346003241
Using two for loops:
from timeit import timeit
timeit("""with open("file.txt", "r") as fo:
for i in range(90000):
next(fo)
for j in fo:
j.strip()""", number=100)
>>> 2.427317383000627
clearly the itertools method is more efficient when dealing with large files.
If you don't want to read the whole file into memory at once, you can use a few tricks:
With next(iterator) you can advance to the next line:
with open("filename.txt") as f:
next(f)
next(f)
next(f)
for line in f:
print(f)
Of course, this is slighly ugly, so itertools has a better way of doing this:
from itertools import islice
with open("filename.txt") as f:
# start at line 17 and never stop (None), until the end
for line in islice(f, 17, None):
print(f)
This solution helped me to skip the number of lines specified by the linetostart variable.
You get the index (int) and the line (string) if you want to keep track of those too.
In your case, you substitute linetostart with 18, or assign 18 to linetostart variable.
f = open("file.txt", 'r')
for i, line in enumerate(f, linetostart):
#Your code
If it's a table.
pd.read_table("path/to/file", sep="\t", index_col=0, skiprows=17)
You can use a List-Comprehension to make it a one-liner:
[fl.readline() for i in xrange(17)]
More about list comprehension in PEP 202 and in the Python documentation.
Here is a method to get lines between two line numbers in a file:
import sys
def file_line(name,start=1,end=sys.maxint):
lc=0
with open(s) as f:
for line in f:
lc+=1
if lc>=start and lc<=end:
yield line
s='/usr/share/dict/words'
l1=list(file_line(s,235880))
l2=list(file_line(s,1,10))
print l1
print l2
Output:
['Zyrian\n', 'Zyryan\n', 'zythem\n', 'Zythia\n', 'zythum\n', 'Zyzomys\n', 'Zyzzogeton\n']
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
Just call it with one parameter to get from line n -> EOF
Related
I have a train_file.txt which has 3 columns on each row.
For example;
1 10 1
1 12 1
2 64 2
6 17 1
...
I am reading this txt file with
train_data = open("train_file.txt", 'r').readlines()
Then I am trying to get each value with for loop
for eachline in train_data:
uid, lid, x = eachline.strip().split()
Question: Train data is a huge file that's why I want to just get the first 1000 rows.
I was trying to execute the following code but I am getting an error ('list' object cannot be interpreted as an integer)
for eachline in range(train_data,1000)
uid, lid, x = eachline.strip().split()
It is not necessary to read the entire file at all. You could use enumerate on the file directly and break early or use itertools.islice:
from itertools import islice
train_data = list(islice(open("train_file.txt", 'r'), 1000))
You can also keep using the same file handle to read more data later:
f = open("train_file.txt", 'r')
train_data = list(islice(f, 1000)) # reads first 1000
test_data = list(islice(f, 100)) # reads next 100
Maybe try changing this line:
train_data = open("train_file.txt", 'r').readlines()
To:
train_data = open("train_file.txt", 'r').readlines()[:1000]
train_data is a list, use slicing:
for eachline in train_data[:1000]:
As the file is "huge" in your words a better approach is to read just first 1000 rows (readlines() will read the whole file in memory)
with open("train_file.txt", 'r'):
train_data = []
for idx, line in enumerate(f, start=1):
train_data.append(line.strip.split())
if idx == 1000:
break
Note that data will be str, not int. You probably want to convert them to int.
You could use enumerate and a break:
for k, line in enumerate(lines):
if k > 1000:
break # exit the loop
# do stuff on the line
I would recommend using the csv built in library since the data is csv-like (or the pandas one if you're using it), and using with. So something like this:
import csv
from itertools import islice
with open('./test.csv', 'r') as input_file:
csv_reader = csv.reader(input_file, delimiter=' ')
rows = list(islice(csv_reader, 1000))
# Use rows
print(rows)
You don't need it right now but it will make escaped characters or multiline entries way easier to parse. Also, if there are headers you can use csv.DictReader to include them.
Regarding your original code:
The call the readlines() will read all lines at that point so doing any filtering after won't make a difference.
If you did read it that way, to get the first 1000 lines your for loop should be:
for eachline in traindata[:1000]:
...
I have a data file that has 100 lines and I want create a dictionary that skips the first two lines and then create a dictionary with enumerating keys with the lines as values.
myfile = open(infile, 'r')
d={}
with myfile as f:
next(f)
next(f)
for line in f:
This is what I got, I don't how to use iteritems(), enumerate(), or itervalues() but I feel like I think I will use them or maybe not if anybody can help me.
You could do something like:
from itertools import islice
with open(infile, 'r') as myfile:
d = dict(enumerate(islice(myfile, 2, None)))
But I wish I understood why you want to skip the first two lines – are you sure you don't want linecache?
This is just going to be of the top of my head so the will certainly be room for improvement.
myfile = open(infile, 'r') # open the file
d = {} # initiate the dict
for line in myfile: # iterate over lines in the file
counter = 0 # initiate the counter
if counter <= 1: # if we have counted under 2 iterations
counter += 1 # increase the counter by 1 and do nothing else
else: # if we have counted over 2 iterations
d[counter - 2] = line # make a new key with the name of lines counted (taking in to consideration the 2 lines precounted)
counter += 1 # increase the counter by 1 before continuing
I can not of the top of my head remember where in the code it would be best to close the file but do some experimentation and read this and this. And another time a good place to start would really be google and the python docs in general.
actually I read a file like this:
f = open("myfile.txt")
for line in f:
#do s.th. with the line
what do I need to do to start reading not at the first line, but at the X line? (e.g. the 5.)
Using itertools.islice you can specify start, stop and step if needs be and apply that to your input file...
from itertools import islice
with open('yourfile') as fin:
for line in islice(fin, 5, None):
pass
An opened file object f is an iterator. Read (and throw away) the first four lines and then go on with regular reading:
with open("myfile.txt", 'r') as f:
for i in xrange(4):
next(f, None)
for line in f:
#do s.th. with the line
I wish to count the number of lines in a .txt file which looks something like this:
apple
orange
pear
hippo
donkey
Where there are blank lines used to separate blocks. The result I'm looking for, based on the above sample, is five (lines).
How can I achieve this?
As a bonus, it would be nice to know how many blocks/paragraphs there are. So, based on the above example, that would be two blocks.
non_blank_count = 0
with open('data.txt') as infp:
for line in infp:
if line.strip():
non_blank_count += 1
print 'number of non-blank lines found %d' % non_blank_count
UPDATE: Re-read the question, OP wants to count non-blank lines .. (sigh .. thanks #RanRag).
(I need a break from the computer ...)
A short way to count the number of non-blank lines could be:
with open('data.txt', 'r') as f:
lines = f.readlines()
num_lines = len([l for l in lines if l.strip(' \n') != ''])
I am surprised to see that there isn't a clean pythonic answer yet (as of Jan 1, 2019). Many of the other answers create unnecessary lists, count in a non-pythonic way, loop over the lines of the file in a non-pythonic way, do not close the file properly, do unnecessary things, assume that the end of line character can only be '\n', or have other smaller issues.
Here is my suggested solution:
with open('myfile.txt') as f:
line_count = sum(1 for line in f if line.strip())
The question does not define what blank line is. My definition of blank line: line is a blank line if and only if line.strip() returns the empty string. This may or may not be your definition of blank line.
sum([1 for i in open("file_name","r").readlines() if i.strip()])
Considering the blank lines will only contain the new line character, it would be pretty faster to avoid calling str.strip which creates a new string but instead to check if the line contains only spaces using str.isspace and then skip it:
with open('data.txt') as f:
non_blank_lines = sum(not line.isspace() for line in f)
Demo:
from io import StringIO
s = '''apple
orange
pear
hippo
donkey'''
non_blank_lines = sum(not line.isspace() for line in StringIO(s)))
# 5
You can further use str.isspace with itertools.groupby to count the number of contiguous lines/blocks in the file:
from itertools import groupby
no_paragraphs = sum(k for k, _ in groupby(StringIO(s), lambda x: not x.isspace()))
print(no_paragraphs)
# 2
Not blank lines Counter:
lines_counter = 0
with open ('test_file.txt') as f:
for line in f:
if line != '\n':
lines_counter += 1
Blocks Counter:
para_counter = 0
prev = '\n'
with open ('test_file.txt') as f:
for line in f:
if line != '\n' and prev == '\n':
para_counter += 1
prev = line
This bit of Python code should solve your problem:
with open('data.txt', 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
This is how I would've done it:
f = open("file.txt")
l = [x for x in f.readlines() if x != "\n"]
print len(l)
readlines() will make a list of all the lines in the file and then you can just take those lines that have at least something in them.
Looks pretty straightforward to me!
Pretty straight one! I believe
f = open('path','r')
count = 0
for lines in f:
if lines.strip():
count +=1
print count
My one liner would be
print(sum(1 for line in open(path_to_file,'r') if line.strip()))
python: what is the quickest way to split a file into two files, each file having half of the number of lines in the original file, such that the lines in each of the two files are random?
for example: if the file is
1
2
3
4
5
6
7
8
9
10
it could be split into:
3
2
10
9
1
4
6
8
5
7
This sort of operation is often called "partition". Although there isn't a built-in partition function, I found this article: Partition in Python.
Given that definition, you can do this:
import random
def partition(l, pred):
yes, no = [], []
for e in l:
if pred(e):
yes.append(e)
else:
no.append(e)
return yes, no
lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)
Note that this won't necessarily exactly split the file in two, but it will on average.
You can just load the file, call random.shuffle on the resulting list, and then split it into two files (untested code):
def shuffle_split(infilename, outfilename1, outfilename2):
from random import shuffle
with open(infilename, 'r') as f:
lines = f.readlines()
# append a newline in case the last line didn't end with one
lines[-1] = lines[-1].rstrip('\n') + '\n'
shuffle(lines)
with open(outfilename1, 'w') as f:
f.writelines(lines[:len(lines) // 2])
with open(outfilename2, 'w') as f:
f.writelines(lines[len(lines) // 2:])
random.shuffle shuffles lines in-place, and pretty much does all the work here. Python's sequence indexing system (e.g. lines[len(lines) // 2:]) makes things really convenient.
I'm assuming that the file isn't huge, i.e. that it will fit comfortably in memory. If that's not the case, you'll need to do something a bit more fancy, probably using the linecache module to read random line numbers from your input file. I think probably you would want to generate two lists of line numbers, using a similar technique to what's shown above.
update: changed / to // to evade issues when __future__.division is enabled.
import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
if n==len(data)/2:
c+=1
f.close()
f=open("test."+str(c),"w")
f.write(i)
Other version:
from random import shuffle
def shuffle_split(infilename, outfilename1, outfilename2):
with open(infilename, 'r') as f:
lines = f.read().splitlines()
shuffle(lines)
half_lines = len(lines) // 2
with open(outfilename1, 'w') as f:
f.write('\n'.join(lines.pop() for count in range(half_lines)))
with open(outfilename2, 'w') as f:
f.writelines('\n'.join(lines))