Huge slowdown with string.count() when reading certain data with certain characters - python

I need to read in a large csv data file which is however riddled with newline characters and generally quite chaotic. So instead of pandas I do it manually, however I'm running into a strange slow-down which seems to depend on the characters which appear in the file.
While trying to recreate the problem by randomly creating a csv file which looks similar I figured that maybe the problem lies in the count function.
Consider this example which creates a large file with chaotic random data, reads the file and then by using count orders it such that it can be read as columnar data.
Note that in the first run of the file I only use string.ascii_letters for the random data, for the second run I'm using characters from string.printable.
import os
import random as rd
import string
import time
# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
lineFull = ''
nl = True
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
for i in range(num):
if i == 0:
line = 'Start;'
else:
line = ''
bb = rd.choice([True,True,False])
if bb:
line = line+'\"\";'
else:
if rd.random() < 0.999:
line = line+randstr
else:
line = line+rd.randint(10,100)*randstr
if nl and i != num-1:
line = line+';\n'
nl = False
elif rd.random() < 0.04 and i != num-1:
line = line+';\n'
if rd.random() < 0.01:
add = rd.randint(1,10)*'\n'
line = line+add
else:
line = line+';'
lineFull = lineFull+line
return lineFull+'\n'
# Create file with random data:
outputFolder = "C:\\DataDir\\Output\\"
numberOfCols = 38
fullLength = 10000
testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
with open(outputFolder+"TestFile.txt",'w') as tf:
tf.writelines(testLines)
# Read in file:
with open(outputFolder+"TestFile.txt",'r') as ff:
lines = []
for line in ff.readlines():
lines.append(unicode(line.rstrip('\n')))
# Restore columns by counting the separator:
linesT = ''
lines2 = []
time0 = time.time()
for i in range(len(lines)):
linesT = linesT + lines[i]
count = linesT.count(';')
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
if i%1000 == 0:
print time.time()-time0
time0 = time.time()
print time.time()-time0
The print statements output this:
0.0
0.0019998550415
0.00100016593933
0.000999927520752
0.000999927520752
0.000999927520752
0.000999927520752
0.00100016593933
0.0019998550415
0.000999927520752
0.00100016593933
0.0019998550415
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
Consistently fast performance.
Now I change the third line in createRandomString to randstr = ''.join(rd.choice(string.printable) for _ in range(7)), my output now becomes this:
0.0
0.0759999752045
0.273000001907
0.519999980927
0.716000080109
0.919999837875
1.11500000954
1.25199985504
1.51200008392
1.72199988365
1.8820002079
2.07999992371
2.21499991417
2.37400007248
2.64800000191
2.81900000572
3.04500007629
3.20299983025
3.55500006676
3.6930000782
3.79499983788
4.13900017738
4.19899988174
4.58700013161
4.81799983978
4.92000007629
5.2009999752
5.40199995041
5.48399996758
5.70299983025
5.92300009727
6.01099991798
6.44200015068
6.58999991417
3.99399995804
Not only is the performance very slow but it is consistently becoming slower over time.
The only difference lies in the range of characters which are written into the random data.
The full set of character which appear in my real data is this:
charSet = [' ','"','&',"'",'(',')','*','+',',','-','.','/','0','1','2','3','4','5','6',
'7','8','9',':',';','<','=','>','A','B','C','D','E','F','G','H','I','J','K',
'L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','\\','_','`','a',
'b','d','e','g','h','i','l','m','n','o','r','s','t','x']
Lets do some benchmarking on the count-function:
import random as rd
rd.seed()
def Test0():
randstr = ''.join(rd.choice(string.digits) for _ in range(10000))
randstr.count('7')
def Test1():
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(10000))
randstr.count('a')
def Test2():
randstr = ''.join(rd.choice(string.printable) for _ in range(10000))
randstr.count(';')
def Test3():
randstr = ''.join(rd.choice(charSet) for _ in range(10000))
randstr.count(';')
I'm testing only digits, only letters, printable, and the charset from my data.
Results of %timeit:
%timeit(Test0())
100 loops, best of 3: 9.27 ms per loop
%timeit(Test1())
100 loops, best of 3: 9.12 ms per loop
%timeit(Test2())
100 loops, best of 3: 9.94 ms per loop
%timeit(Test3())
100 loops, best of 3: 8.31 ms per loop
The performance is consistent and doesn't suggest any problems of count with certain character sets.
I also tested if concatenating strings with + would cause a slow down but this wasn't the case either.
Can anyone explain this or give me some hints?
EDIT: Using Python 2.7.12
EDIT 2: In my original data the following is happening:
The file has around 550000 lines which are often broken by random newline characters yet defined by always 38 ";"-delimiters. Until roughtly 300000 lines the performance is fast, then from that line on it suddenly starts getting slower and slower. I'm investigating this further now with the new clues.

The problem is in count(';').
string.printable contains ';' while string.ascii_characters doesn't.
Then as the length of linesT grows, the execution time grows as well:
0.000236988067627
0.0460968017578
0.145275115967
0.271568059921
0.435608148575
0.575787067413
0.750104904175
0.899538993835
1.08505797386
1.24447107315
1.34459710121
1.45430088043
1.63317894936
1.90502595901
1.92841100693
2.07722711563
2.16924905777
2.30753016472
In particular this code is problematic with string.printable:
numberOfCols = 38
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
Since there is a chance that ';' is included more than once in line 37 just before linesT is flushed, 38 will be skipped and linesT grows indefinitely.
You can observe this behaviour by leaving the initial set to string.ascii_characters and changing your code to count('a').
To fix the problem with printable you can modify your code like this:
if count > numberOfCols:
Then we go back to the expected runtime behaviour:
0.000234842300415
0.00233697891235
0.00247097015381
0.00217199325562
0.00262403488159
0.00262403488159
0.0023078918457
0.0024049282074
0.00231409072876
0.00233006477356
0.00214791297913
0.0028760433197
0.00241804122925
0.00250506401062
0.00254893302917
0.00266218185425
0.00236296653748
0.00201988220215
0.00245118141174
0.00206398963928
0.00219988822937
0.00230193138123
0.00205302238464
0.00230097770691
0.00248003005981
0.00204801559448

I am just reporting what I found. The performance difference seemingly does not come from str.count() function. I changed your code and refactored the str.count() into its own function. I also put your global code into a main function. The following is my version of your code:
import os
import time
import random as rd
import string
import timeit
# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
lineFull = ''
nl = True
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
#randstr = ''.join(rd.choice(string.printable) for _ in range(7))
for i in range(num):
if i == 0:
line = 'Start;'
else:
line = ''
bb = rd.choice([True,True,False])
if bb:
line = line+'\"\";'
else:
if rd.random() < 0.999:
line = line+randstr
else:
line = line+rd.randint(10,100)*randstr
if nl and i != num-1:
line = line+';\n'
nl = False
elif rd.random() < 0.04 and i != num-1:
line = line+';\n'
if rd.random() < 0.01:
add = rd.randint(1,10)*'\n'
line = line+add
else:
line = line+';'
lineFull = lineFull+line
return lineFull+'\n'
def counting_func(lines_iter):
try:
return lines_iter.next().count(';')
except StopIteration:
return -1
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
# Create file with random data:
def main():
fullLength = 100000
outputFolder = ""
numberOfCols = 38
testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
with open(outputFolder+"TestFile.txt",'w') as tf:
tf.writelines(testLines)
# Read in file:
with open(outputFolder+"TestFile.txt",'r') as ff:
lines = []
for line in ff.readlines():
lines.append(unicode(line.rstrip('\n')))
# Restore columns by counting the separator:
lines_iter = iter(lines)
print timeit.timeit(wrapper(counting_func, lines_iter), number=fullLength)
if __name__ == '__main__': main()
Tests are done 100000 times on each line generated. With string.ascii_letters, I get from timeit on average 0.0454177856445 seconds each loop. With string.printable, I get on average 0.0426299571991. In fact the latter is slightly faster than the former, though not really a significant difference.
I suspect the performance difference comes from what you are doing in the following loop besides counting:
for i in range(len(lines)):
linesT = linesT + lines[i]
count = linesT.count(';')
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
if i%1000 == 0:
print time.time()-time0
time0 = time.time()
Another possibility is slow-down from accessing global variable without a main function. But that should happen in both case, so not really.

Related

NoneType-Error with yield in Python

I've got a problem with the yield and generator thing in Python, hope you might know the solution
Here's my code (very simplified)
!/usr/bin/env
import sys
import time
import subprocess
from Tkinter import *
import numpy
import threading
CDatei = subprocess.Popen("/home/pi/meinc++/Spi")
print("Hallo")
i = 0
x = 0
def GetValue():
with open("/home/pi/meinc++/BeispielDatei.txt","r") as Datei:
for line in Datei:
time.sleep(0.1)
return line
def WithoutNull(input):
ReturnValue = input
while ReturnValue is None:
ReturnValue = GetValue()
return ReturnValue
def UebergabeWert():
while x == 0:
WholeString = WithoutNull(GetValue())
StringVar, DatumVar = WholeString.strip().split(' - ')
IntStringVar = [int(v) for v in StringVar.split()]
return IntStringVar,DatumVar
def MinutenWert():
ArrayValue = []
ZeitStart = time.time()
i = 0
while 1:
CompleteValue, Datum = UebergabeWert()
ArrayValue.insert(i,CompleteValue[0])
i = i + 1
ZeitEnde = time.time()
if (ZeitEnde-ZeitStart >= 10):
LaengeArray = len(ArrayValue)
print ArrayValue
ArrayValue = []
i = 0
break
while i <= LaengeArray:
CompleteValue, Datum = UebergabeWert()
ArrayValue.insert(i,CompleteValue[0])
i = i + 1
ArraySumme = numpy.sum(ArrayValue)
LaengeArray = len(ArrayValue)
Mittelwert = ArraySumme/LaengeArray
print ArrayValue
print ArraySumme
print LaengeArray
yield Mittelwert
if i == LaengeArray:
i = 0
xx = MinutenWert()
for x in xx:
print x
Quick Explanation of the Code:
I have a sensor and I'm reading data out of the UebergabeWert(). But since i wanted to make the average of a minute, i started to do following: I put the data in an array for 60 seconds (in the code it's 10 cause i dont want to wait so long) and then i sum up the array and divide it with the length of the array.
The first while loop is to set the total Array length (cause i cant make my main loop dependend from the time cause when the sensor is slower, it messes up with the data) and the second loop is to make the average. The idea is: when the array reaches it's end, it will erase the first value and insert the newest one. The loop should go indefinitely and I'll soon implement threading so it runs in the background.
PS: the "prints" are here for me to keep track of the process
My Problem:
The first loop works perfectly, the array prints out about 100 (different) values in the array without None.
The second loop however, breaks after the first iteration.
Error-message:
Traceback (most recent call last):
line 87, in <module>
for x in xx:
line 54, in MinutenWert
CompleteValue, Datum = UebergabeWert()
TypeError: 'NoneType' object is not iterable
Why is there suddenly a NoneType-error? I just cannot figure it out.
EDIT:
People pointed out that the problem is rather with a previous function of the code so I'll add it to the code.
Also, quick explanation: The data is in a textfile so i open the textfile with GetValue(). Sometimes the sensor is too slow and gives back a None-value so WithoutNull() gets rid of that.
The data is in this form "var1, var2, var3, var4 - timestamp". So i need to seperate the values from the String with UebergabeWert(). Normally i won't get a null-response from that so it's kinda strange.
The problem was with...
xx = MinutenWert()
for x in xx:
print x
It returns None because the Yield isn't called. To get Yield values one have to use next().
The updated version....
xx = Minutenwert()
while 1:
y = next(xx)
print y
returns the values and doesn't give back an error.

Faster way of calculating the percentage of identical sites in alignment using biopython

I developed the following code to calculate the number of identical sites in an alignment. Unfortunately the code is slow, and I have to iterate it over hundreds of files, it takes close to 12 hours to process more than 1000 alignments, meaning that something ten times faster would be appropriate. Any help would be appreciated:
import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import AlignIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
from Bio.Align import MultipleSeqAlignment
import time
a = SeqRecord(Seq("CCAAGCTGAATCAGCTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTAAGAGTTCCTTTCGAGCACTACAAGAAGACGATTCGCGCGAACCACCGCAT", generic_dna), id="Alpha")
b = SeqRecord(Seq("CGAAGCTGACTCAGTGGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACAATTCGTGCGAACCACCGCAT", generic_dna), id="Beta")
c = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAATCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Gamma")
d = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Delta")
e = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Epsilon")
align = MultipleSeqAlignment([a, b, c], annotations={"tool": "demo"})
start_time = time.time()
if len(align) != 1:
for n in range(0,len(align[0])):
n=0
i=0
while n<len(align[0]): #part that needs to be faster
column = align[:,n]
if (column == len(column) * column[0]) == True:
i=i+1
n=n+1
match = float(i)
length = float(n)
global_identity = 100*(float(match/length))
print(global_identity)
print("--- %s seconds ---" % (time.time() - start_time))
So, you're trying to check that each of the 5 strings have the same characters in the column? If the characters in the column all match, you increment i, else you increment n.
Your interpretation of the code is correct.
Based on the above, I'd hazard to suggest the following code as a faster alternative.
I suppose that align is a structure like this:
align = [
'AGCTCGCGGAGGCGCTGCT....',
'ACCTCGGAGGGCTGCTGTAC...',
'AGCTCGGAGGGCTGCTGTAC...',
# possibly more ...
]
We try to detect the columns of same characters in it. Above, the first column is AAA (a match), the next is GCG (a mismatch).
def all_equal(items):
"""Returns True iff all items are equal."""
first = items[0]
return all(x == first for x in items)
def compute_match(aligned_sequences):
"""Returns the ratio of same-character columns in ``aligned_sequences``.
:param aligned_sequences: a list of strings or equal length.
"""
match_count = 0
mismatch_count = 0
for chars in zip(*aligned_sequences):
# Here chars is a column of chars,
# one taken from each element of aligned_sequences.
if all_equal(chars):
match_count += 1
else:
mismatch_count += 1
return float(match_count) / float(mismatch_count)
# What would make more sense:
# return float(matches) / len(aligned_sequences[0])
An even shorter version:
def compute_match(aligned_sequences):
match_count = sum(1 for chars in zip(*aligned_sequences) if all_equal(chars))
total = len(aligned_sequences[0])
mismatch_count = total - match_count # Obviously.
return ...

How to find number of rows of a huge csv file in pandas [duplicate]

How do I get a line count of a large file in the most memory- and time-efficient manner?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
One line, probably pretty fast:
num_lines = sum(1 for line in open('myfile.txt'))
You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.
Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).
I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.
Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor
Here are my results:
mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714
Edit: numbers for Python 2.6:
mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297
So the buffer read strategy seems to be the fastest for Windows/Python 2.6
Here is the code:
from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict
def mapcount(filename):
f = open(filename, "r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines
def simplecount(filename):
lines = 0
for line in open(filename):
lines += 1
return lines
def bufcount(filename):
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
def opcount(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
counts = defaultdict(list)
for i in range(5):
for func in [mapcount, simplecount, bufcount, opcount]:
start_time = time.time()
assert func("big_file.txt") == 1209138
counts[func].append(time.time() - start_time)
for key, vals in counts.items():
print key.__name__, ":", sum(vals) / float(len(vals))
I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).
All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)
Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:
def rawcount(filename):
f = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\n')
buf = read_f(buf_size)
return lines
Using a separate generator function, this runs a smidge faster:
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024*1024)
def rawgencount(filename):
f = open(filename, 'rb')
f_gen = _make_gen(f.raw.read)
return sum( buf.count(b'\n') for buf in f_gen )
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
from itertools import (takewhile,repeat)
def rawincount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
return sum( buf.count(b'\n') for buf in bufgen )
Here are my timings:
function average, s min, s ratio
rawincount 0.0043 0.0041 1.00
rawgencount 0.0044 0.0042 1.01
rawcount 0.0048 0.0045 1.09
bufcount 0.008 0.0068 1.64
wccount 0.01 0.0097 2.35
itercount 0.014 0.014 3.41
opcount 0.02 0.02 4.83
kylecount 0.021 0.021 5.05
simplecount 0.022 0.022 5.25
mapcount 0.037 0.031 7.46
You could execute a subprocess and run wc -l filename
import subprocess
def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])
After a perfplot analysis, one has to recommend the buffered read solution
def buf_count_newlines_gen(fname):
def _make_gen(reader):
while True:
b = reader(2 ** 16)
if not b: break
yield b
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
It's fast and memory-efficient. Most other solutions are about 20 times slower.
Code to reproduce the plot:
import mmap
import subprocess
from functools import partial
import perfplot
def setup(n):
fname = "t.txt"
with open(fname, "w") as f:
for i in range(n):
f.write(str(i) + "\n")
return fname
def for_enumerate(fname):
i = 0
with open(fname) as f:
for i, _ in enumerate(f):
pass
return i + 1
def sum1(fname):
return sum(1 for _ in open(fname))
def mmap_count(fname):
with open(fname, "r+") as f:
buf = mmap.mmap(f.fileno(), 0)
lines = 0
while buf.readline():
lines += 1
return lines
def for_open(fname):
lines = 0
for _ in open(fname):
lines += 1
return lines
def buf_count_newlines(fname):
lines = 0
buf_size = 2 ** 16
with open(fname) as f:
buf = f.read(buf_size)
while buf:
lines += buf.count("\n")
buf = f.read(buf_size)
return lines
def buf_count_newlines_gen(fname):
def _make_gen(reader):
b = reader(2 ** 16)
while b:
yield b
b = reader(2 ** 16)
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
def wc_l(fname):
return int(subprocess.check_output(["wc", "-l", fname]).split()[0])
def sum_partial(fname):
with open(fname) as f:
count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
return count
def read_count(fname):
return open(fname).read().count("\n")
b = perfplot.bench(
setup=setup,
kernels=[
for_enumerate,
sum1,
mmap_count,
for_open,
wc_l,
buf_count_newlines,
buf_count_newlines_gen,
sum_partial,
read_count,
],
n_range=[2 ** k for k in range(27)],
xlabel="num lines",
)
b.save("out.png")
b.show()
Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.
import multiprocessing, sys, time, os, mmap
import logging, logging.handlers
def init_logger(pid):
console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
logger = logging.getLogger() # New logger at root level
logger.setLevel( logging.INFO )
logger.handlers.append( logging.StreamHandler() )
logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )
def getFileLineCount( queues, pid, processes, file1 ):
init_logger(pid)
logging.info( 'start' )
physical_file = open(file1, "r")
# mmap.mmap(fileno, length[, tagname[, access[, offset]]]
m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )
#work out file size to divide up line counting
fSize = os.stat(file1).st_size
chunk = (fSize / processes) + 1
lines = 0
#get where I start and stop
_seedStart = chunk * (pid)
_seekEnd = chunk * (pid+1)
seekStart = int(_seedStart)
seekEnd = int(_seekEnd)
if seekEnd < int(_seekEnd + 1):
seekEnd += 1
if _seedStart < int(seekStart + 1):
seekStart += 1
if seekEnd > fSize:
seekEnd = fSize
#find where to start
if pid > 0:
m1.seek( seekStart )
#read next line
l1 = m1.readline() # need to use readline with memory mapped files
seekStart = m1.tell()
#tell previous rank my seek start to make their seek end
if pid > 0:
queues[pid-1].put( seekStart )
if pid < processes-1:
seekEnd = queues[pid].get()
m1.seek( seekStart )
l1 = m1.readline()
while len(l1) > 0:
lines += 1
l1 = m1.readline()
if m1.tell() > seekEnd or len(l1) == 0:
break
logging.info( 'done' )
# add up the results
if pid == 0:
for p in range(1,processes):
lines += queues[0].get()
queues[0].put(lines) # the total lines counted
else:
queues[0].put(lines)
m1.close()
physical_file.close()
if __name__ == '__main__':
init_logger( 'main' )
if len(sys.argv) > 1:
file_name = sys.argv[1]
else:
logging.fatal( 'parameters required: file-name [processes]' )
exit()
t = time.time()
processes = multiprocessing.cpu_count()
if len(sys.argv) > 2:
processes = int(sys.argv[2])
queues=[] # a queue for each process
for pid in range(processes):
queues.append( multiprocessing.Queue() )
jobs=[]
prev_pipe = 0
for pid in range(processes):
p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
p.start()
jobs.append(p)
jobs[0].join() #wait for counting to finish
lines = queues[0].get()
logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
A one-line bash solution similar to this answer, using the modern subprocess.check_output function:
def line_count(filename):
return int(subprocess.check_output(['wc', '-l', filename]).split()[0])
I would use Python's file object method readlines, as follows:
with open(input_file) as foo:
lines = len(foo.readlines())
This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.
This is the fastest thing I have found using pure python.
You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.
from functools import partial
buffer=2**16
with open(myfile) as f:
print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))
I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.
def file_len(full_path):
""" Count number of lines in a file."""
f = open(full_path)
nr_of_lines = sum(1 for line in f)
f.close()
return nr_of_lines
Here is what I use, seems pretty clean:
import subprocess
def count_file_lines(file_path):
"""
Counts the number of lines in a file using wc utility.
:param file_path: path to file
:return: int, no of lines
"""
num = subprocess.check_output(['wc', '-l', file_path])
num = num.split(' ')
return int(num[0])
UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.
One line solution:
import os
os.system("wc -l filename")
My snippet:
>>> os.system('wc -l *.txt')
0 bar.txt
1000 command.txt
3 test_file.txt
1003 total
Kyle's answer
num_lines = sum(1 for line in open('my_file.txt'))
is probably best, an alternative for this is
num_lines = len(open('my_file.txt').read().splitlines())
Here is the comparision of performance of both
In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop
In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:
lines = 0
buffer = bytearray(2048)
with open(filename) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
You can play around with the buffer size and maybe see a little improvement.
Just to complete the above methods I tried a variant with the fileinput module:
import fileinput as fi
def filecount(fname):
for line in fi.input(fname):
pass
return fi.lineno()
And passed a 60mil lines file to all the above stated methods:
mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974
It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...
As for me this variant will be the fastest:
#!/usr/bin/env python
def main():
f = open('filename')
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
print lines
if __name__ == '__main__':
main()
reasons: buffering faster than reading line by line and string.count is also very fast
This code is shorter and clearer. It's probably the best way:
num_lines = open('yourfile.ext').read().count('\n')
I have modified the buffer case like this:
def CountLines(filename):
f = open(filename)
try:
lines = 1
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
# Empty file
if not buf:
return 0
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
finally:
f.close()
Now also empty files and the last line (without \n) are counted.
print open('file.txt', 'r').read().count("\n") + 1
A lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...
I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.
The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.
Hence, there are 3 major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gc collection and other micro-managing tricks:
Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can pre-process them, first convert the line return to unix style (\n) as this will save 1 character compared to Windows or MacOS styles (not a big save but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).
If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.
Simple method:
1)
>>> f = len(open("myfile.txt").readlines())
>>> f
430
>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>
num_lines = len(list(open('myfile.txt')))
If one wants to get the line count cheaply in Python in Linux, I recommend this method:
import os
print os.popen("wc -l file_path").readline().split()[0]
file_path can be both abstract file path or relative path. Hope this may help.
def count_text_file_lines(path):
with open(path, 'rt') as file:
line_count = sum(1 for _line in file)
return line_count
the result of opening a file is an iterator, which can be converted to a sequence, which has a length:
with open(filename) as f:
return len(list(f))
this is more concise than your explicit loop, and avoids the enumerate.
What about this
def file_len(fname):
counts = itertools.count()
with open(fname) as f:
for _ in f: counts.next()
return counts.next()
count = max(enumerate(open(filename)))[0]
How about this?
import fileinput
import sys
counter=0
for line in fileinput.input([sys.argv[1]]):
counter+=1
fileinput.close()
print counter
How about this one-liner:
file_length = len(open('myfile.txt','r').read().split('\n'))
Takes 0.003 sec using this method to time it on a 3900 line file
def c():
import time
s = time.time()
file_length = len(open('myfile.txt','r').read().split('\n'))
print time.time() - s
def line_count(path):
count = 0
with open(path) as lines:
for count, l in enumerate(lines, start=1):
pass
return count

Find the length of a text file [duplicate]

How do I get a line count of a large file in the most memory- and time-efficient manner?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
One line, probably pretty fast:
num_lines = sum(1 for line in open('myfile.txt'))
You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.
Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).
I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.
Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor
Here are my results:
mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714
Edit: numbers for Python 2.6:
mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297
So the buffer read strategy seems to be the fastest for Windows/Python 2.6
Here is the code:
from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict
def mapcount(filename):
f = open(filename, "r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines
def simplecount(filename):
lines = 0
for line in open(filename):
lines += 1
return lines
def bufcount(filename):
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
def opcount(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
counts = defaultdict(list)
for i in range(5):
for func in [mapcount, simplecount, bufcount, opcount]:
start_time = time.time()
assert func("big_file.txt") == 1209138
counts[func].append(time.time() - start_time)
for key, vals in counts.items():
print key.__name__, ":", sum(vals) / float(len(vals))
I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).
All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)
Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:
def rawcount(filename):
f = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\n')
buf = read_f(buf_size)
return lines
Using a separate generator function, this runs a smidge faster:
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024*1024)
def rawgencount(filename):
f = open(filename, 'rb')
f_gen = _make_gen(f.raw.read)
return sum( buf.count(b'\n') for buf in f_gen )
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
from itertools import (takewhile,repeat)
def rawincount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
return sum( buf.count(b'\n') for buf in bufgen )
Here are my timings:
function average, s min, s ratio
rawincount 0.0043 0.0041 1.00
rawgencount 0.0044 0.0042 1.01
rawcount 0.0048 0.0045 1.09
bufcount 0.008 0.0068 1.64
wccount 0.01 0.0097 2.35
itercount 0.014 0.014 3.41
opcount 0.02 0.02 4.83
kylecount 0.021 0.021 5.05
simplecount 0.022 0.022 5.25
mapcount 0.037 0.031 7.46
You could execute a subprocess and run wc -l filename
import subprocess
def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])
After a perfplot analysis, one has to recommend the buffered read solution
def buf_count_newlines_gen(fname):
def _make_gen(reader):
while True:
b = reader(2 ** 16)
if not b: break
yield b
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
It's fast and memory-efficient. Most other solutions are about 20 times slower.
Code to reproduce the plot:
import mmap
import subprocess
from functools import partial
import perfplot
def setup(n):
fname = "t.txt"
with open(fname, "w") as f:
for i in range(n):
f.write(str(i) + "\n")
return fname
def for_enumerate(fname):
i = 0
with open(fname) as f:
for i, _ in enumerate(f):
pass
return i + 1
def sum1(fname):
return sum(1 for _ in open(fname))
def mmap_count(fname):
with open(fname, "r+") as f:
buf = mmap.mmap(f.fileno(), 0)
lines = 0
while buf.readline():
lines += 1
return lines
def for_open(fname):
lines = 0
for _ in open(fname):
lines += 1
return lines
def buf_count_newlines(fname):
lines = 0
buf_size = 2 ** 16
with open(fname) as f:
buf = f.read(buf_size)
while buf:
lines += buf.count("\n")
buf = f.read(buf_size)
return lines
def buf_count_newlines_gen(fname):
def _make_gen(reader):
b = reader(2 ** 16)
while b:
yield b
b = reader(2 ** 16)
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
def wc_l(fname):
return int(subprocess.check_output(["wc", "-l", fname]).split()[0])
def sum_partial(fname):
with open(fname) as f:
count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
return count
def read_count(fname):
return open(fname).read().count("\n")
b = perfplot.bench(
setup=setup,
kernels=[
for_enumerate,
sum1,
mmap_count,
for_open,
wc_l,
buf_count_newlines,
buf_count_newlines_gen,
sum_partial,
read_count,
],
n_range=[2 ** k for k in range(27)],
xlabel="num lines",
)
b.save("out.png")
b.show()
Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.
import multiprocessing, sys, time, os, mmap
import logging, logging.handlers
def init_logger(pid):
console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
logger = logging.getLogger() # New logger at root level
logger.setLevel( logging.INFO )
logger.handlers.append( logging.StreamHandler() )
logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )
def getFileLineCount( queues, pid, processes, file1 ):
init_logger(pid)
logging.info( 'start' )
physical_file = open(file1, "r")
# mmap.mmap(fileno, length[, tagname[, access[, offset]]]
m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )
#work out file size to divide up line counting
fSize = os.stat(file1).st_size
chunk = (fSize / processes) + 1
lines = 0
#get where I start and stop
_seedStart = chunk * (pid)
_seekEnd = chunk * (pid+1)
seekStart = int(_seedStart)
seekEnd = int(_seekEnd)
if seekEnd < int(_seekEnd + 1):
seekEnd += 1
if _seedStart < int(seekStart + 1):
seekStart += 1
if seekEnd > fSize:
seekEnd = fSize
#find where to start
if pid > 0:
m1.seek( seekStart )
#read next line
l1 = m1.readline() # need to use readline with memory mapped files
seekStart = m1.tell()
#tell previous rank my seek start to make their seek end
if pid > 0:
queues[pid-1].put( seekStart )
if pid < processes-1:
seekEnd = queues[pid].get()
m1.seek( seekStart )
l1 = m1.readline()
while len(l1) > 0:
lines += 1
l1 = m1.readline()
if m1.tell() > seekEnd or len(l1) == 0:
break
logging.info( 'done' )
# add up the results
if pid == 0:
for p in range(1,processes):
lines += queues[0].get()
queues[0].put(lines) # the total lines counted
else:
queues[0].put(lines)
m1.close()
physical_file.close()
if __name__ == '__main__':
init_logger( 'main' )
if len(sys.argv) > 1:
file_name = sys.argv[1]
else:
logging.fatal( 'parameters required: file-name [processes]' )
exit()
t = time.time()
processes = multiprocessing.cpu_count()
if len(sys.argv) > 2:
processes = int(sys.argv[2])
queues=[] # a queue for each process
for pid in range(processes):
queues.append( multiprocessing.Queue() )
jobs=[]
prev_pipe = 0
for pid in range(processes):
p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
p.start()
jobs.append(p)
jobs[0].join() #wait for counting to finish
lines = queues[0].get()
logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
A one-line bash solution similar to this answer, using the modern subprocess.check_output function:
def line_count(filename):
return int(subprocess.check_output(['wc', '-l', filename]).split()[0])
I would use Python's file object method readlines, as follows:
with open(input_file) as foo:
lines = len(foo.readlines())
This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.
This is the fastest thing I have found using pure python.
You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.
from functools import partial
buffer=2**16
with open(myfile) as f:
print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))
I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.
def file_len(full_path):
""" Count number of lines in a file."""
f = open(full_path)
nr_of_lines = sum(1 for line in f)
f.close()
return nr_of_lines
Here is what I use, seems pretty clean:
import subprocess
def count_file_lines(file_path):
"""
Counts the number of lines in a file using wc utility.
:param file_path: path to file
:return: int, no of lines
"""
num = subprocess.check_output(['wc', '-l', file_path])
num = num.split(' ')
return int(num[0])
UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.
One line solution:
import os
os.system("wc -l filename")
My snippet:
>>> os.system('wc -l *.txt')
0 bar.txt
1000 command.txt
3 test_file.txt
1003 total
Kyle's answer
num_lines = sum(1 for line in open('my_file.txt'))
is probably best, an alternative for this is
num_lines = len(open('my_file.txt').read().splitlines())
Here is the comparision of performance of both
In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop
In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:
lines = 0
buffer = bytearray(2048)
with open(filename) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
You can play around with the buffer size and maybe see a little improvement.
Just to complete the above methods I tried a variant with the fileinput module:
import fileinput as fi
def filecount(fname):
for line in fi.input(fname):
pass
return fi.lineno()
And passed a 60mil lines file to all the above stated methods:
mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974
It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...
As for me this variant will be the fastest:
#!/usr/bin/env python
def main():
f = open('filename')
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
print lines
if __name__ == '__main__':
main()
reasons: buffering faster than reading line by line and string.count is also very fast
This code is shorter and clearer. It's probably the best way:
num_lines = open('yourfile.ext').read().count('\n')
I have modified the buffer case like this:
def CountLines(filename):
f = open(filename)
try:
lines = 1
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
# Empty file
if not buf:
return 0
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
finally:
f.close()
Now also empty files and the last line (without \n) are counted.
print open('file.txt', 'r').read().count("\n") + 1
A lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...
I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.
The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.
Hence, there are 3 major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gc collection and other micro-managing tricks:
Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can pre-process them, first convert the line return to unix style (\n) as this will save 1 character compared to Windows or MacOS styles (not a big save but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).
If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.
Simple method:
1)
>>> f = len(open("myfile.txt").readlines())
>>> f
430
>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>
num_lines = len(list(open('myfile.txt')))
If one wants to get the line count cheaply in Python in Linux, I recommend this method:
import os
print os.popen("wc -l file_path").readline().split()[0]
file_path can be both abstract file path or relative path. Hope this may help.
def count_text_file_lines(path):
with open(path, 'rt') as file:
line_count = sum(1 for _line in file)
return line_count
the result of opening a file is an iterator, which can be converted to a sequence, which has a length:
with open(filename) as f:
return len(list(f))
this is more concise than your explicit loop, and avoids the enumerate.
What about this
def file_len(fname):
counts = itertools.count()
with open(fname) as f:
for _ in f: counts.next()
return counts.next()
count = max(enumerate(open(filename)))[0]
How about this?
import fileinput
import sys
counter=0
for line in fileinput.input([sys.argv[1]]):
counter+=1
fileinput.close()
print counter
How about this one-liner:
file_length = len(open('myfile.txt','r').read().split('\n'))
Takes 0.003 sec using this method to time it on a 3900 line file
def c():
import time
s = time.time()
file_length = len(open('myfile.txt','r').read().split('\n'))
print time.time() - s
def line_count(path):
count = 0
with open(path) as lines:
for count, l in enumerate(lines, start=1):
pass
return count

How to get line count of a large file cheaply in Python?

How do I get a line count of a large file in the most memory- and time-efficient manner?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
One line, probably pretty fast:
num_lines = sum(1 for line in open('myfile.txt'))
You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.
Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).
I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.
Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor
Here are my results:
mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714
Edit: numbers for Python 2.6:
mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297
So the buffer read strategy seems to be the fastest for Windows/Python 2.6
Here is the code:
from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict
def mapcount(filename):
f = open(filename, "r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines
def simplecount(filename):
lines = 0
for line in open(filename):
lines += 1
return lines
def bufcount(filename):
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
def opcount(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
counts = defaultdict(list)
for i in range(5):
for func in [mapcount, simplecount, bufcount, opcount]:
start_time = time.time()
assert func("big_file.txt") == 1209138
counts[func].append(time.time() - start_time)
for key, vals in counts.items():
print key.__name__, ":", sum(vals) / float(len(vals))
I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).
All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)
Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:
def rawcount(filename):
f = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\n')
buf = read_f(buf_size)
return lines
Using a separate generator function, this runs a smidge faster:
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024*1024)
def rawgencount(filename):
f = open(filename, 'rb')
f_gen = _make_gen(f.raw.read)
return sum( buf.count(b'\n') for buf in f_gen )
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
from itertools import (takewhile,repeat)
def rawincount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
return sum( buf.count(b'\n') for buf in bufgen )
Here are my timings:
function average, s min, s ratio
rawincount 0.0043 0.0041 1.00
rawgencount 0.0044 0.0042 1.01
rawcount 0.0048 0.0045 1.09
bufcount 0.008 0.0068 1.64
wccount 0.01 0.0097 2.35
itercount 0.014 0.014 3.41
opcount 0.02 0.02 4.83
kylecount 0.021 0.021 5.05
simplecount 0.022 0.022 5.25
mapcount 0.037 0.031 7.46
You could execute a subprocess and run wc -l filename
import subprocess
def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])
After a perfplot analysis, one has to recommend the buffered read solution
def buf_count_newlines_gen(fname):
def _make_gen(reader):
while True:
b = reader(2 ** 16)
if not b: break
yield b
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
It's fast and memory-efficient. Most other solutions are about 20 times slower.
Code to reproduce the plot:
import mmap
import subprocess
from functools import partial
import perfplot
def setup(n):
fname = "t.txt"
with open(fname, "w") as f:
for i in range(n):
f.write(str(i) + "\n")
return fname
def for_enumerate(fname):
i = 0
with open(fname) as f:
for i, _ in enumerate(f):
pass
return i + 1
def sum1(fname):
return sum(1 for _ in open(fname))
def mmap_count(fname):
with open(fname, "r+") as f:
buf = mmap.mmap(f.fileno(), 0)
lines = 0
while buf.readline():
lines += 1
return lines
def for_open(fname):
lines = 0
for _ in open(fname):
lines += 1
return lines
def buf_count_newlines(fname):
lines = 0
buf_size = 2 ** 16
with open(fname) as f:
buf = f.read(buf_size)
while buf:
lines += buf.count("\n")
buf = f.read(buf_size)
return lines
def buf_count_newlines_gen(fname):
def _make_gen(reader):
b = reader(2 ** 16)
while b:
yield b
b = reader(2 ** 16)
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
def wc_l(fname):
return int(subprocess.check_output(["wc", "-l", fname]).split()[0])
def sum_partial(fname):
with open(fname) as f:
count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
return count
def read_count(fname):
return open(fname).read().count("\n")
b = perfplot.bench(
setup=setup,
kernels=[
for_enumerate,
sum1,
mmap_count,
for_open,
wc_l,
buf_count_newlines,
buf_count_newlines_gen,
sum_partial,
read_count,
],
n_range=[2 ** k for k in range(27)],
xlabel="num lines",
)
b.save("out.png")
b.show()
Here is a python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20million line file from 26 seconds to 7 seconds using an 8 core windows 64 server. Note: not using memory mapping makes things much slower.
import multiprocessing, sys, time, os, mmap
import logging, logging.handlers
def init_logger(pid):
console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
logger = logging.getLogger() # New logger at root level
logger.setLevel( logging.INFO )
logger.handlers.append( logging.StreamHandler() )
logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )
def getFileLineCount( queues, pid, processes, file1 ):
init_logger(pid)
logging.info( 'start' )
physical_file = open(file1, "r")
# mmap.mmap(fileno, length[, tagname[, access[, offset]]]
m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )
#work out file size to divide up line counting
fSize = os.stat(file1).st_size
chunk = (fSize / processes) + 1
lines = 0
#get where I start and stop
_seedStart = chunk * (pid)
_seekEnd = chunk * (pid+1)
seekStart = int(_seedStart)
seekEnd = int(_seekEnd)
if seekEnd < int(_seekEnd + 1):
seekEnd += 1
if _seedStart < int(seekStart + 1):
seekStart += 1
if seekEnd > fSize:
seekEnd = fSize
#find where to start
if pid > 0:
m1.seek( seekStart )
#read next line
l1 = m1.readline() # need to use readline with memory mapped files
seekStart = m1.tell()
#tell previous rank my seek start to make their seek end
if pid > 0:
queues[pid-1].put( seekStart )
if pid < processes-1:
seekEnd = queues[pid].get()
m1.seek( seekStart )
l1 = m1.readline()
while len(l1) > 0:
lines += 1
l1 = m1.readline()
if m1.tell() > seekEnd or len(l1) == 0:
break
logging.info( 'done' )
# add up the results
if pid == 0:
for p in range(1,processes):
lines += queues[0].get()
queues[0].put(lines) # the total lines counted
else:
queues[0].put(lines)
m1.close()
physical_file.close()
if __name__ == '__main__':
init_logger( 'main' )
if len(sys.argv) > 1:
file_name = sys.argv[1]
else:
logging.fatal( 'parameters required: file-name [processes]' )
exit()
t = time.time()
processes = multiprocessing.cpu_count()
if len(sys.argv) > 2:
processes = int(sys.argv[2])
queues=[] # a queue for each process
for pid in range(processes):
queues.append( multiprocessing.Queue() )
jobs=[]
prev_pipe = 0
for pid in range(processes):
p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
p.start()
jobs.append(p)
jobs[0].join() #wait for counting to finish
lines = queues[0].get()
logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )
A one-line bash solution similar to this answer, using the modern subprocess.check_output function:
def line_count(filename):
return int(subprocess.check_output(['wc', '-l', filename]).split()[0])
I would use Python's file object method readlines, as follows:
with open(input_file) as foo:
lines = len(foo.readlines())
This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.
This is the fastest thing I have found using pure python.
You can use whatever amount of memory you want by setting buffer, though 2**16 appears to be a sweet spot on my computer.
from functools import partial
buffer=2**16
with open(myfile) as f:
print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))
I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. Its a very good read to understand how to count lines quickly, though wc -l is still about 75% faster than anything else.
def file_len(full_path):
""" Count number of lines in a file."""
f = open(full_path)
nr_of_lines = sum(1 for line in f)
f.close()
return nr_of_lines
Here is what I use, seems pretty clean:
import subprocess
def count_file_lines(file_path):
"""
Counts the number of lines in a file using wc utility.
:param file_path: path to file
:return: int, no of lines
"""
num = subprocess.check_output(['wc', '-l', file_path])
num = num.split(' ')
return int(num[0])
UPDATE: This is marginally faster than using pure python but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.
One line solution:
import os
os.system("wc -l filename")
My snippet:
>>> os.system('wc -l *.txt')
0 bar.txt
1000 command.txt
3 test_file.txt
1003 total
Kyle's answer
num_lines = sum(1 for line in open('my_file.txt'))
is probably best, an alternative for this is
num_lines = len(open('my_file.txt').read().splitlines())
Here is the comparision of performance of both
In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop
In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop
I got a small (4-8%) improvement with this version which re-uses a constant buffer so it should avoid any memory or GC overhead:
lines = 0
buffer = bytearray(2048)
with open(filename) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
You can play around with the buffer size and maybe see a little improvement.
Just to complete the above methods I tried a variant with the fileinput module:
import fileinput as fi
def filecount(fname):
for line in fi.input(fname):
pass
return fi.lineno()
And passed a 60mil lines file to all the above stated methods:
mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974
It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...
As for me this variant will be the fastest:
#!/usr/bin/env python
def main():
f = open('filename')
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
print lines
if __name__ == '__main__':
main()
reasons: buffering faster than reading line by line and string.count is also very fast
This code is shorter and clearer. It's probably the best way:
num_lines = open('yourfile.ext').read().count('\n')
I have modified the buffer case like this:
def CountLines(filename):
f = open(filename)
try:
lines = 1
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
# Empty file
if not buf:
return 0
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
finally:
f.close()
Now also empty files and the last line (without \n) are counted.
print open('file.txt', 'r').read().count("\n") + 1
A lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...
I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.
The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.
Hence, there are 3 major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling gc collection and other micro-managing tricks:
Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preparation solution: if you generate or can modify how the files you process are generated, or if it's acceptable that you can pre-process them, first convert the line return to unix style (\n) as this will save 1 character compared to Windows or MacOS styles (not a big save but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can always pad smaller lines. This way, you can calculate instantly the number of lines from the total filesize, which is much faster to access. Often, the best solution to a problem is to pre-process it so that it better fits your end purpose.
Parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).
If none of those are an option, then you can only rely on micro-managing tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionated compared to the returns in speed improvement you'll see.
Simple method:
1)
>>> f = len(open("myfile.txt").readlines())
>>> f
430
>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>
num_lines = len(list(open('myfile.txt')))
If one wants to get the line count cheaply in Python in Linux, I recommend this method:
import os
print os.popen("wc -l file_path").readline().split()[0]
file_path can be both abstract file path or relative path. Hope this may help.
def count_text_file_lines(path):
with open(path, 'rt') as file:
line_count = sum(1 for _line in file)
return line_count
the result of opening a file is an iterator, which can be converted to a sequence, which has a length:
with open(filename) as f:
return len(list(f))
this is more concise than your explicit loop, and avoids the enumerate.
What about this
def file_len(fname):
counts = itertools.count()
with open(fname) as f:
for _ in f: counts.next()
return counts.next()
count = max(enumerate(open(filename)))[0]
How about this?
import fileinput
import sys
counter=0
for line in fileinput.input([sys.argv[1]]):
counter+=1
fileinput.close()
print counter
How about this one-liner:
file_length = len(open('myfile.txt','r').read().split('\n'))
Takes 0.003 sec using this method to time it on a 3900 line file
def c():
import time
s = time.time()
file_length = len(open('myfile.txt','r').read().split('\n'))
print time.time() - s
def line_count(path):
count = 0
with open(path) as lines:
for count, l in enumerate(lines, start=1):
pass
return count

Categories

Resources