Reading 100 000 lines from stdin is too slow - python

I am writing a program for which I need up to 100,000 lines of integer pairs from sys.stdin on which to do calculations. My whole program, consisting of reading this input and performing calculations on the integers of each input line has to take a maximum of 1 second. The problem is that, just going through all the lines of input takes way more than 1 second! In the case of 100,000 lines, it takes roughly 10 seconds.
My question is, is this performance to be expected for this amount of lines?
The input is in the format:
100000 5 100000
72324 563
56487 2252
866 19750
65532 69349
96171 56840
70287 14094
76381 14722
48359 38831
74431 12611
29994 66230
92169 20726
39565 38429
59416 2360
45470 40781
...
Where the rightmost integer on the first line indicates the number of lines to come.
To read this input, I'm using the following code:
import time
from sys import stdin, stderr
def read():
row = stdin.readline().split()
n, k, q = int(row[0]), int(row[1]), int(row[2])
start = time.clock()
for i in range(q):
line = stdin.readline().split()
# Do some calculation on the integers of this line...
end = time.clock()
print("Reading time: " + str(end-start))
read()
Am I missing something here? The limit of 1 second is due to this being a school project of calculating Q number of distances between two nodes in a K-ary tree.
Thanks in advance.

Related

tqdm: simple loop on iterations, show MB/s

Question
I want to do a simple loop over n iterations. Each iteration contains a copy operation of which I know how many bytes were copied. My question is: how can I also show the number of MB/s in the progress bar?
Example
I'm showing a progress bar around rsync. I modified this answer as follows:
import subprocess
import sys
import re
import tqdm
n = len(open('myfiles.txt').readlines())
your_command = 'rsync -aP --files-from="myfiles.txt" myhost:mysource .'
pbar = tqdm.trange(n)
process = subprocess.Popen(your_command, stdout=subprocess.PIPE, shell=True)
for line in iter(process.stdout.readline, ''):
line = line.decode("utf-8")
if re.match(r'(.*)(xfr\#)([0-9])(.*)(to\-chk\=)([0-9])(.*)', line):
pbar.update()
Whereby myfiles.txt contains a list of files. This gives me a good progress bar, showing the number of iterations per second.
However, the summary line that I match on to signal that a file was copied, e.g.
682,356 100% 496.92kB/s 0:00:01 (xfr#5, to-chk=16756/22445)
also contains the number of bytes that were copied, which I want to use to show a copying speed.
Below I provided code doing what you need.
Because I didn't have your example data I created simple generator function that emulates process of retrieving data. It generates at random points of time (on each iteration) value equal to random number of bytes (not mega-bytes).
There are two tqdm bars below, one is used for progress, it measures iterations per second and total number of iterations done and percentage. And second bar that measures speed in megabytes per second and total number of megabytes received.
Try it online!
import tqdm
def gen(cnt):
import time, random
for i in range(cnt):
time.sleep(random.random() * 0.125)
yield random.randrange(1 << 20)
total_iterations = 150
pbar = tqdm.tqdm(total = total_iterations, ascii = True)
sbar = tqdm.tqdm(unit = 'B', ascii = True, unit_scale = True)
for e in gen(total_iterations):
pbar.update()
sbar.update(e)
Output:
9%|█████████▉ | 89/1000 [00:11<01:53, 8.05it/s]
40.196MiB [00:11, 3.63MiB/s]
ASCII video (+link):

How can i handle file with 161 million line?

I tried to handle this code as I have a big file with the size 3 GB "mydata.dat" with 161991000 lines. Code is for calculating the distance between two points using DensityPeakCluster. number of points 18000
sample of the file like as
1 2 26.23
1 3 44.49
1 4 47.17
and so on until
1 18000 23.5
then
2 3 25.2
2 4 15.2
until 2 18000 0.25 and so on until 17999 18000 0.25
block one for the code is
class Graph(defaultdict):
def __init__(self, input_file, sep=" ", header=False, undirect=True):
super(Graph, self).__init__(dict)
self.edges_num = 0
with open(input_file) as f:
if header:
f.readline()
for line in f:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
def edges(self):
edges_list = []
for node1 in self:
for node2 in self[node1]:
edges_list.append((node1, node2))
return edges_list
block 2 of the code as the code is long to write it here
def edges_weight(self):
weight_list = []
for edge in self.edges():
node1, node2 = edge
weight_list.append([node1, node2, self[node1][node2]])
weight_list = sorted(weight_list, key=lambda x:x[2])
return weight_list
def get_weight(self, node1, node2):
return self[node1][node2]
def get_weights(self):
weights = []
for edge in self.edges():
weights.append(self.get_weight(edge[0], edge[1]))
return weights
if __name__=="__main__":
input_file = "./data/mydata.dat"
percent = 2.0
output_file = "./data/results"
G = Graph(input_file)
position = round(G.number_of_edges()*percent/100)
dc = G.edges_weight()[position][2]
print("average percentage of neighbours (hard coded): {}".format(percent))
print("Computing Rho with gaussian kernel of radius: {}".format(dc))
nodes = G.nodes()
for i in range(G.number_of_nodes()-1):
for j in range(i+1, G.number_of_nodes()):
node_i = nodes[i]
node_j = nodes[j]
dist_ij = G.get_weight(node_i, node_j)
what happened to me
1- I got killed so I tried to make reading from the file as
bigfile = open(input_file,'r')
tmp_lines = bigfile.readlines(1024*1024)
for line in tmp_lines:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
2- but got
dist_ij = G.get_weight(node_i, node_j) in get_weight
return self[node1][node2]
KeyError: '6336'
3- I tried to use google colab but didn't work as RAM is 12 GB and didn't enough for me .. i asked for buying a neW RAM but the problem still was I couldn't manage the code well so the RAM will be less for processing .. i'm stuck in this problem and couldn't know what should I do ?
**1- My problem is how to deal with a big file as I have ? what is the way that I should use to handle this size?
2- if I use NumPy to load the file can this decrease usage of memory?**
The most straight forward answer is to not load the whole file at once. This can even be done one line at a time. For example, suppose you wanted the sum:
filename = 'file.dat'
lines = (int(line.split(' ')[2]) for line in open(filename))
print(sum(lines))
Here we did not load all the lines into memory. We instead opened a file pointer and started a python generator. The generator holds the function "int(line.split(' ')[2])" and only executes that function when each line is called. The initiation of needing to call each line is started by the sum(), and sum only calls each line one at a time as needed, never loading more than one line into memory at a time. Hence, when we execute that line we start to add up all the values on the lines from the generator and keep a running total. The point is that the code uses no memory RAM (aside from the kernel overhead).
This could be done a piece at a time as well. Load all the zeros.
filename = 'file.dat'
lines = (line.split(' ') for line in open(filename))
zeros = (line for line in lines if line[0]=='0' or line[1]=='0')
print(sum(c for a,b,c in zeros))
This can of course be slower than loading some or all of the file into memory. Moreover you have to consider how many times you want to iterate over the file like this. It is preferred to only iterate over the lines a few times, gathering all the calculations you want. You then probably want to save those answers because re-iterating over the file again takes more time.
In considering loading the file into memory, you need to double check what exactly you want to load and how. For example, do you want to load the values 1 2 in the line 1 2 26.23? If not, then strip those out to take up less memory. For example
import numpy as np
filename = 'file.dat'
values = (float(line.split(' ')[2]) for line in open(filename))
X = np.fromiter(values,dtype='float32',count=161991000)
By specifying the count we told python EXACTLY how much memory to allocate in advance (instead of having python re-adjust the array every time it needs more memory). With a count of that size and dtype of float32, we know that this data will take up exactly 647.97mb in RAM. So, be careful not to write any operations that duplicate this data. If you write something that makes 5 copies of this that will eat up RAM quickly.
I think this gives you an idea of how to manage memory. :-)

search a 2GB WAV file for dropouts using wave module

`What is the best way to analyze a 2GB WAV file (1khz Tone) for audio dropouts using wave module? I tried the script below
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in xrange(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in xrange(len(frame)):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
break
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
file1.close()
file2.close()
I haven't used the wave module before, but file1.readframes(i) looks like it's reading 1 frame when you're at the first frame, 2 frames when you're at the second frame, 10 frames when you're in the tenth frame, and a 2Gb CD quality file might have a million frames - by the time you're at frame 100,000 reading 100,000 frames ... getting slower each time through the loop as well?
And from my comment, in Python 2 range() generates an in-memory array of the full size first, and xrange() doesn't, but not using range at all helps even more.
And push the looping down into the lower layers with any() to make the code shorter, and possibly faster:
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
chunksize = file1.getframerate()
chunk = file1.readframes(chunksize)
while chunk:
if not any(ord(sample) for sample in chunk):
print >> file2, 'dropout at second %s' % (file1.tell()/chunksize)
chunk = file1.readframes(chunksize)
file1.close()
file2.close()
This should read the file in 1-second chunks.
I think a simple solution to this would be to consider that the frame rate on audio files is pretty high. A sample file on my computer happens to have a framerate of 8,000. That means for every second of audio, I have 8,000 samples. If you have missing audio, I'm sure it will exist across multiple frames within a second, so you can essentially reduce your comparisons as drastically as your standards would allow. If I were you, I would try iterating over every 1,000th sample instead of every single sample in the audio file. That basically means it will examine every 1/8th of a second of audio to see if it's dead. Not as precise, but hopefully it will get the job done.
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in range(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in range(0, len(frame), 1000):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
break
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
file1.close()
file2.close()
At the moment, you're reading the entire file into memory, which is not ideal. If you look at the methods available for a "Wave_read" object, one of them is setpos(pos), which sets the position of the file pointer to pos. If you update this position, you should be able to only keep the frame you want in memory at any given time, preventing errors. Below is a rough outline:
import wave
file1 = wave.open("testdropout.wav", "r")
file2 = open("silence.log", "w")
def scan_frame(frame):
for j in range(len(frame)):
# check if amplitude is less than 0
# It makes more sense here to check for the desired case (low amplitude)
# rather than breaking at higher amplitudes
if ord(frame[j]) <= 0:
return True
for i in range(file1.getnframes()):
frame = file1.readframes(1) # only read the frame at the current file position
zero = scan_frame(frame)
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
pos = file1.tell() # States current file position
file1.setpos(pos + len(frame)) # or pos + 1, or whatever a single unit in a wave
# file is, I'm not entirely sure
file1.close()
file2.close()
Hope this can help!

Writing to a file after one by one process completes in python

Context:
I have large(30mb now but later it may be over gigabytes) csv file (with 185 line) that to be searched for some value (each element of times) chunk by chunk ( chunk of 6 value of csv) by rows and if found it to be written in the another file. i.e. get one element from times sorted deque and search in another deque (i.e. rdr = deque(reder)) by 6 element in rdr if found write to the file and continue for the next elemnt in the times deque.
Problem:
I have already written a code that does the work perfectly but it is so-so slow (8 hrs). I want better performance. I think of multiprocessing - i am not getting through and thus seek help. I used a function ddd that gets all arguments from the calling scope except times1 that i pass explicitly.
Code i tried with:
dim = [0,76,'1.040000',1,1,'1.000000']+min_max_ret(X,Y)
times = deque(sorted(list(timestep),key=lambda x:ast.literal_eval(x)))
def ddd(times1):# ddd(outfl, rdr, acc_ret, FR_XY, width, length) all these arguments are get from the calling scope.
for tim in times1:
time = ['{0:.6f}'.format(ast.literal_eval(tim)/1000.000000)]
outfl.writelines([u'2 ********* TIMESTEP']+['\n']+time+['\n'])
for index,line in enumerate(rdr):
if index!=0:
cnt = 8
for counter in [qq for qq in [line[jj:jj+6] for jj in range(8,len(line),6)] if len(qq)==6]:
counter = map(unicode.strip,counter)
if counter[5]==tim:
cr_id = line[0]
acc = '{0:.6f}'.format(acc_ret(counter[3], counter[4]))
car_ltlng = map(unicode.strip,[line[cnt],line[cnt+1],line[cnt+6],line[cnt+7]])
xy = FR_XY(*car_ltlng)
data = [3]+[cr_id]+[1,1]+xy+[length,width]+[counter[2]]+[acc]
outfl.writelines([unicode(ww).strip()+'\n' for ww in data])
cnt+=6
print "Time is %s is completed"%tim
with open(r"C:\my_output_ascii_14Dec.trj",'w') as outfl:
with open(fl,'r') as inf:
reder = csv.reader(inf,delimiter=';')
rdr = deque(reder)
outfl.writelines([str(w)+'\n' for w in dim])
p = Pool(5)
p.map(ddd,times)#[[xx for xx in islice(times,ii,ii+10)] for ii in range(0,len(times))])
Sample csv content:
car_id; car_type; entry_gate; entry_time(ms); exit_gate; exit_time(ms); traveled_dist(m); avg_speed(m/s); trajectory(x[m];y[m];speed[m/s];a_tangential[ms-2];a_lateral[ms-2];timestamp[ms];)
24; Bus; 25; 4300.00; 26; 48520.00; 118.47; 2.678999; 509552.78; 5039855.59; 10.0740; 0.4290; 0.2012; 0.0000; 509552.97; 5039855.57; 10.0821; 0.3853; 0.2183; 20.0000; 509553.17; 5039855.55; 10.0886; 0.2636; 0.2356; 40.0000; 509553.37; 5039855.53; 10.0927; 0.1420; 0.2532; 60.0000; 509553.57; 5039855.51; 10.0943; 0.0203; 0.2710; 80.0000; 509553.76; 5039855.48; 10.0935; -0.1014; 0.2890; 100.0000; 509553.96; 5039855.46; 10.0902; -0.2231; 0.3073; 120.0000; 509554.16; 5039855.44; 10.0846; -0.3448; 0.3257; 140.0000; 509554.36; 5039855.42; 10.0765; -0.4665; 0.3444; 160.0000; 509554.56; 5039855.40; 10.0659; -0.5881; 0.3633; 180.0000; 509554.76; 5039855.37; 10.0529; -0.7098; 0.3823; 200.0000; 509554.96; 5039855.35; 10.0375; -0.8315; 0.4016; 220.0000; 509555.17; 5039855.33; 10.0197; -0.9532; 0.4211; 240.0000; 509555.37;
Partial csv file at here.

using python to search extremely large text file

I have a large 40 million line, 3 gigabyte text file (probably wont be able to fit in memory) in the following format:
399.4540176 {Some other data}
404.498759292 {Some other data}
408.362737492 {Some other data}
412.832976111 {Some other data}
415.70665675 {Some other data}
419.586515381 {Some other data}
427.316825959 {Some other data}
.......
Each line starts off with a number and is followed by some other data. The numbers are in sorted order. I need to be able to:
Given a number x and and a range y, find all the lines whose number is within y range of x. For example if x=20 and y=5, I need to find all lines whose number is between 15 and 25.
Store these lines into another separate file.
What would be an efficient method to do this without having to trawl through the entire file?
If you don't want to generate a database ahead of time for line lengths, you can try this:
import os
import sys
# Configuration, change these to suit your needs
maxRowOffset = 100 #increase this if some lines are being missed
fileName = 'longFile.txt'
x = 2000
y = 25
#seek to first character c before the current position
def seekTo(f,c):
while f.read(1) != c:
f.seek(-2,1)
def parseRow(row):
return (int(row.split(None,1)[0]),row)
minRow = x - y
maxRow = x + y
step = os.path.getsize(fileName)/2.
with open(fileName,'r') as f:
while True:
f.seek(int(step),1)
seekTo(f,'\n')
row = parseRow(f.readline())
if row[0] < minRow:
if minRow - row[0] < maxRowOffset:
with open('outputFile.txt','w') as fo:
for row in f:
row = parseRow(row)
if row[0] > maxRow:
sys.exit()
if row[0] >= minRow:
fo.write(row[1])
else:
step /= 2.
step = step * -1 if step < 0 else step
else:
step /= 2.
step = step * -1 if step > 0 else step
It starts by performing a binary search on the file until it is near (less than maxRowOffset) the row to find. Then it starts reading every line until it finds one that is greater than x-y. That line, and every line after it are written to an output file until a line is found that is greater than x+y, and which point the program exits.
I tested this on a 1,000,000 line file and it runs in 0.05 seconds. Compare this to reading every line which took 3.8 seconds.
You need random access to the lines which you won't get with a text files unless the lines are all padded to the same length.
One solution is to dump the table into a database (such as SQLite) with two columns, one for the number and one for all the other data (assuming that the data is guaranteed to fit into whatever the maximum number of characters allowed in a single column in your database is). Then index the number column and you're good to go.
Without a database, you could read through file one time and create an in-memory data structure with pairs of values showing containing (number, line-offset). You calculate the line-offset by adding the lengths of each row (including line end). Now you can binary search these value pairs on number and randomly access the lines in the file using the offset. If you need to repeat the search later, pickle the in-memory structure and reload for later re-use.
This reads the entire file (which you said you don't want to do), but does so only once to build the index. After that you can execute as many requests against the file as you want and they will be very fast.
Note that this second solution is essentially creating a database index on your text file.
Rough code to create the index in second solution:
import Pickle
line_end_length = len('\n') # must be a better way to do this!
offset = 0
index = [] # probably a better structure to use than a list
f = open(filename)
for row in f:
nbr = float(row.split(' ')[0])
index.append([nbr, offset])
offset += len(row) + line_end_length
Pickle.dump(index, open('filename.idx', 'wb')) # saves it for future use
Now, you can perform a binary search on the list. There's probably a much better data structure to use for accruing the index values than a list, but I'd have to read up on the various collection types.
Since you want to match the first field, you can use gawk:
$ gawk '{if ($1 >= 15 && $1 <= 25) { print }; if ($1 > 25) { exit }}' your_file
Edit: Taking a file with 261,775,557 lines that is 2.5 GiB big, searching for lines 50,010,015 to 50,010,025 this takes 27 seconds on my Intel(R) Core(TM) i7 CPU 860 # 2.80GHz. Sounds good enough for me.
In order to find the line that starts with the number just above your lower limit, you have to go through the file line by line until you find that line. No other way, i.e. all data in the file has to be read and parsed for newline characters.
We have to run this search up to the first line that exceeds your upper limit and stop. Hence, it helps that the file is already sorted. This code will hopefully help:
with open(outpath) as outfile:
with open(inpath) as infile:
for line in infile:
t = float(line.split()[0])
if lower_limit <= t <= upper_limit:
outfile.write(line)
elif t > upper_limit:
break
I think theoretically there is no other option.

Categories

Resources