Improve speed when reading very large files in Python

Improve speed when reading very large files in Python - python

So I'm running multiple functions, each function takes a section out of the million line .txt file. Each function has a for loop that runs through every line in that section of million line file.
It takes info from those lines to see if it matches info in 2 other files, one about 50,000-100,000 lines long, the other about 500-1000 lines long. I checked if the lines match by running for loops through the other 2 files. Once the info matches I write the output to a new file, all functions write to the same file. The program will produce about 2,500 lines a minute, but will slow down the longer it runs. Also, when I run one of the function, it does in about 500 a minute, but when I do it with 23 other processes it only makes 2500 a minute, why is that?
Does anyone know why that would happen? Anyway, I could import something to make the program run/read through files faster, I am already using the with "as file1:" method.
Can the multi-processes be redone to run faster?

The thread can only use your ressources. 4 cores = 4 thread with full ressource. There are a few cases where having more thread can improve performance, but this is not the case for you. So keep the thread count to the number of cores you have.
Also, because you have a concurrent access to a file, you need a lock on this file which will slow down the process a bit.
What could be improve however is your code to compare the string, but that is another question.

Related

How to solve Python RAM leak when running large script

I have a massive Python script I inherited. It runs continuously on a long list of files, opens them, does some processing, creates plots, writes some variables to a new text file, then loops back over the same files (or waits for new files to be added to the list).
My memory usage steadily goes up to the point where my RAM is full within an hour or so. The code is designed to run 24/7/365 and apparently used to work just fine. I see the RAM usage steadily going up in task manager. When I interrupt the code, the RAM stays used until I restart the Python kernel.
I have used sys.getsizeof() to check all my variables and none are unusually large/increasing with time. This is odd - where is the RAM going then? The text files I am writing to? I have checked and as far as I can tell every file creation ends with a f.close() statement, closing the file. Similar for my plots that I create (I think).
What else would be steadily eating away at my RAM? Any tips or solutions?
What I'd like to do is some sort of "close all open files/figures" command at some point in my code. I am aware of the del command but then I'd have to list hundreds of variables at multiple points in my code to routinely delete them (plus, as I pointed out, I already checked getsizeof and none of the variables are large. Largest was 9433 bytes).
Thanks for your help!

Iterating through files is faster after the first time [duplicate]

This question already has answers here:
How did Python read this binary faster the second time?
(2 answers)
Closed 1 year ago.
This was something I came across while working on a project and I'm kind of confused. I have a .txt file with ~15000 lines. And when I run the program once, it takes around 4-5 seconds to go through all the lines. But I added a while True before opening the file and I did file.close() so that it continuously opens, goes through all the lines, and then closes.
But after the first run, I noticed that it takes around 1 second to complete. I made sure to close the files afterwards so what might be causing it to be so much faster?

It's called "file caching" or "warming the cache". All of the major operating systems allocate a goodly portion of your RAM to a file cache. When you read a file, those buffers are retained for a while instead of being released right away. If you read the same file again, it can often pull the data from RAM instead of going to disk.

Python Stops Running then causes memory to spike

I'm running a large Python3.7 script using PyCharm and interfaced by Django that parses txt files line by line and processes the text. It gets stuck at a certain point on one particularly large file and I can't for the life of me figure out why. Once it gets stuck, the memory that PyCharm uses according to Task Manager runs up to 100% of available over the course of 5-10 seconds and I have to manually stop the execution (memory usage is low when it runs on other files and before the execution stops on the large file).
I've narrowed the issue down to the following loop:
i = 0
for line in line_list:
label_tmp = self.get_label(line) # note: self because this is all contained in a class
if label_tmp in target_list:
index_dict[i] = line
i += 1
print(i) # this is only here for diagnostic purposes for this issue
This works perfectly for a handful of files that I've tested it on, but on the problem file it will stop on the 2494th iteration (ie when i=2494). It does this even when I delete the 2494th line of the file or when I delete the first 10 lines of the file - so this rules out a bug in the code on any particular line in the file - it will stop running regardless of what is in the 2494th line.
I built self.get_label() to produce a log file since it is a large function. After playing around, I've begun to suspect that it will stop running after a certain number of actions no matter what. For example I added the following dummy lines to the beginning of self.get_label():
log.write('Check1\n')
log.write('Check2\n')
log.write('Check3\n')
log.write('Check4\n')
On the 2494th iteration, the last entry in the log file is "Check2". If I make some tweaks to the function it will stop at Check 4; if I make other tweaks it will stop at iteration 2493 but stop at "Check1" or even make it all the way to the end of the function.
I thought the problem might have something to do with memory from the log file, but even when I comment out the log lines the code still stops on the 2494th line (once again, irrespective of the text that's actually contained in that line) or the 2493rd line, depending on the changes that I make.
No matter what I do, execution stops, then memory used according to Task Manager runs up to 100%. It's important to note that the memory DOES NOT increase substantially until AFTER the execution gets stuck.
Does anyone have any ideas what might be causing this? I don't see anything wrong with the code and the fact that it stops executing after a certain number of actions indicates that I'm hitting some sort of fundamental limit that I'm not aware of.

Can you try using sys.getsizeof. Something must be happening to that dict that increases memory like crazy. Something else to try is using your regular terminal/cmd. Otherwise, I'd want to see a little bit more of the code.
Also, instead of using i += 1, you can enumerate your for loop.
for i, line in enumerate(line_list):
Hopefully some of that helps.
(Sorry, not enough rep to comment)

Just wanted to provide the solution months after asking. As most experienced coders probably know, the write() function only adds the output to a buffer. So if an infinite loop occurs before the buffer can clear (it only clears once every few lines, depending on the size of the buffer) then any lines still in the buffer won't print to the file. This made it appear to be a different type of issue (I thought the issue was ~20-30 lines before the actual flawed line; the buffer cleared on different lines depending on how I changed the code, which explains why the log file ended on different lines when unrelated changes were made). When I replaced "write" with "print" I was able to identify the exact line in the code that caused the loop.
To avoid a dummy situation like this, I recommend making a custom "write_to_file" function that includes a "flush" so that it writes every single line to the log file. I also added other types of protection to that custom "write_to_file" function such as not writing if the file exceeds a certain size, etc.

Pyspark write multiple outputs by key without partition

I have a PySpark dataframe that contains records for 6 million people, each with an individual userid. Each userid has 2000 entries. I want to save my each userid's data into a separate csv file with the userid as the name.
I have some code that does this, taken from the solution to this question. However, as I understand it the code will try to partition each of the 6 million ids. I don't actually care about this as I'm going to write each of these files to another non-HDFS server.
I should note that the code works for a small number of userids (up to 3000) but it fails on the full 6 million.
Code:
output_file = '/path/to/some/hdfs/location'
myDF.write.partitionBy('userid').mode('overwrite').format("csv").save(output_file)
When I run the above it takes WEEKS to run with most of that time spent on the writing step. I assume this is because of the number of partitions. Even if I manually specify the number of partitions to something small it still takes ages to execute.
Question: Is there a way to save each of the userids data into a single, well named (name of file = userid) file without partitioning?

Given the requirements there is really on much hope for improvement. HDFS is not designed for handling very small files, and pretty much any file system will be challenged if you try to open 6 million file descriptors at the same time.
You can improve this a little, if you haven't already by calling repartition before write:
(myDF
.repartition('userid')
.write.partitionBy('userid').mode('overwrite').format("csv").save(output_file))
If you can accept multiple ids per file you can use persistent table and bucketing
myDFA
.write
.bucketBy(1024, 'userid') # Adjust numBuckets if needed
.sortBy('userid')
.mode('overwrite').format("csv")
.saveAsTable(output_table))
and process each file separately, taking consecutive chunks of data.
Finally, if plain text output is not a hard requirement you can use any sharded database and partition data by userid.

Best Approach for I/O Bound Problems?

I am currently running a code on a HPC cluster that writes several 16 MB files on disk (same directory) for a short period of time and then deletes it. They are written to disks and then deleted sequentially. However, the total number of I/O operations exceeds 20,000 * 12,000 times.
I am using the joblib module in python2.7 to take advantage of running my code on several cores. Its basically a nested loop problem with the outer loop being parallelised by joblib and the inner loop is run sequentially in the function. In total its a (20,000 * 12,000 loop.)
The basic skeleton of my code is the following.
from joblib import Parallel, delayed
import subprocess
def f(a,b,c,d):
cmds = 'path/to/a/bash_script_on_disk with arguments from a,b > \
save_file_to_disk'
subprocess.check_output(cmds,shell=True)
cmds1 = 'path/to/a/second_bash_script_on_disk > \
save_file_to_disk'
subprocess.check_output(cmds1,shell=True)
#The structure above is repeated several times.
#However I do delete the files as soon as I can using:
cmds2 = 'rm -rf files'
subprocess.check_output(cmds2,shell=True)
#This is followed by the second/inner loop.
for i in range(12000):
#Do some computation, create and delete files in each
#iteration.
if __name__ == '__main__':
num_cores = 48
Parallel(n_jobs=num_cores)(delayed(f)(a,b,c,d) for i in range(20,000))
#range(20,000) is batched by a wrapper script that sends no more \
#than 48 jobs per node.(Max.cores available)
This code is extremely slow and the bottleneck is the I/O time. Is this a good use case to temporarily write files to /dev/shm/? I have 34GB of space available as tmpfs on /dev/shm/.
Things I already tested:
I tried to set up the same code on a smaller scale on my laptop which has 8 cores. However, writing to /dev/shm/ ran slower than writing to disk.
Side Note: (The inner loop can be parallelised too, however, the number of cores I have available is far lesser than 20,000 which is why I am sticking to this configuration. Please let me know if there are better ways to do this.)

First, do not talk about total I/O operations, that is meaningless. Instead, talk about IOPS and throughout.
Second, that is almost impossible that writing to /dev/shm/ will be slower than writing to disk. Please provide more information. You can test write performance using fio, example command: sudo fio --name fio_test_file --rw=read --direct=1 --bs=4k --size=50M --numjobs=16 --group_reporting, and my test result is: bw=428901KB/s, iops=107225.
Third, you are really writing too many files, you should think about your structure.

It depends on your temporary data size.
If you have much more memory than you're using for the data, then yes - shm will be a good place for it. If you're going to write almost as much as you've got available, then you're likely going to start swapping - which would kill the performance of everything.
If you can fit your data in memory, then tmpfs by definition will always be faster than writing to a physical disk. If it isn't, then there are more factors impacting your environment. Running your code under a profiler would be a good idea in this case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.