The code below is a part of my main function
def main():
model = GoodPackage.load_file_format('hello.bin', binary=True)
do_stuff_with_model(model)
def do_stuff_with_model(model):
do something~
Assume that the size of hello.bin is a few gigabytes and it takes a while to load it. the method do_stuff_with_model is still unstable and I must do a lot of iterations until I have a stable version. In other words, I have to run the main function many times to finish debugging. However, since it takes a few minutes to load the model every time I run the code, it is time consuming. Is there a way for me to store the model object in some other place, so that every time I run the code by typing python my_code.py in the console I don't have to wait? I assume using pickle wouldn't help either because the file will still be big.
How about creating a ramdisk? If you have enough memory, you can store the entire file in RAM. This will drastically speed things up, though you'll likely have to do this every time you restart your computer.
Creating a ramdisk is quite simple on linux. Just create a directory:
mkdir ramdisk
and mount it as a temps or ramfs filesystem:
mount -t tmpfs -o size=512m tmpfs ./ramdisk
From there you can simply copy your large file to the ramdisk. This has the benefit that your code stays exactly the same, apart from simply changing the path to your big file. File access occurs just as it normally would, but now it's much faster, since it's loading it from RAM.
Related
I have a massive Python script I inherited. It runs continuously on a long list of files, opens them, does some processing, creates plots, writes some variables to a new text file, then loops back over the same files (or waits for new files to be added to the list).
My memory usage steadily goes up to the point where my RAM is full within an hour or so. The code is designed to run 24/7/365 and apparently used to work just fine. I see the RAM usage steadily going up in task manager. When I interrupt the code, the RAM stays used until I restart the Python kernel.
I have used sys.getsizeof() to check all my variables and none are unusually large/increasing with time. This is odd - where is the RAM going then? The text files I am writing to? I have checked and as far as I can tell every file creation ends with a f.close() statement, closing the file. Similar for my plots that I create (I think).
What else would be steadily eating away at my RAM? Any tips or solutions?
What I'd like to do is some sort of "close all open files/figures" command at some point in my code. I am aware of the del command but then I'd have to list hundreds of variables at multiple points in my code to routinely delete them (plus, as I pointed out, I already checked getsizeof and none of the variables are large. Largest was 9433 bytes).
Thanks for your help!
Is a file stored to disc, when only present for a fraction of a second?
I'm running with python 3.7 on ubuntu 18.04.
I make use of a python script. This script extracts json-files from a zip-package. Resulting files will be processed. Afterwards the resulting files well be deleted.
As I'm running on an SSD. I want to spare write cycles to it.
Does linux buffer such write cycles to the RAM, or do I need to assume, that I'm forcing my poor SSD into sever thousend write cycles per second?
Linux may cache file operations under some circumstances, but you're looking for it to optimize by avoiding ever committing a whole sequence of operations to storage at all, based on there being no net effect. I do not think you can expect that.
It sounds like you might be better served by using a different filesystem in the first place. Linux has memory-backed file systems (served by the tmpfs filesystem driver, for example), so perhaps you want to set up such a filesystem for your application to use for these scratch files.1 Do note, however, that these are backed by virtual memory, so, although this approach should reduce the number of write cycles on your SSD, it might not eliminate all writes.
1 For example, see https://unix.stackexchange.com/a/66331/289373
I'm loading the model using:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
Now every time i run the file in Pycharm, it loads the model again.
So, is there a way to load it once and be available whenever i run things like model['king'] and model.doesnt_match("house garage store dog".split())
because it takes alot of time whenever i wana check the similarity or words that don't match.
When i ran model.most_similar('finance') it was really slow and the whole laptop freezed for like 2 min. So, is there a way to make things faster, 'cause i wana use it in my project, but i can't let the user wait for this long.
Any suggestions?
That's a set of word-vectors that's about 3.6GB on disk, and slightly larger when loaded - so just the disk IO can take a noticeable amount of time.
Also, at least until gensim-4.0.0 (now available as a beta preview), versions of Gensim through 3.8.3 require an extra one-time pre-calculation of unit-length-normalized vectors upon the very first use of a .most_similar() or .doesnt_match() operation (& others). This step can also take a noticeable moment, & then immediately requires a few extra GB of memory for a full model like GoogleNews - which on any machine with less thanf about 8GB RAM free risks using slower virtual-memory or even crashing with an out-of-memory error. (Starting in gensim-4.0.0beta, once the model loads, the 1st .most_similar() won't need any extra pre-calculation/allocation.)
The main way to avoid this annoying lag is to structure your code or service to not reload it separately before each calculation. Typically, this means keeping an interactive Python process that's loaded it alive, ready for your extra operations (or later user requests, as might be the case with a web-deployed service.)
It sounds like you may be developing a single Python script, something like mystuff.py, and running it via PyCharm's execute/debug/etc utilities for launching a Python file. Unfortunately, upon each completed execution, that will let the whole Python process end, releasing any loaded data/objects completely. Running the script again must do all the loading/precalculation again.
If your main interest is doing a bit of investigational examination & experimentation with the set of word-vectors, on your own, a big improvement would be to move to an interactive environment that keeps a single Python run alive & waiting for your next line of code.
For example, if you run the ipython interpreter at a command-line, in a separate shell, you can load the model, do a few lookup/similarity operations to print the results, and then just leave the prompt waiting for your next code. The full loaded state of the process remains available until you choose to exit the interpreter.
Similarly, if you use a Jupyter Notebook inside a web-browser, you get that same interpreter experience inside a growing set of editable-code-and-result 'cells' that you can re-run. All are sharing the same back-end interpreter process, with persistent state – unless you choose to restart the 'kernel'.
If you're providing a script or library code for your users' investigational work, they could also use such persistent interpreters.
But if you're building a web service or other persistently-running tool, you'd similarly want to make sure that the model remains loaded between user requests. (Exactly how you'd do that would depend on the details of your deployment, including web server software, so it'd be best to ask/search-for that as a separate question supplying more details when you're at that step.)
There is one other trick that may help in your constant-relaunch scenario. Gensim can save & load in its own native format, which can make use of 'memory-mapping'. Essentially, a range of a file on-disk can be used directly by the operating-system's virtual memory system. Then, when many processes all designate the same file as the canonical version of something they want in their own memory-space, the OS knows they can re-use any parts of that file that are already in memory.
This technique works far more simply in the `gensim-4.0.0beta' and later, so I'm only going to describe the steps needed there. (See this message if you want to force this preview installation before Gensim 4.0 is officially released.)
First, load the original-format file, but then re-save it in Gensim's format:
from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
kv_model.save('GoogleNews-vectors-negative300.kv')
Note that there will be an extra .npv file created that must be kept alongside the GoogleNews-vectors-negative300.kv if you move the model elsewhere. DO this only once to create the new files.
Second, when you later need the model, use Gensim's .load() with the mmap option:
kv_model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
# do your other operations
Right away, the .load() should complete faster. However, when you 1st try to access any word – or all words in a .most_similar() – the read from disk will still need to happen, just shifting the delays to later. (If you're only ever doing individual-word lookups or small sets of .doesnt_match() words, you may not notice any long lags.)
Further, depending on your OS & amount-of-RAM, you might even get some speedup when you run your script once, let it finish, then run it again soon after. It's possible in some cases that even though the OS has ended the prior process, its virtual-memory machinery remembers that some of the not-yet-cleared old-process memory pages are still in RAM, & correspond to the memory-mapped file. Thus, the next memory-map will re-use them. (I'm not sure of this effect, and if you're in a low-memory situation the chance of such re-use from a completed may disappear completely.
But, you could increase the chances of the model file staying memory-resident by taking a third step: launch a separate Python process to preload the model that doesn't exit until killed. To do this, make another Python script like preload.py:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
Run this script in a separate shell: python preload.py. It will map the model into memory, but then hang until you CTRL-C exit it.
Now, any other code you run on the same machine that memory-maps the same file will automatically re-use any already-loaded memory pages from this separate process. (In low-memory conditions, if any other virtual-memory is being relied upon, ranges could still be flushed out of RAM. But if you have plentiful RAM, this will ensure minimal disk IO each new time the same file is referenced.)
Finally, one other option that can be mixed with any of these is to load only a subset of the full 3-million-token, 3.6GB GoogleNews set. The less-common words are near the end of this file, and skipping them won't affect many uses. So you can use the limit argument of load_word2vec_format() to only load a subset - which loads faster, uses less memory, and completes later full-set searches (like .most_similar()) faster. For example, to load just the 1st 1,000,000 words for about 67% savings of RAM/load-time/search-time:
from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', limit=1000000, binary=True)
I am currently running a code on a HPC cluster that writes several 16 MB files on disk (same directory) for a short period of time and then deletes it. They are written to disks and then deleted sequentially. However, the total number of I/O operations exceeds 20,000 * 12,000 times.
I am using the joblib module in python2.7 to take advantage of running my code on several cores. Its basically a nested loop problem with the outer loop being parallelised by joblib and the inner loop is run sequentially in the function. In total its a (20,000 * 12,000 loop.)
The basic skeleton of my code is the following.
from joblib import Parallel, delayed
import subprocess
def f(a,b,c,d):
cmds = 'path/to/a/bash_script_on_disk with arguments from a,b > \
save_file_to_disk'
subprocess.check_output(cmds,shell=True)
cmds1 = 'path/to/a/second_bash_script_on_disk > \
save_file_to_disk'
subprocess.check_output(cmds1,shell=True)
#The structure above is repeated several times.
#However I do delete the files as soon as I can using:
cmds2 = 'rm -rf files'
subprocess.check_output(cmds2,shell=True)
#This is followed by the second/inner loop.
for i in range(12000):
#Do some computation, create and delete files in each
#iteration.
if __name__ == '__main__':
num_cores = 48
Parallel(n_jobs=num_cores)(delayed(f)(a,b,c,d) for i in range(20,000))
#range(20,000) is batched by a wrapper script that sends no more \
#than 48 jobs per node.(Max.cores available)
This code is extremely slow and the bottleneck is the I/O time. Is this a good use case to temporarily write files to /dev/shm/? I have 34GB of space available as tmpfs on /dev/shm/.
Things I already tested:
I tried to set up the same code on a smaller scale on my laptop which has 8 cores. However, writing to /dev/shm/ ran slower than writing to disk.
Side Note: (The inner loop can be parallelised too, however, the number of cores I have available is far lesser than 20,000 which is why I am sticking to this configuration. Please let me know if there are better ways to do this.)
First, do not talk about total I/O operations, that is meaningless. Instead, talk about IOPS and throughout.
Second, that is almost impossible that writing to /dev/shm/ will be slower than writing to disk. Please provide more information. You can test write performance using fio, example command: sudo fio --name fio_test_file --rw=read --direct=1 --bs=4k --size=50M --numjobs=16 --group_reporting, and my test result is: bw=428901KB/s, iops=107225.
Third, you are really writing too many files, you should think about your structure.
It depends on your temporary data size.
If you have much more memory than you're using for the data, then yes - shm will be a good place for it. If you're going to write almost as much as you've got available, then you're likely going to start swapping - which would kill the performance of everything.
If you can fit your data in memory, then tmpfs by definition will always be faster than writing to a physical disk. If it isn't, then there are more factors impacting your environment. Running your code under a profiler would be a good idea in this case.
I have written a code which does some processing , I want to reduce the execution time of the program and I think it can be done if I run it on my RAM which is 1GB.
So will running my program form RAM make any difference to my execution time and if yes how it can be done.
Believe it or not, when you use a modernish computer system, most of your computation is done from RAM. (Well, technically, it's "done" from processor registers, but those are filled from RAM so let's brush that aside for the purposes of this answer)
This is thanks to the magic we call caches and buffers. A disk "cache" in RAM is filled by the operating system whenever something is read from permanent storage. Any further reads of that same data (until and unless it is "evicted" from the cache) only read memory instead of the permanent storage medium.
A "buffer" works similarly for write output, with data first being written to RAM and then eventually flushed out to the underlying medium.
So, in the course of normal operation, any runs of your program after the first (unless you've done a lot of work in between), will already be from RAM. Ditto the program's input file: if it's been read recently, it's already cached in memory! So you're unlikely to be able to speed things up by putting it in memory yourself.
Now, if you want to force things for some reason, you can create a "ramdisk", which is a filesystem backed by RAM. In Linux the easy way to do this is to mount "tmpfs" or put files in the /dev/shm directory. Files on a tmpfs filesystem go away when the computer loses power and are entirely stored in RAM, but otherwise behave like normal disk-backed files. From the way your question is phrased, I don't think this is what you want. I think your real answer is "whatever performance problems you think you have, this is not the cause, sorry".