How to save spaCy model onto cache? - python

I'm using spaCy with Python for Named Entity Recognition, but the script requires the model to be loaded on every run and takes about 1.6GB memory to load it.
But 1.6GB is not dispensable for every run.
How do I load it into the cache or temporary memory so as to enable the script to run faster?

First of all you, if you only do NER, you can install the parser without vectors.
This is possible giving the argument parser to:
python -m spacy.en.download parser
This will prevent the 700MB+ Glove vectors to be downloaded, slimming the memory needed for a single run.
Then, well, it depends on your application/usage you make of the library.
If you call it often it will be better to pass spacy.load('en') to a module/class variable loaded at the beginning of your stack.
This will slow down a bit your boot time, but spacy will be ready (in memory) to be called.
(If the boot time is a big problem, you can do lazy loading).

Related

How can a Word2Vec pretrained model be loaded in Gensim faster?

I'm loading the model using:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
Now every time i run the file in Pycharm, it loads the model again.
So, is there a way to load it once and be available whenever i run things like model['king'] and model.doesnt_match("house garage store dog".split())
because it takes alot of time whenever i wana check the similarity or words that don't match.
When i ran model.most_similar('finance') it was really slow and the whole laptop freezed for like 2 min. So, is there a way to make things faster, 'cause i wana use it in my project, but i can't let the user wait for this long.
Any suggestions?
That's a set of word-vectors that's about 3.6GB on disk, and slightly larger when loaded - so just the disk IO can take a noticeable amount of time.
Also, at least until gensim-4.0.0 (now available as a beta preview), versions of Gensim through 3.8.3 require an extra one-time pre-calculation of unit-length-normalized vectors upon the very first use of a .most_similar() or .doesnt_match() operation (& others). This step can also take a noticeable moment, & then immediately requires a few extra GB of memory for a full model like GoogleNews - which on any machine with less thanf about 8GB RAM free risks using slower virtual-memory or even crashing with an out-of-memory error. (Starting in gensim-4.0.0beta, once the model loads, the 1st .most_similar() won't need any extra pre-calculation/allocation.)
The main way to avoid this annoying lag is to structure your code or service to not reload it separately before each calculation. Typically, this means keeping an interactive Python process that's loaded it alive, ready for your extra operations (or later user requests, as might be the case with a web-deployed service.)
It sounds like you may be developing a single Python script, something like mystuff.py, and running it via PyCharm's execute/debug/etc utilities for launching a Python file. Unfortunately, upon each completed execution, that will let the whole Python process end, releasing any loaded data/objects completely. Running the script again must do all the loading/precalculation again.
If your main interest is doing a bit of investigational examination & experimentation with the set of word-vectors, on your own, a big improvement would be to move to an interactive environment that keeps a single Python run alive & waiting for your next line of code.
For example, if you run the ipython interpreter at a command-line, in a separate shell, you can load the model, do a few lookup/similarity operations to print the results, and then just leave the prompt waiting for your next code. The full loaded state of the process remains available until you choose to exit the interpreter.
Similarly, if you use a Jupyter Notebook inside a web-browser, you get that same interpreter experience inside a growing set of editable-code-and-result 'cells' that you can re-run. All are sharing the same back-end interpreter process, with persistent state – unless you choose to restart the 'kernel'.
If you're providing a script or library code for your users' investigational work, they could also use such persistent interpreters.
But if you're building a web service or other persistently-running tool, you'd similarly want to make sure that the model remains loaded between user requests. (Exactly how you'd do that would depend on the details of your deployment, including web server software, so it'd be best to ask/search-for that as a separate question supplying more details when you're at that step.)
There is one other trick that may help in your constant-relaunch scenario. Gensim can save & load in its own native format, which can make use of 'memory-mapping'. Essentially, a range of a file on-disk can be used directly by the operating-system's virtual memory system. Then, when many processes all designate the same file as the canonical version of something they want in their own memory-space, the OS knows they can re-use any parts of that file that are already in memory.
This technique works far more simply in the `gensim-4.0.0beta' and later, so I'm only going to describe the steps needed there. (See this message if you want to force this preview installation before Gensim 4.0 is officially released.)
First, load the original-format file, but then re-save it in Gensim's format:
from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
kv_model.save('GoogleNews-vectors-negative300.kv')
Note that there will be an extra .npv file created that must be kept alongside the GoogleNews-vectors-negative300.kv if you move the model elsewhere. DO this only once to create the new files.
Second, when you later need the model, use Gensim's .load() with the mmap option:
kv_model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
# do your other operations
Right away, the .load() should complete faster. However, when you 1st try to access any word – or all words in a .most_similar() – the read from disk will still need to happen, just shifting the delays to later. (If you're only ever doing individual-word lookups or small sets of .doesnt_match() words, you may not notice any long lags.)
Further, depending on your OS & amount-of-RAM, you might even get some speedup when you run your script once, let it finish, then run it again soon after. It's possible in some cases that even though the OS has ended the prior process, its virtual-memory machinery remembers that some of the not-yet-cleared old-process memory pages are still in RAM, & correspond to the memory-mapped file. Thus, the next memory-map will re-use them. (I'm not sure of this effect, and if you're in a low-memory situation the chance of such re-use from a completed may disappear completely.
But, you could increase the chances of the model file staying memory-resident by taking a third step: launch a separate Python process to preload the model that doesn't exit until killed. To do this, make another Python script like preload.py:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
Run this script in a separate shell: python preload.py. It will map the model into memory, but then hang until you CTRL-C exit it.
Now, any other code you run on the same machine that memory-maps the same file will automatically re-use any already-loaded memory pages from this separate process. (In low-memory conditions, if any other virtual-memory is being relied upon, ranges could still be flushed out of RAM. But if you have plentiful RAM, this will ensure minimal disk IO each new time the same file is referenced.)
Finally, one other option that can be mixed with any of these is to load only a subset of the full 3-million-token, 3.6GB GoogleNews set. The less-common words are near the end of this file, and skipping them won't affect many uses. So you can use the limit argument of load_word2vec_format() to only load a subset - which loads faster, uses less memory, and completes later full-set searches (like .most_similar()) faster. For example, to load just the 1st 1,000,000 words for about 67% savings of RAM/load-time/search-time:
from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', limit=1000000, binary=True)

Does executing a python script load it into memory?

I'm running a python script using python3 myscript.py on Ubuntu 16.04. Is the script loaded into memory or read and interpreted line by line from the hdd? If it's not loaded all at once, is there any way of knowing or controlling how big the chunks are, that are loaded into Memory?
It is loaded into memory in its entirety. This must be the case, because a syntax error near the end will abort the program straight away. Try it and see.
There does not need to be any way to control or configure this. It is surely an implementation detail best left alone. If you have a problem related to this (e.g. your script is larger than your RAM), it can be solved some other way.
The "script" you use is only the human friendly representation you see. Python opens that script, reads lines, tokenizes them, creates a parse and ast tree for it and then emits bytecode which you can see using the dis module.
The "script" isn't loaded, it's code object (the object that contains the instructions generated for it) is. There's no direct way to affect that process. I have never heard of a script being so big that you need to read it in chunks, I'd be surprised if you accomplished it.

Having time-consuming object in memory

The code below is a part of my main function
def main():
model = GoodPackage.load_file_format('hello.bin', binary=True)
do_stuff_with_model(model)
def do_stuff_with_model(model):
do something~
Assume that the size of hello.bin is a few gigabytes and it takes a while to load it. the method do_stuff_with_model is still unstable and I must do a lot of iterations until I have a stable version. In other words, I have to run the main function many times to finish debugging. However, since it takes a few minutes to load the model every time I run the code, it is time consuming. Is there a way for me to store the model object in some other place, so that every time I run the code by typing python my_code.py in the console I don't have to wait? I assume using pickle wouldn't help either because the file will still be big.
How about creating a ramdisk? If you have enough memory, you can store the entire file in RAM. This will drastically speed things up, though you'll likely have to do this every time you restart your computer.
Creating a ramdisk is quite simple on linux. Just create a directory:
mkdir ramdisk
and mount it as a temps or ramfs filesystem:
mount -t tmpfs -o size=512m tmpfs ./ramdisk
From there you can simply copy your large file to the ramdisk. This has the benefit that your code stays exactly the same, apart from simply changing the path to your big file. File access occurs just as it normally would, but now it's much faster, since it's loading it from RAM.

Loading time shared library too large

I'm connecting one lib(*.so) with ctypes. However, the loading time is very large. That is very slow.
What technique can I use to improve performance?
My module will always run at the prompt. Will run a command at a time.
> $./myrunlib.py fileQuestion fileAnswer
# again
> $./myrunlib.py fileQuestion fileAnswer
code:
from ctypes import *
drv = cdll.LoadLibrary('/usr/lib/libXPTO.so')
Either you've got a strange bug which makes your library load extremely slowly when used by a Python program (which I find rather unlikely), or the loading take the time it takes (maybe because the library does a large initialization task upon being loaded).
In the latter case your only option seems to be to prevent any restarts of your Python program. Let it run in a loop which reads all tasks from stdin (or any other pipe or socket or maybe even from job files) instead of from the command line.

How can I save the state of running python programs to resume later?

I am developing the machine learning analysis program which has to process the 27GB of text files in linux. Although my production system won't be rebooted very often but I need to test that in my home computer or development environment.
Now I have power failure very often so I can hardly run it continuously for 3 weeks.
My programs reads the files, applies some parsing, saves the filtered data in new files in dictionary, then I apply the algorithm on those files then saves result in mysqlDB.
I am not able to find how can I save the algorithm state.
I everything regarding the algorithm state is saved in a class, you can serialize the class an save it to disk: http://docs.python.org/2/library/pickle.html
Since the entire algorithm state can be saved in a class, you might want to use pickle (as mentioned above), but pickle comes with it's own overloads and risks.
For better ways to do the same, you might want to check out this article, which explains why you should use the camel library instead of pickle.

Categories

Resources