How can a Word2Vec pretrained model be loaded in Gensim faster?

How can a Word2Vec pretrained model be loaded in Gensim faster? - python

I'm loading the model using:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
Now every time i run the file in Pycharm, it loads the model again.
So, is there a way to load it once and be available whenever i run things like model['king'] and model.doesnt_match("house garage store dog".split())
because it takes alot of time whenever i wana check the similarity or words that don't match.
When i ran model.most_similar('finance') it was really slow and the whole laptop freezed for like 2 min. So, is there a way to make things faster, 'cause i wana use it in my project, but i can't let the user wait for this long.
Any suggestions?

That's a set of word-vectors that's about 3.6GB on disk, and slightly larger when loaded - so just the disk IO can take a noticeable amount of time.
Also, at least until gensim-4.0.0 (now available as a beta preview), versions of Gensim through 3.8.3 require an extra one-time pre-calculation of unit-length-normalized vectors upon the very first use of a .most_similar() or .doesnt_match() operation (& others). This step can also take a noticeable moment, & then immediately requires a few extra GB of memory for a full model like GoogleNews - which on any machine with less thanf about 8GB RAM free risks using slower virtual-memory or even crashing with an out-of-memory error. (Starting in gensim-4.0.0beta, once the model loads, the 1st .most_similar() won't need any extra pre-calculation/allocation.)
The main way to avoid this annoying lag is to structure your code or service to not reload it separately before each calculation. Typically, this means keeping an interactive Python process that's loaded it alive, ready for your extra operations (or later user requests, as might be the case with a web-deployed service.)
It sounds like you may be developing a single Python script, something like mystuff.py, and running it via PyCharm's execute/debug/etc utilities for launching a Python file. Unfortunately, upon each completed execution, that will let the whole Python process end, releasing any loaded data/objects completely. Running the script again must do all the loading/precalculation again.
If your main interest is doing a bit of investigational examination & experimentation with the set of word-vectors, on your own, a big improvement would be to move to an interactive environment that keeps a single Python run alive & waiting for your next line of code.
For example, if you run the ipython interpreter at a command-line, in a separate shell, you can load the model, do a few lookup/similarity operations to print the results, and then just leave the prompt waiting for your next code. The full loaded state of the process remains available until you choose to exit the interpreter.
Similarly, if you use a Jupyter Notebook inside a web-browser, you get that same interpreter experience inside a growing set of editable-code-and-result 'cells' that you can re-run. All are sharing the same back-end interpreter process, with persistent state – unless you choose to restart the 'kernel'.
If you're providing a script or library code for your users' investigational work, they could also use such persistent interpreters.
But if you're building a web service or other persistently-running tool, you'd similarly want to make sure that the model remains loaded between user requests. (Exactly how you'd do that would depend on the details of your deployment, including web server software, so it'd be best to ask/search-for that as a separate question supplying more details when you're at that step.)
There is one other trick that may help in your constant-relaunch scenario. Gensim can save & load in its own native format, which can make use of 'memory-mapping'. Essentially, a range of a file on-disk can be used directly by the operating-system's virtual memory system. Then, when many processes all designate the same file as the canonical version of something they want in their own memory-space, the OS knows they can re-use any parts of that file that are already in memory.
This technique works far more simply in the `gensim-4.0.0beta' and later, so I'm only going to describe the steps needed there. (See this message if you want to force this preview installation before Gensim 4.0 is officially released.)
First, load the original-format file, but then re-save it in Gensim's format:
from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
kv_model.save('GoogleNews-vectors-negative300.kv')
Note that there will be an extra .npv file created that must be kept alongside the GoogleNews-vectors-negative300.kv if you move the model elsewhere. DO this only once to create the new files.
Second, when you later need the model, use Gensim's .load() with the mmap option:
kv_model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
# do your other operations
Right away, the .load() should complete faster. However, when you 1st try to access any word – or all words in a .most_similar() – the read from disk will still need to happen, just shifting the delays to later. (If you're only ever doing individual-word lookups or small sets of .doesnt_match() words, you may not notice any long lags.)
Further, depending on your OS & amount-of-RAM, you might even get some speedup when you run your script once, let it finish, then run it again soon after. It's possible in some cases that even though the OS has ended the prior process, its virtual-memory machinery remembers that some of the not-yet-cleared old-process memory pages are still in RAM, & correspond to the memory-mapped file. Thus, the next memory-map will re-use them. (I'm not sure of this effect, and if you're in a low-memory situation the chance of such re-use from a completed may disappear completely.
But, you could increase the chances of the model file staying memory-resident by taking a third step: launch a separate Python process to preload the model that doesn't exit until killed. To do this, make another Python script like preload.py:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
Run this script in a separate shell: python preload.py. It will map the model into memory, but then hang until you CTRL-C exit it.
Now, any other code you run on the same machine that memory-maps the same file will automatically re-use any already-loaded memory pages from this separate process. (In low-memory conditions, if any other virtual-memory is being relied upon, ranges could still be flushed out of RAM. But if you have plentiful RAM, this will ensure minimal disk IO each new time the same file is referenced.)
Finally, one other option that can be mixed with any of these is to load only a subset of the full 3-million-token, 3.6GB GoogleNews set. The less-common words are near the end of this file, and skipping them won't affect many uses. So you can use the limit argument of load_word2vec_format() to only load a subset - which loads faster, uses less memory, and completes later full-set searches (like .most_similar()) faster. For example, to load just the 1st 1,000,000 words for about 67% savings of RAM/load-time/search-time:
from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', limit=1000000, binary=True)

Related

How to efficiently run multiple Pytorch Processes / Models at once ? Traceback: The paging file is too small for this operation to complete

Background
I have a very small network which I want to test with different random seeds.
The network barely uses 1% of my GPUs compute power so i could in theory run 50 processes at once to try many different seeds at once.
Problem
Unfortunately i can't even import pytorch in multiple processes. When the nr of processes exceeds 4 I get a Traceback regarding a too small paging file.
Minimal reproducable code§ - dispatcher.py
from subprocess import Popen
import sys
procs = []
for seed in range(50):
procs.append(Popen([sys.executable, "ml_model.py", str(seed)]))
for proc in procs:
proc.wait()
§I increased the number of seeds so people with better machines can also reproduce this.
Minimal reproducable code - ml_model.py
import torch
import time
time.sleep(10)
Traceback (most recent call last):
File "ml_model.py", line 1, in <module>
import torch
File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
import torch
File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
raise err
Further Investigation
I noticed that each process loads a lot of dll's into RAM. And when i close all other programs which use a lot of RAM i can get up to 10 procesess instead of 4. So it seems like a resource constraint.
Questions
Is there a workaround ?
What's the recommended way to train many small networks with pytorch on a single gpu ?
Should i write my own CUDA Kernel instead, or use a different framework to achieve this ?
My goal would be to run around 50 processes at once (on a 16GB RAM Machine, 8GB GPU RAM)

I've looked a bit into this tonight. I don't have a solution (edit: I have a mitigation, see the edit at end), but I have a bit more information.
It seems the issue is caused by NVidia fatbins (.nv_fatb) being loaded into memory. Several DLLs, such as cusolver64_xx.dll, torcha_cuda_cu.dll, and a few others, have .nv_fatb sections in them. These contain tons of different variations of CUDA code for different GPUs, so it ends up being several hundred megabytes to a couple gigabytes.
When Python imports 'torch' it loads these DLLs, and maps the .nv_fatb section into memory. For some reason, instead of just being a memory mapped file, it is actually taking up memory. The section is set as 'copy on write', so it's possible something writes into it? I don't know. But anyway, if you look at Python using VMMap ( https://learn.microsoft.com/en-us/sysinternals/downloads/vmmap ) you can see that these DLLs are committing huge amounts of committed memory for this .nv_fatb section. The frustrating part is it doesn't seem to be using the memory. For example, right now my Python.exe has 2.7GB committed, but the working set is only 148MB.
Every Python process that loads these DLLs will commit several GB of memory loading these DLLs. So if 1 Python process is wasting 2GB of memory, and you try running 8 workers, you need 16GB of memory to spare just to load the DLLs. It really doesn't seem like this memory is used, just committed.
I don't know enough about these fatbinaries to try to fix it, but from looking at this for the past 2 hours it really seems like they are the issue. Perhaps its an NVidia problem that these are committing memory?
edit: I made this python script: https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5
Get it and install its pefile dependency ( python -m pip install pefile ).
Run it on your torch and cuda DLLs. In OPs case, command line might look like:
python fixNvPe.py --input=C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\*.dll
(You also want to run this wherever your cusolver64_*.dll and friends are. This may be in your torch\lib folder, or it may be, eg, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.X\bin . If it is under Program Files, you will need to run the script with administrative privileges)
What this script is going to do is scan through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.
ASLR is 'address space layout randomization.' It is a security feature that randomizes where a DLL is loaded in memory. We disable it for this DLL so that all Python processes will load the DLL into the same base virtual address. If all Python processes using the DLL load it at the same base address, they can all share the DLL. Otherwise each process needs its own copy.
Marking the section 'read-only' lets Windows know that the contents will not change in memory. If you map a file into memory read/write, Windows has to commit enough memory, backed by the pagefile, just in case you make a modification to it. If the section is read-only, there is no need to back it in the pagefile. We know there are no modifications to it, so it can always be found in the DLL.
The theory behind the script is that by changing these 2 flags that less memory will be committed for the .nv_fatb, and more memory will be shared between the Python processes. In practice, it works. Not quite as well as I'd hope (it still commits a lot more than it uses), so my understanding may be flawed, but it significantly decreases memory commit.
In my limited testing I haven't ran into any issues, but I can't guarantee there are no code paths that attempts to write to that section we marked 'read only.' If you start running into issues, though, you can just restore the backups.
edit 2022-01-20:
Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ."
This should certainly help. If it's not enough without ASLR as well then the script should still work

For my case system is already set to system managed size, yet I have same error, that is because I pass a big sized variable to multiple processes within a function. Likely I need to set a very large paging file as Windows cannot create it on the fly, but instead opt out to reduce number of processes as it is not an always to be used function.
If you are in Windows it may be better to use 1 (or more) core less than total number of pysical cores as multiprocessing module in python in Windows tends to get everything as possible if you use all and actually tries to get all logical cores.
import multiprocessing
multiprocessing.cpu_count()
12
# I actually have 6 pysical cores, if you use this as base it will likely hog system
import psutil
psutil.cpu_count(logical = False)
6 #actual number of pysical cores
psutil.cpu_count(logical = True)
12 #logical cores (e.g. hyperthreading)
Please refer to here for more detail:
Multiprocessing: use only the physical cores?

Well, i managed to resolve this.
open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

Following up on #chris-obryan's answer (I would comment but have no reputation), I've found that memory utilisation drops pretty sharply some time in to training with their fix applied (in orders of roughly the mentioned 2GB per process).
To eek out some more performance it may be worth monitoring memory utilisation and spawning a new instance of the model when these drops in memory occur, leaving enough space (~3 or 4 GB to be safe) for a bit of overhead.
I was seeings ~28GB of RAM utilised during the setup phase, which dropped to about 14GB after iterating for a while.
(Note that my use case is a little different here as I'm bottlenecked by host<->device transfers due to optimising with a GA, as a reasonable amount of CPU bound processing needs to occur after each generation, so this could play in to it. I am also using concurrent.futures.ProcessPoolExecutor() rather than manually using subprocesses)

I have changed 'num_workers = 10' to 'num_workers = 1'. It helped me to solve the problem.

To fix this problem, I updated the CUDA 11.8.0 version and PyTorch to the 11.6 cudatoolkit version with PyTorch 1.9.1. Using conda:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
Thanks to #chris-obryan I understood the problem and thought an update was available already. I measured the memory consumption before and after the updates, dropping sharply.

Since it seems that each import torch loads a bunch of fat DLLs (thanks #chris-obryan), I tried changing this:
import torch
if __name__ == "__main__":
# multiprocessing stuff, paging file errors
to this...
if __name__ == "__main__":
import torch
# multiprocessing stuff
And it worked well (because when the subprocesses are created __name__ is not "__main__").
Not an elegant solution, but perhaps useful to someone.

Problems Using the Berkeley DB Transactional Processing

I'm writing a set of programs that have to operate on a common database, possibly concurrently. For the sake of simplicity (for the user), I didn't want to require the setup of a database server. Therefore I setteled on Berkeley DB, where one can just fire up a program and let it create the DB if it doesn't exist.
In order to let programs work concurrently on a database, one has to use the transactional features present in the 5.x release (here I use python3-bsddb3 6.1.0-1+b2 with libdb5.3 5.3.28-12): the documentation clearly says that it can be done. However I quickly ran in trouble, even with some basic tasks :
Program 1 initializes records in a table
Program 2 has to scan the records previously added by program 1 and updates them with additional data.
To speed things up, there is an index for said additional data. When program 1 creates the records, the additional data isn't present, so the pointer to that record is added to the index under an empty key. Program 2 can then just quickly seek to the not-yet-updated records.
Even when not run concurrently, the record updating program crashes after a few updates. First it complained about insufficient space in the mutex area. I had to resolve this with an obscure DB_CONFIG file and then run db_recover.
Next, again after a few updates it complained 'Cannot allocate memory -- BDB3017 unable to allocate space from the buffer cache'. db_recover and relaunching the program did the trick, only for it to crash again with the same error a few records later.
I'm not even mentioning concurrent use: when one of the programs is launched while the other is running, they almost instantly crash with deadlock, panic about corrupted segments and ask to run recover. I made many changes so I went throug a wide spectrum of errors which often yield irrelevant matches when searched for. I even rewrote the db calls to use lmdb, which in fact works quite well and is really quick, which tends to indicate my program logic isn't at fault. Unfortunately it seems the datafile produced by lmdb is quite sparse, and quickly grew to unacceptable sizes.
From what I said, it seems that maybe some resources are being leaked somewhere. I'm hesitant to rewrite all this directly in C to check if the problem can come from the Python binding.
I can and I will update the question with code, but for the moment ti is long enough. I'm looking for people who have used the transactional stuff in BDB, for similar uses, which could point me to some of the gotchas.
Thanks

RPM (see http://rpm5.org) uses Berkeley DB in transactional mode. There's a fair number of gotchas, depending on what you are attempting.
You have already found DB_CONFIG: you MUST configure the sizes for mutexes and locks, the defaults are invariably too small.
Needing to run db_recover while developing is quite painful too. The best fix (imho) is to automate recovery while opening by checking the return code for DB_RUNRECOVERY, and then reopening the dbenv with DB_RECOVER.
Deadlocks are usually design/coding errors: run db_stat -CA to see what is deadlocked (or what locks are held) and adjust your program. "Works with lmdv" isn't sufficient to claim working code ;-)
Leaks can be seen with either valgrind and/or BDB compilation with -fsanitize:address. Note that valgrind will report false uninitializations unless you use overrides and/or compile BDB to initialize.

How to save spaCy model onto cache?

I'm using spaCy with Python for Named Entity Recognition, but the script requires the model to be loaded on every run and takes about 1.6GB memory to load it.
But 1.6GB is not dispensable for every run.
How do I load it into the cache or temporary memory so as to enable the script to run faster?

First of all you, if you only do NER, you can install the parser without vectors.
This is possible giving the argument parser to:
python -m spacy.en.download parser
This will prevent the 700MB+ Glove vectors to be downloaded, slimming the memory needed for a single run.
Then, well, it depends on your application/usage you make of the library.
If you call it often it will be better to pass spacy.load('en') to a module/class variable loaded at the beginning of your stack.
This will slow down a bit your boot time, but spacy will be ready (in memory) to be called.
(If the boot time is a big problem, you can do lazy loading).

Speed up feedparser

I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?

I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.

Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.

How can I save the state of running python programs to resume later?

I am developing the machine learning analysis program which has to process the 27GB of text files in linux. Although my production system won't be rebooted very often but I need to test that in my home computer or development environment.
Now I have power failure very often so I can hardly run it continuously for 3 weeks.
My programs reads the files, applies some parsing, saves the filtered data in new files in dictionary, then I apply the algorithm on those files then saves result in mysqlDB.
I am not able to find how can I save the algorithm state.

I everything regarding the algorithm state is saved in a class, you can serialize the class an save it to disk: http://docs.python.org/2/library/pickle.html

Since the entire algorithm state can be saved in a class, you might want to use pickle (as mentioned above), but pickle comes with it's own overloads and risks.
For better ways to do the same, you might want to check out this article, which explains why you should use the camel library instead of pickle.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.