Update spaCy Vocabulary - python

I was wondering if it is possible to update spacys default vocabulary. What I am trying doing is this:
run word2vec on my own corpus with gensim
load the vectors into my model with nlp.vocab.load_vectors_from_bin_loc(\path)
But since a lot of words in my corpus aren't in spacys default vocabulary I can't make use of the imported vectors. Is there an (easy) way to add those missing types?
Edit:
I realize it might be problematic to mix vectors. So my question is:
How can I import a custom vocabulary into spacy?

This is much easier in the next version, which should be out this week --- I'm just finishing testing it. For now:
By default spaCy loads a data/vocab/vec.bin file, where the "data" directory is within the spacy.en module directory
Create the vec.bin file from a bz2 file using spacy.vocab.write_binary_vectors
Either replace spaCy's vec.bin file, or call nlp.vocab.load_rep_vectors at run-time, with the path to the binary file.
The above is a bit inconvenient at first, but the binary file format is much smaller and faster to load, and the vectors files are fairly big. Note that GloVe distributes in gzip format, not bzip.
Out of interest: are you using the GloVe vectors, or something you trained on your own data? If your own data, did you use Gensim? I'd like to make this much easier, so I'd appreciate suggestions for what work-flow you'd like to see.
Load new vectors at run-time, optionally converting them
import spacy.vocab
def set_spacy_vectors(nlp, binary_loc, bz2_loc=None):
if bz2_loc is not None:
spacy.vocab.write_binary_vectors(bz2_loc, binary_loc)
write_binary_vectors(bz2_input_loc, binary_loc)
nlp.vocab.load_rep_vectors(binary_loc)
Replace the vec.bin, so your vectors will be loaded by default
from spacy.vocab import write_binary_vectors
import spacy.en
from os import path
def main(bz2_loc):
bin_loc = path.join(path.dirname(spacy.en.__file__), 'data', 'vocab', 'vec.bin')
write_binary_vectors(bz2_loc, bin_loc)
if __name__ == '__main__':
plac.call(main)

Related

Set up data pipeline for video processing

I am newbie to Tensorflow so I would appreciate any constructive help.
I am trying to build a feature extraction and data pipeline with Tensorflow for video processing where multiple folders holding video files with multiple classes (JHMDB database), but kind of stuck.
I have the feature extracted to one folder, at the moment to separate *.npz compressed arrays, in the filename I have stored the class name as well.
First Attempt
First I thought I would use this code from the TF tutorials site, simply reading files from folder method:
jhmdb_path = Path('...')
# Process files in folder
list_ds = tf.data.Dataset.list_files(str(jhmdb_path/'*.npz'))
for f in list_ds.take(5):
print(f.numpy())
def process_path(file_path):
labels = tf.strings.split(file_path, '_')[-1]
features = np.load(file_path)
return features, labels
labeled_ds = list_ds.map(process_path)
for a, b in labeled_ds.take(5):
print(a, b)
TypeError: expected str, bytes or os.PathLike object, not Tensor
..but this not working.
Second Attempt
Then I thought ok I will use generators:
# using generator
jhmdb_path = Path('...')
def generator():
for item in jhmdb_path.glob("*.npz"):
features = np.load(item)
print(item.files)
print(f['PAFs'].shape)
features = features['PAFs']
yield features
dataset = tf.data.Dataset.from_generator(generator, (tf.uint8))
iter(next(dataset))
TypeError: 'FlatMapDataset' object is not an iterator...not working.
In the first case, somehow the path is a byte type, and I could not change it to str to be able to load it with np.load(). (If I point a file directly on np.load(direct_path), then strange, but it works.)
At second case... I am not sure what is wrong.
I looked for hours to find a solution how to build an iterable dataset from list of relatively big and large numbers of 'npz' or 'npy' files, but seems to be this is not mentioned anywhere (or just way too trivial maybe).
Also, as I could not test the model so far, I am not sure if this is the good way to go. I.e. to feed the model with hundreds of files in this way, or just build one huge 3.5 GB npz (that would sill fit in memory) and use that instead. Or use TFrecords, but that looks more complicated, than the usual examples.
What is really annoying here, that TF tutorials and in general all are about how to load a ready dataset directly, or how to load np array(s) or how to load, image, text, dataframe objects, but no way to find any real examples how to process big chunks of data files, e.g. the case of extracting features from audio or video files.
So any suggestions or solutions would be highly appreciated and I would be really, really grateful to have something finally working! :)

Most efficient way to use a large data set for PyTorch?

Perhaps this question has been asked before, but I'm having trouble finding relevant info for my situation.
I'm using PyTorch to create a CNN for regression with image data. I don't have a formal, academic programming background, so many of my approaches are ad-hoc and just terribly inefficient. May times I can go back through my code and clean things up later because the inefficiency is not so drastic that performance is significantly affected. However, in this case, my method for using the image data takes a long time, uses a lot of memory, and it is done every time I want to test a change in the model.
What I've done is essentially loaded the image data into numpy arrays, saved those arrays in an .npy file, and then when I want to use said data for the model I import all of the data in that file. I don't think the data set is really THAT large, as it is comprised of 5000, 3 color channel images of size 64x64. Yet my memory usage shoots up to 70%-80% (out of 16gb) when it is being loaded, and it takes 20-30 seconds to load in every time.
My guess is that I'm being dumb about the way I'm loading it in, but frankly I'm not sure what the standard is. Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
I would really appreciate any help on this.
For speed I would advise to used HDF5 or LMDB:
Reasons to use LMDB:
LMDB uses memory-mapped files, giving much better I/O performance.
Works well with really large datasets. The HDF5 files are always read
entirely into memory, so you can’t have any HDF5 file exceed your
memory capacity. You can easily split your data into several HDF5
files though (just put several paths to h5 files in your text file).
Then again, compared to LMDB’s page caching the I/O performance won’t
be nearly as good.
[http://deepdish.io/2015/04/28/creating-lmdb-in-python/]
If you decide to used LMDB:
ml-pyxis is a tool for creating and reading deep learning datasets using LMDBs.*(I am co author of this tool)
It allows to create binary blobs (LMDB) and they can be read quite fast. The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos .
This notebook has an example on how to create a dataset and read it paralley while using pytorch.
If you decide to use HDF5:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
https://www.pytables.org/
Here is a concrete example to demonstrate what I meant. This assumes that you've already dumped the images into an hdf5 file (train_images.hdf5) using h5py.
import h5py
hf = h5py.File('train_images.hdf5', 'r')
group_key = list(hf.keys())[0]
ds = hf[group_key]
# load only one example
x = ds[0]
# load a subset, slice (n examples)
arr = ds[:n]
# should load the whole dataset into memory.
# this should be avoided
arr = ds[:]
In simple terms, ds can now be used as an iterator which gives images on the fly (i.e. it doesn't load anything in memory). This should make the whole run time blazing fast.
for idx, img in enumerate(ds):
# do something with `img`
In addition to the above answers, the following may be useful due to some recent advances (2020) in the Pytorch world.
Your question: Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
You can leave the image files in their original format (.jpg, .png, etc.) on your local disk or on the cloud storage, but with one added step - compress the directory as a tar file. Please read this for more details:
Pytorch Blog (Aug 2020): Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs (https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/)
This package is designed for situations where the data files are too large to fit in memory for training. Therefore, you give the URL of the dataset location (local, cloud, ..) and it will bring in the data in batches and in parallel.
The only (current) requirement is that the dataset must be in a tar file format.
The tar file can be on the local disk or on the cloud. With this, you don't have to load the entire dataset into the memory every time. You can use the torch.utils.data.DataLoader to load in batches for stochastic gradient descent.
No need saving image into npy and loading all into memory. Just load a batch of image path and transform then into tensor.
The following code define the MassiveDataset, and pass it into DataLoader, everything goes well.
from torch.utils.data.dataset import Dataset
from typing import Optional, Callable
import os
import multiprocessing
def apply_transform(transform: Callable, data):
try:
if isinstance(data, (list, tuple)):
return [transform(item) for item in data]
return transform(data)
except Exception as e:
raise RuntimeError(f'applying transform {transform}: {e}')
class MassiveDataset(Dataset):
def __init__(self, filename, transform: Optional[Callable] = None):
self.offset = []
self.n_data = 0
if not os.path.exists(filename):
raise ValueError(f'filename does not exist: {filename}')
with open(filename, 'rb') as fp:
self.offset = [0]
while fp.readline():
self.offset.append(fp.tell())
self.offset = self.offset[:-1]
self.n_data = len(self.offset)
self.filename = filename
self.fd = open(filename, 'rb', buffering=0)
self.lock = multiprocessing.Lock()
self.transform = transform
def __len__(self):
return self.n_data
def __getitem__(self, index: int):
if index < 0:
index = self.n_data + index
with self.lock:
self.fd.seek(self.offset[index])
line = self.fd.readline()
data = line.decode('utf-8').strip('\n')
return apply_transform(self.transform, data) if self.transform is not None else data
NB: open file with buffering=0 and multiprocessing.Lock() are used to avoid loading bad data (usually a bit from one part of the file and a bit from the another part of the file).
additionally, if using multiprocessing in DataLoader, one could get such exception TypeError: cannot serialize '_io.BufferedReader' object. This is caused by pickle module used in multiprocessing, it cannot serialize _io.BufferedReader, but dill can. Replacing multiprocessing with multiprocess, things goes okay (major changes compare with multiprocessing, enhanced serialization is done with dill)
same thing was discussed in this issue

How to load index shards by gensim.similarities.Similarity?

I'm working on something using gensim.
In gensim, var index usually means an object of gensim.similarities.<cls>.
At first, I use gensim.similarities.Similarity(filepath, ...) to save index as a file, and then loads it by gensim.similarities.Similarity.load(filepath + '.0'). Because gensim.similarities.Similarity default save index to shards file like index.0.
When index file becoming larger, it automatically seperate into more shards, like index.0,index.1,index.2......
How can I load these shards file? gensim.similarities.Similarity.load() can only load one file.
BTW: I have try to find the answer in gensim's doc, but failed.
from gensim.corpora.textcorpus import TextCorpus
from gensim.test.utils import datapath, get_tmpfile
from gensim.similarities import Similarity
temp_fname = get_tmpfile("index")
output_fname = get_tmpfile("saved_index")
corpus = TextCorpus(datapath('testcorpus.txt'))
index = Similarity(output_fname, corpus, num_features=400)
index.save(output_fname)
loaded_index = index.load(output_fname)
https://radimrehurek.com/gensim/similarities/docsim.html
shoresh's answer is correct. The key part that OP was missing was
index.save(output_fname)
While just creating the object appears to save it, it's really only saving the shards, which require saving a sort of directory file (via index.save(output_fname) to be made accessible as a whole object.

Word2Vec Vocabulary not definded error

I am new to python and word2vec and keep getting a "you must first build vocabulary before training the model" error. What is wrong with my code?
Here is my code:
file_object=open("SupremeCourt.txt","w")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)
out=model.most_similar()
print(out[1])
print(out[2])
I could see some wrong things in your code like the file is opened in write mode and the model which you have loaded doesn't contain the word which you want to find the most similar words.
I would like to suggest to use the predefined models like google_news_vectors to load in the gensim or to build your own word2vec model so that you won't get the error.
the usage of most_similar in gensim is out = model.most_similar("word-name")
file_object=open("SupremeCourt.txt","r")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)#use google news vectors here
out=model.most_similar("word")
print(out)
You're opening that file in write mode with this line:
file_object = open("SupremeCourt.txt", "w")
By doing this, you're erasing the contents of your file, so that when you try to pass the file the model for training, there is no data to read. That's why that error is thrown.
Remove that line (and also restore your file contents), and it'll work.

How to import a Matrix from a text file in Python

Hey this might be a simple question, but I have never had to import any files into Python before.
So I have a numpy Matrix in a text file, named dmatrix.txt, so how would I be able to import the file into Python and use the matrix?
I am trying to use numpy.load(), but I am unsure how to use it.
Try numpy.loadtxt('dmatrix.txt'); you could add a delimiter argument if the file is comma-separated or something.
numpy.load is for files in numpy/python binary formats - created by numpy.save, numpy.savez, or pickle.

Categories

Resources