saving and loading multiple shards made by gensim similarity model

saving and loading multiple shards made by gensim similarity model - python

My data has more than 1 million rows and while training gensim similarity model, it is making multiple .sav files (model.sav, model.sav.0, model.sav.1 and so on..). Problem is while loading, it is loading only one sub-part, instead of all the sub-parts, hence performing horribly in prediction. Parameters/options are not working as per gensim documentation.
As per the gensim documentation - https://radimrehurek.com/gensim/similarities/docsim.html
Saving as file handle and giving the following params should have worked - :
model.save(fname_or_handle, separately = None)
model.load(filepath, mmap = 'r')
Even tried to -
pickle the .sav files ( this pickles the 1st shard only i.e. model.sav)
compressing all sub-parts as .gz file ( this compresses one shard only , not all the sub-parts) and also gives some sort of pickle error.
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('./models/model.sav',tf_idf[corpus],
num_features=len(dictionary))
sims.save('./models/model.sav')
sims1 = gensim.similarities.Similarity.load(./models/model.sav)
Expected results should give all matching documents from corpus, but this gives only from model.sav (the file mentioned while loading). It does NOT even execute the other shards. I checked result from each shard.
Question: How do I use all the sub-files of gensim model to predict similarity of my test document, WITHOUT looping through every sub-file individually and then presenting union of those results.

It's my understanding that 'model.sav' serves as a sort of directory to access all the actual similarity shards.
What's your output from len(sims1)? Running the above code on a corpus of 65,536 entries (creates exactly two shards), I can save and load a corpus and check that it has the 65,536 documents. I can also add documents and further save/load.

Related

Training Word2Vec Model from sourced data - Issue Tokenizing data

I have recently sourced and curated a lot of reddit data from Google Bigquery.
The dataset looks like this:
Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.
I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.
Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.
I am facing the following issue:
When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.
To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:
Knowing that my computer can handle performing the action on the dataset, I simply did:
reddit_subset = reddit_data[:50]
reddit_subset['tokens'] = reddit_subset['body_cleaned'].apply(lambda x: word_tokenize(x))
This produces the following result:
This in fact works with word2vec and produces model one can work with. Great so far.
Because of my inability to operate on such a large dataset in one go, I had to get creative with how I handle this dataset. My solution was to batch the dataset and work on it in small iterations using Panda's own batchsize argument.
I wrote the following function to achieve that:
def reddit_data_cleaning(filepath, batchsize=20000):
if batchsize:
df = pd.read_csv(filepath, encoding='utf-8', error_bad_lines=False, chunksize=batchsize, iterator=True, lineterminator='\n')
print("Beginning the data cleaning process!")
start_time = time.time()
flag = 1
chunk_num = 1
for chunk in df:
chunk[u'tokens'] = chunk[u'body_cleaned'].apply(lambda x: word_tokenize(x))
chunk_num += 1
if flag == 1:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Beginning writing a new file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='w+', index=None, header=True)
flag = 0
else:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Adding a chunk into an already existing file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='a', index=None, header=None)
end_time = time.time()
print("Processing has been completed in: ", (end_time - start_time), " seconds.")
Although this piece of code allows me to actually work through this huge dataset in chunks and produces results where otherwise I'd crash from memory failures, I get a result which doesn't fit my word2vec requirements, and leaves me quite baffled at the reason for it.
I used the above function to perform the same operation on the Data subset to compare how the result differs between the two functions, and got the following:
The desired result is on the new_tokens column, and the function that chunks the dataframe produces the "tokens" column result.
Is anyone any wiser to help me understand why the same function to tokenize produces a wholly different result depending on how I iterate over the dataframe?
I appreciate you if you read through the whole issue and stuck through!

First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.
They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:
Python list objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim's Word2Vec model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either ['yessir', 'shit', 'is', 'real'] – which is a true Python list literal – or [yessir, shit, is, real] – which is some other mess likely to break if any tokens have challenging characters).
the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes
So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.
Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.
That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.
Lots of algorithms, like Gensim's Word2Vec or FastText, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.
For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

After taking gojomo's advice I simplified my approach at reading the csv file and writing to a text file.
My initial approach using pandas had yielded some pretty bad processing times for a file with around 12 million rows, and memory issues due to how pandas reads data all into memory before writing it out to a file.
What I also realized was that I had a major flaw in my previous code.
I was printing some output (as a sanity check), and because I printed output too often, I overflowed Jupyter and crashed the notebook, not allowing the underlying and most important task to complete.
I got rid of that, simplified reading with the csv module and writing into a txt file, and I processed the reddit database of ~12 million rows in less than 10 seconds.
Maybe not the finest piece of code, but I was scrambling to solve an issue that stood as a roadblock for me for a couple of days (and not realizing that part of my problem was my sanity checks crashing Jupyter was an even bigger frustration).
def generate_corpus_txt(csv_filepath, output_filepath):
import csv
import time
start_time = time.time()
with open(csv_filepath, encoding = 'utf-8') as csvfile:
datareader = csv.reader(csvfile)
count = 0
header = next(csvfile)
print(time.asctime(time.localtime()), " ---- Beginning Processing")
with open(output_filepath, 'w+') as output:
# Check file as empty
if header != None:
for row in datareader:
# Iterate over each row after the header in the csv
# row variable is a list that represents a row in csv
processed_row = str(' '.join(row)) + '\n'
output.write(processed_row)
count += 1
if count == 1000000:
print(time.asctime(time.localtime()), " ---- Processed 1,000,000 Rows of data.")
count = 0
print('Processing took:', int((time.time()-start_time)/60), ' minutes')
output.close()
csvfile.close()

Set up data pipeline for video processing

I am newbie to Tensorflow so I would appreciate any constructive help.
I am trying to build a feature extraction and data pipeline with Tensorflow for video processing where multiple folders holding video files with multiple classes (JHMDB database), but kind of stuck.
I have the feature extracted to one folder, at the moment to separate *.npz compressed arrays, in the filename I have stored the class name as well.
First Attempt
First I thought I would use this code from the TF tutorials site, simply reading files from folder method:
jhmdb_path = Path('...')
# Process files in folder
list_ds = tf.data.Dataset.list_files(str(jhmdb_path/'*.npz'))
for f in list_ds.take(5):
print(f.numpy())
def process_path(file_path):
labels = tf.strings.split(file_path, '_')[-1]
features = np.load(file_path)
return features, labels
labeled_ds = list_ds.map(process_path)
for a, b in labeled_ds.take(5):
print(a, b)
TypeError: expected str, bytes or os.PathLike object, not Tensor
..but this not working.
Second Attempt
Then I thought ok I will use generators:
# using generator
jhmdb_path = Path('...')
def generator():
for item in jhmdb_path.glob("*.npz"):
features = np.load(item)
print(item.files)
print(f['PAFs'].shape)
features = features['PAFs']
yield features
dataset = tf.data.Dataset.from_generator(generator, (tf.uint8))
iter(next(dataset))
TypeError: 'FlatMapDataset' object is not an iterator...not working.
In the first case, somehow the path is a byte type, and I could not change it to str to be able to load it with np.load(). (If I point a file directly on np.load(direct_path), then strange, but it works.)
At second case... I am not sure what is wrong.
I looked for hours to find a solution how to build an iterable dataset from list of relatively big and large numbers of 'npz' or 'npy' files, but seems to be this is not mentioned anywhere (or just way too trivial maybe).
Also, as I could not test the model so far, I am not sure if this is the good way to go. I.e. to feed the model with hundreds of files in this way, or just build one huge 3.5 GB npz (that would sill fit in memory) and use that instead. Or use TFrecords, but that looks more complicated, than the usual examples.
What is really annoying here, that TF tutorials and in general all are about how to load a ready dataset directly, or how to load np array(s) or how to load, image, text, dataframe objects, but no way to find any real examples how to process big chunks of data files, e.g. the case of extracting features from audio or video files.
So any suggestions or solutions would be highly appreciated and I would be really, really grateful to have something finally working! :)

How to load index shards by gensim.similarities.Similarity？

I'm working on something using gensim.
In gensim, var index usually means an object of gensim.similarities.<cls>.
At first, I use gensim.similarities.Similarity(filepath, ...) to save index as a file, and then loads it by gensim.similarities.Similarity.load(filepath + '.0'). Because gensim.similarities.Similarity default save index to shards file like index.0.
When index file becoming larger, it automatically seperate into more shards, like index.0,index.1,index.2......
How can I load these shards file? gensim.similarities.Similarity.load() can only load one file.
BTW: I have try to find the answer in gensim's doc, but failed.

from gensim.corpora.textcorpus import TextCorpus
from gensim.test.utils import datapath, get_tmpfile
from gensim.similarities import Similarity
temp_fname = get_tmpfile("index")
output_fname = get_tmpfile("saved_index")
corpus = TextCorpus(datapath('testcorpus.txt'))
index = Similarity(output_fname, corpus, num_features=400)
index.save(output_fname)
loaded_index = index.load(output_fname)
https://radimrehurek.com/gensim/similarities/docsim.html

shoresh's answer is correct. The key part that OP was missing was
index.save(output_fname)
While just creating the object appears to save it, it's really only saving the shards, which require saving a sort of directory file (via index.save(output_fname) to be made accessible as a whole object.

TensorFlow - tf.data.Dataset reading large HDF5 files

I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func, reading examples from the HDF5 file using custom Python logic is quite easy. For example:
def read_examples_hdf5(filename, label):
with h5py.File(filename, 'r') as hf:
# read frames from HDF5 and decode them from JPG
return frames, label
filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5"))
labels = [0]*len(filenames) # ... can we do this more elegantly?
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
lambda filename, label: tuple(tf.py_func(
read_examples_hdf5, [filename, label], [tf.uint8, tf.int64]))
)
dataset = dataset.shuffle(1000 + 3 * BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
This example works, however the problem is that it seems like tf.py_func can only handle one example at a time. As my HDF5 container stores 100 examples, this limitation causes significant overhead as the files constantly need to be opened, read, closed and reopened. It would be much more efficient to read all the 100 video examples into the dataset object and then move on with the next HDF5 file (preferably in multiple threads, each thread dealing with it's own collection of HDF5 files).
So, what I would like is a number of threads running in the background, reading video frames from the HDF5 files, decode them from JPG and then feed them into the dataset object. Prior to the introduction of the tf.data.Dataset pipeline, this was quite easy using the RandomShuffleQueue and enqueue_many ops, but it seems like there is currently no elegant way of doing this (or the documentation is lacking).
Does anyone know what would be the best way of achieving my goal? I have also looked into (and implemented) the pipeline using tfrecord files, but taking a random sample of video frames stored in a tfrecord file seems quite impossible (see here). Additionally, I have looked at the from_generator() inputs for tf.data.Dataset but that is definitely not going to run in multiple threads it seems. Any suggestions are more than welcome.

I stumbled across this question while dealing with a similar issue. I came up with a solution based on using a Python generator, together with the TF dataset construction method from_generator. Because we use a generator, the HDF5 file should be opened for reading only once and kept open as long as there are entries to read. So it will not be opened, read, and then closed for every single call to get the next data element.
Generator definition
To allow the user to pass in the HDF5 filename as an argument, I generated a class that has a __call__ method since from_generator specifies that the generator has to be callable. This is the generator:
import h5py
import tensorflow as tf
class generator:
def __init__(self, file):
self.file = file
def __call__(self):
with h5py.File(self.file, 'r') as hf:
for im in hf["train_img"]:
yield im
By using a generator, the code should pick up from where it left off at each call from the last time it returned a result, instead of running everything from the beginning again. In this case it is on the next iteration of the inner for loop. So this should skip opening the file again for reading, keeping it open as long as there is data to yield. For more on generators, see this excellent Q&A.
Of course, you will have to replace anything inside the with block to match how your dataset is constructed and what outputs you want to obtain.
Usage example
ds = tf.data.Dataset.from_generator(
generator(hdf5_path),
tf.uint8,
tf.TensorShape([427,561,3]))
value = ds.make_one_shot_iterator().get_next()
# Example on how to read elements
while True:
try:
data = sess.run(value)
print(data.shape)
except tf.errors.OutOfRangeError:
print('done.')
break
Again, in my case I had stored uint8 images of height 427, width 561, and 3 color channels in my dataset, so you will need to modify these in the above call to match your use case.
Handling multiple files
I have a proposed solution for handling multiple HDF5 files. The basic idea is to construct a Dataset from the filenames as usual, and then use the interleave method to process many input files concurrently, getting samples from each of them to form a batch, for example.
The idea is as follows:
ds = tf.data.Dataset.from_tensor_slices(filenames)
# You might want to shuffle() the filenames here depending on the application
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(filename),
tf.uint8,
tf.TensorShape([427,561,3])),
cycle_length, block_length)
What this does is open cycle_length files concurrently, and produce block_length items from each before moving to the next file - see interleave documentation for details. You can set the values here to match what is appropriate for your application: e.g., do you need to process one file at a time or several concurrently, do you only want to have a single sample at a time from each file, and so on.
Edit: for a parallel version, take a look at tf.contrib.data.parallel_interleave!
Possible caveats
Be aware of the peculiarities of using from_generator if you decide to go with the solution. For Tensorflow 1.6.0, the documentation of from_generator mentions these two notes.
It may be challenging to apply this across different environments or with distributed training:
NOTE: The current implementation of Dataset.from_generator() uses
tf.py_func and inherits the same constraints. In particular, it
requires the Dataset- and Iterator-related operations to be placed on
a device in the same process as the Python program that called
Dataset.from_generator(). The body of generator will not be serialized
in a GraphDef, and you should not use this method if you need to
serialize your model and restore it in a different environment.
Be careful if the generator depends on external state:
NOTE: If generator depends on mutable global variables or other
external state, be aware that the runtime may invoke generator
multiple times (in order to support repeating the Dataset) and at any
time between the call to Dataset.from_generator() and the production
of the first element from the generator. Mutating global variables or
external state can cause undefined behavior, and we recommend that you
explicitly cache any external state in generator before calling
Dataset.from_generator().

I took me a while to figure this out, so I thought I should record this here. Based on mikkola's answer, this is how to handle multiple files:
import h5py
import tensorflow as tf
class generator:
def __call__(self, file):
with h5py.File(file, 'r') as hf:
for im in hf["train_img"]:
yield im
ds = tf.data.Dataset.from_tensor_slices(filenames)
ds = ds.interleave(lambda filename: tf.data.Dataset.from_generator(
generator(),
tf.uint8,
tf.TensorShape([427,561,3]),
args=(filename,)),
cycle_length, block_length)
The key is you can't pass filename directly to generator, since it's a Tensor. You have to pass it through args, which tensorflow evaluates and converts it to a regular python variable.

Adding Relationships to Existing Nodes

I have a very large (~24 million lines) edge list that I'm trying to import into a Neo4j graph that is populated with nodes. The CSV file has three columns: from, to, and the period (relationship property). I've tried this using the REST API using the following (Python) code:
batch_queue.append({"method":"POST","to":'index/node/people?uniqueness=get_or_create','id':1,'body':{'key':'name','value':row[0]}})
batch_queue.append({"method":"POST","to":'index/node/people?uniqueness=get_or_create','id':2,'body':{'key':'name','value':row[1]}})
batch_queue.append({"method":"POST","to":'{1}/relationships','body':{'to':"{2}","type":"FP%s" % row[2]}})
Where the third line failed, and then also using the Cypher statement:
USING PERIODIC COMMIT
LOAD CSV FROM "file:///file-name.csv" AS line
MATCH (a:Person {name: line[0]}),(b:Person {name:line[1]})
CREATE (a)-[:FOLLOWS {period: line[2]}]->(b)
Which worked in small scale but gave me an "Unknown Error" when using the whole list (also with smaller periodic commit values).
Any guidance as to what I'm doing incorrectly would be appreciated.

You might want to look into my batch-importer for that: http://github.com/jexp/batch-import
Otherwise for LOAD CSV, see my blog post here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
use the neo4j-shell for LOAD CSV
Depending on your memory available, you might have to split the data a bit. By moving a window over the file (e.g. 1M rows at once below). Do you have indexes / constraints created for :Person(name) ?
USING PERIODIC COMMIT
LOAD CSV FROM "file:///file-name.csv" AS line
WITH line
SKIP 2000000 LIMIT 1000000
MATCH (a:Person {name: line[0]}),(b:Person {name:line[1]})
CREATE (a)-[:FOLLOWS {period: line[2]}]->(b)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

saving and loading multiple shards made by gensim similarity model - python

Related

Training Word2Vec Model from sourced data - Issue Tokenizing data

Set up data pipeline for video processing

How to load index shards by gensim.similarities.Similarity？

TensorFlow - tf.data.Dataset reading large HDF5 files

Adding Relationships to Existing Nodes

Categories

Resources