Disclaimer: In the past, I've predominantly used PyTorch, hence my reasoning is in accordance with how things are done in PyTorch as well.
I have a large database (MySQL) which I want to load as a dataset. It is not feasible to keep this dataset in memory at all times, hence it needs to be done lazily/on demand. My plan is to instantiate a Dataset object from the range of row id's, then retrieve the corresponding rows. This is much like how you would use file names/paths when using large files like images which you would then load that way. The issue with this method is that I can only retrieve one row per worker thread, meaning that I have to issue a SELECT query for each. I found that storing a batch in a table and issuing a JOIN as if it was a foreign key is orders of magnitude faster.
My first thought was to apply a map operation over each batch, which would require me to call a function of that kind after I obtain the batch from the dataset. In PyTorch, I would be able to define all this behaviour in a class that inherits from its Dataset class, which I think is a cleaner way to do it, and encapsulates this behaviour. Is there anyway to (neatly) do this within tensorflow?
Bonus points if someone can conjure up a method that is perfectly encapsulated (the user does not know how the dataset is internally stored and kept track of) from the user, yet conforms to the tensorflow API (i.e. a callable class to be used a generator for tf.data.Dataset.from_generator()).
Edit: In PyTorch, a common implementation is as follows (which I consider to be "neat" and is encapsulated).
class MyDataset(torch.Dataset):
def __init__(self, row_ids):
# Store row ids, do any pre-processing if necessary.
def __getitem__(self, item):
# From the item (may be several), join all corresponding
# database rows and apply post-processing.
Related
I have trained a gensim doc2vec model for an English news recommender system. the model was trained with 40K news data. I am using the code below to recommend the top 5 most similar news for e.g. news_1:
inferred_vector = model.infer_vector(news_1)
sims = model.dv.most_similar([inferred_vector], topn=5)
The problem is that if I add another 100 news data to the database(so our database will have 40K + 100 news data now) and re-run the same code, the code will only be able to recommend news based on the original 40K(instead of 40K + 100) to me, in another word, the recommended articles will never come from the 100 articles.
how can I address this issue without the need to retrain the model? Thank you in advanced!
Ps: As our APP is for news, so everyday we'll have lots of news data coming into our database, so we won't consider to retrain the model everyday(doing so may crash our backend server).
There's a bulk contiguous vector structure initially created by training, for the initial known set of vectors. It's amenable to the every-candidate bulk vector calculation at the heart of most_similar() - so that operation goes about as fast as it can, with the right vector libraries for your OS/processor.
But, that structure wasn't originally designed with incremental expansion in mind. Indeed, if you have 1 million vectors in a dense array, then want to add 1 to the end, the straightforward approach requires you to allocate a new 1-million-and-1 long array, bulk copy over the 1 million, then add the last 1. That works, but what seems like a "tiny" operation then takes a while, and ever-longer as the structure grows. And, each add more-than-doubles the temporary memory usage, for the bulk copy. So, the naive pattern of adding a whole bunch of new items individuall in a loop can be really slow & memory-intensive.
So, Gensim hasn't yet focused on providing a set-of-vectors that's easy & efficient to incrementally grow with new vectors. But, it's still indirectly possible, if you understand the caveats.
Especially in gensim-4.0.0 & above, the .dv set of doc-vectors is an instance of KeyedVectors with all that class's standard functions. Thos include the add_vector() and add_vectors() methods:
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vector
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vectors
You can try these methods to add your new inferred vectors to the model.dv object - and then they'll also be ncluded in folloup most_similar() results.
But keep in mind:
The above caveats about performance & memory-usage - which may be minor concerns as long as your dataset isn't too large, or manageable if you do additions in occasional larger batches.
The containing Doc2Vec model generally isn't expecting its internal .dv to be arbitrarily modified or expanded by other code. So, once you start doing that, parts of the model may not behave as expected. If you have problems with this, you could consider saving-aside the full Doc2Vec model before any direct-tampering with its .dv, and/or only expanding a completely separate instance of the doc-vectors, for example by saving them aside (eg: model.dv.save(DOC_VECS_FILENAME)) & reloading them into a separate KeyedVectors (eg: growing_docvecs = KeyedVectors.load(DOC_VECS_FILENAME)).
I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks
I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.
Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.
thanks for hearing me out.
I have a dataset that is a matrix of shape 75000x10000 filled with float values. Think of it like heatmap/correlation matrix. I want to store this in a SQLite database (SQLite because I am modifying an existing Django project). The source data file is 8 GB in size and I am trying to use python to carry out my task.
I have tried to use pandas chunking to read the file into python and transform it into unstacked pairwise indexed data and write it out onto a json file. But this method is eating up my computational cost. For a chunk of size 100x10000 it generates a 200 MB json file.
This json file will be used as a fixture to form the SQLite database in Django backend.
Is there a better way to do this? Faster/Smarter way. I don't think a 90 GB odd json file written out taking a full day is the way to go. Not even sure if Django databases can take this load.
Any help is appreciated!
SQLite is quite impressive for what it is, but it's probably not going to give you the performance you are looking for at that scale, so even though your existing project is Django on SQLite I would recommend simply writing a Python wrapper for a different data backend and just using that from within Django.
More importantly, forget about using Django models for something like this; they are an abstraction layer built for convenience (mapping database records to Python objects), not for performance. Django would very quickly choke trying to build 100s of millions of objects since it doesn't understand what you're trying to achieve.
Instead, you'll want to use a database type / engine that's suited to the type of queries you want to make; if a typical query consists of a hundred point queries to get the data in particular 'cells', a key-value store might be ideal; if you're typically pulling ranges of values in individual 'rows' or 'columns' then that's something to optimize for; if your queries typically involve taking sub-matrices and performing predictable operations on them then you might improve the performance significantly by precalculating certain cumulative values; and if you want to use the full dataset to train machine learning models, you're probably better off not using a database for your primary storage at all (since databases by nature sacrifice fast-retrieval-of-full-raw-data for fast-calculations-on-interesting-subsets), especially if your ML models can be parallelised using something like Spark.
No DB will handle everything well, so it would be useful if you could elaborate on the workload you'll be running on top of that data -- the kind of questions you want to ask of it?
So a field that can be computed like full_name from first and last name, we should use the #property to compute the full_name. But when we required to get a list of all 'n' persons with their full name. the full_name will be computed 'n' times which should require more time than just getting the field from database(if it stored as separate field already!).
So is their any processing time / db fetching time advantage /disadvantage of using #property to compute full_name?
(Note: I have considered other advantages of #property like reduction of database size, not worrying about change in first or last name without change in full name, setter function to set first and last name etc. I just want to know the processing/ db fetching time advantage/disadvantage over saving full_name into database.
Technique that you're talking about is called Denormalization. This is quite advanced technique.
Denormalization is a strategy used on a previously-normalized database
to increase performance. In computing, denormalization is the process
of trying to improve the read performance of a database, at the
expense of losing some write performance, by adding redundant copies of data or by grouping data.
It's opposite to database Normalization. And you always should start your application with normalized database.
If you don't have any serious problems with performance, I'd advise to not do this. If you have problems, try other solutions first to improve your app speed.
First Normal Form(1NF):
It should only have single(atomic) valued attributes/columns.
Very basic example of disadvantage: UPDATE statement. You'll need to access 2 columns in table + calculation for full_name.
Anyway, your full_name example is so simple, that you should definitely do this with #property
More on this topic:
Difference Between Normalization and Denormalization
I am dealing with a relatively large dataset (>400 GB) for analytics purposes but have somewhat limited memory (256 GB). I am using python. So far I have been using pandas on a subset of the data but it is becoming obvious that I need a solution that allows me to access data from the entire dataset.
A little bit about the data. Right now the data is segregated over a set of flat files that are pandas dataframes. The files consist of column that have 2 keys. The primary key, let's call it "record", which I want to be able to use to access the data, and a secondary key, which is basically row number within the primary key. As in I want to access row 2 in record "A".
The dataset is used for training a NN (keras/tf). So the task is to partition the entire set into train/dev/test by record, and then pass the data to train/predict generators (I implement keras.utils.Sequence(), which I have to do because the data is variable length sequences that need to be padded for batch learning).
Given my desire to pass examples to the NN as fast as possible and my inability to store all of the examples in memory, should I use a database (mongodb or sqlite or something else?) and query examples as needed, or should I continue to store things in flat files and load them/delete them (and hope that python garbage collector works)?
Another complication is that there are about 3mil "records". Right now the pandas dataframes store them in batches of ~10k, but it would benefit me to split the training/test/validation randomly, which means I really need to be able to access some but not all of the records in a particular batch. In pandas this seems hard (as in as far as I know I need to read the entire flat file to then access the particular record since I don't know in which chunk of the file the data is located), on the other hand I don't think generating 3mil individual files is smart either.
A further complication is that the model is relatively simple and I am unable due to various bottlenecks to saturate my compute power during training, so if I could stream the training to several different models that would help with hyperparameter search, since otherwise I am wasting cycles.
What do you think is the correct (fast, simple) back-end to handle my data needs?
Best,
Ilya
This is a good use case for writing a custom generator, then using Keras' model.fit_generator. Here's something I wrote the other day in conjunction with Pandas.
Note that I first split my main dataframe into training and validation splits (merged was my original dataframe), but you may have to move things around on disk and specify them when selecting in the generator
Lots of the reshaping and lookup/loading is all custom to my problem, but you see the pattern.
msk = np.random.rand(len(merged)) < 0.8
train = merged[msk]
valid = merged[~msk]
def train_generator(batch_size):
sample_rows = train[train['match_id'].isin(npf.id.values)].sample(n=batch_size)
sample_file_ids = sample_rows.FILE_NAME.tolist()
sample_data = [np.load('/Users/jeff/spectro/' + x.split(".")[0] + ".npy").T for x in sample_file_ids]
sample_data = [x.reshape(x.shape[0], x.shape[1]) for x in sample_data]
sample_data = np.asarray([x[np.random.choice(x.shape[0], 128, replace=False)] for x in sample_data])
sample_labels = np.asarray([labels.get(x) for x in sample_file_ids])
while True:
yield (sample_data, sample_labels)
It essentially returns batch_size samples whenever you call it. Keras requires your generator to return a tuple of length 2, where the first element is your data in the expected shape (whatever your neural network input shape is) and the labels to also map to the expected shape (N_classes, or whatever).
Here's another relatively useful link regarding generator, which may help you determine when you've truly exhausted all examples. My generator just randomly samples, but the dataset is sufficiently large that I don't care.
https://github.com/keras-team/keras/issues/7729#issuecomment-324627132
Don't forget to write a validation_generator as well, which is reading from some set of files or dataframes which you randomly put in some other place, for validation purposes.
Lastly, here's calling the generator:
model.fit_generator(train_generator(32),
samples_per_epoch=10000, nb_epoch=20,
validation_data=valid_generator(32), validation_steps=500)
depending on the keras version, you may find arg names have changed slightly, but a few searches should get you fixed up quickly.