Incomplete feed dictionary for graph consisting of multiple separate parts? - python

I read somewhere around here that running multiple Tensorflow graphs in a single process is considered bad practice. Therefore, I now have a single graph which consists of multiple separate "sub-graphs" of the same structure. Their purpose is to generate specific models that describe production tolerances of multiple sensors of the same type. The tolerances are different for each sensor.
I'm trying to use TF to optimize a loss function in order to come up with a numerical description (i.e. a tensor) of that production tolerance for each sensor separately.
In order to achieve that and avoid having to deal with multiple graphs (i.e. avoid bad practice), I built a graph that contains a distinct sub-graph for each sensor.
The problem is that I only get data from a single sensor at a time. So, I cannot build a feed_dict that has all placeholders for all sub-graphs filled with numbers (all zeros wouldn't make sense).
TF now complains about missing values for certain placeholders, namely those of the other sensors that I don't have yet. So basically I would like to calculate a sub-graph without feeding the other sub-graphs.
Is that at all possible and, if yes, what will I have to do in order to hand an incomplete feed_dict to the graph?
If it's not possible to train only parts of a graph, even if they have no connection to other parts, what's the royal road to create models with the same structure but different weights that can get trained separately but don't use multiple graphs?

Related

How can I ensure that all users and all items appear in the training set of my recommender system?

I am building a recommender system in Python using the MovieLens dataset (https://grouplens.org/datasets/movielens/latest/). In order for my system to work correctly, I need all the users and all the items to appear in the training set. However, I have not found a way to do that yet. I tried using sklearn.model_selection.train_test_split on the partition of the dataset relevant to each user and then concatenated the results, thus succeeding in creating training and test datasets that contain at least one rating given by each user. What I need now is to find a way to create training and test datasets that also contain at least one rating for each movie.
This requirement is quite reasonable, but is not supported by the data ingestion routines for any framework I know. Most training paradigms presume that your data set is populated sufficiently that there is a negligible chance of missing any one input or output.
Since you need to guarantee this, you need to switch to an algorithmic solution, rather than a probabilistic one. I suggest that you tag each observation with the input and output, and then apply the "set coverage problem" to the data set.
You can continue with as many distinct covering sets as needed to populate your training set (which I recommend). Alternately, you can set a lower threshold of requirement -- say get three sets of total coverage -- and then revert to random methods for the remainder.

Multidimensional LSTM

I'm having a problem to make at least one functional machine learning model, the examples I found all over the network are either off topic or good but incomplete (missing dataset, explanations...).
The closest example related to my problem is this.
I'm trying to create a model based on accelerometer & gyroscope sensor, each one has its own 3 axis, for example if I lift the sensor parallel to the gravity then return it back to his initial position, then I should have a table like this.
Example
Now this whole table correspond to one movement which I call it "Fade_away", and the duration for this same movement is variable.
I have only two main questions:
In which format I need to save my dataset, because I don't think an array could arrange this kind of data?
How can I implement a simple model at least with one hidden layer?
To make it easier, let's say that I have 3 outputs, "Fade_away", "Punch" and "Rainbow".

Is there a way to cluster transactions (journals) data using python if a transaction is represented by two or more rows?

In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution

Should I drop a variable that has the same value in the whole column for building machine learning models?

For instance, column x has 50 values and all of these values are the same.
Is it a good idea to delete variables like these for building machine learning models? If so, how can I spot these variables in a large data set?
I guess a formula/function might be required to do so. I am thinking of using nunique that can take account of the whole dataset.
You should be deleting such columns because it will provide no extra information about how each data point is different from another. It's fine to leave the column for some machine learning models (due to the nature of how the algorithms work), like random forest, because this column will actually not be selected to split the data.
To spot those, especially for categorical or nominal variables (with fixed number of possible values), you can count the occurrence of each unique value, and if the mode is larger than a certain threshold (say 95%), then you delete that column from your model.
I personally will go through variables one by one if there aren't any so that I can fully understand each variable in the model, but the above systematic way is possible if the feature size is too large.

Database vs flat files in python (need speed but can't fit in memory) to be used with generator for NN training

I am dealing with a relatively large dataset (>400 GB) for analytics purposes but have somewhat limited memory (256 GB). I am using python. So far I have been using pandas on a subset of the data but it is becoming obvious that I need a solution that allows me to access data from the entire dataset.
A little bit about the data. Right now the data is segregated over a set of flat files that are pandas dataframes. The files consist of column that have 2 keys. The primary key, let's call it "record", which I want to be able to use to access the data, and a secondary key, which is basically row number within the primary key. As in I want to access row 2 in record "A".
The dataset is used for training a NN (keras/tf). So the task is to partition the entire set into train/dev/test by record, and then pass the data to train/predict generators (I implement keras.utils.Sequence(), which I have to do because the data is variable length sequences that need to be padded for batch learning).
Given my desire to pass examples to the NN as fast as possible and my inability to store all of the examples in memory, should I use a database (mongodb or sqlite or something else?) and query examples as needed, or should I continue to store things in flat files and load them/delete them (and hope that python garbage collector works)?
Another complication is that there are about 3mil "records". Right now the pandas dataframes store them in batches of ~10k, but it would benefit me to split the training/test/validation randomly, which means I really need to be able to access some but not all of the records in a particular batch. In pandas this seems hard (as in as far as I know I need to read the entire flat file to then access the particular record since I don't know in which chunk of the file the data is located), on the other hand I don't think generating 3mil individual files is smart either.
A further complication is that the model is relatively simple and I am unable due to various bottlenecks to saturate my compute power during training, so if I could stream the training to several different models that would help with hyperparameter search, since otherwise I am wasting cycles.
What do you think is the correct (fast, simple) back-end to handle my data needs?
Best,
Ilya
This is a good use case for writing a custom generator, then using Keras' model.fit_generator. Here's something I wrote the other day in conjunction with Pandas.
Note that I first split my main dataframe into training and validation splits (merged was my original dataframe), but you may have to move things around on disk and specify them when selecting in the generator
Lots of the reshaping and lookup/loading is all custom to my problem, but you see the pattern.
msk = np.random.rand(len(merged)) < 0.8
train = merged[msk]
valid = merged[~msk]
def train_generator(batch_size):
sample_rows = train[train['match_id'].isin(npf.id.values)].sample(n=batch_size)
sample_file_ids = sample_rows.FILE_NAME.tolist()
sample_data = [np.load('/Users/jeff/spectro/' + x.split(".")[0] + ".npy").T for x in sample_file_ids]
sample_data = [x.reshape(x.shape[0], x.shape[1]) for x in sample_data]
sample_data = np.asarray([x[np.random.choice(x.shape[0], 128, replace=False)] for x in sample_data])
sample_labels = np.asarray([labels.get(x) for x in sample_file_ids])
while True:
yield (sample_data, sample_labels)
It essentially returns batch_size samples whenever you call it. Keras requires your generator to return a tuple of length 2, where the first element is your data in the expected shape (whatever your neural network input shape is) and the labels to also map to the expected shape (N_classes, or whatever).
Here's another relatively useful link regarding generator, which may help you determine when you've truly exhausted all examples. My generator just randomly samples, but the dataset is sufficiently large that I don't care.
https://github.com/keras-team/keras/issues/7729#issuecomment-324627132
Don't forget to write a validation_generator as well, which is reading from some set of files or dataframes which you randomly put in some other place, for validation purposes.
Lastly, here's calling the generator:
model.fit_generator(train_generator(32),
samples_per_epoch=10000, nb_epoch=20,
validation_data=valid_generator(32), validation_steps=500)
depending on the keras version, you may find arg names have changed slightly, but a few searches should get you fixed up quickly.

Categories

Resources