When it comes to normal ANNs, or any of the standard machine learning techniques, I understand what the training, testing, and validation sets should be (both conceptually, and the rule-of-thumb ratios). However, for a bidirectional LSTM (BLSTM) net, how to split the data is confusing me.
I am trying to improve prediction on individual subject data that consists of monitored health values. In the simplest case, for each subject, there is one long time series of values (>20k values), and contiguous parts of that time series are labeled from a set of categories, depending on the current health of the subject. For a BLSTM, the net is trained on all of the data going forwards and backwards simultaneously. The problem then is, how does one split a time series for one subject?
I can't just take the last 2,000 values (for example), because they might all fall into a single category.
And I can't chop the time series up randomly, because then both the learning and testing phases would be made of disjointed chunks.
Finally, each of the subjects (as far as I can tell) has slightly different (but similar) characteristics. So, maybe, since I have thousands of subjects, do I train on some, test on some, and validate on others? However, since there are inter-subject differences, how would I set up the tests if I was only considering one subject to start?
I think this has more to do with your particular dataset than Bi-LSTMs in general.
You're confusing splitting a dataset for training/testing vs. splitting a sequence in a particular sample. It seems like you have many different subjects, which constitute a different sample. For a standard training/testing split, you would split your dataset between subjects, as you suggested in the last paragraph.
For any sort of RNN application, you do NOT split along your temporal sequence; you input your entire sequence as a single sample to your Bi-LSTM. So the question really becomes whether such a model is well-suited to your problem, which has multiple labels at specific points in the sequence. You can use a sequence-to-sequence variant of the LSTM model to predict which label each time point in the sequence belongs to, but again you would NOT be splitting the sequence into multiple parts.
Related
I am working with time series datasets where I have two different cases. One where my sequences are of same size and the other one where the sequences are of different lengths. When I have same length sequences I can merge all the datasets and then fit the model once.
But for different length sequences, I was wondering how differently should the keras model.fit will behave
if the models are fitted with each different length sequences one by one with batch size=length of sequence
if the models are fitted once with all the sequences merged together having a fixed batch size
And based on the given scenario what should be the correct or better course of action?
In first scenario, the weights will be optimized first with first dataset and then will be changed(updated) for the second dataset and so on. In second scenario, you are simultaneously asking the model to learn patterns from all the datasets. This means the weights will be adjusted according to all the datasets all at once. I will prefer the second approach because NN have the tendency to forget/break when trained on new datasets. They are more likely to focus on the data they have seen recently.
I encountered a coding problem. In my dataset, an instance includes several sentences (different amounts in different instances). They can not be concatenated to serve as a single one. How can I effectively process this kind of data with PyTorch? Or I have to process instance one by one?
This is a very broad question. However, I can think of two less complex solutions.
Use a dummy sentence to pad the instances and mask the dummy sentences while learning representations for the instance.
You can group the instances based on the number of sentences to create mini-batches to avoid padding. However, if this is not a case, at least try to group instances which are similar in the number of sentences to minimize the amount of padding.
You can study existing implementations of learning document representations, for example, Hierarchical Attention Networks for Document Classification paper.
I do not understand how model.predict(...) works on a a time series forecasting problem. I usually use it with a CNN and it is pretty straight forward but for time series I don't understand what it returns.
For example I am currently doing an exercise where I have to forecast the power consumption based on data using LTSM, I succeeded to train my model but when I want to know what the power cusumption will be tomorrow (so no data except past ones) I don't know what input to use.
Traditional ML algorithms, which you might be more used to, generally expect the data in a 2D structure like this:
For sequential data, such as a stream of timed events associated with each user, it’s also possible to create a lagged 2D dataset, where the history of different features for different IDs is aligned into single rows, with this structure:
This can be a good way to work because once your data is in the correct shape you can use it with fast to set up and train models. However, models using features engineered using this approach generally don’t have any capacity to “learn” anything about the natural sequence of the data. To something like a tree-based ensemble model receiving this format, feature 1 at time t and time t-1 in the example above are treated completely independently and this can severely limit the model’s predictive power.
There are types of deep learning architecture specifically designed for modelling sequence data called recurrent neural nets (RNN). Two of the most popular cells to use in these are long short term memory (LSTM) and gated recurrent units (GRU). There’s a good post on how to understand how LSTM cells work here, but the TL;DR is they have a structure that allows them to learn from sequences of data.
Cells like LSTM expect a 3D tensor of input data. We arrange it so that one axis has the data features along it, the second axis has the sequence steps (like time ticks) and the third axis has each of the different examples we want to predict a single "y" value for stacked along it. Using the same type of dataset as the lagged example above, it would look something like this:
The ability to learn patterns in sequences of data like this is particularly beneficial for both time series and text data, which are naturally ordered.
To return to your original question, when you want to predict something in your test set you'll need to pass it sequences represented just like the ones it was trained in (this is a reasonably good rule of supervised learning in general). For example, if the data is trained like the last example above, you'll need to pass it a 2D example for each ID you want to make a prediction for.
You should explore the way the original training data is represented and make sure you understand it well, as you'll need to create the same shape of data to make predictions. X_train.shape is a great place to start, if you have your training data in a pandas dataframe or numpy arrays, to see what the dimensionality is, and then you can inspect entries along each axis until you get a good feel for the data it contains.
I am making a convolutional network model with which I want to classify EEG data. The data is an experiment where participants are evoked with images of 3 different classes with 2 subclasses each. To give a brief explanation about the dataset size, a subclass has ±300 epochs of a given participant (this applies for all the subclasses).
Object
Color
Number
Now my question is:
I have 5 participants in my training dataset, I took 15% of each participants' data and put it in the testing dataset. Can I consider the 15% as unseen data even though the same participant was used to train the model on?
Any input is welcome!
It depends on what you want to test. A test set is used to estimate the generalization (i.e. performance on unseen data). So the question is:
Do want to estimate the generalization to unseen data from the same participants (whose data was used to train the classifier)?
Or do you want to estimate the generalization to unseen participants (the general population)?
This really depends on you goal or the claim you are trying to make. I can think of situations for both approaches:
Think of BCIs which need to be retrained for every user. Here, you would test on data from the same individual.
On the other hand, if you make a very general claim (e.g. I can decode some relevant signal from a certain brain region across the population) then having a test set consisting of participants which were not included in the training set would lend much stronger support to your claim. (The question is whether this works, though.)
I have been trying to build a prediction model using a user’s data. Model’s input is documents’ metadata (date published, title etc) and document label is that user’s preference (like/dislike). I would like to ask some questions that I have come across hoping for some answers:
There are way more liked documents than disliked. I read somewhere that if somebody train’s a model using way more inputs of one label than the other this affects the performance in a bad way (model tends to classify everything to the label/outcome that has the majority of inputs
Is there possible to have input to a ML algorithm e.g logistic regression be hybrid in terms of numbers and words and how that could be done, sth like:
input = [18,23,1,0,’cryptography’] with label = [‘Like’]
Also can we use a vector ( that represents a word, using tfidf etc) as an input feature (e.g. 50-dimensions vector) ?
In order to construct a prediction model using textual data the only way to do so is by deriving a dictionary out of every word mentioned in our documents and then construct a binary input that will dictate if a term is mentioned or not? Using such a version though we lose the weight of the term in the collection right?
Can we use something as a word2vec vector as a single input in a supervised learning model?
Thank you for your time.
You either need to under-sample the bigger class (take a small random sample to match the size of the smaller class), over-sample the smaller class (bootstrap sample), or use an algorithm that supports unbalanced data - and for that you'll need to read the documentation.
You need to turn your words into a word vector. Columns are all the unique words in your corpus. Rows are the documents. Cell values are one of: whether the word appears in the document, the number of times it appears, the relative frequency of its appearance, or its TFIDF score. You can then have these columns along with your other non-word columns.
Now you probably have more columns than rows, meaning you'll get a singularity with matrix-based algorithms, in which case you need something like SVM or Naive Bayes.