I am using VAR model to forecast multivariate time series with lag 2. I have three features, and would like to forcast several timestamps forward. Instead of forcasting all the three features, I actually know the values of two of the features, and would like to forcast only one feature.
If I wanted to forcast all the three features 5 timestamps a head, I could have done that as follows (this is a toy example):
import pandas as pd
from statsmodels.tsa.api import VAR
data=pd.DataFrame({'Date':['1959-06-01','1959-06-02','1959-06-03','1959-06-04']\
,'a':[1,2,3,5],'b':[2,3,5,8],'c':[3,4,7,11]})
data.set_index('Date', inplace=True)
model = VAR(data)
results = model.fit(2)
results.forecast(data.values[-2:], 5)
Note that data is
a b c
Date
1959-06-01 1 2 3
1959-06-02 2 3 4
1959-06-03 3 5 7
1959-06-04 5 8 11
And the forecast gives me
array([[ 8.01388889, 12.90277778, 17.79166667],
[ 12.93113426, 20.67650463, 28.421875 ],
[ 20.73343461, 33.12405961, 45.51468461],
[ 33.22366195, 52.98948789, 72.75531383],
[ 53.15895736, 84.72805652, 116.29715569]])
Let's say I knew that the next 5 values for a should have actually been 8,13,21,34,55 and the next 5 values for b should have been 13,21,34,55,89. Is there a way to incorporate that into the model in statsmodels.tsa (or any other python package) to forecast only the 5 values of c? I know that R has such an option, by incorporating "hard" conditions into cpredict.VAR, but I was wondering if this can be done in python as well.
The above is a toy example. In reality I have several dozens of features, but I still know all of them and would like predict only one of them using VAR model.
I have a similar issue when solving this problem. This is a makeshift method to accomplish what you are asking.
prediction = model_fit.forecast(model_fit.y, steps=len(test))
predictions = prediction[:,0]
`
Where 0 in the prediction[:,0] refers to the column that contains the desired forecasting value.
Related
I'm trying to wrap my head around the whole feature column thing. I can build the Iris training model etc with Keras but I want to understand a little more.
What I'm having trouble is with the "Feature_Column" aspect. For example lets say my data set looks like:
Length Width Weight Label
1.5 2 .07 1
1 5 .09 3
3 6 .19 4
I don't get a Feature "Column" as the the "Row" is the example isn't it? So when I read about a Feature Column what do they mean by that?
From what I can see it's a Feature Row no? [1.5, 2, .07] = 1 <= that's a Row of Features with a Label 1
I don't know I'm just confused :). Anyone willing to explain what a Feature Column is when they Transform the data for a feature column?
This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.
There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.
If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning
For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.
It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.
I have a dataset which looks like this:
val
1
1
3
4
6
6
9
...
I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:
val occurrences
1 2
3 1
4 1
6 2
9 1
...
and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.
My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?
For example:
1,1,3,4,6,6,9
would be:
df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})
Median is 4. I'm looking for a method to extract median directly from given df.
No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.
If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).
I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?
convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up
It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.
One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.
INTRO
I have a Pandas DataFrame that represents a segmented time series of different users (i.e., user1 & user2). I want to train a scikit-learn classifier with the mentioned DataFrames, but I can't understand the shape of the scikit-learn dataset that I must create.
Since my series are segmented, my DataFrame has a 'segID' column that contains IDs of a specific segment. I'll skip the description of the segmentation since it is provided by an algorithm.
Let's take an example where both user1 and user2 has 2 segments: print df
username voltage segID
0 user1 -0.154732 0
1 user1 -0.063169 0
2 user1 0.554732 1
3 user1 -0.641311 1
4 user1 -0.653732 1
5 user2 0.446469 0
6 user2 -0.655732 0
7 user2 0.646769 0
8 user2 -0.646369 1
9 user2 0.257732 1
10 user2 -0.346369 1
QUESTIONS:
scikit-learn dataset API says to create a dict containing data and target, but how can I shape my data since they are segments and not just a list?
I can't figure out my segments fitting into the n_samples * n_features structure.
I have two ideas:
1) every data sample is a list representing a segment, on the other hand, target is different for each data entry since they're grouped. What about target_names? Could this work?
{
'data': array([
[[-0.154732, -0.063169]],
[[ 0.554732, -0.641311, -0.653732],
[[ 0.446469, -0.655732, 0.646769]],
[[-0.646369, 0.257732, -0.346369]]
]),
'target':
array([0, 1, 2, 3]),
'target_names': array(['user1seg1', 'user1seg2', 'user2seg1', 'user2seg2'], dtype='|S10')
}
2) data is (simply) the nparray returned by df.values. target contains segments' IDs different for each user.... does it make sense?
{
'data': array([
[-0.154732],
[-0.063169],
[ 0.554732],
[-0.641311],
[-0.653732],
[ 0.446469],
[-0.655732],
[ 0.646769],
[-0.646369],
[ 0.257732],
[-0.346369]
]),
'target':
array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
'target_names': array(['user1seg1', 'user1seg1', 'user1seg2', 'user1seg2', .....], dtype='|S10')
}
I think the main problem is that I can't figure out what to use as labels...
EDIT:
ok it's clear... labels are given by my ground truth, they are just the user's names.
elyase's answer is exactly what i was looking for.
In order to better state the problem, I'm going to explain here the segID meaning.
In time series pattern recognition, segmenting could be useful in order to isolate meaningful segments.
At testing time I want to recognize segments and not the entire series, because series is rather long and segments are supposed to be meaningful in my context.
Have a look at the following example from this implementation based on "An Online Algorithm for Segmenting Time Series".
My segID is just a column representing the id of a chunk.
This is not trivial and there might be several way of formulating the problem for consumption by a ML algorithm. You should try them all and find how you get the best results.
As you already found you need two things, a matrix X of shape n_samples * n_features and a column vector y of length 'n_samples'. Lets start with the target y.
Target:
As you want to predict a user from a discrete pool of usernames, you have a classification problem an your target will be a vector with np.unique(y) == ['user1', 'user2', ...]
Features
Your features are the information that you provide the ML algorithm for each label/user/target. Unfortunately most algorithms require this information to have a fixed length, but variable length time series don't fit well into this description. So if you want to stick to classic algorithms, you need some way to condense the time series information for a user into a fixed length vector. Some possibilities are the mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.
So if you ignore the SegID information your X matrix will look like this:
y/features
min max ... sum
user1 0.1 1.2 ... 1.1 # <-first time series for user 1
user1 0.0 1.3 ... 1.1 # <-second time series for user 1
user2 0.3 0.4 ... 13.0 # <-first time series for user 2
As SegID is itself a time series you also need to encode it as fixed length information, for example a histogram/counts of all possible values, most frequent value, etc
In this case you will have:
y/features
min max ... sum segID_most_freq segID_min
user1 0.1 1.2 ... 1.1 1 1
user1 0.3 0.4 ... 13 2 1
user2 0.3 0.4 ... 13 5 3
The algorithm will look at this data and will "think": so for user1 the minimum segID is always 1 so if I see a user a prediction time, whose time series has a minimum ID of 1 then it should be user1. If it is around 3 it is probably user2, and so on.
Keep in mind that this is only a possible approach. Sometimes it is useful to ask, what info will I have at prediction time that will allow me to find which user is the one I am seeing and why will this info lead to the given user?