I'm learning about time series and am trying to predict closing stock price for the next two weeks, given the data I already have (about a year).
I've created 7 lag features using Pandas shift, so I have features t-7, t-6, ..., t-1 and the current day's closing stock price for my whole DataFrame, df. I've made a test_df which is just the last two weeks of data. test_df has the true values for each of its row's lagged features.
I want to mimic predicting future values by limiting myself to values from my training set (everything in df before the last two weeks) and my predictions.
So I was going to do something like:
# for each row in test_df
# prediction = model.predict(row)
# row["t"] = prediction
I think this is close, but it doesn't fix other lagged features like t-1, t-2, ..., t-7. I need to do this:
row 2, t = prediction for row 1
row 2, t-1 = t for row 1
...
row 2, t-i = t-i+1 for row 1
And I would repeat this for all rows in my test_df.
I could do this by writing my own function, but I'm wondering if there's a way to take advantage of Pandas to do this more easily.
Edit for clarity:
Suppose I'm looking at my first test row. I don't have the closing_price, so I use my model to predict based on the lagged features. Before prediction, my df looks like this:
closing_price t-1 t-2 t-3 t-4 t-5
0 None 7 6 5 4 3
Suppose my prediction for closing_price is 15. Then my updated DataFrame should look like this:
closing_price t-1 t-2 t-3 t-4 t-5
0 15.0 7.0 6.0 5.0 4.0 3.0
1 NaN 15.0 7.0 6.0 5.0 4.0
Thanks!
Edited: So you won't actually need time series split for this at all, since you're only trying to predict the value for one row. It seems you know how to create the shifted dataframe, so suppose you've stored your train data in a dataframe df where the 'closing_price' element of the last row is None. You'll use:
Xtrain = df[:-1]
ytrain = Xtrain.pop('closing_price')
Xtest = df.tail(1)
Xtest.pop('closing_price')
reg.fit(Xtrain, ytrain)
prediction = reg.predict(Xtest)
From there you can either put the prediction into your existing dataframe with df.set_value or make a new dataframe altogether if you're doing this incrementally.
If I'm understanding your question correctly (please comment if I'm not!), I think you're looking for the scikit-learn Time Series Split. That will let you create multiple predictions at different points in time using only historical data.
From their documentation:
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)
for train_index, test_index in tscv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Related
I have a relatively large dataframe (~24000 rows and 15 columns) which has 2D coordinate data of rat movements, outputted by a neural network (DeepLabCut).
As part of this output data, there is a p-value score that is a measure of how certain the neural network was when applying that label. I'm trying to filter low quality predictions by copying the previous row into its place, each time that a low p-value is encountered, which assumes that the rat remained still for that frame.
Here's my function thus far:
def checkPVals(DataFrame, CutOff):
for Cols in DataFrame.columns.values:
if Cols % 3 == 0:
for Vals in DataFrame.index.values:
if float(DataFrame[Cols][Vals]) < CutOff:
if (Vals != 0):
PreviousRow = DataFrame.loc[Vals - 1, Cols - 3:Cols]
DataFrame.loc[Vals, Cols - 3:Cols] = PreviousRow
return(DataFrame)
Here is a sample of the input data frame:
pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
Here is a sample of the desired output:
x y Pval
0 1 5 1.0
1 2 4 1.0
2 2 4 1.0
3 4 2 1.0
With the idea being that row index 2 is replaced with values from row index 1, such that when the inter-frame Euclidean distance between these coordinates is calculated, the distance is 0, implying the label (rat) has not moved.
Clearly, my current implementation is very inefficient. I was looking at iterrows(), but that converts my data into a series and messes with it. My other thought was to convert the p-value columns into np.arrrays, iterate through those, take the index of the p-values below threshold and then swap the rows for the previous one in an iterative manner. However, I feel like that'll take just as long.
Any help is very much appreciated. Thank you!
I'm pretty sure I understood what you are attempting to do. If you could update your question to have a sample output that's paired with you sample input, that would be greatly beneficial.
If I understood correctly, you should be using a vectorized approach instead of explicit looping (this will massively speed up your data wrangling). Essentially you can mask the rows of the dataframe depending on whether or not the "likelihood" column is above a certain value. Once you mask the low likelihoods away (i.e. replace those values with NaN), you can simply forward fill the entire dataframe to fill in the "bad" rows with the previous row's values.
df = pd.DataFrame(data={
"x":[1, 2, 3, 4],
"y":[5, 4, 3, 2],
"likelihood":[1, 1, 0.3, 1]
})
cutoff = 0.5
new_df = df.mask(df["likelihood"] < cutoff).ffill()
print(new_df)
x y likelihood
0 1.0 5.0 1.0
1 2.0 4.0 1.0
2 2.0 4.0 1.0
3 4.0 2.0 1.0
I'm using a simple RandomForestRegressor script to predict a target variable. I'm trying to write a new CSV based on my training / validation data to include the actual value and the predicted value. However, when I export the data, the "Predicted Values" column is missing about half the values, and the values that do show up don't correlate well with the features / actual values. It seems like the values are randomized and then assigned to the first half of the rows.
To test, I've tried not splitting the data between validation and training data in the first place. I'm still finding the same problem.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
#file path
My_File_Path = "//path.csv"
#read the file
My_Data = pd.read_csv(My_File_Path)
#drop the null values
My_Data = My_Data.dropna(axis=0)
#define the target variable
y = My_Data.Annualized_2018_Payments
my_features = ['feature1','feature2','feature3']
#define the features
x = My_Data[my_features]
#set the split data
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 1)
forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_x, train_y)
WA_My_preds = forest_model.predict(val_x)
print("MAE for validation data is ", mean_absolute_error(val_y, WA_My_preds))
#print(My_Data.columns)
My_Data_Predicted = My_Data
#My_Data_Predicted.append(prediction_column, ignore_index = False, sort=None)
My_Data_Predicted['Predicted_Value'] = pd.DataFrame(data = forest_model.predict(My_Data_Predicted[my_features]))
print("The average predicted value is ", My_Data_Predicted['Predicted_Value'].mean())
print("The average true value is ", My_Data_Predicted['Annualized_2018_Payments'].mean())
#write to csv
My_Data_Predicted.to_csv("//path….Preds.csv")
I expect every row to have a column that reads "predicted values" with the values predicted by the random forest regressor. But the last half of the rows are missing that value.
For a short answer and resolution:
Based on testing your code, you should try this line instead:
My_Data_Predicted['Predicted_Value'] = forest_model.predict(My_Data_Predicted[my_features])
And now Here's why this is happening:
I tested this using my own dataset and it looks like the issue is this line:
My_Data_Predicted['Predicted_Value'] = pd.DataFrame(data = forest_model.predict(My_Data_Predicted[my_features]))
What is happening, it would seem, is that when you drop the null rows here:
My_Data = My_Data.dropna(axis=0)
you are also dropping the indexes along with the rows, which is not wrong, but important for your issue. To test this, try My_Data_Predicted.index.max() to get the highest index and compare that to My_Data_Predicted.shape and you will see that there are skipped indexes.
The reason this is a problem is that by making your predicted column a dataframe instead of a series, it is automatically trying to merge the new data based on indexes. The issue is that the original dataframe has a higher max index with some gaps, and this new one for predictions has sequential indexes, so some of your predictions are getting dropped in the process of merging.
Here is a dumbed down example of whats going on (pay attention to the indexes):
My_Data_Predicted predictions My_Data_Predicted (merged)
index a b c index d index a b c d
0 1 4 3 0 1 0 1 4 3 1
3 3 2 7 1 2 3 3 2 7 4
4 2 2 2 2 3 4 2 2 2 5
6 4 3 5 3 4 6 4 3 5 NaN
8 6 2 1 4 5 8 6 2 1 NaN
Notice that in the merged dataframe the last two are NaN because there is no index 6 or 8 in the predictions dataframe.
All of this should resolve by passing in the result if the predictions just like this:
My_Data_Predicted['Predicted_Value'] = forest_model.predict(My_Data_Predicted[my_features])
since the type is a numpy array and will not try to merge on the index.
Regression algorithms seem to be working on features represented as numbers.
For example:
This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.
But now I want to do a regression analysis on data that contain categorical features:
There are 5 features: District, Condition, Material, Security, Type
How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.
Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?
Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.
Usually there are three possibilities:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Use something like group means for categorical data (e. g. mean prices for city districts).
You have to be carefull to not infuse information you do not have in the application case.
One hot encoding
If you have categorical data, you can create dummy variables with 0/1 values for each possible value.
E. g.
idx color
0 blue
1 green
2 green
3 red
to
idx blue green red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
This can easily be done with pandas:
import pandas as pd
data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))
will result in:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Numbers for ordinal data
Create a mapping of your sortable categories, e. g.
old < renovated < new → 0, 1, 2
This is also possible with pandas:
data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])
Result:
0 0
1 2
2 2
3 1
Name: q, dtype: int8
Using categorical data for groupby operations
You could use the mean for each category over past (known events).
Say you have a DataFrame with the last known mean prices for cities:
prices = pd.DataFrame({
'city': ['A', 'A', 'A', 'B', 'B', 'C'],
'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})
print(data.merge(mean_price, on='city', how='left'))
Result:
city price
0 A 1
1 B 2
2 C 3
3 A 1
4 B 2
5 A 1
In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here
Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose any relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.
Here is complete code on how you can do it for your housing dataset
So you have categorical features:
District, Condition, Material, Security, Type
And one numerical features that you are trying to predict:
Price
First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:
Input variables:
X = housing[['District','Condition','Material','Security','Type']]
Prediction:
Y = housing['Price']
Convert categorical variable into dummy/indicator variables and drop one in each category:
X = pd.get_dummies(data=X, drop_first=True)
So now if you check shape of X with drop_first=True you will see that it has 4 columns less - one for each of your categorical variables.
You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
You can use "Dummy Coding" in this case.
There are Python libraries to do dummy coding, you have a few options:
You may use scikit-learn library. Take a look at here.
Or, if you are working with pandas, it has a built-in function to create dummy variables.
An example with pandas is below:
import pandas as pd
sample_data = [[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'b']]
df = pd.DataFrame(sample_data, columns=['numeric1','numeric2','categorical'])
dummies = pd.get_dummies(df.categorical)
df.join(dummies)
One way to achieve regression with categorical variables as independent variables is as mentioned above - Using encoding.
Another way of doing is by using R like statistical formula using statmodels library. Here is a code snippet
from statsmodels.formula.api import ols
tips = sns.load_dataset("tips")
model = ols('tip ~ total_bill + C(sex) + C(day) + C(day) + size', data=tips)
fitted_model = model.fit()
fitted_model.summary()
Dataset
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Summary of regression
I should start by saying that I am quite new to pandas and numpy (and machine learning in general).
I am trying to learn some basic machine learning algorithms and am doing linear regression. I have completed this problem using matlab, but wanted to try implementing it in python - as that is a more practically used language. I am having a very difficult time doing basic matrix operations with these libraries and I think it's down to a lack of understanding of how pandas is indexing the dataframe...
I have found several posts talking about the differences between iloc and ix and that ix is being deprecated so use iloc, but iloc is causing me loads of issues. I am simply trying to pull the first n-1 columns out of a dataframe into a new dataframe, then the final column into another dataframe to get my label values. Then I want to perform the cost function one time to see what my current cost is with theta = 0. Currently, my dataset has only one label - but I'd like to code as if I had more. Here is my code and my output:
path = os. getcwd() + '\\ex1data1.txt'
data = pd.read_csv(path, header=None)
numRows = data.shape[0]
numCols = data.shape[1]
X = data.iloc[:,0:numCols-1].copy()
theta = pd.DataFrame(np.zeros((X.shape[1], 1)))
y = data.iloc[:,-1].copy()
#start computing cost sum((X-theta)-y).^2)
predictions = X.dot(theta)
print("predictions shape: {0}".format(predictions.shape))
print(predictions.head())
print("y shape: {0}".format(y.shape))
print(y.head())
errors = predictions.subtract(y)
print("errors shape: {0}".format(errors.shape))
print(errors.head())
output:
predictions shape: (97, 1)
0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
y shape: (97, 1)
1
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
errors shape: (97, 2)
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I can see that y and X have the same shape, but for some reason when I display them - it seems that y is beginning its indexing at column 1 (it's original position in the first dataframe) and X has its original column of 0. As a result, pandas is properly doing the subtraction and replacing any missing values with NaN. As y has no column 0 values, they are all NaN, and as X has no column 1 values, they are all NaN, resulting in a 97x2 NaN matrix.
If I use y = data.ix[:,-1:0] - the above code does the correct calculations. Output:
errors shape: (97, 1)
0
0 -6.1101
1 -5.5277
2 -8.5186
3 -7.0032
4 -5.8598
But I am trying to stay away from ix as it has been said it is deprecating.
How to I tell pandas that the new matrix has a start column of 0 and why is this not the default behavior?
Looks like the calculation you actually want to do is on the series (individual columns). So you should be able to do:
predictions[0].subtract(y[1])
To get the value you want. This looks kind of confusing because you have numbers as DataFrame columns, you are selecting the columns you want (0 and 1) and performing the subtraction between them.
Or using iloc as you originally suggested, which gives you more like matrix style indexing you could do this:
predictions.iloc[:, 0].subtract(y.iloc[:, 0])
Because in each DataFrame you want all the rows and the first column
I would like to get the feature names of a data set after it has been transformed by SKLearn OneHotEncoder.
In active_features_ attribute in OneHotEncoder one can see a very good explanation how the attributes n_values_, feature_indices_ and active_features_ get filled after transform() was executed.
My question is:
For e.g. DataFrame based input data:
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
How does the code look like to get from the original feature names a, b and c to a list of the transformed feature names
(like e.g:
a-0,a-1, a-2, b-0, b-1, b-2, b-3, c-0, c-1, c-2, c-3
or
a-0,a-1, a-2, b-0, b-1, b-2, b-3, b-4, b-5, b-6, b-7, b-8
or anything that helps to see the assignment of encoded columns to the original columns).
Background: I would like to see the feature importances of some of the algorithms to get a feeling for which feature have the most effect on the algorithm used.
You can use pd.get_dummies():
pd.get_dummies(data["a"],prefix="a")
will give you:
a_0 a_1 a_2
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
which can automatically generates the column names. You can apply this to all your columns and then get the columns names. No need to convert them to a numpy matrix.
So with:
df = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]})
data = df.as_matrix()
the solution looks like:
columns = df.columns
my_result = pd.DataFrame()
temp = pd.DataFrame()
for runner in columns:
temp = pd.get_dummies(df[runner], prefix=runner)
my_result[temp.columns] = temp
print(my_result.columns)
>>Index(['a_0', 'a_1', 'a_2', 'b_0', 'b_1', 'b_4', 'b_5', 'c_0', 'c_1', 'c_4',
'c_5'],
dtype='object')
If I understand correctly you can use feature_indices_ to identify which columns correspond to which feature.
e.g.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
ohe = OneHotEncoder(sparse=False)
ohe_fitted = ohe.fit_transform(data)
print(ohe_fitted)
print(ohe.feature_indices_) # [ 0 3 9 15]
From the above feature_indices_ we know if we spliced the OneHotEncoded data from 0:3 we would get the features corresponding to the first column in data like so:
print(ohe_fitted[:,0:3])
Each column in the spliced data represents a value in the first feature. The first column is 0, the second 1 and the third column is 2. To illustrate this on the spliced data, the column labels would look like:
a_0 a_1 a_2
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
Note that features are sorted first before they are encoded.
You can do that with the open source package feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder
# load titanic data from openML
pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
# divide into train and test
X_train, X_test, y_train, y_test = train_test_split(
data[['sex', 'embarked']], # predictors for this example
data['survived'], # target
test_size=0.3, # percentage of obs in test set
random_state=0) # seed to ensure reproducibility
ohe_enc = OneHotEncoder(
top_categories=None,
variables=['sex', 'embarked'],
drop_last=True)
ohe_enc.fit(X_train)
X_train = ohe_enc.transform(X_train)
X_test = ohe_enc.transform(X_test)
X_train.head()
You should see this output returned:
sex_female embarked_S embarked_C embarked_Q
501 1 1 0 0
588 1 1 0 0
402 1 0 1 0
1193 0 0 0 1
686 1 0 0 1
More details about feature engine here:
https://www.trainindata.com/feature-engine
https://github.com/feature-engine/feature_engine
https://feature-engine.readthedocs.io/en/latest/
There is a OneHotEncoder that does all the work for you.
Package sksurv has a OneHotEncoder that will return a pandas Dataframe with all the column names set-up for you. Check it out. Make sure you set-up an environment to play with the encoder to ensure it doesn't break your current environment. This encoder saved me a lot of time and effort.
scikit-suvival GitHub
OneHotEncoder Documentation
OneHotEncoder now has a method get_feature_names. You can use input_features=data.columns to match to the training data.