I have this dataset for agriculture raw materials from 1990 to 2017, and I am trying to make some price predictions for sake of learning:
Here are all the columns:
Now I want to split the dataset into training and test set, so I can apply some machine learning models into predicting, however it is not clear in my head what should be my target variable y, considering that each of the columns has their prices and they are all independent from each other. How should I be splitting this dataset if I wanted to make price prediction?
As I can see from your data, there are a couple of raw material prices available for prediction. Considering that these raw materials prices are independent of each other, you can create a dataset with just one dependent variable (for example Copra_Price) and the rest of the independent variables, removing other price-related variables from the data. Once you have this dataset, you can easily split into train and test using Copra_Price. This can be repeated for each of the price variables.
One more consideration is that, if none of the price variables has anomalies in them, then you could use any one of them to split the data as a random selection on one of them would in most probability be a random selection across the group.
Related
This is the dataset. I want to create a time series to forecast the last row (EURUSD).
Is it possible to forecast the last variable based on the other financial indicators present in the dataset?
You can use multiple linear regression for the prediction.
With your independent variables(interest rate, etc.) you can find the dependent variable(EURUSD in your case).
For further instructions and how to write it, you can visit these links;
Basic One
More intuitive
I am trying to get a sense of the rationale between some independent variables and quantify their importance on a dependent variable. I came across methods like the random forest that can quantify the importance of variables and then predict the outcome. However, I have an issue with the nature of the data to be used with the random forest or similar methods. An example of data structure is provided below, and as you can see the time series have some variables like population and Age that do not change with time, though different among the different city. While other variables such as temperature and #internet users are changing through time and within the cities. My question is: how can I quantify the importance of these variables on the “Y” variable? BTW, I prefer to apply the method in python environment.
"How can I quantity the importance" is very common question also known as "feature-importance".
The feature importance depends on your model; with a regression you have importance in your coefficients, in random forest you can use (but, some would not recommend) the build-in feature_importances_ or better the SHAP-values. Further more you can use som correlaion i.e Spearman/Pearson correlation between your features and your target.
Unfortunately there is no "free lunch", you will need to decide that based on what you want to use it for, how your data looks like etc.
I think the one you came across might be Boruta where you shuffle up your variables, add them to your data set and then create a threshold based on the "best shuffled variable" in a Random Forest.
My idea is as follows. Your outcome variable 'Y' has only a few possible values. You can build a classifier (Random Forest is one of many existing classifiers), to predict say 'Y in [25-94,95-105,106-150]'. You will here have three different outcomes that rule out each other. (Other interval limits than 95 and 105 are possible, if that better suits your application).
Some of your predictive variables are time series whereas others are constant, as you explain. You should use a sliding window technique where your classifier predicts 'Y' based on the time-related variables in say the month January. It doesn't matter that some variables are constant, as the actual variable 'City' has the four outcomes: '[City_1,City_2,City_3,City_4]'. Similarly, use 'Population' and 'Age_mean' as the actual variables.
Once you use classifiers, many approaches to feature ranking and feature selection have been developed. You can use a web service like insight classifiers to do it for you, or download a package like Weka for that.
Key point is that you organize your model and its predictive variables such that a classifier can learn correctly.
If city and month are also your independent variables, you should convert them from index into columns. Using pandas to read your file, then use df.reset_index() can do the job for you.
I've recently built a multi class classification machine learning model through sklearn and I want to transfer the learnings from one dataset to another.
I have our first party data (let's call it Sales) which includes the names of thousands of text books and the disciplines they belong to (i.e. Biology 101 (title) is a Biology (discipline) textbook). I was able to get the machine to fairly accurately predict the discipline of a textbook based on the title of the book.
I now have a second data set which contains Competitor text book titles, but no disciplines. I want to have the machine guess the disciplines for the competitor text books based on what it learned from the Sales data set.
The Sales Machine Learning model works well on the Sales side. So here is what I want to do:
1) Transfer the learnings from the Sales model to the Competitor set.
2) Export the results of that transfer to a CSV.
3) In order to do the machine learning model from Sales and Competitor I stripped all other columns of data, ideally I'd like to export the predicted discipline for both data sets.
If anyone could even point me in the right direction of documentation on transferring my model I would appreciate it.
If you are already familiar with scikit-learn then this should be an easy task.
Here is some high-level pseudo-code:
sales_data = preprocess_data(raw_data_sales) # normalization, vectorization, etc.
model.fit(sales_data,sales_labels) # potentially with cross-validation, hyperparameter-tuning etc.
competitor_data = preprocess_data(competitor_raw_data) # same preprocessing as for train data
sales_predictions = model.predict(sales_data)
competitor_predictions = model.predict(competitor_data)
export_to_CSV(sales_predictions) # export predictions to CSV
export_to_CSV(competitor_predictions)
There is actually no need for 'transfer learning' here since you don't have any labels for your competitor data. What you like to achieve sounds like simple inference.
export_to_CSV() could be a numpy (np.savetxt()) or a pandas (df.to_csv()) function, whatever you like to use. To map your non-numeric labels (the disciplines) back and forth from text to numbers you can use scikit-learn's LabelEncoder.
Note: Since your data comes from two different sources and you cannot train the model on the data from the second source but only on your own sales data (since you have no labels from your competitor), the performance of your model might be worse than on your sales data. If you would have additional labels from your competitor, then this would be a transfer learning task since you could use your initial model and continue training.
I used one set of data to learn a Random Forest Regressor and right now I have another dataset with smaller number of features (the subset of the previous set).
Is there a function which allows to get the list of names of columns used during the training of the Random Forest Regressor model?
If not, then is there a function which for the missing columns would assign Nulls?
Is there a function which allows to get the list of names of columns
used during the training of the Random Forest Regressor model?
RF uses all features from your dataset. Each tree may contain sqrt(num_of_features) or log2(num_of_features) or whatever but these columns are selected at random. So usually RF covers all columns from your dataset.
There may be edge case when you use a small number of estimators in RF and some features may not be considered. I suppose, RandomForestRegressor.feature_importances_ (zero or nan value may be indicators here) or dive into each tree in RandomForestRegressor.estimators_ may help.
If not, then is there a function which for the missing columns would
assign Nulls?
RF does not accept missing values. Either you need to code missing value as the separate class (and use it for learning too) or XGBoost (for example) is your choice.
[Preface: I now realize I should've used a classification model (maybe decision tree) instead but I ended up using a linear regression model.]
I had a pandas dataframe as such:
And I want to predict audience score using genre, year, tomato-meter score. But as is constructed, the genres for each movie came in a list, so I felt the need to isolate each genre to pass each genre into my model as separate variables.
After doing such, my modified dataframe looks like this, with duplicate rows for each movie, but each genre element of that movie isolated (just one movie pulled from the dataframe to show):
Now, my question is, can I pass in this second dataframe as is to Patsy and statsmodel linear regression, or will the row duplication introduce bias into my model?
y1, X1 = dmatrices('Q("Audience Score") ~ Year + Q("Tomato-meter") + Genre',
data=DF2, return_type='dataframe')
In summary, looking for a way for patsy and my model to recognize treat each genre as separate variables.. but want to make sure I'm not fudging the numbers/model by passing in a dataframe in this format as the data (as not every movie as the same # of genres).
I see two problems with the approach:
parameter estimates:
If there are different number of repeated observations, the weight for observations with multiple categories will be larger than observations with only a single category. This could be corrected by using weights in the linear model. Use WLS with weights equal to the inverse of the number of repetitions (or the square root of it ?). Weights are not available for other models like Poisson or Logit or GLM-Binomial. This will not make a larger difference for the parameter estimates, if the "pattern", i.e. the underlying parameters are not systematically different across movies with different number of categories.
Inference, standard error of parameter estimates:
All basic models like OLS, Poisson and so on assume that each row is an independent observation. The total number of rows will be larger than the number of actual observations and the estimated standard errors of the parameters will be underestimated. (We could use cluster robust standard errors, but I never checked how well they work with duplicate observations, i.e. response is identical across several observations.)
Alternative
As an alternative to repeating observations, I would encode the categories into non-exclusive dummy variables. For example, if we have three levels of the categorical variable, movie categories in this case, then we add a 1 in each corresponding column if the observation is "in" that category.
Patsy doesn't have premade support for this, so the design matrix for the movie category would need to be build by hand or as the sum of the individual dummy matrices (without dropping a reference category).
alternative model
This is not directly related to the issue of multiple categories in the explanatory variables.
The response variable movie ratings is bound to be between 0 and 100. A linear model will work well as a local approximation, but will not take into account that observed ratings are in a limited range and will not enforce it for prediction.
Poisson regression could be used to take the non-negativity into account, but wouldn't use the upper bound. Two alternatives that will be more appropriate are GLM with Binomial family and a total count for each observation set to 100 (maximum possible rating), or use a binary model, e.g. Logit or Probit, after rescaling the ratings to be between 0 and 1.
The latter corresponds to estimating a model for proportions which can be estimated with the statsmodels binary response models. To have inference that is correct even if the data is not binary, we can use robust standard errors. For example
result = sm.Logity(y_proportion, x).fit(cov_type='HC0')
Patsy doesn't have any built-in way to separate out a "multi-category" like your Genre variable, and as far as I know there's no direct way to represent it in Pandas either.
I'd break the Genre into a bunch of booleans columns, one per category: Mystery = True/False, Comedy = True/False, etc. That fits better with both pandas and patsy's way of representing things.