I have a dataset which contains visiting history from customers.
It has three columns in dataset including customer ID, AM/PM (visit at AM or PM) and Weekday/Weekend (visit on weekday or weekend).
I want to learn from this dataset and select the top 50 customers who have the biggest chance to visit in specified input (like AM / Weekday).
For now, I create model for each customer by using one-class SVM (I only have positive (visit) data). Since the one-class SVM only has binary output, I can only tell the certain customer will visit or not in specified input, rather than selecting the top 50 customers.
I was wondering if there is an algorithm that can learn from a positive-only dataset and give a score or probability like output?
That's a subcategory problem within machine learning. You can learn a lot reading this survey: "One-Class Classification: Taxonomy of Study and Review of
Techniques" (http://arxiv.org/pdf/1312.0049.pdf). Hope it helps.
Related
I've recently built a multi class classification machine learning model through sklearn and I want to transfer the learnings from one dataset to another.
I have our first party data (let's call it Sales) which includes the names of thousands of text books and the disciplines they belong to (i.e. Biology 101 (title) is a Biology (discipline) textbook). I was able to get the machine to fairly accurately predict the discipline of a textbook based on the title of the book.
I now have a second data set which contains Competitor text book titles, but no disciplines. I want to have the machine guess the disciplines for the competitor text books based on what it learned from the Sales data set.
The Sales Machine Learning model works well on the Sales side. So here is what I want to do:
1) Transfer the learnings from the Sales model to the Competitor set.
2) Export the results of that transfer to a CSV.
3) In order to do the machine learning model from Sales and Competitor I stripped all other columns of data, ideally I'd like to export the predicted discipline for both data sets.
If anyone could even point me in the right direction of documentation on transferring my model I would appreciate it.
If you are already familiar with scikit-learn then this should be an easy task.
Here is some high-level pseudo-code:
sales_data = preprocess_data(raw_data_sales) # normalization, vectorization, etc.
model.fit(sales_data,sales_labels) # potentially with cross-validation, hyperparameter-tuning etc.
competitor_data = preprocess_data(competitor_raw_data) # same preprocessing as for train data
sales_predictions = model.predict(sales_data)
competitor_predictions = model.predict(competitor_data)
export_to_CSV(sales_predictions) # export predictions to CSV
export_to_CSV(competitor_predictions)
There is actually no need for 'transfer learning' here since you don't have any labels for your competitor data. What you like to achieve sounds like simple inference.
export_to_CSV() could be a numpy (np.savetxt()) or a pandas (df.to_csv()) function, whatever you like to use. To map your non-numeric labels (the disciplines) back and forth from text to numbers you can use scikit-learn's LabelEncoder.
Note: Since your data comes from two different sources and you cannot train the model on the data from the second source but only on your own sales data (since you have no labels from your competitor), the performance of your model might be worse than on your sales data. If you would have additional labels from your competitor, then this would be a transfer learning task since you could use your initial model and continue training.
First of all, I must say, I'm a beginner to this AI things. I followed most of the tutorials about stock market predictions and all of them are pretty much same. These tutorials using a data set and split in to two sets. First one is Training set and the 2nd one is Test set. They are using Closing price of the stocks to train and make a model. From that model, they insert test data set which contain the closing price and showing two graphs. Then they say the actual and the predicted graphs are pretty much same.
The github repo of the tutorial. -
https://github.com/surajr/Stock-Predictor-using-LSTM/blob/master/Stock-Predictor-using-LSTM.ipynb
This is my question,
1. Why all those tutorials are putting closing price in the testing set also? They are only suppose to insert dates right? Because we are predicting the closing price. This is confusing. Please explain me.
2. No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
Please help me to clarify this. Thanks a lot.
Take a look at this link. I think it will get you going in the right direction.
https://www.datacamp.com/community/tutorials/lstm-python-stock-market
Why all those tutorials are putting closing price in the testing set also?
The ultimate goal is to predict the movement (growth), Which is closing minus- opening price. The ultimate model is the model that calculates the growth in test data set very close to what the actual growth is. The growth is the main problem that the model is trying solve and is the point of reference when you calculate the accuracy of the trained model.
They are only suppose to insert dates right? Because we are predicting the closing price
The model is predicting the growth based on given factors. For a company, you have many factors that are quantified, per day. I suspect the tutorial you did uses a testing set extracted for one particular day and different stocks. Like extracting all parameters for all companies but only in 10th of January and then check how accurate the trained model is. The training set on the other hand contains the stock for more than one day most of the time.
No one is telling me how to predict next 7 days values. So if we have a model, how to get next 7 days closing value?
To predict the stock price relatively accurate, you need a well-trained model. To do this you need to train your model based on many many factors. Same model cannot predict stock in different countries. One model might be suitable to predict technology stocks (AAPL) but not other fields.
Overall, this is a complicated subject. Financial advisers pay a massive amount of money just to use reliable models. Most of them use multiple models based on their client's portfolio. These tutorials introduce the subject to you and teach you the main concept. IMHO, I would say the next step would be learning and then competing in Kaggle.
In the training set, closing value is included as an input because it is relevant to the "next day's" price, or "price in X days" (for models that predict price movement over more than 1 day).
Note, in the training data, typically the future price (today + 1 day) is the target value (train_Y).
In the testing data, the closing data is included because the testing data is predicting "future price."
In determining the accuracy of the model, the price prediction of (today + X days) is compared against the future value (test_Y) to determine the effectiveness of the prediction. Just like a human stock trader, if you are guessing/predicting if the FUTURE price will be Y (i.e. up/down), then you would have access to the current day's end of day closing price...which is why it is a relevant input. Obviously, in a real-world model, the accuracy of the prediction would only be known AFTER X days pass. When training and then testing a model, typically the data is historical, so out of sample values (like the price of today + X days) is used for accuracy determination, though the FUTURE value should definitely not be an input.
Why all those tutorials are putting closing price in the testing set also?
-> It is easy to understand that closing price is a kind of input variable which is required to calculate stock price.
As I see the code, it seems predict stock price with 22days history
X_train (1173, 22, 3)
y_train (1173,)
X_test (130, 22, 3)
y_test (130,)
I think you should re-train with (~~~, 7, 3) to predict price of 7 days after today.
I have a dataset with daily observations of sales for 1000 company shops during the last 3 years(of course apart from the sales figure alone, I've got features like: promotions, shop type, assortment type etc.)
The goal is to build a model to predict future sales. How would you build a model from 1000 time series and generalize it such that it could be used to predict sales for 1 shop with certain features?
The dataset is similar to : https://www.kaggle.com/c/rossmann-store-sales/notebooks.
Basing on the solutions (in python) for this dataset provided on Kaggle, I noticed that nearly everyone is using XGBoost, but I have some doubts regarding these solutions that were provided and I'd be thankful for some clarification. In particular:
How can people just load the data with daily observations for over 1000 shops for 3.5 years for each shop to the model, without one-hot encoding the store id's first? Isn't the model going to fail at some point because it will learn that shop #1040 is better than shop #35 - just because of shop id?
If we would use traditional one-hot encoding this would create 1000 new columns which is unmanageable - but nevertheless is there a way of solving this problem with one-hot encoding?
Why are people extracting the "date" feature by adding : day,week,month as separate variables? Isn't that misleading to the model? Why aren't people assigning the "date" as an index instead?
1 - This explains data prepration for XGBoost. You could be hurting the models ability of learning by not one-hot encoding, but it seems like you have enough features that your model can still learn the correct correlations.
2 - You pretty much answer this question yourself. The one-hot encoding approach would explode the number of features you have, overwhelming the model, slowing the training process and worst of all it might not help that much
3 - This is a common feature engineering approach to give me the model more information to work with while not completely overshadowing other features. For ex. Store A might have sales on Monday at 5 most of the time.
Basically you have to try it to see if it works.
Im sure your fellow Kagglers have tried many more combinations of feature engineering, one-hot encoding categorical data, different model and many more.
I have 2 years of historical data of customers, items ordered and the numbers of orders. Based on this data, I am trying to predict the future sales at customer - item level. I tried ARIMA model which didn't give me the expected results. Any suggestions or references of implementation. I am interested to try LSTM and looking for good retail references.
ARIMA models can be challenging to work with as there are a lot of parameters which need to be set in a reasonable manner, especially if you're trying to model seasonality.
Here's a decent starter discussion (with code) on using an LSTM for retail sales problem: https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
Also, consider working to identify some dependent variables which may impact sales to add as auxiliary inputs. For example - some items could sell more during certain seasons, around national holidays, etc.
Forecasting at this detailed of a level is always challenging so the more variables you can find which explain the observed phenomena, the better you may be able to do.
I have a dataframe that contains product and in this dataframe I have some features like: brand, cat1, cat2, cat3, city, desc, image_count, mileage, price, title, year.
The goal is predicting category of products. I have 1 billion training data and important features for prediction are title and description that are text type.
I like to know what algorithm is best for my prediction? I'm a beginner in machine learning and confused among different algorithms. Thanks
This question fits here
But just to give a start-off, you should look onto concepts like:
Decision Trees
SVMs
Linear regression
Also while creating a model please do keep in mind about:
Overfitting
Hyperparameters(Learning rate, epochs, dropout etc.)
Performance Evaluation(Accuracy, Precision, Recall etc.)
Good video playlists for beginner level machine learning are available here and here