So I have just started using scikit-learn for machine learning on Python. I have gone for unsupervised learning on labelled text data. However I cannot figure out what the format of the .csv file containing the dataset before converting it to a NumPy array should look like. For example:
I stored the a string under the 'String' label and the boolean values denoting whether the string is acceptable or not as the column 'Status'. I understand this kind of labelling is wrong. But I haven't found articles that clearly specify what to do.
Try importing one of the example datasets e.g.:
from sklearn.datasets import make_moons
makemoons()
The sklearn example codes is pretty good in my experience e.g. this one
Related
I have a certain number of datasets and I've given numbers to each of them as the names let's consider 20 datasets, so the names are 1.csv, 2.csv and so on.
I'm trying to give an input, here the number(name of the dataset) so that my code reads and works on that dataset. How do I make that possible?
I've done something like giving input and changing it into a string and using pandas read_csv(string+".csv") but the code's not working
Can anyone help out?
pandas read_csv(string+".csv")
I have done this and it works, I had to change the integer to string first.
Okay, so I am doing research on how to do Time-Series Prediction. Like always, it's preprocessing the data that's the difficult part. I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.
df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])
(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly)
I also understand the part where I scale my data. (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.)
Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model. I don't know what to enter into (for example) model.predict()
I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying. And the example isn't directly related to time-series prediction.
I know this is a lot, you might now be able to answer all the questions. I am fairly new to this. Just help with whatever you can. Thank you!
So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github
There's a couple of things I want to make sure you know before I dive into the answer to your questions.
There are lots of different types of predictive models used in
forecasting. You can find all about them here
You're asking a lot of broad questions but I'll break down the main questions
into two steps and describe what's happening using the example that
I believe you're trying to replicate
Let's break it down
Loading and parsing the data
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)
This section of code loads in the data from a .csv (comma-separated values) file. It's saved into the data frame air_passengers. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.
Scaling the data
log_air_passengers = np.log(air_passengers.Passengers)
This is done to make the math make sense. Logs are the inverse of exponents (X^2 is the same as Log2X). Using numpy's log function it gives us the natural log (log e). This is also called the natural log. Your predicted values will actually be so close to a percent change that you can use them as such
Now that the data has been scaled, we can prep it for statistical modeling
log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)
This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.
The last part of your question contains too many steps to cover here. It is also not as simple as calling a single function. I encourage you to learn more from here
I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.
I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.
Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)
How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?
In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.
Thanks
You can load whatever you want and then use sklearn models.
If you have a .csv file, pandas would be the best option.
import pandas as pd
mydataset = pd.read_csv("dataset.csv")
X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)
Similarily, you can load .txt or .xls files.
The important thing in order to use sklearn models is this:
X should be always be an 2D array with shape [n_samples, n_variables]
y should be the target varible.
I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).
First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.
To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.