use python's sklearn module with custom dataset - python

I've never used python before and I find myself in the dire need of using sklearn module in my node.js project for machine learning purposes.
I have been all day trying to understand the code examples in said module and now that I kind of understand how they work, I don't know how to use my own data set.
Each of the built in data sets has its own function (load_iris, load_wine, load_breast_cancer, etc) and they all load data from a .csv and an .rst file. I can't find a function that will allow me to load my own data set. (there's a load_data function but it seems to be for internal use of the previous three I mentioned, cause I can't import it)
How could I do that? What's the proper way to use sklearn with any other data set? Does it always have to be a .csv file? Could it be programmatically provided data (array, object, etc)?
In case it's important: all those built-in data sets have numeric features, my data set has both numeric and string features to be used in the decision tree.
Thanks

You can load whatever you want and then use sklearn models.
If you have a .csv file, pandas would be the best option.
import pandas as pd
mydataset = pd.read_csv("dataset.csv")
X = mydataset.values[:,0:10] # let's assume that the first 10 columns are the features/variables
y = mydataset.values[:,11] # let's assume that the 11th column has the target values/classes
...
sklearn_model.fit(X,y)
Similarily, you can load .txt or .xls files.
The important thing in order to use sklearn models is this:
X should be always be an 2D array with shape [n_samples, n_variables]
y should be the target varible.

Related

Slicing a percentage of data points within the tables of a nested dictionary

I am trying to run a time-series analysis with a set of coral growth datasets. It is one big nested dictionary that I loaded in from a multiple-tabbed excel file:
dataset = pd.read_excel('path/to/file.xls')
Each dataset within the dictionary contains info for a different coral, with columns for the year, extension, density, and calcification.
Structured like this:
data structure.
I converted to to an array with:
dataset1 = dataset.items()
datalist=list(dataset1)
data_array=np.array(datalist, dtype=object)
My goal is to slice it into training and testing sets - easy and doable, done with:
from sklearn.model_selection import train_test_split
train, test = train_test_split(list(dataset.values()), train_size=0.8)
The part I'm struggling with is, I also want to slice up to a specific percentage of each dataframe in the dictionary, i.e. the first 90% of rows from each coral, to use as my X_train, and the last 10% to use as my y_train, but I cannot figure out how to make it work. I have tried a lot of suggestions from googling and looking through answers on here, but I can't seem to find anything that helps. I am fairly new to the programming world, so I might be missing something simple here too.

Kubeflow, passing Python dataframe across components?

I am writing a Kubeflow component which reads an input query and creates a dataframe, roughly as:
from kfp.v2.dsl import component
#component(...)
def read_and_write():
# read the input query
# transform to dataframe
sql.to_dataframe()
I was wondering how I can pass this dataframe to the next operation in my Kubeflow pipeline.
Is this possible? Or do I have to save the dataframe in a csv or other formats and then pass the output path of this?
Thank you
You need to use the concept of the Artifact. Quoting:
Artifacts represent large or complex data structures like datasets or models, and are passed into components as a reference to a file path.

Preprocessing data for Time-Series prediction

Okay, so I am doing research on how to do Time-Series Prediction. Like always, it's preprocessing the data that's the difficult part. I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.
df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])
(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly)
I also understand the part where I scale my data. (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.)
Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model. I don't know what to enter into (for example) model.predict()
I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying. And the example isn't directly related to time-series prediction.
I know this is a lot, you might now be able to answer all the questions. I am fairly new to this. Just help with whatever you can. Thank you!
So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github
There's a couple of things I want to make sure you know before I dive into the answer to your questions.
There are lots of different types of predictive models used in
forecasting. You can find all about them here
You're asking a lot of broad questions but I'll break down the main questions
into two steps and describe what's happening using the example that
I believe you're trying to replicate
Let's break it down
Loading and parsing the data
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)
This section of code loads in the data from a .csv (comma-separated values) file. It's saved into the data frame air_passengers. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.
Scaling the data
log_air_passengers = np.log(air_passengers.Passengers)
This is done to make the math make sense. Logs are the inverse of exponents (X^2 is the same as Log2X). Using numpy's log function it gives us the natural log (log e). This is also called the natural log. Your predicted values will actually be so close to a percent change that you can use them as such
Now that the data has been scaled, we can prep it for statistical modeling
log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)
This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.
The last part of your question contains too many steps to cover here. It is also not as simple as calling a single function. I encourage you to learn more from here

Python - How To read IRIS CSV Data (a variable) into a Pandas DataFrame

I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).

Format of Text Based Dataset in Python

So I have just started using scikit-learn for machine learning on Python. I have gone for unsupervised learning on labelled text data. However I cannot figure out what the format of the .csv file containing the dataset before converting it to a NumPy array should look like. For example:
I stored the a string under the 'String' label and the boolean values denoting whether the string is acceptable or not as the column 'Status'. I understand this kind of labelling is wrong. But I haven't found articles that clearly specify what to do.
Try importing one of the example datasets e.g.:
from sklearn.datasets import make_moons
makemoons()
The sklearn example codes is pretty good in my experience e.g. this one

Categories

Resources