Python: Split data set according to specific column - python

I am currently trying to build a classification model for which I am using this dataset for training and testing. It is extracted from the TIMIT database and contains digitized frequencies of five different phoneme classes. The frequencies are under the 256 columns labelled "x.1" - "x.256", while the phoneme class itself is labelled "g". Furthermore, there is also a "speakers" column identifying the different speakers.
My question is, is it possible to split this dataset into a 50:50 ratio of training and test data considering the speakers column? In fact, I want to divide the data so that any speaker is not in both sets, so that I do not validate the trained model with test data containing the same speakers that are already in the training data.
My approach was to extract all speakers from the original dataset using NumPy and make use of the stratify parameter of train_test_split:
X_train, X_test, y_train, y_test = train_test_split(input_data, phonemes, random_state=42, test_size=0.5, stratify=speakers)
But this most likely is not the solution. I would greatly appreciate any help in solving this issue!

Hi you can use pandas library of python to load the csv in to dataframe by using
import pandas as pd
df = pd.read_csv(path_to_csv)
then you can get all unique values of the column speaker by using
arrayOfSpeaker = df['speaker'].unique()
now you can easily use the arrayOfSpeaker to split your data into training and testing set.
Also i would recommend to first randomize the arrayOfSpeaker before slicing the array.
and i normally split the data into 70:20:10 ratio for train:validation:test. I didnt get the point of 50:50 split !

Related

Apply machine learning model to another dataset

I'm trying to get into machine learning, and I've been following this tutorial:
https://www.analyticsvidhya.com/blog/2021/05/classification-algorithms-in-python-heart-attack-prediction-and-analysis/
Near the end, we split the dataset into training and testing using train_test_split
x = data3.drop("output", axis=1)
y = data3["output"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
That is, we use the same dataset for training and testing, 70% for training and 30% for testing.
But how can I use another dataset to test my model ?
One scenario came to mind: "You trained your model in 250 patients, now test it against these 3 patients that we have, so we can see the chances of them having a heart attack".
How can I, instead of splitting the data, use another csv/dataframe as a test ? Assuming this test data has the same format as the train, just fewer rows.
train_test_split(x,y,test_size=0.3) only divides data into training and testing set. After training the model on training data, you can use your other data for testing too. This function is mainly for splitting current data and you can use any data for testing purposes. You just have to make sure the attributes and the type are same as the training data. If you have to test on 3 patients, all you have to do is to pass the patients data into model.predict() function as a dataframe or an array depends on data.
Just as you load one dataframe from a file:
data1 = pd.read_csv("heart.csv")
If you had two separate data files, you'd load them into separate data files and skip the train_test_split step.
train_df = pd.read_csv("heart_train.csv")
test_df = pd.read_csv("heart_test.csv")
Since the two dataframes are already separate, you just have to make sure you do any cleaning and pre-processing steps on both of them, including removal of the target variable (y).

Understanding Groups in scikit-learn for XGBoost Ranking

I feel like I'm really missing something obvious when it comes to groups in sklearn data preparation and XGBoost regression parameters.
I've gone over this tutorial: https://medium.com/predictly-on-tech/learning-to-rank-using-xgboost-83de0166229d
as well as XGBRanker documentation.
What exactly is a group? Is it an arbitrary chunk of the dataset? It mentions that groups are important to ensure that you have "A column in your datasets that tells us which datapoints should be compared to what" in the tutorial, but my understanding is sklearn's train_test_split preserves the rows between train and test sets across both the features (X) and the labels (y).
My code uses train_test_split() like your standard data prep process would for classification, i.e.:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=X[<label column name>].values)
What do I have to change to add groups? It mentions that I can just use query ID's too, should I just add a column with randomly generated query ID's to the data before splitting?
In learning-to-rank, you only care about rankings within each group. This is usually described in the context of search results: the groups are matches for a given query. In your linked article, a group is a given race.
If you don't know what your groups are, you might not be in a learning-to-rank situation, and perhaps a more straightforward classification or regression would be better suited.

How to retain Scikit-learn OneHotEncoding from model generation to use on new data?

I'm using OneHotEncoding to generate dummies for a classification problem. When used on the training data, I get ~300 dummy columns, which is fine. However, when I input new data (which is fewer rows), the OneHotEncoding only generates ~250 dummies, which isn't surprising considering the smaller dataset, but then I can't use the new data with the model because the features don't align.
Is there a way to retain the OneHotEncoding schema to use on new incoming data?
I think you are using fit_transform on both training and test dataset, which is not the right approach because the encoding schema has to be consistent on both the dataset for the model to understand the information from the features.
The correct way is do
fit_transform on training data
transform on test data
By doing this way, you will get consistent number of columns.

How does decision tree recognize the features from a given text dataset?

I have a binary classification text data in which there are 10 text features.
I use various techniques like Bag of words, TFIDF etc. to convert them to numerical.
I use hstack() to stack all those features together again after processing them.
After converting them to numerical feature, each feature now has large number of columns hence after conversion, my dataset has around 3000 columns.
My question is when I fit this dataset into decision tree classifier (sklearn), how does the classifier recognizes the columns which belong to a particular feature?
For example first 51 column out of 3000 belong to US_states Bag of words.
Now, how will the DT recognize it?
PS: Data before processing is in pandas Dataframe.
After processing, it is a stacked numpy array being input in the classifier.
The Decision Tree won't recognize from which features the attributes are coming.

How can i train multiple times an SVM classifier from sklearn in Python?

I was wandering if it possible to train the SVM classifier from sklearn in Python many times inside a for loop. I have in mind something like the following:
for i in range(0,10):
data = np.load(somedata)
labels = np.load(somelabels)
C = SVC()
C.fit(data, labels)
joblib.dump(C, 'somefolderpath/Model.pkl')
I want my model to be trained for each one of the 10 data and their labels. Is that possible in that way or do i have to append all the data and labels into two corresponding arrays containing the whole data and labels from my 10 items?
EDITED: If i want to train a separate classifier for each subject. Then how would the above syntax look like? Is my edit correct?
And when i want to load the specific trained classifier for my specific subject, can i do:
C = joblib.load('somefolderpath/Model.pkl')
idx = C.predict(data)
?
Calling fit on any scikit-learn estimator will forget all the previously seen data. So if you want to make predictions using all of your data (all ten patients), you need to concatenate it first.
In particular, if each somelabels contains only a single label, the code doesn't make sense and might even error because only one class is present.

Categories

Resources