Support Vector Machine to classify whole DataFrames in python - python

I would like to create a Support Vector Machine to classify whole DataFrames. So in each cell there would be one DataFrame with a set of Data.
I am working with python.
Do you know if this is in any way possible?
Thank you in advance!
I have not been able to find any examples.

Related

Lib for creating ROC-Curve and DET-Curve by giving it the match scores and non-match-scores?

Im comparing some Open-Source Face-Recognition-Frameworks running with python (dlib) and for that i wanted to create ROC and DET curves. For creating match-scores im using the casia faceV5 dataset. Everything is only for educational purpose.
My Questions is:
Whats the best way to generate these kind of curves? (Any good libs for that?)
I found this via google Skicit but i still dont know how i should use that for face recognition?
I mean, which information should i have to pass? I know that ROC is using the true match rate and the false match rate, but from a developers point of view i just dont know how to integrate these informations to that Skicit-function.
My Test:
Im creating genuine match-scores of every person in the casia dataset. Therefor i use different pictures of the same person to create it. I save this scores in the array "genuineScores".
Example:
Person1_Picture1.jpg comparing with Person1_Picture2.jpg
Person2_Picture1.jpg comparing with Person2_Picture2.jpg etc.
Im also creating impostor match-scores. For this im using two pictures of different persons. I save this scores in the array "impostorScores".
Example:
Person1_Picture1.jpg comparing with Person2_Picture1.jpg
Person2_Picture1.jpg comparing with Person3_Picture1.jpg etc.
Now im just looking for a lib where i could pass the two arrays and its creating a roc curve for me.
Or is there another method for doing so?
I appreciate any kind of help. Thank you.

Data analysis : compare two datasets for devising useful features for population segmentation

Say I have two pandas dataframes, one containing data for general population and one containing the same data for a target group.
I assume this is a very common use case of population segmentation. My first idea to explore the data would be to perform some vizualization using e.g. seaborn Facetgrid or barplot & scatterplot or something like that to get a general idea of the trends and differences.
However, I found out that this operation is not as straightforward as I thought as seaborn is made to analyze one dataset and not compare two datasets.
I found this SO answer which provides a solution. But I am wondering how would people go about if if the dataframe was huge and a concat operation would not be possible ?
Datashader does not seem to provide such features as far as I have seen ?
Thanks for any ideas on how to go about such task
I would use the library Dask when data is too big for pandas. Dask is made by the same people who created pandas and it is a little bit more advanced, because it is a big data tool, but it has some of the same features including concat. I found dask easy enough to use and am using it for a couple of projects where I have dozens of columns and tens of millions of rows.

How to select only missing values for testing the model?

I am working on logistic regression project where I have 850 observations and 8 variables and in this, I found 150 missing values and I have decided to use these values as test data. How can I take only missing values as test data in python?
I am still learning data science if there's a mistake in this approach please let me know.
Thank you :)
You could use the pd.isna() from pandas library.
It will return a boolean array that you can use for filtering your data.
You can select all rows, having any missing value in that, using following code
df[df.isnull().values.any(axis=1)]
I do not recommend you to use all data with missing values for testing. You should either fill the missing values completely or at least partial values should be filled in the test dataset.
Let's see what other Machine Learning experts advise you.

How to convert time series data into image?

I have a dataset where I have 12000+ data points and 25 features out of which last feature is the class label. This is classification problem. Now, I want to convert every data points into image, . I have no idea how to do that. Please help. I work on Python. If anyone have could provide sample code I will be grateful. Thanks in advance.
There is already some work on that, you can use either Gramian Angular Fields (GAF) or Markov Transition Fields (MTF), a good description is in Imaging Time-Series to Improve Classification and Imputation. Also, some other works used recurrent plots as Deep-Gap: deep learning framework. Imaging TS is an interesting way to think about them so you can use e.g. CNNs easily. But which method you like to use? BTW be aware this might not be an "efficient" way to classify time series :)

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.
To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

Categories

Resources