Pandas preprocessing data and labelling - python

I want to divide my data into labels in that the first 6 columns determine the 7th column now I have selected the first 6 columns which is working perfectly
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Assign column names to the dataset
names=['buying', 'maint', 'doors', 'persons', 'lug_boot','safety', 'class']
# load the dataset in csv format into the pandas dataframe
cardata= pd.read_csv(r'C:\Users\user\Downloads\car.data', names=names)
X = cardata.iloc[:, 0:6]
The above code is working perfectly and when I run
print(X.head())
it prints the first 6 columns with exemption of the last column which is supposed to be predicted.
But this code below seems not to work as it outputs a similar behaviour to the one above
y = cardata.select_dtypes(include=[object])
print(y.head())
please help I need to assign the variable y to only the last column that is the seventh column
The output is the same which is not the case , I need when I run print(y.head()) it only prints the last column

Try this
X,y = cardata.iloc[:,:-1],cardata.iloc[:,-1]
This selects all rows and separates X and y based on the last column (index = -1). This should get you the result you are looking for

Related

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

Change pandas DataFrame to numpy array but keeping column names

I have a pandas DataFrame from the sklearn.datasets Boston house price data and am trying to convert this to a numpy array but keeping column names. Here is the code I tried:
from sklearn import datasets ## imports datasets from scikit-learn
import numpy as np
import pandas as pd
data = datasets.load_boston() ## loads Boston dataset from datasets library
df = pd.DataFrame(data.data, columns=data.feature_names)
X = df.to_numpy()
print(X.dtype.names)
However this returns None and therefore column names are not kept. Does anyone understand why?
Thanks
try this :
w = (data.feature_names).reshape(13,1)
X = np.vstack((w.T, data.data))
print (X)

Numpy Array for SVM model rather than a DataFrame

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

How to get column names from my numpy array?

I used L1-based feature selection shown here in order to select suitable columns from pandas DataFrame X.
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
However it is not clear to me how can I get the column names. Since X_new is numpy array, I tried this:
X_new.dtype.names
But it returns nothing. So, how can I actually understand which columns have been selected?
Once you have converted your data into a csv file, you will want to use pd.read_csv to get that file into a dataframe.
You can then use the columns attribute to access the columns.
Furthermore, you could use the to_list attribute to get the columns as a list.
Alternatively, you could use Ahmad's method:
import re
f = open('f.csv','r')
alllines = f.readlines()
columns = re.sub(' +',' ',alllines[0]) #delete extra space in one line
columns = columns.strip().split(',') #split using space
print(columns)
EDIT: The question was solved by the OP through using model.get_support instead of SelectFromModel.get_support

Error in my dataset

Currently, I am getting this error in my code
'ValueError: Input contains NaN, infinity or a value too large for dtype('float64')'
when I want to run this code
import pandas as pd
train=pd.read_csv('C:\Users\ABDILLAH\Desktop\datasets\Rails\RailsDataset.csv')
features_col=['Num_comments', 'Num_Commits','Changed_files']
X=train.loc[:,features_col]
y=train.classes
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X,y)`
So if you need a sample of my dataset to check what is realy happened please let me know
I've loaded the sample set and the code below ran in my computer:
import pandas as pd
from sklearn.linear_model import LogisticRegression
train = pd.read_csv('RailsDataset_bis.csv')
features_col = ['Num_Comments', 'Num_Commits', 'Changed_files']
X = train[features_col].dropna()
y = train['class'].dropna()
logreg = LogisticRegression()
logreg.fit(X, y)
I've have corrected issues such as:
There is no Num_comments column, there only is a Num_Comments
column as pandas is case-sensitive. This line
X=train.loc[:,features_col] didn't give you an error, but
generated a column full of NaN. Selecting columns like this X = train[features_col]will throw an error in case the column name doesn't exist.
There is no train.classes as the column name is class and not classes.
There was a line full of NaN on the bottom of the set that needed to be
removed with dropna().

Categories

Resources