Losing the header of a csv file after normlizing

Losing the header of a csv file after normlizing - python

I've wrote the following code to read a csv file run a column wise normalization :
from sklearn import preprocessing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# reading Train values
Training ='Training.csv'
df = pd.read_csv(Training)
df =df.drop(df.columns[len(df.loc[1])-1],axis=1)
df =df.drop(df.columns[len(df.loc[1])-1],axis=1)
df.describe()
minmax_scaler= preprocessing.MinMaxScaler()
np_scaled = minmax_scaler.fit_transform(df)
normalized = pd.DataFrame(np_scaled)
normalized.describe()
np.shape(df)
np.shape(normalized)
My question is why I can't see the headers in the normalized list? despite it having the same shape of df, I've tried to read the csv file without a header but the program crashes ?
..............................
df = pd.read_csv(Training,header=None)
.........................
delivers the following :
ValueError Traceback (most recent call last)
<ipython-input-15-dd18ba2a6204> in <module>()
14 df.describe()
15 minmax_scaler= preprocessing.MinMaxScaler()
---> 16 np_scaled = minmax_scaler.fit_transform(df)
17 normalized = pd.DataFrame(np_scaled)
18 normalized.describe()
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
492 if y is None:
493 # fit method of arity 1 (unsupervised transformation)
--> 494 return self.fit(X, **fit_params).transform(X)
495 else:
496 # fit method of arity 2 (supervised transformation)
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\preprocessing\data.py in fit(self, X, y)
290 # Reset internal state before fitting
291 self._reset()
--> 292 return self.partial_fit(X, y)
293
294 def partial_fit(self, X, y=None):
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\preprocessing\data.py in partial_fit(self, X, y)
316
317 X = check_array(X, copy=self.copy, ensure_2d=False, warn_on_dtype=True,
--> 318 estimator=self, dtype=FLOAT_DTYPES)
319
320 if X.ndim == 1:
c:\users\engine\appdata\local\programs\python\python35\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float: 'Feature458'
I'll greatful for any hint about how may I solve this !

Well, that is because you use preprocessing.MinMaxScaler() which returns an array, not a dataframe.
After you create a dataframe based on this matrix, it does not know anything about your columns.
You could try something like
normalized = pd.DataFrame(np_scaled, columns=df.columns)
And with the latter example (with header=False), you simply have your header as the first row. When sklearn tries to convert a column name into an integer, you get your error.

Related

Support vector regression: ValueError: Found array with dim 3. StandardScaler expected <= 2

I'm trying to perform a supporting vector regression on my datasets, however I have run into a problem when trying to standard scale my datasets.
Originally I was facing an issue of "ValueError: Expected 2D array, got 1D array instead".
I tried to change the shape of my datasets, but I'm now facing an issue ValueError: Found array with dim 3. StandardScaler expected <= 2.
Could anyone please help me out how can I proceed to eliminate this issue?
Below is what I have done:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset=pd.read_excel(r'C:\Users\Sammy\OneDrive - International Campus, Zhejiang University\Desktop\Data\BAYC Data.xlsx')
print(dataset)
dataset.columns
x=dataset.iloc[:,1].values
y=dataset.iloc[:,2].values
x=dataset.iloc[:,0].to_frame()
x=np.expand_dims(x, axis = -1)
from sklearn.preprocessing import StandardScaler
st_x=StandardScaler()
st_y=StandardScaler()
X=st_x.fit_transform(x)
Y=st_y.fit_transform(y)
And this is the error I'm receiving:
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_11140\1840045772.py in <module>
----> 1 X=st_x.fit_transform(x)
2 Y=st_y.fit_transform(y)
~\anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
850 if y is None:
851 # fit method of arity 1 (unsupervised transformation)
--> 852 return self.fit(X, **fit_params).transform(X)
853 else:
854 # fit method of arity 2 (supervised transformation)
~\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py in fit(self, X, y, sample_weight)
804 # Reset internal state before fitting
805 self._reset()
--> 806 return self.partial_fit(X, y, sample_weight)
807
808 def partial_fit(self, X, y=None, sample_weight=None):
~\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py in partial_fit(self, X, y, sample_weight)
839 """
840 first_call = not hasattr(self, "n_samples_seen_")
--> 841 X = self._validate_data(
842 X,
843 accept_sparse=("csr", "csc"),
~\anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
564 raise ValueError("Validation should be done on X, y or both.")
565 elif not no_val_X and no_val_y:
--> 566 X = check_array(X, **check_params)
567 out = X
568 elif no_val_X and not no_val_y:
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
792 ) from e
793 if not allow_nd and array.ndim >= 3:
--> 794 raise ValueError(
795 "Found array with dim %d. %s expected <= 2."
796 % (array.ndim, estimator_name)
ValueError: Found array with dim 3. StandardScaler expected <= 2.

you are trying to "push" 3d array to the function that need 2d array.
I would recommend delete rows 9 and 10, and in rows 7 and 8 edit the values in [], so the dataset.iloc will give you 2d array :)

python converting string values to numerical with fit_transform(data)

Hey I'm over this problem for 2 hours, can someone explain why I get this error?
I'm supposed to turn strings values (that presents 10,000 columbs of 3 states and a gender)
to numeric values and don't know what's the problem I saw someone on Udemy do it and it worked fine.
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
Error:
Input In [15], in <cell line: 1>()
----> 1 X_train = sc.fit_transform(X_train)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\base.py:867, in TransformerMixin.fit_transform(self, X, y, **fit_params)
863 # non-optimized default implementation; override when a better
864 # method is possible for a given clustering algorithm
865 if y is None:
866 # fit method of arity 1 (unsupervised transformation)
--> 867 return self.fit(X, **fit_params).transform(X)
868 else:
869 # fit method of arity 2 (supervised transformation)
870 return self.fit(X, y, **fit_params).transform(X)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\preprocessing\_data.py:809, in StandardScaler.fit(self, X, y, sample_weight)
807 # Reset internal state before fitting
808 self._reset()
--> 809 return self.partial_fit(X, y, sample_weight)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\preprocessing\_data.py:844, in StandardScaler.partial_fit(self, X, y, sample_weight)
812 """Online computation of mean and std on X for later scaling.
813
814 All of X is processed as a single batch. This is intended for cases
(...)
841 Fitted scaler.
842 """
843 first_call = not hasattr(self, "n_samples_seen_")
--> 844 X = self._validate_data(
845 X,
846 accept_sparse=("csr", "csc"),
847 dtype=FLOAT_DTYPES,
848 force_all_finite="allow-nan",
849 reset=first_call,
850 )
851 n_features = X.shape[1]
853 if sample_weight is not None:
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\base.py:577, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
575 raise ValueError("Validation should be done on X, y or both.")
576 elif not no_val_X and no_val_y:
--> 577 X = check_array(X, input_name="X", **check_params)
578 out = X
579 elif no_val_X and not no_val_y:
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\utils\validation.py:856, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
854 array = array.astype(dtype, casting="unsafe", copy=False)
855 else:
--> 856 array = np.asarray(array, order=order, dtype=dtype)
857 except ComplexWarning as complex_warning:
858 raise ValueError(
859 "Complex data not supported\n{}\n".format(array)
860 ) from complex_warning
ValueError: could not convert string to float: 'Spain'

You need to encode your string columns (categorical features) first. Use OrdinalEncoder(), LabelEncoder() or OneHotEncoder() to convert categorical columns to numeric. You can only scale numerical variables.

Ok I figured it out.
# Preform label encoding for gender variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lableencoder_X_2 = LabelEncoder()
X[:, 2] = lableencoder_X_2.fit_transform(X[:, 2])
# preform one ho encoding for geography varaible
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('ohe', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype = np.str)
X = X[:, 1:]

Question on Python Feature Selection (null values)

I'm building a logistic regression model, and want to understand the features that contribute the most to my output (1 or 0). Trying to understand if a customer comes back to my website, what features are making them come back. I am getting stuck with this fit function. It errors out on me and I can't figure out why. It seems to indicate that I have some null values, but I've scrubbed through my data and removed the null values.
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#load data
df = pd.read_csv('jupyter.csv', header = 0)
array = dataframe.values
X = array[:,1:13]
Y = array[:,14]
print (X.shape)
print (Y.shape)
(544219, 12)
(544219,)
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-f91db4d08897> in <module>
1 # feature extraction
2 test = SelectKBest(score_func=chi2, k=4)
----> 3 fit = test.fit(X, Y)
4 # summarize scores
5 #numpy.set_printoptions(precision=3)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/feature_selection/univariate_selection.py in fit(self, X, y)
339 self : object
340 """
--> 341 X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
342
343 if not callable(self.score_func):
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
720 if multi_output:
721 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
--> 722 dtype=None)
723 else:
724 y = column_or_1d(y, warn=True)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
540 if force_all_finite:
541 _assert_all_finite(array,
--> 542 allow_nan=force_all_finite == 'allow-nan')
543
544 if ensure_min_samples > 0:
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
58 elif X.dtype == np.dtype('object') and not allow_nan:
59 if _object_dtype_isnan(X).any():
---> 60 raise ValueError("Input contains NaN")
61
62
ValueError: Input contains NaN

You can impute missing values using:
SimpleImputer, or
using pandas fillna(). For example, the following code will impute all missing values with the mean:
X.fillna(X.mean())

Pandas df.get_dummies() returns "ValueError: could not convert string to float"

I am trying to one-hot encode several categorical columns using Pandas' df.get_dummies() and it is returning an error that I don't understand. The error says ValueError: could not convert string to float: 'Warm Cool'. What might be causing this issue and how can I successfully one-hot encode all of the columns with dtype == object?
My dataset comes from the DC_Properties.CSV file found here.
My code and the error message:
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Import packages section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Read data section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
df = pd.read_csv('DC_Properties.csv', index_col=0)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Preprocess data section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# remove rows without sales prices
df = df[df.PRICE.notnull()]
# create month sold column
df['MONTHSOLD'] = [i[:i.find('/')] if type(i) == str else i for i in df.SALEDATE]
# create year sold column
df['YEARSOLD'] = [i[-4:] if type(i) == str else i for i in df.SALEDATE]
# join GBA and Living GBA
df['GBA'] = df['GBA'].fillna(df['LIVING_GBA'])
# remove unused columns
unused_cols = ['SALEDATE',
'GIS_LAST_MOD_DTTM',
'CMPLX_NUM',
'LIVING_GBA',
'FULLADDRESS',
'CITY',
'STATE',
'NATIONALGRID',
'ASSESSMENT_SUBNBHD',
'CENSUS_TRACT',
'CENSUS_BLOCK',
'X',
'Y']
df = df.drop(unused_cols, axis=1)
# one-hot encode categorical variables
pd.get_dummies(df, dummy_na=True)
# standardize the data
scaler = StandardScaler()
dataset = scaler.fit_transform(df)
# specify x and y variables
x = dataset[:,-y_idx]
y = dataset[:,'PRICE']
# split data into a train and test set
np.random.seed(123)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-81-62c3931b3dfa> in <module>
33 # standardize the data
34 scaler = StandardScaler()
---> 35 dataset = scaler.fit_transform(df)
36
37 # specify x and y variables
~\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
551 if y is None:
552 # fit method of arity 1 (unsupervised transformation)
--> 553 return self.fit(X, **fit_params).transform(X)
554 else:
555 # fit method of arity 2 (supervised transformation)
~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit(self, X, y)
637 # Reset internal state before fitting
638 self._reset()
--> 639 return self.partial_fit(X, y)
640
641 def partial_fit(self, X, y=None):
~\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in partial_fit(self, X, y)
661 X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
662 estimator=self, dtype=FLOAT_DTYPES,
--> 663 force_all_finite='allow-nan')
664
665 # Even in the case of `with_mean=False`, we update the mean anyway
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
494 try:
495 warnings.simplefilter('error', ComplexWarning)
--> 496 array = np.asarray(array, dtype=dtype, order=order)
497 except ComplexWarning:
498 raise ValueError("Complex data not supported\n"
~\Anaconda3\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'Warm Cool'

It's actually the StandardScaler that throws an error because it encounters strings.
The reason is that you are using pd.dummies, but you never assign the returned dataframe.
# one-hot encode categorical variables
pd.get_dummies(df, dummy_na=True) # <------ is lost
To fix it change it to:
# one-hot encode categorical variables
df = pd.get_dummies(df, dummy_na=True)

I don't think your onehotencoder (get_dummies) is doing what you think it is.
Replace it with this line, or explicitly state what you want the get_dummies to do (mention the columns that require onehotencoding) and then drop the original columns.
df = pd.concat([df,pd.get_dummies(df, prefix='dummy',drop_first = False, dummy_na=True)],axis=1)
Change the 'drop_first = True' if you want to drop one of the dummies as required by certain statistical methods.
Hope that helps. Good luck!

Data type for gaussian Naive bayes classifivation using sklearn, how to clean data [duplicate]

This question already has an answer here:
Factorize a column of strings in pandas
(1 answer)
Closed 4 years ago.
I'm trying to classify mobiles according to their features but when I apply the gaussian NB code through sklearn , I'm unable to do so because of the following error :
the code :
clf = GaussianNB()
clf.fit(X_train,y_train)
GaussianNB()
accuracy = clf.score(X_test,y_test)
print(accuracy)
error:
ValueError Traceback (most recent call last)
<ipython-input-18-e9515ccc2439> in <module>()
2 clf.fit(X_train,y_train)
3 GaussianNB()
----> 4 accuracy = clf.score(X_test,y_test)
5 print(accuracy)
/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
347 """
348 from .metrics import accuracy_score
--> 349 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
350
351
/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict(self, X)
63 Predicted target values for X
64 """
---> 65 jll = self._joint_log_likelihood(X)
66 return self.classes_[np.argmax(jll, axis=1)]
67
/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
422 check_is_fitted(self, "classes_")
423
--> 424 X = check_array(X)
425 joint_log_likelihood = []
426 for i in range(np.size(self.classes_)):
/Users/kiran/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float:
My dataset has been scraped so it contains string as well as float values. It would be helpful if someone could suggest me how I can clean the data and avoid the error .

try the following:
accuracy = clf.score(X_test.astype('float'),y_test.astype('float'))

ValueError: could not convert string to float
I think this says it all. You need to have float as consistent datatype in your dataset.
To convert a string in python to float:
>>> a = "123.345"
>>> float(a)
>>> 123.345
>>> int(float(a))
>>> 123

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Losing the header of a csv file after normlizing - python

Related

Support vector regression: ValueError: Found array with dim 3. StandardScaler expected <= 2

python converting string values to numerical with fit_transform(data)

Question on Python Feature Selection (null values)

Pandas df.get_dummies() returns "ValueError: could not convert string to float"

Data type for gaussian Naive bayes classifivation using sklearn, how to clean data [duplicate]

Categories

Resources