How to normalize-scale data in attribute in range <-1;1>

How to normalize-scale data in attribute in range <-1;1> - python

Hello i have used many options for normalize data in my dataframe attribute elnino_1["air_temp"] ,but it always shows me an error like "Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample." or "'int' object is not callable" .
I try this code:
(1)
elnino_1["air_temp"].min=-1
elnino_1["air_temp"].max=1
elnino_1_std = (elnino_1["air_temp"] - elnino_1["air_temp"].min(axis=0)) / (elnino_1["air_temp"].max(axis=0) - elnino_1["air_temp"].min(axis=0))
elnino_1_scaled = elnino_1_std * (max - min) + min
(2)
XD=elnino_1["air_temp"]
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler = MinMaxScaler(feature_range=(-1, 1))
In both option I use libraries:
from sklearn.preprocessing import scale
from sklearn import preprocessing
What I should to do for normalize this data please?

As I do not have access to your dataset, here I'm using make_classification to generate some synthetic data. Please run through in a notebook to gain understanding. (Do note as well there may be slight differences as I'm using a numpy array as dataset, yours is a DataFrame.)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=1)
pd.DataFrame(X).head()
Thereafter, we fit a MinMaxScaler to the data. MinMaxScaler accepts a 2d array as input by default. In other words, a 'table'. Throughout these, please call X.shape to understand how array shapes work. For e.g in the above, X.shape >> (100, 2) where it is (numrows, numcolumns)
scaler = MinMaxScaler(feature_range=(-1, 1))
X_norm = scaler.fit_transform(X)
pd.DataFrame(X_norm).head()
In your case, you are only trying to fit/scale a single column. When you only fit elnino_1["air_temp"] it is a 1d array, shape is something like (100,).
So we have to reshape it into a 2d array.
x1_norm = scaler.fit_transform(X[:, 1].reshape(-1,1))
pd.DataFrame(x1_norm)
For example, if xyz.shape is (100,) and I want it to be (100,1), I can use xyz.reshape(100,1) if I'm being specific.
The length of the dimension set to -1 is automatically determined by inferring from the specified values of other dimensions. This is useful when converting a large array shape. Thus xyz.reshape(-1,1) achieves the same as above.

Related

Fit a Normalizer with an array, then transform another in python with sklearn

I'm not sure if i'm doing something wrong, or if this is not the correct way to do this..
I'm encoding variables in a dataset for a model, now, i'm using a Normalizer() from sklearn.preprocessing to normalize one of my variables which is numerical.
My dataset is split in two, one for the training and one for the inference. Now, my goal is to normalize this numerical variable (let's call it column x) in the training subset, and then use the normalization parameters to normalize the same variable in the inference dataset. Now, both subsets don't have the same amount of entries, so, what i'm doing is:
nr = Normalizer()
nr.fit([df1.x])
new_col = nr.transform(df1.x)
Now, the problme is.. when i try to use the same normalizer parameters on the column x in the inference subset, since it has a different number of rows:
new_col1 = nr.transform(df2.x)
I get:
X has 10 features, but Normalizer is expecting 697 features as input.
I'm not sure if it's some reshape problem or if the Normalizer() shouldn't be used in that way, so, any advice would be more than welcome.

Normalizer is used to normalize rows whereas StandardScaler is used to normalize column. Concerning your questions, it seems that you want to scale columns. Therefore you should use StandardScaler.
scikit-learn transformers excepts 2D array as input of shape (n_sample, n_feature) but pandas.Series are one-dimensional ndarray with axis labels.
You can fix that by passing a pandas.DataFrame to the transformer.
As follows:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
df1 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=1000)})
df2 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=850)})
scaler = StandardScaler()
new_col = scaler.fit_transform(df1[['x']])
new_col1 = scaler.transform(df2[['x']])

ValueError: Expected 2D array, got 1D array instead: array=[-1]

Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful

x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)

Python Linear Regression Predict Error - Array Issue

When I try to use .predict on my linear regression, I get thrown the following error:
ValueError: Expected 2D array, got scalar array instead:
array=80.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't really understand the reshape feature and why its needed. Can somebody please explain to me what this does, and how to apply it to get a prediction for my model?
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([95,85,80,70,60])
y = np.array([85,95,70,65,70])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
plt.scatter(x,y)
plt.show()
reg = LinearRegression()
reg.fit(x,y)
reg.predict(80)

input of predict() is 2d array you are passing integer value that's why you are getting error. You need to pass 80 as a 2d list [[80]]
reg.predict([[80]])

Python PolynomialFeatures transforms data into different shape from the original one

I'm using sklearn's PolynomialFeatures to preprocess data into various degree transformations in order to compare their model fit.
Below is my code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
# x and y are the original data
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
# using .PolynomialFeatures and fit_transform to transform original data to degree 2
poly1 = PolynomialFeatures(degree=2)
x_D2_poly = poly1.fit_transform(x)
#check out their dimensions
x.shape
x_D2_poly.shape
However, the above transformation returned an array of (1, 5151) from the original x of (100, 1). This is not what I have expected. I couldn't figure out what's wrong with my code. It will be great if someone could point out the error of my code or misconception on my part.
Should I use alternative methods to transform original data instead?
Thank you.
Sincerely,
[update]
So after I used x = x.reshape(-1, 1) to transform the original x, Python does give me the desired output dimension (100, 1) via poly1.fit_transform(x). However, when I did a train_test_split, fitted the data, and tried to obtain predicted values:
x_poly1_train, x_poly1_test, y_train, y_test = train_test_split(x_poly1, y, random_state = 0)
linreg = LinearRegression().fit(x_poly1_train, y_train)
poly_predict = LinearRegression().predict(x)
Python returned an error message:
shapes (1,100) and (2,) not aligned: 100 (dim 1) != 2 (dim 0)
Apparently, there must be somewhere I got the dimensional thing wrong again. Could anyone shed some light on this?
Thank you.

I think you need to reshape your x like
x=x.reshape(-1,1)
Your x had shape (100,) not (100,1) and fit_transform expects 2 dimensions.
The reason you were getting 5151 features is that you were seeing one feature for each distinct pair (100*99/2 = 4950), one feature for each feature squared (100), 1 feature for first power of each feature (100), and one the 0th power (1).
Response to your edited question:
You need to call transform to convert the data you wish to predict on.

sklearn: sklearn.preprocessing DeprecationWarning for arrays

First I looked at all the related question.
There are very similar problems given.
So I followed suggestions from the links, but none of them worked for me.
Data Conversion Error while applying a function to each row in pandas Python
Getting deprecation warning in Sklearn over 1d array, despite not having a 1D array
I also tried to follow the error message, it also didn't work.
The code looks like this:
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# avoid DataConversionError
X = X.astype(float)
y = y.astype(float)
## Attempt to avoid DeprecationWarning for sklearn.preprocessing
#X = X.reshape(-1,1) # attempt 1
#X = np.array(X).reshape((len(X), 1)) # attempt 2
#X = np.array([X]) # attempt 3
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(sc_X.transform(np.array([6.5])))
y_pred = sc_y.inverse_transform(y_pred)
The data looks like this:
Position,Level,Salary
Business Analyst,1,45000
Junior Consultant,2,50000
Senior Consultant,3,60000
Manager,4,80000
Country Manager,5,110000
Region Manager,6,150000
Partner,7,200000
Senior Partner,8,300000
C-level,9,500000
CEO,10,1000000
The full error log goes like this:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
I am using only second and third column so there is no need for one hot encoding for the first column. The only problem is DeprecationWarning.
I tried all the suggestions given but none of them worked.
So, the help will be truly appreciated.

This was a strange one. The code I used to get rid of the deprecation warnings is below, with a slight modification to how you fit StandardScaler() and called transform(). The solution involved painstakingly reshaping and raveling the arrays according to the warning messages. Not sure if this is the best way, but it removed the warnings.
# Importing the libraries
import numpy as np
import pandas as pd
from io import StringIO
from sklearn.preprocessing import StandardScaler
# Setting up data string to be read in as a .csv
data = StringIO("""Position,Level,Salary
Business Analyst,1,45000
Junior Consultant,2,50000
Senior Consultant,3,60000
Manager,4,80000
Country Manager,5,110000
Region Manager,6,150000
Partner,7,200000
Senior Partner,8,300000
C-level,9,500000
CEO,10,1000000""")
dataset = pd.read_csv(data)
# Importing the dataset
#dataset = pd.read_csv('Position_Salaries.csv')
# Deprecation warnings call for reshaping of single feature arrays with reshape(-1,1)
X = dataset.iloc[:, 1:2].values.reshape(-1,1)
y = dataset.iloc[:, 2].values.reshape(-1,1)
# avoid DataConversionError
X = X.astype(float)
y = y.astype(float)
#sc_X = StandardScaler()
#sc_y = StandardScaler()
X_scaler = StandardScaler().fit(X)
y_scaler = StandardScaler().fit(y)
X_scaled = X_scaler.transform(X)
y_scaled = y_scaler.transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
# One of the warnings called for ravel()
regressor.fit(X_scaled, y_scaled.ravel())
# Predicting a new result
# The warnings called for single samples to reshaped with reshape(1,-1)
X_new = np.array([6.5]).reshape(1,-1)
X_new_scaled = X_scaler.transform(X_new)
y_pred = regressor.predict(X_new_scaled)
y_pred = y_scaler.inverse_transform(y_pred)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to normalize-scale data in attribute in range <-1;1> - python

Related

Fit a Normalizer with an array, then transform another in python with sklearn

ValueError: Expected 2D array, got 1D array instead: array=[-1]

Python Linear Regression Predict Error - Array Issue

Python PolynomialFeatures transforms data into different shape from the original one

sklearn: sklearn.preprocessing DeprecationWarning for arrays

Categories

Resources