Replacing Manual Standardization with Standard Scaler Function - python

I want to replace the manual calculation of standardizing the monthly data with the StandardScaler package from sklearn. I tried the line of code below the commented out code, but I am receiving the following error.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
arr = pd.DataFrame(np.arange(1,21), columns=['Output'])
arr2 = pd.DataFrame(np.arange(10, 210, 10), columns=['Output2'])
index2 = pd.date_range('20180928 10:00am', periods=20, freq="W")
# index3 = pd.DataFrame(index2, columns=['Date'])
df2 = pd.concat([pd.DataFrame(index2, columns=['Date']), arr, arr2], axis=1)
print(df2)
cols = df2.columns[1:]
# df2_grouped = df2.groupby(['Date'])
df2.set_index('Date', inplace=True)
df2_grouped = df2.groupby(pd.Grouper(freq='M'))
for c in cols:
#df2[c] = df2_grouped[c].apply(lambda x: (x-x.mean()) / (x.std()))
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
print(df2)
ValueError: Expected 2D array, got 1D array instead:
array=[1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The error message says that StandardScaler().fit_transform only accept a 2-D argument.
So you could replace:
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
with:
from sklearn.preprocessing import scale
df2[c] = df2_grouped[c].transform(lambda x: scale(x.astype(float)))
as a workaround.
From sklearn.preprocessing.scale:
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
So it should work as a standard scaler.

Related

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Python pandas.dataframe.isin returning unexpected results

I have encountered this several times where I'm trying to filter a dataframe using a column from another dataframe. isin incorrectly returns true for every row. It is probably just a misunderstanding on my part as to how it should work. Why is it doing this, and is there a better way to code it?
#Read the data into a pandas dataframe
ar_data = pd.read_excel('~/data/Accounts-Receivable.xlsx')
ar_data.set_index('customerID', inplace=True)
#randomly select records for 70/30 train/test split
train = ar_data.sample(frac=.7, random_state = 1)
mask = ~ar_data.index.isin(list(train.index)) #why does this return False for every value?
test = ar_data[mask]
ar_data.shape #returns (2466, 11)
train.shape #(1726, 11)
test.shape #returns (0, 11). Should return 740 rows!
Example
I tried to execute you code with a sample DataFrame and it works:
import pandas as pd
ar_data = [[10,20],[11,2],[9,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (1,1)
The problem you may probably have is that the index you used is not a key, hence there are multiple lines with the same Customer_id.
In fact executing your code with duplicated indexes leads to the bug you encountered.
import pandas as pd
ar_data = [[10,20],[10,2],[10,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (0,1)
Anyways an easier and faster way to split your dataset would be:
from sklearn.model_selection import train_test_split
X = ar_data
y = ar_data
train, test, _, _ = train_test_split(X,y,test_size=0.3,random_state=1)
with that possibility, you can also split the features and the predictions with only one function, and it doesn't rely on the indexes.

How to create a new data frame

Shape of passed values is (1000, 10), indices imply (1000, 11)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns)
The error,
Shape of passed values is (1000, 10), indices imply (1000, 11)
occurs on this line
df_feat = pd.DataFrame(scaled_features,columns=df.columns)
because scaled_features has 10 columns, but df.columns has length 11.
Notice that df.drop('TARGET CLASS',axis=1) was called twice to drop the TARGET CLASS column from df. It seems likely that this is the extra column in df that you would want to drop from the list of new columns.
The problem can be fixed by saving a reference to df.drop('TARGET CLASS',axis=1)
(let's call it df_minus_target), and passing df_minus_target.columns as the new list of columns:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_minus_target = df.drop('TARGET CLASS',axis=1)
scaler.fit(df_minus_target)
scaled_features = scaler.transform(df_minus_target)
df_feat = pd.DataFrame(scaled_features,columns=df_minus_target.columns)
You forgot about removing the last column from df when extracting columns to create df_feat dataframe (should be pd.DataFrame(scaled_features,columns=df.drop('TARGET CLASS',axis=1).columns), see the whole reproducible example below:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Mock your dataset:
df = pd.DataFrame(np.random.rand(5, 10))
df = pd.concat([df, pd.Series([1, 1, 0, 0, 1], name='TARGET CLASS')], axis=1)
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.drop('TARGET CLASS',axis=1).columns)
print(df_feat)
Or in order to prevent this kind of mistakes in the future - extract feature columns that you want to work on into a separate dataframe first:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Mock your dataset:
df = pd.DataFrame(np.random.rand(5, 10))
df = pd.concat([df, pd.Series([1, 1, 0, 0, 1], name='TARGET CLASS')], axis=1)
# Extract raw features columns first.
df_feat = df.drop('TARGET CLASS', axis=1)
# Do transformations.
scaler = StandardScaler()
scaler.fit(df_feat)
scaled_features = scaler.transform(df_feat)
df_feat_scaled = pd.DataFrame(scaled_features, columns=df_feat.columns)
print(df_feat_scaled)

K-Means Python Syntax When Records Represented by a Cnt Column (in Aggregate)

Trying to accomplish K-Means in Python using aggregated data files. For example, instead of a data frame with 3 records represented by 3 rows, one row will represent all 3 with a column like cnt (arbitrarily named) representing those 3 unique instances with the number 3 in it.
Below is a set of some basic starter code that does NOT use the aggregated representation of the rows. Please let me know if you would like for me to post the .csv too, but it should be pretty basic:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
data = pd.read_csv('../Data/wholesale_data.csv')
data.head()
categorical_features = ['Channel', 'Region']
continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper', 'Delicassen']
for col in categorical_features: #for each categorical col
dummies = pd.get_dummies(data[col], prefix=col) #one-hot-encoding
data = pd.concat([data, dummies], axis=1) #append to data
data.drop(col, axis=1, inplace=True) #drop orig column
data.head()
mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)
sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k) #init model
km = km.fit(data_transformed) #fit model
sum_of_squared_distances.append(km.inertia_) #overall SSE
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

Linear regression using Python (Pandas and Numpy)

I am trying to implement linear regression using python.
I did the following steps:
import pandas as p
import numpy as n
data = p.read_csv("...path\Housing.csv", usecols=[1]) # I want the first col
data1 = p.read_csv("...path\Housing.csv", usecols=[3]) # I want the 3rd col
x = data
y = data1
Then I try to obtain the co-efficients, and use the following:
regression_coeff = n.polyfit(x,y,1)
And then I get the following error:
raise TypeError("expected 1D vector for x")
TypeError: expected 1D vector for x
I am unable to get my head around this, as when I print x and y, I can very clearly see that they are both 1D vectors.
Can someone please help?
Dataset can be found here: DataSets
The original code is:
import pandas as p
import numpy as n
data = pd.read_csv('...\housing.csv', usecols = [1])
data1 = pd.read_csv('...\housing.csv', usecols = [3])
x = data
y = data1
regression = n.polyfit(x, y, 1)
This should work:
np.polyfit(data.values.flatten(), data1.values.flatten(), 1)
data is a dataframe and its values are 2D:
>>> data.values.shape
(546, 1)
flatten() turns it into 1D array:
>> data.values.flatten().shape
(546,)
which is needed for polyfit().
Simpler alternative:
df = pd.read_csv("Housing.csv")
np.polyfit(df['price'], df['bedrooms'], 1)
pandas.read_csv() returns a DataFrame, which has two dimensions while np.polyfit wants a 1D vector for both x and y for a single fit. You can simply convert the output of read_csv() to a pd.Series to match the np.polyfit() input format using .squeeze():
data = pd.read_csv('../Housing.csv', usecols = [1]).squeeze()
data1 = p.read_csv("...path\Housing.csv", usecols=[3]).squeeze()
Python is telling you that the data is not in the right format, in particular x must be a 1D array, in your case it is a 2D-ish panda array.
You can transform your data in a numpy array and squeeze it to fix your problem.
import pandas as pd
import numpy as np
data = pd.read_csv('../Housing.csv', usecols = [1])
data1 = pd.read_csv('../Housing.csv', usecols = [3])
data = np.squeeze(np.array(data))
data1 = np.squeeze(np.array(data1))
x = data
y = data1
regression = np.polyfit(x, y, 1)

Categories

Resources