labels column shape is inconsistent after splitting data into features and labels - python

I have a dataset as pandas dataframe that needs to be divided into feature set and labels. as of now, I am dividing the columns as below,
features = df2.drop('case_of_injury_group', 1)
labels = df2['case_of_injury_group']
but the shape of labels is not as what I expected,
features.shape
give (39778, 12) and
labels.shape
gives (39778, ) but I want it as (39778, 1). Please let me know what i am doing wrong here

If want one column DataFrame select by one element nested list:
labels = df2[['case_of_injury_group']]

Related

MinMax Scaler using column transformer ( the transformed columns are shifted front)

I am trying to build a model on House Prices - Advanced Regression Techniques data set (1460, 80). It has 37 Numerical Features and 43 Categorical Features.
I want to Scale the Numerical Feature first then. One_hot_encode the Categorical Feature.
I am using MinMax scaler along with Column transformer.
after scaling the data, the DataFrame is not retaining the column names
Here is my code
columns_transform_sc=make_column_transformer((MinMaxScaler(),['MSSubClass',
'LotFrontage',
'LotArea',
'OverallQual',
'OverallCond',
'YearBuilt',
'YearRemodAdd',
'MasVnrArea',
'BsmtFinSF1',
'BsmtFinSF2',
'BsmtUnfSF',
'TotalBsmtSF',
'1stFlrSF',
'2ndFlrSF',
'LowQualFinSF',
'GrLivArea',
'BsmtFullBath',
'BsmtHalfBath',
'FullBath',
'HalfBath',
'BedroomAbvGr',
'KitchenAbvGr',
'TotRmsAbvGrd',
'Fireplaces',
'GarageYrBlt',
'GarageCars',
'GarageArea',
'WoodDeckSF',
'OpenPorchSF',
'EnclosedPorch',
'3SsnPorch',
'ScreenPorch',
'PoolArea',
'MiscVal',
'MoSold',
'YrSold']),remainder="passthrough")
sc_df=columns_transform_sc.fit_transform(x_train)
I used the original dataframe's(x_train) columns for the scaled dataframe(sc_df).
sc_df=pd.DataFrame(sc_df,index=x_train.index,columns=x_train.columns)
The problem I'm facing is that the column transformer shifts all the columns that it has transformed to the front and shifts the passthrough columns back, and I can't use x_train.columns to replace the sc_df.columns
All the Categories feature has been shifted back. Is there a way to retain the column names of getting the column names
also
Should I encode the categorical feature (one_hot_encode or label_encode) first, then Scale(Standardize or normalize) the entire thing (the encoded data too) or scale then encode
I think you can - and sometimes have first to do the scaling. I suggest trying this:
qt = QuantileTransformer(n_quantiles=50, output_distribution='normal', random_state=0)
df.Betrag = qt.fit_transform(df.Betrag.values.reshape(-1, 1))
Note: You can replace the one column directly with a slice of columns with the known standard Syntax for selecting a subset of Pandas DataFrame columns:
age_sex = titanic[["Age", "Sex"]]
In this case, you would pass age_sex to the fit and the transform function if we assume that these columns the definite ones. Even more, you are not restricted to the QuantileTransformer. The code should work generically for all Transformers.
EDIT: Sorry, quick sidenote: The reshape operation is just necessary if you pass a tensor with just one particular feature to the QuantileTransformer. In the case of a multi-feature tensor and another transformer, it should be necessary.
I suggest perform some kind encoding first and then scale all values.This would not only help you to retain your columns but also those encoded values will get scaled under same scale.

normalizing my timeseries dataset then setting the timestamp as Index

here is my code trying to normalize my dataset, the code works but the problem is when I create the new data frame (the last line of my code) it is not including the timestamp column because it is just including the scaled values.
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
data_consumption2.fillna(0,inplace=True)
data_consumption2 = data_consumption2.set_index('Timestamp')
#returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(data_consumption2.values)
data_consumption2 = pd.DataFrame(x_scaled)
I hope any one can help me with having my original dframe with timestamps and scaled values in it
You have to set the index of the new dataframe you created.
What the min_max_scaler.fit_transform returns is a numpy array of the scaled values (thus losing the index).
So you could do :
data_consumption2 = pd.DataFrame(data=x_scaled, index=data_consumption2.index)
If you want to also retrieve the columns, you can also pass them along :
data_consumption2 = pd.DataFrame(data=x_scaled,
index=data_consumption2.index,
columns=data_consumption2.columns)
More details in the DataFrame's documentation : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Those are basic pandas' manipulations, you should find all the answers about it in their documentation.

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Converting an array structure to a dataframe to get the column names

I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it
X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values
Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code
dff=df.iloc[:, 0:345]
The above statement is to get the first 345 columns (of the data frame).
Then,
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
I get the following error
ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)
I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
X_test being a numpy.ndarray
Modified the above statement to just this:
df_new=pd.DataFrame(X_test)
df_new.columns=list(dff.columns)
The new dataframe contains the X_test data and the column names are assigned from the dff dataframe to the newly created dataframe as well.
I would recommend using the DataFrame for train_test_split, and then passing in arrays to your algorithm using numpy:
my_algorithm(np.asarray(X_train), np.asarray(y_train))
This way you can look at your data the same way you would for any df, but can run the model with the array. I'm not sure what library you are using - but I'm pretty sure some can take DataFrames now for modeling.

What is the meaning of the error ValueError: cannot copy sequence with size 205 to array axis with dimension 26 and how to solve it

This is the code i wrote , i am trying to convert the non-numerical data to numeric. However it return an error ValueError: cannot copy sequence with size 205 to array axis with dimension 26
The data is get from http://archive.ics.uci.edu/ml/datasets/Automobile
automobile = pd.read_csv('imports-85.csv', names = ["symboling",
"normalized-losses", "make", "fuel", "aspiration", "num-of-doors", "body-
style", "drive-wheels", "engine-location", "wheel-base", "length", "width",
"height", " curb-weight", "engine-type", "num-of-cylinders","engine-
size","fuel-system","bore","stroke"," compression-ratio","horsepower","peak-
rpm","city-mpg","highway-mpg","price"])
X = automobile.drop('symboling',axis=1)
y = automobile['symboling']
le = preprocessing.LabelEncoder()
le.fit([automobile])
print (le)
The fit method takes an array of [n_samples] see the docs. You're passing the entire data frame within a list. I'm pretty sure if you print the shape of your dataframe (automobile.shape) it will show a shape of (205, 26)
If you want to encode your data you need to do it one column at a time e.g.
le.fit(automobile['make']).
Note, that this is not the correct way to encode categorical data, as the name suggests LabelEncoder is designed for labels and not input features. In scikit-learns current state you should use OneHotEncoder. There are plans in the next release for a categorical encoder

Categories

Resources