How to create a new data frame

How to create a new data frame - python

Shape of passed values is (1000, 10), indices imply (1000, 11)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns)

The error,
Shape of passed values is (1000, 10), indices imply (1000, 11)
occurs on this line
df_feat = pd.DataFrame(scaled_features,columns=df.columns)
because scaled_features has 10 columns, but df.columns has length 11.
Notice that df.drop('TARGET CLASS',axis=1) was called twice to drop the TARGET CLASS column from df. It seems likely that this is the extra column in df that you would want to drop from the list of new columns.
The problem can be fixed by saving a reference to df.drop('TARGET CLASS',axis=1)
(let's call it df_minus_target), and passing df_minus_target.columns as the new list of columns:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_minus_target = df.drop('TARGET CLASS',axis=1)
scaler.fit(df_minus_target)
scaled_features = scaler.transform(df_minus_target)
df_feat = pd.DataFrame(scaled_features,columns=df_minus_target.columns)

You forgot about removing the last column from df when extracting columns to create df_feat dataframe (should be pd.DataFrame(scaled_features,columns=df.drop('TARGET CLASS',axis=1).columns), see the whole reproducible example below:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Mock your dataset:
df = pd.DataFrame(np.random.rand(5, 10))
df = pd.concat([df, pd.Series([1, 1, 0, 0, 1], name='TARGET CLASS')], axis=1)
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.drop('TARGET CLASS',axis=1).columns)
print(df_feat)
Or in order to prevent this kind of mistakes in the future - extract feature columns that you want to work on into a separate dataframe first:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Mock your dataset:
df = pd.DataFrame(np.random.rand(5, 10))
df = pd.concat([df, pd.Series([1, 1, 0, 0, 1], name='TARGET CLASS')], axis=1)
# Extract raw features columns first.
df_feat = df.drop('TARGET CLASS', axis=1)
# Do transformations.
scaler = StandardScaler()
scaler.fit(df_feat)
scaled_features = scaler.transform(df_feat)
df_feat_scaled = pd.DataFrame(scaled_features, columns=df_feat.columns)
print(df_feat_scaled)

Related

Iterative Imputer giving same output value for all NaNs for a given column

I currently have a data frame with ~350 columns. I want to impute the NaNs in one of these columns using several other columns using iterative imputer and ExtraTreesRegressor. I've created a smaller data frame containing the features of interest. My dataframe looks like:
I want to impute NaNs in first_seen_days however the issue I'm having is that all NaNs are imputed using the same value. I'm expecting each NaN to be imputed with a different value. This is my code:
data_interpolation = df_sample[["first_seen_days","domain_relevant_info_id",
"reason_id", "score.1", "status_id"]]
imp = IterativeImputer(random_state = 0)
imp.fit(data_interpolation)
X = data_interpolation
data_interpolation["first_seen_days"] = imp.transform(X)

I have tried replicating your problem. I am able to impute different values using ExtraTreesRegressor. Based on this, your problem could be because of inherent property of your data.
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
np.random.seed(0)
X = np.random.rand(20, 5)
df = pd.DataFrame(X, columns = ["A", "B", "C", "D", "E"])
#randomly assign these indexes nan
for i in [3, 5, 7, 15]:
df.iloc[i]["A"] = np.nan
##imputation - part of code from the question
imp = IterativeImputer(estimator=ExtraTreesRegressor(), random_state = 0)
imp.fit(df)
X = df
df["A"] = imp.transform(X)
#imputed values
print(df.iloc[[3, 5, 7, 15]]["A"])
#output
3 0.706066
5 0.561352
7 0.776586
15 0.550094

Changing values in a pandas.DataFrame

I everybody, I'm new to python world and I'm trying to learn pandas and tensorflow.
At the moment I've a dataframe with positive and negative values that I want to manage to resize.
For example
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_excel ('/Users/dataset.xlsx')
print(df[:])
scaler = MinMaxScaler(feature_range=(0,1))
df_absolute = df.abs()
df_scaled = scaler.fit_transform(df_absolute)
#df_mod = df_scaled.loc[(df<0)] = df_scaled*-1
df_normalized = pd.DataFrame(df_mod)
print(df_normalized[:])
I've an error on the line with # and such as 'numpy.ndarray'.
How can I resolve this?

In the
df = pd.read_excel ('/Users/dataset.xlsx')
there is widespace ' ' should remove it
df = pd.read_excel('/Users/dataset.xlsx')

How To ' Write To New .CSV File' or "Save As New .CSV File' In python

I have a CSV file, I want to apply One hot encoding, then save the new dataframe(dataset) as a new CSV file. But when the new file is saved, it only writes 5 Rows of dummies and all rows of original dataset!
I just want to save all rows and columns in the new file.csv, the final shape of the dataset is (237124, 417).
my code contain:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import csv
dataset=pd.read_csv("C:/Users/User/Desktop/data.csv",encoding='cp1252')
dataset.shape
#output: (237124, 37)
dummies = pd.get_dummies(dataset, columns=["name","mark",....... ]).head()
dummies.shape
#output : (5, 380)
dataset = pd.concat([dataset, dummies], axis=1)
dataset.shape
#output: (237124, 417)
# i want this shape(original+dummies)
dataset.to_csv('OneHotEncodnig.csv', index=False)

You call df.head() in this line:
dummies = pd.get_dummies(dataset, columns=["name","mark",....... ]).head()
This is why you only get 5 dummy rows. Remove the .head() and you get all rows.

head() function will return only 5 rows by default. Please apply get_dummies() without head() function
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import csv
dataset=pd.read_csv("C:/Users/User/Desktop/data.csv",encoding='cp1252')
dataset.shape
#output: (237124, 37)
dummies = pd.get_dummies(dataset, columns=["name","mark",....... ])
dummies.shape
#output : (5, 380)
dataset = pd.concat([dataset, dummies], axis=1)
dataset.shape
#output: (237124, 417)
# i want this shape(original+dummies)
dataset.to_csv('OneHotEncodnig.csv', index=False)

Extract smaller table from pivot table pandas

I want to split the following pivot table into training and testing sets (to evaluate recommendation system), and was thinking of extracting two tables with non-overlapping indices (userID) and column values (ISBN). How can I split it properly? Thank you.

As suggested by #moys, can use train_test_split from scikit-learn after splitting your dataframe columns first for the non-overlapping column names.
Example:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
Generate data:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Split df columns in some way, eg half:
cols = int(len(df.columns)/2)
df_A = df.iloc[:, 0:cols]
df_B = df.iloc[:, cols:]
Use train_test_split:
train_A, test_A = train_test_split(df_A, test_size=0.33)
train_B, test_B = train_test_split(df_B, test_size=0.33)

Replacing Manual Standardization with Standard Scaler Function

I want to replace the manual calculation of standardizing the monthly data with the StandardScaler package from sklearn. I tried the line of code below the commented out code, but I am receiving the following error.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
arr = pd.DataFrame(np.arange(1,21), columns=['Output'])
arr2 = pd.DataFrame(np.arange(10, 210, 10), columns=['Output2'])
index2 = pd.date_range('20180928 10:00am', periods=20, freq="W")
# index3 = pd.DataFrame(index2, columns=['Date'])
df2 = pd.concat([pd.DataFrame(index2, columns=['Date']), arr, arr2], axis=1)
print(df2)
cols = df2.columns[1:]
# df2_grouped = df2.groupby(['Date'])
df2.set_index('Date', inplace=True)
df2_grouped = df2.groupby(pd.Grouper(freq='M'))
for c in cols:
#df2[c] = df2_grouped[c].apply(lambda x: (x-x.mean()) / (x.std()))
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
print(df2)
ValueError: Expected 2D array, got 1D array instead:
array=[1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The error message says that StandardScaler().fit_transform only accept a 2-D argument.
So you could replace:
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
with:
from sklearn.preprocessing import scale
df2[c] = df2_grouped[c].transform(lambda x: scale(x.astype(float)))
as a workaround.
From sklearn.preprocessing.scale:
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
So it should work as a standard scaler.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a new data frame - python

Related

Iterative Imputer giving same output value for all NaNs for a given column

Changing values in a pandas.DataFrame

How To ' Write To New .CSV File' or "Save As New .CSV File' In python

Extract smaller table from pivot table pandas

Replacing Manual Standardization with Standard Scaler Function

Categories

Resources