What can I do to change dot in comma? - python

Good morning! I'm new of python, I use Spyder 4.0 to build neural network.
In the script below I use the random forest in order to do feature importances. So the values importances are the ones that tell me what is the importance of each features. Unfortunatly I can't upload the dataset, but I can tell you that there are 18 features and 1 label, both are phisical quantyties and it's a regression problem.
I want to export in a excel file the variable importances, but when I do it (simply cooping the vector) the numbers are with the dot (eg 0.012, 0.015, .....ect). In order to use it in the excel file I prefere to have the comma instead of the dot.
I try to use .replace('.',',') but it doesn't works, the error is:
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
It think that it happens because the vector importances is an Array of float64 (18,).
What can I do?
Thanks.`
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
dataset = pd.read_csv('Dataset.csv', decimal=',', delimiter = ";")
label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])
y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)
def denormalize(y):
final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
return final_value
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = 0.20, shuffle = True)
y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)
scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)
sel = RandomForestRegressor(n_estimators = 200,max_depth = 9, max_features = 5, min_samples_leaf = 1, min_samples_split = 2,bootstrap = False)
sel.fit(X_train, y_train)
importances = sel.feature_importances_
# sel.fit(X_train, y_train)
# a = []
# for feature_list_index in sel.get_support(indices=True):
# a.append(feat_labels[feature_list_index])
# print(feat_labels[feature_list_index])
# X_important_train = sel.transform(X_train1)
# X_important_test = sel.transform(X_test1)

I will try to show you an example of what you should do by using some random values. I ran this on the python shell that's why you see also the ">>>".
>>> import numpy as np # first I import numpy as "np"
# I generate 10 random values and I store them in "importance"
>>> importance=np.random.rand(10)
# here I just want to see the content of "importance"
>>> importance
array([0.77609076, 0.97746829, 0.56946118, 0.23986983, 0.93655692,
0.22003531, 0.7711095 , 0.36083248, 0.58277805, 0.57865248])
# here there is your error that I reproduce for teaching purpose
>>>importance.replace(".", ",")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
What you need to to is to convert the elements of "importance" to a list of strings
>>> imp_astr=[str(i) for i in importance]
>>> imp_astr
['0.7760907642658763', '0.9774682868805988', '0.569461184647781', '0.23986982589422634', '0.9365569207431337', '0.22003531170279356', '0.7711094966708247', '0.3608324767276052', '0.5827780487688116', '0.5786524781334242']
# at the end, for each string, you can use the "replace" function
>>> imp_astr=[i.replace(".", ",") for i in imp_astr]
>>> imp_astr
['0,7760907642658763', '0,9774682868805988', '0,569461184647781', '0,23986982589422634', '0,9365569207431337', '0,22003531170279356', '0,7711094966708247', '0,3608324767276052', '0,5827780487688116', '0,5786524781334242']

Related

Apply Cross-Validation to transformer classification

I have a classification script using simpletransformers. I am using an imbalanced dataset with three labels (many 0, not so many 1 and 2), so the roberta classifier oftentimes decides to only predict 0 instead of the minority classes.
To get some overall estimation of the performance of the classifier I would like to use 10-fold cross validation instead of the train/test held-out data.
For this I am using the code below.
import pandas as pd
import numpy as np
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report
df = pd.read_excel('Classifications_Output_NEW.xlsx')
df.shape
df["Q2"] = df["Q2"]. replace(np. nan,0)
df["Q2"] = df["Q2"].replace(3,0)
df["Q2"] = df["Q2"].replace(4,0)
# rename columns
df["text"] = df["SNIPPET"]
df["labels"] = df["Q2"]
# Function to replace the token before the sentiment word with the Q1 info
def replace_token_with_number(row):
string = row['SNIPPET']
number = row['Q1']
return string.replace('[[', "xxproj " + str(number) + " ")
# Apply the function to every row in the dataframe
df["text"] = df.apply(replace_token_with_number, axis=1)
# remove [+] and [-] from text column and replace with ++ and --
df["text"] = df["text"].str.replace("[+]", " xxpositive", regex=False)
df["text"] = df["text"].str.replace("[-]", " xxnegative", regex=False)
# remove [[ and ]] from text column
df["text"] = df["text"].str.replace("[[", "", regex=False)
df["text"] = df["text"].str.replace("]]", "", regex=False)
# replace the Q1 number with a string
#df["text"] = df["text"].str.replace("xxproj1.0", "xxyesPJ", regex=False)
#df["text"] = df["text"].str.replace("xxproj0.0", "xxnoPJ", regex=False)
# prepare cross validation
from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
import pandas as pd
n=10
kf = KFold(n_splits=n, shuffle=True)
results = []
for train_index, val_index in kf.split(df):
# splitting Dataframe (dataset not included)
train_df = df["text"][train_index]
val_df = df["labels"][val_index]
# Defining Model
model = ClassificationModel('roberta', 'roberta-base', num_labels=3, weight=class_weights.tolist(),
use_cuda=True, args={'reprocess_input_data': True, 'overwrite_output_dir': True,
"num_train_epochs": 10})
# train the model
model.train_model(train_df)
# validate the model
result, model_outputs, wrong_predictions = model.eval_model(val_df, acc=accuracy_score)
print(result['acc'])
# append model score
results.append(result['acc'])
print("results",results)
print(f"Mean-Precision: {sum(results) / len(results)}")
Using this code produces an AttributeError: 'Series' object has no attribute 'columns'. I believe it has to do with the way the script accesses the columns of my dataframe but I am not able to solve the error.
I am grateful for any advice!

Mlxtend: SequentialFeatureSelector with ColumnTransformer gives AttributeError

I'm trying to use mlxtend SequentialFeatureSelector() in combination with a pipeline by using ColumnTransformer(). I use the ColumnTransformer() to make power transformations (via PowerTransformer()) only on the numeric variables, but not on the binary variables. My problem is that I either get the error: AttributeError: 'numpy.ndarray' object has no attribute 'columns' or the results make no sense.
If I define like this: numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5'], then I get the AttributeError: 'numpy.ndarray' object has no attribute 'columns'
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features = ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
# numeric_features = [0,1,2,3,4,]
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop'))
pipe = make_pipeline((preprocessor), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
If I define numeric_features like this: numeric_features = [0,1,2,3,4] , then I get incorrect results (same value for every iteration) which also include NaNs. See pictures below.
Results of the SequentialFeatureSelector() - Part 1
Results of the SequentialFeatureSelector() - Part 2
The problem lies only in preprocessor = make_pipeline(ColumnTransformer(transformers = [('num', numeric_transformer, numeric_features)], remainder = 'drop')). If for example, I would remove the specific column transformation and would perform the power transformations on the whole dataset, then code works and results would be:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PowerTransformer
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=10,random_state=123)
cv_define = rskf.split(X, y)
cv = list(cv_define)
# Define pipeline (transformation & model)
numeric_features ['feature_1','feature_2', 'feature_3', 'feature_4', 'feature_5']
numeric_transformer = make_pipeline(PowerTransformer(method = 'yeo-johnson', standardize=False))
pipe = make_pipeline((numeric_transformer), (LogisticRegression(max_iter = 10000, penalty = 'none')))
# Run SFS (Sequential Feature Selector)
sfs1 = SFS(pipe, k_features = 1, scoring = "neg_log_loss", forward = False, floating = False, cv = cv, verbose = 2, n_jobs = -1)
sfs_result = sfs1.fit(X, y)
Results:
Working results of SFS - Part 1
Working result of SFS - Part 2
Does anyone has an idea how this could be solved? Any help would be much appreciated.

How to convert str into float? ValueError: could not convert string to float: '0,25691372'

I'm using XGBoost for feature importance, I want to select the features that give me the 90 % of importance, so at first I build a Dataframe beacause I need it for excel and then I write a while cycle to evalutate the features that give me 90% of importances. After this there is a neural network (but it isn't in the code below). I know that maybe there are some easiest way to do this but it gives me an error:
ValueError: could not convert string to float: '0,25691372'
The code is
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from matplotlib import pyplot as plt
dataset = pd.read_csv('CompleteDataSet_original_Clean_CONC.csv', decimal=',', delimiter = ";")
from sklearn.metrics import r2_score
label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])
y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)
def denormalize(y):
final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
return final_value
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = 0.20, random_state = 1, shuffle = True)
y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)
scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)
sel = XGBRegressor(colsample_bytree= 0.7, learning_rate = 0.005, max_depth = 5, min_child_weight = 3, n_estimators = 1000)
sel.fit(X_train, y_train)
importances = sel.feature_importances_
importances = [str(i) for i in importances]
importances = [i.replace(".", ",") for i in importances]
df1 = pd.DataFrame(features.columns)
df1.columns = ['Features']
df2 = pd.DataFrame(importances)
df2.columns = ['Importances [%]']
result = pd.concat([df1,df2],axis = 1)
result = result.sort_values(by='Importances [%]', ascending=False)
result.to_excel("Feature_Results.xlsx")
i = 0
somma = 0
feature = []
while somma <=0.9:
a = result.iloc[i,-1]
somma = float(a) + somma
feature.append(result.iloc[i,-2])
i = i + 1
float('0,25691372'.replace(",", "."))
You could use locale.atof() to handle , being used as the decimal separator.
import locale
locale.setlocale(locale.LC_ALL, 'fr_FR')
...
somma = locale.atof(a) + somma
Try to convert "0,0001" into "0.0001" and then convert the string to float.

Getting No loop matching the specified signature and casting error

I'm a beginner to python and machine learning . I get below error when i try to fit data into statsmodels.formula.api OLS.fit()
Traceback (most recent call last):
File "", line 47, in
regressor_OLS = sm.OLS(y , X_opt).fit()
File
"E:\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py",
line 190, in fit
self.pinv_wexog, singular_values = pinv_extended(self.wexog)
File "E:\Anaconda\lib\site-packages\statsmodels\tools\tools.py",
line 342, in pinv_extended
u, s, vt = np.linalg.svd(X, 0)
File "E:\Anaconda\lib\site-packages\numpy\linalg\linalg.py", line
1404, in svd
u, s, vt = gufunc(a, signature=signature, extobj=extobj)
TypeError: No loop matching the specified signature and casting was
found for ufunc svd_n_s
code
#Importing Libraries
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt #Visualization
#Importing the dataset
dataset = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')
#dataset.head(10)
#Encoding categorical data using panda get_dummies function . Easier and straight forward than OneHotEncoder in sklearn
#dataset = pd.get_dummies(data = dataset , columns=['Platform' , 'Genre' , 'Rating' ] , drop_first = True ) #drop_first use to fix dummy varible trap
dataset=dataset.replace('tbd',np.nan)
#Separating Independent & Dependant Varibles
#X = pd.concat([dataset.iloc[:,[11,13]], dataset.iloc[:,13: ]] , axis=1).values #Getting important variables
X = dataset.iloc[:,[10,12]].values
y = dataset.iloc[:,9].values #Dependant Varible (Global sales)
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis = 0)
imputer = imputer.fit(X[:,0:2])
X[:,0:2] = imputer.transform(X[:,0:2])
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2 , random_state = 0)
#Fitting Mutiple Linear Regression to the Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
#Predicting the Test set Result
y_pred = regressor.predict(X_test)
#Building the optimal model using Backward Elimination (p=0.050)
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((16719,1)).astype(float) , values = X , axis = 1)
X_opt = X[:, [0,1,2]]
regressor_OLS = sm.OLS(y , X_opt).fit()
regressor_OLS.summary()
Dataset
dataset link
Couldn't find anything helpful to solve this issue on stack-overflow or google .
try specifiying the
dtype = 'float'
When the matrix is created.
Example:
a=np.matrix([[1,2],[3,4]], dtype='float')
Hope this works!
Faced the similar problem. Solved the problem my mentioning dtype and flatten the array.
numpy version: 1.17.3
a = np.array(a, dtype=np.float)
a = a.flatten()
As suggested previously, you need to ensure X_opt is a float type.
For example in your code, it would look like this:
X_opt = X[:, [0,1,2]]
X_opt = X_opt.astype(float)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
Was facing a similar problem, I used df.values[]
y = df.values[:, 4]
fixed the issue by using df.iloc[].values function.
y = dataset.iloc[:, 4].values
df.values[] function returns object datatype
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6, 152211.77, 149759.96, 146121.95, 144259.4,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9, 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8, 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4], dtype=object)
but
df.iloc[:, 4].values returns floats array
which is what
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
OLS() fun accepts
OR
you can just change the datatype of y before inserting it into the fun OLS()
y = np.array(y, dtype = float)
Downgrading from NumPy 1.18.4 to 1.15.2 worked for me:
pip install --upgrade numpy==1.15.2

decision tree:how to replace the 'NaN' to a value which can help me to fix the model

the error is
TypeError: float() argument must be a string or a number
where the error:
clf = clf.fit(model_train,y_train)
my code is below
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn import tree
Model_Dev_Val = pd.read_excel("fuckdata.xlsx")
target = Model_Dev_Val[['source_2']]
model_train, model_test, y_train, y_test = train_test_split(Model_Dev_Val, target,test_size = 0.5, random_state = 40,stratify = target)
imp = Imputer(missing_values = 'NaN',strategy = 'mean',axis=0)
model_train = imp.fit(model_train)
y_train = imp.fit(y_train)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(model_train,y_train)
clf.predict(model_test)
Looks like my 'NAN' doesnt turn to 'mean'.Anyway,help.I have been searching it all day.THX
Try using imp.fit_transform instead of just imp.fit - the latter just returns a fitted model, not an actual new array.

Categories

Resources