Setting an array element with a sequence python - machine learning (logistic regression) - python

I am trying to run logistic regression on my y and x, however I keep getting the error 'Setting an array element with a sequence'. I think I might have to reshape my data, however I am not too sure what dimensions of array should be used. I tried reshaping k to (3, 1) and g to (4000000, 1) but it still did not work. I have attached my code below (without reshaping arrays). The data is a netcdf file. Appreciate if anyone can help, thank you.
Screenshot of final_df.head(5)
import pandas as pd
import geopandas as gpd
from netCDF4 import Dataset
from osgeo import gdal, ogr
f = Dataset('C:\\filename.nc', 'r')
#Extract pixel 'coords'
B01_DATA = f.variables['B01_DATA'][:]
B02_DATA = f.variables['B02_DATA'][:]
VIS_DATA = f.variables['VIS_DATA'][:]
#these are look-up tables
B01_LUT = f.variables['B01_LUT'][:]
B02_LUT = f.variables['B02_LUT'][:]
VIS_LUT = f.variables['VIS_LUT'][:]
min_lat = -15
min_lon = 90
res = 0.009 #resolution
import numpy as np
lst = []
for x in range(0, 2000):
for y in range(0,2000):
B01 = (B01_LUT[B01_DATA[x,y]])
B02 = (B02_LUT[B02_DATA[x,y]])
VIS = (VIS_LUT[VIS_DATA[x,y]])
k = np.array([B01,B02,VIS], dtype=np.float32)
lst.append(k)
df = pd.DataFrame()
df['x'] = lst
#print(df)
lst1 = []
lst2=[]
for x in range(0, 2000):
for y in range(0,2000):
lon = min_lat + x*res
lat = min_lon + y*res
lst1.append(lat)
lst2.append(lon)
df1 = pd.DataFrame()
df1['Latitude'] = lst1
df1['Longitude'] = lst2
df1['Coords'] = list(zip(df1.Latitude, df1.Longitude))
print(df1)
import shapefile
from shapely.geometry import shape, Point
# read your shapefile
r = shapefile.Reader("C:\\shapefile.shp")
# get the shapes
shapes = r.shapes()
# build a shapely polygon from your shape
hold = []
for k in range(20,22): #I am only taking a subset of layers in the polygon
polygon = shape(shapes[k])
for x in df1.Coords:
if polygon.contains(Point(x)):
hold.append(x)
#print(len(hold))
g = np.where(df1['Coords'].isin(hold), 1,0)
g.tolist()
df1['y'] = g
final_df = df.join(df1)
print(final_df)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X = final_df.X
y = final_df.y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
This is the full error message:
ValueError Traceback (most recent call last)
<ipython-input-12-f189af4819e6> in <module>()
2 from sklearn.linear_model import LogisticRegression
3 logmodel = LogisticRegression()
----> 4 logmodel.fit(X_train, y_train)
~\Anaconda2\envs\python3env\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
1214
1215 X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype,
-> 1216 order="C")
1217 check_classification_targets(y)
1218 self.classes_ = np.unique(y)
~\Anaconda2\envs\python3env\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
571 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
572 ensure_2d, allow_nd, ensure_min_samples,
--> 573 ensure_min_features, warn_on_dtype, estimator)
574 if multi_output:
575 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
~\Anaconda2\envs\python3env\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.

Looks to me like your error is caused by the fact that you have columns which contain lists which isn't a valid input format to a model. Try something like this (taken from here: Pandas split column of lists into multiple columns):
X = pd.DataFrame(final_df.X.values.tolist(), columns=['x1','x2','x3'])
This should return a three column dataframe with your co-ordinates

Related

I've added a OneHotEncoder to a ColumnTransformer that I'm using in a Pipeline, I get an error as it can't change one of the columns to a float

I'm working with the Loan Approvals Dataset from Analytics Vidhya (https://www.kaggle.com/datasets/anmolkumar/analytics-vidhya-loan-prediction?select=train.csv) which is a dataframe with a mixture of categorical and numerical data aiming to predict if a loan is approved or not.
I'm trying to preprocess the data by building a ColumnTransformer so that I can perform specific preprocessing steps on the numerical and the categorical columns. The last step of the ColumnTransformer is a One Hot Encoder to one hot encode the categorical data. I then add this ColumnTransformer to a Pipeline with a LogisticRegression as a baseline model. When I try and fit this Pipeline, I get 'ValueError: could not convert string to float: 'Male''. I must be doing something wrong here, but I can't figure out what.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler, FunctionTransformer, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.base import BaseEstimator, TransformerMixin
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')
path = "https://raw.githubusercontent.com/richrussell1991/datasets/main/analytics_vidhya_loan_approval_practice_train.csv"
df = pd.read_csv(path, on_bad_lines='skip')
df.drop('Loan_ID', axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(df.drop('Loan_Status', axis=1), df['Loan_Status'], test_size=0.33)
mean_imputer = SimpleImputer(strategy='mean')
mode_imputer = SimpleImputer(strategy='most_frequent')
standard_scaler = StandardScaler()
one_hot_encoder = OneHotEncoder()
ct = ColumnTransformer(
transformers=[
('mean_imputer', mean_imputer, ['LoanAmount', 'Loan_Amount_Term']),
('mode_imputer', mode_imputer, ['Credit_History']),
('standard_scaler', standard_scaler, ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']),
('impute_most_common', mode_imputer, make_column_selector(dtype_include='object')),
('one_hot_encode', one_hot_encoder, make_column_selector(dtype_include='object'))
],
remainder='passthrough'
)
clf = Pipeline(
[('ct', ct), ('classifier', LogisticRegression())]
)
clf.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-4f66baab8f93> in <module>
----> 1 clf.fit(X_train, y_train)
4 frames
/usr/local/lib/python3.8/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
392 if self._final_estimator != "passthrough":
393 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394 self._final_estimator.fit(Xt, y, **fit_params_last_step)
395
396 return self
/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_logistic.py in fit(self, X, y, sample_weight)
1506 _dtype = [np.float64, np.float32]
1507
-> 1508 X, y = self._validate_data(
1509 X,
1510 y,
/usr/local/lib/python3.8/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
579 y = check_array(y, **check_y_params)
580 else:
--> 581 X, y = check_X_y(X, y, **check_params)
582 out = X, y
583
/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
962 raise ValueError("y cannot be None")
963
--> 964 X = check_array(
965 X,
966 accept_sparse=accept_sparse,
/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
744 array = array.astype(dtype, casting="unsafe", copy=False)
745 else:
--> 746 array = np.asarray(array, order=order, dtype=dtype)
747 except ComplexWarning as complex_warning:
748 raise ValueError(
ValueError: could not convert string to float: 'Male'
I've built this ColumnTransformer, in order to apply certain transformations to the different columns, i.e. standard scaling to Numerical columns and OneHotEncoding to the Categorical columns, however I error out when trying to fit the overall Pipeline on X_train, y_train

Building a custom RandomSearchCV using Python

I am trying to build a custom K-fold RandomSearchCV from scratch. I understand how RandomSearchCV works and I'm trying to implement it from scratch on a randomly generated dataset. When I try to run the code I get the following error. I think it has to do something with how I've created groups in my x_train list. What is this error and its fix? :
ValueError Traceback (most recent call last)
<ipython-input-12-229cc493eeb9> in <module>
41
42 classifier = KNeighborsClassifier()
---> 43 RandomSearchCV(X_train,y_train, classifier, folds = 3)
44
45
<ipython-input-12-229cc493eeb9> in RandomSearchCV(x_train, y_train, classifier, folds)
26 #classifier (K-NN)
27 classifier.n_neighbors = parameter
---> 28 classifier.fit(x_train_group, y_train_group)
29
30 #Predicton
~\anaconda3\lib\site-packages\sklearn\neighbors\_base.py in fit(self, X, y)
1128 """
1129 if not isinstance(X, (KDTree, BallTree)):
-> 1130 X, y = check_X_y(X, y, "csr", multi_output=True)
1131
1132 if y.ndim == 1 or y.ndim == 2 and y.shape[1] == 1:
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
753 ensure_min_features=ensure_min_features,
754 warn_on_dtype=warn_on_dtype,
--> 755 estimator=estimator)
756 if multi_output:
757 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
572 if not allow_nd and array.ndim >= 3:
573 raise ValueError("Found array with dim %d. %s expected <= 2."
--> 574 % (array.ndim, estimator_name))
575
576 if force_all_finite:
ValueError: Found array with dim 3. Estimator expected <= 2.
Here's my implementation:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy
from tqdm import tqdm
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
x,y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant= 0, n_clusters_per_class=1, random_state=60)
X_train, X_test, y_train, y_test = train_test_split(x,y,stratify=y,random_state=42)
def RandomSearchCV(x_train,y_train, classifier, folds):
train_scores = []
test_scores = []
#1. Generating 10 unique values from given range
params = random.sample(range(0, 50), 10)
x_train_split = []
y_train_split = []
#dividing x_train into groups
for i in range(0, len(x_train), int(len(x_train)/folds)):
x_train_split.append(x_train[i:i+int(len(x_train)/folds)])
y_train_split.append(y_train[i:i+int(len(y_train)/folds)])
#3.for each hyperparameter that we generated in step 1 and dividing dataset into training and CV datasets:
for parameter in params:
trainscores_folds = []
testscores_folds = []
for group in range(len(x_train_split)):
x_train_group = x_train_split[0:group] + x_train_split[group+1:]
x_cv_group = [x_train_split[group]]
y_train_group = y_train_split[0:group] + y_train_split[group+1:]
y_cv_group = [y_train_split[group]]
#classifier (K-NN)
classifier.n_neighbors = parameter
classifier.fit(x_train_group, y_train_group)
#Predicton
y_pred = classifier.predict(x_cv_group)
testscores_folds.append(accuracy_score(y_cv_group, Y_pred))
y_pred = classifier.predict(x_train_group)
trainscores_folds.append(accuracy_score(y_train_group, Y_pred))
trainscores.append(np.mean(np.array(trainscores_folds)))
testscores.append(np.mean(np.array(testscores_folds)))
return trainscores, testscores
classifier = KNeighborsClassifier()
RandomSearchCV(X_train,y_train, classifier, folds = 3)
Thank you for your help.
x_train_group is a list of arrays, which makes it 3-dimensional (as mentioned in the error). This does not work with fitting the classifier, as it expects 2-dimensional input. Try calling np.concatenate(x_train_group) to concatenate the folds and make it a 2-dimensional input.
As the error states, you are using an array of three dimensions while the classifier.fit() method was expecting a two-dimensional matrix. all you need to fix this issue is to change your train/cv/test groups to be like the following:
for group in range(len(x_train_split)):
x_train_group = np.concatenate(x_train_split[0:group] + x_train_split[group+1:])
x_cv_group = x_train_split[group]
y_train_group = np.concatenate(y_train_split[0:group] + y_train_split[group+1:])
y_cv_group = y_train_split[group]
....

Sklearn fitting SVM with StandardScaler

please am fitting svr on my dataset and am getting this error message. it was working when I have not included standardscaler. I have tried all means but still not working.
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(np.array(y).reshape(1,-1))
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X,y)`
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-75416c35e495> in <module>
2 from sklearn.svm import SVR
3 regressor = SVR(kernel = 'rbf') # rbf means radial basis function
----> 4 regressor.fit(X,y)
C:\anconda\lib\site-packages\sklearn\svm\_base.py in fit(self, X, y, sample_weight)
146 X, y = check_X_y(X, y, dtype=np.float64,
147 order='C', accept_sparse='csr',
--> 148 accept_large_sparse=False)
149 y = self._validate_targets(y)
150
C:\anconda\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
758 dtype=None)
759 else:
--> 760 y = column_or_1d(y, warn=True)
761 _assert_all_finite(y)
762 if y_numeric and y.dtype.kind == 'O':
C:\anconda\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
795 return np.ravel(y)
796
--> 797 raise ValueError("bad input shape {0}".format(shape))
798
799
ValueError: bad input shape (1, 10)
You are feeding to the SVM a target vector with dimension (1,10) which means one row and ten columns, this is wrong and it's caused by you're using of reshaping in
y = sc_y.fit_transform(np.array(y).reshape(1,-1))
Please note that this line is also conceptually wrong, the standardised should be applied only on the training features, not on the target vector, so you can avoid to define
sc_y = StandardScaler()

Question on Python Feature Selection (null values)

I'm building a logistic regression model, and want to understand the features that contribute the most to my output (1 or 0). Trying to understand if a customer comes back to my website, what features are making them come back. I am getting stuck with this fit function. It errors out on me and I can't figure out why. It seems to indicate that I have some null values, but I've scrubbed through my data and removed the null values.
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#load data
df = pd.read_csv('jupyter.csv', header = 0)
array = dataframe.values
X = array[:,1:13]
Y = array[:,14]
print (X.shape)
print (Y.shape)
(544219, 12)
(544219,)
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-f91db4d08897> in <module>
1 # feature extraction
2 test = SelectKBest(score_func=chi2, k=4)
----> 3 fit = test.fit(X, Y)
4 # summarize scores
5 #numpy.set_printoptions(precision=3)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/feature_selection/univariate_selection.py in fit(self, X, y)
339 self : object
340 """
--> 341 X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
342
343 if not callable(self.score_func):
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
720 if multi_output:
721 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
--> 722 dtype=None)
723 else:
724 y = column_or_1d(y, warn=True)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
540 if force_all_finite:
541 _assert_all_finite(array,
--> 542 allow_nan=force_all_finite == 'allow-nan')
543
544 if ensure_min_samples > 0:
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
58 elif X.dtype == np.dtype('object') and not allow_nan:
59 if _object_dtype_isnan(X).any():
---> 60 raise ValueError("Input contains NaN")
61
62
ValueError: Input contains NaN
You can impute missing values using:
SimpleImputer, or
using pandas fillna(). For example, the following code will impute all missing values with the mean:
X.fillna(X.mean())

Unpack Dictionaries for Logistic Regression in Python

I'm trying to run some sentiment analysis on product reviews, and I'm getting tripped up with getting my model to read the word count dictionaries
import pandas as pd
import numpy as np
from sklearn import linear_model, model_selection, metrics
products = pd.read_csv('data.csv')
def count_words(s):
d = {}
wl = str(s).split()
for w in wl:
d[w] = wl.count(w)
return d
products['word_count'] = products['review'].apply(count_words)
products = products[products['rating'] != 3]
products['sentiment'] = (products['rating'] >= 4) * 1
train_data, test_data = model_selection.train_test_split(products, test_size = 0.2, random_state=0)
sentiment_model = linear_model.LogisticRegression()
sentiment_model.fit(X = train_data['word_count'], y =train_data['sentiment'])
When I run that last line I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-51-0c3f47af3a6e> in <module>()
----> 1 sentiment_model.fit(X = train_data['word_count'], y =
train_data['sentiment'])
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\linear_model\logistic.py
in fit(self, X, y, sample_weight)
1171
1172 X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64,
-> 1173 order="C")
1174 check_classification_targets(y)
1175 self.classes_ = np.unique(y)
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\utils\validation.py in
check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
519 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
520 ensure_2d, allow_nd, ensure_min_samples,
--> 521 ensure_min_features, warn_on_dtype, estimator)
522 if multi_output:
523 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
C:\ProgramData\anaconda_3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
TypeError: float() argument must be a string or a number, not 'dict'
It seems like the model is pulling the dictionaries as the x variables instead of the entries in the dictionaries. I think I need to unpack the dictionaries into arrays (?) but haven't had any luck doing so.
update:
Here is that products looks like after running word_count and defining sentiment
products.head()
If you want to just correct the error, first use DictVectorizer on the train_data['word_count'] to convert it into acceptable format which is an array of shape [n_samples, n_features].
Add the below to your code before sentiment_model.fit():
from sklearn.feature_extraction import DictVectorizer
dictVectorizer = DictVectorizer()
train_data_dict = dictVectorizer.fit_transform(train_data['word_count'])
Then call sentiment_model.fit() like this:
sentiment_model.fit(X = train_data_dict, y =train_data['sentiment'])
Note:-
And instead of implementing your own count words method, I would recommend you to use CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer
countVec = CountVectorizer()
train_data_vectorizer = countVec.fit_transform(train_data['review'])
sentiment_model.fit(X = train_data_vectorizer, y =train_data['sentiment'])
Try
X = train_data['word_count'].values()
This should return a list of word counts (numbers) for each item in train_data['word_count'] if that is what you are looking for.

Categories

Resources