SelectKBest is not working when I read CSV files - python

How can I use the SelectKBest function when I try to read a csv file from my desktop as pandas.
(im a noob so plz be patient with me)
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
data = pd.read_csv(r"pima.csv")
X, y = data(return_X_y=True)
X.shape
X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
X_new.shape`
I've tried pima with single quotes (') and double (") with/without (r) nothing changed
the file is a famous (pima indian diabetes) dataset that is available everywhere on google
I get this error when I try to run it:
'DataFrame' object is not callable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_4116\4011967154.py in <module>
2 from sklearn.feature_selection import SelectKBest, chi2
3 data = pd.read_csv(r"pima.csv")
----> 4 X, y = data(return_X_y=True)
5 X.shape
6
TypeError: 'DataFrame' object is not callable

If you're loading a dataframe with pandas your X and y need to be selected as columns, probably like this:
X = data.drop(['Outcome'], axis=1)
y = data['Outcome']

Related

Why am i getting index error on this one hot encoding?

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('netflixprice.csv')
x = dataset.iloc[:,0].values
y = dataset.iloc[:, 1:6].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
IndexError Traceback (most recent call last)
Input In [8], in <cell line: 4>()
2 from sklearn.preprocessing import OneHotEncoder
3 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
----> 4 x = np.array(ct.fit_transform(x))
data structure
New to this. Also anywhere i can learn more about data processing ?
It's hard to tell anything without knowing the structure of your data. However, it seems like you may want to reshape your x:
x = dataset.iloc[:, 0].values.reshape(-1, 1)
I could find a dataset that might be similar to yours and tried it, it worked.
As for learning how to process the data: I personally try to refer to the documentation of a method I want to apply. In your case it's here. However, a clue to where the problem was I could find in the error message:
def _get_column_indices(X, key):
"""Get feature column indices for input data X and key.
For accepted values of `key`, see the docstring of
:func:`_safe_indexing_column`.
"""
--> n_columns = X.shape[1] # this is where the problem is
key_dtype = _determine_key_type(key)
if isinstance(key, (list, tuple)) and not key:
# we get an empty list
IndexError: tuple index out of range
That made me suspect that you got an ndarray shaped (n,) when sliced x, which doesn't have columns that were required.
It also seems like you intended x to be the target rather than the only feature. With 6 other columns assigned to y you may want to swap x and y. You may still encode your target like you planned.

TypeError:'DataFrame' object is not callable

I have been trying to split the dataset into train and test data for deployment using Streamlit.
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold,cross_val_score
from sklearn.cluster import KMeans
import xgboost as xgb
from xgboost import XGBClassifier
def load_dataset():
df = pd.read_csv('txn.csv')
return df
df = load_dataset()
#create X and y, X will be feature set and y is the label - LTV
X = df.drop(['LTVCluster','m1_Revenue'],axis=1)
y = df(['LTVCluster'])
But I,m getting this error while executing the file:
TypeError: 'DataFrame' object is not callable
Traceback:
File "c:\users\anish\anaconda3\lib\site-packages\streamlit\script_runner.py", line 333, in _run_script
exec(code, module.__dict__)
File "C:\Users\Anish\Desktop\myenv\P52 - Retail Ecommerce\new1.py", line 25, in <module>
y = df(['LTVCluster'],axis=1)
What can be the error??
You have a extra set of parentheses in your last line, so Python thinks you're calling df. To filter by columns in Pandas, you use square brackets, so remove the parentheses.
y = df['LTVCluster']
To select a column, remove the () from df(['LTVCluster']):
y = df['LTVCluster']

dataset is not callable problems

Im trying to impute NaN values but,first i want to check the best method to calculate this values. Im new using this methods, so im want to use a code i found to capare the differents regressors and choose the best. The original code is this:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = fetch_california_housing(return_X_y=True)
# ~2k samples is enough for the purpose of the example.
Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
fetch_california_housing is his Dataset.
So, when i try to adapt this code to my case i wrote this code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy import genfromtxt
data = genfromtxt('documents/datasets/df.csv', delimiter=',')
features = data[:, :2]
targets = data[:, 2]
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = data(return_X_y= True)
# ~2k samples is enough for the purpose of the example.
# Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
I always get the same error:
AttributeError: 'numpy.ndarray' object is not callable
and before I used my DF as csv (df.csv) the error is the same
AttributeError: 'Dataset' object is not callable
the complete error is this:
ypeError Traceback (most recent call last) <ipython-input-8-3b63ca34361e> in <module>
3 rng = np.random.RandomState(0) 4
----> 5 X_full, y_full = df(return_X_y=True)
6 # ~2k samples is enough for the purpose of the example.
7 # Remove the following two lines for a slower run with different error bars.
TypeError: 'DataFrame' object is not callable
and i dont know how to solve one of both error to go away
I hope to explain well my problem cause my english is not very good

unable to transform the categorical variable, showing categories=auto error

python version 3.7, spyder 3.3.6. always showing an error I have tried with different versions python also:
import pandas as pa
import numpy as np
X=0
y=0
dataset = 0
#import the data set and separete the
dataset = pa.read_csv("50_Startups.csv")
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,4].values
#categorical variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
[('one_hot_encoder',OneHotEncoder(),[0])],
remainder = 'passthrough'
)
X = np.array(ct.fit_transform(X), dtype=np.float64)
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
The error is:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:415: FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values.
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
warnings.warn(msg, FutureWarning)
Traceback (most recent call last):
File "<ipython-input-5-139c661c06f7>", line 25, in <module>
X = np.array(ct.fit_transform(X), dtype=np.float64)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 490, in fit_transform
return self._hstack(list(Xs))
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py", line 541, in _hstack
raise ValueError("For a sparse output, all columns should"
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Matrix of features as X and dep variable as Y (convert dataframe to numpy array)
`X = dataset.iloc[:,:-1].values`
`Y = dataset.iloc[:,-1].values`
Encoding Categorical variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
en = LabelEncoder()
X[:,3] = en.fit_transform(X[:,3])
oh = OneHotEncoder(categorical_features=[3])
X = oh.fit_transform(X)
#converting from matrix to array
X = X.toarray()
#Dummy variable trap ---- Removing one dummy variable
X = X[:,1:]
Here you selecting all the columns which have numeric data.You only fit the encoder for categorical column and then transform it. And remove the dummy variable.

Linear Regression issues

I'm trying to run a linear regression for 2 columns of data (IMF_VALUES, BBG_FV)
I have this code:
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
ISO_TH = raw_data[["IMF_VALUE","BBG_FV"]]
filtered_TH = ISO_TH[np.isfinite(raw_data['BBG_FV'])]
npMatrix = np.matrix(filtered_TH)
IMF_VALUE, BBG_FV = npMatrix[:,0], npMatrix[:,1]
regression = linear_model.LinearRegression
regression.fit(IMF_VALUE, BBG_FV)
When I run this as a test, I get this error and I really have no idea why:
TypeError Traceback (most recent call last)
<ipython-input-28-1ee2fa0bbed1> in <module>()
1 regression = linear_model.LinearRegression
----> 2 regression.fit(IMF_VALUE, BBG_FV)
TypeError: fit() missing 1 required positional argument: 'y'
Make sure that both are one dimensional arrays:
regression.fit(np.array(IMF_VALUE).reshape(-1,1), np.array(BBG_FV).reshape(-1,1))

Categories

Resources