AttributeError: 'numpy.ndarray' object has no attribute 'drop' - python

I'm trying to delete the first 24 rows of my pandas dataframe.
Searching on the web has led me to believe that the best way to do this is by using the pandas 'drop' function.
However, whenever I try to use it, I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'drop'
This is how I created my pandas dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%matplotlib inline
import os
cwd = os.getcwd()
df = pd.read_csv('C:/Users/.../Datasets/Weather/temperature4.csv')
Then:
df.fillna(df.mean())
df.dropna()
The head of my dataframe looks like this:
And then:
df = StandardScaler().fit_transform(df)
df.drop(df.index[0, 23], inplace=True)
This is where I get the attributeerror.
Not sure what I should do to delete the first 24 rows of my dataframe.
(This was all done using Python 3 on a Jupyter notebook on my local machine)

The problem lies in the following line:
df = StandardScaler().fit_transform(df)
It returns a numpy array (see docs), which does not have a drop function.
You would have to convert it into a pd.DataFrame first!
new_df = pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns, index=df.index)

Related

Why is my pandas dataframe data type turning into 'None' type?

I am working with data frame while after running the specific code and check for head() function I got the error "AttributeError: 'NoneType' object has no attribute 'head'
"
The relevant piece of code is below:
import numpy as np
import pandas as pd
rfilepath="Advertising.csv"
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
def loaddata(rfilepath):
data=pd.read_csv(rfilepath)
return(data)
try:
data_df=loaddata(rfilepath)
print(data_df)
except:
print("error")
data_df.head() #Here no error is showing
def processdata(data_df):
for (columnName, columnData) in data_df.iteritems():
print(columnName)
sns.boxplot(data_df[columnName])
plt.show()
q1=stats.scoreatpercentile(data_df[columnName],25)
print("Q1",q1)
q3=stats.scoreatpercentile(data_df[columnName],75)
print("Q3",q3)
iqr=stats.iqr(data_df[columnName])
print("iqr",iqr)
lower_bound= q1- 1.5*(iqr)
print("Lowebound",lower_bound)
upper_bound= q3+ 1.5*(iqr)
print("upperbound",upper_bound)
print("\n")
outliers= data_df[columnName][((data_df[columnName]<lower_bound) | (data_df[columnName]>upper_bound))]
outliers
median=stats.scoreatpercentile(data_df[columnName],99)
median
for i in outliers:
data_df[columnName]=np.where(data_df[columnName]==i,median,data_df[columnName])
sns.boxplot(data_df[columnName])
plt.show()
try:
data_df=processdata(data_df)
except:
print("error")
data_df.head()#after calling the function processdata(data_df) here shows the "AttributeError: 'NoneType' object has no attribute 'head'"
I think the issue is with the function processdata(data_df).If anyone know what exactly the issue?
Your function is not returning anything, just printing intermediate operations. And you are assigning that nothing to your dataframe data_df, hence it is not a dataframe any more. Use return data_df outside of upper for loop, defining what is the output that is need to be assigned to data_df later.

Applying scikitlearn preprocessing to pandas without causing warnings

I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?
Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().

Time Series Regression Model Issue

I am new to Python trying to do a time series regression model. I have 3 columns, X, Y, and the date. I imported everything below, but I am getting stuck with an error.
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import adfuller
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
Filtering to get rid of NaN:
filtered_TH = ISO_TH[np.isfinite(raw_data['BBG_FV'])]
I get this error:
C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py:2698: >SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation
Your problem has the exact same origin as it is written in the pandas documentation you linked. Look at the minimal example they provided there:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
foo['quux'] = value # We don't know whether this will modify df or not!
return foo
The problem is that foo might either be a copy of the dataframe df or a view. If it is a view, then changes on foo will also affect the original dataframe df. If foo is a copy, then the line foo['quux'] = value will have no effect on df.
How does this now translate to your problem?
You start with creating a dataframe from a *.csv file:
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
Then you select the columns "IMF_VALUE", "BBG_FV", "IMF_DATE" from the dataframe raw_data in the following way:
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
Now, this looks very similar to the second line from the documentation:
foo = df[['bar', 'baz']]
Is your ISO_TH a view or a copy of raw_data? We don't now! So what happens if we change a column of ISO_TH? Does raw_data also change or not? We don't now and hence the warning.
Toy example:
import pandas as pd
import numpy as np
raw_data=pd.DataFrame([[np.inf,22,333,44], [3,4,5,2],[1,2,3,4],[np.inf,0,0,0]],columns=["BBG_FV", "IMF_VALUE", "IMF_DATE", "unused"])
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
# if we now change ISO_TH, we get a warning
ISO_TH.IMF_VALUE=[0,0,0,0] # SettingWithCopyWarning
The fact that you create an intermediate object filtered_TH from ISO_TH changes nothing here.
How can we solve this? Easy, we read the docs and do what is written there!
ISO_TH = raw_data.loc[:,["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
And continue as before.
Additional information: What rules does Pandas use to generate a view vs a copy?

Loading SKLearn cancer dataset into Pandas DataFrame

I'm trying to load a sklearn.dataset, and missing a column, according to the keys (target_names, target & DESCR). I have tried various methods to include the last column, but with errors.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
the keys are ['target_names', 'data', 'target', 'DESCR', 'feature_names']
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
with the code above, it only returns 30 column, when I need 31 columns. What is the best way load scikit-learn datasets into pandas DataFrame.
Another option, but a one-liner, to create the dataframe including the features and target variables is:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
If you want to have a target column you will need to add it because it's not in cancer.data. cancer.target has the column with 0 or 1, and cancer.target_names has the label. I hope the following is what you want:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
data = data.assign(target=pd.Series(cancer.target))
print data.describe()
# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.
This works too, also using pd.Series.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)
print data.keys()
print data.shape
Only target column is missing, so you can just add one.
df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target
mapping target names can be handled elegantly using map():
data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
As of scikit-learn 0.23 you can do the following to get a DataFrame with the target column included.
df = load_breast_cancer(as_frame=True)
df.frame

ufunc 'add' did not contain a loop with signature matching types dtype('<U23') dtype('<U23') dtype('<U23')

When trying to convert the sklearn dataset into pandas dataframe by the following code I am getting this error "ufunc 'add' did not contain a loop with signature matching types dtype('
import numpy as np
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'])
Here is how I converted the sklearn dataset to a pandas dataframe. The target column name needs to be appended.
bostonData = pd.DataFrame(data= np.c_[boston['data'], boston['target']],
columns= np.append(boston['feature_names'],['target']))
You have numpy array of strings please provide full error therefore we figure out what's missing;
For example I am assuming you got dtype('U9'), please add;
dtype=float into your array. Something like not certain;
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'], dtype=float)
Sometimes it's just easier to keep it simple. Create a DF for both data and target, then merge using pandas.
data_df = pd.DataFrame(data=cancer['data'] ,columns=cancer['feature_names'])
target_df = pd.DataFrame(data=cancer['target'], columns=['target']).reset_index(drop=True)
target_df.rename_axis(None)
df = pd.concat([data_df, target_df], axis=1)

Categories

Resources