Time Series Regression Model Issue - python

I am new to Python trying to do a time series regression model. I have 3 columns, X, Y, and the date. I imported everything below, but I am getting stuck with an error.
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tsa.stattools import adfuller
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
Filtering to get rid of NaN:
filtered_TH = ISO_TH[np.isfinite(raw_data['BBG_FV'])]
I get this error:
C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py:2698: >SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation

Your problem has the exact same origin as it is written in the pandas documentation you linked. Look at the minimal example they provided there:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
foo['quux'] = value # We don't know whether this will modify df or not!
return foo
The problem is that foo might either be a copy of the dataframe df or a view. If it is a view, then changes on foo will also affect the original dataframe df. If foo is a copy, then the line foo['quux'] = value will have no effect on df.
How does this now translate to your problem?
You start with creating a dataframe from a *.csv file:
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
Then you select the columns "IMF_VALUE", "BBG_FV", "IMF_DATE" from the dataframe raw_data in the following way:
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
Now, this looks very similar to the second line from the documentation:
foo = df[['bar', 'baz']]
Is your ISO_TH a view or a copy of raw_data? We don't now! So what happens if we change a column of ISO_TH? Does raw_data also change or not? We don't now and hence the warning.
Toy example:
import pandas as pd
import numpy as np
raw_data=pd.DataFrame([[np.inf,22,333,44], [3,4,5,2],[1,2,3,4],[np.inf,0,0,0]],columns=["BBG_FV", "IMF_VALUE", "IMF_DATE", "unused"])
ISO_TH = raw_data[["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
# if we now change ISO_TH, we get a warning
ISO_TH.IMF_VALUE=[0,0,0,0] # SettingWithCopyWarning
The fact that you create an intermediate object filtered_TH from ISO_TH changes nothing here.
How can we solve this? Easy, we read the docs and do what is written there!
ISO_TH = raw_data.loc[:,["IMF_VALUE", "BBG_FV", "IMF_DATE"]]
And continue as before.
Additional information: What rules does Pandas use to generate a view vs a copy?

Related

Applying scikitlearn preprocessing to pandas without causing warnings

I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?
Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().

Cross validation in random forest using anaconda

I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)
pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)

AttributeError: 'numpy.ndarray' object has no attribute 'drop'

I'm trying to delete the first 24 rows of my pandas dataframe.
Searching on the web has led me to believe that the best way to do this is by using the pandas 'drop' function.
However, whenever I try to use it, I get the error:
AttributeError: 'numpy.ndarray' object has no attribute 'drop'
This is how I created my pandas dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%matplotlib inline
import os
cwd = os.getcwd()
df = pd.read_csv('C:/Users/.../Datasets/Weather/temperature4.csv')
Then:
df.fillna(df.mean())
df.dropna()
The head of my dataframe looks like this:
And then:
df = StandardScaler().fit_transform(df)
df.drop(df.index[0, 23], inplace=True)
This is where I get the attributeerror.
Not sure what I should do to delete the first 24 rows of my dataframe.
(This was all done using Python 3 on a Jupyter notebook on my local machine)
The problem lies in the following line:
df = StandardScaler().fit_transform(df)
It returns a numpy array (see docs), which does not have a drop function.
You would have to convert it into a pd.DataFrame first!
new_df = pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns, index=df.index)

Why does Pandas say this data frame has only one column?

I began a python course in linear and logistic regression but I am encountering what is probably a stupid error. I have to work with this data frame:
http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
And this is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rwq = pd.read_csv('*filepath*/winequality-red.csv')
rows = len(rwq.index)
cols = rwq.shape[1]
When I print rows and cols, rows correctly prints 1599 but for some reason cols always equals 1 (when in fact they are 12).
I also tried 'len(rwq.columns)' and I still get 1.
Am I doing something wrong or is the problem with the file provided?

Label Encoder is not creating dummy variables

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset=pd.read_csv('Churn_Modelling.csv')
X=dataset.iloc[:, 3:13]
Y=dataset.iloc[:, 13]
from sklearn.preprocessing import LabelEncoder
label_en1=LabelEncoder()
X.values[:, 1]=label_en1.fit_transform(X.values[:, 1])
label_en2=LabelEncoder()
X.values[:, 2]=label_en2.fit_transform(X.values[:, 2])
I tried creating dummy variables but it is not happening. I am using X.values int the encoding section because the version of Spyder that I have does not support object arrays so let X and Y be dataframes. I added .values because it dataframes do not support slice terminology. Where might I have gone wrong ?
I created a similar program before for creating dummy variables and it worked then. I don't understand why it is not happening for this one.
Edit:
Can you pass in a slice of your slice? Like so:
X.iloc[:, 1] = label_en1.fit_transform(X.iloc[:, 1])
You would essentially trim your dataframe down to what appears to be an array
Instead of accessing X.values, try accessing the feature / column name directly:
X['col_name'] = label_en1.fit_transform(X['col_name'])

Categories

Resources