I am trying to put out a linear regression but am getting this error:
ValueError: cannot reshape array of size 2246 into shape (2,2246)
and
C:\Users\Brian\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
This is my code.
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
df = pd.read_csv(r'C:\Users\Brian\Desktop\GOOGTICKER.CSV')
df
times = pd.DatetimeIndex(df['Date'])
grouped= df.groupby([times.year]).mean()
from sklearn import linear_model
x_val= times
y_val= df['GOOGL']
body_reg =linear_model.LinearRegression()
body_reg.fit(x_val, y_val)
I have imported numpy as py and have tried reshaping, but I still get an error. Any advice would be greatly appreciated. Thank you for your time.
Related
Basic setup: I'm trying to run a logit regression in python on the probability of founding a business (founder variable) the exogenous variables are year, age, edu_cat (education category), and sex.
The X matrix is (4, 650), and the y matrix(1, 650). All of the variables within the x matrix have 650 non-NaN observations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
x=np.array ([ df_all['Year'], df_all['Age'], df_all['Edu_cat'], df_all['sex']])
y= np.array([df_all['founder']])
logit_model = sm.Logit(y, x)
result = logit_model.fit()
print(result)
So I'm tracking that the shape is good, but python is telling me otherwise. Am I missing something basic?
I believe the issue is with the Y array, being [650,1], when it should be [650,], which it defaults to. Additionally I needed to make the x array [650,4] through a transpose.
I'm getting the following error from my code:
ValueError: Expected 2D array, got scalar array instead:
array=99.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is the code used:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
Physical_activity_df = pd.read_excel('C:/Users/Usuario/Desktop/LW_docs/Physical_activity_nopass.xlsx')
prediction_df = Physical_activity_df[['Activity_Score','Calories']]
prediction_df.plot(kind='scatter', x= 'Activity_Score', y= 'Calories')
plt.show()
#change to df variables
activity_score = pd.DataFrame(prediction_df['Activity_Score'])
calories = pd.DataFrame(prediction_df['Calories'])
lm = linear_model.LinearRegression()
model = lm.fit(activity_score,calories)
#predict new values for calories (FROM HERE COMES THE ERROR)
activity_score_new = 99
calories_predict = model.predict(activity_score_new)
calories_predict
Any idea about how to fix this issue? Thanks!
I am in completing an Assignment for my Data Cleaning class where I must perform a PCA. You shouldn't need the data frame to help with this particular issue. I am a python noob. I've spent the last 3 hours trying to figure this out on my own with no success. The data frame as been cleaned and anomalies are sorted. I'm trying to create a PCA.
Here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from pylab import rcParams
from sklearn.decomposition import PCA
df_pcs = pd.read_csv('/Users/personal/Desktop/Data set/churn_raw_data.csv', usecols=['Churn','Age', 'Income', 'Outage_sec_perweek','Yearly_equip_failure', 'Tenure', 'MonthlyCharge'])
data_numeric = df_pcs[['Age', 'Income', 'Outage_sec_perweek','Yearly_equip_failure', 'Tenure', 'MonthlyCharge']]
data_normed = (df_pcs - df_pcs.mean()) / df_pcs.std()
pca = PCA(n_components=2)
pca.fit(data_normed)
data_pca = pd.DataFrame(pca.transform(data_normed),columns = data_numeric)
The error I'm getting: ValueError: Index data must be 1-dimensional
I'm not entirely sure how to make the index data 1 dimensional. Any help would be greatly appreciated.
I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)
pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
My Code:
### Working with NaN using sklearn
import numpy as np
from sklearn.preprocessing import Imputer
### Mean strategy
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
imp.fit([1,5,9,np.NaN])
X = [1,5,9,np.NaN]
y = imp.transform(X)
print (y)
After running I am getting below warning message:
C:\Users\Admin\Anaconda3\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
How to solve it? I tried the reshape but it is giving error message saying:
'list' object has no attribute 'reshape'
Please help.
so i ran your code and changed X do a 2d list... Turns out that because you were passing a 1D array to transform so it was throwing you the error... So i made it a 2D lisst
import numpy as np
from sklearn.preprocessing import Imputer
### Mean strategy
imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
imp.fit([1,5,9,np.NaN])
X = [[1,5,9,np.NaN]] # < =========== The change that I made
y = imp.transform(X)
print(y)
enter code here