I am in completing an Assignment for my Data Cleaning class where I must perform a PCA. You shouldn't need the data frame to help with this particular issue. I am a python noob. I've spent the last 3 hours trying to figure this out on my own with no success. The data frame as been cleaned and anomalies are sorted. I'm trying to create a PCA.
Here is my code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from pylab import rcParams
from sklearn.decomposition import PCA
df_pcs = pd.read_csv('/Users/personal/Desktop/Data set/churn_raw_data.csv', usecols=['Churn','Age', 'Income', 'Outage_sec_perweek','Yearly_equip_failure', 'Tenure', 'MonthlyCharge'])
data_numeric = df_pcs[['Age', 'Income', 'Outage_sec_perweek','Yearly_equip_failure', 'Tenure', 'MonthlyCharge']]
data_normed = (df_pcs - df_pcs.mean()) / df_pcs.std()
pca = PCA(n_components=2)
pca.fit(data_normed)
data_pca = pd.DataFrame(pca.transform(data_normed),columns = data_numeric)
The error I'm getting: ValueError: Index data must be 1-dimensional
I'm not entirely sure how to make the index data 1 dimensional. Any help would be greatly appreciated.
Related
Let's say that i have two 1-D arrays with 2 different statistical distributions. Now, i want to match both distributions using one of them as "target".
In the example, i "shifted" one of the distributions using MinMaxScaler() from SciKit to match it with the other one...but i am sure i can achieve a "automatic" and "better" match with some API...or some code...
In the example i have both arrays in the same DataFrame (and both have the same length), but i'd be very pleased if somebody kwnow a way to achieve it using 2 different Dataframes and/or 2 arrays with different lengths.
Thank you!!
CODE
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import plotly.figure_factory as ff
################## DATA ######################
np.random.seed(54)
crv = np.random.uniform(1,99,(1,100)).flatten()
np.random.seed(115)
crv_target = np.random.uniform(51,149,(1,100)).flatten()
# Create DataFrame
df = pd.DataFrame(data=[crv, crv_target]).T
df = df.rename(columns={0: "crv", 1: "crv_target"})
# Scaler
scale = MinMaxScaler(feature_range=(50,150))
df['crv_shifted'] = scale.fit_transform(X=df['crv'].values.reshape(-1, 1),y=df['crv_target'].values.reshape(-1, 1))
# Create distplot
data = [df['crv_shifted'],df['crv_target'],df['crv']]
labels = ['crv_shifted','crv_target','crv']
colors = ['#F8C471', '#22D2E6','#CD6155']
fig = ff.create_distplot(data, labels,show_hist=False,show_rug=False,colors=colors)
fig.show()
LINK TO PLOT
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
import statsmodels.api as sm
import scipy
import scipy.stats
import seaborn as sns
import numpy.random as npr
import math
from scipy.stats import norm
import sqlite3 as sql
import seaborn
from numba import jit, prange
df = pd.read_csv('ODI-2021.edited.csv')
df.info()
sr_targets = pd.Series(df['What is your stress level (0-100)?'])
sr_targets.describe()
df_features = df.drop('What is your stress level (0-100)?', axis=1)
print (df_features)
df_features.describe()
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
df.isnull().sum()
df_missing = df.dropna()
df_missing.shape
X = df["What is your stress level (0-100)?"]
y = df["Time you went to be Yesterday"]
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
model.summary()
-AttributeError: 'LinearRegression' object has no attribute 'summary'
from sklearn import preprocessing
def convert(df):
number = preprocessing.LabelEncoder()
data['Date'] = number.fit_transform(df['Date'])
data=data.fillna(-999)
return data
model = LinearRegression(fit_intercept=True)
result = model.fit(df_features, sr_targets)
-ValueError: could not convert string to float: '3/16/2021'
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
-NameError: name 'result' is not defined
sr_endog = sr_targets.copy()
df_exog = sm.add_constant(df_features)
model = sm.OLS(sr_endog, df_exog)
result = model.fit()
-ValueError: Pandas data cast to numpy dtype of object. Check input data with
np.asarray(data)
result.summary()
-NameError: name 'result' is not defined
cross validation
from sklearn.linear_model import LassoCV
from sklearn.model_selection import Fold
nb_folds = 10
cv = KFold(n_splits=nb_folds)
model = LassoCV(fit_intercept=True, cv=cv, n_alphas=200, max_iter=2000)
result = model.fit(df_features_rescaled, sr_targets)
-NameError: name 'df_features_rescaled' is not defined
fig = plt.figure(figsize=[16,15])
xvalues = np.log10(result.alphas_)
rmse_path = np.sqrt(result.mse_path_)
for k in range (nb_folds):
yvalues = rmse_path[:,k]
plt.plot(xvalues, yvalues)
pos_ymin = yvalues.argmin()
plt.plot(xvalues[pos_ymin], yvalues[pos_ymin], marker='o')
plt.axvline(np.log10(result.alpha_))
plt.title('RMSE for differebt alpha', fontsize=20)
plt.grid()
-NameError: name 'result' is not define
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
this is my code and these are the errors im getting, could someone help me with what im doing wrong? i have looked up the errors and i have no clue how to fix those. my data set has numbers but also dates and answers such as yes no and university level education level responses which i have no clue how to convert in float. i have been trying to run a regression with two columns which consist of numbers and i get there errors. for the cross validation i am dropping one column and im using the rest and im getting the error that i havent defined the variable result which i have , im clueless
3.thanks in advance!
Each of your errors means something. Learning to read the errors is extremely important in understanding what is going on. For example,
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
This appears to suggest that someone place the phrase "over 9000" in the cell of the CSV file you are opening. Hence, python is having trouble figuring out how to convert that to a float. Same thing would happen if you tried to run
float("over 9000")
It appears the data needs to be cleaned up a bit before it can be used by your sm. It appears that python is trying to tell you the same thing here too:
-ValueError: could not convert string to float: '3/16/2021'
The string "3/16/2021" has symbols that are not apart of a float(), namely "/" symbol.
I think it would be helpful if you broke up your errors and concerns into separate questions, then people could tackle them one at a time for you.
I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)
pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
I began a python course in linear and logistic regression but I am encountering what is probably a stupid error. I have to work with this data frame:
http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
And this is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rwq = pd.read_csv('*filepath*/winequality-red.csv')
rows = len(rwq.index)
cols = rwq.shape[1]
When I print rows and cols, rows correctly prints 1599 but for some reason cols always equals 1 (when in fact they are 12).
I also tried 'len(rwq.columns)' and I still get 1.
Am I doing something wrong or is the problem with the file provided?
I am trying to put out a linear regression but am getting this error:
ValueError: cannot reshape array of size 2246 into shape (2,2246)
and
C:\Users\Brian\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
This is my code.
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
df = pd.read_csv(r'C:\Users\Brian\Desktop\GOOGTICKER.CSV')
df
times = pd.DatetimeIndex(df['Date'])
grouped= df.groupby([times.year]).mean()
from sklearn import linear_model
x_val= times
y_val= df['GOOGL']
body_reg =linear_model.LinearRegression()
body_reg.fit(x_val, y_val)
I have imported numpy as py and have tried reshaping, but I still get an error. Any advice would be greatly appreciated. Thank you for your time.