I would like to identify arrays with a band like structure (first image) as compared to a more homogeneous structure shown in the homogenous image.
I have so far used some skewness and RMS techniques to test for this but it doesn't work well if the bands are evenly spaced. Are there any more refined ways of identifying such arrays in Python?
Try sns.pairplot.
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# The dataset: wages
# We fetch the data from OpenML. Note that setting the parameter as_frame to True will retrieve the data as a pandas dataframe.
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
# Then, we identify features X and targets y: the column WAGE is our target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")
y = survey.target.values.ravel()
survey.target.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
Related
Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
I am using this dataset:
https://filebin.net/wr2jy0ass7rsl0vt
There are three colums : "Date","Temperature","Anomaly" . I use "Date" to predict "Temperature". The code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data_df = pd.read_csv("ave_yearly_temp_nyc_1895-2017.csv")
data_df.columns= ["Date","Temperature","Anomaly"]
data_df["Date"] = data_df["Date"]//100
regressor = LinearRegression()
X_train,X_test, y_train,y_test = train_test_split(data_df.iloc[:,0],data_df.iloc[:,1],test_size=0.2, random_state=0)
regressor.fit(X_train,y_train) #training the algorithm
The data_df:
The error:
How to fix it?
It needs a 2D array, using iloc[:,0] you are getting a 1D array.
Instead you can use the entire dataframe column as parameter.
Try using:
X_train,X_test, y_train,y_test = train_test_split(data_df['Date'],data_df['Temperature'],test_size=0.2, random_state=0)
Try to do what the error message tells you. It seems that the implementation expects X to contain more than only one feature. Hence you'll need to transform it like this:
X_train, X_test, y_train, y_test = train_test_split(np.array(data_df.iloc[:,0]).reshape(-1, 1),data_df.iloc[:,1],test_size=0.2, random_state=0)
from sklearn import datasets
import numpy as np
import pandas as pd from sklearn.model_selection
import train_test_split
from sklearn.linear_model import Perceptron
data = pd.read_csv('student_selection.csv')
x = data[['Average','Pass','Division','Domicile']]
y = data[['Selected']]
x_train,x_test,y_train,y_test train_test_split(x,y,test_size=1,random_state=0)
ppn = Perceptron(eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, random_state=0)
ppn.fit(x_train, y_train)
y_pred = ppn.predict(x_train)
x_train['Predicted'] = pd.Series(y_pred)
How to see the actual vs predicted as a table and along with a plot? x_train is the value I am getting as predicted, but I am unable to merge it with the actual data to see the deviation.
How to see the actual vs predicted as a table and along with a plot?
Just run:
y_predict= pnn.predict(x)
data['y_predict'] = y_predict
and have the column in your dataframe, if you want to plot it you can use:
import matplotlib.pyplot as plt
plt.scatter(data['Selected'], data['y_predict'])
plt.show()
I am trying to imitate this one code that i found on Kaggle on plotting SVM decision boundaries. I am using my own dataset with 608 data and 10 features, with 2 classes. Those 2 classes, for instance, is whether you're diabetec or not. I copied the SVM part of the code on this link (in which you can find when you scroll it way down at the bottom) where it mentioned about decision boundary visualisation. Here's the link to my reference.
However, i get this error saying that "X must be a Numpy array". Can someone explain to me what does this mean?
The code below is what i've done. Take note that my dataset have been normalised beforehand. Also, I'm splitting the data into 70:30 ratio.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.pyplot as show
import matplotlib as cm
import matplotlib.colors as colors
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
autism = pd.read_csv('diabetec.csv')
x = autism.drop(['TARGET'], axis = 1)
y = autism['TARGET']
x_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state=1)
t = np.array(y_train)
t = t.astype(np.integer)
clf_svm = SVC(C=1.3, gamma=0.8, kernel='rbf')
clf_svm.fit(x_train, t)
plt.figure(figsize=[15,10])
plot_decision_regions(x_train, t, clf = clf_svm, hide_spines = False, colors = 'purple,limegreen', markers = ['x','o'])
plt.title('Support Vector Machine')
plot_decision_regions expects a numpy array but x_train is a pandas dataframe . Try with x_train.values, i.e.
plot_decision_regions(x_train.values, t, clf = clf_svm, ...
I'm a big fan of mlxtend's plot_decision_regions function, (http://rasbt.github.io/mlxtend/#examples , https://stackoverflow.com/a/43298736/1870832)
It accepts an X(just two columns at a time), y, and (fitted) classifier clf object, and then provides a pretty awesome visualization of the relationship between model predictions, true y-values, and a pair of independent variables.
A couple restrictions:
X and y have to be numpy arrays, and clf needs to have a predict() method. Fair enough. My problem is that in my case, the classifier clf object I would like to visualize has already been fitted on a Pandas DataFrame...
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib
matplotlib.use('Agg')
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit a Classifier to the data
# This classifier is fit on the data as a Pandas DataFrame
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = xgb.XGBClassifier()
clf.fit(X, y)
So now when I try to use plot_decision_regions passing X/y as numpy arrays...
# Plot Decision Region using mlxtend's awesome plotting function
plot_decision_regions(X=X.values,
y=y.values,
clf=clf,
legend=2)
I (understandably) get an error that the model can't find the column names of the dataset it was trained on
ValueError: feature_names mismatch: ['Planned_End', 'Actual_End'] ['f0', 'f1']
expected Planned_End, Actual_End in input data
training data did not have the following fields: f1, f0
In my actual case, it would be a big deal to avoid training our model on Pandas DataFrames. Is there a way to still produce decision_regions plots for a classifier trained on a Pandas DataFrame?
Try to change:
X = df[['Planned_End', 'Actual_End']].values
y = df['Late'].values
and proceed to:
clf = xgb.XGBClassifier()
clf.fit(X, y)
plot_decision_regions(X=X,
y=y,
clf=clf,
legend=2)
OR fit & plot using X.values and y.values