Binning Data Using Numpy To Simplify Linear Regression - python

I have training data in the form of images taken by a PiCamera from an RaspberryPi RC car while I drive it in between two lane lines.
Each image is labelled with Left and Right motor controls. I've plotted them in the below graph.
I am using ConvNets to do the regression with Keras and TensorFlow as backend.
It's clearly visible that regression can be much simpler if I am able to remove the training samples which are to the left of the visible regression.
The code for loading the images and labels is very simple and is below:
filenames = glob.glob("../data/*.jpg")
labels = []
images = []
for filename in filenames:
# Timestamp-LeftMotorControl-RightMotorControl.jpg
filename = filename.replace('.jpg', '')
parts = filename.split('-')
if float(parts[1]) == 0. or float(parts[2]) == 0.:
continue
images.append(filename)
labels.append([float(parts[1]), float(parts[2])])
Firstly is there a good approach for removing data which is out of bounds of the visible regression from the training samples?
Also, I have a different approach in mind which is to create 100 data bins with edges of 0 to 1, then take 50 samples from each bin such that my data is balanced.
Is there a numpy way to put the data into bins so that I don't need to do that with some custom function?

Answering the first question:
Do the linear regression fit on the whole dataset
Remove the points with the largest residuals
Repeat 1 and 2 until all residuals are comfortably small

Related

T-SNE for better data visualization

My dataset shape is (248857, 11)
This is how it looks like before StandartScaler. I performed clustering analysis because of those clustering algorithms such as K-means do need feature scaling before they are fed to the algo.
After
I performed K-Means with three clusters and I am trying to find a way to show these clusters.
I found T-SNE as a solution but I am stuck.
This is how I implemented it:
# save the clusters into a variable l.
l = df_scale['clusters']
d = df_scale.drop("clusters", axis = 1)
standardized_data = StandardScaler().fit_transform(d)
# TSNE Picking the top 100000points as TSNE
data_points = standardized_data[0:100000, :]
labels_80 = l[0:100000]
model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(data_points)
# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_80)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dimension1", "Dimension2", "Clusters"))
# Ploting the result of tsne
sns.FacetGrid(tsne_df, hue ="Clusters", size = 6).map(
plt.scatter, 'Dimension1', 'Dimension2').add_legend()
plt.show()
As you see, it is not that good. How to visualize this better?
It seems you need to tune the perplexity hyper-parameter which is:
a tunable parameter that says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.
Read more about it in this post and more specifically, here.

Logistic Regression vs predicting probability by splitting data into bin

So I am exploring using a logistic regression model to predict the probability of a shot resulting in a goal. I have two predictors but for simplicity lets assume I have one predictor: distance from the goal. When doing some data exploration I decided to investigate the relationship between distance and the result of a goal. I did this graphical by splitting the data into equal size bins and then taking the mean of all the results (0 for a miss and 1 for a goal) within each bin. Then I plotted the average distance from goal for each bin vs the probability of scoring. I did this in python
#use the seaborn library to inspect the distribution of the shots by result (goal or no goal)
fig, axes = plt.subplots(1, 2,figsize=(11, 5))
#first we want to create bins to calc our probability
#pandas has a function qcut that evenly distibutes the data
#into n bins based on a desired column value
df['Goal']=df['Goal'].astype(int)
df['Distance_Bins'] = pd.qcut(df['Distance'],q=50)
#now we want to find the mean of the Goal column(our prob density) for each bin
#and the mean of the distance for each bin
dist_prob = df.groupby('Distance_Bins',as_index=False)['Goal'].mean()['Goal']
dist_mean = df.groupby('Distance_Bins',as_index=False)['Distance'].mean()['Distance']
dist_trend = sns.scatterplot(x=dist_mean,y=dist_prob,ax=axes[0])
dist_trend.set(xlabel="Avg. Distance of Bin",
ylabel="Probabilty of Goal",
title="Probability of Scoring Based on Distance")
Probability of Scoring Based on Distance
So my question is why would we go through the process of creating a logistic regression model when I could fit a curve to the plot in the image? Would that not provide a function that would predict a probability for a shot with distance x.
I guess the problem would be that we are reducing say 40,000 data point into 50 but I'm not entirely sure why this would be a problem for predict future shot. Could we increase the number of bins or would that just add variability? Is this a case of bias-variance trade off? Im just a little confused about why this would not be as good as a logistic model.
The binning method is a bit more finicky than the logistic regression since you need to try different types of plots to fit the curve (e.g. inverse relationship, log, square, etc.), while for logistic regression you only need to adjust the learning rate to see results.
If you are using one feature (your "Distance" predictor), I wouldn't see much difference between the binning method and the logistic regression. However, when you are using two or more features (I see "Distance" and "Angle" in the image you provided), how would you plan to combine the probabilities for each to make a final 0/1 classification? It can be tricky. For one, perhaps "Distance" is more useful a predictor than "Angle". However, logistic regression does that for you because it can adjust the weights.
Regarding your binning method, if you use fewer bins you might see more bias since the data may be more complicated than you think, but this is not that likely because your data looks quite simple at first glance. However, if you use more bins that would not significantly increase variance, assuming that you fit the curve without varying the order of the curve. If you change the order of the curve you fit, then yes, it will increase variance. However, your data seems like it is amenable to a very simple fit if you go with this method.

Techniques for analyzing clusters after performing k-means clustering on dataset

I was recently introduced to clustering techniques because I was given the task to find "profiles" or "patterns" of professors of my university based on a survey they had to answer. I've been studying some of the avaible options to perform this and I came across the k-means clustering algorithm. Since most of my data is categorical I had to perform a one-hot-encoding (transforming the categorical variable in 0-1 single column vectors) and right after that I did a correlation analysis on Excel in order to exclude some redundant variables. After this I used python with pandas, numpy, matplotlib and sklearn libraries to perform a optimal cluster number check (elbow method) and then run k-means, finally.
This is the code I used to import the .csv with the data from the professors survey and to run the elbow method:
# loads the .csv dataframe (DF)
df = pd.read_csv('./dados_selecionados.csv', sep=",")
# prints the df
print(df)
#list for the sum of squared distances
SQD = []
#cluster number for testing in elbow method
num_clusters = 10
# runs k-means for each cluster number
for k in range(1,num_clusters+1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(df)
SQD.append(kmeans.inertia_)
# sets up the plot and show it
plt.figure(figsize=(16, 8))
plt.plot(range(1, num_clusters+1), SQD, 'bx-')
plt.xlabel('Número de clusters')
plt.ylabel('Soma dos quadrados das distâncias de cada ponto ao centro de seu cluster')
plt.title('Método do cotovelo')
plt.show()
According to the figure I decided to go with 3 clusters. After that I run k-means for 3 clusters and sent cluster data to a .xlsx with the following code:
# runs k-means
kmeans = KMeans(n_clusters=3, max_iter=100,verbose=2)
kmeans.fit(df)
clusters = kmeans.fit_predict(df)
# dict to store clusters data
cluster_dict=[]
for c in clusters:
cluster_dict.append(c)
# prints the cluster dict
cluster_dict
# adds the cluster information as a column in the df
df['cluster'] = cluster_dict
# saves the df as a .xlsx
df.to_excel("3_clusters_k_means_selecionado.xlsx")
# shows the resulting df
print(df)
# shows each separate cluster
for c in clusters:
print(df[df['cluster'] == c].head(10))
My main doubt right know is how to perform a reasonable analysis on each cluster data to understand how they were created? I began using means on each variable and also conditional formatting on Excel to see if some patterns would show up and they kind of did actually, but I think this is not the best option.
And I'm also going to use this post to ask for any recommendations on the whole method. Maybe some of the steps I took were not the best.
If you're using scikit learns kmeans function, there is a parameter called n_init, which is the number of times the kmeans algorithm will run with different centroid seeds. By default it is set to 10 iteration, so essentially it does 10 different runs and outputs a single result with the lowest sum of squares. Another parameter you could mess around with is random_state which is a seed number to initialize the centroids randomly. This may give you better reproducibility because you choose the seed number, so if you see an optimal result you know which seed corresponds to that result.
You may want to consider testing several different clustering algos. Here is a list of some of the popular ones.
https://scikit-learn.org/stable/modules/clustering.html
I think there are over 100 different clustering algos out there now.
Also, some clustering algos will automatically select the optimal number of clusters for you, so you don't have to 'guess'. I say guess, because the silhouette and elbow techniques will help quantify the K number for you, but you, yourself, still need to do some kind of guess-work.

How can I initialize K means clustering on a data matrix with 569 rows (samples), and 30 columns (features)?

I'm having trouble understanding how to begin my solution. I have a matrix with 569 rows, each representing a single sample of my data, and 30 columns representing the features of each sample. My intuition is to plot each individual row, and see what the clusters (if any) look like, but I can't figure out how to do more than 2 rows on a single scatter plot.
I've spent several hours looking through tutorials, but have not been able to understand how to apply it to my data. I know a scatter plot takes 2 vectors as a parameter, so how could I possibly plot all 569 samples to cluster them? Am I missing something fundamental here?
#our_data is a 2-dimensional matrix of size 569 x 30
plt.scatter(our_data[0,:], our_data[1,:], s = 40)
My goal is to start k means clustering on the 569 samples.
Since you have a 30-dimensinal factor space, it is difficult to plot such data in 2D space (i.e. on canvas). In such cases usually apply dimension reduction techniques first. This could help to understand data structure. You can try to apply,e.g. PCA (principal component analysis) first, e.g.
#your_matrix.shape = (569, 30)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected_data = pca.fit_transform(your_matrix)
plt.scatter(projected_data[:,0], projected_data[:, 1]) # This might be very helpful for data structure understanding...
plt.show()
You can also look on other (including non-linear) dimension reduction techniques, such as, e.g. T-sne.
Further you can apply k-means or something else; or apply k-means to projected data.
If by initialize you mean picking the k initial clusters, one of the common ways of doing so is to use K-means++ described here which was developed in order to avoid poor clusterings.
It essentially entails semi-randomly choosing centers based upon a probability distribution of distances away from a first center that is chosen completely randomly.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Categories

Resources