Create clusters depending on scores performance - python

I have data from students who took a test that has 2 sections : the 1st section tests their digital skill at level2, and the second section tests their digital skills at level3. I need to come up with 3 clusters of students depending on their scores to place them in 3 different skills levels (1,2 and 3) --> code sample below
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
I thought about using K-means clustering, but following a tutorial online, I'd need to use x,y coordinates. Should I use scores_section1 as x, and Scores_section2 as y or vice-versa, and why?
Many thanks in advance for your help!

Try it this way.
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
#Import required module
from sklearn.cluster import KMeans
#Initialize the class object
kmeans = KMeans(n_clusters=3)
#predict the labels of clusters.
df = df[['Scores_section1', 'Scores_section2', 'Sum_all_scores']]
label = kmeans.fit_predict(df)
label
df['kmeans'] = label
df
# K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to
# clusters in an effort to minimize the variance within each cluster.
# The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets
# on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably
# efficient in the sense of within-class variance.
# plot X & Y coordinates and color by cluster number
import plotly.express as px
fig = px.scatter(df, x="Scores_section1", y="Scores_section2", color="kmeans", size='Sum_all_scores', hover_data=['kmeans'])
fig.show()
Feel free to modify the code to suit your needs.

Related

How to shift a distribution to match it with a "target" distribution

Let's say that i have two 1-D arrays with 2 different statistical distributions. Now, i want to match both distributions using one of them as "target".
In the example, i "shifted" one of the distributions using MinMaxScaler() from SciKit to match it with the other one...but i am sure i can achieve a "automatic" and "better" match with some API...or some code...
In the example i have both arrays in the same DataFrame (and both have the same length), but i'd be very pleased if somebody kwnow a way to achieve it using 2 different Dataframes and/or 2 arrays with different lengths.
Thank you!!
CODE
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import plotly.figure_factory as ff
################## DATA ######################
np.random.seed(54)
crv = np.random.uniform(1,99,(1,100)).flatten()
np.random.seed(115)
crv_target = np.random.uniform(51,149,(1,100)).flatten()
# Create DataFrame
df = pd.DataFrame(data=[crv, crv_target]).T
df = df.rename(columns={0: "crv", 1: "crv_target"})
# Scaler
scale = MinMaxScaler(feature_range=(50,150))
df['crv_shifted'] = scale.fit_transform(X=df['crv'].values.reshape(-1, 1),y=df['crv_target'].values.reshape(-1, 1))
# Create distplot
data = [df['crv_shifted'],df['crv_target'],df['crv']]
labels = ['crv_shifted','crv_target','crv']
colors = ['#F8C471', '#22D2E6','#CD6155']
fig = ff.create_distplot(data, labels,show_hist=False,show_rug=False,colors=colors)
fig.show()
LINK TO PLOT

how can I drop low correlated features

I am making a preprocessing code for my LSTM training. My csv contains more than 30 variables. After applying some EDA techniques, I found that half of the features can be drop and they don't make any effect on training.
Right now I am dropping such features manually by using pandas.
I want to make a code which can drop such features automaticlly.
I wrote a code to visualize heat map and correlation in this way:
#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data
def calculateCorrelationByPearson(self):
columns = self.data.columns
plt.figure(figsize=(12, 8))
sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f',
linewidths=0.5, cmap='Blues')
plt.show()
for column in columns:
corr = stats.spearmanr(self.data['total'], self.data[columns])
print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')
This gives me a perfect view of my features and relationship with each other.
Now I want to drop columns which are not important.
Let's say correlation less than 0.4.
How can I apply this logic in to my code?
Here is an approach to remove variables with a correlation coef value below some threshold:
import pandas as pd
from scipy.stats import spearmanr
data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
targetVar = "A"
corr_threshold = 0.4
corr = spearmanr(data)
corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold
vars_to_keep = list(corrSeries.index.values) #list of variables to keep
vars_to_keep.append(targetVar) #add the target variable back in
data2 = data[vars_to_keep]

Summing time series after k-means clustering

I am trying out different variations of K in K-means clustering on a set with time series data.
For each experiment I want to sum up the time series for each cluster label and perform predictions on them.
So for example:
If I cluster the time series into 3 clusters I want to sum all the time series (column-wise) belonging to cluster 1 and all the times series belonging to cluster 2, and the same for cluster 3. After that I will make predictions on each aggregated time-series cluster, but I do not need help on the prediction part.
I was thinking to add the cluster labels to the original dataframe and then use .loc and a loop to extract time series corresponding to the same clusters. But I am wondering if there is a more efficient way?
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.cluster import KMeans
#create dataframe with time series
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
for i in range(20):1
df['ts' + str(i)] = np.random.randint(0,100,size=(len(date_rng)))
df_pivot = df.pivot_table(columns = 'date', values = df.columns)
#cluster
K = range(1,10,2)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_pivot)
print(km.labels_)
#sum/aggregate all ts in each cluster column-wise
#forecast next step for each cluster(dont need help with this part)
`
You can access data points for every cluster and then sum their values.
Something like this:
labels = km.labels_
centroids = km.cluster_centers_
cluster_sums_dict = {} # cluster number: sum of elements
for i in range(k):
# select
temp_cluster = df_pivot[np.where(labels==i)]
cluster_sums_dict[i] = temp_cluster['ts'].sum()
Also on a side note, instead of aggregating a cluster_values, can you use centroids of each cluster for prediction?

Why does sklearn's train/test split plus PCA make my labelling incorrect?

I am exploring PCA in Scikit-learn (0.20 on Python 3) using Pandas for structuring my data. When I apply a test/train split (and only when), my input labels seem to no longer match up with the PCA output.
import pandas
import sklearn.datasets
from matplotlib import pyplot
import seaborn
def load_bc_as_dataframe():
data = sklearn.datasets.load_breast_cancer()
df = pandas.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = pandas.Series(data.target_names[data.target])
return data.feature_names.tolist(), df
feature_names, bc_data = load_bc_as_dataframe()
from sklearn.model_selection import train_test_split
# bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train = bc_data
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca_raw = pca.fit_transform(bc_train[feature_names])
bc_pca = pandas.DataFrame(bc_pca_raw, columns=('PCA 1', 'PCA 2'))
bc_pca['diagnosis'] = bc_train['diagnosis']
seaborn.scatterplot(
data=bc_pca,
x='PCA 1',
y='PCA 2',
hue='diagnosis',
style='diagnosis'
)
pyplot.show()
This looks reasonable, and that's borne out by accurate classification results. If I replace the bc_train = bc_data with a train_test_split() call (even with test_size=0), my labels seem to no longer correspond to the original ones.
I realise that train_test_split() is shuffling my data (which I want it to, in general), but I don't see why that would be a problem, since the PCA and the label assignment use the same shuffled data. PCA's transformation is just a projection, and while it obviously doesn't retain the same features (columns), it shouldn't change which label goes with which frame.
How can I correctly relabel my PCA output?
The issue has three parts:
The shuffling in train_test_split() causes the indices in bc_train to be in a random order (compared to the row location).
PCA operates on numerical matrices, and effectively strips the indices from the input. Creating a new DataFrame recreates the indices to be sequential (compared to the row location).
Now we have random indices in bc_train and sequential indices in bc_pca. When I do bc_pca['diagnosis'] = bc_train['diagnosis'], bc_train is reindexed with bc_pcas indices. This reorders the bc_train data so that its indices match bc_pcas.
To put it another way, Pandas does a left-join on the indices when I assign with bc_pca['diagnosis'] (ie. __setitem__()), not a row-by-row copy (similar to update().
I don't find this intuitive, and I couldn't find documentation on __setitem__()'s behaviour beyond the source code, but I expect it makes sense to a more experienced Pandas user, and maybe it's documented at a higher level somewhere I haven't seen.
There are a number of ways to avoid this. I can reset the index of the training/test data:
bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train.reset_index(inplace=True)
Alternatively I could assign from the values member:
bc_pca['diagnosis'] = bc_train['diagnosis'].values
I could also do a similar thing before constructing the DataFrame (arguably more sensible, since PCA is effectively operating on bc_train[feature_names].values).

Put a fixed quantity of missing values in a dataset - Azure ML

I'm dealing with Azure ML and my goal is to see what happens if I have a fixed quantity(in percentage) of missing values in my dataset.
My idea could be:
Starting from the dataset(take in example Adult dataset) ,duplicate the original dataset and call it for convention X. Dataset X will contain randomly missing value in the percentage of the 20%. Once we have the original dataset and the duplicated dataset X we can use a Neural Net algo , create training and test set and then train this neural net with the dataset X in input . What it could be interesting to see is the global error produced. After we can imagine to expand the range of missing values in the dataset X. Starting from 20%,after 40% and so on... I think the hardest part is to duplicate the original dataset and so create the dataset X with this missing values.
In which way I can do it? Using modules in Azure ML or maybe R/Python scripts?
Just Sharing my idea, please see the sample code & comments as below.
import numpy as np
import pandas as pd
# Origin DataFrame
df = pd.DataFrame(np.random.randn(6,4))
# Copy data via flatten data matrix as an array
array = df.values.flatten()
# insert missing data by percent
# Define the percent of missing data
percent = 0.2
size = len(array)
# generate a random list for indexing data which will be assigned NaN
chosen = np.random.choice(size, int(size*percent))
array[chosen] = np.nan
# Create a new DataFrame with missing data
df2 = pd.DataFrame(np.reshape(array, (6,4)))
Hope it helps.

Categories

Resources