Clustering with mixed data type - python

Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -
id age txn_duration Statename amount gender religion
1 27 275 bihar 110 m hindu
2 33 163 maharashtra 50 f muslim
3 53 63 delhi 50 f muslim
4 47 100 up 50 m hindu
5 39 263 punjab 100 m punjabi
6 41 303 delhi 50 m punjabi
There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is -
1. how to find optimum no of cluster for mixed data type.
2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means
3.How to interpretation my cluster results. I have use
print (cluster_data[clmns].groupby(['clusters']).mean())
Any other way I can see or plot?please provide me the code
My code is -
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
cluster_data = pd.read_csv("cluster.csv")
cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])
clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
'StateName_Telangana', 'StateName_Uttar Pradesh',
'StateName_West Bengal', 'gender_female',
'gender_male', 'religion_buddhist',
'religion_christian', 'religion_hindu',
'religion_jain', 'religion_muslim',
'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])
#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
#Glue back to originaal data
cluster_data['clusters'] = labels
clmns.extend(['clusters'])
#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())

You can run something like this code:
Look at the image attached, in that plot you can see that having more than 3 clusters (for the dataset it was run on) does not provide a significant decrease in distortion. So optimum cluster number would be 3 in that case (simple synthetic data). For noisy data the decision might be harder.
Reference: A. Mueller's scipy notes on sklearn
import matplotlib.pyplot as plt
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i,
random_state=0)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
Edit for ValueError:
For ValueError: you need just numericals, so you can do like this:
df_numerics = df.drop(['Statename', 'gender', 'religion], axis=1)
You can also drop other columns that you don't want included in clustering analysis.
with df_numerics, try the elbow method and try to find a good cluster number.
Then, let's say you found out that 3 clusters was good, you can run:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
labels contains the cluster numbers (0,1,2 for 3-clusters) for each row in your dataframe.You can also save this as a column in you datafame:
df['cluster_labels'] = labels
Then to visualize it you can pick 2 columns (more than that is dificult to visualize). Let's say you picked 'txn_duration' and 'amount' you can plot those columns, and add the cluster labels as color like this:
import matplotlib.pyplot as plt
plt.scatter(df['txn_duration'],df['amount'], c=df['cluster_labels'])

Related

how to label data in csv file as outlier detecetd by DBSCAN clusttering

i am using DBSCAN for clustering data so I can label the data which is anomalous here is my code I wanted to print 1 in front of outliers records in my csv file but for now my code is just telling the record no and printing those records which are outliers
data wrangling
import pandas as pd# visualization
import matplotlib.pyplot as plt# algorithm
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
# import data
df = pd.read_csv("C:/Users/user1/Desktop/4005_20200101_20200331.csv")
print(df.head())
setting up data to cluster
X = df
scale and standardizing data
X = StandardScaler().fit_transform(X)
#Instantiating our DBSCAN Model. In the code below, epsilon = 3 and min_samples is the minimum number of points needed to constitute a cluster.
instantiating DBSCAN
dbscan = DBSCAN(eps=3, min_samples=4)
fitting model
model = dbscan.fit(X)
#Storing the labels formed by the DBSCAN
labels = model.labels_
#Identifying which points make up our “core points”
import numpy as np
from sklearn import metrics
identify core samples
core_samples = np.zeros_like(labels, dtype=bool)
core_samples[dbscan.core_sample_indices_] = True
print(core_samples)
#Calculating the number of clusters
declare the number of clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(n_clusters)
#Computing the Silhouette Score
#print("Silhoette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)
outliers = df[model.labels_ == -1]
print(outliers)
I wanted to print 1 in front of outliers records in my csv file

How can I use SVM classifier to detect outliers in percentage changes?

I have a pandas dataframe that is in the following format:
This contains the % change in stock prices each day for 3 companies MSFT, F and BAC.
I would like to use a OneClassSVM calculator to detect whether the data is an outlier or not. I have tried the following code, which I believe detects the rows which contain outliers.
#Import libraries
from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt
#Create SVM Classifier
svm = OneClassSVM(kernel='rbf',
gamma=0.001, nu=0.03)
#Use svm to fit and predict
svm.fit(delta)
pred = svm.predict(delta)
#If the values are outlier the prediction
#would be -1
outliers = where(pred==-1)
#Print rows with outliers
print(outliers)
This gives the following output:
I would like to then add a new column to my dataframe that includes whether the data is an outlier or not. I have tried the following code but I get an error due to the lists being different lengths as shown below.
condition = (delta.index.isin(outliers))
assigned_value = "outlier"
df['isoutlier'] = np.select(condition,
assigned_value)
Would you be able to let me know I could add this column given that the list of the rows containing outliers is much shorter please?
It's not very clear what is delta and df in your code. I am assuming they are the same data frame.
You can use the result from svm.predict , here we leave it as blank '' if not outlier:
import numpy as np
df = pd.DataFrame(np.random.uniform(0,1,(100,3)),columns=['A','B','C'])
svm = OneClassSVM(kernel='rbf', gamma=0.001, nu=0.03)
svm.fit(df)
pred = svm.predict(df)
df['isoutlier'] = np.where(pred == -1 ,'outlier','')
A B C isoutlier
0 0.869475 0.752420 0.388898
1 0.177420 0.694438 0.129073
2 0.011222 0.245425 0.417329
3 0.791647 0.265672 0.401144
4 0.538580 0.252193 0.142094
.. ... ... ... ...
95 0.742192 0.079426 0.676820 outlier
96 0.619767 0.702513 0.734390
97 0.872848 0.251184 0.887500 outlier
98 0.950669 0.444553 0.088101
99 0.209207 0.882629 0.184912

Can we cluster Multivariate Time Series dataset in Python

I have a dataset with many financial signal values for different stocks at different times.For example
StockName Date Signal1 Signal2
----------------------------------
Stock1 1/1/20 a b
Stock1 1/2/20 c d
.
.
.
Stock2 1/1/20 e f
Stock2 1/2/20 g h
.
.
.
I would like to build a time series table look like below and cluster stocks based on both signal1 and signal2 (2 variables)
StockName 1/1/20 1/2/20 ........ Cluster#
----------------------------------------------------
Stock1 [a,b] [c,d] 0
Stock2 [e,f] [g,h] 1
Stock3 ...... ..... 0
.
.
.
1)Are there any ways to do this? (Clustering stocks based on multiple variables for the time series data). I tried to search online but they are all about clustering time series based on one variable.
2)Also, are there any ways to cluster different stocks at different times as well? (So maybe Stock1 at time1 is in the same cluster with Stock2 at time3)
I am revising my answer here, based on the new information that you last posted.
from utils import *
import time
import numpy as np
from mxnet import nd, autograd, gluon
from mxnet.gluon import nn, rnn
import mxnet as mx
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.decomposition import PCA
import math
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
context = mx.cpu(); model_ctx=mx.cpu()
mx.random.seed(1719)
# Note: The purpose of this section (3. The Data) is to show the data preprocessing and to give rationale for using different sources of data, hence I will only use a subset of the full data (that is used for training).
def parser(x):
return datetime.datetime.strptime(x,'%Y-%m-%d')
# dataset_ex_df = pd.read_csv('data/panel_data_close.csv', header=0, parse_dates=[0], date_parser=parser)
import yfinance as yf
# Get the data for the stock AAPL
start = '2018-01-01'
end = '2020-04-22'
data = yf.download('GS', start, end)
data = data.reset_index()
data
data.dtypes
# re-name field from 'Adj Close' to 'Adj_Close'
data = data.rename(columns={"Adj Close": "Adj_Close"})
data
num_training_days = int(data.shape[0]*.7)
print('Number of training days: {}. Number of test days: {}.'.format(num_training_days, data.shape[0]-num_training_days))
# TECHNICAL INDICATORS
#def get_technical_indicators(dataset):
# Create 7 and 21 days Moving Average
data['ma7'] = data['Adj_Close'].rolling(window=7).mean()
data['ma21'] = data['Adj_Close'].rolling(window=21).mean()
# Create exponential weighted moving average
data['26ema'] = data['Adj_Close'].ewm(span=26).mean()
data['12ema'] = data['Adj_Close'].ewm(span=12).mean()
data['MACD'] = (data['12ema']-data['26ema'])
# Create Bollinger Bands
data['20sd'] = data['Adj_Close'].rolling(window=20).std()
data['upper_band'] = data['ma21'] + (data['20sd']*2)
data['lower_band'] = data['ma21'] - (data['20sd']*2)
# Create Exponential moving average
data['ema'] = data['Adj_Close'].ewm(com=0.5).mean()
# Create Momentum
data['momentum'] = data['Adj_Close']-1
dataset_TI_df = data
dataset = data
def plot_technical_indicators(dataset, last_days):
plt.figure(figsize=(16, 10), dpi=100)
shape_0 = dataset.shape[0]
xmacd_ = shape_0-last_days
dataset = dataset.iloc[-last_days:, :]
x_ = range(3, dataset.shape[0])
x_ =list(dataset.index)
# Plot first subplot
plt.subplot(2, 1, 1)
plt.plot(dataset['ma7'],label='MA 7', color='g',linestyle='--')
plt.plot(dataset['Adj_Close'],label='Closing Price', color='b')
plt.plot(dataset['ma21'],label='MA 21', color='r',linestyle='--')
plt.plot(dataset['upper_band'],label='Upper Band', color='c')
plt.plot(dataset['lower_band'],label='Lower Band', color='c')
plt.fill_between(x_, dataset['lower_band'], dataset['upper_band'], alpha=0.35)
plt.title('Technical indicators for Goldman Sachs - last {} days.'.format(last_days))
plt.ylabel('USD')
plt.legend()
# Plot second subplot
plt.subplot(2, 1, 2)
plt.title('MACD')
plt.plot(dataset['MACD'],label='MACD', linestyle='-.')
plt.hlines(15, xmacd_, shape_0, colors='g', linestyles='--')
plt.hlines(-15, xmacd_, shape_0, colors='g', linestyles='--')
# plt.plot(dataset['log_momentum'],label='Momentum', color='b',linestyle='-')
plt.legend()
plt.show()
plot_technical_indicators(dataset_TI_df, 400)
This will give you some signals to work with. Of course, these features can be anything you want. I'm sure you know this is technical analysis, and not fundamental analysis. Now, you can do your clustering, and whatever else you want, at this point.
Here is a good link for clustering.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/
Good material to read (Title: Time Series Clustering and Dimensionality Reduction)
https://towardsdatascience.com/time-series-clustering-and-dimensionality-reduction-5b3b4e84f6a3

Extract rows of clusters in hierarchical clustering using seaborn clustermap

I am using hierarchical clustering from seaborn.clustermap to cluster my data. This works fine to nicely visualize the clusters in a heatmap. However, now I would like to extract all row values that are assigned to the different clusters.
This is what my data looks like:
import pandas as pd
# load DataFrame
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
df
log_HU1 log_HU2
EEF1A1 13.439499 13.746856
HSPA8 13.169191 12.983910
FTH1 13.861164 13.511200
PABPC1 12.142340 11.885885
TFRC 11.261368 10.433607
RPL26 13.837205 13.934710
NPM1 12.381585 11.956855
RPS4X 13.359880 12.588574
EEF2 11.076926 11.379336
RPS11 13.212654 13.915813
RPS2 12.910164 13.009184
RPL11 13.498649 13.453234
CA1 9.060244 13.152061
RPS3 11.243343 11.431791
YBX1 12.135316 12.100374
ACTB 11.592359 12.108637
RPL4 12.168588 12.184330
HSP90AA1 10.776370 10.550427
HSP90AB1 11.200892 11.457365
NCL 11.366145 11.060236
Then I perform the clustering using seaborn as follows:
fig = sns.clustermap(df)
Which produces the following clustermap:
For this example I may be able to manually interpret the values belonging to each cluster (e.g. that TFRC and HSP90AA1 cluster). However I am planning to do these clustering analysis on much bigger data sets.
So my question is: does anyone know how to get the row values belonging to each cluster?
Thanks,
Using scipy.cluster.hierarchy module with fcluster allows cluster retrieval:
import pandas as pd
import seaborn as sns
import scipy.cluster.hierarchy as sch
df = pd.read_csv('expression_data.txt', sep='\t', index_col=0)
# retrieve clusters using fcluster
d = sch.distance.pdist(df)
L = sch.linkage(d, method='complete')
# 0.2 can be modified to retrieve more stringent or relaxed clusters
clusters = sch.fcluster(L, 0.2*d.max(), 'distance')
# clusters indicices correspond to incides of original df
for i,cluster in enumerate(clusters):
print(df.index[i], cluster)
Out:
EEF1A1 2
HSPA8 1
FTH1 2
PABPC1 3
TFRC 5
RPL26 2
NPM1 3
RPS4X 1
EEF2 4
RPS11 2
RPS2 1
RPL11 2
CA1 6
RPS3 4
YBX1 3
ACTB 3
RPL4 3
HSP90AA1 5
HSP90AB1 4
NCL 4

SMOTETomek - how to set ratio as dictionary for fixed balance

I've tried to use this technique to correct very imbalanced classes.
My data set has classes e.g.:
In [123]:
data['CON_CHURN_TOTAL'].value_counts()
Out[123]:
0 100
1 10
Name: CON_CHURN_TOTAL, dtype: int64
I wanted to use SMOTETomek to under sample 0-class and over sample 1-class to achieve ratio 80 : 20. However, I cannot find a way to correct the dictionary. Of course in full code the ratio 80:20 will be calculated based on number of rows.
When I am trying:
from imblearn.combine import SMOTETomek
smt = SMOTETomek(ratio={1:20, 0:80})
I have error:
ValueError: With over-sampling methods, the number of samples in a
class should be greater or equal to the original number of samples.
Originally, there is 100 samples and 80 samples are asked.
But this method should be suitable for doing both under and over sampling at the same time.
Unfortunately the documentary is not working now due to 404 error.
I've encountered problem with it again, therefore I asked the question directly on imbalanced-learn github.
Here is full answer: github.com/scikit-learn-contrib/imbalanced-learn
Most important:
SMOTETomek is not doing what you are thinking about.
SMOTETomek applies SMOTE followed by removing the Tomek link and not
both over-sampling and under-sampling at the same time.
Be aware that you cannot define the number of samples to use when
using Tomek:
http://imbalanced-learn.org/en/stable/under_sampling.html#tomek-s-links
If you really want to have an under-sampling you could pipeline 2 samplers:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
data = load_breast_cancer()
X = pd.DataFrame(data=data.data, columns=data.feature_names)
count_class_0 = 300
count_class_1 = 300
pipe = make_pipeline(
SMOTE(sampling_strategy={0: count_class_0}),
NearMiss(sampling_strategy={1: count_class_1}
)
X_smt, y_smt = pipe.fit_resample(X, data.target)

Categories

Resources