Clustering binary data

Clustering binary data - python

I want perform cluster analysis for the following data (sample):
ID CODE1 CODE2 CODE3 CODE4 CODE5 CODE6
------------------------------------------------------------------
00001 0 1 1 0 0 0
00002 1 0 0 0 1 1
00003 0 1 0 1 1 1
00004 1 1 1 0 1 0
...
Where 1 indicates the presence of that code for a person, and 0 the absence..
Is k-means or hierarchical clustering most appropriate for clustering the codes for this kind of data (for about a million distinct ids), and with which distance measure? If neither of these methods are appropriate, what do you think is most appropriate?
Thank you

No, k-means does not make a lot of sense for binary data.
Because k-means computes means. But what is the mean vector for binary data?
Your cluster "centers" will be not part of your data space, and nothing like your input data. That doesn't seem like a proper "center" to me, when it's totally different from your objects.
Most likely, your cluster "centers" will end up being more similar to each other than to the actual cluster members, because they are somewhere in the center, and all your data is in corners.
Seriously, investigate similarity functions for your data type. Then choose a clustering algorithm that works with this distance function. Hierarchical clustering is quite general, but really slow. But you don't have to use a 40 year old algorithm, you may want to look into more modern stuff.

Related

Identifying outliers in an event sequence using a Python Dataframe

I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.

How can I search for anomalies in each column in a Dataframe?

I have a Dataframe and my goal is to find anomalies for each different column. So I am looking for univariate anomalies.
Let's assume this is my Dataframe:
df=pd.DataFrame(np.random.rand(100, 6) * 1, columns=['A','B','C','D','E','F'])
I am faced with two questions:
Which algorithms are adequate for this goal? E.g. Isolation Forest?
How could I run an algorithm (E.g. Isolation Forest) over all columns, rather than doing it column per column? Can I use a for loop?
Thanks for your help!

Q2:eg.
df = pd.DataFrame({"bytes":[1,2,3,4,5], "flow":[1,2,3,4,5], "userid":[1,2,3,4,5]}).set_index("userid")
def get_anomaly(arr):
# your algorithm
if arr.bytes < 3 and arr.flow < 3:
return -1
elif arr.bytes > 3 and arr.flow > 3:
return 1
else:
return 0
df['is_anomaly'] = df.apply(get_anomaly, axis=1)
>>> df
bytes flow userid is_anomaly
0 1 1 1 -1
1 2 2 2 -1
2 3 3 3 0
3 4 4 4 1
4 5 5 5 1
We can talk a little bit about Q1.
Level 0: Linear relationships or other experiences
Box-plot: min outlier < Q1-1.5ΔQ <= normal data <= Q3+1.5ΔQ < max outlier
Scott rule: Δb=3.5σn1/3 .Split the data and do distribution statistics
Other data status: avg. mean std and so on.
Level 1: Statistical algorithm
Great algo:
CMP
https://www.sciencedirect.com/science/article/abs/pii/S1389128616301633
Beehive
https://nds2.ccs.neu.edu/papers/Beehive.pdf
CBLOF
https://www.goldiges.de/publications/Anomaly_Detection_Algorithms_for_RapidMiner.pdf
And some AR MA ARMA algo, I don't know much.
Level 2: Unsupervised learning
Kmeans and so on...(This is actually quite a lot)
Level 3: Supervised learning
from elasticsearch (doc)
EWMA
s2=α*x2+(1-α)*s1
Holt-Linear
s2=α*x2+(1-α)*(s1+t1)
t2=ß*(s2-s1)+(1-ß)*t1
Holt-Winters
si=α(xi-pi-k)+(1-α)(si-1+ti-1)
ti=ß(si-si-1)+(1-ß)ti-1
pi=γ(xi-si)+(1-γ)pi-k
from ML
CNN RNN LSTM Prefixspan AutoML Bayes and so on.(There are a few scenarios you can use.)
There are too many left unlisted, too many algorithms to use, too many appropriate, too many details to write down.
UEBA's thinking is important when analyzing anomalies.

How to handle categorical data for preprocessing in Machine Learning

This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.

There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.

It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

Efficient calculation of point mutual information in the text corpus in Python

I have a corpus, in which I calculate the frequency of unigrams and skipgrams, normalize the values by dividing them by the sum of all frequencies, and feed them into pandas data frames. Now, I would like to calculate the point mutual information of each skipgram, which is the log of normalized frequency of skipgram divided by the multiplied normalized frequencies of both unigrams in the skipgram.
My data frames look like this:
unigram_df.head()
word count prob
0 nordisk 1 0.000007
1 lments 1 0.000007
2 four 91 0.000593
3 travaux 1 0.000007
4 cancerestimated 1 0.000007
skipgram_df.head()
words count prob
0 (o, odds) 1 0.000002
1 (reported, pretreatment) 1 0.000002
2 (diagnosis, simply) 1 0.000002
3 (compared, sbx) 1 0.000002
4 (imaging, or) 1 0.000002
For now, I calculate the PMI values of each skipgram, by iterating through each row of skipgram_df, extracting the prob value of the skipgram, extracting prob values of both unigrams, and then calculating the log, and appending the results into the list.
The code looks like this, and it works fine:
for row in skipgram_df.itertuples():
skipgram_prob = float(row[3])
x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][0])]['prob'])
y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == str(row[1][1])]['prob'])
pmi = math.log10(skipgram_prob/(x_unigram_prob*y_unigram_prob))
pmi_list.append(pmi)
The problem is that it takes long to iterate through the whole dataframe (around 30 minutes on 300,000 skipgrams). I will have to work on corpora that are even 10-20 times bigger than that, so I am looking for a more efficient way to do that. Can anyone suggest another solution that will be quicker? Thank you.

I am also trying to solve something similar. I do not know how to improve the performance of the code, but you could parallelize it because each calculation is independent from the other.
Pandas df.iterrow() parallelization

Machine Learning: combining features into single feature

I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?

convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up

It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.

One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.