Analyzing and Visualizing Similarity with Binary Data (Dimensionality Reduction) - python

I have a dataset of users and locations they are affiliated with, binary encoded as columns.
I'd like to visualize each user on a 2-d axis based on the similar of their affiliated locations. The closer they are in the vector space, the more similar they are in terms of locations they're affiliated with.
Here is an example of what I'd like to create...where each dot represents a user and they are closer or further based on their location profile.
I am trying to think through methods to collapse the location information (many dimensions) into 2 dimensions.
The ask:
Are there any techniques that are well suited for this problem?
A few ideas so far:
1) PCA (or similar): Conduct dimensionality reduction via PCA with an eye for techniques that work with binary features (looking into Kernal PCA)
2) Neural Network Embedding: Apply techniques similar to word embeddings to this problem. Create an embedded layer where each user is translated into a continuous vector space (which can be reduced down to 2 dimensions).
Reproducible data below. The actual dataset is ~5k users and 50 locations, but I'd like to solution to be scalable.
import names
import pandas as pd
import numpy as np
names_list = []
for i in range(1,100):
single_name = names.get_full_name()
names_list.append(single_name)
df = pd.DataFrame(names_list,columns=['Names'])
df['Var1'] = np.random.randint(0,2, size=len(df))
df['Var2'] = np.random.randint(0,2, size=len(df))
df['Var100'] = np.random.randint(0,2, size=len(df))
print(df)
#sample data
Names Var1 Var2 Var100
0 Clayton Stocks 1 1 1
1 Gary Beavers 0 0 1
2 Kristal Feagin 0 1 1
3 Crystal Barb 0 0 1
4 William Wilburn 1 0 0
.. ... ... ... ...
94 Jennifer Cool 0 0 0
95 Roberta Larsen 0 0 0
96 Malcom Mosley 1 0 1
97 Hazel Wilkins 1 1 0
98 Chanell Jaremka 1 0 1

Related

Correlation analysis with multiple data in a single cell

I have a dataset with some rows containing singular answers and others having multiple answers. Like so:
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.
I could split them and add additional columns for each answer like:
year length Animation
0 1971 121 1
1 1971 121 2
2 1971 121 3
3 1939 71 1
4 1939 71 3 ...
and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?
EDIT: Actual data snippet
You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.
N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:
Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x
Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.
N.B.2 For sparse matrix output please see this post.
Sample Solution
Data & Preprocessing
import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap
df = pd.read_csv(io.StringIO("""
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
"""), sep=r"\s{2,}", engine="python")
# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)
Frequency Table
Note: grouping of length is ignored for simplicity
ct = pd.crosstab(df["Animation"], df["length"])
print(ct)
# Out[65]:
# length 7 70 71 121
# Animation
# 0 1 1 1 0
# 1 0 1 1 1
# 2 1 1 1 1
# 3 0 0 2 1
Visualization
ax = heatmap(ct, cmap="viridis",
yticklabels=df["Animation"].drop_duplicates().sort_values(),
xticklabels=df["length"].drop_duplicates().sort_values(),
)
ax.set_title("Title", fontsize=20)
plt.show()
Example Analysis
Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.
For example,
Q: What length does Animation=3 produces?
A: 66.7% chance to give 71
33.3% chance to give 121
otherwise unobserved
You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.
First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).
positions = df.Preferred_positions.str.split(',').sum().unique()
Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.
for position in positions:
df[position] = df.Preferred_positions.apply(
lambda x: 1 if position in x else 0
)

Should I use tf.keras.utils.normalize to normalize my targets?

Working on a machine learning model regression problem that predicts a score.
Usually, when using a scaler for normalization, for example MinMaxScaler, You get a reference to the scaler so later you can inverse your data back to its original values.
When using tf.keras.utils.normalize, which is (as far as I understand it) is an L2 normalization for the following Data:
val target
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
You get this output:
val target
0 0.13484 0.13484
1 0.26968 0.26968
2 0.40452 0.40452
3 0.53936 0.53936
4 0.67420 0.67420
So I see no possible way to go back to the original series of 10,20,30,40,50
Q: If I want to inverse the predicted targets back to their original scale, should I normalize the scores separately using MinMaxScalar?
Neural network activations generally like their inputs to be normalized. Normalizing inputs to nodes in a network helps prevent the so-called vanishing (and exploding) gradients.
Generally, Batch Normalization is performed on the inputs, but it has its own drawbacks like slower predictions due to extra computation. Instead, you can use any other normalizing technique as you mentioned.
In your example instead of normalizing both input and target, normalize only input like mentioned below.
Dataframe:
val target
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
Normalizing input:
df["val"] = tf.keras.utils.normalize(df["val"].values,axis=-1, order=2 )[0]
Input Normalized Dataframe:
val target
0 0.13484 10
1 0.26968 20
2 0.40452 30
3 0.53936 40
4 0.67420 50

How to replace scikit-learn (make_circle) to my own dataset?

I am trying to integrate my own dataset in scikit learn. My dataset was categorical data and I encoded to numerical data, it has 3 columns and 100 rows. The current scikit learn dataset is created using make_circle().
X, Y = make_circles(n_samples=n, noise=0.07, factor=0.4)
What I did?
I read my dataset using pandas.
col_names = ['Relation', 'Entity1', 'Entity2']
# load dataset
pima = pd.read_csv("encode.csv", header=None, names=col_names)
pima.head()
Current Output:
Relation Entity1 Entity2
3 0 0
0 1 2
2 9 0
3 5 3
1 4 1
2 6 0
3 3 4
But I want to add this dataset based on make_circle() into 2 dimensional spaces.
You have to apply a dimensionality reduction and bring it down to 2 dimentions.
You can use something like PCA or UMAP.
Check this post. It should be very useful.
Using UMAP:
import umap
reduced = umap.UMAP().fit_transform(pima)
Using PCA:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced = pca.fit_tranform(pima)

Machine learning random forest classifier

data=pd.DataFrame({'gender':['m','f','m'],'icds':[['i10'],['i20','i30'],['i40']],'med':[[1,2,4,5],[3,4,6],[5,6,7]]})
Which machine learning algorithm shall I use for this type of data? I think of the inconsistent length of arrays in the med column. Whenever I try to pass it in the random forest classifier, med column is basically the labels.
Yeah, you are right, the algorithm you should use is RF or logistic also should be good. The issue is with the inconsistent length of data in 'med' column. If its not necessary you can use the following functions to average/sum out the numerical data in med columns arrays:
def sum_out(x):
return np.nansum(x)
def avg_out(x):
return np.nanmean(x)
data=pd.DataFrame({'gender':['m','f','m'],'icds':[['i10'],['i20','i30'],['i40']],'med':[[1,2,4,5],[3,4,6],[5,6,7]]})
data['med_sum']= data['med'].map(sum_out)
data['med_avg']= data['med'].map(avg_out)
You can actually add those meds as features, something like this:
data=pd.DataFrame({'gender':['m','f','m'],'icds':[['i10'],['i20','i30'],['i40']],'med':[['xanex','isotopin'],['cz3','hicet','t-montair'],['t-montair','xanex']]})
all_med= list(np.unique(flatten(list(data['med'].values))))
for meds in all_med:
med_list=[]
for i in xrange(len(data)):
d= data['med'][i]
if meds in d:
med_list.append(1)
else:
med_list.append(0)
data[meds]=med_list
Output:
gender icds med cz3 hicet isotopin \
0 m [i10] [xanex, isotopin] 0 0 1
1 f [i20, i30] [cz3, hicet, t-montair] 1 1 0
2 m [i40] [t-montair, xanex] 0 0 0
t-montair xanex
0 0 1
1 1 0
2 1 1

Memory usage in creating Term Density Matrix from pandas dataFrame

I have a DataFrame which I save/read from a csv file, and I want to create a Term Density Matrix DataFrame from it. Following herrfz's suggestion here, I use CounVectorizer from sklearn. I wrapped that code in a function
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
from scipy.sparse import coo_matrix, csc_matrix, hstack
def df2tdm(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
'''
tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
return tdm_df
Which returns the TDM as a DataFrame, for example:
df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
print df.head()
tdm_df = df2tdm(df,'title','page')
tdm_df.head()
boiled delicious egg else fried orange potato salad something \
0 1 1 1 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 1 1 0
3 0 0 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 0 1
split page
0 0 1
1 0 1
2 0 2
3 1 3
4 0 4
This implementation suffers from bad memory scaling: When I use a DataFrame which occupies 190 kB saved as utf8, the function uses ~200 MB to create the TDM dataframe. When the csv file is 600 kB, the function uses 700 MB, and when the csv is 3.8 MB the function uses up all of my memory and swap file (8 GB) and crashes.
I also made an implementation using sparse matrices and sparse DataFrames (below), but the memory usage is pretty much the same, only it is considerably slower
def df2tdm_sparse(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn. This implementation uses sparse DataFrames.
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
'''
pm = df[[placementColumn]].values
tm = countvec.fit_transform(df[titleColumn])#.toarray()
m = csc_matrix(hstack([pm,tm]))
dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
dfout.columns = [placementColumn]+countvec.get_feature_names()
return dfout
Any suggestions on how to improve memory usage? I wonder if this is related to the memory issues of scikit, e.g. here
I also think that the problem might be with the conversion from sparse matrix to sparse data frame.
try this function (or something similar)
def SparseMatrixToSparseDF(xSparseMatrix):
import numpy as np
import pandas as pd
def ElementsToNA(x):
x[x==0] = NaN
return x
xdf1 =
pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel()))
for i in np.arange(xSparseMatrix.shape[0]) ])
return xdf1
you can see that it reduces the size by using function density
df1.density
I hope it helps

Categories

Resources