Documents-terms matrix dimensionality reduction - python

I am working with text documents clustering, with a Hierarchical Clustering approach, in Python.
I have a corpus of 10k documents and have constructed a documents-terms matrix over a dictionary based on a collection of terms classified as 'keyword' for the entire corpus.
The matrix has a shape: [10000 x 2000] and is very sparse. (let's call it dtm)
id 0 1 2 4 ... 1998 1999
0 0 0 0 1 ... 0 0
1 0 1 0 0 ... 0 1
2 1 0 0 0 ... 1 0
.. .. ... ... .. ..
9999 0 0 0 0 ... 0 0
I think that applying some dimensionality reduction techniques could lead to an enhancement in the precision of clustering.
I have tried using some MDS approach like this
def select_n_components(var_ratio, goal_var: float) -> int:
# Set initial variance explained so far
total_variance = 0.0
# Set initial number of features
n_components = 0
# For the explained variance of each feature:
for explained_variance in var_ratio:
# Add the explained variance to the total
total_variance += explained_variance
# Add one to the number of components
n_components += 1
# If we reach our goal level of explained variance
if total_variance >= goal_var:
# End the loop
break
# Return the number of components
return n_components
def do_MDS(dtm):
# scale dtm in range [0:1] to better variance maximization
scl = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scl.fit_transform(dtm)
tsvd = TruncatedSVD(n_components=data_rescaled.shape[1] - 1)
X_tsvd = tsvd.fit(data_rescaled)
# List of explained variances
tsvd_var_ratios = tsvd.explained_variance_ratio_
optimal_components = select_n_components(tsvd_var_ratios, 0.95)
from sklearn.manifold import MDS
mds = MDS(n_components=optimal_components, dissimilarity="euclidean", random_state=1)
pos = mds.fit_transform(dtm.values)
U_df = pd.DataFrame(pos)
U_df_transposed = U_df.T # for consistency with pipeline workflow, export tdm matrix
return U_df_transposed
The objective is to automatically detect an optimal number of components and apply the dimensionality reduction. But the output has not shown a tangible enhancement.

Related

Sum the predictions of a Linear Regression from Scikit-Learn

I need to make a linear regression and sum all the predictions. Maybe this isn't a question for Scikit-Learn but for NumPy because I get an array at the end and I am unable to turn it into a float.
df
rank Sales
0 1 18000
1 2 17780
2 3 17870
3 4 17672
4 5 17556
x = df['rank'].to_numpy()
y = df['Sales'].to_numpy()
X = x.reshape(-1,1)
regression = LinearRegression().fit(X, y)
I am getting it right up to this point. The next part (which is a while loop to sum all the values) is not working:
number_predictions = 100
x_current_prediction = 1
total_sales = 0
while x_current_prediction <= number_predictions:
variable_sum = x_current_prediction*regression.coef_
variable_sum_float = variable_sum.astype(np.float_)
total_sales = total_sales + variable_sum_float
x_current_prediction =+1
return total_sales
I think that the problem is getting regression.coef_ to be a float, but when I use astype, it does not work?
You don't need to loop like this, and you don't need to use the coefficient to compute the prediction (don't forget there may be an intercept as well).
Instead, make an array of all the values of x you want to predict for, and ask sklearn for the predictions:
X_new = np.arange(1, 101).reshape(-1, 1) # X must be 2D.
y_pred = regression.predict(X_new)
If you want to add all these numbers together, use y_pred.sum() or np.sum(y_pred), or if you want a cumulative sum, np.cumsum(y_pred) will do it.

Find high correlations in a large coefficient matrix

I have a dataset with 56 numerical features. Loading it to pandas, I can easily generate a correlation coefficients matrix.
However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!
Thanks!
I think you can do where and stack(): this:
np.random.seed(1)
df = pd.DataFrame(np.random.rand(10,3))
coeff = df.corr()
# 0.3 is used for illustration
# replace with your actual value
thresh = 0.3
mask = coeff.abs().lt(thresh)
# or mask = coeff < thresh
coeff.where(mask).stack()
Output:
0 2 -0.089326
2 0 -0.089326
dtype: float64
Output:
0 1 0.319612
2 -0.089326
1 0 0.319612
2 -0.687399
2 0 -0.089326
1 -0.687399
dtype: float64
This approach will work if you're looking to also deduplicate the correlation results.
thresh = 0.8
# get correlation matrix
df_corr = df.corr().abs().unstack()
# filter
df_corr_filt = df_corr[(df_corr>thresh) | (df_corr<-thresh)].reset_index()
# deduplicate
df_corr_filt.iloc[df_corr_filt[['level_0','level_1']].apply(lambda r: ''.join(map(str, sorted(r))), axis = 1).drop_duplicates().index]

Accuracy metric of a subsection of categories in Keras

I've got a 3-class classification problem. Let's define them as classes 0,1 and 2. In my case, class 0 is not important - that is, whatever gets classified as class 0 is irrelevant. What's relevant, however, is accuracy, precision, recall, and error rate only for classes 1 and 2. I would like to define an accuracy metric that only looks at a subsection of the data that relates to 1 and 2 and gives me a measure of that as the model is training. I am not asking for code for accuracy or f1 or precision/recall - those I've found and can implement myself. What I'm asking is for code that can help select a subsection of the categories to perform these metrics on.
Visually, with a confusion matrix:
Given:
> 0 1 2
>0 10 3 4
>1 2 5 1
>2 8 5 9
I would like to only perform an accuracy measure in-training for the following subset only:
> 1 2
>1 5 1
>2 5 9
Possible idea:
Concatenate a categorized, argmaxed y_pred and argmaxed y_true, drop all instances where 0 appears, re-unravel them back into a one_hot array, and do a simple binary accuracy on what remains?
Edit:
I've tried to exclude the 0-class through this code, but it doesn't make sense. the 0-category gets effectively wrapped into the 1-category (that is, the true positives of both 0 and 1 end up being labeled as 1). Still looking for help - can anybody help out please?
#this solution does not work :(
def my_acc(y_true, y_pred):
#excluding the 0-category
y_true_cust = y_true[:,np.r_[1:3]]
y_pred_cust = y_pred[:,np.r_[1:3]]
#binary accuracy source code, slightly edited
y_pred_cat = Ker.round(y_pred_cust)
eql_cust = Ker.equal(y_true_cust, y_pred_cust)
return Ker.mean(eql_cust, axis = -1)
# Ashwin Geet D'Sa
correct_guesses_3cat = 10 + 5 + 9
print(correct_guesses_3cat)
24
total_guesses_3cat = 10+3+4+2+5+1+8+5+9
print(total_guesses_3cat)
47
accuracy_3cat = 24/47
print(accuracy_3cat)
51.1 %
correct_guesses_2cat =5 + 9
print(correct_guesses_2cat)
14
total_guesses_2cat = 5+1+5+9
print(total_guesses_2cat)
20
accuracy_2cat = 14/20
print(accuracy_2cat)
70.0 %

Sampling rows in data frame with an empirical probability distribution of a variable

I have got a following problem.
Let's assume that we have a data frame with few variables. Morover one variable (var_A) is a probability score - its values ranges from 0 to 1. I want to sample rows from this data frame in a way that it will be more probable to pick a row with higher value of var_A - so I guess that I have to draw from an empirical distribution of var_A. I know how to implement edf function of var_A as it's suggested here but I have no idea how to use this distribution for sampling rows.
Can you please help me with this?
Thanks
You can use numpy.random.choice to sample in this manner:
import numpy as np
num_dists = 4
num_samples = 10
var_A = np.random.uniform(0, 1, num_dists)
# ensure var_A sums to 1
var_A /= np.sum(var_A)
samples = np.random.choice(len(var_A), num_samples, p=var_A)
print('var_A: ', var_A)
print('samples: ', samples)
Sample output:
var_A: [ 0.23262621 0.02990421 0.22357316 0.51389642]
samples: [3 0 0 2 0 0 2 3 3 2]

Weighted data problems, mean is fine, but Covar and std look wrong, how do I adjust?

I'm trying to apply a weighted filter on data rather the use raw data before calculating stats, mu, std and covar. But the results clearly need adjusting.
# generate some data and a filter
f_n = 100.
np.random.seed(seed=101);
foo = np.random.rand(f_n,3)
foo = DataFrame(foo).add(1).pct_change()
f_filter = np.arange(f_n,.0,-1)
f_filter = 1.0 / (f_filter**(f_filter/f_n))
# nominalise the filter ... This could be where I'm going wrong?
f_filter = f_filter * (f_n / f_filter.sum())
Now we are ready to look at some results
print foo.mul(f_filter,axis=0).mean()
print foo.mean()
0 0.039147
1 0.039013
2 0.037598
dtype: float64
0 0.035006
1 0.042244
2 0.041956
dtype: float64
Means all look in line, but when we look at covar and std they are significantly different in terms of scale and also direction
print foo.mul(f_filter,axis=0).cov()
print foo.cov()
0 1 2
0 0.124766 -0.038954 0.027256
1 -0.038954 0.204269 0.056185
2 0.027256 0.056185 0.203934
0 1 2
0 0.070063 -0.014926 0.010434
1 -0.014926 0.099249 0.015573
2 0.010434 0.015573 0.087060
print foo.mul(f_filter,axis=0).std()
print foo.std()
0 0.353223
1 0.451961
2 0.451590
dtype: float64
0 0.264694
1 0.315037
2 0.295060
dtype: float64
Any ideas why, how can we adjust the filter or to adjust the covar matrix to make it more comparable?
The problem is your weighting function. (Do you want Gaussian random numbers or uniform r.v.?) See this plot
f_n = 100.
np.random.seed(seed=101);
# ??? you want uniform random variable? or is this just a typo and you want normal random variable?
foo = np.random.rand(f_n,3)
foo = DataFrame(foo)
f_filter = np.arange(f_n,.0,-1)
# here is the problem, uneven weight makes a artificial trend, causing non-stationary. covariance only works for stationary data.
# =============================================
f_filter = 1.0 / (f_filter**(f_filter/f_n))
fig, ax = plt.subplots()
ax.plot(f_filter)
Uneven weight makes a artificial trend (your random numbers are all positive uniforms), causing non-stationary. covariance only works for stationary data. Take a look at the resulting weighted data.

Categories

Resources