I have a sparse matrix
from scipy.sparse import *
M = csr_matrix((data_np, (rows_np, columns_np)));
then I'm doing clustering that way
from sklearn.cluster import KMeans
km = KMeans(n_clusters=n, init='random', max_iter=100, n_init=1, verbose=1)
km.fit(M)
and my question is extremely noob: how to print the clustering result without any extra information. I don't care about plotting or distances. I just need clustered rows looking that way
Cluster 1
row 1
row 2
row 3
Cluster 2
row 4
row 20
row 1000
...
How can I get it? Excuse me for this question.
Time to help myself. After
km.fit(M)
we run
labels = km.predict(M)
which returns labels, numpy.ndarray. Number of elements in this array equals number of rows. And each element means that a row belongs to the cluster.
For example: if first element is 5 it means that row 1 belongs to cluster 5.
Lets put our rows in a dictionary of lists looking this way {cluster_number:[row1, row2, row3], ...}
# in row_dict we store actual meanings of rows, in my case it's russian words
clusters = {}
n = 0
for item in labels:
if item in clusters:
clusters[item].append(row_dict[n])
else:
clusters[item] = [row_dict[n]]
n +=1
and print the result
for item in clusters:
print "Cluster ", item
for i in clusters[item]:
print i
Update:
You can do it the following way
"""data= data clustered retrieved by function as you want"""
"""model = result from the data with got by KMeans"""
"""cluster = clusters formed by the model"""
from sklearn.cluster import KMeans
data = clusteredData()
model = KMeans(n_clusters=5, init='random', max_iter=100, n_init=1, verbose=1)
cluster = model.fit_predict(scale(data))
dictionary = {}
for index in range(len(data)):
if cluster[index] in dictionary:
value = []
value = dictionary[cluster[index]]
value.append(data[index])
dictionary[cluster[index]] = value
else:
dictionary[cluster[index]]=data[index]
This will create you a dictionary with the NUMBER_OF_THE_CLUSTER as a key and the data within that cluster as a VALUE
Related
I wanted to calculate the co-occurrence such that it can be used as an edge for the construction of a graph. I have a column skill that consists of the list of skills for each id in the dataset. Now I wanted to calculate cooccurrence and use it as an edge.the data format is
skill
Product Management,RPM,Progress 4GL,IP,CAMEL,Prince2 Foundation,Continuous Integration,GSM(HLR,MSC),Programming,SS7,INAP,ClearCase,SS7 protocol,Software Development,Shell Scripting,GPRS(SGSN,GGSN),MySQL,VOIP,Linux,Agile,SIP,Diameter,Test,Oracle,Software
User Experience,Interaction Design,3D rendering,Event,Team,Graphic Design,Engineering,User Experience Design,Sales,3D Modeling,Product Marketing,Employee Training,business plan,3D,Business Development,Creative Problem Solving,Product Design,renewable energy,Electronics,news paper,Project Management,Product Development,Social Enterprise
the above is the skills list of two ids in the dataset.
Now I want my output to be in the format of 3 columns which is the source, target, and weight count, and in the next step I can use them for the graph construction.
Source_elt Target_elt WeightCount
Can anyone share your insights that would be helpful. My endpoint is that using this weight count I will go further for community detection.
I amusing the following code for the co-occurrence calculation.
document =nested_list
#unique job titles
fnc_names = unique_jobtitle
# Get a list of all of the combinations you have
expanded = [tuple(itertools.combinations(d, 2)) for d in document]
expanded = itertools.chain(*expanded)
# Sort the combinations so that A,B and B,A are treated the same
expanded = [tuple(sorted(d)) for d in expanded]
# count the combinations
c = Counter(expanded)
#initialize NxN matrix with zeros
table =np.zeros((len(fnc_names),len(fnc_names)), dtype=int)
for i, v1 in enumerate(fnc_names):
for j, v2 in enumerate(fnc_names[i:]):
j = j + i
table[i, j] = c[v1, v2]
table[j, i] = c[v1, v2]
df_cooccMatrix = pd.DataFrame(table, index=fnc_names, columns=fnc_names)
df_cooccMatrix.head()
and later for the weight count
#Assign count as f edge
weight_cout = df_cooccMatrix.stack()
weight_cout = pd.DataFrame(weight_cout.rename_axis(('Source_elt', 'Target_elt')).reset_index(name='WeightCount'))
#weight_cout.sort_values(by=['WeightCount'], inplace=True, ascending=False)
#weight_cout.head(10)
But when I am calculating weight count I am getting the memory issue
MemoryError: Unable to allocate 6.25 GiB for an array with shape (1678704784,) and data type int32
so can anyone help me in solving the issue.
Thanks in advance
In my dataframe highlighting product sales on the internet, I have a column that contains the description of each product sold.
I would like to create an algorithm to check if the combination and or redundancy of words has a correlation with the number of sales.
But I would like to be able to filter out words that are too redundant like the product type. For example, my dataframe deals with the sale of wines, so the algorithm must not take into account the word "wine" in the description.
In my df I have 700 rows consisting of 4 columns:
product_id: id for each product
product_price: product price
total_sales: total number of product sales
product_description: product description (e.g.: "Fruity wine, perfect as a starter"; "Dry and full-bodied wine"; "Fresh and perfect wine as a starter"; "Wine combining strength and character"; "Wine with a ruby color, full-bodied "; etc...)
Edit:
I added:
the column 'CA': the total sales by product * the product's price
an example of my df
My DataFrame example:
import pandas as pd
data = {'Product_id': [1, 2, 3, 4, 5],
'Price': [24, 13.5, 12.9, 34, 26],
'Total_sales': [28, 34, 29, 42, 10],
'CA': [672, 459, 374.1, 1428, 260],
'Product_description': ["Fruity wine, perfect as a starter",
"Dry and full-bodied wine",
"Fresh and perfect wine as a starter",
"Wine combining strength and character",
"Wine with a ruby color, full-bodied "]}
df = pd.DataFrame(data)
df
Edit 2:
to find out if a correlation between certain words (and/or combinations of words) would have an impact on the number of sales. I thought that for that I could create a heatmap putting in ordered the number of different values for my column ["total_sales"] and in abscissa a list of the most used words in the column ["Product_description"]. I thought that an ANOVA could help me verify a correlation between these two variables, or a Chi-square...
My action process:
find the number of unique values for my column ["total_sales"], I have 43 different ones.
create a list stopwords=[list of redundant words (ex: 'the', 'the', 'by', etc...)]
split the words of all my lines for the column ["description"]
wordslist = df["description"].str.split()
I can't filter the results of my wordlist variable with stopwords
comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]
print(z)
I get
TypeError: expected string or bytes-like object
After that I intend to get the frequency of each word in the column df["description"]
The words that will have a significant frequency should appear on the abscissa of my heatmap with the ordered numbers of sales
Is this a good method (provided I find the solution to my error) to check if the use of a word/a combination of words has an impact on the sale of a product?
Could you please give me some hints?
Edit 3:
Thank to #maaniB for the great help, thanks to that I took a big step towards the final solution but I still have a little way to go, here is where I am:
I'm french so for the cleaning method with stop_words i replaced nltk by spacy
import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\\/",
"\\[",
"\\]",
"\\:",
"\\|",
'\\"',
"\\?",
"\\<",
"\\>",
"\\,",
"\\(",
"\\)",
"\\\\",
"\\.",
"\\+",
"\\-",
"\\!",
"\\$",
"\\`",
"\\،",
"\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
To extract the feature i tried with CountVectorizer and TfidfVectorizer (i confused it with TfidfTransformer) and i find better result with TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
ngram_range=(1, 4),
encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
columns=tfidf_vector.get_feature_names(),
index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
axis=1)
df_total_bow
And i'm stuck with the last step, i try the good version of #maaniB with the least square
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
To run it and have a result in Jupyter notebook i had to changed the --NotebookApp.iopub_data_rate_limit
by the command line
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
It worked after 3 minutes of process but i'm totaly lost with the result, it returned to me 46987 lines but i don't know how to interpret it.
Here a screenshot of my results.
Could someone explain to me how to interpret it please?
I tried another method but after an hour of process without result
i cancel it:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Edit 4:
I tried to make a heatmap in vain with the df_total_bow
import seaborn as sns
tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n
c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()
But i get
ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)
Your question is a combination of text mining tasks, which I try to briefly address here. The first step is, as always in NLP and text mining projects, the cleaning one, including removing stop words, stop characters, etc.:
import re
import pandas as pd
from nltk.corpus import stopwords
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\\/",
"\\[",
"\\]",
"\\:",
"\\|",
'\\"',
"\\?",
"\\<",
"\\>",
"\\,",
"\\(",
"\\)",
"\\\\",
"\\.",
"\\+",
"\\-",
"\\!",
"\\$",
"\\`",
"\\،",
"\\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
# Product_id Price Total_sales CA Product_description
# 0 1 24.0 28 672.0 fruity perfect starter
# 1 2 13.5 34 459.0 dry fullbodied
# 2 3 12.9 29 374.1 fresh perfect starter
# 3 4 34.0 42 1428.0 combining strength character
# 4 5 26.0 10 260.0 ruby color fullbodied
Next, you need to extract features (you mentioned count of words, phrases).
from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
# Total_sales character color color fullbodied combining ... ruby # color ruby color fullbodied starter strength strength character
# 0 28 0 0 0 0 ... # 0 0 1 0 0
# 1 34 0 0 0 0 ... # 0 0 0 0 0
# 2 29 0 0 0 0 ... # 0 0 1 0 0
# 3 42 1 0 0 1 ... # 0 0 0 1 1
# 4 10 0 1 1 0 ... # 1 1 0 0 0
Finally, you can make your models on the data:
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
Regarding your other questions:
There are various statistical methods to find the importance of words in a text, and their correlation with some other variables. CountVectorizer is just a simple method of feature_extraction. There are better methods like TfidfTransformer.
The type of statistical tests or models depends on the problem. Since you just need to find out the correlation of word combinations with sales statistics, simple regression-based methods with feature extraction are helpful. To rank features (find the word combinations with highest correlations and significance), the recursive feature elimination (sklearn.feature_selection.RFE) might be practical.
I have the function below that performs a sentiment analysis in phrase and returns a tuple (sentiment, % NB classifier), like (sadness, 0.78)
I want to apply this function on a pandas dataframe df.Message to analyse it and then create 2 another columns df.Sentiment , df.Prob
The code is below:
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
#print(each)
return a, b
example on a single string:
avalia('i am sad today!')
(sadness, 0.98)
Now i have a dataframe with 13k rows and one column: Message.
I can apply my function to a dataframe column and get a pandas.series like:
0 (surpresa, 0.27992165905522154)
1 (medo, 0.5632686358414051)
2 (surpresa, 0.2799216590552195)
3 (alegria, 0.5429940754962914)
I want to use these info´s to create 2 new columns in the same dataframe, like below.
Message Sentiment Probability
0 I am sad surpresa 0.2799
1 I am happy medo 0.56
I cant get this last part done. Any help please?
Try returning both values at the end of the function, and saving them into separate columns with an apply():
def avalia(teste):
testeStemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras_treinamento) in teste.split():
comStem = [p for p in palavras_treinamento.split()]
testeStemming.append(str(stemmer.stem(comStem[0])))
novo = extrator_palavras(testeStemming)
distribuicao = classificador.prob_classify(novo)
classe_array = [(classe, (distribuicao.prob(classe))) for classe in distribuicao.samples()]
inverse = [(value, key) for key, value in classe_array]
max_key = max(inverse)[1]
for each in classe_array:
if each[0] == max_key:
a=each[0] # returns the sentiment
b=each[1] # returns the probability
return a, b
df.Sentiment, df.Prob = df.Message.apply(avalia)
my dataframe urm has a shape of (96438, 3)
user_id anime_id user_rating
0 1 20 7.808497
1 3 20 8.000000
2 5 20 6.000000
3 6 20 7.808497
4 10 20 7.808497
i'm trying to build an item-user-rating matrix :
X = urm[["user_id", "anime_id"]].as_matrix()
y = urm["user_rating"].values
n_u = len(urm["user_id"].unique())
n_m = len(urm["anime_id"].unique())
R = np.zeros((n_u, n_m))
for idx, row in enumerate(X):
R[row[0]-1, row[1]-1] = y[idx]
if the code succes the matrix looks like that : (i filled NaN with 0)
with in index user_id, anime_id in columns and rating for the value (i got this matrix from pivot_table)
is in some tutorial it works but there i got an
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-278-0e06bd0f3133> in <module>()
15 R = np.zeros((n_u, n_m))
16 for idx, row in enumerate(X):
---> 17 R[row[0]-1, row[1]-1] = y[idx]
IndexError: index 5276 is out of bounds for axis 1 with size 5143
I tried the second suggestion of dennlinger and it worked for me.
This was the code I wrote:
def id_to_index(df):
"""
maps the values to the lowest consecutive values
:param df: pandas Dataframe with columns user, item, rating
:return: pandas Dataframe with the extra columns index_item and index_user
"""
index_item = np.arange(0, len(df.item.unique()))
index_user = np.arange(0, len(df.user.unique()))
df_item_index = pd.DataFrame(df.item.unique(), columns=["item"])
df_item_index["new_index"] = index_item
df_user_index = pd.DataFrame(df.user.unique(), columns=["user"])
df_user_index["new_index"] = index_user
df["index_item"] = df["item"].map(df_item_index.set_index('item')["new_index"]).fillna(0)
df["index_user"] = df["user"].map(df_user_index.set_index('user')["new_index"]).fillna(0)
return df
I am assuming you have non-consecutive user IDs (or movie IDs), which means that there exist indices that either have
no rating, or
no movie
In your case, you are setting up your matrix dimensions with the assumption that every value will be consecutive (since you are defining the dimension with the amount of unique values), which causes some non-consecutive values to reach out of bounds.
In that case, you have two options:
You can define you matrix to be of size urm["user_id"].max() by urm["anime_id"].max()
Create a dictionary that maps your values to the lowest consecutive values.
The disadvantage of the first approach is obviously that it requires you to store a bigger matrix. Also, you can use scipy.sparse to create a matrix from the data format you have (commonly referred to as the coordinate matrix format).
Potentially, you can do something like this:
from scipy import sparse
# scipy expects the data in (value_column, (x, y))
mat = sparse.coo_matrix((urm["user_rating"], (urm["user_id"], urm["anime_id"]))
# if you want it as a dense matrix
dense_mat = mat.todense()
You can then also work your way to the second suggestion, as I have previously asked here
I have a couple of files 100 MB each. The format for those files looks like this:
0 1 2 5 8 67 9 122
1 4 5 2 5 8
0 2 1 5 6
.....
(note the actual file does not have the alignment spaces added in, only one space separates each element, added alignment for aesthetic effect)
this first element in each row is it's binary classification, and the rest of the row are indices of features where the value is 1. For instance, the third row says the row's second, first, fifth and sixth features are 1, the rest are zeros.
I tried to read each line from each file, and use sparse.coo_matrix create a sparse matrix like this:
for train in train_files:
with open(train) as f:
row = []
col = []
for index, line in enumerate(f):
record = line.rstrip().split(' ')
row = row+[index]*(len(record)-4)
col = col+record[4:]
row = np.array(row)
col = np.array(col)
data = np.array([1]*len(row))
mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
mmwrite(train+'trans',mtx)
but this took forever to finish. I started reading the data at night, and let the computer run after I went to sleep, and when I woke up, it still haven't finish the first file!
What are the better ways to process this kind of data?
I think this would be a bit faster than your method because it does not read file line by line. You can try this code with a small portion of one file and compare with your code.
This code also requires to know the feature number in advance. If we don't know the feature number, it would require another line of code which was commented out.
import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial
def writeMx(result, row):
# zero-based matrix requires the feature number minus 1
col_ind = row.dropna().values - 1
# Assign values without duplicating row index and values
result[row.name, col_ind] = 1
def fileToMx(f):
# number of features
col_n = 136
df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
# This is the label of the binary classification
label = df.pop(0)
# Or get the feature number by the line below
# But it would not be the same across different files
# col_n = df.max().max()
# Number of row
row_n = len(label)
# Generate feature matrix for one file
result = lil_matrix((row_n, col_n))
# Save features in matrix
# DataFrame.apply() is usually faster than normal looping
df.apply(partial(writeMx, result), axis=0)
return(result)
for train in train_files:
# result is the sparse matrix you can further save or use
result = fileToMx(train)
print(result.shape, result.nnz)
# The shape of matrix and number of nonzero values
# ((420, 136), 15)