k means on structured data using python - more than one column - python

how does one do k means on multiple columns in structured data ?
In the example below its been done on 1 column (name)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
here only name is used but say we wanted to use name and country, should I be adding country to the same column as follows ?
df_new['name'] = df_new['name'] + " " + df_new['country']
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
It works from a code perspective and am still trying to understand the results (I actually have tons of columns) the data but I wonder if that is the right way to fit when there is more than one columns
import os
import pandas as pd
import re
import numpy as np
df = pd.read_csv('sample-data.csv')
def split_description(string):
# name
string_split = string.split(' - ',1)
name = string_split[0]
return name
df_new = pd.DataFrame()
df_new['name'] = df.loc[:,'description'].apply(lambda x: split_description(x))
df_new['id'] = df['id']
def remove(name):
new_name = re.sub("[0-9]", '', name)
new_name = ' '.join(new_name.split())
return new_name
df_new['name'] = df_new.loc[:,'name'].apply(lambda x: remove(x))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
use_idf=True,
stop_words = 'english',
ngram_range=(1,4), min_df = 0.01, max_df = 0.8)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
print (tfidf_matrix.shape)
print (tfidf_vectorizer.get_feature_names())
from sklearn.metrics.pairwise import cosine_similarity
dist = 1.0 - cosine_similarity(tfidf_matrix)
print (dist)
from sklearn.cluster import KMeans
num_clusters = range(1,20)
KM = [KMeans(n_clusters=k, random_state = 1).fit(tfidf_matrix) for k in num_clusters]

No, that is an incorrect way to fit multiple columns. You are basically simply jamming together multiple features together and expecting it to behave correctly as if kmeans was applied on these multiple columns as separate features.
You need to use other methods like Vectorizor and Pipelines along with tfidifVectorizor to do this on multiple columns. You can check out this link for more information.
Additionally, you can check out this answer for a possible alternate solution to your problem.

Related

Why can't I replace null values in this excel sheet?

In my code, I run a t-test which sometimes yields "NaN" or "nan" when running a test on two zero value groups. I have tried making new data frames, tried replacing using .replace and also tried fillna() but nothing was successful. I get errors when also trying to define a new dataframe or read the file again after adding new calculations.
How do I replace the nulls and "nan" in these files: "significant_report2.xls" or "quant_report2.xls"
import json
import os, sys
import numpy as np
import pandas as pd
import scipy.stats
output_report = "quant_report2.xls"
significant_report = "significant_report2.xls"
output_report_writer = open(output_report, "w")
significant_writer = open(significant_report, "w")
# setup samples grouped by control and treatment
header = []
for idx in control_indices:
header.append(quant_columns[idx])
for idx in treatment_indices:
header.append(quant_columns[idx])
output_report_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
significant_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
feature_list = list(quantitative_data_frame.index)
for feature_idx in range(len(feature_list)):
feature_name = feature_list[feature_idx]
control_values = quantitative_data_frame.iloc[feature_idx, control_indices]
treatment_values = quantitative_data_frame.iloc[feature_idx, treatment_indices]
ttest_stat, ttest_pvalue = scipy.stats.ttest_ind(control_values, treatment_values, equal_var=False)
ctrl_means = scipy.mean(control_values,0)
ctrl_stdDev = scipy.stats.tstd(control_values)
tx_means= scipy.mean(treatment_values,axis=0)
tx_stdDev1 = scipy.stats.tstd(treatment_values)
output_report_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,
"\t".join([str(x) for x in list(control_values)]),
"\t".join([str(x) for x in list(treatment_values)]), ttest_pvalue, ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))
significant_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,"\t".join([str(x) for x in list(control_values)]), "\t".join([str(x) for x in list(treatment_values)]),ttest_pvalue,ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))

For loop looping multiple times

At the request of many, I have simplified the problem as far as I can imagine (my imagination doesn't go that far), and I think it's reproducable also. The two different Excel files I've been using are called: "first apples.xlsx" and "second apples.xlsx". I've been using the following data:
import os
import numpy as np
import pandas as pd
import glob
#%%
path = os.getcwd() + r"\apples"
file_locations = glob.glob(path + "\*.xlsx")
#%%
df = {}
for i, file in enumerate(file_locations):
df[i] = pd.read_excel(file, usecols=['Description', 'Price'])
#%%
price_standard_apple = []
price_special_apple = []
special_apple_description = ['Golden', 'Diamond', 'Titanium']
#%%
for file in range(len(df)):
df_description = pd.DataFrame(df[file].iloc[:,-2])
df_prices = pd.DataFrame(df[file].iloc[:,-1])
for description in df_description['Description']:
if description in special_apple_description:
description_index = df_description.loc[df_description['Description']==description].index.values
price = df_prices['Price'].iloc[description_index]
price_sum = np.sum(price)
price_special_apple.append(price_sum)
elif description not in special_apple_description:
description_index = df_description.loc[df_description['Description']==description].index.values
price = df_prices['Price'].iloc[description_index]
price_sum = np.sum(price)
price_standard_apple.append(price_sum)
I would expect the sum of the red colored cells (the special apples so to speak) to be 97 and that of the standard apples to be 224. This is not the case and the problem seems to be in the second loop. Python prints the following values: standard: 234 special: 209
I think you are making this harder than it needs to be.
Given your test data from above you can simply do:
import pandas as pd
test_data = pd.read_csv(r'path\to\file', sep=',')
special = ['Golden', 'Diamond', 'Titanium']
mask = test_data['Description'].isin(special)
specials_price = sum(test_data[mask]['Price']) # -> 97
other = sum(test_data[~mask]['Price']) # -> 224
This is what test_data looks like:
I think the break statement would be useful in your case. It would allow you to break out of the inner for loop when you've retrieved the value you wanted, continuing the outer for loop.
Also covered in this question: Python: Continuing to next iteration in outer loop

How to used groupby and CountVectorizer() together in pandas Dataframe?

I have this sample data. This is a CSV file. I want to create feature vectors of 'Questions' and 'Replies' columns using Bag-of-Word method (CounterVector()) and then calculate the cosine similarity between the question and their replies.
So far I have this python code:
topFeaturesValueListColumns = ['cosinSimilarityIpostRpost', 'Class']
topFeaturesValueList = []
featureVectorsPD = pd.DataFrame()
df = pd.read_csv("test1.csv", usecols = ['ThreadID', 'Title', 'UserID_inipst', 'Questions', 'UserID', 'Replies', 'Class'])
df = pd.DataFrame(df)
df = df.apply(lambda x: x.astype(str).str.lower())
for column in df:
df[column] = df[column].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
cv = CountVectorizer()
features =cv.fit(df['Title']+' '+df['UserID_inipst']+' '+df['Questions']+' '+df['UserID']+' '+df['Replies'])
print(features.vocabulary_)
featureVectorsPD['Questions'] = cv.transform(df['Questions']).toarray().tolist()
featureVectorsPD['Replies'] = cv.transform(df['Replies']).toarray().tolist()
featureVectorsPD['Class'] = df['Class']
for i in range(len(featureVectorsPD)):
q=np.array([featureVectorsPD['Questions'][i]])
r=np.array([featureVectorsPD['Replies'][i]])
label = featureVectorsPD['Class'][i]
res = cosine_similarity(q, r, dense_output=True)
res = float(np.asscalar(res[0]))
row = [res, label]
topFeaturesValueList.append(row)
topQDFValuesPD = pd.DataFrame(topFeaturesValueList, columns=topFeaturesValueListColumns)
print(topQDFValuesPD)
Problem in this code is that the
features = cv.fit(df['Questions'] + ' ' + df['Replies'])
creates words dictionary (features.vocabulary_) from the whole "Questions" and "Replies" columns but my requirement is to calculate "vocabulary" for each thread individually and then create features vectors based on that individual dictionary. in other words in "ThreadID" column when values changes new vocabulary should be created.
I think "groupby" function is used here but how? Hope the question is clear.
Please help me. I will be very thankful to you.

How do I apply CountVectorizer to each row in a dataframe?

I have a dataframe say df which has 3 columns. Column A and B are some strings. Column C is a numeric variable.
Dataframe
I want to convert this to a feature matrix by passing it to a CountVectorizer.
I define my countVectorizer as:
cv = CountVectorizer(input='content', encoding='iso-8859-1',
decode_error='ignore', analyzer='word',
ngram_range=(1), tokenizer=my_tokenizer, stop_words='english',
binary=True)
Next I pass the entire dataframe to cv.fit_transform(df) which doesn't work.
I get this error:
cannot unpack non-iterable int object
Next I covert each row of the dataframe to
sample = pdt_items["A"] + "," + pdt_items["C"].astype(str) + "," + pdt_items["B"]
Then I apply
cv_m = sample.apply(lambda row: cv.fit_transform(row))
I still get error:
ValueError: Iterable over raw text documents expected, string object received.
Please let me know where am I going wrong?Or if I need to take some other approach?
Try this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
A = ['very good day', 'a random thought', 'maybe like this']
B = ['so fast and slow', 'the meaning of this', 'here you go']
C = [1, 2, 3]
pdt_items = pd.DataFrame({'A':A,'B':B,'C':C})
cv = CountVectorizer()
# use pd.DataFrame here to avoid your error and add your column name
sample = pd.DataFrame(pdt_items['A']+','+pdt_items['B']+','+pdt_items['C'].astype('str'), columns=['Output'])
vectorized = cv.fit_transform(sample['Output'])
With the help of #QuantStats's comment, I applied the cv on each row of dataframe as follows:
row_input = df['column_name'].tolist()
kwds = []
for i in range(len(row_input)):
cell_input = [row_input[i]]
full_set = row_keywords(cell_input, 1,1)
candidates = [x for x in full_set if x[1]> 1] # to extract frequencies more than 1
kwds.append(candidates)
kwds_col = pd.Series(kwds)
df['Keywords'] = kwds_col
("row_keywords" is a function for CountVectorizer.)

python/pandas/sklearn: getting closest matches from pairwise_distances

I have a dataframe and am trying to get the closest matches using mahalanobis distance across three categories, like:
from io import StringIO
from sklearn import metrics
import pandas as pd
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
d = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
print(df)
print(d)
Where that pid column is a unique identifier.
What I need to do is take that ndarray returned by the pairwise_distances call and update the original dataframe so each row has some kind of list of its closest N matches (so pid 0 might have an ordered list by distance of like 2, 1, 5, 3, 4 (or whatever it actually is), but I'm totally stumped how this is done in python.
from io import StringIO
from sklearn import metrics
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
dist = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
dist = pd.DataFrame(dist)
ranks = np.argsort(dist, axis=1)
df["rankcol"] = ranks.apply(lambda row: ','.join(map(str, row)), axis=1)
df

Categories

Resources