python/pandas/sklearn: getting closest matches from pairwise_distances - python

I have a dataframe and am trying to get the closest matches using mahalanobis distance across three categories, like:
from io import StringIO
from sklearn import metrics
import pandas as pd
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
d = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
print(df)
print(d)
Where that pid column is a unique identifier.
What I need to do is take that ndarray returned by the pairwise_distances call and update the original dataframe so each row has some kind of list of its closest N matches (so pid 0 might have an ordered list by distance of like 2, 1, 5, 3, 4 (or whatever it actually is), but I'm totally stumped how this is done in python.

from io import StringIO
from sklearn import metrics
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
dist = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
dist = pd.DataFrame(dist)
ranks = np.argsort(dist, axis=1)
df["rankcol"] = ranks.apply(lambda row: ','.join(map(str, row)), axis=1)
df

Related

Why can't I replace null values in this excel sheet?

In my code, I run a t-test which sometimes yields "NaN" or "nan" when running a test on two zero value groups. I have tried making new data frames, tried replacing using .replace and also tried fillna() but nothing was successful. I get errors when also trying to define a new dataframe or read the file again after adding new calculations.
How do I replace the nulls and "nan" in these files: "significant_report2.xls" or "quant_report2.xls"
import json
import os, sys
import numpy as np
import pandas as pd
import scipy.stats
output_report = "quant_report2.xls"
significant_report = "significant_report2.xls"
output_report_writer = open(output_report, "w")
significant_writer = open(significant_report, "w")
# setup samples grouped by control and treatment
header = []
for idx in control_indices:
header.append(quant_columns[idx])
for idx in treatment_indices:
header.append(quant_columns[idx])
output_report_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
significant_writer.write("Feature\t%s\tP-value\tctrl_means\tctrl_stdDev\ttx_means\ttx_stdDev\n"%"\t".join(header))
feature_list = list(quantitative_data_frame.index)
for feature_idx in range(len(feature_list)):
feature_name = feature_list[feature_idx]
control_values = quantitative_data_frame.iloc[feature_idx, control_indices]
treatment_values = quantitative_data_frame.iloc[feature_idx, treatment_indices]
ttest_stat, ttest_pvalue = scipy.stats.ttest_ind(control_values, treatment_values, equal_var=False)
ctrl_means = scipy.mean(control_values,0)
ctrl_stdDev = scipy.stats.tstd(control_values)
tx_means= scipy.mean(treatment_values,axis=0)
tx_stdDev1 = scipy.stats.tstd(treatment_values)
output_report_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,
"\t".join([str(x) for x in list(control_values)]),
"\t".join([str(x) for x in list(treatment_values)]), ttest_pvalue, ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))
significant_writer.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n"%(feature_name,"\t".join([str(x) for x in list(control_values)]), "\t".join([str(x) for x in list(treatment_values)]),ttest_pvalue,ctrl_means,ctrl_stdDev,tx_means,tx_stdDev))

Filling out a dataframe column using parallel processing in Python

Trying to compute a value for each row of dataframe in parallel using the following code, but getting errors either when I pass individual input ranges or the combination:
#!pip install pyblaze
import itertools
import pyblaze
import pyblaze.multiprocessing as xmp
import pandas as pd
inputs = [range(2),range(2),range(3)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
print(df)
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df
def parallel(inputs_list):
tokenizer = xmp.Vectorizer(Addition, num_workers=8)
return tokenizer.process(inputs_list)
parallel(inputs_list)

Replace cells in a dataframe with a range of values

I have a large dataframe that has certain cells which have values like: <25-27>. Is there a simple way to convert these into something like:25|26|27 ?
Source data frame:
import pandas as pd
import numpy as np
f = {'function':['2','<25-27>','200'],'CP':['<31-33>','210','4001']}
filter = pd.DataFrame(data=f)
filter
Output Required
output = {'function':['2','25|26|27','200'],'CP':['31|32|33','210','4001']}
op = pd.DataFrame(data=output)
op
thanks a lot !
import re
def convert_range(x):
m = re.match("<([0-9]+)+\-([0-9]+)>", x)
if m is None:
return x
s1, s2 = m.groups()
return "|".join([str(s) for s in range(int(s1), int(s2)+1)])
op = filter.applymap(convert_range)

k means on structured data using python - more than one column

how does one do k means on multiple columns in structured data ?
In the example below its been done on 1 column (name)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
here only name is used but say we wanted to use name and country, should I be adding country to the same column as follows ?
df_new['name'] = df_new['name'] + " " + df_new['country']
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
It works from a code perspective and am still trying to understand the results (I actually have tons of columns) the data but I wonder if that is the right way to fit when there is more than one columns
import os
import pandas as pd
import re
import numpy as np
df = pd.read_csv('sample-data.csv')
def split_description(string):
# name
string_split = string.split(' - ',1)
name = string_split[0]
return name
df_new = pd.DataFrame()
df_new['name'] = df.loc[:,'description'].apply(lambda x: split_description(x))
df_new['id'] = df['id']
def remove(name):
new_name = re.sub("[0-9]", '', name)
new_name = ' '.join(new_name.split())
return new_name
df_new['name'] = df_new.loc[:,'name'].apply(lambda x: remove(x))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
use_idf=True,
stop_words = 'english',
ngram_range=(1,4), min_df = 0.01, max_df = 0.8)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
print (tfidf_matrix.shape)
print (tfidf_vectorizer.get_feature_names())
from sklearn.metrics.pairwise import cosine_similarity
dist = 1.0 - cosine_similarity(tfidf_matrix)
print (dist)
from sklearn.cluster import KMeans
num_clusters = range(1,20)
KM = [KMeans(n_clusters=k, random_state = 1).fit(tfidf_matrix) for k in num_clusters]
No, that is an incorrect way to fit multiple columns. You are basically simply jamming together multiple features together and expecting it to behave correctly as if kmeans was applied on these multiple columns as separate features.
You need to use other methods like Vectorizor and Pipelines along with tfidifVectorizor to do this on multiple columns. You can check out this link for more information.
Additionally, you can check out this answer for a possible alternate solution to your problem.

Pandas assign each row the mean of its bin

I have the following dataframe (p1.head(7)):
ColA
0 6.286333
1 3.317000
2 13.24889
3 26.20667
4 26.25556
5 60.59000
6 79.59000
7 1.361111
I can get the bin ranges using:
pandas.qcut(p1.ColA, 4)
Is there a way I can create a new column where each value corresponds to the mean value of the bin? I.e for each bin, (a,b], I want (a+b)/2
The key here is the retbins option on qcut.
import pandas
df = pandas.DataFrame(np.random.random(100)*100, columns=['val1'])
pctiles = pandas.qcut(df['val1'],4,retbins=True)
pctile_object = pctiles[0]
pctile_boundaries = pctiles[1]
Here pctile_object is just what qcut would return if you hadn't passed retbins=True, and pctile_boundaries is a numpy array of the interval boundaries.
import numpy
bin_halfway = pctile_boundaries[:-1] + (numpy.diff(pctile_boundaries)/2)
This gives us the halfway points of the bins.
Now we make a dataframe with just the interval names (as strings) and the halfway points.
df2 = pandas.DataFrame({'quartile boundaries': pctile_object.levels,
'midway point': bin_halfway})
Finally, merge the bin halfway points back into the original dataframe.
df['quartile boundaries'] = pctile_object
pandas.merge(df,df2,on='quartile boundaries')
Then you can drop quartile boundaries if you want.
I wrote a function to utilize #exp1orer 's logic:
def midway_quantiles(feature_series,q=4):
import pandas as pd
pctiles = pd.qcut(feature_series,q,retbins=True)
pctile_object = pctiles[0]
df1= pd.DataFrame({"feature":feature_series,"q_bound": pctile_object})
pctile_boundaries = pctiles[1]
import numpy as np
bin_halfway = pctile_boundaries[:-1] + (np.diff(pctile_boundaries)/2)
df2 = pd.DataFrame({"q_bound": pctile_object.cat.categories,
"midpoint": bin_halfway})
df3=pd.merge(df1,df2,on="q_bound",how="left")
return df3["midpoint"]

Categories

Resources