Distribution fitting of Multiple columns - python
I am trying to get the distribution fitting of my data using scipy.stats. The data contains multiple columns col_1, col_2, col_3 in a single CSV file.
The problem is distribution fitting only takes a single column to identify a best distribution fittings as I have shown in the below code.
How to get the distribution fitting of all columns at the same time? e.g distribution fitting ofcol_1, col_2, col_3
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import scipy
from sklearn.preprocessing import StandardScaler
import scipy.stats
import matplotlib.pyplot as plt
# Load data and select first column
from sklearn import datasets
data_set = datasets.load_breast_cancer()
# Multiple columns of csv
col_1=data_set.data[:,0]
col_2=data_set.data[:,1]
col_3=data_set.data[:,2]
# Create an index array (x) for data
x = np.arange(len(col_1))
size = len(col_1)
plt.hist(col_1)
plt.show()
sc=StandardScaler()
yy = col_1.reshape (-1,1)
sc.fit(yy)
y_std =sc.transform(yy)
y_std = y_std.flatten()
y_std
del yy
dist_names = ['beta',
'expon',
'gamma',
'lognorm',
'norm',
'pearson3',
'triang',
'uniform',
'weibull_min',
'weibull_max']
# Set up empty lists to stroe results
chi_square = []
p_values = []
# Set up 50 bins for chi-square test
# Observed data will be approximately evenly distrubuted aross all bins
percentile_bins = np.linspace(0,100,51)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)
# Loop through candidate distributions
for distribution in dist_names:
# Set up distribution and get fitted distribution parameters
dist = getattr(scipy.stats, distribution)
param = dist.fit(y_std)
# Obtain the KS test P statistic, round it to 5 decimal places
p = scipy.stats.kstest(y_std, distribution, args=param)[1]
p = np.around(p, 5)
p_values.append(p)
# Get expected counts in percentile bins
# This is based on a 'cumulative distrubution function' (cdf)
cdf_fitted = dist.cdf(percentile_cutoffs, *param[:-2], loc=param[-2],
scale=param[-1])
expected_frequency = []
for bin in range(len(percentile_bins)-1):
expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
expected_frequency.append(expected_cdf_area)
# calculate chi-squared
expected_frequency = np.array(expected_frequency) * size
cum_expected_frequency = np.cumsum(expected_frequency)
ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
chi_square.append(ss)
# Collate results and sort by goodness of fit (best at top)
results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square
results['p_value'] = p_values
results.sort_values(['chi_square'], inplace=True)
# Report results
print ('\nDistributions sorted by goodness of fit:')
print ('----------------------------------------')
print (results)
Related
Curve fitting with determination of phonon number associated with each motional state
I have to write a program in python for curve fitting for at least 20 different parameters of occupation probability as explained below. I have added a model for fitting as well. Later when from the fit we have the values of fitted occupation probability, we have to determine the mean phonon or vibrational quantum number by thermal population distribution for Pn. I am attaching a code below just for one parameter P0. import numpy as np import pandas as pd from lmfit import Minimizer, Parameters, report_fit df = pd.read_csv('Fock0_1st BSB.csv') x = pd.DataFrame(df["Untitled"]).to_numpy() data = pd.DataFrame(df["Untitled 1"]).to_numpy() x = [i[0] for i in x]data = [i[0] for i in data] x = np.asarray (x) data = np.asarray(data) x = x/1000 data = abs(data-100)/100 n=0 #Ground State Measurements for n=0 def function(params,x,data): v=params.valuesdict() model = 0.5*(1+(v['P0'])*np.cos(np.sqrt(n+1)*v['omega0']*v['eta']*x + v['phase'])*np.exp(-(v['gamma']((n+1)**0.7))*x)) - v['decay']*x return model - data params=Parameters() params.add('P0',value=0.97,min=0.01,max=0.999) params.add('omega0',value=0.1967,min=0.156,max=0.23,vary=True) params.add('eta',value=0.0629,min=0.01,max=0.11,vary=True) params.add('gamma',value=5.6E-4) params.add('phase',value=0.143) params.add('decay',value=0.1E-6) minner = Minimizer(function, params, fcn_args=(x, data)) result = minner.minimize() final = data + result.residual report_fit(result) try: import matplotlib.pyplot as plt plt.plot(x, data, '+')# plt.plot(x, final) plt.show() except ImportError: pass
Using scipy.optimize.curve_fit with more than one input data and p0= two variables
I am using scipy.optimize.curve_fit to fit measured (test in code) data to theoretical (run in code) data. Attached are two codes. In the first one I have one measured and theoretical data. When I use scipy.optimize.curve_fit I get approximately the correct temperature. The problem comes when I need to extend scipy.optimize.curve_fit to more one measured and theoretical data. The second code is my progress so far. How do I deal with two input data, i.e, what do I replace x-and y-data with. For example do I a need to combine the data in some manner. I have tried a few ways to non-success. Any help would be appreciated. import pandas as pd import numpy as np from scipy import interpolate # mr data: wave, counts, temp run_1 = pd.read_excel("run_1.xlsx") run_1_temp = np.array(run_1['temp']) run_1_counts = np.array(run_1['count']) # test data: wave, counts, temp = 30 test_1 = pd.read_excel("test_1.xlsx") xdata = test_1['wave'] ydata = test_1['counts'] # Interpolate inter_run_1 = interpolate.interp1d(run_1_temp,run_1_counts, kind='linear', fill_value='extrapolation') run_1_temp_new = np.linspace(20,50,0.1) run_1_count_new = inter_run_1(run_1_temp_new) # Curve-fit def f(wave, temp): signal = inter_run_1(temp) return signal popt, pcov = scipy.optimize.curve_fit(f,xdata,ydata,p0=[30]) print(popt, pcov) import pandas as pd import numpy as np from scipy import interpolate # mr data: wave, counts, temp run_1 = pd.read_excel("run_1.xlsx") run_2 = pd.read_excel("run_2.xlsx") # test data 1: wave, counts, temp = 30 test_1 = pd.read_excel("test_1.xlsx") xdata = test_1['wave'] ydata = test_1['counts'] test data 1: wave, counts, temp = 40 test_2 = pd.read_excel("test_2.xlsx") x1data = test_2['wave'] y1data = test_2['counts'] run_1_temp = np.array(run_1['temp']) run_1_counts = np.array(run_1['count']) run_2_temp = np.array(run_2['temp']) run_2_counts = np.array(run_2['count']) # Interpolate inter_run_1 = interpolate.interp1d(run_1_temp,run_1_counts, kind='linear', fill_value='extrapolation') run_1_temp_new = np.linspace(20,50,0.1) run_1_count_new = inter_run_1(run_1_temp_new) inter_run_2 = interpolate.interp1d(run_2_temp,run_2_counts, kind='linear', fill_value='extrapolation') run_2_temp_new = np.linspace(20,50,0.1) run_2_count_new = inter_run_2(run_2_temp_new) def f(wave,temp1,temp2): signal_1 = inter_run_1(temp1) signal_2 = inter_run_2(temp2) signal = signal_1 + signal_1 return signal popt, pcov = scipy.optimize.curve_fit(f,xdata,ydata,p0=[30,50]) print(popt, pcov)
Having some problem to understand the x_bin in regplot of Seaborn
I used the seaborn.regplot to plot data, but not quite understand how the error bar in regplot was calculated. I have compared the results with the mean and standard deviation derived from mannual calculation. Here is my testing script. import numpy as np import pandas as pd import seaborn as sn def get_data_XYE(p): x_list = [] lower_list = [] upper_list = [] for line in p.lines: x_list.append(line.get_xdata()[0]) lower_list.append(line.get_ydata()[0]) upper_list.append(line.get_ydata()[1]) y = 0.5 * (np.asarray(lower_list) + np.asarray(upper_list)) y_error = np.asarray(upper_list) - y x = np.asarray(x_list) return x, y, y_error x = [37.3448,36.6026,42.7795,34.7072,75.4027,226.2615,192.7984,140.8045,242.9952,458.451,640.6542,726.1024,231.7347,107.5605,200.2254,190.0006,314.1349,146.8131,152.4497,175.9096,284.9926,116.9681,118.2953,312.3787,815.8389,458.0146,409.5797,595.5373,188.9955,15.7716,36.1839,244.8689,57.4579,94.8717,112.2237,87.0687,72.79,22.3457,24.1728,29.505,80.8765,252.7454,280.6002,252.9573,348.246,112.705,98.7545,317.0541,300.9573,402.8411,406.6884,56.1286,30.1385,32.9909,497.556,19.3606,20.8409,95.2324,108.6074,15.7753,54.5511,45.5623,64.564,101.1934,81.8459,88.286,58.2642,56.1225,51.2943,38.0649,63.5882,63.6847,120.495,102.4097,49.3255,111.3309,171.6028,58.9526,28.7698,144.6884,180.0661,116.6028,146.2594,199.8702,128.9378,423.2363,119.8537,124.6508,518.8625,306.3023,79.5213,121.0309,116.9346,170.8863,930.361,48.9983,55.039,47.1092,72.0548,75.4045,103.521,83.4134,142.3253,146.6215,121.4467,101.4252,68.4812,291.4275,143.9475,142.647,78.9826,47.094,204.2196,89.0208,82.792,27.1346,142.4764,83.7874,67.3216,112.9531,138.2549,133.3446,86.2659,45.3464,56.1604,43.5882,54.3623,86.296,115.7272,96.5498,111.8081,36.1756,40.2947,34.2532,89.1452,53.9062,36.458,113.9297,176.9962,77.3125,77.8891,64.807,64.1515,127.7242,119.6876,976.2324,322.8454,434.2883,168.6923,250.0284,234.7329,131.0793,152.335,118.8838,243.1772,24.1776,168.6327,170.7541,167.8444,75.9315,110.1045,113.4417,60.5464,66.8956,79.7606,71.6659,72.5251,77.513,207.8019,21.8592,35.2787,169.7698,146.5012,412.9934,248.0708,318.5489,104.1278,184.7592,108.0581,175.2646,169.7698,340.3732,570.3396,23.9853,69.0405,66.7391,67.9435,294.6085,68.0537,77.6344,433.2713,104.3178,229.4615,187.8587,78.1399,121.4737,122.5451,384.5935,38.5232,117.6835,50.3308,318.2513,103.6695,20.7181,321.9601,510.3248,13.4754,16.1188,44.8082,37.7291,733.4587,446.6241,21.1822,287.9603,327.2367,274.1109,195.4713,158.2114,64.4537,26.9857,172.8503] y = [37,40,30,29,24,23,27,12,21,20,29,28,27,32,23,29,28,22,28,23,24,29,32,18,22,12,12,14,29,31,34,31,22,40,25,36,27,27,29,35,33,25,25,27,27,19,35,26,18,24,25,37,52,47,34,39,40,48,41,44,35,36,53,46,38,44,23,26,26,28,27,21,25,21,20,27,35,24,46,34,22,30,30,30,31,26,25,28,21,31,24,27,33,21,31,33,29,33,32,21,25,22,39,31,34,26,23,18,20,18,34,25,20,12,23,25,21,21,25,31,17,27,28,29,25,24,25,21,24,27,23,22,23,22,22,26,22,19,26,35,33,35,29,26,26,30,22,32,33,33,28,32,26,29,36,37,37,28,24,30,25,20,29,24,33,35,30,32,31,33,40,35,37,24,34,29,27,24,36,26,26,26,27,27,20,17,28,34,18,20,20,18,19,23,20,22,25,32,44,41,39,41,40,44,36,42,31,32,26,29,23,29,29,28,31,22,29,24,28,28,25] xbreaks = [13.4754, 27.1346, 43.5882, 58.9526, 72.79, 89.1452, 110.1045, 131.0793, 158.2114, 180.0661, 207.8019, 234.7329, 252.9573, 300.9573, 327.2367, 348.246, 412.9934, 434.2883, 458.451, 518.8625, 595.5373, 640.6542, 733.4587, 815.8389, 930.361, 976.2324] df = pd.DataFrame([x,y]).T df.columns = ['x','y'] # Check the bin average and std using agge bins = pd.cut(df.x,xbreaks,right=False) t = df[['x','y']].groupby(bins).agg({"x": "mean", "y": ["mean","std"]}) t.reset_index(inplace=True) t.columns = ['range_cut','x_avg_cut','y_avg_cut','y_std_cut'] t.index.name ='id' # Get the bin average from g = sns.regplot(x='x',y='y',data=df,fit_reg=False,x_bins=xbreaks,seed=seed) xye = pd.DataFrame(get_data_XYE(g)).T xye.columns = ['x_regplot','y_regplot','e_regplot'] xye.index.name = 'id' t2 = xye.merge(t,on='id',how='left') t2 You can see the y and e from the two ways are different. I understand that the default x_ci or x_estimator may afect the result of regplot, but I still can not the these values in excel by removing some lowest and/or highest values in each bin.
In seaborn.regplot, the x_bins are the center of each bin, and the original x values are assigned to the nearest bin value. Whereas in pandas.cut, the breaks define the bin edges.
my algorithm gives bad clusters while usingTF-IDF
im getting bad clusters i would like to rewrite it in a way where i can just plug in any algorithm that i would like (e.g hierarchical, knn, k-means) etc. #takes in our text_extracts dictionary and returns clusters in an indexed list def run_clustering(plan): """ Transform texts to Tf-Idf coordinates and cluster texts using K-Means """ vectorizer = TfidfVectorizer(tokenizer=process_text, max_df=0.5, min_df=0.005, ngram_range=(1,4), lowercase=True) #set the model with the vectorizer which will tokenize with our process_text function extracts = {} for page in plan.page_list: if len(page.text_extract) > 50: extracts[str(page.document_id) + '_' + str(page.page_number)] = page.text_extract extract_lst = [extracts[text] for text in extracts] tfidf_model = vectorizer.fit_transform(extract_lst) #determine cluster number with silhouette coefficient #start with 2 as a cluster size in case the set is very small num_of_clusters_to_test = [2] #going to test 25 more sizes in equal intervals based on the number of docs we are clustering intervals_to_test = int(len(extracts) / 25) #print(intervals_to_test) num_of_clusters_to_test += [i for i in range(len(extracts)) if i % intervals_to_test == 0 and i != 0] #these variables will help us determine the max silhouette #iters_since_new_max is just being held so that if we aren't reaching optimal size for #four iterations in a row, we dont have to keep testing huge cluster sizes max_silhouette_coef = 0 iters_since_new_max = 0 good_size = 2 #cluster with a certain cluster size and record the silhouette coefficient for size in num_of_clusters_to_test: kmeans = KMeans(n_clusters=size).fit(tfidf_model) label = kmeans.labels_ sil_coeff = silhouette_score(tfidf_model, label, metric='euclidean') if sil_coeff > max_silhouette_coef: max_silhouette_coef = sil_coeff good_size = size iters_since_new_max = 0 else: iters_since_new_max += 1 if iters_since_new_max > 4: break # finally cluster for with the good size we want km_model = KMeans(n_clusters=good_size) km_model.fit(tfidf_model) clustering = collections.defaultdict(list) for idx, label in enumerate(km_model.labels_): clustering[label].append(idx) return clustering left as much comment as i can to help you all follow what i am going for can anyone help me improve this
You know KMeans if for numeric data only, right. I mean, don't expect it to work on labeled data. With KMeans, you calculate the distance to the nearest centroid (cluster center) and add this point to this cluster. What is the 'distance' between apple, banana, and watermelon? It doesn't make sense! So, just make sure you are running your KMeans over numerics. import numpy as np import pandas as pd from pylab import plot,show from numpy import vstack,array from scipy.cluster.vq import kmeans,vq from sklearn.cluster import KMeans from matplotlib import pyplot as plt import seaborn as sns df = pd.read_csv('foo.csv') # get only numeric fields from your dataframe df = df.sample(frac=0.1, replace=True, random_state=1) numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] newdf = df.select_dtypes(include=numerics) for col in newdf.columns: print(col) # your independent variables X = newdf[['NumericField1','NumericField2','NumericField3','list_price']] # your dependent variable y = newdf['DependentVariable'] # take all numeric features from the corr exercise, and turn into an array # so we can feed it into a cluetering algorythm data = np.asarray(newdf) X = data # computing K-Means with K = 100 (100 clusters) centroids,_ = kmeans(data,100) # assign each sample to a cluster idx,_ = vq(data,centroids) # some plotting using numpy's logical indexing plot(data[idx==0,0],data[idx==0,1],'ob', data[idx==1,0],data[idx==1,1],'oy', data[idx==2,0],data[idx==2,1],'or', data[idx==3,0],data[idx==3,1],'og', data[idx==4,0],data[idx==4,1],'om') plot(centroids[:,0],centroids[:,1],'sg',markersize=8) show() details = [(name,cluster) for name, cluster in zip(df.brand,idx)] for detail in details: print(detail) I've found Affinity Propogation to produce much tighter clusters than KMeans can achieve. Here is an example. # Run Affinity Propogation Experiment af = AffinityPropagation(preference=20).fit(X) cluster_centers_indices = af.cluster_centers_indices_ labels = af.labels_ n_clusters_ = len(cluster_centers_indices) print('Estimated number of clusters: %d' % n_clusters_) # plt.scatter(X[:, 0], X[:, 1], s=50) # Plot result import matplotlib.pyplot as plt from itertools import cycle plt.close('all') plt.figure(1) plt.clf() colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk') for k, col in zip(range(n_clusters_), colors): class_members = labels == k cluster_center = X[cluster_centers_indices[k]] plt.plot(X[class_members, 0], X[class_members, 1], col + '.') plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) for x in X[class_members]: plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.show() Try these concepts and see how you get along.
python: increase performance of finding the best timeshift for a correlation between each X column and y
I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code? (In the example I used the iris data set which is of course not a timeseries...) from sklearn import datasets import pandas as pd import numpy as np #import matplotlib.pyplot as plt from copy import deepcopy def get_best_shift(dfX, dfy, ti=60, maxt=1440): """ determines the best correlation for the last maxt minutes based on a timestep of ti minutes. Creates a dataframe with the shifted variables based on the best match (strongest correlation). """ df_out = deepcopy(dfX) for xcol in dfX: bestshift = 0 Rmax = 0 for ishift in range(0, int(maxt / ti)): xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values]) selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool) xvals = xvals[selector] yvals = yvals[selector] R = np.corrcoef(xvals,yvals)[0][1] # plt.figure() # plt.plot(xvals,yvals,'k.') # plt.show() if R ** 2 > Rmax: Rmax = R ** 2 # print(Rmax) bestshift = ishift df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values) df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])}) return df_out iris = datasets.load_iris() X = pd.DataFrame(iris.data) y = pd.DataFrame(iris.target) df = get_best_shift(X,y)