Cutting outliers in Histogram (Python)

Cutting outliers in Histogram (Python) - python

I wanted to know, if there is a method that shows me how long my x-axis should be. I have a record with different outliers. I can just cut them with plt.xlim() but is there a statistical method to compute a senseful x-axis limit? In the added picture a logical cut would be after 150 km drived distance. To compute the threshold of the cutting would be perfect
The dataframe that the definition gets is a standard pandas dataframe
Code:
def yearly_distribution(dataframe):
df_distr = dataframe
h=sorted(df_distr['Distance'])
l=len(h)
fig, ax =plt.subplots(figsize=(16,9))
binwidth = np.arange(0,501,0.5)
n, bins, patches = plt.hist(h, bins=binwidth, normed=1, facecolor='#023d6b', alpha=0.5, histtype='bar')
lnspc =np.arange(0,500.5,0.5)
gevfit = gev.fit(h)
pdf_gev = gev.pdf(lnspc, *gevfit)
plt.plot(lnspc, pdf_gev, label="GEV")
logfit = stats.lognorm.fit(h)
pdf_lognorm = stats.lognorm.pdf(lnspc, *logfit)
plt.plot(lnspc, pdf_lognorm, label="LogNormal")
weibfit = stats.weibull_min.fit(h)
pdf_weib = stats.weibull_min.pdf(lnspc, *weibfit)
plt.plot(lnspc, pdf_weib, label="Weibull")
burrfit = stats.burr.fit(h)
pdf_burr = stats.burr.pdf(lnspc, *burrfit)
plt.plot(lnspc, pdf_burr, label="Burr Distribution")
genparetofit = stats.genpareto.fit(h)
pdf_genpareto = stats.genpareto.pdf(lnspc, *genparetofit)
plt.plot(lnspc, pdf_genpareto, label ="Generalized Pareto")
myarray = np.array(h)
clf = GMM(8,n_iter=500, random_state=3)
myarray.shape = (myarray.shape[0],1)
clf = clf.fit(myarray)
lnspc.shape = (lnspc.shape[0],1)
pdf_gmm = np.exp(clf.score(lnspc))
plt.plot(lnspc, pdf_gmm, label = "GMM")
plt.xlim(0,500)
plt.xlabel('Distance')
plt.ylabel('Probability')
plt.title('Histogram')
plt.ylim(0,0.05)

you should remove outliers from your data before any plot or fitting :
h=sorted(df_distr['Distance'])
out_threshold= 150.0
h=[i for i in h if i<out_threshold]
EDIT
that maybe not the fastest way but with numpy.std() :
out_threshold= 2.0*np.std(h+[-a for a in h])

Related

How to set widgets to link to array for jupyternotebooks

I am trying to set an interactive notebook up that plots some interpolated GPS data. I have the plotting working by itself, but I am trying to use the ipython widgets to make it more interactive for others.
Currently, my plotting looks like this
def create_grid(array,spacing=.01):
'''
creates evenly spaced grid from the min and max of an array
'''
grid = np.arange(np.amin(array), np.amax(array),spacing)
return grid
def interpolate(x, y, z, grid_spacing = .01, model='spherical',returngrid = False):
'''Interpolates z value and uses create_grid to create a grid of values based on min and max of x and y'''
grid_x = create_grid(x,spacing = grid_spacing)
grid_y = create_grid(y, spacing = grid_spacing)
OK = OrdinaryKriging(x, y, z, variogram_model=model, verbose = False,\
enable_plotting=False, nlags = 20)
z1, ss1 = OK.execute('grid', grid_x,grid_y,mask = False)
print('Interpolation Complete')
vals=np.ma.getdata(z1)
sigma = np.ma.getdata(ss1)
if returngrid == False:
return vals,sigma
else:
return vals, sigma, grid_x, grid_y
mesh_x, mesh_y = np.meshgrid(grid_x,grid_y)
plot = plt.scatter(mesh_x, mesh_y, c = z1, cmap = cm.hsv)
cb = plt.colorbar(plot)
cb.set_label('Northing Change')
plt.show()
'''
This works currently, but I am trying to set up a widget to change the variogram model in the kriging interpolation, as well as change the field to be interpolated.
Currently, to do that I have:
def update_plot(zfield,variogram):
plt.clf()
z1, ss1, grid_x,grid_y =interpolate(lon,lat,zfield,returngrid= True,model=variogram)
mesh_x, mesh_y = np.meshgrid(grid_x,grid_y)
plot = plt.scatter(mesh_x, mesh_y, c = z1, cmap = cm.hsv)
cb = plot.colorbar(plot)
cb.set_label('Interpolated Value')
variogram = widgets.Dropdown(options = ['linear', 'power', 'gaussian', 'spherical', 'exponential', 'hole-effect'],
value = 'spherical', description = "Variogram model for interpolation")
zfield = widgets.Dropdown(options = {'Delta N':delta_n, 'Delta E': delta_e,'Delta V':delta_v},value = 'Delta N',
description = 'Interpolated value')
widgets.interactive(update_plot, variogram = variogram,zfield =zfield)
Which brings up the error
TraitError: Invalid selection: value not found
the values delta_n, delta_e and delta_v are numpy arrays. I have tried looking at documentation but it is not as detailed as something like matplotlibs documentation or something so I feel like I am kind of flying blind here.
Thank you

In this line, you specify the possible values of the Dropdown as:
zfield = widgets.Dropdown(options = {'Delta N':delta_n, 'Delta E': delta_e,'Delta V':delta_v}
When a mapping is used, the values of the dict are interpreted as the possible options. So value = 'Delta N' causes an error as this is not one of the possible values of the Dropdown (although it is one of the keys in the mapping dict). I believe you want value = delta_n instead.

Why does my functions seem to integrate and not differentiate? (pywt.cwt)

I am really confused by the function pywt.cwt, as I've not been able to get it to work. The function seems to integrate instead of differentiating. I would like to work it as the following: Example CWT, but my graph looks like this: My CWT. The idea is to integrate the raw signal (av) with cumtrapz, then differentiate with a gaussian CWT (=> S1), and then once more differentiate with gaussian CWT (=> S2).
As you can see in the pictures, the bottom peaks of the red line should line up in the valleys, but the land under the top peaks for me, and the green line should move 1/4th period to the left but moves to the right... Which makes me think it integrates for some reason.
I currently have no idea what causes this... Does anyone happen to know what is going on?
Thanks in advance!
#Get data from pandas
av = dfRange['y']
#remove gravity & turns av right way up
av = av - dfRange['y'].mean()
av = av * -1
#Filter
[b,a] = signal.butter(4, [0.9/(55.2/2), 20/(55.2/2)], 'bandpass')
av = signal.filtfilt(b,a, av)
#Integrate and differentiate av => S1
integrated_av = integrate.cumtrapz(av)
[CWT_av1, frequency1] = pywt.cwt(integrated_av, 8.8 , 'gaus1', 1/55.2)
CWT_av1 = CWT_av1[0]
CWT_av1 = CWT_av1 * 0.05
#differentiate S1 => S2
[CWT_av2, frequency2] = pywt.cwt(CWT_av1, 8.8 , 'gaus1', 1/55.2)
CWT_av2 = CWT_av2[0]
CWT_av2 = CWT_av2 * 0.8
#Find Peaks
inv_CWT_av1 = CWT_av1 * -1
av1_min, _ = signal.find_peaks(inv_CWT_av1)
av2_max, _ = signal.find_peaks(CWT_av2)
#Plot
plt.style.use('seaborn')
plt.figure(figsize=(25, 7), dpi = 300)
plt.plot_date(dfRange['recorded_naive'], av, linestyle = 'solid', marker = None, color = 'steelblue')
plt.plot_date(dfRange['recorded_naive'][:-1], CWT_av1[:], linestyle = 'solid', marker = None, color = 'red')
plt.plot(dfRange['recorded_naive'].iloc[av1_min], CWT_av1[av1_min], "ob", color = 'red')
plt.plot_date(dfRange['recorded_naive'][:-1], CWT_av2[:], linestyle = 'solid', marker = None, color = 'green')
plt.plot(dfRange['recorded_naive'].iloc[av2_max], CWT_av2[av2_max], "ob", color = 'green')
plt.gcf().autofmt_xdate()
plt.show()

I'm not sure this is your answer, but an observation from playing with pywt...
From the documentation the wavelets are basically given by the differentials of a Gaussian but there is an order dependent normalisation constant.
Plotting the differentials of a Guassian against the wavelets (extracted by putting in an impulse response) gives the following:
The interesting observation is that the order dependent normalisation constant sometimes seems to include a '-1'. In particular, it does for the first order gaus1.
So, my question is, could you actually have differentiation as you expect, but also multiplication by -1?
Code for the graph:
import numpy as np
import matplotlib.pyplot as plt
import pywt
dt = 0.01
t = dt * np.arange(100)
# Calculate the differentials of a gaussian by quadrature:
# start with the gaussian y = exp(-(x - x_0) ^ 2 / dt)
ctr = t[len(t) // 2]
gaus = np.exp(-np.power(t - ctr, 2)/dt)
gaus_quad = [np.gradient(gaus, dt)]
for i in range(7):
gaus_quad.append(np.gradient(gaus_quad[-1], dt))
# Extract the wavelets using the impulse half way through the dataset
y = np.zeros(len(t))
y[len(t) // 2] = 1
gaus_cwt = list()
for i in range(1, 9):
cwt, cwt_f = pywt.cwt(y, 10, f'gaus{i}', dt)
gaus_cwt.append(cwt[0])
fig, axs = plt.subplots(4, 2)
for i, ax in enumerate(axs.flatten()):
ax.plot(t, gaus_cwt[i] / np.max(np.abs(gaus_cwt[i])))
ax.plot(t, gaus_quad[i] / np.max(np.abs(gaus_quad[i])))
ax.set_title(f'gaus {i+1}', x=0.2, y=1.0, pad=-14)
ax.axhline(0, c='k')
ax.set_xticks([])
ax.set_yticks([])

How to calculate precision-recall curves for semantic (image) segmentation?

I am performing semantic segmentation (with materials as classes) on images, and wish to calculate precision-recall curves of my accuracy. Currently, I calculate the true positives, false positives and false negatives for each class by summing the number of pixels for which ground truth and prediction agree with that class, for which only the prediction agrees, and for which only the ground truth agrees, respectively. Then I calculate precision and recall accordingly:
pixel_probs = np.array(pixel_probs) # shape (num_pixels), the classification certainty for each pixel
pixel_labels_pred, pixel_labels_gt = np.array(pixel_labels_pred).astype(bool), np.array(pixel_labels_gt).astype(bool) # shape (num_pixels, num_classes), one hot labels for each pixel
precision_mat, recall_mat = np.array([]).reshape(num_labels, 0), np.array([]).reshape(num_labels, 0) # stores the precision-recall pairs for each certainty threshold
prev_num_pixels = sum(pixel_probs > 0.0)
for threshold in sorted(thresholds):
pixel_mask = pixel_probs > threshold
if sum(pixel_mask) == prev_num_pixels: continue
prev_num_pixels == sum(pixel_mask)
pixel_labels_pred_msk = pixel_labels_pred[pixel_mask]
pixel_labels_gt_msk = pixel_labels_gt[pixel_mask]
tps = np.sum(np.logical_and(pixel_labels_gt_msk, pixel_labels_pred_msk), axis=0)
fps = np.sum(np.logical_and(np.logical_not(pixel_labels_gt_msk), pixel_labels_pred_msk), axis=0)
fns = np.sum(np.logical_and(pixel_labels_gt_msk, np.logical_not(pixel_labels_pred_msk)), axis=0)
precisions = tps / (tps + fps)
recalls = tps / (tps + fns)
precision_mat = np.concatenate([precision_mat, np.expand_dims(precisions, axis=-1)], axis=-1)
recall_mat = np.concatenate([recall_mat, np.expand_dims(recalls, axis=-1)], axis=-1)
fig = plt.figure()
fig.set_size_inches(12, 5)
for label_index in range(precision_mat.shape[0]):
r = recall_mat[label_index]
p = precision_mat[label_index]
sort_order = np.argsort(r)
r = r[sort_order]
p = p[sort_order]
plt.plot(r, p, '-o', markersize=2, label=labels[label_index])
plt.title("Precision-recall curve")
plt.legend(loc='upper left', fontsize=8.5, ncol=1, bbox_to_anchor=(1, 1))
plt.xlabel('recall', fontsize=12)
plt.ylabel('precision', fontsize=12)
plt.savefig(dir + "test/pr_curves.png")
However, this produces some very strange looking graphs:
It is true that my segmentator is performing rather horribly, but I would at least expect the curves to follow more or less a downward slope.
Am I calculating my PR-curve correctly? Are there alternative ways of calculating such curves that I should consider? Is there perhaps a bug in my plotting code?

How to fix this likelihood in python and plot it?

For a linear logistic regression model formulated as follows:
I was trying to code the likelihood and plot it to see what it looks like. This is how I've implemented it:
# Define likelihood function
def likelihood(beta):
# reshape correctly
beta = np.array(beta).reshape(-1, 1)
# X beta has x_i^T * \beta in every entry
Xbeta = np.matmul(X, beta).flatten()
exp_argument = np.multiply(newrows.y.values, Xbeta)
numerator = np.exp(exp_argument.sum())
denominator = np.prod((1 + np.exp(Xbeta)))
return numerator / denominator
# Prepare a grid
b1 = np.linspace(-10, -8.5, 100) # these values chosen by trial and error
b2 = np.linspace(4.5, 6.2, 100) # these values chosen by trial and error
B1, B2 = np.meshgrid(b1, b2)
# Evaluate function at the gridpoints
Zbeta = np.array([likelihood(thing) for thing in zip(B1.ravel(), B2.ravel())])
Zbeta = Zbeta.reshape(B1.shape)
# Surface plot
fig = plt.figure(figsize=(15, 6))
ax = fig.add_subplot(121, projection='3d')
ax.plot_surface(B1, B2, Zbeta)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax = fig.add_subplot(122)
ax.contour(B1, B2, Zbeta)
plt.show()
Here the matrix X is a matrix containing "fake" Bernoulli observations. It is a well known dataset called BeetleMortality, which looks like this and I've stored it in a Pandas Dataframe called data.
which are basically Binomial observations. I've created Bernoulli observations as follows:
newrows = np.zeros((data.NumberExposed.sum(), 2))
for index, row in data.iterrows():
obs = np.zeros(int(row.NumberExposed), dtype=int)
obs[:int(row.NumberKilled)] = 1
np.random.shuffle(obs)
obs = obs.reshape(-1, 1)
doses = np.repeat(row.Dose, len(obs)).reshape(-1, 1)
obs = np.hstack((doses, obs))
from_index = int(row.CumulativeN - row.NumberExposed)
to_index = int(row.CumulativeN)
newrows[from_index:to_index] = obs
# Save it as a dataframe
newrows = pd.DataFrame(newrows, columns=['x', 'y'])
# Notice that if you wanted to check that these calculations are correct use: newrows.groupby('x').sum()
newrows.head()
Everything seems to work fine, except that when I try to plot this I obtain the following plot:
Which makes absolutely no sense.. Any help? Maybe I'm implementing the likelihood function in the wrong way?

Find a easier way to cluster 2-d scatter data into grid array data

I have figured out a method to cluster disperse point data into structured 2-d array(like rasterize function). And I hope there are some better ways to achieve that target.
My work
1. Intro
1000 point data has there dimensions of properties (lon, lat, emission) whicn represent one factory located at (x,y) emit certain amount of CO2 into atmosphere
grid network: predefine the 2-d array in the shape of 20x20
http://i4.tietuku.com/02fbaf32d2f09fff.png
The code reproduced here:
#### define the map area
xc1,xc2,yc1,yc2 = 113.49805889531724,115.5030664238035,37.39995194888143,38.789235929357105
map = Basemap(llcrnrlon=xc1,llcrnrlat=yc1,urcrnrlon=xc2,urcrnrlat=yc2)
#### reading the point data and scatter plot by their position
df = pd.read_csv("xxxxx.csv")
px,py = map(df.lon, df.lat)
map.scatter(px, py, color = "red", s= 5,zorder =3)
#### predefine the grid networks
lon_grid,lat_grid = np.linspace(xc1,xc2,21), np.linspace(yc1,yc2,21)
lon_x,lat_y = np.meshgrid(lon_grid,lat_grid)
grids = np.zeros(20*20).reshape(20,20)
plt.pcolormesh(lon_x,lat_y,grids,cmap = 'gray', facecolor = 'none',edgecolor = 'k',zorder=3)
2. My target
Finding the nearest grid point for each factory
Add the emission data into this grid number
3. Algorithm realization
3.1 Raster grid
note: 20x20 grid points are distributed in this area represented by blue dot.
http://i4.tietuku.com/8548554587b0cb3a.png
3.2 KD-tree
Find the nearest blue dot of each red point
sh = (20*20,2)
grids = np.zeros(20*20*2).reshape(*sh)
sh_emission = (20*20)
grids_em = np.zeros(20*20).reshape(sh_emission)
k = 0
for j in range(0,yy.shape[0],1):
for i in range(0,xx.shape[0],1):
grids[k] = np.array([lon_grid[i],lat_grid[j]])
k+=1
T = KDTree(grids)
x_delta = (lon_grid[2] - lon_grid[1])
y_delta = (lat_grid[2] - lat_grid[1])
R = np.sqrt(x_delta**2 + y_delta**2)
for i in range(0,len(df.lon),1):
idx = T.query_ball_point([df.lon.iloc[i],df.lat.iloc[i]], r=R)
# there are more than one blue dot which are founded sometimes,
# So I'll calculate the distances between the factory(red point)
# and all blue dots which are listed
if (idx > 1):
distance = []
for k in range(0,len(idx),1):
distance.append(np.sqrt((df.lon.iloc[i] - grids[k][0])**2 + (df.lat.iloc[i] - grids[k][1])**2))
pos_index = distance.index(min(distance))
pos = idx[pos_index]
# Only find 1 point
else:
pos = idx
grids_em[pos] += df.so2[i]
4. Result
co2 = grids_em.reshape(20,20)
plt.pcolormesh(lon_x,lat_y,co2,cmap =plt.cm.Spectral_r,zorder=3)
http://i4.tietuku.com/6ded65c4ac301294.png
5. My question
Can someone point out some drawbacks or error of this method?
Is there some algorithms more aligned with my target?
Thanks a lot!

There are many for-loop in your code, it's not the numpy way.
Make some sample data first:
import numpy as np
import pandas as pd
from scipy.spatial import KDTree
import pylab as pl
xc1, xc2, yc1, yc2 = 113.49805889531724, 115.5030664238035, 37.39995194888143, 38.789235929357105
N = 1000
GSIZE = 20
x, y = np.random.multivariate_normal([(xc1 + xc2)*0.5, (yc1 + yc2)*0.5], [[0.1, 0.02], [0.02, 0.1]], size=N).T
value = np.ones(N)
df_points = pd.DataFrame({"x":x, "y":y, "v":value})
For equal space grids you can use hist2d():
pl.hist2d(df_points.x, df_points.y, weights=df_points.v, bins=20, cmap="viridis");
Here is the output:
Here is the code to use KdTree:
X, Y = np.mgrid[x.min():x.max():GSIZE*1j, y.min():y.max():GSIZE*1j]
grid = np.c_[X.ravel(), Y.ravel()]
points = np.c_[df_points.x, df_points.y]
tree = KDTree(grid)
dist, indices = tree.query(points)
grid_values = df_points.groupby(indices).v.sum()
df_grid = pd.DataFrame(grid, columns=["x", "y"])
df_grid["v"] = grid_values
fig, ax = pl.subplots(figsize=(10, 8))
ax.plot(df_points.x, df_points.y, "kx", alpha=0.2)
mapper = ax.scatter(df_grid.x, df_grid.y, c=df_grid.v,
cmap="viridis",
linewidths=0,
s=100, marker="o")
pl.colorbar(mapper, ax=ax);
the output is:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cutting outliers in Histogram (Python) - python

you should remove outliers from your data before any plot or fitting : h=sorted(df_distr['Distance']) out_threshold= 150.0 h=[i for i in h if i<out_threshold] EDIT that maybe not the fastest way but with numpy.std() : out_threshold= 2.0*np.std(h+[-a for a in h])

Related

How to set widgets to link to array for jupyternotebooks

Why does my functions seem to integrate and not differentiate? (pywt.cwt)

How to calculate precision-recall curves for semantic (image) segmentation?

How to fix this likelihood in python and plot it?

Find a easier way to cluster 2-d scatter data into grid array data

Categories

Resources