Some questions on dendrogram - python (Scipy) - python

I am new to scipy but I managed to get the expected dendrogram. I am some more questions;
In the dendrogram, distance between some points are 0 but its not
visible due to image border. How can I remove the border and make
the lower limit of y-axis to -1, so that it is clearly visible.
e.g. distance between these points are 0 (13,17), (2,10), (4,8,19)
How can I prune/truncate on a particular distance. for e.g. prune at 0.4
How to write these clusters(after pruning) to a file
My python code:
import scipy
import pylab
import scipy.cluster.hierarchy as sch
import numpy as np
D = np.genfromtxt('LtoR.txt', dtype=None)
def llf(id):
return str(id)
fig = pylab.figure(figsize=(10,10))
Y = sch.linkage(D, method='single')
Z1 = sch.dendrogram(Y,leaf_label_func=llf,leaf_rotation=90)
fig.show()
fig.savefig('dendrogram.png')
Dendrogram:
thank you.

1.fig.gca().set_ylim(-0.4,1.2) Here gca() returns the current axes object, so you can give it a name
ax=fig.gca()
ax.set_ylim(-0.4,ax.get_ylim()[1])

You can prune the dendrogram and obtain your clusters using fcluster.
To prune at a distance of 0.4:
clusters = sch.fcluster(Y,t = 0.4,criterion = 'distance')
The resulting array (clusters) contains the cluster label for every observation in your data. You can write the array using numpy.savetxt:
np.savetxt('clusters.txt', clusters, delimiter=',')

The border is shown because of the axis. So you can remove the border using the following command:
fig = plt.figure(figsize=(10, 8))
ax2 = fig.add_axes([0.3, 0.71, 0.6, 0.2])
Y = sch.linkage(D, method='ward')
Z2 = sch.dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
ax2.axis('off')
ax.axis('off') hides the border.

Related

How do I specify the number of axis points in matplotlib and how do I extract theese points?

I have a small script that creates a matplotlib graph with 2000 random points following a random walk.
I'm wondering if there is a simple way to change the number of points on the y-axis as well as how I can extract these values?
When I run the code below, I get 5 points on the Y-axis but I'm looking for a way to expand this to 20 points as well as creating an array or series with these values. Many thanks in advance.
import matplotlib.pyplot as plt
dims = 1
step_n = 2000
step_set = [-1, 0, 1]
origin = np.zeros((1,dims))
random.seed(30)
step_shape = (step_n,dims)
steps = np.random.choice(a=step_set, size=step_shape)
path = np.concatenate([origin, steps]).cumsum(0)
plt.plot(path)
import matplotlib.pyplot as plt
import numpy as np
import random
dims = 1
step_n = 2000
step_set = [-1, 0, 1]
origin = np.zeros((1,dims))
random.seed(30)
step_shape = (step_n,dims)
steps = np.random.choice(a=step_set, size=step_shape)
path = np.concatenate([origin, steps]).cumsum(0)
#first variant
plt.plot(path)
plt.locator_params(axis='x', nbins=20)
plt.locator_params(axis='y', nbins=20)
You can use locator_params in order to specify the number of ticks. Of course you can retrieve these points. For this you must create a subplot with ax, and then you can get the y_ticks with get_yticks.
#second variant
# create subplot
fig, ax = plt.subplots(1,1, figsize=(20, 11))
img = ax.plot(path)
plt.locator_params(axis='y', nbins=20)
y_values = ax.get_yticks() # y_values is a numpy array with your y values

How to use geopandas to plot latitude and longitude on a more detailed map with by using basemaps?

I am trying to plot some latitude and longitudes on the map of delhi which I am able to do by using a shape file in python3.8 using geopandas
Here is the link for the shape file:
https://drive.google.com/file/d/1CEScjlcsKFCgdlME21buexHxjCbkb3WE/view?usp=sharing
Following is my code to plot points on the map:
lo=[list of longitudes]
la=[list of latitudes]
delhi_map = gpd.read_file(r'C:\Users\Desktop\Delhi_Wards.shp')
fig,ax = plt.subplots(figsize = (15,15))
delhi_map.plot(ax = ax)
geometry = [Point(xy) for xy in zip(lo,la)]
geo_df = gpd.GeoDataFrame(geometry = geometry)
print(geo_df)
g = geo_df.plot(ax = ax, markersize = 20, color = 'red',marker = '*',label = 'Delhi')
plt.show()
Following is the result:
Now this map is not very clear and anyone will not be able to recognise the places marked so i tried to use basemap for a more detailed map through the following code:
df = gpd.read_file(r'C:\Users\Jojo\Desktop\Delhi_Wards.shp')
new_df = df.to_crs(epsg=3857)
print(df.crs)
print(new_df.crs)
ax = new_df.plot()
ctx.add_basemap(ax)
plt.show()
And following is the result:
I am getting the basemap but my shapefile is overlapping it. Can i get a map to plot my latitudes and longitudes where the map is much more detailed with names of places or roads or anything similar to it like in google maps or even something like the map which is being overlapped by the blue shapefile map?
Is it possible to plot on a map like this??
https://www.researchgate.net/profile/P_Jops/publication/324715366/figure/fig3/AS:618748771835906#1524532611545/Map-of-Delhi-reproduced-from-Google-Maps-12.png
use zorder parameter to adjust the layers' orders (lower zorder means lower layer), and alpha to the polygon. anyway, I guess, you're plotting df twice, that's why it's overlapping.
here's my script and the result
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx
from shapely.geometry import Point
long =[77.2885437011719, 77.231931, 77.198767, 77.2750396728516]
lat = [28.6877899169922, 28.663863, 28.648287, 28.5429172515869]
geometry = [Point(xy) for xy in zip(long,lat)]
wardlink = "New Folder/wards delimited.shp"
ward = gpd.read_file(wardlink, bbox=None, mask=None, rows=None)
geo_df = gpd.GeoDataFrame(geometry = geometry)
ward.crs = {'init':"epsg:4326"}
geo_df.crs = {'init':"epsg:4326"}
# plot the polygon
ax = ward.plot(alpha=0.35, color='#d66058', zorder=1)
# plot the boundary only (without fill), just uncomment
#ax = gpd.GeoSeries(ward.to_crs(epsg=3857)['geometry'].unary_union).boundary.plot(ax=ax, alpha=0.5, color="#ed2518",zorder=2)
ax = gpd.GeoSeries(ward['geometry'].unary_union).boundary.plot(ax=ax, alpha=0.5, color="#ed2518",zorder=2)
# plot the marker
ax = geo_df.plot(ax = ax, markersize = 20, color = 'red',marker = '*',label = 'Delhi', zorder=3)
ctx.add_basemap(ax, crs=geo_df.crs.to_string(), source=ctx.providers.OpenStreetMap.Mapnik)
plt.show()
I don't know about google maps being in the contextily, I don't think it's available. alternatively, you can use OpenStreetMap base map which shows quite the same toponym, or any other basemap you can explore. use `source` keyword in the argument, for example, `ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)` . here's how to check the available providers and the map each providers provides:
>>> ctx.providers.keys()
dict_keys(['OpenStreetMap', 'OpenSeaMap', 'OpenPtMap', 'OpenTopoMap', 'OpenRailwayMap', 'OpenFireMap', 'SafeCast', 'Thunderforest', 'OpenMapSurfer', 'Hydda', 'MapBox', 'Stamen', 'Esri', 'OpenWeatherMap', 'HERE', 'FreeMapSK', 'MtbMap', 'CartoDB', 'HikeBike', 'BasemapAT', 'nlmaps', 'NASAGIBS', 'NLS', 'JusticeMap', 'Wikimedia', 'GeoportailFrance', 'OneMapSG'])
>>> ctx.providers.OpenStreetMap.keys()
dict_keys(['Mapnik', 'DE', 'CH', 'France', 'HOT', 'BZH'])
I don't know geopandas. The idea I'm suggesting uses only basic python and matplotlib. I hope you can adapt it to your needs.
The background is the following map. I figured out the GPS coordinates of its corners using google-maps.
The code follows the three points of my remark. Note that the use of imread and imshow reverses the y coordinate. This is why the function coordinatesOnFigur looks non-symmetrical in x and y.
Running the code yields the map with a red bullet near Montijo (there is a small test at the end).
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import patches
from matplotlib.widgets import Button
NE = (-8.9551, 38.8799)
SE = (-8.9551, 38.6149)
SW = (-9.4068, 38.6149)
NW = (-9.4068, 38.8799)
fig = plt.figure(figsize=(8, 6))
axes = fig.add_subplot(1,1,1, aspect='equal')
img_array = plt.imread("lisbon_2.jpg")
axes.imshow(img_array)
xmax = axes.get_xlim()[1]
ymin = axes.get_ylim()[0] # the y coordinates are reversed, ymax=0
# print(axes.get_xlim(), xmax)
# print(axes.get_ylim(), ymin)
def coordinatesOnFigure(long, lat, SW=SW, NE=NE, xmax=xmax, ymin=ymin):
px = xmax/(NE[0]-SW[0])
qx = -SW[0]*xmax/(NE[0]-SW[0])
py = -ymin/(NE[1]-SW[1])
qy = NE[1]*ymin/(NE[1]-SW[1])
return px*long + qx, py*lat + qy
# plotting a red bullet that corresponds to a GPS location on the map
x, y = coordinatesOnFigure(-9, 38.7)
print("test: on -9, 38.7 we get", x, y)
axes.scatter(x, y, s=40, c='red', alpha=0.9)
plt.show()

Defining a 2D object and using its area as Boolean

I have defined two space dimesions ( x and z ) and I was able to manually "draw" an object to use it as a boolen for solving an equation. I defined it as it follows:
A = np.zeros((nz,nx))
object = np.ones_like(A)
object[ int(5/dz):int(10/dz) , int(5/dx):int(10/dz) ] = 2
object = object == 2
By doing that I can define an square 5x10 in z dimesion and 5x10 in x dimesion , and apply the algorythim which understands this as an area , I think. But when it comes to draw complex areas it ends up being hard doing it by little squares and rectangles.
So I want to automatize an area generation by mouse clicking and I want to be able to use this area as a boolean.
I was able to draw a polygon using:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Polygon
fig, ax = plt.subplots()
object = np.array(plt.ginput(n=-100,mouse_stop=2))
p = Polygon(object, alpha=0.5)
plt.gca().add_artist(p)
plt.draw()
plt.show()
But this outputs z and x coordinates of the vertices, and I tried to use it as boleean but I could'nt write it so that python uderstands it as the area defined by those points.
Is this problem easy to solve?
If you just want to calculate the area of a general polygon, you can use for example the Shapely python package like this:
import numpy as np
import matplotlib.pyplot as plt
from shapely.ops import Polygon
from matplotlib.patches import Polygon as PltPolygon
# Get the coordinate input
canvas_size = np.array([1, 1])
canvas_lim = np.array([[0, canvas_size[0]], [0, canvas_size[1]]])
fig, ax = plt.subplots()
plt.xlim(canvas_lim[0])
plt.ylim(canvas_lim[1])
ax.set_aspect("equal")
coordinates = np.array(plt.ginput(n=-100, mouse_stop=2))
# Use shapely.ops.Polygon to calculate the area
poly = Polygon(coordinates)
area = poly.area
print("The area is {} units^2".format(area))
# Draw the polygon
p = PltPolygon(coordinates, alpha=0.5)
ax.add_artist(p)
plt.show()
If you definitely need the mask, here's one way to rasterize it using numpy and matplotlib.path. For details see the comments in the code:
import numpy as np
import matplotlib.path as mpltPath
import matplotlib.pyplot as plt
# Define the limits of our polygon
canvas_desired_size = np.array([110, 100])
# The pixel size with which we calculate (number of points to consider)
# The higher this number, the more we have to calculate, but the
# closer the approximation will be
pixel_size = 0.1
# Cacluate the actual size of the canvas
num_pxiels = np.ceil(canvas_desired_size / pixel_size).astype(int)
canvas_actual_size = num_pxiels * pixel_size
# Let's create a grid where each pixel's value is it's position in our 2d image
x_coords = np.linspace(
start=0,
stop=canvas_actual_size[0],
endpoint=False,
num=canvas_desired_size[0] / pixel_size,
)
y_coords = np.linspace(
start=0,
stop=canvas_actual_size[1],
endpoint=False,
num=canvas_desired_size[1] / pixel_size,
)
# Since it makes more sense to check if the middle of the pixel is in the
# polygion, we shift everything with half pixel size
pixel_offset = pixel_size / 2
x_centers = x_coords + pixel_offset
y_centers = y_coords + pixel_offset
xx, yy = np.meshgrid(x_centers, y_centers, indexing="ij")
# Flatten our xx and yy matrixes to an N * 2 array, which contains
# every point in our grid
pixel_centers = np.array(
list(zip(xx.flatten(), yy.flatten())), dtype=np.dtype("float64")
)
# Now prompt for the imput shape
canvas_lim = np.array([[0, canvas_actual_size[0]], [0, canvas_actual_size[1]]])
fig, ax = plt.subplots()
plt.xlim(canvas_lim[0])
plt.ylim(canvas_lim[1])
ax.set_aspect("equal")
shape_points = np.array(plt.ginput(n=-100, mouse_stop=2))
# Create a Path object
shape = mpltPath.Path(shape_points)
# Use Path.contains_points to calculate if each point is
# within our shape
shape_contains = shape.contains_points(pixel_centers)
# Reshape the result to be a matrix again
mask = np.reshape(shape_contains, num_pxiels)
# Calculate area
print(
"The shape area is roughly {} units^2".format(
np.sum(shape_contains) * pixel_size ** 2
)
)
# Show the rasterized shape to confirm it looks correct
plt.imshow(np.transpose(mask), aspect="equal", origin="lower")
plt.xlim([0, num_pxiels[0]])
plt.ylim([0, num_pxiels[1]])
plt.show()
Alternatively, a simpler solution would be using your plot as an image and thresholding it to get a boolean mask. There should be plent of examples of how to do this on google.

How to make standard deviation and percentile bands in a python scatter plot

I have data for a scatter plot (for reference, x values are labelled sm, y values are labelled bhm) and my three goals are to find the medians of binned data, create standard deviation bands, and create bands at the 90th and 10th percentiles. I've managed to do the first, and while I've been able to make vertical bars indicating the standard deviation, I can't figure out how to make filled-in bands since every time I try to set parameters with the fill_between function, it says operators with sm/bhm are incompatible since they're datasets and I'm comparing them to singular values (the mean line). I copied all of my code down below and there's a comment pointing out the relevant stuff - I just kept all of it since the variable names are a bit important and also because some parts of the plot don't show up properly without the seemingly extraneous code
To create the bands at 90/10 percent, I tried this bit of code by trying to bin the mean as I did for the median, and then filling the top and bottom of the line +-90% of the data but I keep getting
patsy.PatsyError: model is missing required outcome variables
#stuff that really doesn't work
model = smf.quantreg(bhm, sm)
quantiles = [0.1, 0.9]
fits = [model.fit(q=q) for q in quantiles]
figure, axes = plt.subplots()
_sm = np.linspace(min(sm), max(sm))
for index, quantile in enumerate(quantiles):
_bhm = fits[index].params['world'] * _sm +
fits[index].params['Intercept']
axes.plot(_sm, _bhm, label = quantile)
axes.plot(_sm, _sm, 'g--', label = 'i guess this line is the mean')
#stuff that also doesn't really work
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as mpatches
import h5py
import statistics as stat
import pandas as pd
import statsmodels.formula.api as smf
#my files and labels for things
f=h5py.File(r'C:\Users\hanna\Downloads\CatalogueGalsz0p0.hdf5', 'r')
sm = f['StellarMass']
bhm = f['BHMass']
bt = f['BtoT']
dt = f['DtoT']
nbins = 125
#titles and scaling for the plot
plt.title('Relationships Between Stellar Mass, Black Hole Mass, and Bulge
to Total Ratios')
plt.xlabel('Stellar Mass')
plt.ylabel('Black Hole Mass')
plt.xscale('log')
plt.yscale('log')
axes = plt.gca()
axes.set_ylim([500000,max(bhm)])
axes.set_xlim([min(sm),max(sm)])
#labels for the legend and how I colored the points in the plot
DtoT = np.copy(f['DtoT'].value)
colour = np.zeros(len(DtoT),dtype=str)
for i in np.arange(0, len(bt)):
if bt[i]>=0.5:
colour[i]='green'
else:
colour[i]='red'
redbt = mpatches.Patch(color = 'red', label = 'Bulge to Total Ratios Below 0.5')
greenbt = mpatches.Patch(color = 'green', label = 'Bulge to Total Ratios Above 0.5')
plt.legend(handles = [(redbt), (greenbt)])
#the important part - this is how I binned my data to make the median line, and this part works but not the standard deviation bands
bins = np.linspace(0, max(sm), nbins)
delta = bins[1]-bins[0]
idx = np.digitize(sm, bins)
runningmedian = [np.median(bhm[idx==k]) for k in range(nbins)]
runningstd = [bhm[idx==k].std() for k in range(nbins)]
plt.plot(bins-delta/2, runningmedian, c = 'b', lw=1)
plt.scatter(sm, bhm, c=colour, s=.2)
plt.show()

Python irregular x,y data to contour plot on original domain

I have file containing points under the columns "x-cord", "y-cord", "value". These are irregularly spaced. I am trying to make a contour plot of "value" and overlay this over the original domain. I gave up trying to do this in both pgfplots and matlab and thought I would give python a go. An answer in any of these scripts would be fine. The python script is as follows
import numpy as np
from scipy.interpolate import griddata
import matplotlib.pyplot as plt
import numpy.ma as ma
from numpy.random import uniform, seed
from scipy.spatial import ConvexHull
#
# Loading data
filename = "strain.dat"
coordinates = []
x_c = []
y_c = []
z_c = []
xyz = open(filename)
title = xyz.readline()
for line in xyz:
x,y,z = line.split()
coordinates.append([float(x), float(y), float(z)])
x_c.append([float(x)])
y_c.append([float(y)])
z_c.append([float(z)])
xyz.close()
#
# Rehaping and translating data
x_c=np.ravel(np.array(x_c))
y_c=np.ravel(np.array(y_c))
z_c=np.ravel(np.array(z_c))
x_c = x_c-100.0
y_c = y_c-100.0
#
# Checking the convex hull
points=np.column_stack((x_c,y_c))
hull = ConvexHull(points);
plt.plot(points[hull.vertices,0], points[hull.vertices,1], 'r--', lw=2)
plt.scatter(x_c, y_c, marker='o', s=5, zorder=10)
#
# Mapping the irregular data onto a regular grid and plotting
xic = np.linspace(min(x_c), max(x_c), 1000)
yic = np.linspace(min(y_c), max(y_c), 1000)
zic = griddata((x_c, y_c), z_c, (xic[None,:], yic[:,None]))
CS = plt.contour(xic,yic,zic,15,linewidths=0.5,colors='k')
CS = plt.contourf(xic,yic,zic,15,cmap=plt.cm.summer)
plt.colorbar() # draw colorbar
#
#plt.scatter(x_c, y_c, marker='o', s=5, zorder=10)
plt.axis('equal')
plt.savefig('foo.pdf', bbox_inches='tight')
plt.show()
and the output looks like
The problem is that griddata uses a convex hull and this convex hull exceeds the edges of the irregular data. Is there any way to set the values of the griddata points which are outside the edges of the boundary of the original points to zero?
Edit
In the end I threw in the towel and reverted back to Matlab. I'll have to export the data to pgfplots to get a nice plot. The code I came up with was
x = strain.x;
y = strain.y;
z = strain.eps;
% Get the alpha shape (couldn't do this in python easily)
shp = alphaShape(x,y,.001);
% Get the boundary nodes
[bi, xy] = boundaryFacets(shp);
no_grid = 500;
xb=xy(:,1);
yb=xy(:,2);
[X,Y] = ndgrid(linspace(min(x),max(x),no_grid),linspace(min(y),max(y),no_grid));
Z = griddata(x,y,z,X,Y,'v4');
% Got through the regular grid and set the values which are outside the boundary of the original domain to Nans
for j = 1:no_grid
[in,on] = inpolygon(X(:,j),Y(:,j),xb,yb);
Z(~in,j) = NaN;
end
contourf(X,Y,Z,10),axis equal
colorbar
hold on
plot(xb,yb)
axis equal
hold off
Here is the resulting image.
If someone can do something similar in Python I'll happily accept the answer.
I had to plot interpolated data on a complex geometry (see the blue points on figure) P(x,z) (z is the horizontal coordinate). I used mask operations and it worked well. Without mask, the whole square (x=0..1 ; z=0..17.28) is covered by contourf.
## limiting values for geometry
xmax1=0.408
zmin1=6.
xmax2=0.064
zmin2=13.12
xmin=0.
xmax=1.
zmin=0.
zmax=17.28
# Grid for points
x1 = np.arange(xmin,xmax+dx,dx)
z1 = np.arange(zmin,zmax+dz,dz)
zi2,xi2 = np.meshgrid(z1,x1)
mask = (((zi2 > zmin2) & (xi2 > xmax2)) | ((zi2 > zmin1) & (zi2 <= zmin2) & (xi2 > xmax1)))
zim=np.ma.masked_array(zi2,mask)
xim=np.ma.masked_array(xi2,mask)
# Grid for P values
# npz=z coordinates of data, npx is the x coordinates and npp is P values
grid_p = scipy.interpolate.griddata((npz, npx), npp, (zim,xim),method='nearest')
pm=np.ma.masked_array(grid_p,mask)
# plot
plt.contour(zim, xim, pm, 25, linewidths=0.5, colors='k',corner_mask=False)
plt.contourf(zim, xim, pm, 25,vmax=grid_p.max(), vmin=grid_p.min(),corner_mask=False)
plt.colorbar()
# Scatter plot to check
plt.scatter(npz,npr, marker='x', s=2)
plt.show()
enter image description here

Categories

Resources