Related
I have some data which is the number of readings at each point on a 5x10 grid, which is in the format of;
X = [1, 2, 3, 4,..., 5]
Y = [1, 1, 1, 1,...,10]
Z = [9,8,14,0,89,...,0]
I would like to plot this as a heatmap/density map from above, but all of the matplotlib graphs (incl. contourf) that I have found are requiring a 2D array for Z and I don't understand why.
EDIT;
I have now collected the actual coordinates that I want to plot which are not as regular as what I have above they are;
X = [8,7,7,7,8,8,8,9,9.5,9.5,9.5,11,11,11,10.5,
10.5,10.5,10.5,9,9,8, 8,8,8,6.5,6.5,1,2.5,4.5,
4.5,2,2,2,3,3,3,4,4.5,4.5,4.5,4.5,3.5,2.5,2.5,
1,1,1,2,2,2]
Y = [5.5,7.5,8,9,9,8,7.5,6,6.5,8,9,9,8,6.5,5.5,
5,3.5,2,2,1,2,3.5,5,1,1,2,4.5,4.5,4.5,4,3,
2,1,1,2,3,4.5,3.5,2.5,1.5,1,5.5,5.5,6,7,8,9,
9,8,7]
z = [286,257,75,38,785,3074,1878,1212,2501,1518,419,33,
3343,1808,3233,5943,10511,3593,1086,139,565,61,61,
189,155,105,120,225,682,416,30632,2035,165,6777,
7223,465,2510,7128,2296,1659,1358,204,295,854,7838,
122,5206,6516,221,282]
From what I understand you can't use floats in a np.array so I have tried to multiply all values by 10 so that they are all integers, but I am still running into some issues. Am I trying to do something that will not work?
They expect a 2D array because they use the "row" and "column" to set the position of the value. For example, if array[2, 3] = 5, then when x is 2 and y is 3, the heatmap will use the value 5.
So, let's try transforming your current data into a single array:
>>> array = np.empty((len(set(X)), len(set(Y))))
>>> for x, y, z in zip(X, Y, Z):
array[x-1, y-1] = z
If X and Y are np.arrays, you could do this too (SO answer):
>>> array = np.empty((X.shape[0], Y.shape[0]))
>>> array[np.array(X) - 1, np.array(Y) - 1] = Z
And now just plot the array as you prefer:
>>> plt.imshow(array, cmap="hot", interpolation="nearest")
>>> plt.show()
I am exploring the best way to do this.
I have a scatter plot of y versus x, where x is income per capita.
After plotting all values as a scatter plot, I would like to find the highest value for y for each x value (i.e., at each income level) and then connect these points with a line.
How can I do this in Python?
You could use pandas, because it has a convenient groupby method and plays well with matplotlib:
import pandas as pd
# example data
df = pd.DataFrame({'x': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
'y': [3, 7, 9, 4, 1, 2, 8, 6, 4, 4, 3, 1]})
# standard scatter plot
ax = df.plot.scatter('x', 'y')
# find max. y value for each x
groupmax = df.groupby('x').max()
# connect max. values with lines
groupmax.plot(ax=ax, legend=False);
You have two parallel lists: x and y. You want to group them by x and take the maximum in y. First, you should sort the lists together. Zip them into a list of tuples and sort:
xy = sorted(zip(x, y))
Now, group the sorted list by the first element ("x"). The result is a list of tuples where the first element is x and the second is a list of all points with that x. Naturally, each point is also a tuple, and the first element of each tuple is the same x:
from itertools import groupby
grouped = groupby(xy, lambda item: item[0])
Finally, take the x and the max of the points for each group:
envelope = [(xp, max(points)[1]) for xp, points in grouped]
envelope is a list of xy tuples that envelope your scatter plot. You can further unzip it into xs and ys:
x1, y1 = zip(*envelope)
Putting it all together:
x1, y1 = zip(*[(xp, max(points)[1])
for xp, points
in groupby(sorted(zip(x, y)), lambda item: item[0])])
I would like to create a scatter plot in matplotlib to measure the performance of my algorithm.
An example of my data is as follows:
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3] # corresponding to x = 1
y2 = [4, 5, 6] # corresponding to x = 2
y3 = [7, 8, 9] # corresponding to x = 3
y4 = [10, 11, 12] # corresponding to x = 4
y5 = [13, 14, 15] # corresponding to x = 5
What data type would be best to represent multiple y values with one x value?
In my example the relation is exponential. Is there a way to plot an exponential regression line in matplotlib?
I think it is related with the data analyses. If I understand correctly, I think you want to have a comparison with every test's time efficiency, but at each test run, they should be at the same test environments (like the same machine, the same input data, etc.) So just give a suggestion, you can use each test's average run time as the standard value to show your test results. Here is some code you can use.
import numpy as np
import matplotlib.pyplot as plt
data_dim = 4 # number of test
data_points = 100 # number of each test_data_points
data_set = np.random.rand(data_dim,data_points)
time = [ list(range(len(i))) for i in data_set]
norm = np.full((data_dim,data_points),1)
aver = [] # get each test's average value
ndx = 0
for i in norm:
aver.append(i* sum(data_set[0]) / data_points)
fig = plt.figure(figsize=(10,10))
ndx = 1
for i in range(0,2):
for j in range(0,2):
ax = fig.add_subplot(2,2,ndx)
ax.plot(time[ndx-1],data_set[ndx-1],'ko')
ax.plot(time[ndx-1],aver[ndx-1],'r')
ax.set_ylim(-1,2)
ndx += 1
plt.show()
The following is the run result. Note, the red solid line is the average of your test time, which will give some senses of your each test.
I want to plot some finite element results using tricontourf in matplotlib. To do so, a for loop needs to be created that accounts for the data at the nodes of each element (something like the fill command in MATLAB). This works nicely, but the problem lies in the display of the colorbar. It only shows the values of the last element (after the loop finishes) and it does not include the values of the whole domain.
Therefore, is there a way to 'accumulate' the data of the tricontourf so that the colorbar includes the values of the whole domain, or maybe a way to manipulate the colorbar to do so? I tried to append the data using contour_sets.append() and then use this as input to colorbar, but it does not work.
The code looks like this:
# nodes and coordinates
IEN = np.array([[1,2,5,6],[2,3,4,5],[5,4,9,8],[6,5,8,7]]) # nodes vs elements
xx = np.array([1, 2, 3, 3, 2, 1, 1, 2, 3]) # x coordinates
yy = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3]) # y coordinates
zz = np.array([50, 40, 30, 45, 20, 30, 30, 15, 10]) # magnitude
nen,nfe = np.shape(IEN) # nodes per element, number of elements
#contour_sets = []
fig = plt.figure(1)
for e in range(nfe):
idx = np.squeeze(IEN[:,e]-1) # take the element nodes
x = xx[idx] # x coordinates of element e
y = yy[idx] # y coordinates of element e
z = zz[idx] # magnitude values of element e
# additional node in the middle to be able to use tricontourf
x5 = np.mean(x) # x coordinate
y5 = np.mean(y) # y coordinate
z5 = np.mean(z) # magnitude value
# set up the triangle nodes (rectangular element as two triangles)
triangles = np.array([ [ 2, 4, 1 ],
[ 2, 3, 4 ], ]) -1
# insert the additional middle node to the coordinates
xt = np.insert(x,nen,[x5]).squeeze()
yt = np.insert(y,nen,[y5]).squeeze()
zt = np.insert(z,nen,[z5]).squeeze()
# plot the data
fil = plt.tricontourf(xt, yt, triangles, zt, norm=plt.Normalize(vmax=zz.max(), vmin=zz.min()), cmap=cm.pink)
#contour_sets.append(fil)
# organize the plot
cb = plt.colorbar(fil, shrink=0.95, aspect=15, pad = 0.05)
cb.ax.yaxis.set_offset_position('left')
cb.update_ticks()
plt.xticks(np.arange(1,3.1,0.5))
plt.yticks(np.arange(1,3.1,0.5))
plt.xlim(1, 3)
plt.ylim(1, 3)
plt.gca().set_aspect('equal', adjustable='box')
plt.tight_layout()
which gives this Figure (edited for clarification purposes):
The colorbar only shows the values of zz asociated to the element 4
I have a function here that visualises the classification made by a certain classifier like Logistic Regression or simply the perceptron model. But I don't get several things:
X has n examples and just 2 features.
Why do I have to use xx1.ravel() and xx2.ravel() and then transpose the entire array for classifier.predict? Why can't I simply predict the outcomes using the original dimensions?
2.Why do I need to reshape Z back to the original xx1 shape?
Why is there a need to create a meshgrid for plotting a scatter plot? Does the specific points in the meshgrid act like 'pixels' that represent a certain point on the grid? Why is this needed anyway?
What is the idx value in idx, cl in enumerate(np.unique(y)), when all I get when I use unique is simply the unique id of the outcomes?
What is the use of c = cmap(idx) in the scatter function? Why can cmap take in an argument?
I apologise for the latter questions that may not fit with the topic question.
The code is taken from the Python Machine Learning book.
def plot_decision_regions(X, y, classifier, test_idx = None, resolution = 0.002):
#Setup marker generator and color map
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red','blue','green','gray','cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])
#MESHGRID - plot decision surface
x1_min, x1_max = X[:, 0].min(), X[:, 0].max()
x2_min, x2_max = X[:, 1].min(), X[:, 1].max()
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
# print 'meshgrid:', xx1, xx2
#CLASSIFIER PREDICT
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)
plt.contourf(xx1, xx2, Z, alpha = 0.4, cmap=cmap)
plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x = X[y == cl, 0], y = X[y == cl, 1], alpha = 0.8, c = cmap(idx), marker = markers[idx], label =cl)
#highlight test samples
if test_idx:
XTest, yTest = X[test_idx, :], y[test_idx]
plt.scatter(XTest[:,0], XTest[:,1], c = '', alpha = 1.0, linewidth = 1, marker = 'o', s = 55, label = 'test set')
This business with meshgrid and ravel is simply a way of taking the cartesian product of the coordinate ranges in order to get a set of (x, y) coordinate pairs representing individual points in a region.
The classifier expects its input to be an Nx2 array, where N is the number of samples (i.e., cases whose class you want to predict). It wants two columns because there are two features.
Meshgrid produces two arrays, one containing the X coordinates of points in a specified rectangular region, and the other containing the Y coordinates of those points. By using .ravel(), you roll out these arrays into lists of coordinates. This is just a somewhat confusing way of taking the cartesian product of the desired coordinate ranges. In other words, this:
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
coord1, coord2 = xx1.ravel(), xx2.ravel()
Is effectively the same as this:
coord1, coord2 = zip(*itertools.product(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution)))
You can see this with a simple example:
>>> xx1, xx2 = np.meshgrid(np.arange(3), np.arange(2))
>>> coord1, coord2 = xx1.ravel(), xx2.ravel()
>>> coord1
array([0, 1, 2, 0, 1, 2])
>>> coord2
array([0, 0, 0, 1, 1, 1])
>>> coord1, coord2 = zip(*itertools.product(np.arange(3), np.arange(2)))
>>> coord1
(0, 0, 1, 1, 2, 2)
>>> coord2
(0, 1, 0, 1, 0, 1)
You can see that the same x/y pairs are generated there (although they are generated in different orders).
The meshgrid approach was probably chosen here because it's needed for contourf. contourf essentially takes an "XY plane" as input (consisting of arrays of X and Y coordinates) along with an array of Z values for each point in that plane.
The upshot is that the classifier and the contour plot expect input in different formats. The classifier takes two individual values (the two input features) and returns a single value (the class it predicts). contourf requires a rectangular grid of points. In other words, loosely speaking, predict wants one X coordinate and one Y coordinate at a time, but contourf wants all the X coordinates first and then all the Y coordinates. The code you posted is doing some reshaping to convert between these two formats. You generate X and Y in the format contourf wants, and reshape it into the format predict wants so you can pass it to predict. predict gives you the Z data in the shape predict likes, and then you reshape that back into the format contourf wants.