Coloring a scatter plot from pandas itertuples?

Coloring a scatter plot from pandas itertuples? - python

I have a dataframe that has coordinates in one column, and an orientation in the other:
I'm trying to scatter plot the coordinates and then colour them by their orientation:
for row in df.itertuples():
x, y = row.coords[:,1], row.coords[:,0]
plt.scatter(x,y, c=df.orientation)
This plots the coordinates fine but not the orientation as it's still within the itertuple loop. Does anyone know how to get around this problem?

As each row has only one color, you need to explicitly set that color for the row. In order to get a color from a certain numeric orientation value, you need to create a colormap and a norm. The colormap can be any of your choice. The norm needs to be set using the complete range of the 'orientation' column.
Using the norm (to get a value between 0 and 1) you can index the colormap and obtain an rgb-value.
As plt.scatter tries to verify whether you are giving one single color for all points together or one color per point, rgb values can cause confusion. Therefore, it is safest to create an array around the color value (so c=[cmap(norm(row.orientation))] instead of just c=cmap(norm(row.orientation))).
The colormap and norm can also be used to create an accompanying colorbar.
Here is some example code to get you started:
from matplotlib import pyplot as plt
from matplotlib.cm import ScalarMappable
import numpy as np
import pandas as pd
N = 30
df = pd.DataFrame({'coords': [np.random.normal(0, 1, size=(np.random.randint(5, 50), 2)) + np.random.uniform(0, 50, 2)
for _ in range(N)],
'orientation': np.random.uniform(-1, 1, N)})
cmap = plt.get_cmap('magma')
norm = plt.Normalize(df.orientation.min(), df.orientation.max())
for row in df.itertuples():
coords = np.array(row.coords)
x, y = coords[:, 1], coords[:, 0]
plt.scatter(x, y, c=[cmap(norm(row.orientation))])
plt.colorbar(ScalarMappable(cmap=cmap, norm=norm), label='orientation')
plt.show()

You iterate by row, so you should use the same syntax as you did for x and y : c=row.orientation

Related

How to define a color in matplotlib with combination of different percentage of different colors?

I have a series of data and for each of them, I would like to plot a line with matplotlib. And I want to define a color for each condition by a combination of two different colors in order to distinguish them but with gradually changing colors.
For example, I would like to define a color like m% * blue + n% * red, where m and n could be 10%, 20%, 30%, etc., something like the way to define a customized color in LaTex, but I could not find anything by search on the manual or internet. Could you please tell me how to do that?
My original data is large, in order to keep the question simple, I guess maybe I could use the following data as a minimum example. For example, the first line is in color 20%*blue+20 %*red and the second line could be in color 50%*blue+40 %*red or any combinations are good. I think the main aspect of the problem is kept.
import numpy as np
import matplotlib.pyplot as plt;
x1 = np.linspace(1, 10, 10)
y1 = np.random.rand(10)
x2 = np.linspace(1, 10, 20)
y2 = np.random.rand(20)
plt.plot(x1, y1)
plt.plot(x2, y2)

You can pass to plt.plot function a parameter color as rgb value as color=(r,g,b), where r, g, b are float in [0,1] representing the percentage of red, green and blue. I wrote some example code:
def myfunction(x, a=1):
return np.sin(x+a)
x = np.linspace(-10,10,101)
plt.figure(figsize=(10,5))
for a in np.linspace(0,2,21):
plt.plot(x, myfunction(x,a), color=(a/2, 1-a/2,0))
plt.show()
output:
in this example i use the parameter a to change the red and green color of the plot.

Color mixing is non-trivial, even if we extract the RGBA values of defined colors with to_rgba(). However, we can cheat and let matplotlib do the calculations by plotting the same curve twice - first with x% x_color (x% represented as a value between 0 and 1 with 1 meaning 100%), then again with y% y_color:
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
x_color = "orange"
y_color = "orchid"
for x in np.linspace(0, 1, 5):
for y in np.linspace(0, 1, 4):
ax.plot((x, x+0.1), (y, y), color=x_color, alpha=x)
ax.plot((x, x+0.1), (y, y), color=y_color, alpha=y)
ax.set_xlabel(f"color {x_color}")
ax.set_ylabel(f"color {y_color}")
plt.show()
Sample output:

How to plot color as function of a third variable using matplotlib.scatter?

I want to plot Points with x and y-Values and colour them depending on a corresponding time value. Data is stored in a Dataframe.
The solution should be the c-parameter of matplotlib's scatter function, but for some reason its not working for me.
The times-column is a List of float values between 0 and 3.
Plotting the Points without c-parameter is working.
import matplotlib.pyplot as plt
c=list(df_result_local['times'])
for i in range(len(df_result_local['Points'])):
plt.scatter(df_result_local['Points'][i].x, df_result_local['Points'][i].y, c=c, alpha = 0.5)
Here I get a ValueError: 'c' argument has 1698 elements, which is not acceptable for use with 'x' with size 1, 'y' with size 1.

Try this
import matplotlib.pyplot as plt
c=list(df_result_local['times'])
x = []
y = []
for i in range(len(df_result_local['Points'])):
x.append(df_result_local['Points'][i].x)
y.append(df_result_local['Points'][i].y)
plt.scatter(df_result_local['Points'][i].x, df_result_local['Points'][i].y, c=c, alpha = 0.5)

I think you need to use the index on c as well. So
plt.scatter(df_result_local['Points'][i].x, df_result_local['Points'][i].y, c=c[i], alpha = 0.5)

Plot 2D histogram data with pcolormesh

I need to plot a binned statistic, as one would get from scipy.stats.binned_statistic_2d. Basically, that means I have edge values and within-bin data. This also means I cannot (to my knowledge) use plt.hist2d. Here's a code snippet to generate the sort of data I might need to plot:
import numpy as np
x_edges = np.arange(6)
y_edges = np.arange(6)
bin_values = np.random.randn(5, 5)
One would imagine that I could use pcolormesh for this, but the issue is that pcolormesh does not allow for bin edge values. The following will only plot the values in bins 1 through 4. The 5th value is excluded, since while pcolormesh "knows" that the value at 4.0 is some value, there is no later value to plot, so the width of the 5th bin is zero.
import matplotlib.pyplot as plt
X, Y = np.broadcast_arrays(x_edges[:5, None], y_edges[None, :5])
plt.figure()
plt.pcolormesh(X, Y, bin_values)
plt.show()
I can get around this with an ugly hack by adding an additional set of values equal to the last values:
import matplotlib.pyplot as plt
X, Y = np.broadcast_arrays(x_edges[:, None], y_edges[None, :])
dummy_bin_values = np.zeros([6, 6])
dummy_bin_values[:5, :5] = bin_values
dummy_bin_values[5, :] = dummy_bin_values[4, :]
dummy_bin_values[:, 5] = dummy_bin_values[:, 4]
plt.figure()
plt.pcolormesh(X, Y, dummy_bin_values)
plt.show()
However, this is an ugly hack. Is there any cleaner way to plot 2D histogram data with bin edge values? "No" is possibly the correct answer, but convince me that's the case if it is.

I do not understand the problem with any of the two options. So here is simly a code which uses both, numpy histogrammed data with pcolormesh, as well as simply plt.hist2d.
import numpy as np
import matplotlib.pyplot as plt
x_edges = np.arange(6)
y_edges = np.arange(6)
data = np.random.rand(340,2)*5
### using numpy.histogram2d
bin_values,_,__ = np.histogram2d(data[:,0],data[:,1],bins=(x_edges, y_edges) )
X, Y = np.meshgrid(x_edges,y_edges)
fig, (ax,ax2) = plt.subplots(ncols=2)
ax.set_title("numpy.histogram2d \n + plt.pcolormesh")
ax.pcolormesh(X, Y, bin_values.T)
### using plt.hist2d
ax2.set_title("plt.hist2d")
ax2.hist2d(data[:,0],data[:,1],bins=(x_edges, y_edges))
plt.show()
Of course this would equally work with scipy.stats.binned_statistic_2d.

How do I shift categorical scatter markers to left and right above xticks (multiple data sets per category)?

I have a simple pandas dataframe that I want to plot with matplotlib:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('SAT_data.xlsx', index_col = 'State')
plt.figure()
plt.scatter(df['Year'], df['Reading'], c = 'blue', s = 25)
plt.scatter(df['Year'], df['Math'], c = 'orange', s = 25)
plt.scatter(df['Year'], df['Writing'], c = 'red', s = 25)
Here is what my plot looks like:
I'd like to shift the blue data points a bit to the left, and the red ones a bit to the right, so each year on the x-axis has three mini-columns of scatter data above it instead of all three datasets overlapping. I tried and failed to use the 'verts' argument properly. Is there a better way to do this?

Using an offset transform would allow to shift the scatter points by some amount in units of points instead of data units. The advantage is that they would then always sit tight against each other, independent of the figure size, zoom level etc.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import matplotlib.transforms as transforms
year = np.random.choice(np.arange(2006,2017), size=(300) )
values = np.random.rand(300, 3)
plt.figure()
offset = lambda p: transforms.ScaledTranslation(p/72.,0, plt.gcf().dpi_scale_trans)
trans = plt.gca().transData
sc1 = plt.scatter(year, values[:,0], c = 'blue', s = 25, transform=trans+offset(-5))
plt.scatter(year, values[:,1], c = 'orange', s = 25)
plt.scatter(year, values[:,2], c = 'red', s = 25, transform=trans+offset(5))
plt.show()
Broad figure:
Normal figure:
Zoom
Some explanation:
The problem is that we want to add an offset in points to some data in data coordinates. While data coordinates are automatically transformed to display coordinates using the transData (which we normally don't even see on the surface), adding some offset requires us to change the transform.
We do this by adding an offset. While we could just add an offset in pixels (display coordinates), it is more convenient to add the offset in points and thereby using the same unit as the size of the scatter points is given in (their size is points squared actually).
So we want to know how many pixels are p points? This is found out by dividing p by the ppi (points per inch) to obtain inches, and then by multiplying by the dpi (dots per inch) to obtain the display pixel. This calculation in done in the ScaledTranslation.
While the dots per inch are in principle variable (and taken care of by the dpi_scale_trans transform), the points per inch are fixed. Matplotlib uses 72 ppi, which is kind of a typesetting standard.

A quick and dirty way would be to create a small offset dx and subtract it from x values of blue points and add to x values of red points.
dx = 0.1
plt.scatter(df['Year'] - dx, df['Reading'], c = 'blue', s = 25)
plt.scatter(df['Year'], df['Math'], c = 'orange', s = 25)
plt.scatter(df['Year'] + dx, df['Writing'], c = 'red', s = 25)
One more option could be to use stripplot function from seaborn library. It would be necessary to melt the original dataframe into long form so that each row contains a year, a test and a score. Then make a stripplot specifying year as x, score as y and test as hue. The split keyword argument is what controls plotting categories as separate stripes for each x. There's also the jitter argument that will add some noise to x values so that they take up some small area instead of being on a single vertical line.
import pandas as pd
import seaborn as sns
# make up example data
np.random.seed(2017)
df = pd.DataFrame(columns = ['Reading','Math','Writing'],
data = np.random.normal(540,30,size=(1000,3)))
df['Year'] = np.random.choice(np.arange(2006,2016),size=1000)
# melt the data into long form
df1 = pd.melt(df, var_name='Test', value_name='Score',id_vars=['Year'])
# make a stripplot
fig, ax = plt.subplots(figsize=(10,7))
sns.stripplot(data = df1, x='Year', y = 'Score', hue = 'Test',
jitter = True, split = True, alpha = 0.7,
palette = ['blue','orange','red'])
Output:

Here is how the given code can be adapted to work with multiple subplots, and also to a situation without "middle column".
To adapt the given code, ax[n,p].transData is needed instead of plt.gca().transData. plt.gca() refers to the last created subplot, while now you'll need the transform of each individual subplot.
Another problem is that when only plotting via a transform, matplotlib doesn't automatically sets the lower and upper limits of the subplot. In the given example plots the points "in the middle" without setting a specific transform, and the plot gets "zoomed out" around these points (orange in the example).
If you don't have points at the center, the limits need to be set in another way. The way I came up with, is plotting some dummy points in the middle (which sets the zooming limits), and remove those again.
Also note that the size of the scatter dots in given as the square of their diameter (measured in "unit points"). To have touching dots, you'd need to use the square root for their offset.
import matplotlib.pyplot as plt
from matplotlib import transforms
import numpy as np
# Set up data for reproducible example
year = np.random.choice(np.arange(2006, 2017), size=(100))
data = np.random.rand(4, 100, 3)
data2 = np.random.rand(4, 100, 3)
# Create plot and set up subplot ax loop
fig, axs = plt.subplots(2, 2, figsize=(18, 14))
# Set up offset with transform
offset = lambda p: transforms.ScaledTranslation(p / 72., 0, plt.gcf().dpi_scale_trans)
# Plot data in a loop
for ax, q, r in zip(axs.flat, data, data2):
temp_points = ax.plot(year, q, ls=' ')
for pnt in temp_points:
pnt.remove()
ax.plot(year, q, marker='.', ls=' ', ms=10, c='b', transform=ax.transData + offset(-np.sqrt(10)))
ax.plot(year, r, marker='.', ls=' ', ms=10, c='g', transform=ax.transData + offset(+np.sqrt(10)))
plt.show()

How do I use axvfill with a boolean series

I have a boolean time series that I want to use to determine the parts of the plot that should be shaded.
Currently I have:
ax1.fill_between(data.index, r_min, r_max, where=data['USREC']==True, alpha=0.2)
where, r_min and r_max are just the min and max of the y-axis.
But the fill_between doesn't fill all the way to the top and bottom of the plot because, so I wanted to use axvspan() instead.
Is there any easy way to do this given axvspan only takes coordinates? So the only way I can think of is to group all the dates that are next to each other and are True, then take the first and last of those dates and pass them into axvspan.
Thank you

You can still use fill_between, if you like. However instead of specifying the y-coordinates in data coordinates (for which it is not a priori clear, how large they need to be) you can specify them in axes coorinates. This can be achieved using a transform, where the x part is in data coordinates and the y part is in axes coordinates. The xaxis transform is such a transform. (This is not very surprising since the xaxis is always independent of the ycoorinates.) So
ax.fill_between(data.index, 0,1, where=data['USREC'], transform=ax.get_xaxis_transform())
would do the job.
Here is a complete example:
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
x = np.linspace(0,100,350)
y = np.cumsum(np.random.normal(size=len(x)))
bo = np.zeros(len(y))
bo[y>5] = 1
fig, ax = plt.subplots()
ax.fill_between(x, 0, 1, where=bo, alpha=0.4, transform=ax.get_xaxis_transform())
plt.plot(x,y)
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Coloring a scatter plot from pandas itertuples? - python

You iterate by row, so you should use the same syntax as you did for x and y : c=row.orientation

Related

How to define a color in matplotlib with combination of different percentage of different colors?

How to plot color as function of a third variable using matplotlib.scatter?

Plot 2D histogram data with pcolormesh

How do I shift categorical scatter markers to left and right above xticks (multiple data sets per category)?

How do I use axvfill with a boolean series

Categories

Resources