I've got a dataset of many (~3M) circles (each has an x, y, and od property) in a pandas dataframe. I'd like to plot them over each other to visualize patterns
I had done this previously with a smaller dataset (about 15k circles), but now it seems to be choking (memory is going up to the 16GB by the time I'm at only a few hundred thousand)
df is the dataframe
plt is matplotlib.pyplot
ax2=plt.gca(xlim=(-.25,.25),ylim=(-0.25,0.25))
for i,row in df.iterrows():
x=row.X_delta
y=row.Y_delta
od=float(row.OD)
circle=plt.Circle((x,y),od/2,color='r',fill=False,lw=5,alpha=0.01)
ax2.add_artist(circle)
Any thoughts on a more memory efficient way to do this?
Drawing all 3 million circles in one plot doesn't seem a viable approach. Here's an example with just 1000 circles (following the example by matt_s):
Instead, I suggest to reduce the number of circles to draw to some sensible value, e.g. 50 or 100. One approach is to run KMeans on your dataset to cluster the circles by coordinate and diameter. The following chart represents the clustering of 100'000 random circles as an example. This should easily scale to 3 million circles.
The marker's dimensions represent the diameter (s, scaled to fit the chart), and the color indicates the number of circles per cluster center (c). YMMV
Code used to plot the first chart (ipython)
%matplotlib inline
import pandas as pd
import numpy as np
n = 1000
circles = pd.DataFrame({'x': np.random.random(n), 'y': np.random.random(n), 'r': np.random.random(n)},)
circles.plot(kind='scatter', x='x', y='y', s=circles['r']*1000, c=circles.r * 10, facecolors='none')
Code used to plot the second chart (ipython)
%matplotlib inline
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# parameters
n = 100000
n_clusters = 50
# dummy data
circles = pd.DataFrame({'x': np.random.random(n), 'y': np.random.random(n), 'r': np.random.random(n)})
# cluster using kmeans
km = KMeans(n_clusters=n_clusters, n_jobs=-2)
circles['cluster'] = pd.Series(km.fit_predict(circles.as_matrix()))
# bin by cluster
cluster_size = circles.groupby('cluster').cluster.count()
# plot, using #circles / per cluster as the od weight
clusters = km.cluster_centers_
fig = plt.figure()
ax = plt.scatter(x=clusters[:,0], y=clusters[:,1], # clusters x,y
c=cluster_size, #color
s=clusters[:,2] * 1000, #diameter, scaled
facecolors='none') # don't fill markers
plt.colorbar()
fig.suptitle('clusters by #circles, c/d = size')
plt.xlabel('x')
plt.ylabel('y')
Have you tried the pandas scatter plot?
import pandas as pd
import random
n = 100000
df = pd.DataFrame({'x': np.random.random(n), 'y': np.random.random(n), 'r': np.random.random(n)})
df.plot(kind='scatter', x='x', y='y', s=df['r']*1000, facecolor='none')
Related
I have a dataset with lots of numerical columns. I want to draw histogram for each column but also add extra QQ plot just to check more thoroughly if data follow normal distribution. So I would like to have histogram and QQ plot under histogram for each column. Something like that:
I tried to do this using following code but both plots overlap each other:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
num_cols = df.select_dtypes(include=np.number)
cols = num_cols.columns.tolist()
df_sample = df.sample(n=5000)
fig, axes = plt.subplots(4, 5, figsize=(15,12), layout = 'constrained')
for col, axs in zip(cols, axes.flat):
sns.histplot(data = df_sample[col], kde = True, stat = 'density', ax = axs, alpha = .4)
sm.qqplot(df_sample[col], line='45', ax = axs)
plt.show()
How can I generate hist and QQ plots one under another for each column?
Another issue is that my QQ plots look strange, I'm wondering if I need to standarize all my columns before making QQ plot.
I want to use matplotlib's single-color colormaps (e.g. Blues), but I want the color to "pop" more. I'm not sure what the technical term is for this - higher contrast, increased brightness, something else.
My question: how, in matplotlib, can I make a single-color colormap more vibrant?
There's a toy script and output below. In the output, I want both blue and red to be less dull.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
num_units_per_rf = 1000
# Emulate Gaussian DF
gaussian_place_cell_rf_list = []
gaussian_score_90_by_neuron_list = []
for place_cell_rf in np.arange(start=1, stop=4, step=0.5):
score_90_by_neuron = np.random.normal(loc=place_cell_rf, size=num_units_per_rf)
gaussian_score_90_by_neuron_list.append(score_90_by_neuron)
gaussian_place_cell_rf_list.append(np.full(fill_value=place_cell_rf, shape=num_units_per_rf))
gaussian_df = pd.DataFrame({
'place_cell_rf': np.concatenate(gaussian_place_cell_rf_list),
'score_90_by_neuron': np.concatenate(gaussian_score_90_by_neuron_list),
})
noise_place_cell_rf_list = []
noise_score_90_by_neuron_list = []
for place_cell_rf in np.arange(start=1, stop=4, step=0.5):
score_90_by_neuron = np.random.normal(loc=-place_cell_rf, size=num_units_per_rf)
noise_score_90_by_neuron_list.append(score_90_by_neuron)
noise_place_cell_rf_list.append(np.full(fill_value=place_cell_rf, shape=num_units_per_rf))
noise_df = pd.DataFrame({
'place_cell_rf': np.concatenate(noise_place_cell_rf_list),
'score_90_by_neuron': np.concatenate(noise_score_90_by_neuron_list),
})
fig, ax = plt.subplots(figsize=(12, 8))
# Plot Gaussians and Noise.
g = sns.kdeplot(
data=gaussian_df,
x='score_90_by_neuron',
common_norm=False, # Ensure each sweep is normalized separately.
cumulative=True,
hue='place_cell_rf',
palette='Reds',
ax=ax)
sns.kdeplot(
data=noise_df,
x='score_90_by_neuron',
common_norm=False, # Ensure each sweep is normalized separately.
cumulative=True,
hue='place_cell_rf',
palette='Blues',
ax=g)
plt.show()
I want the blues to "glow" more, like this subset of the hsv colormap:
I want the reds to "glow" more, like this subset of the hsv colormap:
In Have gradient colours in sns.pairplot for one column of dataframe so that I can see which datapoints are connected to each other
very good answers were given how to solve the challenge to recognize which data points are related to the same data points in other sub plots.
To have a self containing question, I state here my requirement (which is somehow an extension of the linked question):
I would like to see the interdependence of my data.
For that I want to have a gradual color gradient for one column of my DataFrame (so
that low numerical values of that column are e.g. yellow and high values are blue).
For a second column of my data, I would like to have increasing marker sizes with
increasing values of this column.
These colors and marker sizes should be visible for all non diagonal subplots of my
plot, based on the data points of a and b.
The solution to the gradient color is given in the linked question. I put here both solutions that presently exist:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
f, axes = plt.subplots(1, 1)
np.random.seed(1)
a = np.arange(0, 10, 0.1)
def myFunc(x):
myReturn = +10 + 10*x -x**2 + 1*np.random.random(x.shape[0])
return myReturn
b = myFunc(a)
c = a * np.sin(a)
df = pd.DataFrame({'a': a, 'b': b, 'c': c})
if False:
sns.pairplot(
df,
corner=True,
diag_kws=dict(color=".6"),
plot_kws=dict(
hue=df.index,
palette="blend:gold,dodgerblue",
),
)
else:
from matplotlib.colors import LinearSegmentedColormap
cmap = LinearSegmentedColormap.from_list('blue-yellow', ['gold', 'lightblue', 'darkblue']) # plt.get_cmap('viridis_r')
g = sns.pairplot(df, corner=True)
for ax in g.axes.flat:
if ax is not None and not ax in g.diag_axes:
for collection in ax.collections:
collection.set_cmap(cmap)
collection.set_array(df['a'])
plt.show()
A (basic) solution for the increasing marker sizes would be (using simply matplotlib):
import numpy as np
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
np.random.seed(19680801)
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = (30 * np.random.rand(N))**2 # 0 to 15 point radii
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
My question is:
I could work on a manual solution to iterate over all columns of my DataFrame and build the sub plots by myself. Is there any more convenient (and probably more robust) way to do this?
You can modify the sizes and hue for the off-diagonal data easily by adding the parameters you'd use in Matplotlib to the plot_kws dictionary:
sns.pairplot(df, corner=True,
diag_kws=dict(color=".6"),
plot_kws=dict(
hue=df['a'],
palette="blend:gold,dodgerblue",
size = df['b']
)
)
I would like to write scout report on some football players and for that I need visualizations. One type of which is pie charts. Now I need some pie charts that looks like below, with different size of slices ( proportionate to the number of the thing the slice indicates) . Can anyone suggest how to do it or have any link to websites where I can learn this?
What you are looking for is called a "Radar Pie Chart". It's analogous to the more commonly used "Radar Chart", but I think it looks better as it highlights the values, rather than focus on meaningless shapes.
The challenge you face with your football dataset is that each category is on a different scale, so you want to plot each value as a percentage of some max. My code will accomplish that, but you'll want to annotate the original values to finish off these charts.
The plot itself can be done with just the standard matplotlib library using polar axes. I borrowed code from here (https://raphaelletseng.medium.com/getting-to-know-matplotlib-and-python-docx-5ee67bad38d2).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import pi
from random import random, seed
seed(12345)
# Generate dataset with 10 rows, different maxes
maxes = [5, 5, 5, 2, 2, 10, 10, 10, 10, 10]
df = pd.DataFrame(
data = {
'categories': ['category_{}'.format(x) for x, _ in enumerate(maxes)],
'scores': [random()*max for max in maxes],
'max_values': maxes,
},
)
df['pct'] = df['scores'] / df['max_values']
df = df.set_index('categories')
# Plot pie radar chart
N = df.shape[0]
theta = np.linspace(0.0, 2*np.pi, N, endpoint=False)
categories = df.index
df['radar_angles'] = theta
ax = plt.subplot(polar=True)
ax.bar(df['radar_angles'], df['pct'], width=2*pi/N, linewidth=2, edgecolor='k', alpha=0.5)
ax.set_xticks(theta)
ax.set_xticklabels(categories)
_ = ax.set_yticklabels([])
I had previously work with rose or polar bar chart. Here is the example.
import plotly.express as px
df = px.data.wind()
fig = px.bar_polar(df, r="frequency", theta="direction",
color="strength", template="plotly_dark",
color_discrete_sequence= px.colors.sequential.Plasma_r)
fig.show()
I am working on a dashboard using Altair. I am creating 4 different plot using the same data. I am creating scatterplots using mark_circle.
How do I change the size to be size*2, or anything else?
Here is a sample:
bar = alt.Chart(df).mark_point(filled=True).encode(
x='AGE_GROUP:N',
y=alt.Y( 'PERP:N', axis=alt.Axis( values= df['PERP'].unique().tolist() )),
size = 'count()')
You can do this by adjusting the scale range for the size encoding. For example, this sets the range such that the smallest points have an area of 100 square pixels, and the largest have an area of 500 square pixels:
import altair as alt
import pandas as pd
import numpy as np
df = pd.DataFrame({
'x': np.random.randn(30),
'y': np.random.randn(30),
'count': np.random.randint(1, 5, 30)
})
alt.Chart(df).mark_point().encode(
x='x',
y='y',
size=alt.Size('count', scale=alt.Scale(range=[100, 500]))
)