How to make aggregated point sizes bigger in Altair, Python? - python

I am working on a dashboard using Altair. I am creating 4 different plot using the same data. I am creating scatterplots using mark_circle.
How do I change the size to be size*2, or anything else?
Here is a sample:
bar = alt.Chart(df).mark_point(filled=True).encode(
x='AGE_GROUP:N',
y=alt.Y( 'PERP:N', axis=alt.Axis( values= df['PERP'].unique().tolist() )),
size = 'count()')

You can do this by adjusting the scale range for the size encoding. For example, this sets the range such that the smallest points have an area of 100 square pixels, and the largest have an area of 500 square pixels:
import altair as alt
import pandas as pd
import numpy as np
df = pd.DataFrame({
'x': np.random.randn(30),
'y': np.random.randn(30),
'count': np.random.randint(1, 5, 30)
})
alt.Chart(df).mark_point().encode(
x='x',
y='y',
size=alt.Size('count', scale=alt.Scale(range=[100, 500]))
)

Related

How to create a seaborn graph that shows probability per bin?

I would like to make a graph using seaborn. I have three types that are called 1, 2 and 3. In each type, there are groups P and F. I would like to present the graph in a way that each bin sums up to 100% and shows how many of each type are of group P and group F. I would also like to show the types as categorical rather than interpreted as numbers.
Could someone give me suggestions how to adapt the graph?
So far, I have used the following code:
sns.displot(data=df, x="TYPE", hue="GROUP", multiple="stack", discrete=1, stat="probability")
And this is the graph:
The option multiple='fill' stretches all bars to sum up to 1 (for 100%). You can use the new ax.bar_label() to label each bar.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame({'TYPE': np.random.randint(1, 4, 30),
'GROUP': np.random.choice(['P', 'F'], 30, p=[0.8, 0.2])})
g = sns.displot(data=df, x='TYPE', hue='GROUP', multiple='fill', discrete=True, stat='probability')
ax = g.axes.flat[0]
ax.set_xticks(np.unique(df['TYPE']))
for bars in ax.containers:
ax.bar_label(bars, label_type='center', fmt='%.2f' )
plt.show()

How to create a wind rose or polar bar plot

I would like to write scout report on some football players and for that I need visualizations. One type of which is pie charts. Now I need some pie charts that looks like below, with different size of slices ( proportionate to the number of the thing the slice indicates) . Can anyone suggest how to do it or have any link to websites where I can learn this?
What you are looking for is called a "Radar Pie Chart". It's analogous to the more commonly used "Radar Chart", but I think it looks better as it highlights the values, rather than focus on meaningless shapes.
The challenge you face with your football dataset is that each category is on a different scale, so you want to plot each value as a percentage of some max. My code will accomplish that, but you'll want to annotate the original values to finish off these charts.
The plot itself can be done with just the standard matplotlib library using polar axes. I borrowed code from here (https://raphaelletseng.medium.com/getting-to-know-matplotlib-and-python-docx-5ee67bad38d2).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from math import pi
from random import random, seed
seed(12345)
# Generate dataset with 10 rows, different maxes
maxes = [5, 5, 5, 2, 2, 10, 10, 10, 10, 10]
df = pd.DataFrame(
data = {
'categories': ['category_{}'.format(x) for x, _ in enumerate(maxes)],
'scores': [random()*max for max in maxes],
'max_values': maxes,
},
)
df['pct'] = df['scores'] / df['max_values']
df = df.set_index('categories')
# Plot pie radar chart
N = df.shape[0]
theta = np.linspace(0.0, 2*np.pi, N, endpoint=False)
categories = df.index
df['radar_angles'] = theta
ax = plt.subplot(polar=True)
ax.bar(df['radar_angles'], df['pct'], width=2*pi/N, linewidth=2, edgecolor='k', alpha=0.5)
ax.set_xticks(theta)
ax.set_xticklabels(categories)
_ = ax.set_yticklabels([])
I had previously work with rose or polar bar chart. Here is the example.
import plotly.express as px
df = px.data.wind()
fig = px.bar_polar(df, r="frequency", theta="direction",
color="strength", template="plotly_dark",
color_discrete_sequence= px.colors.sequential.Plasma_r)
fig.show()

Plotly: How to make a 3D stacked histogram?

I have several histograms that I succeded to plot using plotly like this:
fig.add_trace(go.Histogram(x=np.array(data[key]), name=self.labels[i]))
I would like to create something like this 3D stacked histogram but with the difference that each 2D histogram inside is a true histogram and not just a hardcoded line (my data is of the form [0.5 0.4 0.5 0.7 0.4] so using Histogram directly is very convenient)
Note that what I am asking is not similar to this and therefore also not the same as this. In the matplotlib example, the data is presented directly in a 2D array so the histogram is the 3rd dimension. In my case, I wanted to feed a function with many already computed histograms.
The snippet below takes care of both binning and formatting of the figure so that it appears as a stacked 3D chart using multiple traces of go.Scatter3D and np.Histogram.
The input is a dataframe with random numbers using np.random.normal(50, 5, size=(300, 4))
We can talk more about the other details if this is something you can use:
Plot 1: Angle 1
Plot 2: Angle 2
Complete code:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'browser'
# data
np.random.seed(123)
df = pd.DataFrame(np.random.normal(50, 5, size=(300, 4)), columns=list('ABCD'))
# plotly setup
fig=go.Figure()
# data binning and traces
for i, col in enumerate(df.columns):
a0=np.histogram(df[col], bins=10, density=False)[0].tolist()
a0=np.repeat(a0,2).tolist()
a0.insert(0,0)
a0.pop()
a1=np.histogram(df[col], bins=10, density=False)[1].tolist()
a1=np.repeat(a1,2)
fig.add_traces(go.Scatter3d(x=[i]*len(a0), y=a1, z=a0,
mode='lines',
name=col
)
)
fig.show()
Unfortunately you can't use go.Histogram in a 3D space so you should use an alternative way. I used go.Scatter3d and I wanted to use the option to fill line doc but there is an evident bug see
import numpy as np
import plotly.graph_objs as go
# random mat
m = 6
n = 5
mat = np.random.uniform(size=(m,n)).round(1)
# we want to have the number repeated
mat = mat.repeat(2).reshape(m, n*2)
# and finally plot
x = np.arange(2*n)
y = np.ones(2*n)
fig = go.Figure()
for i in range(m):
fig.add_trace(go.Scatter3d(x=x,
y=y*i,
z=mat[i,:],
mode="lines",
# surfaceaxis=1 # bug
)
)
fig.show()

Altair bar chart with bars of variable width?

I'm trying to use Altair in Python to make a bar chart where the bars have varying width depending on the data in a column of the source dataframe. The ultimate goal is to get a chart like this one:
The height of the bars corresponds to a marginal-cost of each energy-technology (given as a column in the source dataframe). The bar width corresponds to the capacity of each energy-technology (also given as a columns in the source dataframe). Colors are ordinal data also from the source dataframe. The bars are sorted in increasing order of marginal cost. (A plot like this is called a "generation stack" in the energy industry). This is easy to achieve in matplotlib like shown in the code below:
import matplotlib.pyplot as plt
# Make fake dataset
height = [3, 12, 5, 18, 45]
bars = ('A', 'B', 'C', 'D', 'E')
# Choose the width of each bar and their positions
width = [0.1,0.2,3,1.5,0.3]
y_pos = [0,0.3,2,4.5,5.5]
# Make the plot
plt.bar(y_pos, height, width=width)
plt.xticks(y_pos, bars)
plt.show()
(code from https://python-graph-gallery.com/5-control-width-and-space-in-barplots/)
But is there a way to do this with Altair? I would want to do this with Altair so I can still get the other great features of Altair like a tooltip, selectors/bindings as I have lots of other data I want to show alongside the bar-chart.
First 20 rows of my source data looks like this:
(does not match exactly the chart shown above).
In Altair, the way to do this would be to use the rect mark and construct your bars explicitly. Here is an example that mimics your data:
import altair as alt
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'MarginalCost': 100 * np.random.rand(30),
'Capacity': 10 * np.random.rand(30),
'Technology': np.random.choice(['SOLAR', 'THERMAL', 'WIND', 'GAS'], 30)
})
df = df.sort_values('MarginalCost')
df['x1'] = df['Capacity'].cumsum()
df['x0'] = df['x1'].shift(fill_value=0)
alt.Chart(df).mark_rect().encode(
x=alt.X('x0:Q', title='Capacity'),
x2='x1',
y=alt.Y('MarginalCost:Q', title='Marginal Cost'),
color='Technology:N',
tooltip=["Technology", "Capacity", "MarginalCost"]
)
To get the same result without preprocessing of the data, you can use Altair's transform syntax:
df = pd.DataFrame({
'MarginalCost': 100 * np.random.rand(30),
'Capacity': 10 * np.random.rand(30),
'Technology': np.random.choice(['SOLAR', 'THERMAL', 'WIND', 'GAS'], 30)
})
alt.Chart(df).transform_window(
x1='sum(Capacity)',
sort=[alt.SortField('MarginalCost')]
).transform_calculate(
x0='datum.x1 - datum.Capacity'
).mark_rect().encode(
x=alt.X('x0:Q', title='Capacity'),
x2='x1',
y=alt.Y('MarginalCost:Q', title='Marginal Cost'),
color='Technology:N',
tooltip=["Technology", "Capacity", "MarginalCost"]
)

How to draw 3 million circles in python

I've got a dataset of many (~3M) circles (each has an x, y, and od property) in a pandas dataframe. I'd like to plot them over each other to visualize patterns
I had done this previously with a smaller dataset (about 15k circles), but now it seems to be choking (memory is going up to the 16GB by the time I'm at only a few hundred thousand)
df is the dataframe
plt is matplotlib.pyplot
ax2=plt.gca(xlim=(-.25,.25),ylim=(-0.25,0.25))
for i,row in df.iterrows():
x=row.X_delta
y=row.Y_delta
od=float(row.OD)
circle=plt.Circle((x,y),od/2,color='r',fill=False,lw=5,alpha=0.01)
ax2.add_artist(circle)
Any thoughts on a more memory efficient way to do this?
Drawing all 3 million circles in one plot doesn't seem a viable approach. Here's an example with just 1000 circles (following the example by matt_s):
Instead, I suggest to reduce the number of circles to draw to some sensible value, e.g. 50 or 100. One approach is to run KMeans on your dataset to cluster the circles by coordinate and diameter. The following chart represents the clustering of 100'000 random circles as an example. This should easily scale to 3 million circles.
The marker's dimensions represent the diameter (s, scaled to fit the chart), and the color indicates the number of circles per cluster center (c). YMMV
Code used to plot the first chart (ipython)
%matplotlib inline
import pandas as pd
import numpy as np
n = 1000
circles = pd.DataFrame({'x': np.random.random(n), 'y': np.random.random(n), 'r': np.random.random(n)},)
circles.plot(kind='scatter', x='x', y='y', s=circles['r']*1000, c=circles.r * 10, facecolors='none')
Code used to plot the second chart (ipython)
%matplotlib inline
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# parameters
n = 100000
n_clusters = 50
# dummy data
circles = pd.DataFrame({'x': np.random.random(n), 'y': np.random.random(n), 'r': np.random.random(n)})
# cluster using kmeans
km = KMeans(n_clusters=n_clusters, n_jobs=-2)
circles['cluster'] = pd.Series(km.fit_predict(circles.as_matrix()))
# bin by cluster
cluster_size = circles.groupby('cluster').cluster.count()
# plot, using #circles / per cluster as the od weight
clusters = km.cluster_centers_
fig = plt.figure()
ax = plt.scatter(x=clusters[:,0], y=clusters[:,1], # clusters x,y
c=cluster_size, #color
s=clusters[:,2] * 1000, #diameter, scaled
facecolors='none') # don't fill markers
plt.colorbar()
fig.suptitle('clusters by #circles, c/d = size')
plt.xlabel('x')
plt.ylabel('y')
Have you tried the pandas scatter plot?
import pandas as pd
import random
n = 100000
df = pd.DataFrame({'x': np.random.random(n), 'y': np.random.random(n), 'r': np.random.random(n)})
df.plot(kind='scatter', x='x', y='y', s=df['r']*1000, facecolor='none')

Categories

Resources