Is there a way to apply a data transform or filter using the encoding used by a repeated chart?
If I understand correctly the docs, it appears it is not directly possible:
Currently repeat can only be specified for rows and column (not, e.g., for layers) and the target can only be encodings (not, e.g., data transforms) but there is discussion within the Vega-Lite community about making this pattern more general in the future.
What would be good ways to work around this? For instance below, say I want to plot only the points for which y>0 (or could be another transform, I don't want to just zoom on the y axis). Is there a way to apply something like the line #0 using the repeat target (as attempted in #1, which fails with TypeError: '>' not supported between instances of 'RepeatRef' and 'float')?
import altair as alt
import numpy as np
import pandas as pd
x = np.arange(100)
source = pd.DataFrame({
'x': x,
'f': np.sin(x / 5),
'g': np.cos(x / 3),
})
alt.Chart(source).mark_line().encode(
alt.X('x', type='quantitative'),
alt.Y(alt.repeat('column'), type='quantitative'),
).transform_filter(
# alt.datum.f >= 0. #0 Works, but would like to use f or g depending on the plotted variable
alt.repeat('column') > 0. #1 ERROR HERE
).repeat(
column=['f', 'g']
)
There is no way to reference the repeated field within a transform. The best way to approach this would be to build the chart via concatenation; for example:
alt.hconcat(*(
alt.Chart(source).mark_line().encode(
alt.X('x', type='quantitative'),
alt.Y(col, type='quantitative'),
).transform_filter(
alt.datum[col] >= 0
)
for col in ['f', 'g']
))
Related
I started to use python 6 months ago and may be my question is a naive one. I would like to visualize my data and ANOVA statistics. It is common to do this using a barplot with added lines indicating significant differences and interactions. How do you make plot like this using python ?
enter image description here
Here is a simple dataframe, with 3 columns (A,B and the p_values already calculated with a t-test)
mport pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ar = np.array([ [565.0, 81.0, 1.630947e-02],
[1006.0, 311.0, 1.222740e-27],
[2929.0, 1292.0, 5.559912e-12],
[3365.0, 1979.0, 2.507474e-22],
[2260.0, 1117.0, 1.540305e-01]])
df = pd.DataFrame(ar,columns = ['A', 'B', 'p_value'])
ax = plt.subplot()
# I calculate the percentage
(df.iloc[:,0:2]/df.iloc[:,0:2].sum()*100).plot.bar(ax=ax)
for container, p_val in zip(ax.containers,df['p_value']):
labels = [f"{round(v,1)}%" if (p_val > 0.05) else f"(**)\n{round(v,1)}%" for v in container.datavalues]
ax.bar_label(container,labels=labels, fontsize=10,padding=8)
plt.show()
Initially I just wanted to add a "**" each time a significant difference is observed between the 2 columns A & B. But the initial code above is not really working.
Now I would prefer having the added lines indicating significant differences and interactions between the A&B columns. But I have no ideas how to make it happen.
Regards
JYK
The following code produces a column chart in which the y axis grows in the wrong direction.
alt.Chart(df).mark_line().encode(
x = alt.X('pub_date', timeUnit='month'),
y = alt.Y('sum(has_kw)', ),
)
I wanted to correct it as suggested by https://stackoverflow.com/a/58326269, and changed my code to
alt.Chart(df).mark_line().encode(
x = alt.X('pub_date', timeUnit='month'),
y = alt.Y('sum(has_kw)', sort=alt.EncodingSortField('y', order='descending') ),
)
But now altair produces a strange diagram, see 2.
That is, sum(has_kw) is calculated wrong. Why this, and how to correct it?
It is hard to know exactly without seeing a sample of your data but you could try one of the following (based on the example you linked). This first approach is similar to what you tried already:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(0, 3), range(0, 3))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({
'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()
})
alt.Chart(source).mark_rect().encode(
x='x:O',
y=alt.Y('y:O', sort='descending'),
color='z:Q'
)
This second approaches simply reverses the axes without sorting it and might be more compatible with your data:
alt.Chart(source).mark_rect().encode(
x='x:O',
y=alt.Y('y:O', scale=alt.Scale(reverse=True)),
color='z:Q'
)
I am trying to find a good way to run a (nonlinear, injective, multivariable) transformation on columns in a pandas dataframe. Transform is a black box with multiple variables in and multiple variables out.
As an easy illustration, let's just consider converting r, theta coordinates to x, y coordinates. Run this for setup/context
# set up example (all this is given in my case)
def blackbox_transform(rtheta):
x = rtheta[0]*np.cos(rtheta[1])
y = rtheta[0]*np.sin(rtheta[1])
return (x, y)
n = 50
r = np.ones(n)
theta = np.linspace(0, np.pi / 2, n)
r_theta = np.concatenate((r[:, None], theta[:, None]), axis=1)
df = pd.DataFrame(data=r_theta, columns=['r', 'theta'])
For the solution, this is the best I can come up with, but the apply and unpacking seems clunky (hoping a pandas wizard has a better approach):
# solution
xy = df[['r', 'theta']].apply(blackbox_transform, axis=1)
df = pd.concat((df, pd.DataFrame(data=[*xy], columns=['x', 'y'], index=xy.index)), axis=1)
I get that using pandas may look a little silly here, but there's a lot of other information I have in the dataframe and I need to transform some numerics columns while keeping all the indices and other info straight.
Here is a slightly more readable approach:
out = df[['r', 'theta']].apply(rtheta_to_xy, 1).apply(pd.Series)
df = df.assign(x=out[0], y=out[1])
Btw your use of lambda is dispensable when you are just forwarding the same argument.
I'm interested in being able to recreate this multidimensional strip plot below, generated by the Missing Numbers python library, using vega-lite, and I'm looking for a few pointers on how I might do this. The code to generate the image below looks a bit like this snippet:
>>> from quilt.data.ResidentMario import missingno_data
>>> collisions = missingno_data.nyc_collision_factors()
>>> collisions = collisions.replace("nan", np.nan)
>>> import missingno as msno
>>> %matplotlib inline
>>> msno.matrix(collisions.sample(250))
For each column, there is a mark shown for a specific combination of the index, and where the data is null, or not null.
When I look through a gallery of charts created by Altair, I see this horizontal strip plot, which seems to be presenting a similar kind of information, but I'm not sure how to express the same idea.
The viz below is showing a mark when there is data that matches a given combination of horse power and cylinder size - the horsepower and cylinder are encoded into the x and y channels.
I'm not show how I'd express the same for the cool nullity matrix thing, and I think I need some pointers here.
I get that I can reset and index to come up with a y index, but it's not clear to me how to index of the sample is encoded in the Y channel, I'm not sure how I'd populate the x-axis with a column listing the null/not null results. Is this a thing I'd need to do before it gets to vega-lite, or does vega support it?
Yes, you can do this after reshaping your data with a Fold Transform. It looks something like this using Altair:
import numpy as np
import quilt
quilt.install("ResidentMario/missingno_data")
from quilt.data.ResidentMario import missingno_data
collisions = missingno_data.nyc_collision_factors()
collisions = collisions.replace("nan", np.nan)
collisions = collisions.set_index("Unnamed: 0")
import altair as alt
alt.Chart(collisions.sample(250)).transform_window(
index='row_number()'
).transform_fold(
collisions.columns.to_list()
).transform_calculate(
defined="isValid(datum.value)"
).mark_rect().encode(
x=alt.X('key:N',
title=None,
sort=collisions.columns.to_list(),
axis=alt.Axis(orient='top', labelAngle=-45)
),
y=alt.Y('index:O', title=None),
color=alt.Color('defined:N',
legend=None,
scale=alt.Scale(domain=["true", "false"], range=["black", "white"])
)
).properties(
width=800, height=400
)
I need your help with coding a graph result - plotting a function in an interval.
The question which I got is:
"Plot the following composite function. You probably want to use 'if' statements and a loop to 'build' it. Plot the function in the interval from [-3, 5].
enter code here
f(x) = {|x| x<0}
{-1 0 <= x < 1}
{+1 1 <= x < 2}
{ln(x) 2 <= x}
Can anyone write for me please, a code in which the result shows me a GRAPH, in which the above function is shown, without consistancy in the graph's line.
Thank you very much in advance!
Using if statement would be a more involved way. You can directly make use of NumPy indexing and masking to get the task done. Below is how I would do it.
Explanation: First you create a mesh of x-data points in the interval (3, 5). Then you initialize an empty y-array of same length. Next, you use the conditions on x to get the indices of x-array. This is done by using mask. mask1 = ((x>=0) & (x<1)) defines a condition and then you use y[mask1] = -1 which means, [mask1] would return the array indices where the condition holds True and then you use those indices to assign the y-value. You do this for all 4 conditions. I just used two masks for the middle two conditions. You can also use 4 variables (masks) to do the same thing. It's a matter of personal taste.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3, 5, 100)
y = np.zeros(len(x))
mask1 = ((x>=0) & (x<1))
mask2 = ((x>=1) & (x<2))
y[x<0] = np.abs(x[x<0])
y[mask1] = -1
y[mask2] = 1
y[x>=2] = np.log(x[x>=2])
plt.plot(x, y)
plt.xlabel('$x$')
plt.ylabel(r'$f(x)$')
plt.show()
Usually, simple composite functions can easily be written like any other function by multiplying by the respective condition(s). The only place one needs to be careful is with the logarithm, which is not defined over the complete inverval. This problem is circumvented by taking the absolute value here, because it's anyways only relevant in the range > 2.
import numpy as np
import matplotlib.pyplot as plt
f = lambda x: np.abs(x)*(x<0) - ((0<=x) & (x < 1)) + ((1<=x) & (x < 2)) + np.log(np.abs(x))*(2<=x)
x = np.linspace(-3,5,200)
plt.plot(x,f(x))
plt.show()
According to a comment below the answer, one can also evaluate the function in each of the intervals separately,
intervals = [(-3, -1e-6), (0,1-1e-6), (1, 2-1e-6), (2,5)]
for (s,e) in intervals:
x = np.linspace(s,e,100)
plt.plot(x,f(x), color="C0")
Thank you very much for your help, It is really useful :)
In addition, I would like to know how can I eliminate the lines that connecting each step of the interval to the next one?
I need to show only 4 seperate graphic results on the graph, in each step, without the "continuity" of the lines that connect between them.