Altair: adding sorting destroys chart - python

The following code produces a column chart in which the y axis grows in the wrong direction.
alt.Chart(df).mark_line().encode(
x = alt.X('pub_date', timeUnit='month'),
y = alt.Y('sum(has_kw)', ),
)
I wanted to correct it as suggested by https://stackoverflow.com/a/58326269, and changed my code to
alt.Chart(df).mark_line().encode(
x = alt.X('pub_date', timeUnit='month'),
y = alt.Y('sum(has_kw)', sort=alt.EncodingSortField('y', order='descending') ),
)
But now altair produces a strange diagram, see 2.
That is, sum(has_kw) is calculated wrong. Why this, and how to correct it?

It is hard to know exactly without seeing a sample of your data but you could try one of the following (based on the example you linked). This first approach is similar to what you tried already:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(0, 3), range(0, 3))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({
'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()
})
alt.Chart(source).mark_rect().encode(
x='x:O',
y=alt.Y('y:O', sort='descending'),
color='z:Q'
)
This second approaches simply reverses the axes without sorting it and might be more compatible with your data:
alt.Chart(source).mark_rect().encode(
x='x:O',
y=alt.Y('y:O', scale=alt.Scale(reverse=True)),
color='z:Q'
)

Related

How to add Prefix to rSquared extracted from Altair?

I'm adding the rSquared to a chart using the method outlined in this answer:
r2 = alt.Chart(df).transform_regression('x', 'y', params=True
).mark_text().encode(x=alt.value(20), y=alt.value(20), text=alt.Text('rSquared:N', format='.4f'))
But I want to prepend "rSquared = " to the final text.
I've seen this answer involving an f string and a value calculated outside the chart, but I'm not clever enough to figure out how to apply that solution to this scenario.
I've tried, e.g., format='rSquared = .4f', but adding any additional text breaks the output, which I'm sure is the system working as intended.
One possible solution using the posts you linked to would be to extract the value of the parameter using altair_transform and then add the value to the plot. This is not the most elegant solution but should achieve what you want.
# pip install git+https://github.com/altair-viz/altair-transform.git
import altair as alt
import pandas as pd
import numpy as np
import altair_transform
np.random.seed(42)
x = np.linspace(0, 10)
y = x - 5 + np.random.randn(len(x))
df = pd.DataFrame({'x': x, 'y': y})
chart = alt.Chart(df).mark_point().encode(
x='x',
y='y'
)
line = chart.transform_regression('x', 'y').mark_line()
params = chart.transform_regression('x','y', params=True).mark_line()
R2 = altair_transform.extract_data(params)['rSquared'][0]
text = alt.Chart({'values':[{}]}).mark_text(
align="left", baseline="top"
).encode(
x=alt.value(5), # pixels from left
y=alt.value(5), # pixels from top
text=alt.value(f"rSquared = {R2:.4f}"),
)
chart + line + text

Altair-viz repeat and transform

Is there a way to apply a data transform or filter using the encoding used by a repeated chart?
If I understand correctly the docs, it appears it is not directly possible:
Currently repeat can only be specified for rows and column (not, e.g., for layers) and the target can only be encodings (not, e.g., data transforms) but there is discussion within the Vega-Lite community about making this pattern more general in the future.
What would be good ways to work around this? For instance below, say I want to plot only the points for which y>0 (or could be another transform, I don't want to just zoom on the y axis). Is there a way to apply something like the line #0 using the repeat target (as attempted in #1, which fails with TypeError: '>' not supported between instances of 'RepeatRef' and 'float')?
import altair as alt
import numpy as np
import pandas as pd
x = np.arange(100)
source = pd.DataFrame({
'x': x,
'f': np.sin(x / 5),
'g': np.cos(x / 3),
})
alt.Chart(source).mark_line().encode(
alt.X('x', type='quantitative'),
alt.Y(alt.repeat('column'), type='quantitative'),
).transform_filter(
# alt.datum.f >= 0. #0 Works, but would like to use f or g depending on the plotted variable
alt.repeat('column') > 0. #1 ERROR HERE
).repeat(
column=['f', 'g']
)
There is no way to reference the repeated field within a transform. The best way to approach this would be to build the chart via concatenation; for example:
alt.hconcat(*(
alt.Chart(source).mark_line().encode(
alt.X('x', type='quantitative'),
alt.Y(col, type='quantitative'),
).transform_filter(
alt.datum[col] >= 0
)
for col in ['f', 'g']
))

Python - Plotting T_value above barplot

This is a minimal example, code, of what I am doing:
B = np.array([0.6383447, 0.5271385, 1.7721380, 1.7817880])
b_mean = mean(B)
ori_t = stats.ttest_1samp(B, 0)[0]
r1 = [1]
plt.bar(r1,b_mean,width=barWidth, color="blue")
This code produce a barplot of the mean of the 'B' array. Now I would like to add the T-value (extracted at the 3 line) and display it above the barplot. I tried the following:
plt.text(x=r1, y=b_mean+0.1, s=ori_t, size = 6)
each time it returns
TypeError: float() argument must be a string or a number
which I don't understand. Does anyone knows how to achieve or overcome that?
The problem is that you are passing r1 = [1] as the x-position for your text. r1 is a list which cannot be used for specifying the position of the text. x and y arguments in plt.text should be scalars. So either you write x=1 OR you write x=r1[0] both of which are scalars. I have included the missing imports in my answer to make it complete. I have also adjusted the y-limits accordingly.
From the docs:
x, y : scalars
The position to place the text. By default, this is in data coordinates. The coordinate system can be changed using the transform parameter.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
B = np.array([0.6383447, 0.5271385, 1.7721380, 1.7817880])
b_mean = np.mean(B)
ori_t = stats.ttest_1samp(B, 0)[0]
r1 = [1]
plt.bar(r1,b_mean,width=0.02, color="blue")
plt.text(x=r1[0], y=b_mean+0.1, s=ori_t, size = 10)
# plt.text(x=1, y=b_mean+0.1, s=ori_t, size = 10)
plt.ylim(0, b_mean+0.2)
plt.show()

Slicing a graph

I have created a graph in python but I now need to take a section of the graph and expand this by using a small range of the original data, but I don't know how to find the row number of the results that form the range or how I can create a graph using just these results form the file. This is the code I have for the graph:
import numpy as np
import matplotlib.pyplot as plt
#variable for data to plot
spec_to_plot = "SN2012fr_20121129.42_wifes_BR.dat"
#tells python where to look for the file
spec_directory = '/home/fh1u16/Documents/spectra/'
data = np.loadtxt(spec_directory + spec_to_plot, dtype=np.float)
x = data[:,0]
y = data[:,1]
plt.plot(x, y)
plt.xlabel("Wavelength")
plt.ylabel("Flux")
plt.title(spec_to_plot)
plt.show()
edit: data is between 3.5e+3 and 9.9e+3 in the first column, I need to use just the data between 5.5e+3 and 6e+3 to plot another graph, but this only applies to the first column. Hope this makes a bit more sense?
Python version 2.7
If I understand you correctly, you could do it this way:
my_slice = slice(np.argwhere(x>5.5e3)[0], np.argwhere(x>6e3)[0])
x = data[my_slice,0]
y = data[my_slice,1]
np.argwhere(x>5.5e3)[0] is the index of the first occurrence of x>5.5e3 and like wise for the end of the slice. (assuming your data is sorted)
A more general way working even if your data is not sorted:
mask = (x>5.5e3) & (x<6e3)
x = data[mask, 0]
y = data[mask, 1]
solved by using
plt.axis([5500, 6000, 0, 8e-15])
thanks for help.

Use of pandas.shift() to align datasets based on scipy.signal.correlate

I have datasets that look like the following: data0, data1, data2 (analogous to time versus voltage data)
If I load and plot the datasets using code like:
import pandas as pd
import numpy as np
from scipy import signal
from matplotlib import pylab as plt
data0 = pd.read_csv('data0.csv')
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
plt.plot(data0.x, data0.y, data1.x, data1.y, data2.x, data2.y)
I get something like:
now I try to correlate data0 with data1:
shft01 = np.argmax(signal.correlate(data0.y, data1.y)) - len(data1.y)
print shft01
plt.figure()
plt.plot(data0.x, data0.y,
data1.x.shift(-shft01), data1.y)
fig = plt.gcf()
with output:
-99
and
which works just as expected! but if I try it the same thing with data2, I get a plot that looks like:
with a positive shift of 410. I think I am just not understanding how pd.shift() works, but I was hoping that I could use pd.shift() to align my data sets. As far as I understand, the return from correlate() tells me how far off my data sets are, so I should be able to use shift to overlap them.
panda.shift() is not the correct method to shift curve along x-axis. You should adjust X values of the points:
plt.plot(data0.x, data0.y)
for target in [data1, data2]:
dx = np.mean(np.diff(data0.x.values))
shift = (np.argmax(signal.correlate(data0.y, target.y)) - len(target.y)) * dx
plt.plot(target.x + shift, target.y)
here is the output:
#HYRY one correction to your answer: there is an indexing mismatch between len(), which is one-based, and np.argmax(), which is zero-based. The line should read:
shift = (np.argmax(signal.correlate(data0.y, target.y)) - (len(target.y)-1)) * dx
For example, in the case where your signals are already aligned:
len(target.y) = N (one-based)
The cross-correlation function has length 2N-1, so the center value, for aligned data, is:
np.argmax(signal.correlate(data0.y, target.y) = N - 1 (zero-based)
shift = ((N-1) - N) * dx = (-1) * dx, when we really want 0 * dx

Categories

Resources