I'm a newbie to Altair, and I would like to change the number of bars being plotted in a bar plot. I have done some research online, but couldn't find anything helpful. Here is my code:
import altair as alt
import pandas as pd
import numpy as np
# Generate a random np array of size 1000, and our goal is to plot its distribution.
my_numbers = np.random.normal(size = 1000)
my_numbers_df = pd.DataFrame.from_dict({'Integers': my_numbers})
alt.Chart(my_numbers_df).mark_bar(size = 10).encode(
alt.X("Integers",
bin = True,
scale = alt.Scale(domain=(-5, 5))
),
y = 'count()',
)
The plot right now looks something like this
You can increase the number of bins by passing an alt.Bin() object and specifying the maxbins
import altair as alt
import pandas as pd
import numpy as np
# Generate a random np array of size 1000, and our goal is to plot its distribution.
my_numbers = np.random.normal(size = 1000)
my_numbers_df = pd.DataFrame.from_dict({'Integers': my_numbers})
alt.Chart(my_numbers_df).mark_bar(size = 10).encode(
alt.X("Integers",
bin = alt.Bin(maxbins=25),
scale = alt.Scale(domain=(-5, 5))
),
y = 'count()',
)
Related
I have 2 tables a 10 by 110 and a 35 by 110 and both contain random numbers from a exponential distribution function provided by my professor. The assignment is to prove the central limit theorem in statistics.
What I thought to try is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"importing data"
df1 = pd.read_excel(r'C:\Users\Henry\Desktop\n10.xlsx')
df2 = pd.read_excel(r'C:\Users\Henry\Desktop\n30.xlsx')
df1avg = pd.read_excel(r'C:\Users\Henry\Desktop\n10avg.xlsx')
df2avg = pd.read_excel(r'C:\Users\Henry\Desktop\n30avg.xlsx')
"plotting n10 histogram"
plt.hist(df1, bins=34)
plt.hist(df1avg, bins=11)
"plotting n30 histogram"
plt.hist(df2, bins=63)
plt.hist(df2avg, bins=11)
Is that ok or do I need to format the tables into a singular column, and if so what is the most efficient way to do that?
I suspect that you will want to flatten your dataframe first, as illustrated below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = np.random.exponential(1, [40, 5])
df = pd.DataFrame(N) # convert to dataframe
bin_edges = np.linspace(0,6,30)
plt.figure()
plt.hist(df, bins = bin_edges, density = True)
plt.xlabel('Value')
plt.ylabel('Probability density')
The multiple (5) colours of lines per bin shows the histograms for each column of the data frame.
Fortunately, this is not hard to adjust. You can convert the data frame to a numpy array and flatten it:
plt.hist(df.to_numpy().flatten(), bins = bin_edges, density = True)
plt.ylabel('Probability density')
plt.xlabel('Value')
I have several histograms that I succeded to plot using plotly like this:
fig.add_trace(go.Histogram(x=np.array(data[key]), name=self.labels[i]))
I would like to create something like this 3D stacked histogram but with the difference that each 2D histogram inside is a true histogram and not just a hardcoded line (my data is of the form [0.5 0.4 0.5 0.7 0.4] so using Histogram directly is very convenient)
Note that what I am asking is not similar to this and therefore also not the same as this. In the matplotlib example, the data is presented directly in a 2D array so the histogram is the 3rd dimension. In my case, I wanted to feed a function with many already computed histograms.
The snippet below takes care of both binning and formatting of the figure so that it appears as a stacked 3D chart using multiple traces of go.Scatter3D and np.Histogram.
The input is a dataframe with random numbers using np.random.normal(50, 5, size=(300, 4))
We can talk more about the other details if this is something you can use:
Plot 1: Angle 1
Plot 2: Angle 2
Complete code:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'browser'
# data
np.random.seed(123)
df = pd.DataFrame(np.random.normal(50, 5, size=(300, 4)), columns=list('ABCD'))
# plotly setup
fig=go.Figure()
# data binning and traces
for i, col in enumerate(df.columns):
a0=np.histogram(df[col], bins=10, density=False)[0].tolist()
a0=np.repeat(a0,2).tolist()
a0.insert(0,0)
a0.pop()
a1=np.histogram(df[col], bins=10, density=False)[1].tolist()
a1=np.repeat(a1,2)
fig.add_traces(go.Scatter3d(x=[i]*len(a0), y=a1, z=a0,
mode='lines',
name=col
)
)
fig.show()
Unfortunately you can't use go.Histogram in a 3D space so you should use an alternative way. I used go.Scatter3d and I wanted to use the option to fill line doc but there is an evident bug see
import numpy as np
import plotly.graph_objs as go
# random mat
m = 6
n = 5
mat = np.random.uniform(size=(m,n)).round(1)
# we want to have the number repeated
mat = mat.repeat(2).reshape(m, n*2)
# and finally plot
x = np.arange(2*n)
y = np.ones(2*n)
fig = go.Figure()
for i in range(m):
fig.add_trace(go.Scatter3d(x=x,
y=y*i,
z=mat[i,:],
mode="lines",
# surfaceaxis=1 # bug
)
)
fig.show()
I am producing the probability distribution function of my variable, which is temperature:
and I am going to produce several plots with temperature PDF evolution.
For this reason, I would like to link the color of the plot (rainbow-style) with the value of the peak of the temperature distribution.
In this way, it is easy to associate the average value of the temperature just by looking at the color.
Here's the code I have written for producing plots of the PDF evolution:
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import seaborn as sns
from scipy.stats import gaussian_kde
my_file = 'tas/tas.nc'
fh = Dataset(my_file, mode='r')
lons = (fh.variables['rlon'][:])
lats = (fh.variables['rlat'][:])
t = (fh.variables['tas'][:])-273
step = len(t[:,0,0])
t_units = fh.variables['tas'].units
fh.close()
len_lon = len(t[0,0,:])
len_lat = len(t[0,:,0])
len_tot = len_lat*len_lon
temperature = np.zeros(len_tot)
for i in range(step):
temperature=t[i,:,:]
temperature_array = temperature.ravel()
density = gaussian_kde(temperature_array)
xs = np.linspace(-80,50,200)
density.covariance_factor = lambda : .25
density._compute_covariance()
plt.title(str(1999+i))
plt.xlabel("Temperature (C)")
plt.ylabel("Frequency")
plt.plot(xs,density(xs))
plt.savefig('temp_'+str(i))
Because the question is lacking a working snippet, I had to come up with some sample data. This creates three datasets, where each one is colored with a specific color between blue (cold) and red (hot) according to their maximum value.
import matplotlib.pyplot as plt
import random
from colour import Color
nrange = 20
mydata1 = random.sample(range(nrange), 3)
mydata2 = random.sample(range(nrange), 3)
mydata3 = random.sample(range(nrange), 3)
colorlist = list(Color('blue').range_to(Color('red'), nrange))
# print(mydata1) print(mydata2) print(mydata3)
plt.plot(mydata1, color='{}'.format(colorlist[max(mydata1)]))
plt.plot(mydata2, color='{}'.format(colorlist[max(mydata2)]))
plt.plot(mydata3, color='{}'.format(colorlist[max(mydata3)]))
plt.show()
I have a series of data that I'm reading in from a tutorial site.
I've managed to plot the distribution of the TV column in that data, however I also want to overlay a normal distribution curve with StdDev ticks on a second x-axis (so I can compare the two curves). I'm struggling to work out how to do it..
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# draw distribution curve
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
plt.plot(h, pdf)
Here is a diagram close to what I'm after, where x is the StdDeviations. All this example needs is a second x axis to show the values of data.TV
Not sure what you really want, but you could probably use second axis like this
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('Advertising.csv', index_col=0)
fig, ax1 = plt.subplots()
# draw distribution curve
h = sorted(data.TV)
ax1.plot(h,'b-')
ax1.set_xlabel('TV')
ax1.set_ylabel('Count', color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
ax2 = ax1.twinx()
ax2.plot(h, pdf, 'r.')
ax2.set_ylabel('pdf', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
plt.show()
Ok, assuming that you want to plot the distribution of your data, the fitted normal distribution with two x-axes, one way to achieve this is as follows.
Plot the normalized data together with the standard normal distribution. Then use matplotlib's twiny() to add a second x-axis to the plot. Use the same tick positions as the original x-axis on the second axis, but scale the labels so that you get the corresponding original TV values. The result looks like this:
Code
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
h_n = (h - hmean) / hstd
pdf = stats.norm.pdf( h_n )
# plot data
f,ax1 = plt.subplots()
ax1.hist( h_n, 20, normed=1 )
ax1.plot( h_n , pdf, lw=3, c='r')
ax1.set_xlim( [h_n.min(), h_n.max()] )
ax1.set_xlabel( r'TV $[\sigma]$' )
ax1.set_ylabel( r'Relative Frequency')
ax2 = ax1.twiny()
ax2.grid( False )
ax2.set_xlim( ax1.get_xlim() )
ax2.set_ylim( ax1.get_ylim() )
ax2.set_xlabel( r'TV' )
ticklocs = ax2.xaxis.get_ticklocs()
ticklocs = [ round( t*hstd + hmean, 2) for t in ticklocs ]
ax2.xaxis.set_ticklabels( map( str, ticklocs ) )
I have a data set I wish to plot as scatter plot with matplotlib, and a vector the same size that categorizes and labels the data points (discretely, e.g. from 0 to 3). I want to use different markers for different labels (e.g. 'x' for 0, 'o' for 1 and so on). How can I solve this elegantly? I am quite sure I am just missing out on something, but didn't really find it, and my naive approaches failed so far...
What about iterating over all markers like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = np.random.rand(100)
category = np.random.random_integers(0, 3, 100)
markers = ['s', 'o', 'h', '+']
for k, m in enumerate(markers):
i = (category == k)
plt.scatter(x[i], y[i], marker=m)
plt.show()
Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
kmean = np.array([0, 1, 0, 2, 2])
df = pd.DataFrame({'x':x,'y':y,'z':z, 'km_z':kmean})
sns.scatterplot(data = df, x='x', y='y', hue='km_z', style='km_z')
which produces the following output
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
df = pd.DataFrame({'x':x,'y':y,'z':z})
df['bins'] = pd.cut(df.z, bins=3)
sns.scatterplot(data = df, x='x', y='y', hue='bins', style='bins')
and it produces the following example:
I've used the latter method to produce graphs like the following: