Ridgeline/Joyplot across a moving range - python

(Using Python 3.0) In increments of 0.25, I want to calculate and plot PDFs for the given data across specified ranges for easy visualization.
Calculating the individual plot has been done thanks to the SO community, but I cannot quite get the algorithm right to iterate properly across the range of values.
Data: https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=0
What I have so far is normalized toy data that looks like a shotgun blast with one of the target areas isolated between the black lines with an increment of 0.25:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
import seaborn as sns
Data=pd.read_csv("Data.csv")
g = sns.jointplot(x="x", y="y", data=Data)
bottom_lim = 0
top_lim = 0.25
temp = Data.loc[(Data.y>=bottom_lim)&(Data.y<top_lim)]
g.ax_joint.axhline(top_lim, c='k', lw=2)
g.ax_joint.axhline(bottom_lim, c='k', lw=2)
# we have to create a secondary y-axis to the joint-plot, otherwise the kde
might be very small compared to the scale of the original y-axis
ax_joint_2 = g.ax_joint.twinx()
sns.kdeplot(temp.x, shade=True, color='red', ax=ax_joint_2, legend=False)
ax_joint_2.spines['right'].set_visible(False)
ax_joint_2.spines['top'].set_visible(False)
ax_joint_2.yaxis.set_visible(False)
And now what I want to do is make a ridgeline/joyplot of this data across each 0.25 band of data.
I tried a few techniques from the various Seaborn examples out there, but nothing really accounts for the band or range of values as the y-axis. I'm struggling to translate my written algorithm into working code as a result.

I don't know if this is exactly what you are looking for, but hopefully this gets you in the ballpark. I also know very little about python, so here is some R:
library(tidyverse)
library(ggridges)
data = read_csv("https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=1")
data2 = data %>%
mutate(breaks = cut(x, breaks = seq(-1,7,.5), labels = FALSE))
data2 %>%
ggplot(aes(x=x,y=breaks)) +
geom_density_ridges() +
facet_grid(~breaks, scales = "free")
data2 %>%
ggplot(aes(x=x,y=y)) +
geom_point() +
geom_density() +
facet_grid(~breaks, scales = "free")
And please forgive the poorly formatted axis.

Related

Why are the quartiles in seaborn boxplot different from ploty? How can I put them to show me the same result?

Seaborn
Importing libraries and load data
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.set_theme(style="whitegrid", palette="muted") # Set2, muted, pastel, colorblind
# Load the data
import plotly.express as px
df = px.data.gapminder()
df.head()
Show the boxplot and the quartiles
sns.boxplot(
data=df[df.year==2007],
x='lifeExp',
orient="h",
);
print('q1', df[df.year==2007]['lifeExp'].quantile(.25))
print('median', df[df.year==2007]['lifeExp'].median())
print('q3', df[df.year==2007]['lifeExp'].quantile(.75))
plt.show()
Plotly
Show the boxplot and the quartiles
fig_box = px.box(df[df.year==2007], x='lifeExp', orientation='h',
width=500, height=300)
fig_box.show()
Why do the quartiles are different?
I am not competent enough to explain the statistics to you, but it seems to be caused by the difference in the completion method between the 25% and 75% quartiles. Simply put, pandas(seaborn,numpy) and plotly have different calculation methods by default.
import pandas as pd
x = df[df.year==2007]['lifeExp'].values
pd.DataFrame(pd.Series(x.ravel()).describe()).transpose()
count
mean
std
min
25%
50%
75%
max
0
142
67.0074
12.073
39.613
57.1602
71.9355
76.4133
82.603
pd.Series.quantile
See this
interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
pd.Series(x.ravel()).quantile(q=0.75, interpolation='higher')
76.423 <- plotly.box.Q3
pd.Series(x.ravel()).quantile(q=0.25, interpolation='lower')
56.867 <- plotly.box.Q1
r-beginners has already answered your primary question, but the secondary question seems to remain un-answered:
How can I put them to show me the same result?
px.box has three built-in options for calculating quartiles:
['linear', 'exclusive', 'inclusive']
If you have pre-computed values or if you need to use a different algorithm than the ones provided, you can specify them for your px.box figure like so:
fig.update_traces(q1=[df['lifeExp'].quantile(.25)],
median=[df['lifeExp'].median()],
q3=[df['lifeExp'].quantile(.75)],
lowerfence=[df['lifeExp'].min()],
upperfence=[df['lifeExp'].max()],
)
Plot
But beware that you may experience some irregular behavior if you try to manually set only one of the above. In this case it seems that the underlying calculations for the plot may revert to the default. I'll get back to you if I find out more.
Complete code:
import plotly.graph_objects as go
import plotly.express as px
df = px.data.gapminder()
df = df[df.year==2007]#.tail(8)
fig = px.box(df, x = 'lifeExp', orientation = 'h')
fig.update_traces(q1=[df['lifeExp'].quantile(.25)],
median=[df['lifeExp'].median()],
q3=[df['lifeExp'].quantile(.75)],
lowerfence=[df['lifeExp'].min()],
upperfence=[df['lifeExp'].max()],
)
fig.show()

Generating a smooth line with Pandas dataframe and Matplotlib

I am trying to generate a smooth line using a dataset that contains time (measured as number of days) and a set of numbers that represent a socioeconomic variable.
Here is a sample of my data:
date, data
726,1.2414
727,1.2414
728,1.2414
729,1.2414
730,1.2414
731,1.2414
732,1.2414
733,1.2414
734,1.2414
735,1.2414
736,1.2414
737,1.804597701
738,1.804597701
739,1.804597701
740,1.804597701
741,1.804597701
742,1.804597701
743,1.804597701
744,1.804597701
745,1.804597701
746,1.804597701
747,1.804597701
748,1.804597701
749,1.804597701
750,1.804597701
751,1.804597701
752,1.793103448
753,1.793103448
754,1.793103448
755,1.793103448
756,1.793103448
757,1.793103448
758,1.793103448
759,1.793103448
760,1.793103448
761,1.793103448
762,1.793103448
763,1.793103448
764,1
765,1
This is my code so far:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
out_file = "path_to_file/file.csv"
df = pd.read_csv(out_file)
time = df['date']
data = df['data']
ax1 = plt.subplot2grid((4,3),(0,0), colspan = 2, rowspan = 2) # Will be adding other plots
plt.plot(time, data)
plt.yticks(np.arange(1,5,1)) # Include classes 1-4 showing only 1 step changes
plt.gca().invert_yaxis() # Reverse y axis
plt.ylabel('Trend', fontsize = 8, labelpad = 10)
This generates the following plot:
Test plot
I have seen posts that answer similar questions (like the ones below), but can't seem to get my code to work. Can anyone suggest an elegant solution?
Generating smooth line graph using matplotlib
Python Matplotlib - Smooth plot line for x-axis with date values

Matplotlib plot already binned data

I want to plot the mean local binary patterns histograms of a set of images. Here is what I did:
#calculates the lbp
lbp = feature.local_binary_pattern(image, 24, 8, method="uniform")
#Now I calculate the histogram of LBP Patterns
(hist, _) = np.histogram(lbp.ravel(), bins=np.arange(0, 27))
After that I simply sum up all the LBP histograms and take the mean of them. These are the values found, which are saved in a txt file:
2.962000000000000000e+03
1.476000000000000000e+03
1.128000000000000000e+03
1.164000000000000000e+03
1.282000000000000000e+03
1.661000000000000000e+03
2.253000000000000000e+03
3.378000000000000000e+03
4.490000000000000000e+03
5.010000000000000000e+03
4.337000000000000000e+03
3.222000000000000000e+03
2.460000000000000000e+03
2.495000000000000000e+03
2.599000000000000000e+03
2.934000000000000000e+03
2.526000000000000000e+03
1.971000000000000000e+03
1.303000000000000000e+03
9.900000000000000000e+02
7.980000000000000000e+02
8.680000000000000000e+02
1.119000000000000000e+03
1.479000000000000000e+03
4.355000000000000000e+03
3.112600000000000000e+04
I am trying to simply plot these values (don't need to calculate the histogram, because the values are already from a histogram). Here is what I've tried:
import matplotlib
matplotlib.use('Agg')
import numpy as np
import matplotlib.pyplot as plt
import plotly.plotly as py
#load data
data=np.loadtxt('original_dataset1.txt')
#convert to float
data=data.astype('float32')
#define number of Bins
n_bins = data.max() + 1
plt.style.use("ggplot")
(fig, ax) = plt.subplots()
fig.suptitle("Local Binary Patterns")
plt.ylabel("Frequency")
plt.xlabel("LBP value")
plt.bar(n_bins, data)
fig.savefig('lbp_histogram.png')
However, look at the Figure these commands produce:
I still dont understand what is happening. I would like to make a Figure like the one I produced in Excel using the same data, as follows:
I must confess that I am quite rookie with matplotlib. So, what was my mistake?
Try this. Here the array is your mean values from bins.
array = [2962,1476,1128,1164,1282,1661,2253]
fig,ax = plt.subplots(nrows=1, ncols=1,)
ax.bar(np.array(range(len(array)))+1,array,color='orangered')
ax.grid(axis='y')
for i, v in enumerate(array):
ax.text(i+1, v, str(v),color='black',fontweight='bold',
verticalalignment='bottom',horizontalalignment='center')
plt.savefig('savefig.png',dpi=150)
The plot look like this.

How to format x-axis time-series tick marks with missing dates

How can I format the x-axis so that the spacing between periods is "to scale". As in, the distance between 10yr and 30yr should be much larger than the distance between 1yr and 2yr.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import Quandl as ql
yield_ = ql.get("USTREASURY/YIELD")
today = yield_.iloc[-1,:]
month_ago = yield_.iloc[-1000,:]
df = pd.concat([today, month_ago], axis=1)
df.columns = ['today', 'month_ago']
df.plot(style={'today': 'ro-', 'month_ago': 'bx--'},title='Treasury Yield Curve, %');
plt.show()
I want my chart to look like this...
I think doing this while staying purely within Pandas might be tricky. You first need to create a new matplotlib figure and axe. The following might not work exactly but will give you a good idea.
df['years']=[1/12.,0.25,0.5,1,2,3,5,7,10,20,30]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
df.plot(x='years',y='today',ax=ax,kind='scatter')
df.plot(x='years',y='month_ago',ax=ax,kind='scatter')
plt.show()
If you want your axe labels to look like your chart you'll also need to set the lower and upper limit of your axis so they look good and then do something like:
ax.set_xticklabels(list(df.index))

Multiple legends and multiple colors/shapes matplotlib

I want to plot data from about 20+ files at same time. I am trying to plot each set of data from each file in different color and each with different legend. I have seen some examples and also the matplotlib tutorial but I am little lost here. How to put legends and give different shapes for every set.
e.g: The inputs are set of data from several files with separate thresholds.
filenames: file1_th0, file1_th0.1 and so on. So i want to make all similar threshold data of different files of same shape/color. Also give proper legends. I can plot very well which ever data set I need but I am not able to put separate shapes for different threshold value. Any suggestion in this regards will be great.
Code:
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator
for fname in ('file1_th0', 'file1_th0.1','file1_th0.01', 'file1_th0.001', 'file1_th0.001'):
data=np.loadtxt(fname)
X=data[:,2]
sorted_data = np.sort(X)
cdf=np.arange(len(sorted_data))/float(len(sorted_data))
ccdf = 1 - cdf
plt.plot(sorted_data,ccdf,'r-', label = 'label1')
for fname in ('file2_th0', 'file2_th0.1', 'file2_th0.01', 'file2_th0.001','file2_th0.0001'):
data=np.loadtxt(fname)
X=data[:,2]
sorted_data = np.sort(X)
cdf=np.arange(len(sorted_data))/float(len(sorted_data))
ccdf = 1 - cdf
plt.plot(sorted_data,cdf,'b-')
for fname in ('file3_th0','file3_th0.1','file3_th0.01','file3_th0.001', 'file3_th0.0001'):
data=np.loadtxt(fname)
X=data[:,4]
sorted_data = np.sort(X)
cdf=np.arange(len(sorted_data))/float(len(sorted_data))
ccdf = 1 - cdf
plt.plot(sorted_data,cdf,'m-')
for fname in ('file4_th0', 'file4_th0.1', 'file4_th0.01', 'file4_th0.001','file4_th0.0001'):
data=np.loadtxt(fname)
X=data[:,4]
sorted_data = np.sort(X)
cdf=np.arange(len(sorted_data))/float(len(sorted_data))
ccdf = 1 - cdf
plt.plot(sorted_data,cdf,'c--')
plt.xlabel('this is x!')
plt.ylabel('this is y!')
plt.gca().set_xscale("log")
#plt.gca().set_yscale("log")
plt.show()
First of all, you need to add labels and markers to your plot calls and add a legend call, e.g:
b=np.arange(0,20,1)
c=b*0.5
d=b*2
plt.plot(b,d,color='r',marker='o',label='set 1')
plt.plot(b,c,color='g',marker='*',label='set 2')
plt.legend(loc='upper left')
However in your looped example you will end up with lots of identical legend entries, which I presume you don't want.
To get round it, you could:
n=0
for whatever in whatever: # e.g. your for loops
# do stuff with whatever
if n==0:
plt.plot(sorted_data,cdf,color='r',marker='o',label='set 1')
else:
plt.plot(sorted_data,cdf,color='r',marker='o')
n += 1

Categories

Resources