I am trying to run a local and tvp-var for a panel in StatsModel. Time is the index and all variables must be in columns in order to be in line with the state space set up (contrary to the classical panel set up). I have a number of variables that are common and are only differentiated in their names by adding the ISO country code. That is for inflation and GDP in Australia, the USA, and Canada, which would be AUS_inflation, USA_inflation, CAN_inflation, and AUS_gdp. USA_gdp CAN_gdp. There are about one hundred countries and fifty such variables.
What I need is to create a local, a list, a dictionary or whatever that reads and considers them as a single variable when it iterates, e.g., local inflation, local GDP, etc. Ideally, there would be a single value, allowing me to run the local model of StatsModel and obtain results as a panel. How is that done?
I thought that creating a local or what ever could help me solve the problem.
For the local model, I run the following code for the gdp growth rate
#2
# Construct a local level model for Growth rate
mod = sm.tsa.UnobservedComponents(dta.gdp, 'llevel')
# Fit the model's parameters (sigma2_varepsilon and sigma2_eta)
# via maximum likelihood
res = mod.fit()
print(res.params)
# Create simulation smoother objects
sim_kfs = mod.simulation_smoother() # default method is KFS
sim_cfa = mod.simulation_smoother(method='cfa') # can specify CFA method
fa = mod.simulation_smoother(method='cfa') # can specify CFA method
#3
nsimulations = 20
simulated_state_kfs = pd.DataFrame(
np.zeros((mod.nobs, nsimulations)), index=dta.index)
simulated_state_cfa = pd.DataFrame(
np.zeros((mod.nobs, nsimulations)), index=dta.index)
for i in range(nsimulations):
# Apply KFS simulation smoothing
sim_kfs.simulate()
# Save the KFS simulated state
simulated_state_kfs.iloc[:, i] = sim_kfs.simulated_state[0]
# Apply CFA simulation smoothing
sim_cfa.simulate()
# Save the CFA simulated state
simulated_state_cfa.iloc[:, i] = sim_cfa.simulated_state[0]
#4
# Plot the Growth rate of GDP data along with simulated trends
fig, axes = plt.subplots(2, figsize=(15, 6))
# Plot data and KFS simulations
dta.gdp.plot(ax=axes[0], color='k')
axes[0].set_title('Simulations based on KFS approach, MLE parameters')
simulated_state_kfs.plot(ax=axes[0], color='C0', alpha=0.25, legend=False)
# Plot data and CFA simulations
dta.gdp.plot(ax=axes[1], color='k')
axes[1].set_title('Simulations based on CFA approach, MLE parameters')
simulated_state_cfa.plot(ax=axes[1], color='C0', alpha=0.25, legend=False)
# Add a legend, clean up layout
handles, labels = axes[0].get_legend_handles_labels()
axes[0].legend(handles[:2], ['Data', 'Simulated state'])
fig.tight_layout();
#5
fig, ax = plt.subplots(figsize=(15, 3))
# Update the model's parameterization to one that attributes more
# variation in GDP growth to the observation error and so has less
# variation in the trend component
mod.update([4, 0.05])
# Plot simulations
for i in range(nsimulations):
sim_kfs.simulate()
ax.plot(dta.index, sim_kfs.simulated_state[0],
color='C0', alpha=0.25, label='Simulated state')
# Plot data
dta.gdp.plot(ax=ax, color='k', label='Data', zorder=-1)
# Add title, legend, clean up layout
ax.set_title('Simulations with alternative parameterization yielding a smoother trend')
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[-2:], labels[-2:])
fig.tight_layout();
Related
I have a model that makes a prediction and intervals for the prediction of wine prices in python. To this i need som visualization of the finished product.
I'm wondering if it's possible to create a violin plot using only one data point and the interval as it is what my model does. I have done this for a boxplot, but wanted to have a different visualization.
I have created a boxplot this way using stats this way where price, q1, q3, interval_lower and interval_upper are all single datapoints that the model predicts:
stats = [{
"med": price,
"q1": q1,
"q3": q3,
"whislo": interval_lower, # required
"whishi": interval_upper, # required
"fliers": [] # required if showfliers=True
}]
fs = 10 # fontsize
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
bp = axes.bxp(stats, patch_artist=True)
axes.set_title('Boxplot for wine prediction', fontsize=fs,color='Blue')
plt.show(bp)
This works fine for boxplots, but i cannot find a way to use violinplot in the same way. Does anyone know a way to do this? I'm using jupyter notebook for python.
I am using geoplot's kdeplot function to display Kernel Density maps (aka heatmaps), for different periods of time.
I have a code that looks like this:
fig, axs = plt.subplots(n_rows, n_cols, figsize=(20,10))
for ax, t in zipped(axs.flatten(), periods):
# df is a GeoPandas dataframe
data = df.loc[(df['period'] == t), ['id', 'geometry']]
# heatmap plot
gplt.kdeplot(
data,
clip=africa.geometry,
cmap='Reds',
shade=True,
cbar=True,
ax=ax)
gplt.polyplot(africa, ax=ax, zorder=1)
ax.set_title(int(t))
It outputs the following image
i would like instead to be able to define a common scale for my entire dataset (regardless of the time), which I can then use in kdeplot and as a unique legend for my subplots.
I know that my data have different density in different years, but I am trying to find a sort of common values that can be used for each of them.
I thought the levels parameter would be what I am looking for (i.e. using the same iso-proportions of the density for my periods, e.g. [0.2,0.4,0.6,0.8,1]).
However, when I use it in combination with cbar=True to display the legends, the values of each legend is different from the other legends (and from the levels vector).
Am I doing something wrong?
If not, do I need to manually set the cbar?
I am analyzing a time series dataset and I used seasonal_decompose function in statsmodel library to obtain trend and seasonal behavior. I obtained the autocorrelation plot and the decomposition of the time-series provided should provide a “remainder” component that should be uncorrelated. By observing the autocorrelation plot how do we say that auto-correlation function indicate that the remainder is indeed uncorrelated?
I am attaching the code I used to obtain autocorrelation plot and the plot obtained.
fig, ax = plt.subplots(figsize=(20, 5))
plot_acf(data, ax=ax)
plt.show()
Autocorrelation_plot
if the results of auto correlation are close to zero then the features not not correlated. I use lag of 40, but you will need to adjust this value dependant on your data.
plt.clf()
fig,ax = plt.subplots(figsize=(12,4))
plt.style.use('seaborn-pastel')
fig = tsaplots.plot_acf(df['value'], lags=40,ax=ax)
plt.show()
print('values close to 1 are showing strong positive correlation. The blue regions are showing areas of uncertainty')
I want to create a plot that looks like the plot attached below.
My data frame is built at this format:
Playlist Type Streams
0 a classical 94
1 b hip-hop 12
2 c classical 8
The 'popularity' category can be replaced by the 'streams' - the only thing is that the streams variable has a high variance of values (goes from 0 to 10,000+) and therefore I believe the density graph might look weird.
However, my first question is how can I plot a graph similar to this in Pandas, when grouping by the 'Type' column and then creating the density graph.
I tried various methods but did not find a good one to establish my goal.
To augment the answer of #Student240 you could make use of the seaborn library, which makes it easy to fit 'kernal density estimates'. In other words, to have smooth curves similar to that in your question, rather than a binned histogram. This is done with the KDEplot class. A related plot type is the distplot which gives the KDE estimate but also shows the histogram bins.
Another difference in my answer is to use the explicit object oriented approach in matplotlib/seaborn. This involves initially declaring a figure and axes objects with plt.subplots() rather than the implicit approach of fig.hist. See this really good tutorial for more details.
import matplotlib.pyplot as plt
import seaborn as sns
## This block of code is copied from Student240's answer:
import random
categories = ['classical','hip-hop','indiepop','indierock','jazz'
,'metal','pop','rap','rock']
# NB I use a slightly different random variable assignment to introduce a bit more variety in my random numbers.
df = pd.DataFrame({'Type':[random.choice(categories) for _ in range(1000)],
'stream':[random.normalvariate(i,random.randint(0,15)) for i in
range(1000)]})
###split the data into groups based on types
g = df.groupby('Type')
## From here things change as I make use of the seaborn library
classical = g.get_group('classical')
hiphop = g.get_group('hip-hop')
indiepop = g.get_group('indiepop')
indierock = g.get_group('indierock')
fig, ax = plt.subplots()
ax = sns.kdeplot(data=classical['stream'], label='classical streams', ax=ax)
ax = sns.kdeplot(data=hiphop['stream'], label='hiphop streams', ax=ax)
ax = sns.kdeplot(data=indiepop['stream'], label='indiepop streams', ax=ax)
# for this final one I use the shade option just to show how it is done:
ax = sns.kdeplot(data=indierock['stream'], label='indierock streams', ax=ax, shade=True)
ax.set_xtitle('Count')
ax.set_ytitle('Density')
ax.set_title('KDE plot example from seaborn")
Hi you can try the following example, I have used randon normals just for this example, obviously it wouldn't be possible to have negative streams. Anyway disclaimer over, here is the code:
import random
categories = ['classical','hip-hop','indiepop','indierock','jazz'
,'metal','pop','rap','rock']
df = pd.DataFrame({'Type':[random.choice(categories) for _ in range(10000)],
'stream':[random.normalvariate(0,random.randint(0,15)) for _ in
range(10000)]})
###split the data into groups based on types
g = df.groupby('Type')
###access the classical group
classical = g.get_group('classical')
plt.figure(figsize=(15,6))
plt.hist(classical.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="Classical Streams", color="#D73A30", density=True)
plt.legend(loc="upper left")
###hip hop
hiphop = g.get_group('hip-hop')
plt.hist(hiphop.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="hiphop Streams", color="#2A3586", density=True)
plt.legend(loc="upper left")
###indie pop
indiepop = g.get_group('indiepop')
plt.hist(indiepop.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="indie pop streams", color="#5D271B", density=True)
plt.legend(loc="upper left")
#indierock
indierock = g.get_group('indierock')
plt.hist(indierock.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="indie rock Streams", color="#30A9D7", density=True)
plt.legend(loc="upper left")
##jazz
jazz = g.get_group('jazz')
plt.hist(jazz.stream, histtype='stepfilled', bins=50, alpha=0.2,
label="jazz Streams", color="#30A9D7", density=True)
plt.legend(loc="upper left")
####you can add other here if you wish
##modify this to control x-axis, possibly useful for high-variance data
plt.xlim([-20,20])
plt.title('Distribution of Streams by Genre')
plt.xlabel('Count')
plt.ylabel('Density')
You can Google 'Hex color picker' if you want to get a specific '#000000' color in the format I have used in this example.
modify variable 'alpha' if you want to change how dense the color is displayed, you can also play around with 'bins' in the example I provided as this should allow you to make it look better if 50 is too large or small.
I hope this helps, plotting in matplotlib can be a pain to learn, but it is surely worth it!!
After performing a PCA analysis in R we can do:
ggbiplot(pca, choices=1:2, groups=factor(row.names(df_t)))
That will plot the data in the 2 PC space, and the direction and weight of the variables in such space as vectors (with different length and direction).
In Python I can plot the data in the 2 PC space, and I can get the weights of the variables, but how do I know the direction.
In other words, how could I plot the variable contribution to both PC (weight and direction) in Python?
I am not aware of any pre-made implementation of this kind of plot, but it can be created using matplotlib.pyplot.quiver. Here's an example I quickly put together. You can use this as a basis to create a nice plot that works well for your data.
Example Data
This generates some example data. It is reused from this answer.
# User input
n_samples = 100
n_features = 5
# Prep
data = np.empty((n_samples,n_features))
np.random.seed(42)
# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)
PCA
pca = PCA().fit(data)
Variables Factor Map
Here we go:
# Get the PCA components (loadings)
PCs = pca.components_
# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
PCs[0,:], PCs[1,:],
angles='xy', scale_units='xy', scale=1)
# Add labels based on feature names (here just numbers)
feature_names = np.arange(PCs.shape[1])
for i,j,z in zip(PCs[1,:]+0.02, PCs[0,:]+0.02, feature_names):
plt.text(j, i, z, ha='center', va='center')
# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])
# Label axes
plt.xlabel('PC 0')
plt.ylabel('PC 1')
# Done
plt.show()
Being Uncertain
I struggled a bit with the scaling of the arrows. Please make sure they correctly reflect the loadings for your data. A quick check of whether feature 4 really correlates strongly with PC 1 (as this example would suggest) looks promising:
data_pca = pca.transform(data)
plt.scatter(data_pca[:,1], data[:,4])
plt.xlabel('PC 2') and plt.ylabel('feature 4')
plt.show()
Thanks to WhoIsJack for the earlier answer.
I adapted there code to a function below that takes in a fitted PCA object and the data it was based on. It produces the figure similar to above, but I substituted out real column names for the column index, and then pruned it to only show a certain number of contributing columns.
def plot_pca_vis(pca, df: pd.DataFrame, pc_x: int = 0, pc_y: int = 1, num_dims: int = 5):
"""
https://stackoverflow.com/questions/45148539/project-variables-in-pca-plot-in-python
Adapted into function by Tim Cashion
"""
# Get the PCA components (loadings)
PCs = pca.components_
PC_x_index = PCs[pc_x, : ].argsort()[-num_dims:][::-1]
PC_y_index = PCs[pc_y, : ].argsort()[-num_dims:][::-1]
combined_index = set(list(PC_x_index) + list(PC_y_index))
PCs = PCs[:, list(combined_index)]
# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
PCs[pc_x,:], PCs[pc_y,:],
angles='xy', scale_units='xy', scale=1)
# Add labels based on feature names (here just numbers)
feature_names = df.columns
for i,j,z in zip(PCs[pc_y,:]+0.02, PCs[pc_x,:]+0.02, feature_names):
plt.text(j, i, z, ha='center', va='center')
# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])
# Label axes
plt.xlabel('PC ' + str(pc_x))
plt.ylabel('PC ' + str(pc_y))
# Done
plt.show()
Hope this helps someone!