I am trying to plot a simple boxplot for a large dataset(more than one million records) that I converted from pyspark to pandas to perform some preliminary data analysis. The problem is that when I try to visualize one of the feature with a boxplot the y axis does not reflect the real values(or at least it rescale everything I think).
# Describe basic statistics for the features (1)
DF.select('#followers', '#friends', '#favorites').describe().show()
df_pandas = DF.toPandas()
fig = plt.figure(figsize =(10, 7))
# Creating plot
plt.boxplot(df_pandas["#followers"])
# show plot
plt.show()
Related
I am doing a time series project and I am trying to figure out how to overlay a scatter plot on top of a time series plot
'''df = alaska
fig = px.scatter(df, x='week', y='travel_restrictions')
fig.show()'''
''' df = alaska
fig = px.line(df, x='week', y='depression')
fig.show()'''
simple plot I would like to combine
I have tried a couple of different ways of adding traces but I get value errors whenever I try to combine the two charts plot from code
'''' import matplotlib.pyplot as plt
model = ChangepointDetector()
res = model.find_trend_changepoints(
df=combine.reset_index(), # data df
time_col="week", # time column name
value_col="Least Restrictions", # value column name
yearly_seasonality_order=10, # yearly seasonality order, fit along with trend
regularization_strength=0.3, # between 0.0 and 1.0, greater values imply fewer changepoints, and 1.0 implies no changepoints
resample_freq="7D", # data aggregation frequency, eliminate small fluctuation/seasonality
potential_changepoint_n=25, # the number of potential changepoints
yearly_seasonality_change_freq="365D", # varying yearly seasonality for every year
no_changepoint_distance_from_end="365D") # the proportion of data from end where changepoints are not allowed
fig = model.plot(
observation=True,
trend_estimate=False,
trend_change=True,
yearly_seasonality_estimate=False,
adaptive_lasso_estimate=True,
plot=False,
)
trace2 = go.Scatter(
x=df[cat],
y=df['cumulative_perc'],
name='Cumulative Percentage',
yaxis='y2'
fig.update_layout(title='Alaska')
plotly.io.show(fig)''''
Also I'm not sure if it was possible but originally I was trying to stack a scatter plot on top of this greykite plot but I only managed to have the two plot obscure each other
plot from code
Ideally I'd stack my scatter plot on top of my change plot but I don't know if it's possible
I have a data-frame with soil temperature data from several different models that I want to create a scatterplot matrix of. The data frame looks like this:
dataframe structure
The data is organized by model (or station), and I have also included a couple of columns to differentiate between data occurring between the cold or warm season ['Season'] , as well as the layer ['Layer'] that the data is from.
My goal is to create a scatterplot matrix with the following characteristics:
data color-coded by season (which I have set up in the script so
far)
the bottom triangle only consisting of data from the 0cm to 30cm
soil layer, and the upper triangle only consisting of data from the
30cm to 300cm soil layer.
I have figured out how to create a scatterplot matrix for one triangle/portion of the dataset at a time, such as in this example:
Scatterplot for top 30cm
however I am unsure of how to have a different portion of the data to be used in each triangle.
The relevant files can be found here:
dframe_btm
dframe_top
dframe_master
Here is the relevant code
dframe_scatter_top = pd_read.csv(dframe_top.csv)
dframe_scatter_btm = pd_read.csv(dframe_btm.csv)
dframe_master = pd.read_csv(dframe_master.csv)
scatter1 = sn.pairplot(dframe_scatter_top,hue='Season',corner='True')
sns.set_context(rc={"axes.labelsize":20}, font_scale=1.0)
sns.set_context(rc={"legend.fontsize":18}, font_scale=1.0)
scatter1.set(xlim=(-40,40),ylim=(-40,40))
plt.show()
I suspect that the trick is to use PairGrid, and set one portion of the data to appear in map upper and the other portion in map lower, however I don't currently see a way to explicitly split the data. For example is there a way perhaps to do the following?
scatter1 = sns.PairGrid(dframe_master)
scatter1.map_upper(#only plot data from 0-30cm)
scatter1.map_lower(#only plot data from 30-300cm)
You're close. You'll need to define a custom function that does the splitting:
import seaborn as sns
df = sns.load_dataset("penguins")
def scatter_subset(x, y, hue, mask, **kws):
sns.scatterplot(x=x[mask], y=y[mask], hue=hue[mask], **kws)
g = sns.PairGrid(df, hue="species", diag_sharey=False)
g.map_lower(scatter_subset, mask=df["island"] == 'Torgersen')
g.map_upper(scatter_subset, mask=df["island"] != 'Torgersen')
g.map_diag(sns.kdeplot, fill=True, legend=False)
g.add_legend()
i have 8 different arryas that i want to plot using violin plot to compare distributions, this is how I plotted:
plt.violinplot(alpha_g159)
plt.violinplot(alpha_g108)
plt.violinplot(alpha_g141)
plt.violinplot(alpha_g110)
plt.violinplot(alpha_g115)
plt.violinplot(alpha_g132)
plt.violinplot(alpha_g105)
plt.violinplot(alpha_g126)
And I have this plot:
Actually what I want to do, is to shift each plot horizontally (along the x-axis) so they would not overlap, and then add on the x-axis the label of each plot.
Could anyone guide me on how to do that? i tried adding for example alpha_108+x0with x0=2but it just shifts it vertically.
You can achieve this by putting your data into a list. Matplotlib than plots the individual plots side by side.
import matplotlib.pyplot as plt
import numpy as np
# put your data in a list like this:
# data = [alpha_g159, alpha_g108, alpha_g141, alpha_g110, alpha_g115, alpha_g132, alpha_g105, alpha_g126]
# as I do not have your data I created some test data
data = [sorted(np.random.normal(0, std, 100)) for std in range(1, 9)]
plt.violinplot(data)
labels = ["alpha_g159", "alpha_g108", "alpha_g141", "alpha_g110", "alpha_g115", "alpha_g132", "alpha_g105", "alpha_g126"]
# add the labels (rotated by 45 degrees so that they do not overlap)
plt.xticks(range(1, 9), labels, rotation=45)
# Tweak spacing to prevent clipping of tick-labels
plt.subplots_adjust(bottom=0.3)
plt.show()
resulting plot
My seaborn heatmap is showing multiple scales (for each column I presume)
Attached an image showing the code, data & chart.
Wondering how I can remove the multiple scales on the right and show only 1.
clustered_heatmap = clustered_points.groupby("Predicted Clusters").sum()
clustered_heatmap = clustered_heatmap.drop(clustered_heatmap.columns[0], axis = 1)
clustered_heatmap
You can try this:
# Create heatmap
plt.figure(figsize=(16,9))
sns.heatmap(clustered_heatmap)
When I plot single plots with panda dataframes I have an x-axis.
However, when I make a subplot and try to make a shared x-axis the way I would when using numpy arrays without pandas, there are no numbers labels
I only want the numbers and label to appear on the last plot as they share the same x-axis.
The data loaded and the plot produced can be found here:
https://drive.google.com/open?id=1hTmTSkIcYl-usv_CCxLl8U6bAoO6tMRh
This is for combining and plotting the data logged from two different logging devices which represent the same time period.
import pandas as pd
import matplotlib.pyplot as plt
df1=pd.read_csv('data1.csv', sep=',',header=0)
df1.columns.values
cols1 = list(df1.columns.values)
df2=pd.read_csv('data2.dat', sep='\t',header=18)
df2.columns.values
cols2 = list(df2.columns.values)
start =10000
stop = 30000
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True, figsize=(10, 10))
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[1], ax=axes[0])
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[2], ax=axes[0])
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[3], ax=axes[2])
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[4], ax=axes[2])
df2.iloc[start:stop].plot(x=cols2[0], y=cols2[3], ax=axes[3])
ax3.set_xlabel("Time [s]")
plt.show()
I expect there to be numbers and a label on the x-axis but instead, it only gives the pandas label "#timestamp"
UPDATE: I have found something that hints at the problem. I think the problem is due to the two files not having identical time spacing, the first column of each file is time, they are roughly 1 sample per second but not exactly. If I remove the x=cols[x] parts it then shows numbers on the x-axis but then there is a shift in time between the two plots as they are not plotting against time but rather against the index in the dataframe.
I am currently trying to interpolate the data so that they have the same x-axis but I would not have expected that to be necessary.