I have a data-frame with soil temperature data from several different models that I want to create a scatterplot matrix of. The data frame looks like this:
dataframe structure
The data is organized by model (or station), and I have also included a couple of columns to differentiate between data occurring between the cold or warm season ['Season'] , as well as the layer ['Layer'] that the data is from.
My goal is to create a scatterplot matrix with the following characteristics:
data color-coded by season (which I have set up in the script so
far)
the bottom triangle only consisting of data from the 0cm to 30cm
soil layer, and the upper triangle only consisting of data from the
30cm to 300cm soil layer.
I have figured out how to create a scatterplot matrix for one triangle/portion of the dataset at a time, such as in this example:
Scatterplot for top 30cm
however I am unsure of how to have a different portion of the data to be used in each triangle.
The relevant files can be found here:
dframe_btm
dframe_top
dframe_master
Here is the relevant code
dframe_scatter_top = pd_read.csv(dframe_top.csv)
dframe_scatter_btm = pd_read.csv(dframe_btm.csv)
dframe_master = pd.read_csv(dframe_master.csv)
scatter1 = sn.pairplot(dframe_scatter_top,hue='Season',corner='True')
sns.set_context(rc={"axes.labelsize":20}, font_scale=1.0)
sns.set_context(rc={"legend.fontsize":18}, font_scale=1.0)
scatter1.set(xlim=(-40,40),ylim=(-40,40))
plt.show()
I suspect that the trick is to use PairGrid, and set one portion of the data to appear in map upper and the other portion in map lower, however I don't currently see a way to explicitly split the data. For example is there a way perhaps to do the following?
scatter1 = sns.PairGrid(dframe_master)
scatter1.map_upper(#only plot data from 0-30cm)
scatter1.map_lower(#only plot data from 30-300cm)
You're close. You'll need to define a custom function that does the splitting:
import seaborn as sns
df = sns.load_dataset("penguins")
def scatter_subset(x, y, hue, mask, **kws):
sns.scatterplot(x=x[mask], y=y[mask], hue=hue[mask], **kws)
g = sns.PairGrid(df, hue="species", diag_sharey=False)
g.map_lower(scatter_subset, mask=df["island"] == 'Torgersen')
g.map_upper(scatter_subset, mask=df["island"] != 'Torgersen')
g.map_diag(sns.kdeplot, fill=True, legend=False)
g.add_legend()
Related
I am trying to plot a box plot of the temperature of the 20th Century vs the 21st century.
I want to plot these on one box plot but I want the temperature of the 20th century in different color vs the 21st century in a different color.
I don't want to have two different box plots. I want to plot it on one box plot to see if the values of the 21st century are in the outlier range or not.
Also, I want to see the values of individual points in the box plot. Not sure how to do this? I tried Seaborn but it doesn't allow me to show individual values and have a different color of data points in the 2 centuries.
Here is the code to generate values of temperature:
def generate(median=20, err=1, outlier_err=25, size=100, outlier_size=10):
errs = err * np.random.rand(size) * np.random.choice((-5, 5), size)
data = median + errs
lower_errs = outlier_err * np.random.rand(outlier_size)
lower_outliers = median - err - lower_errs
upper_errs = outlier_err * np.random.rand(outlier_size)
upper_outliers = median + err + upper_errs
data = np.round(np.concatenate((data, lower_outliers, upper_outliers)))
np.random.shuffle(data)
return data
data = pd.DataFrame(generate(),columns=['temp'])
data['year']='20th Century'
Not sure if I got what you wanted right, but considering you want individual coloured points and just one box, I suggest you try .swarmplot(). Here's how it might look like:
import seaborn as sns
# generate data for two centuries in a DataFrame
data= pd.DataFrame({'20_century': generate(),
'21_century': generate()})
# transform from wide to long form to plot individual points in a single swarm
data_long = pd.melt(data, value_vars=['20_century', '21_century'])
# rename columns
data_long.columns = ['century', 'temp']
# since .swarmplot() requiers categories on one axis, add one dummy for all, say, for a timescale
data_long['timescale'] = ['century' for row in data_long.iterrows()]
# draw a stripplot with hue to color centuries, dodge=False to plot in one swarm
sns.swarmplot(data=data_long, x='timescale', y='temp', hue='century', dodge=False)
I got one group of individual points, coloured by century, outliers are visible:
You might want to try .stripplot() as well:
# added alpha=0.7 for less opacity to better show overlapping points
sns.stripplot(data=data_long, x='timescale', y='temp', hue='century', dodge=False, alpha=0.7)
I individually like this one better:
This is how a boxplot would look like in the way I understood your request:
sns.boxplot(data=data_long, x='timescale', y='temp', hue='century', dodge=False)
Good Afternoon All,
I'm attempting to create a contour map of surface elevation by using drilling data from a mineral exploration programme. I am new to programming, any feedback would be welcomed!
Each drill hole has a:
hole id
x co-ordinate (Easting)
y co-ordinate (Northing)
z value (surface elevation).
An excerpt of the data is as follows:
Methodology
I broke the work down into two steps.
1) Checking that the data plots in the correct area
I used pandas to extract the co-ordinates of each drilling hole from the csv file, and plotted the data using plt.scatter from matplotlib.
This is my output. So far it works, so now I want to plot the 3D (z axis) data.
2) Plotting of Surface_Elevation (z axis)
This is where I am having problems. I've read through several contouring guides for matplotlib which is dependent on plt.contour. The issue is that this function wants a 2D array, and the data that I want to contour is 1D. Am I missing something here?
My attempt
import matplotlib.pyplot as plt # plot data
import pandas as pd # extract data from csv
# access csv and assign as a variable
dataset = pd.read_csv('spreadsheet.csv')
# x_axis values extracted and converted to a list from the csv
x_axis = list(dataset["Orig_East"])
# y_axis values extracted and converted to a list from the csv
y_axis = list(dataset["Orig_North"])
# z_axis values extracted and converted to a list from the csv
z_axis = list(dataset["Surface_Elevation"])
plt.contour(x_axis, y_axis, z_axis, colors='black');
plt.ticklabel_format(useOffset=False, style='plain') # remove exponential axis labels
plt.xlabel('Easting') # label x axis
plt.ylabel('Northing') # label y axis
plt.title('Surface Elevation') # label plot
# plot graph
plt.show()
A possible solution is to encode the elevation of each point into the color of the scatter marker. This can be done by calling plt.scatter(x, y, c=z)
you can also specify a desired cmap, see the documentation.
I'm trying to plot some X and Z coordinates on an image to show which parts of the image have higher counts. Y values are height in this case so I am excluding since I want 2D. Since I have many millions of data points, I have grouped by the combinations of X and Z coordinates and counted how many times that value occurred. The data should contain almost all conbinations of X and Z coordinates. It looks something like this (fake data):
I have experimented with matplotlib.pyplot by using the plt.hist2d(x,y) function but it seems like this takes raw data and not already-summarized data like I've got.
Does anyone know if this is possible?
Note: I can figure out the plotting on an image part later, first I'm trying to get the scatter-plot/heatmap to show aggregated data.
I managed to figure this out. After loading in the data in the format of the original post, step one is pivoting the data so you have x values as columns and z values as rows. Then you plot it using seaborn heatmap. See below:
#pivot columns
values = pd.pivot_table(raw, values='COUNT_TICKS', index=['Z_LOC'], columns = ['X_LOC'], aggfunc=np.sum)
plt.figure(figsize=(20, 20))
sns.set(rc={'axes.facecolor':'cornflowerblue', 'figure.facecolor':'cornflowerblue'})
#ax = sns.heatmap(values, vmin=100, vmax=5000, cmap="Oranges", robust = True, xticklabels = x_labels, yticklabels = y_labels, alpha = 1)
ax = sns.heatmap(values,
#vmin=1,
vmax=1000,
cmap="Greens", #BrBG is also good
robust = True,
alpha = 1)
plt.show()
I have a pandas DataFrame with multiple columns filled with numbers and rows, and the 1st column has the categorical data. Obviously, I have NaN values and zeroes in multiple rows (but not the entire blank row, of course) and in different columns.
The rows have valuable data in other columns which are not NaN. And the columns have valuable data in other rows, which are also not NaN.
The problem is that sns.pairplot does not ignore NaN values for correlation and returns errors (such as division by zero, string to float conversion, etc.).
I have seen some people saying to use fillna() method, but I am hoping if anyone knows a more elegant way to do this, without having to go through that solution and spend numerous hours to fix the plot, axis, filters, etc. afterwards. I didn't like that work around.
It is similar to what this person has reported:
https://github.com/mwaskom/seaborn/issues/1699
ZeroDivisionError: 0.0 cannot be raised to a negative power
Here is the sample dataset:
Seaborn's PairGrid function will allow you to create your desired plot. PairGrid is much more flexible than sns.pairplot. Any PairGrid created has three sections: the upper triangle, the lower triangle and the diagonal.
For each part, you can define a customized plotting function. The upper and lower triangle sections can take any plotting function that accepts two arrays of features (such as plt.scatter) as well as any associated keywords (e.g. marker). The diagonal section accepts a plotting function that has a single feature array as input (such as plt.hist) in addition to the relevant keywords.
For your purpose, you can filter out the NaNs in your customized function(s):
from sklearn import datasets
import pandas as pd
import numpy as np
import seaborn as sns
data = datasets.load_iris()
iris = pd.DataFrame(data.data, columns=data.feature_names)
# break iris dataset to create NaNs
iris.iat[1, 0] = np.nan
iris.iat[4, 0] = np.nan
iris.iat[4, 2] = np.nan
iris.iat[5, 2] = np.nan
# create customized scatterplot that first filters out NaNs in feature pair
def scatterFilter(x, y, **kwargs):
interimDf = pd.concat([x, y], axis=1)
interimDf.columns = ['x', 'y']
interimDf = interimDf[(~ pd.isnull(interimDf.x)) & (~ pd.isnull(interimDf.y))]
ax = plt.gca()
ax = plt.plot(interimDf.x.values, interimDf.y.values, 'o', **kwargs)
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data=iris, vars=list(iris.columns), size = 4)
# Map a scatter plot to the upper triangle
grid = grid.map_upper(scatterFilter, color='darkred')
# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins=10, edgecolor='k', color='darkred')
# Map a density plot to the lower triangle
grid = grid.map_lower(scatterFilter, color='darkred')
This will yield the following plot:
PairPlot allows you to plot contour plots, annotate the panels with descriptive statistics, etc. For more details, see here.
Let's look at a swarmplot, made with Python 3.5 and Seaborn on some data (which is stored in a pandas dataframe df with column lables stored in another class. This does not matter for now, just look at the plot):
ax = sns.swarmplot(x=self.dte.label_temperature, y=self.dte.label_current, hue=self.dte.label_voltage, data = df)
Now the data is more readable if plotted in log scale on the y-axis because it goes over some decades.
So let's change the scaling to logarithmic:
ax.set_yscale("log")
ax.set_ylim(bottom = 5*10**-10)
Well I have a problem with the gaps in the swarms. I guess they are there because they have been there when the plot is created with a linear axis in mind and the dots should not overlap there. But now they look kind of strange and there is enough space to from 4 equal looking swarms.
My question is: How can I force seaborn to recalculate the position of the dots to create better looking swarms?
mwaskom hinted to me in the comments how to solve this.
It is even stated in the swamplot doku:
Note that arranging the points properly requires an accurate transformation between data and point coordinates. This means that non-default axis limits should be set before drawing the swarm plot.
Setting an existing axis to log-scale and use this for the plot:
fig = plt.figure() # create figure
rect = 0,0,1,1 # create an rectangle for the new axis
log_ax = fig.add_axes(rect) # create a new axis (or use an existing one)
log_ax.set_yscale("log") # log first
sns.swarmplot(x=self.dte.label_temperature, y=self.dte.label_current, hue=self.dte.label_voltage, data = df, ax = log_ax)
This yields in the correct and desired plotting behaviour: