Pandas+seaborn faceting with multidimensional dataframes - python

In Python pandas, I need to do a facet grid from a multidimensional DataFrame.
In columns a and b I hold scalar values, which represent conditions of an experiment.
In columns x and y instead I have two numpy arrays. Column x is the x-axis of the data and column y is the value of a function corresponding to f(x).
Obviously both x and y have the same number of elements.
I now would like to do a facet grid with rows and columns specifying the conditions, and in every cell of the grid, plot the value of column D vs column D.
This could be a minimal working example:
import pandas as pd
d = [0]*4 # initialize a list with 4 elements
d[0] = {'x':[1,2,3],'y':[4,5,6],'a':1,'b':2} # then fill these elements
d[1] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':3}
d[2] = {'x':[3,1,5],'y':[6,5,1],'a':1,'b':3}
d[3] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':2}
pd.DataFrame(d) # create the pandas dataframe
How can I use already existing faceting functions to address the issue of plotting y vs x grouped by the conditions a and b?
Since I need to apply this function to general datasets with different column names, I would like to avoid resorting on hard-coded solutions, but rather see whether it is possible to extend seaborn FacetGrid function to this kind of problem.

I think the best way to go is to split the nested arrays first and then create a facet grid with seaborn.
Thanks to this post (Split nested array values from Pandas Dataframe cell over multiple rows) I was able to split the nested array in your dataframe:
unnested_lst = []
for col in df.columns:
unnested_lst.append(df[col].apply(pd.Series).stack())
result = pd.concat(unnested_lst, axis=1, keys=df.columns).fillna(method='ffill')
Then you can make the facet grid with this code:
import seaborn as sbn
fg = sbn.FacetGrid(result, row='b', col='a')
fg.map(plt.scatter, "x", "y", color='blue')

You need a long-form frame to be able to use FacetGrid, so your best bet is to explode the lists, then recombine and apply:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
d = [0]*4
d[0] = {'x':[1,2,3],'y':[4,5,6],'a':1,'b':2} # then fill these elements
d[1] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':3}
d[2] = {'x':[3,1,5],'y':[6,5,1],'a':1,'b':3}
d[3] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':2}
df = pd.DataFrame(d)
df.set_index(['a','b'], inplace=True, drop=True)
x_long = pd.melt(df['x'].apply(pd.Series).reset_index(),
id_vars=['a', 'b'], value_name='x')
y_long = pd.melt(df['y'].apply(pd.Series).reset_index(),
id_vars=['a', 'b'], value_name='y')
long_df = pd.merge(x_long, y_long).drop('variable', axis='columns')
grid = sns.FacetGrid(long_df, row='a', col='b')
grid.map(plt.scatter, 'x', 'y')
plt.show()
This will show you the following:

I believe the best, shortest and most comprehensible solution is to define an appositely created lambda function. It has as input the mapping variables specified by the FacetGrid.map method, and takes its values in form of numpy arrays by the .values[0], as they are unique.
import pandas as pd
d = [0]*4 # initialize a list with 4 elements
d[0] = {'x':[1,2,3],'y':[4,5,6],'a':1,'b':2} # then fill these elements
d[1] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':3}
d[2] = {'x':[3,1,5],'y':[6,5,1],'a':1,'b':3}
d[3] = {'x':[3,1,5],'y':[6,5,1],'a':0,'b':2}
df = pd.DataFrame(d) # create the pandas dataframe
import seaborn as sns
import matplotlib.pyplot as plt
grid = sns.FacetGrid(df,row='a',col='b')
grid.map(lambda _x,_y,**kwargs : plt.scatter(_x.values[0],_y.values[0]),'x','y')

Related

Plotly graph : show number of occurrences in bar-chart

I try to plot a bar-chart from a givin dataframe.
x-axis = dates
y-axis = number of occurences for each month
The result should be a barchart. Each x is an occurrence.
x
xx
x
2020-1
2020-2
2020-3
2020-4
2020-5
I tried but don't get the desired result as above.
import datetime as dt
import pandas as pd
import numpy as np
import plotly.offline as pyo
import plotly.graph_objs as go
# initialize list of lists
data = [['a', '2022-01-05'], ['a', '2022-02-14'], ['a', '2022-02-15'],['a', '2022-05-14']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Date'])
# print dataframe.
df['Date']=pd.to_datetime(df['Date'])
# plot dataframe
trace1=go.Bar(
#
x = df.Date.dt.month,
y = df.Name.groupby(df.Date.dt.month).count()
)
data=[trace1]
fig=go.Figure(data=data)
pyo.plot(fig)
Remove the last line and write instead:
fig.show()
Edit:
It's unclear to me whether you have 1 dimensional or 2 dimensional data here. Supposing you have 1d data, this is, just a bunch of dates that you want to aggregate in a bar chart, simply do this:
# initialize list of lists
data = ['2022-01-05', '2022-02-14', '2022-02-15', '2022-05-14']
# Create the pandas DataFrame
df = pd.DataFrame(data)
# plot dataframe
fig = px.bar(df)
If, instead, you have 2d data then what you want is a scatter plot, not a bar chart.

Create a box plot from two series

I have two pandas series of numbers (not necessarily in the same size).
Can I create one side by side box plot for both of the series?
I didn't found a way to create a boxplot from a series, and not from 2 series.
For the test I generated 2 Series, of different size:
np.random.seed(0)
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(14))
The first processing step is to concatenate them into a single DataFrame
and set some meaningful column names (will be included in the picture):
df = pd.concat([s1, s2], axis=1)
df.columns = ['A', 'B']
And to create the picture, along with a title, you can run:
ax = df.boxplot()
ax.get_figure().suptitle(t='My Boxplot', fontsize=16);
For my source data, the result is:
We can try with an example dataset, two series, unequal length, and defined colors.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(100)
S1 = pd.Series(np.random.normal(0,1,10))
S2 = pd.Series(np.random.normal(0,1,14))
colors = ['#aacfcf', '#d291bc']
One option is to make a data.frame containing the two series in a column, and provide a label for the series:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
import seaborn as sns
sns.boxplot(x='series',y='values',
data=pd.DataFrame({'values':pd.concat([S1,S2],axis=0),
'series':np.repeat(["S1","S2"],[len(S1),len(S2)])}),
ax = ax,palette=colors,width=0.5
)
The other, is to use matplotlib directly, as the other solutions have suggested. However, there is no need to concat them column wise and create some amounts of NAs. You can directly use plt.boxplot from matplotlib to plot an array of values. The downside is, that it takes a bit of effort to adjust the colors etc, as I show below:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
bplot = ax.boxplot([S1,S2],patch_artist=True,widths=0.5,
medianprops=dict(color="black"),labels =['S1','S2'])
plt.setp(bplot['boxes'], color='black')
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)
Try this:
import numpy as np
ser1 = pd.Series(np.random.randn(10))
ser2 = pd.Series(np.random.randn(10))
## solution
pd.concat([ser1, ser2], axis=1).plot.box()

Reorder Catagorical X axis ticks Matplotlib [duplicate]

EDIT: this question arose back in 2013 with pandas ~0.13 and was obsoleted by direct support for boxplot somewhere between version 0.15-0.18 (as per #Cireo's late answer; also pandas greatly improved support for categorical since this was asked.)
I can get a boxplot of a salary column in a pandas DataFrame...
train.boxplot(column='Salary', by='Category', sym='')
...however I can't figure out how to define the index-order used on column 'Category' - I want to supply my own custom order, according to another criterion:
category_order_by_mean_salary = train.groupby('Category')['Salary'].mean().order().keys()
How can I apply my custom column order to the boxplot columns? (other than ugly kludging the column names with a prefix to force ordering)
'Category' is a string (really, should be a categorical, but this was back in 0.13, where categorical was a third-class citizen) column taking 27 distinct values: ['Accounting & Finance Jobs','Admin Jobs',...,'Travel Jobs']. So it can be easily factorized with pd.Categorical.from_array()
On inspection, the limitation is inside pandas.tools.plotting.py:boxplot(), which converts the column object without allowing ordering:
pandas.core.frame.py.boxplot() is a passthrough to
pandas.tools.plotting.py:boxplot()
which instantiates ...
matplotlib.pyplot.py:boxplot() which instantiates ...
matplotlib.axes.py:boxplot()
I suppose I could either hack up a custom version of pandas boxplot(), or reach into the internals of the object. And also file an enhance request.
Hard to say how to do this without a working example. My first guess would be to just add an integer column with the orders that you want.
A simple, brute-force way would be to add each boxplot one at a time.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
columns_my_order = ['C', 'A', 'D', 'B']
fig, ax = plt.subplots()
for position, column in enumerate(columns_my_order):
ax.boxplot(df[column], positions=[position])
ax.set_xticks(range(position+1))
ax.set_xticklabels(columns_my_order)
ax.set_xlim(xmin=-0.5)
plt.show()
EDIT: this is the right answer after direct support was added somewhere between version 0.15-0.18
tl;dr: for recent pandas - use positions argument to boxplot.
Adding a separate answer, which perhaps could be another question - feedback appreciated.
I wanted to add a custom column order within a groupby, which posed many problems for me. In the end, I had to avoid trying to use boxplot from a groupby object, and instead go through each subplot myself to provide explicit positions.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame()
df['GroupBy'] = ['g1', 'g2', 'g3', 'g4'] * 6
df['PlotBy'] = [chr(ord('A') + i) for i in xrange(24)]
df['SortBy'] = list(reversed(range(24)))
df['Data'] = [i * 10 for i in xrange(24)]
# Note that this has no effect on the boxplot
df = df.sort_values(['GroupBy', 'SortBy'])
for group, info in df.groupby('GroupBy'):
print 'Group: %r\n%s\n' % (group, info)
# With the below, cannot use
# - sort data beforehand (not preserved, can't access in groupby)
# - categorical (not all present in every chart)
# - positional (different lengths and sort orders per group)
# df.groupby('GroupBy').boxplot(layout=(1, 5), column=['Data'], by=['PlotBy'])
fig, axes = plt.subplots(1, df.GroupBy.nunique(), sharey=True)
for ax, (g, d) in zip(axes, df.groupby('GroupBy')):
d.boxplot(column=['Data'], by=['PlotBy'], ax=ax, positions=d.index.values)
plt.show()
Within my final code, it was even slightly more involved to determine positions because I had multiple data points for each sortby value, and I ended up having to do the below:
to_plot = data.sort_values([sort_col]).groupby(group_col)
for ax, (group, group_data) in zip(axes, to_plot):
# Use existing sorting
ordering = enumerate(group_data[sort_col].unique())
positions = [ind for val, ind in sorted((v, i) for (i, v) in ordering)]
ax = group_data.boxplot(column=[col], by=[plot_by], ax=ax, positions=positions)
Actually I got stuck with the same question. And I solved it by making a map and reset the xticklabels, with code as follows:
df = pd.DataFrame({"A":["d","c","d","c",'d','c','a','c','a','c','a','c']})
df['val']=(np.random.rand(12))
df['B']=df['A'].replace({'d':'0','c':'1','a':'2'})
ax=df.boxplot(column='val',by='B')
ax.set_xticklabels(list('dca'))
Note that pandas can now create categorical columns. If you don't mind having all the columns present in your graph, or trimming them appropriately, you can do something like the below:
http://pandas.pydata.org/pandas-docs/stable/categorical.html
df['Category'] = df['Category'].astype('category', ordered=True)
Recent pandas also appears to allow positions to pass all the way through from frame to axes.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py
https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/pyplot.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_axes.py
It might sound kind of silly, but many of the plot allow you to determine the order. For example:
Library & dataset
import seaborn as sns
df = sns.load_dataset('iris')
Specific order
p1=sns.boxplot(x='species', y='sepal_length', data=df, order=["virginica", "versicolor", "setosa"])
sns.plt.show()
If you're not happy with the default column order in your boxplot, you can change it to a specific order by setting the column parameter in the boxplot function.
check the two examples below:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
##
plt.figure()
df.boxplot()
plt.title("default column order")
##
plt.figure()
df.boxplot(column=['C','A', 'D', 'B'])
plt.title("Specified column order")
Use the new positions= attribute:
df.boxplot(column=['Data'], by=['PlotBy'], positions=df.index.values)
This can be resolved by applying a categorical order. You can decide on the ranking yourself. I'll give an example with days of week.
Provide categorical order to weekday
#List categorical variables in correct order
weekday = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
#Assign the above list to category ranking
wDays = pd.api.types.CategoricalDtype(ordered= True, categories=Weekday)
#Apply this to the specific column in DataFrame
df['Weekday'] = df['Weekday'].astype(wDays)
# Then generate your plot
plt.figure(figsize = [15, 10])
sns.boxplot(data = flights_samp, x = 'Weekday', y = 'Y Axis Variable', color = colour)

Creating Pandas Dataframe between two Numpy arrays, then draw scatter plot

I'm relatively new with numpy and pandas (I'm an experimental physicist so I've been using ROOT for years...).
A common plot in ROOT is a 2D scatter plot where, given a list of x- and y- values, makes a "heatmap" type scatter plot of one variable versus the other.
How is this best accomplished with numpy and Pandas? I'm trying to use the Dataframe.plot() function, but I'm struggling to even create the Dataframe.
import numpy as np
import pandas as pd
x = np.random.randn(1,5)
y = np.sin(x)
df = pd.DataFrame(d)
First off, this dataframe has shape (1,2), but I would like it to have shape (5,2).
If I can get the dataframe the right shape, I'm sure I can figure out the DataFrame.plot() function to draw what I want.
There are a number of ways to create DataFrames. Given 1-dimensional column vectors, you can create a DataFrame by passing it a dict whose keys are column names and whose values are the 1-dimensional column vectors:
import numpy as np
import pandas as pd
x = np.random.randn(5)
y = np.sin(x)
df = pd.DataFrame({'x':x, 'y':y})
df.plot('x', 'y', kind='scatter')
Complementing, you can use pandas Series, but the DataFrame must have been created.
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
#df = pd.DataFrame()
#df['X'] = pd.Series(x)
#df['Y'] = pd.Series(y)
# You can MIX
df = pd.DataFrame({'X':x})
df['Y'] = pd.Series(y)
df.plot('X', 'Y', kind='scatter')
This is another way that might help
import numpy as np
import pandas as pd
x = np.linspace(0,2*np.pi)
y = np.sin(x)
df = pd.DataFrame(data=np.column_stack((x,y)),columns=['X','Y'])
And also, I find the examples from karlijn (DatacCamp) very helpful
import numpy as np
import pandas as pd
TAB = np.array([['' ,'Col1','Col2'],
['Row1' , 1 , 2 ],
['Row2' , 3 , 4 ],
['Row3' , 5 , 6 ]])
dados = TAB[1:,1:]
linhas = TAB[1:,0]
colunas = TAB[0,1:]
DF = pd.DataFrame(
data=dados,
index=linhas,
columns=colunas
)
print('\nDataFrame:', DF)
In order to do what you want, I wouldn't use the DataFrame plotting methods. I'm also a former experimental physicist, and based on experience with ROOT I think that the Python analog you want is best accomplished using matplotlib. In matplotlib.pyplot there is a method, hist2d(), which will give you the kind of heat map you're looking for.
As for creating the dataframe, an easy way to do it is:
df=pd.DataFrame({'x':x, 'y':y})

Plot a pandas dataframe with vertical lines

I want to plot a dataframe where each data point is not represented as a point but a vertical line from the zero axis like :
df['A'].plot(style='xxx')
where xxx is the style I need.
Also ideally i would like to be able to color each bar based on the values in another column in my dataframe.
I precise that my x axis values are numbers and are not equally spaced.
The pandas plotting tools are convenient wrappers to matplotlib. There is no way I know of to get the functionality you want directly via pandas.
You can get it in a few lines of matplotlib. Most of the code is to do the colour mapping:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as colors
import matplotlib.cm as cmx
#make the dataframe
a = np.random.rand(100)
b = np.random.ranf(100)
df = pd.DataFrame({'a': a, 'b': b})
# do the colour mapping
c_norm = colors.Normalize(vmin=min(df.b), vmax=max(df.b))
scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=plt.get_cmap('jet'))
color_vals = [scalar_map.to_rgba(val) for val in df.b]
# make the plot
plt.vlines(df.index, np.zeros_like(df.a), df.a, colors=color_vals)
I've used the DataFrame index for the x axis values but there is no reason that you could not use irregularly spaced x values.

Categories

Resources