Plot each column of Pandas dataframe pairwise against one column - python

I have a pandas dataframe where one of the columns is a set of labels that I would like to plot each of the other columns against in subplots. In other words, I want the y-axis of each subplot to use the same column, called 'labels', and I want a subplot for each of the remaining columns with the data from each column on the x-axis. I expected the following code snippet to achieve this, but I don't understand why this results in a single nonsensical plot:
examples.plot(subplots=True, layout=(-1, 3), figsize=(20, 20), y='labels', sharey=False)

The problem with that code is that you didn't specify an x value. It seems nonsensical because it's plotting the labels column against an index from 0 to the number of rows. As far as I know, you can't do what you want in pandas directly. You might want to check out seaborn though, it's another visualization library that has some nice grid plotting helpers.
Here's an example with your data:
import pandas as pd
import seaborn as sns
import numpy as np
examples = pd.DataFrame(np.random.rand(10,4), columns=['a', 'b', 'c', 'labels'])
g = sns.PairGrid(examples, x_vars=['a', 'b', 'c'], y_vars='labels')
g = g.map(plt.plot)
This creates the following plot:
Obviously it doesn't look great with random data, but hopefully with your data it will look better.

Related

plot graphs horizontally when using df.groupby.plot.bar

I want to graph 3 plots horizontally side by side
Three graphs are generated using the code below:
df.groupby(df.col1, pd.cut[0,1,2]).col2.mean().plot.bar()
df1.groupby(df.col1, pd.cut[0,1,2]).col2.mean().plot.bar()
df2.groupby(df.col1, pd.cut[0,1,2]).col2.mean().plot.bar()
I'm not sure where to set axes in this case. Any help would be appreciated.
You may simply use pandas' barh function.
df.groupby(pd.cut(df.col1, [0,1,2]).col2.mean().plot.barh()
This is an example, using this approach to create a dataframe with random samples:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.groupby(pd.cut(df.A, [0,10,20,30,40,50,60,70,80,90,100])).A.mean().plot.barh()
This snippet outputs the following plot:

Multi Index Seaborn Line Plot

I have a multi index dataframe, with the two indices being Sample and Lithology
Sample 20EC-P 20EC-8 20EC-10-1 ... 20EC-43 20EC-45 20EC-54
Lithology Pd Di-Grd Gb ... Hbl Plag Pd Di-Grd Gb
Rb 7.401575 39.055118 6.456693 ... 0.629921 56.535433 11.653543
Ba 24.610102 43.067678 10.716841 ... 1.073115 58.520532 56.946630
Th 3.176471 19.647059 3.647059 ... 0.823529 29.647059 5.294118
I am trying to put it into a seaborn lineplot as such.
spider = sns.lineplot(data = data, hue = data.columns.get_level_values("Lithology"),
style = data.columns.get_level_values("Sample"),
dashes = False, palette = "deep")
The lineplot comes out as
1
I have two issues. First, I want to format hues by lithology and style by sample. Outside of the lineplot function, I can successfully access sample and lithology using data.columns.get_level_values, but in the lineplot they don't seem to do anything and I haven't figured out another way to access these values. Also, the lineplot reorganizes the x-axis by alphabetical order. I want to force it to keep the same order as the dataframe, but I don't see any way to do this in the documentation.
To use hue= and style=, seaborn prefers it's dataframes in long form. pd.melt() will combine all columns and create new columns with the old column names, and a column for the values. The index too needs to be converted to a regular column (with .reset_index()).
Most seaborn functions use order= to set an order on the x-values, but with lineplot the only way is to make the column categorical applying a fixed order.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
column_tuples = [('20EC-P', 'Pd '), ('20EC-8', 'Di-Grd'), ('20EC-10-1 ', 'Gb'),
('20EC-43', 'Hbl Plag Pd'), ('20EC-45', 'Di-Grd'), ('20EC-54', 'Gb')]
col_index = pd.MultiIndex.from_tuples(column_tuples, names=["Sample", "Lithology"])
data = pd.DataFrame(np.random.uniform(0, 50, size=(3, len(col_index))), columns=col_index, index=['Rb', 'Ba', 'Th'])
data_long = data.melt(ignore_index=False).reset_index()
data_long['index'] = pd.Categorical(data_long['index'], data.index) # make categorical, use order of the original dataframe
ax = sns.lineplot(data=data_long, x='index', y='value',
hue="Lithology", style="Sample", dashes=False, markers=True, palette="deep")
ax.set_xlabel('')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.02))
plt.tight_layout() # fit legend and labels into the figure
plt.show()
The long dataframe looks like:
index Sample Lithology value
0 Rb 20EC-P Pd 6.135005
1 Ba 20EC-P Pd 6.924961
2 Th 20EC-P Pd 44.270570
...

pd.categorical didn't sort bars by specified orders in plot

I was trying to use pd categorical to order the bars in a barplot but the result still didn't get sorted.
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame({'x':np.random.randint(1,10,15),'y': ['x']*15})
df.loc[:,'group'] = df['x'].apply(lambda x:'>=5' if x>=5 else x)
df['group'] = df['group'].astype('string')
sample = df['group'].value_counts().reset_index()
sample['index'] = pd.Categorical(sample['index'],categories=['1','2','3','4','5','6','7','8','9','>=5'], ordered=True)
sample.plot(x='index',kind='bar')
After applied ordered=True, the categories still weren't in order and '>=5' were not at the end of the barplot. Not sure why.
DataFrame.plot.bar() plots the bars in order of occurrence (that is, against the range) and relabel the ticks with the column specified by x.
This is the case even with numerical data:
pd.DataFrame({'idx': [3,2,1], 'val':[4,5,6]}).plot.bar(x='idx')
would give:
In your case, you will need to sort the data before plot:
sample.sort_values('index').plot(x='index',kind='bar')
Output:

Separate out (and keep) duplicate categorical data using Seaborn barplot?

I'm trying to plot some hypothetical student testing scores. I'd like to have student lastname on the y-axis and test score on the x-axis (horizontal barplot). Because Student names are non-unique, I'd like to allow duplicates on the y-axis. I've seen ways to get rid of duplicate data in seaborn and/or pandas, but not how to keep. Here's the code I have:
import seaborn as sns
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
scores = pd.read_csv('input_file.csv', sep=',').sort_values("score", ascending=True)
sns.set_color_codes("pastel")
sns.barplot(x="score", y="lastName", data=scores, color="b", ci=None)
plt.title('Scores')
sns.despine(left=True, bottom=True)
plt.savefig('path_to_file.pdf')
I thought that maybe I should be using factorplot and setting the orientation to "h" and type to "bar" but that produced a "tight layout" warning and, indeed, a tight/badly-rendered plot.
FYI, currently I have a barplot that looks nice enough, but it groups non-unique lastnames and sums their test scores; that's what I'm looking to fix.
You can plot a bar for each unique row (by using the index as your y-coordinate), and then manually assign y-axis tick labels.
df = pd.DataFrame({
'name': ['A', 'B', 'A', 'B'],
'score': [10, 20, 30, 40],
})
ax = sns.barplot(x=df.score, y=df.index, orient='h')
ax.set_yticklabels(df.name)
Note that for this task, Seaborn might actually be overkill; you aren't doing any statistical visualization. Since you don't need to group non-unique values and display confidence intervals, matplotlib.pyplot.barh is sufficient (just import seaborn for good-looking plots).
plt.barh(df.index, df.score, align='center')
plt.yticks(df.index, df.name)
plt.gca().invert_yaxis()

Split single column via a condition creating a new pandas DataFrame with two columns

I would like to take a single column containing values, split via a condition into two columns, and then generate the pmf for those distributions and plot as a histogram.
Given a column a what is the best way to split the column via a condition creating a new dataframe with the resulting 2 columns?
import numpy as np
df = DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
I tried to create a new DataFrame using the filtered Series of the original.. but this doesn't seem to work:
DataFrame([df2[df2.a> 0.5].a, df2[df2.a <= 0.5].a], columns=("a_gt", "a_lt"))
You could use join, but it really depends on what sort of result your looking for.
Create a joined DataFrame:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(loc=.5,scale=.2,size=(1000, 4)), columns=['a', 'b', 'c', 'd'])
df1 = pd.DataFrame(df[df.a> 0.5].a)
df2 = pd.DataFrame(df[df.a<= 0.5].a)
dfjoined = df1.join(df2, lsuffix='_gt', rsuffix='_lt', how='outer')
Plot on the same axis:
fig, ax = plt.subplots(1,1)
ax.hist(dfjoined.a_gt, bins=10,range=(0,1), color='r')
ax.hist(dfjoined.a_lt, bins=10,range=(0,1), color='b')
I think the current hist() implementation in Pandas lacks good control over the bin size and range (?), so i have used the histogram function of matplotlib. Numpy also has a histogram function.

Categories

Resources