male[['Gender','Age']].plot(kind='hist', x='Gender', y='Age', bins=50)
female[['Gender','Age']].plot(kind='hist', x='Gender', y='Age', bins=50)
So basically, I used data from a file to create two histograms based on gender and age. From the beginning I separated the data by gender to initially plot. Now i'm having a hard time putting the two histograms together.
As mentioned in the comment, you can use matplotlib to do this task. I haven't figured out how to plot two histogram using Pandas tho (would like to see how people have done that).
import matplotlib.pyplot as plt
import random
# example data
age = [random.randint(20, 40) for _ in range(100)]
sex = [random.choice(['M', 'F']) for _ in range(100)]
# just give a list of age of male/female and corresponding color here
plt.hist([[a for a, s in zip(age, sex) if s=='M'],
[a for a, s in zip(age, sex) if s=='F']],
color=['b','r'], alpha=0.5, bins=10)
plt.show()
Consider converting the dataframes to a two-column numpy matrix as matplotlib's hist works with this structure instead of two different length pandas dataframes with non-numeric columns. Pandas' join is used to bind the two columns, MaleAge and FemaleAge.
Here, the Gender indicator is removed and manually labeled according to the column order.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
...
# RESET INDEX AND RENAME COLUMN AFTER SUBSETTING
male = df2[df2['Gender'] == "M"].reset_index(drop=True).rename(columns={'Age':'MaleAge'})
female = df2[df2['Gender'] == "F"].reset_index(drop=True).rename(columns={'Age':'FemaleAge'})
# OUTER JOIN TO ACHIEVE SAME LENGTH
gendermat = np.array(male[['MaleAge']].join(female[['FemaleAge']], how='outer'))
plt.hist(gendermat, bins=50, label=['male', 'female'])
plt.legend(loc='upper right')
plt.show()
plt.clf()
plt.close()
Related
I have a multi index dataframe, with the two indices being Sample and Lithology
Sample 20EC-P 20EC-8 20EC-10-1 ... 20EC-43 20EC-45 20EC-54
Lithology Pd Di-Grd Gb ... Hbl Plag Pd Di-Grd Gb
Rb 7.401575 39.055118 6.456693 ... 0.629921 56.535433 11.653543
Ba 24.610102 43.067678 10.716841 ... 1.073115 58.520532 56.946630
Th 3.176471 19.647059 3.647059 ... 0.823529 29.647059 5.294118
I am trying to put it into a seaborn lineplot as such.
spider = sns.lineplot(data = data, hue = data.columns.get_level_values("Lithology"),
style = data.columns.get_level_values("Sample"),
dashes = False, palette = "deep")
The lineplot comes out as
1
I have two issues. First, I want to format hues by lithology and style by sample. Outside of the lineplot function, I can successfully access sample and lithology using data.columns.get_level_values, but in the lineplot they don't seem to do anything and I haven't figured out another way to access these values. Also, the lineplot reorganizes the x-axis by alphabetical order. I want to force it to keep the same order as the dataframe, but I don't see any way to do this in the documentation.
To use hue= and style=, seaborn prefers it's dataframes in long form. pd.melt() will combine all columns and create new columns with the old column names, and a column for the values. The index too needs to be converted to a regular column (with .reset_index()).
Most seaborn functions use order= to set an order on the x-values, but with lineplot the only way is to make the column categorical applying a fixed order.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
column_tuples = [('20EC-P', 'Pd '), ('20EC-8', 'Di-Grd'), ('20EC-10-1 ', 'Gb'),
('20EC-43', 'Hbl Plag Pd'), ('20EC-45', 'Di-Grd'), ('20EC-54', 'Gb')]
col_index = pd.MultiIndex.from_tuples(column_tuples, names=["Sample", "Lithology"])
data = pd.DataFrame(np.random.uniform(0, 50, size=(3, len(col_index))), columns=col_index, index=['Rb', 'Ba', 'Th'])
data_long = data.melt(ignore_index=False).reset_index()
data_long['index'] = pd.Categorical(data_long['index'], data.index) # make categorical, use order of the original dataframe
ax = sns.lineplot(data=data_long, x='index', y='value',
hue="Lithology", style="Sample", dashes=False, markers=True, palette="deep")
ax.set_xlabel('')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.02))
plt.tight_layout() # fit legend and labels into the figure
plt.show()
The long dataframe looks like:
index Sample Lithology value
0 Rb 20EC-P Pd 6.135005
1 Ba 20EC-P Pd 6.924961
2 Th 20EC-P Pd 44.270570
...
I have two pandas series of numbers (not necessarily in the same size).
Can I create one side by side box plot for both of the series?
I didn't found a way to create a boxplot from a series, and not from 2 series.
For the test I generated 2 Series, of different size:
np.random.seed(0)
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(14))
The first processing step is to concatenate them into a single DataFrame
and set some meaningful column names (will be included in the picture):
df = pd.concat([s1, s2], axis=1)
df.columns = ['A', 'B']
And to create the picture, along with a title, you can run:
ax = df.boxplot()
ax.get_figure().suptitle(t='My Boxplot', fontsize=16);
For my source data, the result is:
We can try with an example dataset, two series, unequal length, and defined colors.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(100)
S1 = pd.Series(np.random.normal(0,1,10))
S2 = pd.Series(np.random.normal(0,1,14))
colors = ['#aacfcf', '#d291bc']
One option is to make a data.frame containing the two series in a column, and provide a label for the series:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
import seaborn as sns
sns.boxplot(x='series',y='values',
data=pd.DataFrame({'values':pd.concat([S1,S2],axis=0),
'series':np.repeat(["S1","S2"],[len(S1),len(S2)])}),
ax = ax,palette=colors,width=0.5
)
The other, is to use matplotlib directly, as the other solutions have suggested. However, there is no need to concat them column wise and create some amounts of NAs. You can directly use plt.boxplot from matplotlib to plot an array of values. The downside is, that it takes a bit of effort to adjust the colors etc, as I show below:
fig, ax = plt.subplots(1, 1,figsize=(6,4))
bplot = ax.boxplot([S1,S2],patch_artist=True,widths=0.5,
medianprops=dict(color="black"),labels =['S1','S2'])
plt.setp(bplot['boxes'], color='black')
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)
Try this:
import numpy as np
ser1 = pd.Series(np.random.randn(10))
ser2 = pd.Series(np.random.randn(10))
## solution
pd.concat([ser1, ser2], axis=1).plot.box()
I have a list of case and control samples along with the information about what characteristics are present or absent in each of them. A dataframe including the information can be generated by Pandas:
import pandas as pd
df={'Patient':[True,True,False],'Control':[False,True,False]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
I need to visualize this data as a dotplot/scatterplot in the way that both of the x and y axis to be categorical and presence/absence to be coded by different shapes. Something like following:
Patient| x x -
Control| - x -
__________________
GeneA GeneB GeneC
I am new to Matplotlib/seaborn and I can plot simple line plots and scatter plots. But searching online I could not find any instructions or plot similar to what I need here.
A quick way would be:
import pandas as pd
import matplotlib.pyplot as plt
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
heatmap = plt.imshow(df)
plt.xticks(range(len(df.columns.values)), df.columns.values)
plt.yticks(range(len(df.index)), df.index)
cbar = plt.colorbar(mappable=heatmap, ticks=[0, 1], orientation='vertical')
# vertically oriented colorbar
cbar.ax.set_yticklabels(['Absent', 'Present'])
Thanks to #DEEPAK SURANA for adding labels to the colorbar.
I searched the pyplot documentation and could not find a scatter or dot plot exactly like you described. Here is my take on creating a plot that illustrates what you want. The True records are blue and the False records are red.
# creating dataframe and extra column because index is not numeric
import pandas as pd
df={'Patient':[True,True,False],
'Control':[False,True,False]}
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
df['level'] = [i for i in range(0, len(df))]
print(df)
# plotting the data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,6))
for idx, gene in enumerate(df.columns[:-1]):
df_gene = df[[gene, 'level']]
cList = ['blue' if x == True else 'red' for x in df[gene]]
for inr_idx, lv in enumerate(df['level']):
ax.scatter(x=idx, y=lv, c=cList[inr_idx], s=20)
fig.tight_layout()
plt.yticks([i for i in range(len(df.index))], list(df.index))
plt.xticks([i for i in range(len(df.columns)-1)], list(df.columns[:-1]))
plt.show()
Something like this might work
import pandas as pd
import numpy as np
from matplotlib.ticker import FixedLocator
df={'Patient':[1,1,0],'Control':[0,1,0]} # Presence/absence data for three genes for each sample
df=pd.DataFrame(df)
df=df.transpose()
df.columns=['GeneA','GeneB','GeneC']
plot = df.T.plot()
loc = FixedLocator([0,1,2])
plot.xaxis.set_major_locator(loc)
plot.xaxis.set_ticklabels(df.columns)
look at https://matplotlib.org/examples/pylab_examples/major_minor_demo1.html
and https://matplotlib.org/api/ticker_api.html
I think you have to convert the boolean values to zeros and ones to make it work. Someting like df.astype(int)
EDIT: this question arose back in 2013 with pandas ~0.13 and was obsoleted by direct support for boxplot somewhere between version 0.15-0.18 (as per #Cireo's late answer; also pandas greatly improved support for categorical since this was asked.)
I can get a boxplot of a salary column in a pandas DataFrame...
train.boxplot(column='Salary', by='Category', sym='')
...however I can't figure out how to define the index-order used on column 'Category' - I want to supply my own custom order, according to another criterion:
category_order_by_mean_salary = train.groupby('Category')['Salary'].mean().order().keys()
How can I apply my custom column order to the boxplot columns? (other than ugly kludging the column names with a prefix to force ordering)
'Category' is a string (really, should be a categorical, but this was back in 0.13, where categorical was a third-class citizen) column taking 27 distinct values: ['Accounting & Finance Jobs','Admin Jobs',...,'Travel Jobs']. So it can be easily factorized with pd.Categorical.from_array()
On inspection, the limitation is inside pandas.tools.plotting.py:boxplot(), which converts the column object without allowing ordering:
pandas.core.frame.py.boxplot() is a passthrough to
pandas.tools.plotting.py:boxplot()
which instantiates ...
matplotlib.pyplot.py:boxplot() which instantiates ...
matplotlib.axes.py:boxplot()
I suppose I could either hack up a custom version of pandas boxplot(), or reach into the internals of the object. And also file an enhance request.
Hard to say how to do this without a working example. My first guess would be to just add an integer column with the orders that you want.
A simple, brute-force way would be to add each boxplot one at a time.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
columns_my_order = ['C', 'A', 'D', 'B']
fig, ax = plt.subplots()
for position, column in enumerate(columns_my_order):
ax.boxplot(df[column], positions=[position])
ax.set_xticks(range(position+1))
ax.set_xticklabels(columns_my_order)
ax.set_xlim(xmin=-0.5)
plt.show()
EDIT: this is the right answer after direct support was added somewhere between version 0.15-0.18
tl;dr: for recent pandas - use positions argument to boxplot.
Adding a separate answer, which perhaps could be another question - feedback appreciated.
I wanted to add a custom column order within a groupby, which posed many problems for me. In the end, I had to avoid trying to use boxplot from a groupby object, and instead go through each subplot myself to provide explicit positions.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame()
df['GroupBy'] = ['g1', 'g2', 'g3', 'g4'] * 6
df['PlotBy'] = [chr(ord('A') + i) for i in xrange(24)]
df['SortBy'] = list(reversed(range(24)))
df['Data'] = [i * 10 for i in xrange(24)]
# Note that this has no effect on the boxplot
df = df.sort_values(['GroupBy', 'SortBy'])
for group, info in df.groupby('GroupBy'):
print 'Group: %r\n%s\n' % (group, info)
# With the below, cannot use
# - sort data beforehand (not preserved, can't access in groupby)
# - categorical (not all present in every chart)
# - positional (different lengths and sort orders per group)
# df.groupby('GroupBy').boxplot(layout=(1, 5), column=['Data'], by=['PlotBy'])
fig, axes = plt.subplots(1, df.GroupBy.nunique(), sharey=True)
for ax, (g, d) in zip(axes, df.groupby('GroupBy')):
d.boxplot(column=['Data'], by=['PlotBy'], ax=ax, positions=d.index.values)
plt.show()
Within my final code, it was even slightly more involved to determine positions because I had multiple data points for each sortby value, and I ended up having to do the below:
to_plot = data.sort_values([sort_col]).groupby(group_col)
for ax, (group, group_data) in zip(axes, to_plot):
# Use existing sorting
ordering = enumerate(group_data[sort_col].unique())
positions = [ind for val, ind in sorted((v, i) for (i, v) in ordering)]
ax = group_data.boxplot(column=[col], by=[plot_by], ax=ax, positions=positions)
Actually I got stuck with the same question. And I solved it by making a map and reset the xticklabels, with code as follows:
df = pd.DataFrame({"A":["d","c","d","c",'d','c','a','c','a','c','a','c']})
df['val']=(np.random.rand(12))
df['B']=df['A'].replace({'d':'0','c':'1','a':'2'})
ax=df.boxplot(column='val',by='B')
ax.set_xticklabels(list('dca'))
Note that pandas can now create categorical columns. If you don't mind having all the columns present in your graph, or trimming them appropriately, you can do something like the below:
http://pandas.pydata.org/pandas-docs/stable/categorical.html
df['Category'] = df['Category'].astype('category', ordered=True)
Recent pandas also appears to allow positions to pass all the way through from frame to axes.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py
https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/pyplot.py
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes/_axes.py
It might sound kind of silly, but many of the plot allow you to determine the order. For example:
Library & dataset
import seaborn as sns
df = sns.load_dataset('iris')
Specific order
p1=sns.boxplot(x='species', y='sepal_length', data=df, order=["virginica", "versicolor", "setosa"])
sns.plt.show()
If you're not happy with the default column order in your boxplot, you can change it to a specific order by setting the column parameter in the boxplot function.
check the two examples below:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
##
plt.figure()
df.boxplot()
plt.title("default column order")
##
plt.figure()
df.boxplot(column=['C','A', 'D', 'B'])
plt.title("Specified column order")
Use the new positions= attribute:
df.boxplot(column=['Data'], by=['PlotBy'], positions=df.index.values)
This can be resolved by applying a categorical order. You can decide on the ranking yourself. I'll give an example with days of week.
Provide categorical order to weekday
#List categorical variables in correct order
weekday = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
#Assign the above list to category ranking
wDays = pd.api.types.CategoricalDtype(ordered= True, categories=Weekday)
#Apply this to the specific column in DataFrame
df['Weekday'] = df['Weekday'].astype(wDays)
# Then generate your plot
plt.figure(figsize = [15, 10])
sns.boxplot(data = flights_samp, x = 'Weekday', y = 'Y Axis Variable', color = colour)
I want to draw a plot of people who are more than 0.5 years old.
when I enter the data in python and make the data-frame, my code works:
import pandas as pd
data = {'age': [0.62,0.84,0.78,0.80,0.70,0.25,0.32,0.86,0.75],
'gender': [1,0,0,0,1,0,0,1,0],
'LOS': [0.11,0.37,0.23,-0.02,0.19,0.27,0.37,0.31,0.21],
'WBS': [9.42,4.40,6.80,9.30,5.30,5.90,3.10,4.10,12.07],
'HB': [22.44,10.40,15.60,15.10,11.30,10.60,12.50,10.40,14.10],
'Nothrophil': [70.43,88.40,76.50,87,82,87.59,15.40,77,88]}
df = pd.DataFrame(data, index=[0,1,2,3,4,5,6,7,8])
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
but when I use a csv file to form my data-frame, the code dosen’t work:
import pandas as pd
df= pd.read_csv('F:\HCSE\sample_data1.csv',sep=';')
old = df.query('age > 0.5')
import matplotlib.pyplot as plt
plt.plot(old.age)
plt.show()
How can I use a csv file and do the same action?
and one more question. Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more than 0.5 years old (Y axis is the age and the X axis is the number of datas or number of rows in csv file) and I want to use different colors for different genders. how can I do it?
Thanks a lot.
but when I use a csv file to form my data-frame, the code dosen’t
work:
You might want to share the error message so that we can know, what is going on under the hood.
Is it possible to draw a scatter plot with only one argument?
As an example I want to draw a scatter plot of people who are more
than 0.5 years old (Y axis is the age and the X axis is the number of
datas or number of rows in csv file) and I want to use different
colors for different genders. how can I do it?
Yes. Please refer to below code.
colors = ['b' if gender == 1 else 'r' for gender in df.loc[df['age'] >0.5].gender]
df.loc[df['age'] > 0.5].reset_index().plot.scatter('index', 'age', color=colors)
You also can do this very easily using seaborn's lmplot.
import seaborn as sns
sns.lmplot(x="index", y="age", data=df.loc[df['age'] > 0.5].reset_index(), hue="gender", fit_reg=False)
Notice that you can apply colors according to gender with hue argument. Hope this helps for the visualization.
For the scatter plot, you could simply do:
colors = ['b' if gender == 1 else 'r' for gender in old.gender]
plt.scatter(range(len(old.age)), old.age, color = colors)
plt.show()
About the query, can you put your .csv file? It works with my data.