Categorical axis in Holoviews Curve leads to error - python

I have this data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQxY-rTQNBbMWtRI2p8m_gj0TvmkHt3_CqNkhRILq6s5xL3mGX-IXtZt9pej-ae3ZNN1EnAP_iC0unx/pub?output=csv', index=False)
df.head()
Id Year Period Cumulative YearPeriod YearPeriodStr
0 16432976 1 1 6.0 11 Period 11
1 16432976 1 2 7.0 12 Period 12
2 16432976 1 0 0.0 10 Period 10
3 16879454 1 1 6.0 11 Period 11
4 16879454 1 2 6.0 12 Period 12
What I want to achieve is a combination of two ordinal variables on the x-axis. The first level should be Year, the second Period. The following code is what I thought would lead to the graph I want:
%opts Curve [width=600 height=400 show_grid=True default_tools=['save','hover',] show_legend=False] (color='indianred', alpha=0.3, line_width=0.2, )
ds = hv.Dataset(df).sort(['Year','Period'])
ds.to(hv.Curve, ['Year','Period'],'Cumulative', 'Id').overlay()
But this puts Period on the y-axis. The following code gets me closer to what I want:
%opts Curve [width=600 height=400 show_grid=True default_tools=['save','hover',] show_legend=False] (color='indianred', alpha=0.3, line_width=0.2, )
# Numerical column
df['YearPeriod'] = (df.Year * 10) + (df.Period)
ds = hv.Dataset(df).sort(['Year','Period'])
ds.to(hv.Curve, 'YearPeriod','Cumulative', 'Id').overlay()
But the x-axis now is a continuous variable, while what I want is the second level, Period, to be ordinal.
The following code leads more or less to what I want, but fails when the amount of rows exceeds 120.
%opts Curve [width=600 height=400 show_grid=True default_tools=['save','hover',] show_legend=False] (color='indianred', alpha=0.3, line_width=0.2, )
# String
df['YearPeriodStr'] = 'Period'+ df['Year'].astype(str) + df['Period'].astype(str)
ds = hv.Dataset(df.head(120)).sort(['Year','Period'])
ds.to(hv.Curve, 'YearPeriodStr','Cumulative', 'Id').overlay()
What am I missing?

Related

How to create a sine curve of positive part only between two integer values

I have to generate a sine curve of the positive part only between two values. The idea is my variable say monthly-averaged RH, which has 12 data points in a year (i.e. time series) varies between 50 and 70 in a sinusoidal way. The first and the last data points end at 50.
Can anyone help how I can generate this curve/function for the curve to get values of all intermediate data points? I am trying to use numpy/scipy for this.
Best,
Debayan
This is basic trig.
import math
for i in range(12):
print( i, 50 + 20 * math.sin( math.pi * i / 12 ) )
Output:
0 50.0
1 55.17638090205041
2 60.0
3 64.14213562373095
4 67.32050807568876
5 69.31851652578136
6 70.0
7 69.31851652578136
8 67.32050807568878
9 64.14213562373095
10 60.0
11 55.17638090205042

Show Count and percentage labels for grouped bar chart python

I would like to add count and percentage labels to a grouped bar chart, but I haven't been able to figure it out.
I've seen examples for count or percentage for single bars, but not for grouped bars.
the data looks something like this (not the real numbers):
age_group Mis surv unk death total surv_pct death_pct
0 0-9 1 2 0 3 6 100.0 0.0
1 10-19 2 1 0 1 4 99.9 0.0
2 20-29 0 3 0 1 4 99.9 0.0
3 30-39 0 7 1 2 10 100.0 0.0
`4 40-49 0 5 0 1 6 99.7 0.3
5 50-59 0 6 0 4 10 99.3 0.3
6 60-69 0 7 1 4 12 98.0 2.0
7 70-79 1 8 2 5 16 92.0 8.0
8 80+ 0 10 0 7 17 81.0 19.0
And The chart looks something like this
I created the chart with this code:
ax = df.plot(y=['deaths', 'surv'],
kind='barh',
figsize=(20,9),
rot=0,
title= '\n\n surv and deaths by age group')
ax.legend(['Deaths', 'Survivals']);
ax.set_xlabel('\nCount');
ax.set_ylabel('Age Group\n');
How could I add count and percentage labels to the grouped bars? I would like it to look something like this chart
Since nobody else has suggested anything, here is one way to approach it with your dataframe structure.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv("test.txt", delim_whitespace=True)
cat = ['death', 'surv']
ax = df.plot(y=cat,
kind='barh',
figsize=(20, 9),
rot=0,
title= '\n\n surv and deaths by age group')
#making space for the annotation
xmin, xmax = ax.get_xlim()
ax.set_xlim(xmin, 1.05 * xmax)
#connecting bar series with df columns
for cont, col in zip(ax.containers, cat):
#connecting each bar of the series with its absolute and relative values
for rect, vals, perc in zip(cont.patches, df[col], df[col+"_pct"]):
#annotating each bar
ax.annotate(f"{vals} ({perc:.1f}%)", (rect.get_width(), rect.get_y() + rect.get_height() / 2.),
ha='left', va='center', fontsize=10, color='black', xytext=(3, 0),
textcoords='offset points')
ax.set_yticklabels(df.age_group)
ax.set_xlabel('\nCount')
ax.set_ylabel('Age Group\n')
ax.legend(['Deaths', 'Survivals'], loc="lower right")
plt.show()
Sample output:
If the percentages per category add up, one could also calculate the percentages on the fly. This would then not necessitate that the percentage columns have exactly the same name structure. Another problem is that the font size of the annotation, the scaling to make space for labeling the largest bar, and the distance between bar and annotation are not interactive and may need fine-tuning.
However, I am not fond of this mixing of pandas and matplotlib plotting functions. I had cases where the axis definition by pandas interfered with matplotlib, and datetime objects ... well, let's not talk about that.

Confidence interval does not display on barplot

I am wanting to display the confidence interval for each bar in my plot, but they do not seem to show. I have two dataframes, and I am displaying the average of the NUMBER_GIRLS column in my plot from both dataframes.
For example, consider the two dataframes (shown below).
schools_north_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 32
2 SCHOOL_2 12
3 SCHOOL_3 26
schools_south_df
ID NAME NUMBER_GIRLS
----------------------------
1 SCHOOL_1 56
2 SCHOOL_2 33
3 SCHOOL_3 34
Therefore, I have used this code (shown below) to plot my barplot with the confidence intervals showing for each bar - but when plotting it, the confidence interval does not show up.
import matplotlib.pyplot as plt
objects = ('North', 'South')
y_pos = np.arange(len(objects))
avg_girls = [schools_north_df[NUMBER_GIRLS].mean(), schools_south_df[NUMBER_GIRLS].mean()]
sns.barplot(y_pos, avg_girls, ci=95)
plt.xticks(y_pos, objects)
plt.title('Average Number of Girls')
plt.show()
If anyone could kindly help me and indicate what is wrong with my code. I really need the confidence interval to display on my barplot.
Thank you very much!
If you want seaborn to display the confidence intervals, you need to let seaborn aggregate the data by itself (that is to say, provide the raw data instead of calculating the mean yourself).
I would create a new dataframe with an extra column (region) to indicate whether the data are from the "north" or the "south" and then request seaborn to plot NUMBER_GIRLS vs region:
df = pd.concat([schools_north_df.assign(region='North'), schools_south_df.assign(region='South')])
output:
ID NAME NUMBER_GIRLS region
0 1 SCHOOL_1 32 North
1 2 SCHOOL_2 12 North
2 3 SCHOOL_3 26 North
0 1 SCHOOL_1 56 South
1 2 SCHOOL_2 33 South
2 3 SCHOOL_3 34 South
plot:
sns.barplot(data=df, x='region', y='NUMBER_GIRLS', ci=95)

Why don't all of the factor variables appear in the legend?

I'm pretty new to plotting using matplotlib and I'm having a few problems with the legends, I have this data set:
Wavelength CD Time
0 250.0 0.00000 1
1 249.8 -0.04278 1
2 249.6 -0.03834 1
3 249.4 -0.02384 1
4 249.2 -0.04817 1
... ... ... ...
3760 200.8 0.99883 15
3761 200.6 0.50277 15
3762 200.4 -0.19228 15
3763 200.2 0.81317 15
3764 200.0 0.90226 15
[3765 rows x 3 columns]
Column types:
Wavelength float64
CD float64
Time int64
dtype: object
Why when plotted with Time as the categorical variable all the values are not shown in the legend?
x = df1['Wavelength']
y = df1['CD']
z = df1['Time']
sns.lineplot(x, y, hue = z)
plt.tight_layout()
plt.show()
But I can plot using pandas built in matplotlib function with a colorbar bar like this:
df1.plot.scatter('Wavelength', 'CD', c='Time', cmap='RdYlBu')
What's the best way of choosing between discrete and continuous legends using matplotlib/seaborn?
Many thanks!

Stacked Bar Plot By Group Count On Pandas Python

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.
Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values
df = pd.DataFrame(data)
df2 = pd.merge(df1,df, left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()
df3 = pd.merge(df1,df, left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()
ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()
Got something like this as a result:
This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.
I started by generating getting a list of ages that I will use for my x-axis:
cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''
df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()
array([15, 17, 19, 20, 21, 23, 24])
Then I generated a grouped dataframe with the counts of each M and F per age:
counts = df.groupby(['Age','Gender']).count()
print counts
Drug_ID
Age Gender
15 F 1
17 M 1
19 M 2
20 F 1
M 6
21 F 1
M 3
23 F 3
M 4
24 F 3
M 2
Using that, I can easily calculate the total number of individual per age group:
totals = counts.sum(level=0)
print totals
Drug_ID
Age
15 1
17 1
19 2
20 7
21 4
23 7
24 5
To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.
counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts
Gender F M
Age
15 1.0 NaN
17 NaN 1.0
19 NaN 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
Looks pretty good. I'll just do a final refinement and replace the NaN by 0.
counts = counts.fillna(0)
print counts
Gender F M
Age
15 1.0 0.0
17 0.0 1.0
19 0.0 2.0
20 1.0 6.0
21 1.0 3.0
23 3.0 4.0
24 3.0 2.0
With this dataframe, it is trivial to plot the stacked bars:
plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')
To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)
for age,tot in zip(ages,totals.values.flatten()):
plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')
the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.
Check out the documentation for bar() to adjust the parameters of the bar plots.
Check out the documentation for annotate() to customize your annotations

Categories

Resources