I have the following code and would like to get a graph like the one labeled 'want', but I am instead getting one where there is overlapping in color. I believe pandas may have a built in graph like the one I am looking for, but maybe this graph I am generating could do the same.
UPDATE:
I was able to get the graph working, now I need for colors not to repeat. There is repetition of color for each 'State' (i.e. Dehli, etc.) Code has been updated to reflect the changes.
code:
data = Table.read_table('IndiaStatus.csv')#.drop('Discharged', 'Discharge Ratio (%)','Total Cases','Active','Deaths')
data2 = data.to_df()
cols = list(data2.columns)
cols.remove('State/UTs')
# now iterate over the remaining columns and create a new zscore column
for col in cols:
col_zscore = col + '_zscore'
data2[col_zscore] = (data2[col] - data2[col].mean())/data2[col].std(ddof=0)
print(data2)
data2.info()
data2["outlier"] = (abs(data2["Total Cases_zscore"])>1).astype(int)
print(data2)
delete_row = data2[data2["outlier"]== 1].index
data2 = data2.drop(delete_row)
print(data2)
data2["outlier2"] = ((data2["Active_zscore"])> 0.00).astype(int)
delete_row = data2[data2["outlier2"]== 1].index
data2 = data2.drop(delete_row)
'''
#Analyzing and removing outliers for Total Cases_zscore
sns.distplot(data2["Active_zscore"], kde = False, bins = 30)
g = sns.jointplot(x='Active_zscore', y='Active_zscore',
data=data2, hue='State/UTs')
plt.subplots_adjust(right=0.75)
g.ax_joint.legend(bbox_to_anchor=(1.25,1), loc='upper left', borderaxespad=0)
'''
print(data2)
print(data2.mean())
print(data2.std())
#data2.insert(1, column = "Level", value = np.where(data2["Active"] > 9700, "Severe", data["Active"] < 9700 & data["Active"] > 4850, 'Less_Severe','Not_Severe'))
col = 'Active'
conditions = [ data2['Active']<=600, data2['Active']<= 1200, data2['Active'] >1200 ]
choices = [ 'Not_Severe','Less_Severe',"Severe" ]
data2["Level"] = np.select(conditions, choices, default=np.nan)
print(data2)
ax=data2.pivot_table(index='Level', columns = 'State/UTs', values = 'Total Cases').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
#set ylim
#plt.ylim(-1, 20,5)
#plt.xlim(-1,4,8)
#grid on
plt.grid()
# set y=0
ax.axhline(0, color='black', lw=1)
#change size of legend
ax.legend(fontsize=8,loc=(1.0,0.2))
#hiding upper and right axis layout
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#changing the thickness
ax.spines['bottom'].set_linewidth(3)
ax.spines['left'].set_linewidth(3)
#setlabels
ax.set_xlabel('Level',fontsize=20,color='r')
ax.set_ylabel('Total Cases',fontsize=20,color='r')
#rotation
plt.xticks(rotation=0)
Want:
Actual Output:
UPDATE:
Make graph background white usuing
sns.set_style("whitegrid")
I am a bit new to Python. And I am playing with a dummy dataset to get some Python data manipulation practice. Below is the code for generating the dummy data:
d = {
'SeniorCitizen': [0,1,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0] ,
'CollegeDegree': [0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1] ,
'Married': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1] ,
'FulltimeJob': [1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1,1,0,0,1,1,0,0,0,1] ,
'DistancefromBranch': [7,9,14,20,21,12,22,25,9,9,9,12,13,14,16,25,27,4,14,14,20,19,15,23,2] ,
'ReversedPayment': [0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,1,0] }
CarWash = pd.DataFrame(data = d)
categoricals = ['SeniorCitizen','CollegeDegree','Married','FulltimeJob','ReversedPayment']
numerical = ['DistancefromBranch']
CarWash[categoricals] = CarWash[categoricals].astype('category')
I am basically struggling with a couple of things:
#1. A stacked barplot with absolute values (like the excel example below)
#2. A stacked barplot with percentage values (like the excel example below)
Below are my target visualizations for # 1 and # 2 using countplot().
#1
#2
For # 1, instead of a stacked barplot, with countplot() I am able to make a clustered barplot, like below, and also the annotation snippet feels more like a workaround rather than being Python elegant.
# Looping through each categorical column and viewing target variable distribution (ReversedPayment) by value
figure, axes = plt.subplots(2,2,figsize = (10,10))
for i,ax in zip(categoricals[:-1],axes.flatten()):
sns.countplot(x= i, hue = 'ReversedPayment', data = CarWash, ax = ax)
for p in ax.patches:
height = np.nan_to_num(p.get_height()) # gets the height of each patch/bar
adjust = np.nan_to_num(p.get_width())/2 # a calculation for adusting the data label later
label_xy = (np.nan_to_num(p.get_x()) + adjust,np.nan_to_num(p.get_height()) + adjust) #x,y coordinates where we want to put our data label
ax.annotate(height,label_xy) # final annotation
For # 2, I tried creating a new data frame housing % values but that felt tedious and error-prone.
I feel an option like stacked = True, proportion = True, axis = 1, annotate = True could have been so useful for countplot() to have.
Are there any other libraries that would be straight-froward and less code-intensive? Any comments or suggestions are welcome.
In this case, I think plotly.express may be more intuitive for you.
import plotly.express as px
df_temp = CarWash.groupby(['SeniorCitizen', 'ReversedPayment'])['DistancefromBranch'].count().reset_index().rename({'DistancefromBranch':'count'}, axis=1)
fig = px.bar(df_temp, x="SeniorCitizen", y="count", color="ReversedPayment", title="SeniorCitizen", text='count')
fig.update_traces(textposition='inside')
fig.show()
Basically, if you want to have more flexibility to adjust your charts, it is hard to avoid writing lots of codes.
I also try using matplotlib and pandas to create a stacked bar chart for percentages. If you are interested in it, you can try it.
sns.set()
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=[12,8], dpi=100)
# Conver the axes matrix to a 1-d array
axes = ax.flatten()
for i, col in enumerate(['SeniorCitizen', 'CollegeDegree', 'Married', 'FulltimeJob']):
# Calculate the number of plots
df_temp = (CarWash.groupby(col)['ReversedPayment']
.value_counts()
.unstack(1).fillna(0)
.rename({0:f'No', 1:f'Yes'})
.rename({0:'No', 1:'Yes'}, axis=1))
df_temp = df_temp / df_temp.sum(axis=0)
df_temp.plot.bar(stacked=True, ax=axes[i])
axes[i].set_title(col, y=1.03, fontsize=16)
rects = axes[i].patches
labels = df_temp.values.flatten()
for rect, label in zip(rects, labels):
if label == 0: continue
axes[i].text(rect.get_x() + rect.get_width() / 2, rect.get_y() + rect.get_height() / 3, '{:.2%}'.format(label),
ha='center', va='bottom', color='white', fontsize=12)
axes[i].legend(title='Reversed\nPayment', bbox_to_anchor=(1.05, 1), loc='upper left', title_fontsize = 10, fontsize=10)
axes[i].tick_params(rotation=0)
plt.tight_layout()
plt.show()
For context, what I'm trying to do is make an emission abatement chart that has the abated emissions being subtracted from the baseline. Mathematically, this is the same as adding the the abatement to the residual emission line:
Residual = Baseline - Abated
The expected results should look something like this:
Desired structure of stacked area chart:
I've currently got the stacked area chart to look like this:
As you can see, the way that the structure of stacked area chart is that the stacking starts at zero, however, I'm trying to get the stacking to either be added to the residual (red) line, or to be subtracted from the baseline (black).
I would do this in excel by just defining a blank area as the first stacked item, equal the residual line, so that the stacking occurs ontop of that. However, I'm not sure if there is a pythonic way to do this in plotly, while mainting the structure and interactivity of the chart.
The shaping of the pandas dataframes is pretty simple, just a randomly generated series of abatement values for each of the subcategories I've set up, that are then grouped to form the baseline and the residual forecasts:
scenario = 'High'
# The baseline data as a line
baseline_line = baselines.loc[baselines['Scenario']==scenario].groupby(['Year']).sum()
# The abatement and residual data
df2 = deepcopy(abatement).drop(columns=['tCO2e'])
df2['Baseline'] = baselines['tCO2e']
df2['Abatement'] = abatement['tCO2e']
df2 = df2.fillna(0)
df2['Residual'] = df2['Baseline'] - df2['Abatement']
df2 = df2.loc[abatement['Scenario']==scenario]
display(df2)
# The residual forecast as a line
emissions_lines = df2.loc[df2['Scenario']==scenario].groupby(['Year']).sum()
The charting is pretty simple as well, using the plotly express functionality:
# Just plotting
fig = px.area(df2,
x = 'Year',
y = 'Abatement',
color = 'Site',
line_group = 'Fuel_type'
)
fig2 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Baseline',
color_discrete_sequence = ['black'])
fig3 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Residual',
color_discrete_sequence = ['red'])
fig.add_trace(
fig2.data[0],
)
fig.add_trace(
fig3.data[0],
)
fig.show()
To summarise, I wish to have the Plotly stacked area chart be 'elevated' so that it fits between the residual and baseline forecasts.
NOTE: I've used the term 'baseline' with two meanings here. One specific to my example, and one generic to stacked area chart (in the title). The first usage, in the title, is meant to be the series for which the stacked area chart starts. Currently, this series is just the x-axis, or zero, I'm wishing to have this customised so that I can define a series (in this example, the red residual line) that the stacking can start from.
The second usage of the term 'baseline' refers to the 'baseline forecast', or BAU.
I think I've found a workaround, it is not ideal, but is similar to the approach I have taken in excel. I've ultimately added the 'residual' emissions in the same structure as the categories and concatenated it at the start of the DataFrame, so it bumps everything else up in between the residual and baseline forecasts.
Concatenation step:
# Me trying to make it cleanly at the residual line
df2b = deepcopy(emissions_lines)
df2b['Fuel_type'] = "Base"
df2b['Site'] = "Base"
df2b['Abatement'] = df2b['Residual']
df2c = pd.concat([df2b.reset_index(),df2],ignore_index=True)
Rejigged plotting step, with some recolouring/reformatting of the chart:
# Just plotting
fig = px.area(df2c,
x = 'Year',
y = 'Abatement',
color = 'Site',
line_group = 'Fuel_type'
)
# Making the baseline invisible and ignorable
fig.data[0].line.color='rgba(255, 255, 255,1)'
fig.data[0].showlegend = False
fig2 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Baseline',
color_discrete_sequence = ['black'])
fig3 = px.line(emissions_lines,
x = emissions_lines.index,
y = 'Residual',
color_discrete_sequence = ['red'])
fig.add_trace(
fig2.data[0],
)
fig.add_trace(
fig3.data[0],
)
fig.show()
Outcome:
I'm going to leave this unresolved, as I see this as not what I originally intended. It currently 'works', but this is not ideal and causes some issues with the interaction with the legend function in the Plotly object.
I'm using Matplotlib and Pandas to plot x by y, grouped by z. So I have the following:
x = df['ColumnA']
y = df['ColumnB']
fig, ax = plt.subplots(figsize=(20, 10))
for key, grp in df.groupby(['ColumnC']):
plt.plot(grp['ColumnA'], grp['ColumnB'].rolling(window=30).mean(), label=key)
I also want to highlight 2 specific values from the total amount of values that will be plotted:
ax.legend(('Value1', 'Value2'))
plt.show()
This works fine. I just have the 2 values in my legend, but all values are actually plotted. What I actually want, is to be able to specify the colors for the 2 Values above. i.e. red and blue and have all the other values from Column C show on the plot as one color. The objective is to highlight how Value 1 & 2 are performing compared to everything else.
First, change the colours of the lines of interest.
lines_to_highlight = {
'Value1': 'red',
'Value2': 'blue'
}
DEFAULT_COLOR = 'gray'
legend_entries = [] # Lines to show in legend
for line in ax.lines:
if line.get_label() in lines_to_highlight:
line.set_color(lines_to_highlight[line.get_label()])
legend_entries.append(line)
else:
line.set_color(DEFAULT_COLOR)
Second, create your legend.
ax.legend(
legend_entries,
[entry.get_label() for entry in legend_entries]
)
Notes:
ax.legend(('Value1', 'Value2')) doesn't do what you expect. It simply resets the labels for the first two lines you plotted. It doesn't restrict the legend to lines you created with those labels. (The matplotlib docs themselves say that this mistake is easy to make.)
You must call ax.legend(...) after setting the line colours. Otherwise, the colours in the legend might not match the ones in the plot.
Example
import matplotlib.pyplot as plt
ax = plt.subplot(111)
ax.plot([1, 1, 1], label='one')
ax.plot([2, 2, 2], label='two')
ax.plot([3, 3, 3], label='three')
lines_to_highlight = {
'one': 'red',
'three': 'blue'
}
DEFAULT_COLOR = 'gray'
legend_entries = [] # Lines to show in legend
for line in ax.lines:
if line.get_label() in lines_to_highlight:
line.set_color(lines_to_highlight[line.get_label()])
legend_entries.append(line)
else:
line.set_color(DEFAULT_COLOR)
ax.legend(
legend_entries, [entry.get_label() for entry in legend_entries]
)
plt.show()