I'm using the below code to get Segment and Year in x-axis and Final_Sales in y-axis but it is throwing me an error.
CODE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
order = pd.read_excel("Sample.xls", sheet_name = "Orders")
order["Year"] = pd.DatetimeIndex(order["Order Date"]).year
result = order.groupby(["Year", "Segment"]).agg(Final_Sales=("Sales", sum)).reset_index()
bar = plt.bar(x = result["Segment","Year"], height = result["Final_Sales"])
ERROR
Can someone help me to correct my code to see the output as below.
Required Output
Try to add another pair of brackets - result[["Segment","Year"]],
What you tried to do is to retrieve column named - "Segment","Year",
But actually what are you trying to do is to retrieve a list of columns - ["Segment","Year"].
There are several problems with your code:
When using several columns to index a dataframe you want to pass a list of columns to [] (see the docs) as follows :
result[["Segment","Year"]]
From the figure you provide it looks like you want to use year as hue. matplotlib.barplot doesn't have a hue argument, you would have to build it manually as described here. Instead you can use seaborn library that you are already importing anyway (see https://seaborn.pydata.org/generated/seaborn.barplot.html):
sns.barplot(x = 'Segment', y = 'Final_Sales', hue = 'Year', data = result)
Related
I'm trying to do a line plot with one line per column. My dataset looks like this:
I'm using this code, but it's giving me the following error:
ValueError: Wrong number of items passed 3, placement implies 27
plot_x = 'bill__effective_due_date'
plot_y = ['RR_bucket1_perc', 'RR_bucket7_perc', 'RR_bucket14_perc']
ax = sns.pointplot(x=plot_x, y=plot_y, data=df_rollrates_plot, marker="o", palette=sns.color_palette("coolwarm"))
display(ax.figure)
Maybe it's a silly question but I'm new to python so I'm not sure how to do this. This is my expected output:
Thanks!!
You can plot the dataframe as follows (edit: I updated the code below to make bill__effective_due_date the index of the dataframe):
import seaborn as sns
import pandas as pd
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df_rollrates_plot = pd.DataFrame({'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
df_rollrates_plot.index = x
df_rollrates_plot.index.name = 'bill__effective_due_date'
sns.lineplot(data=df_rollrates_plot)
plt.grid()
Your data is in the wrong shape to take advantage of the hue parameter in seaborn's lineplot. You need to stack it so that the columns become categorical values.
import pandas as pd
import seaborn as sns
rr1 = [20,10,2,10,2,5]
rr7 = [17,8,2,8,2,4]
rr14 = [12,5,2,5,2,3]
x = ['Nov-1','Nov2','Nov-3','Nov-4','Nov-5','Nov-6']
df = pd.DataFrame({'bill_effective_due_date':x,
'RR_bucket1_perc':rr1,
'RR_bucket7_perc':rr7,
'RR_bucket14_perc':rr14})
# This is where you are reshaping your data to make it work like you want
df = df.set_index('bill_effective_due_date').stack().reset_index()
df.columns=['bill_effective_due_date','roll_rates_perc','roll_rates']
sns.lineplot(data=df, x='bill_effective_due_date',y='roll_rates', hue='roll_rates_perc', marker='o')
I have a DataFrame with multi-index rows and I would like to create a heatmap without the repetition of row's labels, just like it appears in pandas DataFrame. Here a code to replicate my problem:
import pandas as pd
from matplotlib import pyplot as plt
import random
import seaborn as sns
%matplotlib inline
df = pd.DataFrame({'Occupation':['Economist','Economist','Economist','Engineer','Engineer','Engineer',
'Data Scientist','Data Scientist','Data Scientist'],
'Sex':['Female','Male','Both']*3, 'UK':random.sample(range(-10,10),9),
'US':random.sample(range(-10,10),9),'Brazil':random.sample(range(-10,10),9)})
df = df.set_index(['Occupation','Sex'])
df
sns.heatmap(df, annot=True, fmt="",cmap="YlGnBu")
Besides the elimination of repetition, I would like to customize a bit the y-labels since this raw form doesn't look good to me.
Is it possible?
AFAIK there's no quick and easy way to do that within seaborn, but hopefully some one corrects me. You can do it manually by resetting the ytick_labels to just be the values from level 1 of your index. Then you can loop over level 0 of your index and add a text element to your visualization at the correct location:
from collections import OrderedDict
ax = sns.heatmap(df, annot=True, cmap="YlGnBu")
ylabel_mapping = OrderedDict()
for occupation, sex in df.index:
ylabel_mapping.setdefault(occupation, [])
ylabel_mapping[occupation].append(sex)
hline = []
new_ylabels = []
for occupation, sex_list in ylabel_mapping.items():
sex_list[0] = "{} - {}".format(occupation, sex_list[0])
new_ylabels.extend(sex_list)
if hline:
hline.append(len(sex_list) + hline[-1])
else:
hline.append(len(sex_list))
ax.hlines(hline, xmin=-1, xmax=4, color="white", linewidth=5)
ax.set_yticklabels(new_ylabels)
An alternative approach involves using dataframe styling. This leads to a super simply syntax, but you do lose out on the colobar. This keeps your index and column presentation all the same as a dataframe. Note that you'll need to be working in a notebook or somewhere that can render html to view the output:
df.style.background_gradient(cmap="YlGnBu", vmin=-10, vmax=10)
Is there any pandas way to "link" a dataframe column name with a nice description for that name?
See the following snippet where I have a dataframe with two column: the weight in kg and the height in meter of ten people.
When I create the dataframe I use this syntax
df = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
but I would like to "attach" in the creation of the dataframe a beautiful description for column name a and $\b_0$ some latex for column name b so that all the graph items that automatically use that names appears nice to the user (legend, tick labels, axis labels and so on).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
sz = 10
bmi = np.random.normal(25,0.1,sz)
h = np.random.normal(70*2.54/100,4*2.54/100,sz)
w = bmi*h**2
df = pd.DataFrame({'height_m':h,'weight_kg':w})
ax1 = df.plot.scatter(x='height_m',y='weight_kg')
plt.savefig('raw.png')
ax2 = df.plot.scatter(x='height_m',y='weight_kg')
ax2.set_xlabel('$h_0$, Altezza/m')
ax2.set_ylabel('$p_0$, Peso/kg')
plt.savefig('publishable.png')
plt.show()
This is the raw picture straight from pandas:
This is the picture I would like to get... but without modifying by myself the plot adding set_xlabel and set_ylabel and so on...
You can name your DataFrame correctly from the beginning and plot the dataframe accessing df.columns:
sz = 10
bmi = np.random.normal(25,0.1,sz)
h = np.random.normal(70*2.54/100,4*2.54/100,sz)
w = bmi*h**2
df = pd.DataFrame({'$h_0$, Altezza/m':h,'$p_0$, Peso/kg':w})
df.plot.scatter(x=df.columns[0], y=df.columns[1])
plt.savefig('publishable.png')
plt.show()
Plus, if you are using Jupyter Notebook / Jupyter Lab, it will convert the LaTeX correctly:
I am trying to change the default colour scheme used by Seaborn on plots, I just want something simple, for instance the HLS scheme shown on their documentation. However their methods don't seem to work, and I can only assume that's due to my use of "hue" but I cannot figure out how to make it work properly. Here's the current code, datain is just a text file of the correct number of columns of numbers, with p as an indexing value:
import pandas as pd
import numpy as np
datain = np.loadtxt("data.txt")
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax3 = sns.lineplot("t", "x", sns.color_palette("hls"), data = df[df['p'].isin([0,1,2,3,4])], hue = "p")
plt.show()
The code plots the first few data sets out of a file, and they come out in that weird purple pastel choice that seaborn seems to default to if I don't include the sns.color_palette function. If I include it I get the error:
TypeError: lineplot() got multiple values for keyword argument 'hue'
Which seems a bit odd given the format accepted for the lineplot function.
First thing: You need to stick to the correct syntax. A palette is supplied via the palette argument. Just putting it as the third argument of lineplot will let it be interpreted as the third argument of lineplot which happens to be hue.
Then you will need to make sure the palette has as many colors as you have different p values.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
datain = np.c_[np.arange(50),
np.tile(range(5),10),
np.linspace(0,1)+np.tile(range(5),10)/0.02]
df = pd.DataFrame(data = datain, columns = ["t","p","x"])
ax = sns.lineplot("t", "x", data = df, hue = "p",
palette=sns.color_palette("hls", len(df['p'].unique())))
plt.show()
I am struggling for a while with the definition of colors in a bar plot using Pandas and Matplotlib. Let us imagine that we have following dataframe:
import pandas as pd
pers1 = ["Jesús","lord",2]
pers2 = ["Mateo","apostel",1]
pers3 = ["Lucas","apostel",1]
dfnames = pd.DataFrame(
[pers1,pers2, pers3],
columns=["name","type","importance"]
)
Now, I want to create a bar plot with the importance as the numerical value, the names of the people as ticks and use the type column to assign colors. I have read other questions (for example: Define bar chart colors for Pandas/Matplotlib with defined column) but it doesn't work...
So, first I have to define colors and assign them to different values:
colors = {'apostel':'blue','lord':'green'}
And finally use the .plot() function:
dfnames.plot(
x="name",
y="importance",
kind="bar",
color = dfnames['type'].map(colors)
)
Good. The only problem is that all bars are green:
Why?? I don't know... I am testing it in Spyder and Jupyter... Any help? Thanks!
As per this GH16822, this is a regression bug introduced in version 0.20.3, wherein only the first colour was picked from the list of colours passed. This was not an issue with prior versions.
The reason, according to one of the contributors was this -
The problem seems to be in _get_colors. I think that BarPlot should
define a _get_colors that does something like
def _get_colors(self, num_colors=None, color_kwds='color'):
color = self.kwds.get('color')
if color is None:
return super()._get_colors(self, num_colors=num_colors, color_kwds=color_kwds)
else:
num_colors = len(self.data) # maybe? may not work for some cases
return _get_standard_colors(color=kwds.get('color'), num_colors=num_colors)
There's a couple of options for you -
The most obvious choice would be to update to the latest version of pandas (currently v0.22)
If you need a workaround, there's one (also mentioned in the issue tracker) whereby you wrap the arguments within an extra tuple -
dfnames.plot(x="name",
y="importance",
kind="bar",
color=[tuple(dfnames['type'].map(colors))]
Though, in the interest of progress, I'd recommend updating your pandas.
I find another solution to your problem and it works!
I used directly matplotlib library instead of using plot attribute of the data frame :
here is the code :
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline # for jupyter notebook
pers1 = ["Jesús","lord",2]
pers2 = ["Mateo","apostel",1]
pers3 = ["Lucas","apostel",1]
dfnames = pd.DataFrame([pers1,pers2, pers3], columns=["name","type","importance"])
fig, ax = plt.subplots()
bars = ax.bar(dfnames.name, dfnames.importance)
colors = {'apostel':'blue','lord':'green'}
for index, bar in enumerate(bars) :
color = colors.get(dfnames.loc[index]['type'],'b') # get the color key in your df
bar.set_facecolor(color[0])
plt.show()
And here is the results :