I am trying to change the default colour scheme used by Seaborn on plots, I just want something simple, for instance the HLS scheme shown on their documentation. However their methods don't seem to work, and I can only assume that's due to my use of "hue" but I cannot figure out how to make it work properly. Here's the current code, datain is just a text file of the correct number of columns of numbers, with p as an indexing value:
import pandas as pd
import numpy as np
datain = np.loadtxt("data.txt")
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax3 = sns.lineplot("t", "x", sns.color_palette("hls"), data = df[df['p'].isin([0,1,2,3,4])], hue = "p")
plt.show()
The code plots the first few data sets out of a file, and they come out in that weird purple pastel choice that seaborn seems to default to if I don't include the sns.color_palette function. If I include it I get the error:
TypeError: lineplot() got multiple values for keyword argument 'hue'
Which seems a bit odd given the format accepted for the lineplot function.
First thing: You need to stick to the correct syntax. A palette is supplied via the palette argument. Just putting it as the third argument of lineplot will let it be interpreted as the third argument of lineplot which happens to be hue.
Then you will need to make sure the palette has as many colors as you have different p values.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
datain = np.c_[np.arange(50),
np.tile(range(5),10),
np.linspace(0,1)+np.tile(range(5),10)/0.02]
df = pd.DataFrame(data = datain, columns = ["t","p","x"])
ax = sns.lineplot("t", "x", data = df, hue = "p",
palette=sns.color_palette("hls", len(df['p'].unique())))
plt.show()
Related
I'm using the below code to get Segment and Year in x-axis and Final_Sales in y-axis but it is throwing me an error.
CODE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
order = pd.read_excel("Sample.xls", sheet_name = "Orders")
order["Year"] = pd.DatetimeIndex(order["Order Date"]).year
result = order.groupby(["Year", "Segment"]).agg(Final_Sales=("Sales", sum)).reset_index()
bar = plt.bar(x = result["Segment","Year"], height = result["Final_Sales"])
ERROR
Can someone help me to correct my code to see the output as below.
Required Output
Try to add another pair of brackets - result[["Segment","Year"]],
What you tried to do is to retrieve column named - "Segment","Year",
But actually what are you trying to do is to retrieve a list of columns - ["Segment","Year"].
There are several problems with your code:
When using several columns to index a dataframe you want to pass a list of columns to [] (see the docs) as follows :
result[["Segment","Year"]]
From the figure you provide it looks like you want to use year as hue. matplotlib.barplot doesn't have a hue argument, you would have to build it manually as described here. Instead you can use seaborn library that you are already importing anyway (see https://seaborn.pydata.org/generated/seaborn.barplot.html):
sns.barplot(x = 'Segment', y = 'Final_Sales', hue = 'Year', data = result)
I am trying to have different point sizes on a seaboard scatterplot depending on the value on the "hue" column of my dataframe.
sns.scatterplot(x="X", y="Y", data=df, hue='value',style='value')
value can take 3 different values (0,1 and 2) and I would like points which value is 2 to be bigger on the graph.
I tried the sizes argument :
sizes=(1,1,4)
But could not get it done this way.
Let's use the s parameter and pass a list of sizes using a function of df['value'] to scale the point sizes:
df = pd.DataFrame({'X':[1,2,3],'Y':[1,4,9],'value':[1,0,2]})
import seaborn as sns
_ = sns.scatterplot(x='X',y='Y', data=df, s=df['value']*50+10)
Output:
Using seaborn scatterplots arguments:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'X':[1,2,3,4,5],'Y':[1,2,3,4,5],
'value':[1,1,0,2,2]})
df["size"] = np.where(df["value"] == 2, "Big", "Small")
sns.scatterplot(x="X", y="Y", hue='value', size="size",
data=df, size_order=("Small", "Big"), sizes=(160, 40))
plt.show()
Note that the order of sizes needs to be reveresed compared to the size_order. I have no idea why that would make sense, though.
I've been trying to understand how to accomplish this very simple task of plotting two datasets, each with a different color, but nothing i found online seems to do it. Here is some sample code:
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
ds1x = np.random.randn(1000)
ds1y = np.random.randn(1000)
ds2x = np.random.randn(1000) * 1.5
ds2y = np.random.randn(1000) + 1
ds1 = pd.DataFrame({'dsx' : ds1x, 'dsy' : ds1y})
ds2 = pd.DataFrame({'dsx' : ds2x, 'dsy' : ds2y})
ds1['source'] = ['ds1'] * len(ds1.index)
ds2['source'] = ['ds2'] * len(ds2.index)
ds = pd.concat([ds1, ds2])
Goal is to produce two datasets in a single frame, with a categorical column keeping track of the source. Then i try plotting a scatter plot.
scatter = hv.Scatter(ds, 'dsx', 'dsy')
scatter
And that works as expected. But i cannot seem to understand how to color the two datasets differently based on the source column. I tried the following:
scatter = hv.Scatter(ds, 'dsx', 'dsy', color='source')
scatter = hv.Scatter(ds, 'dsx', 'dsy', cmap='source')
Both throw warnings and no color. I tried this:
scatter = hv.Scatter(ds, 'dsx', 'dsy')
scatter.opts(color='source')
Which throws an error. I tried converting the thing to a Holoviews dataset, same type of thing.
Why is something that is supposed to be so simple so obscure?
P.S. Yes, i know i can split the data and overlay two scatter plots and that will give different colors. But surely there has to be a way to accomplish this based on categorical data.
You can create a scatterplot in Holoviews with different colors per category as follows. They are all elegant one-liners:
1) By simply using .hvplot() on your dataframe to do this for you.
import hvplot
import hvplot.pandas
df.hvplot(kind='scatter', x='col1', y='col2', by='category_col')
# If you are using bokeh as a backend you can also just use 'color' parameter.
# I like this one more because it creates a hv.Scatter() instead of hv.NdOverlay()
# 'category_col' is here just an extra vdim, which is used for colors
df.hvplot(kind='scatter', x='col1', y='col2', color='category_col')
2) By creating an NdOverlay scatter plot as follows:
import holoviews as hv
hv.Dataset(df).to(hv.Scatter, 'col1', 'col2').overlay('category_col')
3) Or doppler's answer slightly adjusted, which sets 'category_col' as an extra vdim and is then used for the colors:
hv.Scatter(
data=df, kdims=['col1'], vdims=['col2', 'category_col'],
).opts(color='category_col', cmap=['blue', 'orange'])
Resulting plot:
You need the following sample data if you want to use my example directly:
import numpy as np
import pandas as pd
# create sample dataframe
df = pd.DataFrame({
'col1': np.random.normal(size=30),
'col2': np.random.normal(size=30),
'category_col': np.random.choice(['category_1', 'category_2'], size=30),
})
As an extra:
I find it interesting that there are basically 2 solutions to the problem.
You can create a hv.Scatter() with the category_col as an extra vdim which provides the colors or alternatively 2 separate scatterplots which are put together by hv.NdOverlay().
In the backend the hv.Scatter() solution will look like this:
:Scatter [col1] (col2,category_col)
And the hv.NdOverlay() backend looks like this:
:NdOverlay [category_col] :Scatter [col1] (col2)
This may help: http://holoviews.org/user_guide/Style_Mapping.html
Concretely, you cannot use a dim transform on a dimension that is not declared, not obscure at all :)
scatter = hv.Scatter(ds, 'dsx', ['dsy', 'source']
).opts(color=hv.dim('source').categorize({'ds1': 'blue', 'ds2': 'orange'}))
should get you there (haven't tested it myself).
Related:
Holoviews color per category
Overlay NdOverlays while keeping color / changing marker
I am struggling for a while with the definition of colors in a bar plot using Pandas and Matplotlib. Let us imagine that we have following dataframe:
import pandas as pd
pers1 = ["Jesús","lord",2]
pers2 = ["Mateo","apostel",1]
pers3 = ["Lucas","apostel",1]
dfnames = pd.DataFrame(
[pers1,pers2, pers3],
columns=["name","type","importance"]
)
Now, I want to create a bar plot with the importance as the numerical value, the names of the people as ticks and use the type column to assign colors. I have read other questions (for example: Define bar chart colors for Pandas/Matplotlib with defined column) but it doesn't work...
So, first I have to define colors and assign them to different values:
colors = {'apostel':'blue','lord':'green'}
And finally use the .plot() function:
dfnames.plot(
x="name",
y="importance",
kind="bar",
color = dfnames['type'].map(colors)
)
Good. The only problem is that all bars are green:
Why?? I don't know... I am testing it in Spyder and Jupyter... Any help? Thanks!
As per this GH16822, this is a regression bug introduced in version 0.20.3, wherein only the first colour was picked from the list of colours passed. This was not an issue with prior versions.
The reason, according to one of the contributors was this -
The problem seems to be in _get_colors. I think that BarPlot should
define a _get_colors that does something like
def _get_colors(self, num_colors=None, color_kwds='color'):
color = self.kwds.get('color')
if color is None:
return super()._get_colors(self, num_colors=num_colors, color_kwds=color_kwds)
else:
num_colors = len(self.data) # maybe? may not work for some cases
return _get_standard_colors(color=kwds.get('color'), num_colors=num_colors)
There's a couple of options for you -
The most obvious choice would be to update to the latest version of pandas (currently v0.22)
If you need a workaround, there's one (also mentioned in the issue tracker) whereby you wrap the arguments within an extra tuple -
dfnames.plot(x="name",
y="importance",
kind="bar",
color=[tuple(dfnames['type'].map(colors))]
Though, in the interest of progress, I'd recommend updating your pandas.
I find another solution to your problem and it works!
I used directly matplotlib library instead of using plot attribute of the data frame :
here is the code :
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline # for jupyter notebook
pers1 = ["Jesús","lord",2]
pers2 = ["Mateo","apostel",1]
pers3 = ["Lucas","apostel",1]
dfnames = pd.DataFrame([pers1,pers2, pers3], columns=["name","type","importance"])
fig, ax = plt.subplots()
bars = ax.bar(dfnames.name, dfnames.importance)
colors = {'apostel':'blue','lord':'green'}
for index, bar in enumerate(bars) :
color = colors.get(dfnames.loc[index]['type'],'b') # get the color key in your df
bar.set_facecolor(color[0])
plt.show()
And here is the results :
Using pandas I can easily make a line plot:
import pandas as pd
import numpy as np
%matplotlib inline # to use it in jupyter notebooks
df = pd.DataFrame(np.random.randn(50, 4),
index=pd.date_range('1/1/2000', periods=50), columns=list('ABCD'))
df = df.cumsum()
df.plot();
But I can't figure out how to also plot the data as points over the lines, as in this example:
This matplotlib example seems to suggest the direction, but I can't find how to do it using pandas plotting capabilities. And I am specially interested in learning how to do it with pandas because I am always working with dataframes.
Any clues?
You can use the style kwarg to the df.plot command. From the docs:
style : list or dict
matplotlib line style per column
So, you could either just set one linestyle for all the lines, or a different one for each line.
e.g. this does something similar to what you asked for:
df.plot(style='.-')
To define a different marker and linestyle for each line, you can use a list:
df.plot(style=['+-','o-','.--','s:'])
You can also pass the markevery kwarg onto matplotlib's plot command, to only draw markers at a given interval
df.plot(style='.-', markevery=5)
You can use markevery argument in df.plot(), like so:
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
df.plot(linestyle='-', markevery=100, marker='o', markerfacecolor='black')
plt.show()
markevery would accept a list of specific points(or dates), if that's what you want.
You can also define a function to help finding the correct location:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
dates = ["2001-01-01","2002-01-01","2001-06-01","2001-11-11","2001-09-01"]
def find_loc(df, dates):
marks = []
for date in dates:
marks.append(df.index.get_loc(date))
return marks
df.plot(linestyle='-', markevery=find_loc(df, dates), marker='o', markerfacecolor='black')
plt.show()