Is there a way to set the marker style in pandas.DataFrame.plot? All other options are available by setting the kind. I would like a marker with error bar but just get a line with an error bar. If I was to do this through the function errorbar I would set fmt='.'
The OP does not specify it, but it depends whether you're trying to plot the Dataframe, or a series.
Plotting DataFrame
Reusing the example by #unutbu:
from numpy import arange, random
import pandas as pd
df = pd.DataFrame({'x': arange(10), 'y': random.randn(10), 'err': random.randn(10)})
df.plot('x', 'y', yerr='err', fmt='.')
Plotting Series in DataFrame
This time it's a bit different:
df.y.plot(fmt='.')
AttributeError: Unknown property fmt
you need:
df.y.plot(style='.')
DataFrame behavior with style
If you pass style to DataFrame.plot, "nothing happens":
df.plot('x', 'y', yerr='err', style='.')
which may be not what you want.
df.plot passes extra keyword parameters along to the underlying matplotlib plotting function. Thus,
df = pd.DataFrame({'x':np.arange(10), 'y':np.random.randn(10),
'err':np.random.randn(10)})
df.plot('x', 'y', yerr='err', fmt='.')
yields
Related
I've been trying to understand how to accomplish this very simple task of plotting two datasets, each with a different color, but nothing i found online seems to do it. Here is some sample code:
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
ds1x = np.random.randn(1000)
ds1y = np.random.randn(1000)
ds2x = np.random.randn(1000) * 1.5
ds2y = np.random.randn(1000) + 1
ds1 = pd.DataFrame({'dsx' : ds1x, 'dsy' : ds1y})
ds2 = pd.DataFrame({'dsx' : ds2x, 'dsy' : ds2y})
ds1['source'] = ['ds1'] * len(ds1.index)
ds2['source'] = ['ds2'] * len(ds2.index)
ds = pd.concat([ds1, ds2])
Goal is to produce two datasets in a single frame, with a categorical column keeping track of the source. Then i try plotting a scatter plot.
scatter = hv.Scatter(ds, 'dsx', 'dsy')
scatter
And that works as expected. But i cannot seem to understand how to color the two datasets differently based on the source column. I tried the following:
scatter = hv.Scatter(ds, 'dsx', 'dsy', color='source')
scatter = hv.Scatter(ds, 'dsx', 'dsy', cmap='source')
Both throw warnings and no color. I tried this:
scatter = hv.Scatter(ds, 'dsx', 'dsy')
scatter.opts(color='source')
Which throws an error. I tried converting the thing to a Holoviews dataset, same type of thing.
Why is something that is supposed to be so simple so obscure?
P.S. Yes, i know i can split the data and overlay two scatter plots and that will give different colors. But surely there has to be a way to accomplish this based on categorical data.
You can create a scatterplot in Holoviews with different colors per category as follows. They are all elegant one-liners:
1) By simply using .hvplot() on your dataframe to do this for you.
import hvplot
import hvplot.pandas
df.hvplot(kind='scatter', x='col1', y='col2', by='category_col')
# If you are using bokeh as a backend you can also just use 'color' parameter.
# I like this one more because it creates a hv.Scatter() instead of hv.NdOverlay()
# 'category_col' is here just an extra vdim, which is used for colors
df.hvplot(kind='scatter', x='col1', y='col2', color='category_col')
2) By creating an NdOverlay scatter plot as follows:
import holoviews as hv
hv.Dataset(df).to(hv.Scatter, 'col1', 'col2').overlay('category_col')
3) Or doppler's answer slightly adjusted, which sets 'category_col' as an extra vdim and is then used for the colors:
hv.Scatter(
data=df, kdims=['col1'], vdims=['col2', 'category_col'],
).opts(color='category_col', cmap=['blue', 'orange'])
Resulting plot:
You need the following sample data if you want to use my example directly:
import numpy as np
import pandas as pd
# create sample dataframe
df = pd.DataFrame({
'col1': np.random.normal(size=30),
'col2': np.random.normal(size=30),
'category_col': np.random.choice(['category_1', 'category_2'], size=30),
})
As an extra:
I find it interesting that there are basically 2 solutions to the problem.
You can create a hv.Scatter() with the category_col as an extra vdim which provides the colors or alternatively 2 separate scatterplots which are put together by hv.NdOverlay().
In the backend the hv.Scatter() solution will look like this:
:Scatter [col1] (col2,category_col)
And the hv.NdOverlay() backend looks like this:
:NdOverlay [category_col] :Scatter [col1] (col2)
This may help: http://holoviews.org/user_guide/Style_Mapping.html
Concretely, you cannot use a dim transform on a dimension that is not declared, not obscure at all :)
scatter = hv.Scatter(ds, 'dsx', ['dsy', 'source']
).opts(color=hv.dim('source').categorize({'ds1': 'blue', 'ds2': 'orange'}))
should get you there (haven't tested it myself).
Related:
Holoviews color per category
Overlay NdOverlays while keeping color / changing marker
I am trying to change the default colour scheme used by Seaborn on plots, I just want something simple, for instance the HLS scheme shown on their documentation. However their methods don't seem to work, and I can only assume that's due to my use of "hue" but I cannot figure out how to make it work properly. Here's the current code, datain is just a text file of the correct number of columns of numbers, with p as an indexing value:
import pandas as pd
import numpy as np
datain = np.loadtxt("data.txt")
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax3 = sns.lineplot("t", "x", sns.color_palette("hls"), data = df[df['p'].isin([0,1,2,3,4])], hue = "p")
plt.show()
The code plots the first few data sets out of a file, and they come out in that weird purple pastel choice that seaborn seems to default to if I don't include the sns.color_palette function. If I include it I get the error:
TypeError: lineplot() got multiple values for keyword argument 'hue'
Which seems a bit odd given the format accepted for the lineplot function.
First thing: You need to stick to the correct syntax. A palette is supplied via the palette argument. Just putting it as the third argument of lineplot will let it be interpreted as the third argument of lineplot which happens to be hue.
Then you will need to make sure the palette has as many colors as you have different p values.
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
datain = np.c_[np.arange(50),
np.tile(range(5),10),
np.linspace(0,1)+np.tile(range(5),10)/0.02]
df = pd.DataFrame(data = datain, columns = ["t","p","x"])
ax = sns.lineplot("t", "x", data = df, hue = "p",
palette=sns.color_palette("hls", len(df['p'].unique())))
plt.show()
So I set up this empty dataframe DF and load data into the dataframe according to some conditions. As such, some its elements would then be empty (nan). I noticed that if I don't specify the datatype as float when I create the empty dataframe, DF.boxplot() will give me an 'Index out of range' error.
As I understand it, pandas' DF.boxplot() uses matplotlib's plt.boxplot() function, so naturally I tried using plt.boxplot(DF.iloc[:,0]) to plot the boxplot of the first column. I noticed a reversed behavior: When dtype of DF is float, it will not work: it will just show me an empty plot. See the code below where DF.boxplot() wont work, but plt.boxplot(DF.iloc[:,0]) will plot a boxplot (when i add dtype='float' when first creating the dataframe, plt.boxplot(DF.iloc[:,0]) will give me an empty plot):
import numpy as np
import pandas as pd
DF=pd.DataFrame(index=range(10),columns=range(4))
for i in range(10):
for j in range(4):
if i==j:
continue
DF.iloc[i,j]=i
I am wondering does this has to do with how plt.boxplot() handles nan for different data types? If so, why did setting the dataframe's data type as 'object' didn't work for DF.boxplot(), if pandas is just using matplotlib's boxplot function?
I think we can agree that neither df.boxplot() nor plt.boxplot can handle dataframes of type "object". Instead they need to be of a numeric datatype.
If the data is numeric, df.boxplot() will work as expected, even with nan values, because they are removed before plotting.
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(index=range(10),columns=range(4), dtype=float)
for i in range(10):
for j in range(4):
if i!=j:
df.iloc[i,j]=i
df.boxplot()
plt.show()
Using plt.boxplot you would need to remove the nans manually, e.g. using df.dropna().
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(index=range(10),columns=range(4), dtype=float)
for i in range(10):
for j in range(4):
if i!=j:
df.iloc[i,j]=i
data = [df[i].dropna() for i in range(4)]
plt.boxplot(data)
plt.show()
To summarize:
Using pandas I can easily make a line plot:
import pandas as pd
import numpy as np
%matplotlib inline # to use it in jupyter notebooks
df = pd.DataFrame(np.random.randn(50, 4),
index=pd.date_range('1/1/2000', periods=50), columns=list('ABCD'))
df = df.cumsum()
df.plot();
But I can't figure out how to also plot the data as points over the lines, as in this example:
This matplotlib example seems to suggest the direction, but I can't find how to do it using pandas plotting capabilities. And I am specially interested in learning how to do it with pandas because I am always working with dataframes.
Any clues?
You can use the style kwarg to the df.plot command. From the docs:
style : list or dict
matplotlib line style per column
So, you could either just set one linestyle for all the lines, or a different one for each line.
e.g. this does something similar to what you asked for:
df.plot(style='.-')
To define a different marker and linestyle for each line, you can use a list:
df.plot(style=['+-','o-','.--','s:'])
You can also pass the markevery kwarg onto matplotlib's plot command, to only draw markers at a given interval
df.plot(style='.-', markevery=5)
You can use markevery argument in df.plot(), like so:
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
df.plot(linestyle='-', markevery=100, marker='o', markerfacecolor='black')
plt.show()
markevery would accept a list of specific points(or dates), if that's what you want.
You can also define a function to help finding the correct location:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
dates = ["2001-01-01","2002-01-01","2001-06-01","2001-11-11","2001-09-01"]
def find_loc(df, dates):
marks = []
for date in dates:
marks.append(df.index.get_loc(date))
return marks
df.plot(linestyle='-', markevery=find_loc(df, dates), marker='o', markerfacecolor='black')
plt.show()
I have a pandas dataframe where one of the columns is a set of labels that I would like to plot each of the other columns against in subplots. In other words, I want the y-axis of each subplot to use the same column, called 'labels', and I want a subplot for each of the remaining columns with the data from each column on the x-axis. I expected the following code snippet to achieve this, but I don't understand why this results in a single nonsensical plot:
examples.plot(subplots=True, layout=(-1, 3), figsize=(20, 20), y='labels', sharey=False)
The problem with that code is that you didn't specify an x value. It seems nonsensical because it's plotting the labels column against an index from 0 to the number of rows. As far as I know, you can't do what you want in pandas directly. You might want to check out seaborn though, it's another visualization library that has some nice grid plotting helpers.
Here's an example with your data:
import pandas as pd
import seaborn as sns
import numpy as np
examples = pd.DataFrame(np.random.rand(10,4), columns=['a', 'b', 'c', 'labels'])
g = sns.PairGrid(examples, x_vars=['a', 'b', 'c'], y_vars='labels')
g = g.map(plt.plot)
This creates the following plot:
Obviously it doesn't look great with random data, but hopefully with your data it will look better.