Line plot with data points in pandas - python

Using pandas I can easily make a line plot:
import pandas as pd
import numpy as np
%matplotlib inline # to use it in jupyter notebooks
df = pd.DataFrame(np.random.randn(50, 4),
index=pd.date_range('1/1/2000', periods=50), columns=list('ABCD'))
df = df.cumsum()
df.plot();
But I can't figure out how to also plot the data as points over the lines, as in this example:
This matplotlib example seems to suggest the direction, but I can't find how to do it using pandas plotting capabilities. And I am specially interested in learning how to do it with pandas because I am always working with dataframes.
Any clues?

You can use the style kwarg to the df.plot command. From the docs:
style : list or dict
matplotlib line style per column
So, you could either just set one linestyle for all the lines, or a different one for each line.
e.g. this does something similar to what you asked for:
df.plot(style='.-')
To define a different marker and linestyle for each line, you can use a list:
df.plot(style=['+-','o-','.--','s:'])
You can also pass the markevery kwarg onto matplotlib's plot command, to only draw markers at a given interval
df.plot(style='.-', markevery=5)

You can use markevery argument in df.plot(), like so:
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
df.plot(linestyle='-', markevery=100, marker='o', markerfacecolor='black')
plt.show()
markevery would accept a list of specific points(or dates), if that's what you want.
You can also define a function to help finding the correct location:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
dates = ["2001-01-01","2002-01-01","2001-06-01","2001-11-11","2001-09-01"]
def find_loc(df, dates):
marks = []
for date in dates:
marks.append(df.index.get_loc(date))
return marks
df.plot(linestyle='-', markevery=find_loc(df, dates), marker='o', markerfacecolor='black')
plt.show()

Related

Add annotation to specific cells in heatmap

I am plotting a seaborn heatmap and would like to annotate only the specific cells with custom text.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data = StringIO(u'''75,83,41,47,19
51,24,100,0,58
12,94,63,91,7
34,13,86,41,77''')
labels = StringIO(u'''7,8,4,,1
5,2,,2,8
1,,6,,7
3,1,,4,7''')
data = pd.read_csv(data, header=None)
data = data.apply(pd.to_numeric)
labels = pd.read_csv(labels, header=None)
#labels = np.ma.masked_invalid(labels)
fig, ax = plt.subplots()
sns.heatmap(data, annot=labels, ax=ax, vmin=0, vmax=100)
plt.show()
The above code generates the following heatmap:
and the commented line generates the following heatmap:
I would like to show only the non-nan (or non-zero) text on the cells. How can that be achieved?
Use a string array for annot instead of a masked array:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
data = StringIO(u'''75,83,41,47,19
51,24,100,0,58
12,94,63,91,7
34,13,86,41,77''')
labels = StringIO(u'''7,8,4,,1
5,2,,2,8
1,,6,,7
3,1,,4,7''')
data = pd.read_csv(data, header=None)
data = data.apply(pd.to_numeric)
labels = pd.read_csv(labels, header=None)
#labels = np.ma.masked_invalid(labels)
# Convert everything to strings:
annotations = labels.astype(str)
annotations[np.isnan(labels)] = ""
fig, ax = plt.subplots()
sns.heatmap(data, annot=annotations, fmt="s", ax=ax, vmin=0, vmax=100)
plt.show()
To complement the answer by #mrzo, you can use na_filter=False in read_csv() to store nans as empty strings and use pandas.DataFrame.astype() to convert to strings in place:
# ...
labels = pd.read_csv(labels, header=None, na_filter=False).astype(str)
sns.heatmap(data, annot=labels, fmt='s', ax=ax, vmin=0, vmax=100)
Just going to add this as it has taken me some time to work out how to do something similar programmatically for a slightly different application: I wanted to suppress 0-values from the annotation, but because the values were arising as the result of a crosstab operation I couldn't use William Miller's nice approach without writing the crosstab out and then reading it back in which seemed... inelegant.
There may be a yet more elegant way to do this, but for me running it through numpy was ridiculously fast and quite easy.
import numpy as np
import pandas as pd
import seaborn as sns
from io import StringIO
data = StringIO(u'''75,83,41,47,19
51,24,100,0,58
12,94,63,91,7
34,13,86,41,77''')
data = pd.read_csv(data, header=None)
data = data.apply(pd.to_numeric)
# For more complex functions you could write a def instead
# of using this simple lambda function
an = np.vectorize(lambda x: '' if x<50 else str(round(x,-1)))(data.to_numpy())
sns.heatmap(
data=data.to_numpy(), # Note this is now numpy too
cmap='BuPu',
annot=an, # The matching ndarray of annotations
fmt = '', # Formats annotations as strings (i.e. no formatting)
cbar=False, # Seems overkill if you've got annotations
vmin=0,
vmax=data.max().max()
)
This can make life a little more difficult in terms of labelling axes, though it's straightforward enough: ax.set_xticklabels(df.columns.values). And if you had axislabels in, say, the first column then you'd need to use iloc (data.iloc[:,1:]) in your to_numpy call, but combined with a custom colormap (e.g. 0==white) you can create heatmaps that are a lot easier to look at.
Obviously the crude rounding is confusing (why does 80 have different shades?) but you get the point:

How to make a distplot for each column in a pandas dataframe

I 'm using Seaborn in a Jupyter notebook to plot histograms like this:
import numpy as np
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('CTG.csv', sep=',')
sns.distplot(df['LBE'])
I have an array of columns with values that I want to plot histogram for and I tried plotting a histogram for each of them:
continous = ['b', 'e', 'LBE', 'LB', 'AC']
for column in continous:
sns.distplot(df[column])
And I get this result - only one plot with (presumably) all histograms:
My desired result is multiple histograms that looks like this (one for each variable):
How can I do this?
Insert plt.figure() before each call to sns.distplot() .
Here's an example with plt.figure():
Here's an example without plt.figure():
Complete code:
# imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [6, 2]
%matplotlib inline
# sample time series data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(-10,12,size=(300, 4)), columns=list('ABCD'))
datelist = pd.date_range(pd.datetime(2014, 7, 1).strftime('%Y-%m-%d'), periods=300).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df.iloc[0]=0
df=df.cumsum()
# create distplots
for column in df.columns:
plt.figure() # <==================== here!
sns.distplot(df[column])
Distplot has since been deprecated in seaborn versions >= 0.14.0. You can, however, use sns.histplot() to plot histogram distributions of the entire dataframe (numerical features only) in the following way:
fig, axes = plt.subplots(2,5, figsize=(15, 5))
ax = axes.flatten()
for i, col in enumerate(df.columns):
sns.histplot(df[col], ax=ax[i]) # histogram call
ax[i].set_title(col)
# remove scientific notation for both axes
ax[i].ticklabel_format(style='plain', axis='both')
fig.tight_layout(w_pad=6, h_pad=4) # change padding
plt.show()
If, you specifically want a way to estimate the probability density function of a continuous random variable using the Kernel Density Function (mimicing the default behavior of sns.distplot()), then inside the sns.histplot() function call, add kde=True, and you will have curves overlaying the histograms.
Also works when looping with plt.show() inside:
for column in df.columns:
sns.distplot(df[column])
plt.show()

Howto force Pandas and native matplotlib to share axis

I folks,
Consider the following example
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(range(len(x)), np.linspace(-1,1,100), y.T)
plt.show()
At this point, I would like the both axis (ax1,ax2) to share the x-axis, i.e. displaying proper pandas dates on the second axis. sharex=True does not seem to work. How can I achieve that? I tried different possibilities which did not work out.
Edit: Since the pandas date formatting is superior to the native matplotlib formatting, please provide me with a solution where pandas date formatting is used (for instance, zooming with an interactive environment works much better with pandas date formatting). Thanks You!
One way to do it would be to do all the plotting with matplotlib, this way there are no problems with the different time formats being used:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex='col')
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
#x.plot(ax=ax1)
ax1.plot(x.index, x.values)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(x.index, np.linspace(-1,1,100), y.T)
fig.tight_layout()
plt.show()
This gives the following plot:
What seems to work fine is to first plot the same line into the axes that should host the image, then plot the image, then remove the line again. What this does is that it tells pandas to apply its locators and formatters to that axes; they will stay after removing the line.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex=True)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
x.plot(ax=ax2, legend=False)
ax2.pcolormesh(dates, np.linspace(-1,1,100), y.T)
ax2.lines[0].remove()
plt.show()
Note that there may be caveats of this solution when zooming or panning. Consider it more like a hack and use it as long as it works, but don't blame anyone once it doesn't.

Python Pandas Matplotlib Plot Colored by type value defined in single column

I have data of the following format:
import pandas as ps
table={'time':[1,2,3,4,5,1,2,3,4,5,1,2,3,4,5],\
'data':[1,1,2,2,2,1,2,3,4,5,1,2,2,2,3],\
'type':['a','a','a','a','a','b','b','b','b','b','c','c','c','c','c']}
df=ps.DataFrame(table,columns=['time','data','type']
I would like to plot data as a function of time connected as a line, but I would like each line to be a separate color for unique types. In this example, the result would be three lines: a data(time) line for each type a, b, and, c. Any guidance is appreciated.
I have been unable to produce a line with this data--pandas.scatter will produce a plot, while pandas.plot will not. I have been messing with loops to produce a plot for each type, but I have not found a straight forward way to do this. My data typically has an unknown number of unique 'type's. Does pandas and/or matpltlib have a way to create this type of plot?
Pandas plotting capabilities will allow you to do this if everything is indexed properly. However, sometimes it's easier to just use matplotlib directly:
import pandas as pd
import matplotlib.pyplot as plt
table={'time':[1,2,3,4,5,1,2,3,4,5,1,2,3,4,5],
'data':[1,1,2,2,2,1,2,3,4,5,1,2,2,2,3],
'type':['a','a','a','a','a','b','b','b','b','b','c','c','c','c','c']}
df=pd.DataFrame(table, columns=['time','data','type'])
groups = df.groupby('type')
fig, ax = plt.subplots()
for name, group in groups:
ax.plot(group['time'], group['data'], label=name)
ax.legend(loc='best')
plt.show()
If you'd prefer to use the pandas plotting wrapper, you'll need to override the legend labels:
import pandas as pd
import matplotlib.pyplot as plt
table={'time':[1,2,3,4,5,1,2,3,4,5,1,2,3,4,5],
'data':[1,1,2,2,2,1,2,3,4,5,1,2,2,2,3],
'type':['a','a','a','a','a','b','b','b','b','b','c','c','c','c','c']}
df=pd.DataFrame(table, columns=['time','data','type'])
df.index = df['time']
groups = df[['data', 'type']].groupby('type')
fig, ax = plt.subplots()
groups.plot(ax=ax, legend=False)
names = [item[0] for item in groups]
ax.legend(ax.lines, names, loc='best')
plt.show()
Just to throw in the seaborn solution.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(df, hue="type", size=5)
g.map(plt.plot, "time", "data")
g.add_legend()

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Categories

Resources