I have a pandas data frame that looks like this:
import pandas as pd
import matplotlib.pyplot
data = [{'A': 21, 'B': 23, 'C':19, 'D':26,'E':28,
'F':26,'G':23,'H':22,'I':24,'J':21}]
# Creates DataFrame.
df = pd.DataFrame(data)
plt.figure(figsize=(12,8))
df.iloc[-1].plot(marker='o',markersize=5)
plt.show()
When I try and plot this in Matplotlib, I end up with a very jagged line.
Is there a way I can smooth the line out to make it look more curved and fluid?
I have tried to use scipy's interpolate, but have not been successful.
Thanks
This should do the trick:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import make_interp_spline
data = [{'A': 21, 'B': 23, 'C':19, 'D':26,'E':28,
'F':26,'G':23,'H':22,'I':24,'J':21}]
# Creates DataFrame.
df = pd.DataFrame(data)
y = np.array(df.iloc[-1].tolist())
x = np.arange(len(df.iloc[-1]))
xnew = np.linspace(x.min(), x.max(), 300)
spl = make_interp_spline(x, y, k=3)
ysmooth= spl(xnew)
plt.plot(xnew, ysmooth)
This is one option (not necessarily the ideal answer though):
You can try and use a polynomial approximation for the data, however you need numeric values for both you x and y axis, i've tried the below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#note i've changed the A,B,C... to 1,2,3...
data = [{1: 21, 2: 23, 3:19, 4:26,5:28,
6:26,7:23,8:22,9:24,10:21}]
#Creates DataFrame.
df = pd.DataFrame(data)
#define your lists
xlist = df.columns.tolist()
ylist = df.values.tolist()
ylist = ylist[0]
#plot data
plt.figure()
poly = np.polyfit(xlist,ylist,5)
poly_y = np.poly1d(poly)(xlist)
plt.plot(xlist,poly_y)
plt.plot(xlist,ylist)
plt.show()
Another option could be the Spline interpolation, the s parameters will allow you to adjust the smoothness of the curve, you can test several values for s:
from scipy.interpolate import splrep, splev
plt.figure()
bspl = splrep(xlist,ylist,s=25)
bspl_y = splev(xlist,bspl)
plt.plot(xlist,ylist)
plt.plot(xlist,bspl_y)
plt.show()
Related
I am trying to generate weighted empirical CDF in python. I know statsmodel.distributions.empirical_distribution provides an ECDF function, but it is unweighted. Is there a library that I can use or how can I go about extending this to write a function which calculates the weighted ECDF (EWCDF) like ewcdf {spatstat} in R.
Seaborn library has ecdfplot function which implements a weighted version of ECDF. I looked into the code of how seaborn calculates it.
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sample = np.arange(100)
weights = np.random.randint(10, size=100)
estimator = sns.distributions.ECDF('proportion', complementary=True)
stat, vals = estimator(sample, weights=weights)
plt.plot(vals, stat)
Seaborn provides ecdfplot which allows you to plot a weighted CDF. See seaborn.ecdf. Based on deepAgrawal's answer, I adapted it a little bit so that what's plotted is CDF rather than 1-CDF.
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sample = np.arange(15)
weights = np.random.randint(5, size=15)
df = pd.DataFrame(np.vstack((sample, weights)).T, columns = ['sample', 'weights'])
sns.ecdfplot(data = df, x = 'sample', weights = 'weights', stat = 'proportion', legend = True)
def ecdf(x):
Sorted = np.sort(x)
Length = len(x)
ecdf = np.zeros(Length)
for i in range(Length):
ecdf[i] = sum(Sorted <= x[i])/Length
return ecdf
x = np.array([1, 2, 5, 4, 3, 6, 7, 8, 9, 10])
ecdf(x)
I am trying to plot the accuracy evolution of NN models overtimes. So, I have an excel file with data like the following:
and I wrote the following code to extract data and plot the scatter:
import pandas as pd
data = pd.read_excel (r'SOTA DNN.xlsx')
acc1 = pd.DataFrame(data, columns= ['Top-1-Acc'])
para = pd.DataFrame(data, columns= ['Parameters'])
dates = pd.to_datetime(data['Date'], format='%Y-%m-%d')
import matplotlib.pyplot as plt
plt.grid(True)
plt.ylim(40, 100)
plt.scatter(dates, acc1)
plt.show()
Is there a way to draw a line in the same figure to show only the ones achieving the maximum and print their names at the same time as in this example:
is it also possible to stretch the x-axis (for the dates)?
It is still not clear what you mean by "stretch the x-axis" and you did not provide your dataset, but here is a possible general approach:
import matplotlib.pyplot as plt
import pandas as pd
#fake data generation, this has to be substituted by your .xls import routine
from pandas._testing import rands_array
import numpy as np
np.random.seed(1234)
n = 30
acc = np.concatenate([np.random.randint(0, 10, 10), np.random.randint(0, 30, 10), np.random.randint(0, 100, n-20)])
date_range = pd.date_range("20190101", periods=n)
model = rands_array(5, n)
df = pd.DataFrame({"Model": model, "Date": date_range, "TopAcc": acc})
df = df.sample(frac=1).reset_index(drop=True)
#now to the actual data modification
#first, we extract the dataframe with monotonically increasing values after sorting the date column
df = df.sort_values("Date").reset_index()
df["Max"] = df.TopAcc.cummax().diff()
df.loc[0, "Max"] = 1
dfmax = df[df.Max > 0]
#then, we plot all data, followed by the best performers
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(df.Date, df.TopAcc, c="grey")
ax.plot(dfmax.Date, dfmax.TopAcc, marker="x", c="blue")
#finally, we annotate the best performers
for _, xylabel in dfmax.iterrows():
ax.text(xylabel.Date, xylabel.TopAcc, xylabel.Model, c="blue", horizontalalignment="right", verticalalignment="bottom")
plt.show()
Sample output:
I have a dataframe with x1 and x2 columns. I want to plot each row as an unidimensional line where x1 is the start and x2 is the end. Follows I have my solution which is not very cool. Besides it is slow when plotting 900 lines in the same plot.
Create some example data:
import numpy as np
import pandas as pd
df_lines = pd.DataFrame({'x1': np.linspace(1,50,50)*2, 'x2': np.linspace(1,50,50)*2+1})
My solution:
import matplotlib.pyplot as plt
def plot(dataframe):
plt.figure()
for item in dataframe.iterrows():
x1 = int(item[1]['x1'])
x2 = int(item[1]['x2'])
plt.hlines(0,x1,x2)
plot(df_lines)
It actually works but I think it could be improved. Thanks in advance.
You can use DataFrame.apply with axis=1 for process by rows:
def plot(dataframe):
plt.figure()
dataframe.apply(lambda x: plt.hlines(0,x['x1'],x['x2']), axis=1)
plot(df_lines)
Matplotlib can save a lot of time drawing lines, when they are organized in a LineCollection. Instead of drawing 50 individual hlines, like the other answers do, you create one single object.
Such a LineCollection requires an array of the line vertices as input, it needs to be of shape (number of lines, points per line, 2). So in this case (50,2,2).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
df_lines = pd.DataFrame({'x1': np.linspace(1,50,50)*2,
'x2': np.linspace(1,50,50)*2+1})
segs = np.zeros((len(df_lines), 2,2))
segs[:,:,0] = df_lines[["x1","x2"]].values
fig, ax = plt.subplots()
line_segments = LineCollection(segs)
ax.add_collection(line_segments)
ax.set_xlim(0,102)
ax.set_ylim(-1,1)
plt.show()
I add to the nice #jezrael response the possibility to do this in the numpy framework using numpy.apply_along_axis. Performance-wise it is equivalent to DataFrame.apply:
def plot(dataframe):
plt.figure()
np.apply_along_axis(lambda x: plt.hlines(0,x[0],x[1]), 1,dataframe.values)
plt.show()
plot(df_lines)
I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html
I have a data set I wish to plot as scatter plot with matplotlib, and a vector the same size that categorizes and labels the data points (discretely, e.g. from 0 to 3). I want to use different markers for different labels (e.g. 'x' for 0, 'o' for 1 and so on). How can I solve this elegantly? I am quite sure I am just missing out on something, but didn't really find it, and my naive approaches failed so far...
What about iterating over all markers like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = np.random.rand(100)
category = np.random.random_integers(0, 3, 100)
markers = ['s', 'o', 'h', '+']
for k, m in enumerate(markers):
i = (category == k)
plt.scatter(x[i], y[i], marker=m)
plt.show()
Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
kmean = np.array([0, 1, 0, 2, 2])
df = pd.DataFrame({'x':x,'y':y,'z':z, 'km_z':kmean})
sns.scatterplot(data = df, x='x', y='y', hue='km_z', style='km_z')
which produces the following output
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
df = pd.DataFrame({'x':x,'y':y,'z':z})
df['bins'] = pd.cut(df.z, bins=3)
sns.scatterplot(data = df, x='x', y='y', hue='bins', style='bins')
and it produces the following example:
I've used the latter method to produce graphs like the following: