1D multiple lines plot with pandas - python

I have a dataframe with x1 and x2 columns. I want to plot each row as an unidimensional line where x1 is the start and x2 is the end. Follows I have my solution which is not very cool. Besides it is slow when plotting 900 lines in the same plot.
Create some example data:
import numpy as np
import pandas as pd
df_lines = pd.DataFrame({'x1': np.linspace(1,50,50)*2, 'x2': np.linspace(1,50,50)*2+1})
My solution:
import matplotlib.pyplot as plt
def plot(dataframe):
plt.figure()
for item in dataframe.iterrows():
x1 = int(item[1]['x1'])
x2 = int(item[1]['x2'])
plt.hlines(0,x1,x2)
plot(df_lines)
It actually works but I think it could be improved. Thanks in advance.

You can use DataFrame.apply with axis=1 for process by rows:
def plot(dataframe):
plt.figure()
dataframe.apply(lambda x: plt.hlines(0,x['x1'],x['x2']), axis=1)
plot(df_lines)

Matplotlib can save a lot of time drawing lines, when they are organized in a LineCollection. Instead of drawing 50 individual hlines, like the other answers do, you create one single object.
Such a LineCollection requires an array of the line vertices as input, it needs to be of shape (number of lines, points per line, 2). So in this case (50,2,2).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
df_lines = pd.DataFrame({'x1': np.linspace(1,50,50)*2,
'x2': np.linspace(1,50,50)*2+1})
segs = np.zeros((len(df_lines), 2,2))
segs[:,:,0] = df_lines[["x1","x2"]].values
fig, ax = plt.subplots()
line_segments = LineCollection(segs)
ax.add_collection(line_segments)
ax.set_xlim(0,102)
ax.set_ylim(-1,1)
plt.show()

I add to the nice #jezrael response the possibility to do this in the numpy framework using numpy.apply_along_axis. Performance-wise it is equivalent to DataFrame.apply:
def plot(dataframe):
plt.figure()
np.apply_along_axis(lambda x: plt.hlines(0,x[0],x[1]), 1,dataframe.values)
plt.show()
plot(df_lines)

Related

Transposing x and y axes with matplotlib and pandas

I'm trying to use a bar chart to visualize my csv data. The data looks like this:
question,count_1,count_2,count_3,count_4,count_5
Q1,0,0,6,0,0
Q2,6,0,0,0,0
Q3,3,2,1,0,0
Q4,0,0,6,0,0
Q5,6,0,0,0,0
Q6,0,6,0,0,0
Q7,6,0,0,0,0
Q8,0,0,0,5,1
Q9,1,4,0,0,1
Q10,0,0,1,5,0
Here is my code
import pandas as pd
import csv
import matplotlib.pyplot as plt
df = pd.read_csv('example.csv')
ax = df.set_index(['question']).plot.bar(stacked=True)
ax.legend(loc='best')
plt.show()
Which gives me:
What I'm trying to do is flip the x and y axes. I want the bars to be horizontal and y axis to be the questions. I tried to transpose my data frame using:
ax = df.set_index(['question']).T.plot.bar(stacked=True)
but that gives me:
which is not what I want. Can anyone help?
to get the bars horizontally (flip the x and y axis), you need to use barh (horizontal bar). More info here. So, the code would be...
import pandas as pd
import csv
import matplotlib.pyplot as plt
df = pd.read_csv('example.csv')
ax = df.set_index(['question']).plot.barh(stacked=True)
ax.legend(loc='best')
plt.show()
Output plot

Howto force Pandas and native matplotlib to share axis

I folks,
Consider the following example
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(range(len(x)), np.linspace(-1,1,100), y.T)
plt.show()
At this point, I would like the both axis (ax1,ax2) to share the x-axis, i.e. displaying proper pandas dates on the second axis. sharex=True does not seem to work. How can I achieve that? I tried different possibilities which did not work out.
Edit: Since the pandas date formatting is superior to the native matplotlib formatting, please provide me with a solution where pandas date formatting is used (for instance, zooming with an interactive environment works much better with pandas date formatting). Thanks You!
One way to do it would be to do all the plotting with matplotlib, this way there are no problems with the different time formats being used:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex='col')
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
#x.plot(ax=ax1)
ax1.plot(x.index, x.values)
y = np.random.random([len(dates),100]) * x.values
ax2.pcolormesh(x.index, np.linspace(-1,1,100), y.T)
fig.tight_layout()
plt.show()
This gives the following plot:
What seems to work fine is to first plot the same line into the axes that should host the image, then plot the image, then remove the line again. What this does is that it tells pandas to apply its locators and formatters to that axes; they will stay after removing the line.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, (ax1,ax2) = plt.subplots(2,1, sharex=True)
dates = pd.date_range("2018-01-01","2019-01-01",freq = "1d")
x = pd.DataFrame(index = dates, data = np.linspace(0,1,len(dates)) )
x.plot(ax=ax1)
y = np.random.random([len(dates),100]) * x.values
x.plot(ax=ax2, legend=False)
ax2.pcolormesh(dates, np.linspace(-1,1,100), y.T)
ax2.lines[0].remove()
plt.show()
Note that there may be caveats of this solution when zooming or panning. Consider it more like a hack and use it as long as it works, but don't blame anyone once it doesn't.

plotting multiple histograms in grid

I am running following code to draw histograms in 3 by 3 grid for 9 varaibles.However, it plots only one variable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def draw_histograms(df, variables, n_rows, n_cols):
fig=plt.figure()
for i, var_name in enumerate(variables):
ax=fig.add_subplot(n_rows,n_cols,i+1)
df[var_name].hist(bins=10,ax=ax)
plt.title(var_name+"Distribution")
plt.show()
You're adding subplots correctly but you call plt.show for each added subplot which causes what has been drawn so far to be shown, i.e. one plot. If you're for instance plotting inline in IPython you will only see the last plot drawn.
Matplotlib provides some nice examples of how to use subplots.
Your problem is fixed like:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def draw_histograms(df, variables, n_rows, n_cols):
fig=plt.figure()
for i, var_name in enumerate(variables):
ax=fig.add_subplot(n_rows,n_cols,i+1)
df[var_name].hist(bins=10,ax=ax)
ax.set_title(var_name+" Distribution")
fig.tight_layout() # Improves appearance a bit.
plt.show()
test = pd.DataFrame(np.random.randn(30, 9), columns=map(str, range(9)))
draw_histograms(test, test.columns, 3, 3)
Which gives a plot like:
In case you don't really worry about titles, here's a one-liner
df = pd.DataFrame(np.random.randint(10, size=(100, 9)))
df.hist(color='k', alpha=0.5, bins=10)

Multiple histograms in Pandas

I would like to create the following histogram (see image below) taken from the book "Think Stats". However, I cannot get them on the same plot. Each DataFrame takes its own subplot.
I have the following code:
import nsfg
import matplotlib.pyplot as plt
df = nsfg.ReadFemPreg()
preg = nsfg.ReadFemPreg()
live = preg[preg.outcome == 1]
first = live[live.birthord == 1]
others = live[live.birthord != 1]
#fig = plt.figure()
#ax1 = fig.add_subplot(111)
first.hist(column = 'prglngth', bins = 40, color = 'teal', \
alpha = 0.5)
others.hist(column = 'prglngth', bins = 40, color = 'blue', \
alpha = 0.5)
plt.show()
The above code does not work when I use ax = ax1 as suggested in: pandas multiple plots not working as hists nor this example does what I need: Overlaying multiple histograms using pandas. When I use the code as it is, it creates two windows with histograms. Any ideas how to combine them?
Here's an example of how I'd like the final figure to look:
As far as I can tell, pandas can't handle this situation. That's ok since all of their plotting methods are for convenience only. You'll need to use matplotlib directly. Here's how I do it:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
#import seaborn
#seaborn.set(style='ticks')
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
fig, ax = plt.subplots()
a_heights, a_bins = np.histogram(df['A'])
b_heights, b_bins = np.histogram(df['B'], bins=a_bins)
width = (a_bins[1] - a_bins[0])/3
ax.bar(a_bins[:-1], a_heights, width=width, facecolor='cornflowerblue')
ax.bar(b_bins[:-1]+width, b_heights, width=width, facecolor='seagreen')
#seaborn.despine(ax=ax, offset=10)
And that gives me:
In case anyone wants to plot one histogram over another (rather than alternating bars) you can simply call .hist() consecutively on the series you want to plot:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
np.random.seed(0)
df = pandas.DataFrame(np.random.normal(size=(37,2)), columns=['A', 'B'])
df['A'].hist()
df['B'].hist()
This gives you:
Note that the order you call .hist() matters (the first one will be at the back)
A quick solution is to use melt() from pandas and then plot with seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# make dataframe
df = pd.DataFrame(np.random.normal(size=(200,2)), columns=['A', 'B'])
# plot melted dataframe in a single command
sns.histplot(df.melt(), x='value', hue='variable',
multiple='dodge', shrink=.75, bins=20);
Setting multiple='dodge' makes it so the bars are side-by-side, and shrink=.75 makes it so the pair of bars take up 3/4 of the whole bin.
To help understand what melt() did, these are the dataframes df and df.melt():
From the pandas website (http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hist):
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
plt.figure();
df4.plot(kind='hist', alpha=0.5)
You make two dataframes and one matplotlib axis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'data1': np.random.randn(10),
'data2': np.random.randn(10)
})
df2 = df1.copy()
fig, ax = plt.subplots()
df1.hist(column=['data1'], ax=ax)
df2.hist(column=['data2'], ax=ax)
Here is the snippet, In my case I have explicitly specified bins and range as I didn't handle outlier removal as the author of the book.
fig, ax = plt.subplots()
ax.hist([first.prglngth, others.prglngth], 10, (27, 50), histtype="bar", label=("First", "Other"))
ax.set_title("Histogram")
ax.legend()
Refer Matplotlib multihist plot with different sizes example.
this could be done with brevity
plt.hist([First, Other], bins = 40, color =('teal','blue'), label=("First", "Other"))
plt.legend(loc='best')
Note that as the number of bins increase, it may become a visual burden.
You could also try to check out the pandas.DataFrame.plot.hist() function which will plot the histogram of each column of the dataframe in the same figure.
Visibility is limited though but you can check out if it helps!
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

Python scatter-plot: Conditions for marker styles?

I have a data set I wish to plot as scatter plot with matplotlib, and a vector the same size that categorizes and labels the data points (discretely, e.g. from 0 to 3). I want to use different markers for different labels (e.g. 'x' for 0, 'o' for 1 and so on). How can I solve this elegantly? I am quite sure I am just missing out on something, but didn't really find it, and my naive approaches failed so far...
What about iterating over all markers like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(100)
y = np.random.rand(100)
category = np.random.random_integers(0, 3, 100)
markers = ['s', 'o', 'h', '+']
for k, m in enumerate(markers):
i = (category == k)
plt.scatter(x[i], y[i], marker=m)
plt.show()
Matplotlib does not accepts different markers per plot.
However, a less verbose and more robust solution for large dataset is using the pandas and seaborn library:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
kmean = np.array([0, 1, 0, 2, 2])
df = pd.DataFrame({'x':x,'y':y,'z':z, 'km_z':kmean})
sns.scatterplot(data = df, x='x', y='y', hue='km_z', style='km_z')
which produces the following output
Additionally you can use the pandas.cut function to plot bins (Its something I regularly need to produce graphs where I can use a third continuous value as a parameter). The way to use it is :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
x = [48.959, 49.758, 49.887, 50.593, 50.683 ]
y = [122.310, 121.29, 120.525, 120.252, 119.509]
z = [136.993, 133.128, 143.710, 129.088, 139.860]
df = pd.DataFrame({'x':x,'y':y,'z':z})
df['bins'] = pd.cut(df.z, bins=3)
sns.scatterplot(data = df, x='x', y='y', hue='bins', style='bins')
and it produces the following example:
I've used the latter method to produce graphs like the following:

Categories

Resources