modifying scipy stats.probplot plotting function with matplotlib - python

I am not am expert with matplotlib, so I am having a hard time trying to set the parameters of scipy stats.
My code takes a pandas df column, iterates over the columns, and attempts to plot the values of the columns using the stats.probplot function. This is my code:
plt.figure(figsize=(10,5))
for col in model_predictions.columns:
res = stats.probplot(df[col]), plot=plt)
plt.legend = col
plt.show()
This generates the charts I want, but difficult to read (no legends, sames colors). Aside from plotting them on top of each other, I would like to plot each line in a different color, as well as add a legend for each line equal to the str in col. Any way to do this?
I can always take the tuple output of the function, run it by another new def, and add the outputs to a new pandas df (to later plot with more control); but I was wondering if there is a quicker way.
Thanks

You can plot them manually by taking the output of stats.probplot, i.e.:
from scipy.stats import probplot
for col in model_predictions.columns
plt.plot(*stats.probplot(df[col])[0], label=col)
plt.legend(loc='best')
plt.show()

Related

How to align bars with tick labels in plt or pandas histogram (when plotting multiple columns)

I have started using python for lots of data problems at work and the datasets are always slightly different. I'm trying to explore more efficient ways of plotting data using the inbuilt pandas function rather than individually writing out the code for each column and editing the formatting to get a nice result.
Background: I'm using Jupyter notebook and looking at histograms where the values are all unique integers.
Problem: I want the xtick labels to align with the centers of the histogram bars when plotting multiple columns of data with the one function e.g. df.hist() to get histograms of all columns at once.
Does anyone know if this is possible?
Or is it recommended to do each graph on its own vs. using the inbuilt function applied to all columns?
I can modify them individually following this post: Matplotlib xticks not lining up with histogram
which gives me what I would like but only for one graph and with some manual processing of the values.
Desired outcome example for one graph:
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of datapoints
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
# print dataframe.
df
Code that displays the graphs in the problem statement
df.hist(figsize=(5,5))
plt.show()
Code that displays the graph for weight how I would like it to be for all
df.hist(column='weight',bins=[175,185,195,205,215])
plt.xticks([180,190,200,210])
plt.yticks([0,1,2,3,4,5])
plt.xlim([170, 220])
plt.show()
Any tips or help would be much appreciated!
Thanks
I hope this helps.You take the column and count the frequency of each label (value counts) then you specify sort_index in order to get the order by the label not by the frecuency, then you plot the bar plot.
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
df.weight.value_counts().sort_index().plot(kind = 'bar')
plt.show()

when I plot a graph using matplotlib how do I increase the linewidth ( or make bold) one of the lines?

I got a time series which has got many columns. The time series is inthe dataframe called cum_returns. so I am currently plotting all graphs using cum_returns.plot()
lets say if I want to make the graph for columns A, C and F darker (or rather increase the line width for those 3 time series) is there an easy way to do that?
if you are using the command: cum_returns.plot() to plot everything, you can access each of the lines on the plot using:
import matplotlib.pyplot as plt
lines = plt.gca().lines
then, find out which line you want to edit, i.e. col 'A' is lines[0], 'B' is lines[1]..etc..
then change the linewidth using:
lines[x].set_linewidth(width) #x is the index of the line you want to edit, width is the new width
you can also do dir(lines[x]) to get a full list of the things you can do to it

python plot how to adjust a lengthy legend [duplicate]

I have a data file which consists of 131 columns and 4 rows. I am plotting it into python as follows
df = pd.read_csv('data.csv')
df.plot(figsize = (15,10))
Once it is plotted, all 131 legends are coming together like a huge tower over the line plots.
Please see the image here, which I have got :
Link to Image, I have clipped after v82 for better understanding
I have found some solutions on Stackoverflow (SO) to shift legend anywhere in the plot but I could not find any solution to break this legend tower into multiple small-small pieces and stack them one beside another.
Moreover, I want my plot something look like this
My desired plot :
Any help would be appreciable. Thank you.
You can specify the position of the legend in relative coordinates using loc and use ncol parameter to split the single legend column into multiple columns. To do so, you need an axis handle returned by the df.plot
df = pd.read_csv('data.csv')
ax = df.plot(figsize = (10,7))
ax.legend(loc=(1.01, 0.01), ncol=4)
plt.tight_layout()

How to show some selected rows with FacetGrid

I have a dataframe and with a column called "my_row". It has many values. I only want to see some of the data on FacetGrid that belong to specific values of "my_row" on the row. I tried to make a subset of my dataframe and visualize that, but still somehow seaborn "knows" that my original dataframe had more values in "my_row" column and shows empty plots for the rows that I dont want.
So using the following code still gives me a figure with 2 rows of data that I want and many empty plots after that.
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
g = sns.FacetGrid(X, row='my_row', col='column')
How can I tell python to just plot that 2 rows?
I get plots like this with many empty plots:
I cannot reproduce this. The code from the question seems to work fine. Here we have a dataframe with four different values in the my_row column. Then filtering out two of them creates a FacetGrid with only two rows.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({"my_row" : np.random.choice(list("1234"), size=40),
"column" : np.random.choice(list("AB"), size=40),
"x" : np.random.rand(40),
"y" : np.random.rand(40)})
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
g = sns.FacetGrid(X, row='my_row', col='column')
g.map(plt.scatter, "x", "y")
plt.show()
For anyone encountering this problem-- the issue is that my_row is a categorical type. To solve, change this to a str.
i.e.
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
X['my_row']=X['my_row'].astype(str)
g = sns.FacetGrid(X, row='my_row', col='column')
This should now work! :)
I got inspired by this link:
Plot lower triangle in a seaborn Pairgrid
and changed my code to this:
g = sns.FacetGrid(df, row='my_row', col='column')
for i in list(range(2,48)):
for j in list(range(0,12)):
g.axes[i,j].set_visible(False)
So I had to iterate over each plot individually at make it invisible. But I think there should be an easier way to do this. And in the end I still don't understand how FacetGrid knows anything about the size of my original dataframe df when I use X and its input.
This is an answer that works, but I think there must be better solutions. One problem with my answer is that when I save the figure, I get a big white space in the saved plot (corresponding to the axes that I set their visibility to False) that I do not see in jupyter notebooks when I am running the code. If FacetGrid just plots the dataframe that I am giving it as the input (in this case X), there would have been no problem anymore. There should be a way to do that.

How to get color of most recent plotted line in pandas df.plot()

I would like to get the color of the my last plot
ax = df.plot()
df2.plot(ax=ax)
# how to get the color of this last plot,
#the plot is a single timeseries, there is therefore a single color.
I know how to do it in matplotlib.pyplot, for those interested see for instance here but I can't find a way to do it in pandas. Is there something acting like get_color() in pandas?
You cannot do the same with DataFrame.plot because it doesn't return a list of Line2D objects as pyplot.plot does. But ax.get_lines() will return a list of the lines plotted in the axes so you can look at the color of the last plotted line:
ax.get_lines()[-1].get_color()
Notice (don't know if it was implicit in the answer by Goyo) that calls to pandas objects' .plot() precisely return the ax you're looking for, as in:
plt1 = pd.Series(range(2)).plot()
color = plt1.lines[-1].get_color()
pd.Series(range(2, 4)).plot(color=color)
This is not much nicer, but might allow you to avoid importing matplotlib explicitly

Categories

Resources