plotting & formatting seaborn chart from pandas dataframe - python

I have a pandas dataframe al_df that contains the population of Alabama from a recent US census. I created a cumulative function that I plot using seaborn, resulting in this chart:
The code that relates to the plotting is this:
figure(num=None, figsize=(20, 10))
plt.title('Cumulative Distribution Function for ALABAMA population')
plt.xlabel('City')
plt.ylabel('Percentage')
#sns.set_style("whitegrid", {"ytick.major.size": "0.1",})
plt.plot(al_df.pop_cum_perc)
My questions are:
1) How can I change the ticks, so the yaxis shows a grid line every 0.1 units instead of the default 0.2 shown?
2) How can I change the x axis to show the actual names of the city, plotted vertically, instead of the "rank" of the city (from the Pandas index)? (there are over 300 names, so they are not going to fit well horizontally).

For question 1) ,add:
plt.yticks(np.arange(0,1+0.1,0.1))
Question 2), I found this in the matplotlib gallery:
ticks_and_spines example code

The matplotlib way would be to use MutlipLocator. The second one is also straight forward
from matplotlib.ticker import *
plt.plot(range(10))
ax=plt.gca()
ax.yaxis.set_major_locator(MultipleLocator(0.5))
plt.xticks(range(10), list('ABCDEFGHIJ'), rotation=90) #would be range(3xx), List_of_city_names, rotation=90
plt.savefig('temp.png')

After some research, and not been able to find a "native" Seaborn solution, I came up with the code below, partially based on #Pablo Reyes and #CT Zhu suggestions, and using matplotlib functions:
from matplotlib.ticker import *
figure(num=None, figsize=(20, 10))
plt.title('Cumulative Distribution Function for ALABAMA population')
plt.xlabel('City')
plt.ylabel('Percentage')
plt.plot(al_df.pop_cum_perc)
#set the tick size of y axis
ax = plt.gca()
ax.yaxis.set_major_locator(MultipleLocator(0.1))
#set the labels of y axis and text orientation
ax.xaxis.set_major_locator(MultipleLocator(10))
ax.set_xticklabels(labels, rotation =90)
The solution introduced a new element "labels" which I had to specify before the plot, as a new Python list created from my Pandas dataframe:
labels = al_df.NAME.values[:]
Producing the following chart:
This requires some tweaking, since specifying a display of every city in the pandas data frame, like this:
ax.xaxis.set_major_locator(MultipleLocator(1))
Produces a chart impossible to read (displaying only x axis):

Related

Overlaying Pandas plot with Matplotlib is sensitive to the plotting order

I have the following problem: I'm trying to overlay two plots: One Pandas plot via plot.area() for a dataframe, and a second plot that is a standard Matplotlib plot. Depending the coder order for those two, the Matplotlib plot is displayed only if the code is before the Pandas plot.area() on the same axes.
Example: I have a Pandas dataframe called revenue that has a DateTimeIndex, and a single column with "revenue" values (float). Separately I have a dataset called projection with data along the same index (revenue.index)
If the code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Pandas area plot
revenue.plot.area(ax = ax)
# Second -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
plt.tight_layout()
plt.show()
Then the only thing displayed is the pandas plot.area() like this:
1/ Pandas plot.area() and 2/ Matplotlib line plot
However, if the order of the plotting is reversed:
fig, ax = plt.subplots(figsize=(10, 6))
# First -- Matplotlib line plot
ax.plot(revenue.index, projection, color='black', linewidth=3)
# Second -- Pandas area plot
revenue.plot.area(ax = ax)
plt.tight_layout()
plt.show()
Then the plots are overlayed properly, like this:
1/ Matplotlib line plot and 2/ Pandas plot.area()
Can someone please explain me what I'm doing wrong / what do I need to do to make the code more robust ? Kind TIA.
The values on the x-axis are different in both plots. I think DataFrame.plot.area() formats the DateTimeIndex in a pretty way, which is not compatible with pyplot.plot().
If you plot of the projection first, plot.area() can still plot the data and does not format the x-axis.
Mixing the two seems tricky to me, so I would either use pyplot or Dataframe.plot for both the area and the line:
import pandas as pd
from matplotlib import pyplot as plt
projection = [1000, 2000, 3000, 4000]
datetime_series = pd.to_datetime(["2021-12","2022-01", "2022-02", "2022-03"])
datetime_index = pd.DatetimeIndex(datetime_series.values)
revenue = pd.DataFrame({"value": [1200, 2200, 2800, 4100]})
revenue = revenue.set_index(datetime_index)
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# Option 1: only pyplot
ax[0].fill_between(revenue.index, revenue.value)
ax[0].plot(revenue.index, projection, color='black', linewidth=3)
ax[0].set_title("Pyplot")
# Option 2: only DataFrame.plot
revenue["projection"] = projection
revenue.plot.area(y='value', ax=ax[1])
revenue.plot.line(y='projection', ax=ax[1], color='black', linewidth=3)
ax[1].set_title("DataFrame.plot")
The results then look like this, where DataFrame.plot gives a much cleaner looking result:
If you do not want the projection in the revenue DataFrame, you can put it in a separate DataFrame and set the index to match revenue:
projection_df = pd.DataFrame({"projection": projection})
projection_df = projection_df.set_index(datetime_index)
projection_df.plot.line(ax=ax[1], color='black', linewidth=3)

How to affect a list of colors to histogram index bar in matplotlib?

I have the the folowing dataframe "freqs2" with index (SD to SD17) and associated values (frequencies) :
freqs
SD 101
SD2 128
...
SD17 65
I would like to affect a list of precise colors (in order) for each index. I've tried the following code :
colors=['#e5243b','#DDA63A', '#4C9F38','#C5192D','#FF3A21','#26BDE2','#FCC30B','#A21942','#FD6925','#DD1367','#FD9D24','#BF8B2E','#3F7E44','#0A97D9','#56C02B','#00689D','#19486A']
freqs2.plot.bar(freqs2.index, legend=False,rot=45,width=0.85, figsize=(12, 6),fontsize=(14),color=colors )
plt.ylabel('Frequency',fontsize=(17))
As result I obtain all my chart bars in red color (first color of the list).
Based on similar questions, I've tried to integrate "freqs2.index" to stipulate that the list of colors concern index but the problem stay the same.
It looks like a bug in pandas, plotting directly in matplotlib or using seaborn (which I recommend) works:
import seaborn as sns
colors=['#e5243b','#dda63a', '#4C9F38','#C5192D','#FF3A21','#26BDE2','#FCC30B','#A21942','#FD6925','#DD1367','#FD9D24','#BF8B2E','#3F7E44','#0A97D9','#56C02B','#00689D','#19486A']
# # plotting directly with matplotlib works too:
# fig = plt.figure()
# ax = fig.add_axes([0,0,1,1])
# ax.bar(x=df.index, height=df['freqs'], color=colors)
ax = sns.barplot(data=df, x= df.index, y='freqs', palette=colors)
ax.tick_params(axis='x', labelrotation=45)
plt.ylabel('Frequency',fontsize=17)
plt.show()
Edit: an issue already exists on Github

How to use a 3rd dataframe column as x axis ticks/labels in matplotlib scatter

I'm struggling to wrap my head around matplotlib with dataframes today. I see lots of solutions but I'm struggling to relate them to my needs. I think I may need to start over. Let's see what you think.
I have a dataframe (ephem) with 4 columns - Time, Date, Altitude & Azimuth.
I produce a scatter for alt & az using:
chart = plt.scatter(ephem.Azimuth, ephem.Altitude, marker='x', color='black', s=8)
What's the most efficient way to set the values in the Time column as the labels/ticks on the x axis?
So:
the scale/gridlines etc all remain the same
the chart still plots alt and az
the y axis ticks/labels remain as is
only the x axis ticks/labels are changed to the Time column.
Thanks
This isn't by any means the cleanest piece of code but the following works for me:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(ephem.Azimuth, ephem.Altitude, marker='x', color='black', s=8)
labels = list(ephem.Time)
ax.set_xticklabels(labels)
plt.show()
Here you will explicitly force the set_xticklabels to the dataframe Time column which you have.
In other words, you want to change the x-axis tick labels using a list of values.
labels = ephem.Time.tolist()
# make your plot and before calling plt.show()
# insert the following two lines
ax = plt.gca()
ax.set_xticklabels(labels = labels)
plt.show()

Why doesn't Subplot using Pandas show x-axis

When I plot single plots with panda dataframes I have an x-axis.
However, when I make a subplot and try to make a shared x-axis the way I would when using numpy arrays without pandas, there are no numbers labels
I only want the numbers and label to appear on the last plot as they share the same x-axis.
The data loaded and the plot produced can be found here:
https://drive.google.com/open?id=1hTmTSkIcYl-usv_CCxLl8U6bAoO6tMRh
This is for combining and plotting the data logged from two different logging devices which represent the same time period.
import pandas as pd
import matplotlib.pyplot as plt
df1=pd.read_csv('data1.csv', sep=',',header=0)
df1.columns.values
cols1 = list(df1.columns.values)
df2=pd.read_csv('data2.dat', sep='\t',header=18)
df2.columns.values
cols2 = list(df2.columns.values)
start =10000
stop = 30000
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True, figsize=(10, 10))
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[1], ax=axes[0])
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[2], ax=axes[0])
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[3], ax=axes[2])
df1.iloc[start:stop].plot(x=cols1[0], y=cols1[4], ax=axes[2])
df2.iloc[start:stop].plot(x=cols2[0], y=cols2[3], ax=axes[3])
ax3.set_xlabel("Time [s]")
plt.show()
I expect there to be numbers and a label on the x-axis but instead, it only gives the pandas label "#timestamp"
UPDATE: I have found something that hints at the problem. I think the problem is due to the two files not having identical time spacing, the first column of each file is time, they are roughly 1 sample per second but not exactly. If I remove the x=cols[x] parts it then shows numbers on the x-axis but then there is a shift in time between the two plots as they are not plotting against time but rather against the index in the dataframe.
I am currently trying to interpolate the data so that they have the same x-axis but I would not have expected that to be necessary.

y-axis scaling in seaborn vs pandas

I'm plotting a scatter plot using a pandas dataframe. This works correctly, but I wanted to use seaborn themes and specials functions. When I plot the same data points calling seaborn, the y-axis remains almost invisible. X-axis values ranges from 5000-15000, while y-axis values are in [-6:6]*10^-7.
If I multiply the y-axis values by 10^6, they display correctly, but the actual values when plotted using seaborn remains invisible/indistinguishable in a seaborn generated plot.
How can I seaborn so that the y-axis values scale automatically in the resultant plot?
Also some rows even contain NaN, not in this case, how to disregard that while plotting, short of manually weeding out rows containing NaN.
Below is the code I've used to plot.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("datascale.csv")
subdf = df.loc[(df.types == "easy") & (df.weight > 1300), ]
subdf = subdf.iloc[1:61, ]
subdf.drop(subdf.index[[25]], inplace=True) #row containing NaN
subdf.plot(x='length', y='speed', style='s') #scales y-axis correctly
sns.lmplot("length", "speed", data=subdf, fit_reg=True, lowess=True) #doesn't scale y-axis properly
# multiplying by 10^6 displays the plot correctly, in matplotlib
plt.scatter(subdf['length'], 10**6*subdf['speed'])
Strange that seaborn does not scale the axis correctly. Nonetheless, you can correct this behaviour. First, get a reference to the axis object of the plot:
lm = sns.lmplot("length", "speed", data=subdf, fit_reg=True)
After that you can manually set the y-axis limits:
lm.axes[0,0].set_ylim(min(subdf.speed), max(subdf.speed))
The result should look something like this:
Example Jupyter notebook here.
Seaborn and matplotlib should just ignore NaN values when plotting. You should be able to leave them as is.
As for the y scaling: there might be a bug in seaborn.
The most basic workaround is still to scale the data before plotting.
Scale to microspeed in the dataframe before plotting and plot microspeed instead.
subdf['microspeed']=subdf['speed']*10**6
Or transform to log y before plotting, i.e.
import math
df = pd.DataFrame({'speed':[1, 100, 10**-6]})
df['logspeed'] = df['speed'].map(lambda x: math.log(x,10))
then plot logspeed instead of speed.
Another approach would be to use seaborn regplot instead.
Matplot lib correctly scales and plots for me as follows:
plt.plot(subdf['length'], subdf['speed'], 'o')

Categories

Resources