Plot gets shifted when using secondary_y - python

I want to plot temperature and precipitation from a weather station in the same plot with two y-axis. However, when I try this, one of the plots gets shifted for no reason it seems like. This is my code: (I have just tried for two precipitation measurements as of now, but you get the deal.)
ax = m_prec_ra.plot()
ax2 = m_prec_po.plot(kind='bar',secondary_y=True,ax=ax)
ax.set_xlabel('Times')
ax.set_ylabel('Left axes label')
ax2.set_ylabel('Right axes label')
This returns the following plot:
My plot is to be found here
I saw someone asking the same question, but I can't seem to figure out how to manually shift one of my datasets.
Here is my data:
print(m_prec_ra,m_prec_po)
Time
1 0.593436
2 0.532058
3 0.676219
4 1.780795
5 4.956048
6 11.909394
7 17.820051
8 14.225257
9 10.261061
10 2.628336
11 0.240568
12 0.431227
Name: Precipitation (mm), dtype: float64 Time
1 0.704339
2 1.225169
3 1.905223
4 4.156270
5 11.531221
6 22.246230
7 30.133800
8 27.634639
9 20.693056
10 5.282412
11 0.659365
12 0.622562
Name: Precipitation (mm), dtype: float64

The explanation for this behaviour is found in this Q & A.
Here, the solution would be to shift the lines one to the front, i.e. plotting against an index which starts at 0, instead of 1.
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A" : np.arange(1,11),
"B" : np.random.rand(10),
"C" : np.random.rand(10)})
df.set_index("A", inplace=True)
ax = df.plot(y='B', kind = 'bar', legend = False)
df2 = df.reset_index()
df2.plot(ax = ax, secondary_y = True, y = 'B', kind = 'line')
plt.show()

What version of pandas are you using for this plotting?
Using 0.23.4 running this code:
df1 = pd.DataFrame({'Data_1':[1,2,4,8,16,12,8,4,1]})
df2 = pd.DataFrame({'Data_2':[1,2,4,8,16,12,8,4,1]})
ax = df1.plot()
ax2 = df2.plot(kind='bar',secondary_y=True,ax=ax)
ax.set_xlabel('Times')
ax.set_ylabel('Left axes label')
ax2.set_ylabel('Right axes label')
I get:
If you want to add sample data we could look at that.

Related

Get the height of the rectangles in a plot

I have the following graph 1 obtained with the following code [2]. As you can see from the first line inside for I gave the height of the rectangles based on the standard deviation value. But I can't figure out how to get the height of the corresponding rectangle. For example given the blue rectangle I would like to return the 2 intervals in which it is included which are approximately 128.8 and 130.6. How can I do this?
[2] The code I used is the following:
import pandas as pd
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
import numpy as np
dfLunedi = pd.read_csv( "0.lun.csv", encoding = "ISO-8859-1", sep = ';')
dfSlotMean = dfLunedi.groupby('slotID', as_index=False).agg( NLunUn=('date', 'nunique'),NLunTot = ('date', 'count'), MeanBPM=('tempo', 'mean'), std = ('tempo','std') )
#print(dfSlotMean)
dfSlotMean.drop(dfSlotMean[dfSlotMean.NLunUn < 3].index, inplace=True)
df = pd.DataFrame(dfSlotMean)
df.to_csv('1.silLunedi.csv', sep = ';', index=False)
print(df)
bpmMattino = df['MeanBPM']
std = df['std']
listBpm = bpmMattino.tolist()
limInf = df['MeanBPM'] - df['std']
limSup = df['MeanBPM'] + df['std']
tick_spacing = 1
fig, ax = plt.subplots(1, 1)
for _, r in df.iterrows():
#
ax.plot([r['slotID'], r['slotID']+1], [r['MeanBPM']]*2, linewidth = r['std'] )
#ax.plot([r['slotID'], r['slotID']+1], [r['MeanBPM']]*2, linewidth = r['std'])
ax.xaxis.grid(True)
ax.yaxis.grid(True)
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
This is the content of the csv:
slotID NMonUnique NMonTot MeanBPM std
0 7 11 78 129.700564 29.323091
2 11 6 63 123.372397 24.049397
3 12 6 33 120.625667 24.029006
4 13 5 41 124.516341 30.814985
5 14 4 43 118.904512 26.205309
6 15 3 13 116.380538 24.336491
7 16 3 42 119.670881 27.416843
8 17 5 40 125.424125 32.215865
9 18 6 45 130.540578 24.437559
10 19 9 58 128.180172 32.099529
11 20 5 44 125.596045 28.060657
I would advise against using linewidth to show anything related to your data. The reason being that linewidth is measured in "points" (see the matplotlib documentation), the size of which are not related to the xy-space that you plot your data in. To see this in action, try plotting with different linewidths and changing the size of the plotting-window. The linewidth will not change with the axes.
Instead, if you do indeed want a rectangle, I suggest using matplotlib.patches.Rectangle. There is a good example of how to do that in the documentation, and I've also added an even shorter example below.
To give the rectangles different colors, you can do as here here and simply get a random tuple with 3 elements and use that for the color. Another option is to take a list of colors, for example the TABLEAU_COLORS from matplotlib.colors and take consecutive colors from that list. The latter may be better for testing, as the rectangles will get the same color for each run, but notice that there are just 10 colors in TABLEAU_COLORS, so you will have to cycle if you have more than 10 rectangles.
import matplotlib.pyplot as plt
import matplotlib.patches as ptc
import random
x = 3
y = 4.5
y_std = 0.3
fig, ax = plt.subplots()
for i in range(10):
c = tuple(random.random() for i in range(3))
# The other option as comment here
#c = mcolors.TABLEAU_COLORS[list(mcolors.TABLEAU_COLORS.keys())[i]]
rect = ptc.Rectangle(xy=(x, y-y_std), width=1, height=2*y_std, color=c)
ax.add_patch(rect)
ax.set_xlim((0,10))
ax.set_ylim((0,5))
plt.show()
If you define the height as the standard deviation, and the center is at the mean, then the interval should be [mean-(std/2) ; mean+(std/2)] for each rectangle right? Is it intentional that the rectangles overlap? If not, I think it is your use of linewidth to size the rectangles which is at fault. If the plot is there to visualize the mean and variance of the different categories something like a boxplot or raincloud plot might be better.

Boxplot and Scatterplot python

I have a time series data on which I would like to build a overlayed scatterplot and boxplot. The data is as so:
TokenUsed date
0 8 2020-01-05
1 8 2020-01-05
2 8 2020-01-05
3 8 2020-01-05
4 8 2020-01-05
... ... ...
51040 7 2020-02-23
51041 7 2020-02-23
51042 7 2020-02-23
51043 7 2020-02-23
51044 7 2020-02-23
This time series can be neatly shown as a boxplot (I've had trouble with the x-axis being a date, but solved it converting it to string). Now I would like to show only the data on which sum is superior to a threshold (>81) in my case. The code and the resulting image are below:
fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])
ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()
When I add a scatter plot, I get image (2) and by filtering only those >81 I get image(3). What I don't understand is why it can't seem to match the x-axis between the two graphs!
Code:
fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])
# Without filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df, ax= ax,color=".25")
# Filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df[df["TokenUsed"]>81], ax= ax,color=".25")
ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()
Answer:
Try editing your filtering such that no rows of df are actually removed. That is, apply a mask specifically on the TokenUsed column, such that values are replaced with NaN (rather than the whole row being removed). Here's how I would implement this:
#make a new copy df, use that to plot
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")
Explanation
Caveat: this is really my understanding of what is going on from my own observations; I am not actually aware of the implementation behind the scenes
seaborn is less aware of the dates then you are expecting. When creating the boxplot and using the date column for the x-axis, seaborn groups the data by each unique value in the date column. It orders these strings and then creates an integer position for each of them (starting from 0). The y-data are then plotted against these integer values, and the x-tick-labels are replaced with the corresponding string value. So in your case, there are 8 unique date strings, and they are plotted at x-positions from 0 to 7. Also, it doesn't actually matter that they look like dates. You could add more string values to the date column; their position relative to prior data would depend on their alphabetical order (e.g. I would guess the string '00-00-0000' would appear first and the string '999' would appear last).
The filter df[df["TokenUsed"]>81] removes any rows where the TokenUsed value is below 81. This means that the filtered DataFrame will not have as many string date values as the original data. This then creates the unexpected result when plotting. In your filtered data, the first date with values above 81 is 2020-02-09. So in the scatterplot call, those values get plotted at x=0, which is confusing because the values from 2020-01-05 were plotted at x=0 in the call to boxplot.
The fix is to make sure all the original dates are still present in the filtered data, but to replace the filtered out values with NaN so nothing gets plotted.
Here is the example I used to test this:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# fake data, only one date has values over 80
dr = ['01-05-2020'] * 100 + ['01-12-2020'] * 100 + ['01-19-2020'] * 100
data = list(np.random.randint(0,80,200)) + list(np.random.randint(50,150,100))
df = pd.DataFrame({'date':dr, 'TokenUsed':data})
fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax=ax, whis=[0,100])
df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
# the fix
df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")
ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()
If I use the same filter that you applied, I get the same issue.

How do I plot stacked barplots side by side in python? (preferentially seaborn)

I'm looking a way to plot side by side stacked barplots to compare host composition of positive (Condition==True) and total cases in each country from my dataframe.
Here is a sample of the DataFrame.
id Location Host genus_name #ofGenes Condition
1 Netherlands Homo sapiens Escherichia 4.0 True
2 Missing Missing Klebsiella 3.0 True
3 Missing Missing Aeromonas 2.0 True
4 Missing Missing Glaciecola 2.0 True
5 Antarctica Missing Alteromonas 2.0 True
6 Indian Ocean Missing Epibacterium 2.0 True
7 Missing Missing Klebsiella 2.0 True
8 China Homo sapiens Escherichia 0 False
9 Missing Missing Escherichia 2.0 True
10 China Plantae kingdom Pantoea 0 False
11 China Missing Escherichia 2.0 True
12 Pacific Ocean Missing Halomonas 0 False
I need something similar to the image bellow, but I want to plot in percentage.
Can anyone help me?
I guess what you want is a stacked categorical bar plot, which cannot be directly plotted using seaborn. But you can achieve it by customizing one.
Import some necessary packages.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
Read the dataset. Considering your sample data is too small, I randomly generate some to make the plot looks good.
def gen_fake_data(data, size=400):
unique_values = []
for c in data.columns:
unique_values.append(data[c].unique())
new_data = pd.DataFrame({c: np.random.choice(unique_values[i], size=size)
for i, c in enumerate(data.columns)})
new_data = pd.concat([data, new_data])
new_data['id'] = new_data.index + 1
return new_data
data = pd.read_csv('data.csv')
new_data = gen_fake_data(data)
Define the stacked categorical bar plot
def stack_catplot(x, y, cat, stack, data, palette=sns.color_palette('Reds')):
ax = plt.gca()
# pivot the data based on categories and stacks
df = data.pivot_table(values=y, index=[cat, x], columns=stack,
dropna=False, aggfunc='sum').fillna(0)
ncat = data[cat].nunique()
nx = data[x].nunique()
nstack = data[stack].nunique()
range_x = np.arange(nx)
width = 0.8 / ncat # width of each bar
for i, c in enumerate(data[cat].unique()):
# iterate over categories, i.e., Conditions
# calculate the location of each bar
loc_x = (0.5 + i - ncat / 2) * width + range_x
bottom = 0
for j, s in enumerate(data[stack].unique()):
# iterate over stacks, i.e., Hosts
# obtain the height of each stack of a bar
height = df.loc[c][s].values
# plot the bar, you can customize the color yourself
ax.bar(x=loc_x, height=height, bottom=bottom, width=width,
color=palette[j + i * nstack], zorder=10)
# change the bottom attribute to achieve a stacked barplot
bottom += height
# make xlabel
ax.set_xticks(range_x)
ax.set_xticklabels(data[x].unique(), rotation=45)
ax.set_ylabel(y)
# make legend
plt.legend([Patch(facecolor=palette[i]) for i in range(ncat * nstack)],
[f"{c}: {s}" for c in data[cat].unique() for s in data[stack].unique()],
bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
plt.grid()
plt.show()
Let's plot!
plt.figure(figsize=(6, 3), dpi=300)
stack_catplot(x='Location', y='#ofGenes', cat='Condition', stack='Host', data=new_data)
If you want to plot in percentile, calculate it in the raw dataset.
total_genes = new_data.groupby(['Location', 'Condition'], as_index=False)['#ofGenes'].sum().rename(
columns={'#ofGenes': 'TotalGenes'})
new_data = new_data.merge(total_genes, how='left')
new_data['%ofGenes'] = new_data['#ofGenes'] / new_data['TotalGenes'] * 100
plt.figure(figsize=(6, 3), dpi=300)
stack_catplot(x='Location', y='%ofGenes', cat='Condition', stack='Host', data=new_data)
You didn't specify how you would like to stack the bars, but you should be able to do something like this...
df = pd.read_csv('data.csv')
agg_df = df.pivot_table(index='Location', columns='Host', values='Condition', aggfunc='count')
agg_df.plot(kind='bar', stacked=True)

Matplotlib Name points on plots

I have searched and found that using annotate in matplotlib for jupyter, we can name the x and y of a point.
I have retried doing as you suggested.
import matplotlib.pyplot as plt
import pandas as pd
def fit_data():
fig = plt.figure(1,figsize=(20,6))
plt.subplot(111)
data1 = pd.DataFrame({"ID" : list(range(11)),
"R" : list(range(11)),
"Theta" : list(range(11))})
plt.scatter(data1['R'], data1['Theta'], marker='o', color='b', s=15)
for i, row in data1.iterrows():
plt.annotate(row["ID"], xy=(row["R"],row["Theta"]))
plt.xlabel('R',size=20)
plt.ylabel('Theta',size=20)
plt.show()
plt.close()
fit_data()
It still doesn't take the ID from my data. It is still plotting an arbitrary plot.
this is the image after using the revised code
My data is as follows
1 19.177 24.642
2 9.398 12.774
3 9.077 12.373
4 15.287 19.448
5 4.129 5.41
6 2.25 3.416
7 11.674 15.16
8 10.962 14.469
9 1.924 3.628
10 2.087 3.891
11 9.706 13.186
I suppose the confusion comes from the fact that while scatter can plot all points at once, while an annotation is a singular object. You would hence need one annotation per row in the dataframe.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"ID" : list(range(6)), # Do not copy this part.
"R" : [5,4,1,2,3,4], # Use your own data
"Theta" : [20,15,40,60,51,71]}) # instead.
fig = plt.figure(1,figsize=(20,6))
plt.subplot(111)
plt.scatter(df['R'], df['Theta'], marker='o', color='b', s=15)
for i, row in df.iterrows():
plt.annotate(row["ID"], xy=(row["R"],row["Theta"]))
plt.xlabel('R',size=20)
plt.ylabel('Theta',size=20)
plt.show()

how to plot two barh in one axis?

These is a pandas series with more than 500 items,I pick up the top 10 and bottom 10 to plot in one matplotlib axis,here is the picture I draw manually:
data is here:
bottom10
Out[12]:
0 -9.823127e+08
1 -8.069270e+08
2 -6.030317e+08
3 -5.709379e+08
4 -5.224355e+08
5 -4.755464e+08
6 -4.095561e+08
7 -3.989287e+08
8 -3.885740e+08
9 -3.691114e+08
Name: amount, dtype: float64
top10
Out[13]:
0 9.360520e+08
1 9.078776e+08
2 6.603838e+08
3 4.967611e+08
4 4.409362e+08
5 3.914972e+08
6 3.547471e+08
7 3.538894e+08
8 3.368558e+08
9 3.189895e+08
Name: amount, dtype: float64
top10.barh(top10.index,top10.amount,color='red',align='edge')
bottom10.barh(bottom10.index,bottom10.amount,color='green',align='edge')
Now it shows like this, which is not what I want:
.
What is the right way to plot?
You can do this by creating a twiny Axes, and plotting the bottom10 DataFrame on there.
For example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Random data
bottom10 = pd.DataFrame({'amount':-np.sort(np.random.rand(10))})
top10 = pd.DataFrame({'amount':np.sort(np.random.rand(10))[::-1]})
# Create figure and axes for top10
fig,axt = plt.subplots(1)
# Plot top10 on axt
top10.plot.barh(color='red',edgecolor='k',align='edge',ax=axt,legend=False)
# Create twin axes
axb = axt.twiny()
# Plot bottom10 on axb
bottom10.plot.barh(color='green',edgecolor='k',align='edge',ax=axb,legend=False)
# Set some sensible axes limits
axt.set_xlim(0,1.5)
axb.set_xlim(-1.5,0)
# Add some axes labels
axt.set_ylabel('Best items')
axb.set_ylabel('Worst items')
# Need to manually move axb label to right hand side
axb.yaxis.set_label_position('right')
plt.show()

Categories

Resources