Pandas pie plot actual values for multiple graphs - python

I'm trying to print actual values in pies instead of percentage, for one dimensonal series this helps:
Matplotlib pie-chart: How to replace auto-labelled relative values by absolute values
But when I try to create multiple pies it won't work.
d = {'Yes':pd.Series([825, 56], index=["Total", "Last 2 Month"]), 'No':pd.Series([725, 73], index=["Total", "Last 2 Month"])}
df = pd.DataFrame(d)
df = df.T
def absolute_value(val):
a = np.round(val/100.*df.values, 0)
return a
df.plot.pie(subplots=True, figsize=(12, 6),autopct=absolute_value)
plt.show()
How can I make this right?
Thanks.

A hacky solution would be to index the dataframe within the absolute_value function, considering that this function is called exactly once per value in that dataframe.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
d = {'Yes':pd.Series([825, 56], index=["Total", "Last 2 Month"]),
'No':pd.Series([725, 73], index=["Total", "Last 2 Month"])}
df = pd.DataFrame(d)
df = df.T
i = [0]
def absolute_value(val):
a = df.iloc[i[0]%len(df),i[0]//len(df)]
i[0] += 1
return a
df.plot.pie(subplots=True, figsize=(12, 6),autopct=absolute_value)
plt.show()
The other option is to plot the pie charts individually by looping over the columns.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
d = {'Yes':pd.Series([825, 56], index=["Total", "Last 2 Month"]),
'No':pd.Series([725, 73], index=["Total", "Last 2 Month"])}
df = pd.DataFrame(d)
df = df.T
print df.iloc[:,0].sum()
def absolute_value(val, summ):
a = np.round(val/100.*summ,0)
return a
fig, axes = plt.subplots(ncols=len(df.columns))
for i,ax in enumerate(axes):
df.iloc[:,i].plot.pie(ax=ax,autopct=lambda x: absolute_value(x,df.iloc[:,i].sum()))
plt.show()
In both cases the output would look similar to this

Related

Seaborn catplot Sort by Count column

Problem
I have data looks like the following:
Month
Product
SalesCount
1
4
94
1
6
38
1
2
56
1
7
47
I would like:
Display a histogram and sort them by SalesCount, from highest to lowest.
Display all labels and titles.
What I've Tried
import numpy as np
import seaborn as sns
rng = np.random.default_rng()
dft = pd.DataFrame({'Month': 1,
'Product': rng.choice(30, size=30, replace=False),
'SalesCount': np.random.randint(1, 100, 30),
})
# Try to sort the dataframe
#dft = dft.sort_values(by=['SalesCount'])
print(dft)
g = sns.catplot(data=dft, kind='bar', x='Product', y='SalesCount', height=6, aspect=1.8, facecolor=(0.3,0.3,0.7,1))
#, order=dft[['Product', 'SalesCount']].index
(g.set_axis_labels('Product', 'Count')
.set_titles('test'))
Which shows chart similar to this:
I have tried sorting the dataframe first (dft = dft.sort_values(by=['SalesCount'])) and also add order parameter (order=dft[['Product', 'SalesCount']].index) to sns.catplot method. Both of these attempts don't sort the histogram.
The second issue I have is adding the titles. I have tried .set_titles('test') in FacetGrid (from sns.catplot) instance, but title would not show up.
Thanks!
You may need to make your Product column a string instead of an integer. This should work.
import numpy as np
import pandas as pd
import seaborn as sns
rng = np.random.default_rng()
dft = pd.DataFrame({'Month': 1,
'Product': rng.choice(30, size=30, replace=False),
'SalesCount': np.random.randint(1, 100, 30),
})
# Try to sort the dataframe
dft = dft.sort_values(by=['SalesCount'])
dft['Product'] = dft['Product'].astype(str)
print(dft)
g = sns.catplot(data=dft, kind='bar', x='Product', y='SalesCount', height=6, aspect=1.8, facecolor=(0.3,0.3,0.7,1))
#, order=dft[['Product', 'SalesCount']].index
(g.set_axis_labels('Product', 'Count')
.set_titles('test'))

Scatter and curve plot using matplotlib

I am trying to plot the accuracy evolution of NN models overtimes. So, I have an excel file with data like the following:
and I wrote the following code to extract data and plot the scatter:
import pandas as pd
data = pd.read_excel (r'SOTA DNN.xlsx')
acc1 = pd.DataFrame(data, columns= ['Top-1-Acc'])
para = pd.DataFrame(data, columns= ['Parameters'])
dates = pd.to_datetime(data['Date'], format='%Y-%m-%d')
import matplotlib.pyplot as plt
plt.grid(True)
plt.ylim(40, 100)
plt.scatter(dates, acc1)
plt.show()
Is there a way to draw a line in the same figure to show only the ones achieving the maximum and print their names at the same time as in this example:
is it also possible to stretch the x-axis (for the dates)?
It is still not clear what you mean by "stretch the x-axis" and you did not provide your dataset, but here is a possible general approach:
import matplotlib.pyplot as plt
import pandas as pd
#fake data generation, this has to be substituted by your .xls import routine
from pandas._testing import rands_array
import numpy as np
np.random.seed(1234)
n = 30
acc = np.concatenate([np.random.randint(0, 10, 10), np.random.randint(0, 30, 10), np.random.randint(0, 100, n-20)])
date_range = pd.date_range("20190101", periods=n)
model = rands_array(5, n)
df = pd.DataFrame({"Model": model, "Date": date_range, "TopAcc": acc})
df = df.sample(frac=1).reset_index(drop=True)
#now to the actual data modification
#first, we extract the dataframe with monotonically increasing values after sorting the date column
df = df.sort_values("Date").reset_index()
df["Max"] = df.TopAcc.cummax().diff()
df.loc[0, "Max"] = 1
dfmax = df[df.Max > 0]
#then, we plot all data, followed by the best performers
fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(df.Date, df.TopAcc, c="grey")
ax.plot(dfmax.Date, dfmax.TopAcc, marker="x", c="blue")
#finally, we annotate the best performers
for _, xylabel in dfmax.iterrows():
ax.text(xylabel.Date, xylabel.TopAcc, xylabel.Model, c="blue", horizontalalignment="right", verticalalignment="bottom")
plt.show()
Sample output:

for-Loop to creat LinePlots with seaborn in DataFrame

i am a beginner with coding with python and i have a question:
This code works fantastic to creat a chart for each Column:
The Main DF is:
enter image description here
1- Removing Outliers:
def remove_outliers(df_in, col):
q1 = df_in[col].quantile(0.25)
q3 = df_in[col].quantile(0.75)
iqr = q3-q1
lower_bound = q1-1.5*iqr
upper_bound = q3+1.5*iqr
df_out = df_in.loc[(df_in[col] > lower_bound) & (df_in[col] < upper_bound)]
return df_out
2- Define the Format of the Lineplot
rc={'axes.labelsize': 20, 'font.size': 20, 'legend.fontsize':20,'axes.titlesize':20,'xtick.labelsize': 14,'ytick.labelsize': 14, 'lines.linewidth':1, 'lines.markersize':7, 'xtick.major.pad':10}
sns.set(rc=rc)
3- Creat a Lineplot with seaborn:
df1_DH001= remove_outliers(main_df, 'DH001')[['DH 001','Datum']]
df1_DH001_chart= sns.scatterplot(x='Datum', y='DH 001', data=df1_DH001)
df1_DH001_chart= sns.lineplot(x='Datum', y='DH 001', data=df1_DH001, lw=3, color="b")
df1_DH001_chart.set(xlim=('1995','2019'), ylim=(0, 220) ,title='DH 001', ylabel='Nitrat mg/L', xlabel="Jahr")
df1_DH001_chart.xaxis.set_major_locator(mdates.YearLocator(1))
df1_DH001_chart.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
df1_DH001_chart
So I got this:
enter image description here
Now I would like to creat a for-Loop to creat the same plot and the same x-Axis (Datum) but with another column (There are 22 Columns)
Could some one help me?
Import the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Create asample DF:
data = {'day': ['Mon','Tue','Wed','Thu'],
'col1': [22000,25000,27000,35000],
'col2': [2200,2500,2700,3500],
}
df = pd.DataFrame(data)
Select only numeric columns from your DF or alternatively select the columns that you want to consider in the loop:
df1 = df.select_dtypes([np.int, np.float])
Iterate through the columns and print a line plot with seaborn:
for i, col in enumerate(df1.columns):
plt.figure(i)
sns.lineplot(x='day',y=col, data=df)
Then the following pictures will be shown:

Why my Seaborn line plot x-axis shifts one unit?

I am trying to compare two simple and summarized pandas dataframe with line plot from Seaborn library but one of the lines shifts one unit in X axis. What's wrong with it?
The dataframes are:
Here is my code:
df = pd.read_csv('/home/gazelle/Documents/m3inference/m3_result.csv',index_col='id')
df = df.drop("Unnamed: 0",axis=1)
for i, v in df.iterrows():
if str(i) not in result:
df.drop(i, inplace=True)
else:
df.loc[i, 'estimated'] = result[str(i)]
m3 = pd.read_csv('plot_result.csv').set_index('id')
ids = list(m3.index.values)
m3 = m3['age'].value_counts().to_frame().reset_index().sort_values('index')
m3 = m3.rename(columns={m3.columns[0]:'bucket', m3.columns[1]:'age'})
df_estimated = df[df.index.isin(ids)]['estimated'].value_counts().to_frame().reset_index().sort_values('index')
df_estimated = df_estimated.rename(columns={df_estimated.columns[0]:'bucket', df_estimated.columns[1]:'age'})
sns.lineplot(x='bucket', y='age', data=m3)
sns.lineplot(x='bucket', y='age', data=df_estimated)
And the result is:
As has been pointed out in the comments, the data and code you provide appear to produce the correct result:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set()
m3 = pd.DataFrame({"index": [2, 3, 4, 1], "age": [123, 116, 66, 33]})
df_estimated = pd.DataFrame({"index": [3, 2, 4, 1], "estimated": [200, 100, 37, 1]})
sns.lineplot(x="index", y="age", data=m3)
sns.lineplot(x="index", y="estimated", data=df_estimated)
plt.show()
This gives a plot which is different from the one you posted above:
From your screenshots it looks like you are working in a Jupyter notebook. You are probably suffering from the issue that at the time you plot, the dataframe m3 no longer has the values you printed above, but has been modified.

Create a plot for each unique ID

Given the dataframe df, I could use some help to create two different scatter plots one for the x,y cordinates, the c value is used for the color map with the id "aa" and one with the x,y cordinates, the c value is used for the color map with the id "bb". With the actual data there are over 1000 unique id's.
import numpy as np
import matplotlib.pyplot as plt
import pyodbc
import pandas as pd
#need to add the
data = {'x':[2,4,6, 8,10, 12], 'y':[2,4,6, 8,10, 12], 'c': [.2,.5,.5,.7,.8,.9], 'id':['aa','aa','aa','bb','bb','bb']}
df = pd.DataFrame(data)
print (df)
for d in df.groupby(df['id']):
plt.scatter(d[1][['x']],d[1][['y']], c=d[1][['c']], s=10, alpha=0.3, cmap='viridis')
clb = plt.colorbar();
plt.show()
Returns this error: ValueError: RGBA values should be within 0-1 range
Try this:
df = pd.DataFrame(data)
for d in df.groupby(df['id']):
plt.plot(d[1][['x','y']])
plt.show()

Categories

Resources