I have a dataframe which looks like this:
df = pd.DataFrame({'Pred': [10, 9.5, 9.8], 'Actual': [10.2, 9.9, 9.1], 'STD': [0.1, 0.2, 0.6]})
Pred Actual STD
0 10.0 10.2 0.1
1 9.5 9.9 0.2
2 9.8 9.1 0.6
I want to make a bar plot with error bars using STD only on the Pred column, and not on the Actual column. So far I have this:
df.plot.bar(yerr='STD', capsize=4)
but this adds the error bars on both Actual and Pred. Is there a straight forward way to tell Pandas to add the erorr bar to a single column?
You can do with
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
errors=df.STD.to_frame('Pred')
df[['Actual','Pred']].plot.bar(yerr=errors, ax=ax)
Related
Let it be the following Python Panda DataFrame:
value
other_value
cluster
1382
2.1
0
10
3.9
1
104
5.9
1
82
-1.1
0
100
0.9
2
1003
0.85
2
232
4.1
0
19
0.6
3
1434
0.3
3
23
1.6
3
Using the seaborn module, I want to display a set of boxplots for each column of values, showing the comparative information per value of the cluster column.
That is, for the above DataFrame, it would show a first graph for the 'value' column with 4 boxplots, one for each cluster value. The second graph would include information for the 'other_value' column also showing 1 boxplot for each cluster.
My idea is to do the same, but instead of in R language, in python: Boxplots of different variables by cluster assigned on one graph in ggplot
My code, It only shows the 1 to 1 graphs, I would like to get a joint graph with all graphs applied, as in the link above:
sns.boxplot(y='value', x='cluster',
data=df,
palette="colorblind",
hue='cluster')
Thanks for the help offered.
Most seaborn functions work best with the data in "long form".
Here is how the code could look like:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_html('https://stackoverflow.com/questions/72301993/')[0]
df_long = df.melt(id_vars='cluster', value_vars=df.columns[:-1], var_name='variable', value_name='values')
sns.catplot(kind='box', data=df_long,
col='variable', y='values', x='cluster', hue='cluster', palette="colorblind", sharey=False, colwrap=2)
plt.tight_layout()
plt.show()
I am producing a pandas barplot with raw counts represented by the plot, however I would like to annotate the bars with the pct of those counts as a whole. I have seen a lot of people using ax.patches methods to annotate but my values are unrelated to the get_height of the actual bars.
Here is some toy data. The plot produced will be the individual counts of the specific type. However, I want to add annotations above that specific bar that represent the pct total of that specific type to all types for that person's name.
Let me know if you need any more clarification.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'ID': [1,1,1,2,2,3,3,3,4],
'name': ['bob','bob','bob','shelby','shelby','jordan','jordan','jordan','jeff'],
'type': ['type1','type2','type4','type1','type6','type5','type8','type2',None]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pivot: pd.DataFrame = df.pivot_table(index='type', columns=['name'], values='ID', aggfunc={'ID': np.sum}).fillna(0)
# create percent totals of the specific type's row of the total
df_pivot['bob_pct_total']: pd.Series = (df_pivot['bob']/df_pivot['bob'].sum()).mul(100).round(1)
df_pivot['shelby_pct_total']: pd.Series = (df_pivot['shelby']/df_pivot['shelby'].sum()).mul(100).round(1)
df_pivot['jordan_pct_total']: pd.Series = (df_pivot['jordan']/df_pivot['jordan'].sum()).mul(100).round(1)
df_pivot.head(10)
name bob jordan shelby bob_pct_total shelby_pct_total jordan_pct_total
type
type1 1.0 0.0 2.0 33.3 50.0 0.0
type2 1.0 3.0 0.0 33.3 0.0 33.3
type4 1.0 0.0 0.0 33.3 0.0 0.0
type5 0.0 3.0 0.0 0.0 0.0 33.3
type6 0.0 0.0 2.0 0.0 50.0 0.0
type8 0.0 3.0 0.0 0.0 0.0 33.3
fig, ax = plt.subplots(figsize=(15,15))
df_pivot.plot(kind='bar', y=['bob','jordan','shelby'], ax=ax)
You can use the old approach, looping through the bars, using the height to position whatever text you want. Since matplotlib 3.4.0 there also is a new function bar_label that removes much of the boilerplate:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'ID': [1, 1, 1, 2, 2, 3, 3, 3, 4],
'name': ['bob', 'bob', 'bob', 'shelby', 'shelby', 'jordan', 'jordan', 'jordan', 'jeff'],
'type': ['type1', 'type2', 'type4', 'type1', 'type6', 'type5', 'type8', 'type2', None]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pivot: pd.DataFrame = df.pivot_table(index='type', columns=['name'], values='ID', aggfunc={'ID': np.sum}).fillna(0)
# create percent totals of the specific type's row of the total
df_pivot['bob_pct_total']: pd.Series = (df_pivot['bob'] / df_pivot['bob'].sum()).mul(100).round(1)
df_pivot['shelby_pct_total']: pd.Series = (df_pivot['shelby'] / df_pivot['shelby'].sum()).mul(100).round(1)
df_pivot['jordan_pct_total']: pd.Series = (df_pivot['jordan'] / df_pivot['jordan'].sum()).mul(100).round(1)
fig, ax = plt.subplots(figsize=(12, 5))
columns = ['bob', 'jordan', 'shelby']
df_pivot.plot(kind='bar', y=['bob', 'jordan', 'shelby'], rot=0, ax=ax)
for bars, col in zip(ax.containers, ['bob_pct_total', 'jordan_pct_total', 'shelby_pct_total']):
ax.bar_label(bars, labels=['' if val == 0 else f'{val}' for val in df_pivot[col]])
plt.tight_layout()
plt.show()
PS: To skip labeling the first bars, you could experiment with:
for bars, col in zip(ax.containers, ['bob_pct_total', 'jordan_pct_total', 'shelby_pct_total']):
labels=['' if val == 0 else f'{val}' for val in df_pivot[col]]
labels[0] = ''
ax.bar_label(bars, labels=labels)
I have a Pandas series that has an index and the values are the counts for each value of the index. I want to plot a CDF (preferably just the line, not the full histogram) where the x-axis represents the index.
For example, if my series is s, I have s.index as the array of values that should be represented on the x-axis and s.values are the counts. I have tried just doing s.plot(cumulative = True,...)but that puts the values on the x-axis, not the index.
Example: s.index yields an array of values from 0 to 1, with 0.01 increments (0.00, 0.01, 0.02, ... 1.00). s.values yields an array of the counts, for example (4372, 1340, 205,...), where each one corresponds to the index (0.01 has a count of 1340). I would like the x-axis to be the 0.00, 0.01,... and the y-axis goes from 0 to 1 as the cumulative distribution based on the counts.
Using seaborn package, you can achieve that:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
x = np.arange(0,.1,0.01)
df = pd.DataFrame({'value':[1340,1200,1300,1150,1421,1175,1232,1432,1123,1231]},index=x)
df
value
0.00 1340
0.01 1200
0.02 1300
0.03 1150
0.04 1421
0.05 1175
0.06 1232
0.07 1432
0.08 1123
0.09 1231
sns.distplot(df.index, rug=True, hist=False)
plt.show()
I have an excel file which consists of the 4 spreadsheets for representing the period of time. each spreadsheet has 3 columns data which are 'subject', 'measure', and 'frequency' (the data is considering the student's interested rate in every 10 years)
E.G, sheet 1970-1980
frequency score
math 3.4 1
english 2.5 0.95
art 0.4 0.8
sheet 1981-1990
frequency score
math 4.7 0.5
english 2.3 0.48
art -0.4 0.13
sheet 1991-2000
frequency score
math 4.2 0.6
english 2.1 0.77
art -0.2 0.24
sheet 2000-2010
frequency score
math 4.5 0.55
english 1.9 0.66
art -0.23 0.19
I have created the scatter plot for each period of time, but I would like to see the movement of the data over the period of time. for example, an x-axis represents the time period and a y-axis represents the frequency and score.
are any suggestions?
First of all, I will reproduce the tables that you have here as pandas Dataframes and for three decades:
data_80s = {'math':[ 3.4, 1], 'english':[2.5, 0.95],'art':[0.4, 0.8]}
df_80s = pd.DataFrame.from_dict(data_80s, orient = 'index', columns=['frequency',
'score'])
df_80s['decade'] = pd.to_datetime(1990, format='%Y')
df_80s['index'] = df_80s.index
data_90s = {'math':[ 4.7, 0.5], 'english':[2.3, 0.48],'art':[-0.4, 0.13]}
df_90s = pd.DataFrame.from_dict(data_90s, orient = 'index', columns=['frequency',
'score'])
df_90s['decade'] = pd.to_datetime(1990, format='%Y')
df_90s['index'] = df_90s.index
data_20s = {'math':[ 4.2, 0.6], 'english':[2.1, 0.77],'art':[-0.2, 0.24]}
df_20s = pd.DataFrame.from_dict(data_20s, orient = 'index', columns=['frequency',
'score'])
df_20s['decade'] = pd.to_datetime(2000, format='%Y')
df_20s['index'] = df_20s.index
Probably you will just need to convert your excell sheet to pandas Dataframes. Just don't forget to add the extra column index and decade.
Then you can merge the dataframes:
frames = [df_90s, df_20s]
result = df_80s.append(frames)
And finally plot whatever you want:
f, (ax1, ax2) = plt.subplots(2, figsize=(15,10))
sns.lineplot(x='decade', y='score', hue = 'index', data=result, ax=ax1)
sns.lineplot(x='decade', y='frequency', hue = 'index', data=result, ax=ax2)
My data from my 'combos' data frame looks like this:
pr = [1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,.....1.0,2.0,3.0,4.0]
lmi = [200, 200, 200, 250, 250,.....780, 780, 780, 800, 800, 800]
pred = [0.16, 0.18, 0.25, 0.43, 0.54......., 0.20, 0.34, 0.45, 0.66]
I plot the results like this:
fig,ax = plt.subplots()
for pr in [1.0,2.0,3.0,4.0]:
ax.plot(combos[combos.pr==pr].lmi, combos[combos.pr==pr].pred, label=pr)
ax.set_xlabel('lmi')
ax.set_ylabel('pred')
ax.legend(loc='best')
And I get this plot:
How can I plot means of "pred" for each "lmi" data point when keeping the pairs (lmi, pr) intact?
You can first group your DataFrame by lmi then compute the mean for each group just as your title suggests:
combos.groupby('lmi').pred.mean().plot()
In one line we:
Group the combos DataFrame by the lmi column
Get the pred column for each lmi
Compute the mean across the pred column for each lmi group
Plot the mean for each lmi group
As of your updates to the question it is now clear that you want to calculate the means for each pair (pr, lmi). This can be done by grouping over these columns and then simply calling mean(). With reset_index(), we then restore the DataFrame format to the previous form.
$ combos.groupby(['lmi', 'pr']).mean().reset_index()
lmi pr pred
0 200 1.0 0.16
1 200 2.0 0.18
2 200 3.0 0.25
3 250 1.0 0.54
4 250 4.0 0.43
5 780 2.0 0.20
6 780 3.0 0.34
7 780 4.0 0.45
8 800 1.0 0.66
In this new DataFrame pred does contain the means and you can use the same plotting procedure you have been using before.