I would like to print the DataFrame besides the plot. What would be a pythonic way to do that?
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Age':[21,22,23,24,25,26,27,28,29,30],'Count':[4,1,3,7,2,3,5,1,1,5]})
print(df)
Age Count
0 21 4
1 22 1
2 23 3
3 24 7
4 25 2
5 26 3
6 27 5
7 28 1
8 29 1
9 30 5
plt.rcParams['figure.figsize']=(10,6)
fig,ax = plt.subplots()
font_used={'fontname':'pristina', 'color':'Black'}
ax.set_ylabel('Count',fontsize=20,**font_used)
ax.set_xlabel('Age',fontsize=20,**font_used)
plt.plot(df['Age'],df['Count'])
I would like to have a Graph like this. How can I have the DataFrame's plotted values are printed alongside?:
You can use ax.text to add the DataFrame to the plot. DataFrames have a .to_string method which makes formatting nice. Supply index=False to remove the row index.
plt.rcParams['figure.figsize']=(10, 6)
fig,ax = plt.subplots()
font_used={'fontname':'pristina', 'color':'Black'}
ax.set_ylabel('Count',fontsize=20,**font_used)
ax.set_xlabel('Age',fontsize=20,**font_used)
# Adjust to where you want.
ax.text(x=28.5, y=4.5, s=df.to_string(index=False))
plt.plot(df['Age'],df['Count'])
plt.show()
Another option is to use the function plt.table():
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Age':[21,22,23,24,25,26,27,28,29,30],'Count':[4,1,3,7,2,3,5,1,1,5]})
plt.rcParams['figure.figsize']=(10,15)
fig,ax = plt.subplots()
plt.subplots_adjust(left=0.1, right=0.85, top=0.9, bottom=0.1)
font_used={'fontname':'pristina', 'color':'Black'}
ax.set_ylabel('Count',fontsize=20,**font_used)
ax.set_xlabel('Age',fontsize=20,**font_used)
plt.plot(df['Age'],df['Count'])
ax.table(cellText=df['Count'].map(str),
rowLabels=df['Age'].map(str),
colWidths=[0.2,0.25],
loc='right')
plt.show()
This approach will create a table with their respective lines. Just make sure to adjust the plot with subplots_adjust() afterwards.
Pandas has a to_html function you can use and place the html next to it. What are you placing the graph and Dataframe into?
df.to_html()
Related
I try to display a histogram with this dataframe.
gr_age weighted_cost
0 1 2272.985462
1 2 2027.919360
2 3 1417.617779
3 4 946.568598
4 5 715.731002
5 6 641.716770
I want to use gr_age column as the X axis and weighted_cost as the Y axis. Here is an example of what I am looking for with Excel:
I tried with the following code, and with discrete=True, but it gives another result, and I didn't do better with displot.
sns.histplot(data=df, x="gr_age", y="weighted_cost")
plt.show()
Thanking you for your ideas!
You want a barplot (x vs y values) not a histplot which plots the distribution of a dataset:
import seaborn as sns
ax = sns.barplot(data=df, x='gr_age', y='weighted_cost', color='#4473C5')
ax.set_title('Values by age group')
output:
I am trying to plot the following data as a horizontal stacked barplot. I would like to show the Week 1 and Week 2, as bars with the largest bar size ('Total') at the top and then descending down. The actual data is 100 lines so I arrived at using Seaborn catplots with kind='bar'. I'm not sure if possible to stack (like Matplotlib) so I opted to create two charts and overlay 'Week 1' on top of 'Total', for the same stacked effect.
However when I run the below I'm getting two separate plots and the chart title and axis is one the one graph. Am I able to combine this into one stacked horizontal chart. If easier way then appreciate to find out.
Company
Week 1
Week 2
Total
Stanley Atherton
0
1
1
Dennis Auton
1
1
2
David Bailey
3
8
11
Alan Ball
5
2
7
Philip Barker
3
0
3
Mark Beirne
0
1
1
Phyllis Blitz
3
0
3
Simon Blower
4
2
6
Steven Branton
5
7
12
Rebecca Brown
0
4
4
(Names created from random name generator)
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('Sample1.csv', delimiter="\t", error_bad_lines=False)
data_rank = data.sort_values(["Attending", "Company"], ascending=[False,True])
sns.set(style="ticks")
g = sns.catplot(y='Company', x='Total', data=data_rank, kind='bar', height=4, color='red', aspect=0.8, ax=ax)
ax2 =ax.twinx()
g = sns.catplot(y='Company', x='Week 1', data=data_rank, kind='bar', height=4, color='blue', aspect=0.8, ax=ax2)
for ax in g.axes[0]:
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
ax.spines['bottom'].set_visible(True)
ax.spines['top'].set_visible(True)
plt.title("Company by week ", size=7)
catplot 1
catplot 2
I think something like this works.
g = sns.barplot(y='Company', x='Total', data=data_rank, color='red', label='Total')
g = sns.barplot(y='Company', x='Week1', data=data_rank, color='blue', label='Week 1')
plt.title("Company by week ", size=12)
plt.xlabel('Frequency')
plt.legend()
plt.show()
I have the following dataframe
Class Age Percentage
0 2004 3 43.491170
1 2004 2 29.616607
2 2004 4 13.838925
3 2004 6 10.049712
4 2004 5 2.637445
5 2004 1 0.366142
6 2005 2 51.267369
7 2005 3 19.589268
8 2005 6 13.730432
9 2005 4 11.155305
10 2005 5 3.343524
11 2005 1 0.913590
12 2005 9 0.000511
I would like to make a bar plot using seaborn where in the y-axis is the 'Percentage', in the x-axis is the 'Class' and label them using the 'Age' column. I would also like to arrange the bars in descending order, i.e. from the bigger to the smaller bar.
In order to do that I thought of the following: I will change the hue_order parameter based on the order of the 'Percentage' variable. For example, if I sort the 'Percentage' column in descending order for the Class == 2004, then the hue_order = [3, 2, 4, 6, 5, 1].
Here is my code:
import matplotlib.pyplot as plt
import seaborn as sns
def hue_order():
for cls in dataset.Class.unique():
temp_df = dataset[dataset['Class'] == cls]
order = temp_df.sort_values('Percentage', ascending = False)['Age']
return order
sns.barplot(x="Class", y="Percentage", hue = 'Age',
hue_order= hue_order(),
data=dataset)
plt.show()
However, the bars are in descending order only for the Class == 2005. Any help?
In my question, I am using the hue parameter, thus, it is not a duplicate as proposed.
The seaborn hue parameter adds another dimension to the plot. The hue_order determines in which order this dimension is handled. However you cannot split that order. This means you may well change the order such that Age == 2 is in the third place in the plot. But you cannot change it partially, such that in some part it is in the first and in some other it'll be in the third place.
In order to achieve what is desired here, namely to use different orders of the auxilary dimensions within the same axes, you need to handle this manually.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({"Class" : [2004]*6+[2005]*7,
"Age" : [3,2,4,6,5,1,2,3,6,4,5,1,9],
"Percentage" : [50,40,30,20,10,30,20,35,40,50,45,30,15]})
def sortedgroupedbar(ax, x,y, groupby, data=None, width=0.8, **kwargs):
order = np.zeros(len(data))
df = data.copy()
for xi in np.unique(df[x].values):
group = data[df[x] == xi]
a = group[y].values
b = sorted(np.arange(len(a)),key=lambda x:a[x],reverse=True)
c = sorted(np.arange(len(a)),key=lambda x:b[x])
order[data[x] == xi] = c
df["order"] = order
u, df["ind"] = np.unique(df[x].values, return_inverse=True)
step = width/len(np.unique(df[groupby].values))
for xi,grp in df.groupby(groupby):
ax.bar(grp["ind"]-width/2.+grp["order"]*step+step/2.,
grp[y],width=step, label=xi, **kwargs)
ax.legend(title=groupby)
ax.set_xticks(np.arange(len(u)))
ax.set_xticklabels(u)
ax.set_xlabel(x)
ax.set_ylabel(y)
fig, ax = plt.subplots()
sortedgroupedbar(ax, x="Class",y="Percentage", groupby="Age", data=df)
plt.show()
I have a pandas dataframe of 434300 rows with the following structure:
x y p1 p2
1 8.0 1.23e-6 10 12
2 7.9 4.93e-6 10 12
3 7.8 7.10e-6 10 12
...
.
...
4576 8.0 8.85e-6 5 16
4577 7.9 2.95e-6 5 16
4778 7.8 3.66e-6 5 16
...
...
...
434300 ...
with the key point being that for every block of varying x,y data there are p1 and p2 that do not vary. Note that these blocks of constant p1,p2 are of varying length so it is not simply a matter of slicing the data every n rows.
I would like to plot the values p1 vs p2 in a graph, but would only like to plot the unique points.
If i do plot p1 vs p2 using:
In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 434300
I see that matplotlib is plotting each individual line of data which is to be expected.
What is the neatest way to plot only the unique points from columns p1 and p2?
Here is a csv of a small example dataset that has all of the important features of my dataset.
Just drop the duplicates and plot:
df.drop_duplicates(how='all', columns=['p1', 'p2'])[['p1', 'p2]].plot()
You can slice the p1 and p2 columns from the data frame and then drop duplicates before plotting.
sub_df = df[['p1','p2']].drop_duplicates()
fig, ax = plt.subplots(1,1)
ax.plot(sub_df['p1'],sub_df['p2'])
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('exampleData.csv')
d = data[['p1', 'p2']].drop_duplicates()
plt.plot(d['p1'], d['p2'], 'o')
plt.show()
After looking at this answer to a similar question in R (which is what the pandas dataframes are based on) I found the pandas function pandas.Dataframe.drop_duplicates. If we modify my example code as follows:
In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: df.drop_duplicates(subset=['p1','p2'],inplace=True)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 15
We see that this restricts df to only the unique points to be plotted. An important point is that you must pass a subset to drop_duplicates so that it only uses those columns to determine duplicate rows.
I'm trying to visualise a large (pandas) dataframe in Python as a heatmap. This dataframe has two types of variables: strings ("Absent" or "Unknown") and floats.
I want the heatmap to show cells with "Absent" in black and "Unknown" in red, and the rest of the dataframe as a normal heatmap, with the floats in a scale of greens.
I can do this easily in Excel with conditional formatting of cells, but I can't find any help online to do this with Python either with matplotlib, seaborn, ggplot. What am I missing?
Thank you for your time.
You could use cmap_custom.set_under('red') and cmap_custom.set_over('black') to apply custom colors to values below and above vmin and vmax (See 1, 2):
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.axes_grid1 as axes_grid1
import pandas as pd
# make a random DataFrame
np.random.seed(1)
arr = np.random.choice(['Absent', 'Unknown']+list(range(10)), size=(5,7))
df = pd.DataFrame(arr)
# find the largest and smallest finite values
finite_values = pd.to_numeric(list(set(np.unique(df.values))
.difference(['Absent', 'Unknown'])))
vmin, vmax = finite_values.min(), finite_values.max()
# change Absent and Unknown to numeric values
df2 = df.replace({'Absent': vmax+1, 'Unknown': vmin-1})
# make sure the values are numeric
for col in df2:
df2[col] = pd.to_numeric(df2[col])
fig, ax = plt.subplots()
cmap_custom = plt.get_cmap('Greens')
cmap_custom.set_under('red')
cmap_custom.set_over('black')
im = plt.imshow(df2, interpolation='nearest', cmap = cmap_custom,
vmin=vmin, vmax=vmax)
# add a colorbar (https://stackoverflow.com/a/18195921/190597)
divider = axes_grid1.make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax, extend='both')
plt.show()
The DataFrame
In [117]: df
Out[117]:
0 1 2 3 4 5 6
0 3 9 6 7 9 3 Absent
1 Absent Unknown 5 4 7 0 2
2 3 0 2 9 8 0 2
3 5 5 7 Unknown 5 Absent 4
4 7 7 5 4 7 Unknown Absent
becomes