How to draw cumulative density plot from pandas?

How to draw cumulative density plot from pandas? - python

I have a dataframe:
count_single count_multi column_names
0 11345 7209 e
1 11125 6607 w
2 10421 5105 j
3 9840 4478 r
4 9561 5492 f
5 8317 3937 i
6 7808 3795 l
7 7240 4219 u
8 6915 3854 s
9 6639 2750 n
10 6340 2465 b
11 5627 2834 y
12 4783 2384 c
13 4401 1698 p
14 3305 1753 g
15 3283 1300 o
16 2767 1697 t
17 2453 1276 h
18 2125 1140 a
19 2090 929 q
20 1330 518 d
I want to visualize the single count and multi_count while column_names as a common column in both of them. I am looking something like this :
What I've tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('paper')
f, ax = plt.subplots(figsize = (6,15))
sns.set_color_codes('pastel')
sns.barplot(x = 'count_single', y = 'column_names', data = df,
label = 'Type_1', color = 'orange', edgecolor = 'w')
sns.set_color_codes('muted')
sns.barplot(x = 'count_multi', y = 'column_names', data = df,
label = 'Type_2', color = 'green', edgecolor = 'w')
ax.legend(ncol = 2, loc = 'lower right')
sns.despine(left = True, bottom = True)
plt.show()
it's giving me plot like this:
How to visualize these two columns with same as expected images?
I really appreciate any help you can provide.

# instantiate figure with two rows and one column
fig, axes = plt.subplots(nrows=2, figsize=(10,5))
# plot barplot in the first row
df.set_index('column_names').plot.bar(ax=axes[0], color=['rosybrown', 'tomato'])
# first scale each column bydividing by its sum and then use cumulative sum to generate the cumulative density function. plot on the second ax
df.set_index('column_names').apply(lambda x: x/x.sum()).cumsum().plot(ax=axes[1], color=['rosybrown', 'tomato'])
# change ticks in first plot:
axes[0].set_yticks(np.linspace(0, 12000, 7)) # this means: make 7 ticks between 0 and 12000
# adjust the axislabels for the second plot
axes[1].set_xticks(range(len(df)))
axes[1].set_xticklabels(df['column_names'], rotation=90)
plt.tight_layout()

Related

Python Seaborn: how to plot all columns and use index as hue?

I have a dataframe that looks like this:
index
9
1
8
3
7
6
2
5
0
4
0
32941
3545
2829
2423
1945
1834
1213
1205
1096
969
1
24352
2738
2666
2432
1388
7937
682
3539
2705
1561
2
2137
1271
2401
540
3906
1446
3432
24855
1885
8127
I want to use barplot to plot these values, and use the index as hue. How can I do that? It can be matplotlib or seaborn or any tool, but I prefer the first two.

use:
df = df.melt(id_vars='index')
sns.barplot(x = 'variable', y = 'value' , data = df, hue = 'index' )
OUTPUT:
NOTE: If you wanna add the values on the top of each bar use:
plt.figure(figsize = (20,8))
ax = sns.barplot(x = 'variable', y = 'value' , data = df, hue = 'index' )
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2., height + 300, int(height) , ha="center", fontsize= 'small')
OUTPUT:

Set color-palette in Seaborn Grouped Barplot depending on values

I have a dataframe with positive and negative values from three kind of variables.
labels variable value
0 -10e5 nat -38
1 2e5 nat 50
2 10e5 nat 16
3 -10e5 agr -24
4 2e5 agr 35
5 10e5 agr 26
6 -10e5 art -11
7 2e5 art 43
8 10e5 art 20
when values are negative I want the barplot to follow the color sequence:
n_palette = ["#ff0000","#ff0000","#00ff00"]
Instead when positive I want it to reverse the palette:
p_palette = ["#00ff00","#00ff00","#ff0000"]
I've tried this:
palette = ["#ff0000","#ff0000","#00ff00",
"#00ff00","#00ff00","#ff00",
"#00ff00","#00ff00","#ff00"]
ax = sns.barplot(x=melted['labels'], y=melted['value'], hue = melted['variable'],
linewidth=1,
palette=palette)
But I get the following output:
what I'd like is the first two bars of the group to become green and the last one red when values are positive.

You seem to want to do the coloring depending on a criterion on two columns. It seems suitable to add a new column which uniquely labels that criterion.
Further, seaborn allows the palette to be a dictionary telling exactly which hue label gets which color. Adding barplot(..., order=[...]) would define a fixed order.
Here is some example code:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from io import StringIO
data_str = ''' labels variable value
0 -10e5 nat -38
1 2e5 nat 50
2 10e5 nat 16
3 -10e5 agr -24
4 2e5 agr 35
5 10e5 agr 26
6 -10e5 art -11
7 2e5 art 43
8 10e5 art 20
'''
melted = pd.read_csv(StringIO(data_str), delim_whitespace=True, dtype={'labels': str})
melted['legend'] = np.where(melted['value'] < 0, '-', '+')
melted['legend'] = melted['variable'] + melted['legend']
palette = {'nat-': "#ff0000", 'agr-': "#ff0000", 'art-': "#00ff00",
'nat+': "#00ff00", 'agr+': "#00ff00", 'art+': "#ff0000"}
ax = sns.barplot(x=melted['labels'], y=melted['value'], hue=melted['legend'],
linewidth=1, palette=palette)
ax.axhline(0, color='black')
plt.show()
PS: To remove the legend: ax.legend_.remove(). Or to have a legend with multiple columns: ax.legend(ncol=3).
A different approach, directly with the original dataframe, is to create two bar plots: one for the negative values and one for the positive. For this to work well, it is necessary that the 'labels' column (the x=) is explicitly made categorical. Also adding pd.Categorical(..., categories=['nat', 'agr', 'art']) for the 'variable' column could fix an order.
This will generate a legend with the labels twice with different colors. Depending on what you want, you can remove it or create a more custom legend.
An idea is to add the labels under the positive and on top of the negative bars:
sns.set()
melted = pd.read_csv(StringIO(data_str), delim_whitespace=True, dtype={'labels': str})
palette_pos = {'nat': "#00ff00", 'agr': "#00ff00", 'art': "#ff0000"}
palette_neg = {'nat': "#ff0000", 'agr': "#ff0000", 'art': "#00ff00"}
melted['labels'] = pd.Categorical(melted['labels'])
ax = sns.barplot(data=melted[melted['value'] < 0], x='labels', y='value', hue='variable',
linewidth=1, palette=palette_neg)
sns.barplot(data=melted[melted['value'] >= 0], x='labels', y='value', hue='variable',
linewidth=1, palette=palette_pos, ax=ax)
ax.legend_.remove()
ax.axhline(0, color='black')
ax.set_xlabel('')
ax.set_ylabel('')
for bar_container in ax.containers:
label = bar_container.get_label()
for p in bar_container:
x = p.get_x() + p.get_width() / 2
h = p.get_height()
if not np.isnan(h):
ax.text(x, 0, label + '\n\n' if h < 0 else '\n\n' + label, ha='center', va='center')
plt.show()
Still another option involves sns.catplot() which could be clearer when a lot of data is involved:
sns.set()
melted = pd.read_csv(StringIO(data_str), delim_whitespace=True, dtype={'labels': str})
melted['legend'] = np.where(melted['value'] < 0, '-', '+')
melted['legend'] = melted['variable'] + melted['legend']
palette = {'nat-': "#ff0000", 'agr-': "#ff0000", 'art-': "#00ff00",
'nat+': "#00ff00", 'agr+': "#00ff00", 'art+': "#ff0000"}
g = sns.catplot(kind='bar', data=melted, col='labels', y='value', x='legend',
linewidth=1, palette=palette, sharex=False, sharey=True)
for ax in g.axes.flat:
ax.axhline(0, color='black')
ax.set_xlabel('')
ax.set_ylabel('')
plt.show()

How to create grouped bars charts with matplotlib with data in DataFrame

This is my current output:
Now i want the next bars next to the already plotted bars.
My DataFrame has 3 columns: 'Block', 'Cluster', and 'District'.
'Block' and 'Cluster' contain the numbers for plotting and the grouping is based
on the strings in 'District'.
How can I plot the other bars next to the existing bars?
df=pd.read_csv("main_ds.csv")
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
plt.xticks(rotation=90)
bwidth=0.30
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
ax.autoscale(tight=False)
def autolabel(rects):
for rect in rects:
h = rect.get_height()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*h, '%d'%int(h),
ha='center', va='top')
autolabel(indic1)
autolabel(indic2)
plt.show()
Data:
District Block Cluster Villages Schools Decadal_Growth_Rate Literacy_Rate Male_Literacy Female_Literacy Primary ... Govt_School Pvt_School Govt_Sch_Rural Pvt_School_Rural Govt_Sch_Enroll Pvt_Sch_Enroll Govt_Sch_Enroll_Rural Pvt_Sch_Enroll_Rural Govt_Sch_Teacher Pvt_Sch_Teacher
0 Dimapur 5 30 278 494 23.2 85.4 88.1 82.5 147 ... 298 196 242 90 33478 57176 21444 18239 3701 3571
1 Kiphire 3 3 94 142 -58.4 73.1 76.5 70.4 71 ... 118 24 118 24 5947 7123 5947 7123 853 261
2 Kohima 5 5 121 290 22.7 85.6 89.3 81.6 128 ... 189 101 157 49 10116 26464 5976 8450 2068 2193
3 Longleng 2 2 37 113 -30.5 71.1 75.6 65.4 60 ... 90 23 90 23 3483 4005 3483 4005 830 293
4 Mon 5 5 139 309 -3.8 56.6 60.4 52.4 165 ... 231 78 219 58 18588 16578 17108 8665 1667 903
5 rows × 26 columns

Try using pandas.DataFrame.plot
import pandas as pd
import numpy as np
from io import StringIO
from datetime import date
import matplotlib.pyplot as plt
def add_value_labels(ax, spacing=5):
for rect in ax.patches:
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2
space = spacing
# Vertical alignment for positive values
va = 'bottom'
# If value of bar is negative: Place label below bar
if y_value < 0:
# Invert space to place label below
space *= -1
# Vertically align label at top
va = 'top'
# Use Y value as label and format number with one decimal place
label = "{:.1f}".format(y_value)
# Create annotation
ax.annotate(
label, # Use `label` as label
(x_value, y_value), # Place label at end of the bar
xytext=(0, space), # Vertically shift label by `space`
textcoords="offset points", # Interpret `xytext` as offset in points
ha='center', # Horizontally center label
va=va) # Vertically align label differently for
# positive and negative values.
first3columns = StringIO("""District Block Cluster
Dimapur 5 30
Kiphire 3 3
Kohima 5 5
Longleng 2
Mon 5 5
""")
df_plot = pd.read_csv(first3columns, delim_whitespace=True)
fig, ax = plt.subplots()
#df_plot.set_index(['District'], inplace=True)
df_plot[['Block', 'Cluster']].plot.bar(ax=ax, color=['r', 'b'])
ax.set_xticklabels(df_plot['District'])
add_value_labels(ax)
plt.show()

Try changing
indic1=ax.bar(df["District"],df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"],df["Cluster"], width=bwidth, color='b')
to
indic1=ax.bar(df["District"]-bwidth/2,df["Block"], width=bwidth, color='r')
indic2=ax.bar(df["District"]+bwidth/2,df["Cluster"], width=bwidth, color='b')

How to groupby column, and then create a scatterplot of counts

I have a dataframe similar to the one below:
id date available
0 1944 2019-07-11 f
1 1944 2019-07-11 t
2 159454 2019-07-12 f
3 159454 2019-07-13 f
4 159454 2019-07-14 f
I would like form a scatter plot where each id has a corresponding point; the x value is the number of t occurrences, and the y value is the number of f occurrences in the available column.
I have tried:
grouped = df.groupby(['listing_id'])['available'].value_counts().to_frame()
grouped.head()
This gives me something like
available
listing_id available
1944 t 364
f 1
2015 f 184
t 181
3176 t 279
f 10
But I'm not sure how to work this anymore. How can I get my desired plot? Is there a better way to proceed?

Assuming you won't have to use the date column:
# Generate example data
N = 100
np.random.seed(1)
df = pd.DataFrame({'id': np.random.choice(list(range(1, 6)), size=N),
'available': np.random.choice(['t', 'f'], size=N)})
df = df.sort_values('id').reset_index(drop=True)
# For each id: get t and f counts, unstack into columns, ensure
# column order is ['t', 'f']
counts = df.groupby(['id', 'available']).size().unstack()[['t', 'f']]
# Plot
fig, ax = plt.subplots()
counts.plot(x='t', y='f', kind='scatter', ax=ax)
# Optional: label each data point with its id.
# This is rough and might not look good beyond a few data points
for label, (t, f) in counts.iterrows():
ax.text(t + .05, f + .05, label)
Output:

You can group by both listing_id and available, do a count and then unstack and then plot using seaborn.
Below I used some random numbers, the image is only for illustration.
import seaborn as sns
data = df.groupby(['listing_id', 'available'])['date'].count().unstack()
sns.scatterplot(x=data.t, y=data.f, hue=data.index, legend='full')

Using your data:
reset the index
df.reset_index(inplace=True)
id available count
1944 t 364
1944 f 1
2015 f 184
2015 t 181
3176 t 279
3176 f 10
create a t & f dataframe:
t = df[df.available == 't'].reset_index(drop=True)
id available count
0 1944 t 364
1 2015 t 181
2 3176 t 279
f = df[df.available == 'f'].reset_index(drop=True)
id available count
0 1944 f 1
1 2015 f 184
2 3176 f 10
Plot the data:
plt.scatter(x=t['count'], y=f['count'])
plt.xlabel('t')
plt.ylabel('f')
for i, txt in enumerate(f['id'].tolist()):
plt.annotate(txt, (t['count'].loc[i] + 3, f['count'].loc[i]))

Plotting multiple lines in the same graph for every different entry in a column

My dataset looks like this:
Town week price sales
A 1 1.1 101
A 2 1.2 303
A 3 1.3 234
B 1 1.2 987
B 2 1.5 213
B 3 3.9 423
C 1 2.4 129
C 2 1.3 238
C 3 1.3 132
Now I need make a single figure with 3 lines (each representing a different town), where I plot the sales and price per week. I know how to do it when I take the mean of the towns, but I can't figure out how to do it per Town.
data = pd.read_excel("data.xlsx")
dfEuroAvg = data[data['Product'] == "Euro"].groupby('Week').mean()
t = np.arange(1, 50, 1)
y3 = dfEuroAvg['Sales']
y4 = dfEuroAvg['Price']
fig, ax2 = plt.subplots()
color = 'tab:green'
ax2.set_xlabel('Week')
ax2.set_ylabel('Sales', color = color)
ax2.plot(t, y3, color = color)
ax2.tick_params(axis = 'y', labelcolor = color)
ax3 = ax2.twinx()
color = 'tab:orange'
ax3.set_ylabel('Price', color=color)
ax3.plot(t, y4, color=color)
ax3.tick_params(axis='y', labelcolor=color)
ax2.set_title("product = Euro, Sales vs. Price")
EDIT: On the X-axis are the weeks and on the Y-axis are the price and sales.

This is one way of doing it using groupby to form groups based on Town and then plot the price and sales using a secondary y axis
fig, ax = plt.subplots(figsize=(8, 6))
df_group = data.groupby('Town')['week','price','sales']
ylabels = ['price', 'sales']
colors =['r', 'g', 'b']
for i, key in enumerate(df_group.groups.keys()):
df_group.get_group(key).plot('week', 'price', color=colors[i], ax=ax, label=key)
df_group.get_group(key).plot('week', 'sales', color=colors[i], linestyle='--', secondary_y=True, ax=ax)
handles,labels = ax.get_legend_handles_labels()
legends = ax.legend()
legends.remove()
plt.legend(handles, labels)
ax1.set_ylabel('Price')
ax2.set_ylabel('Sales')

You will have to fetch the data for each town separately by filtering the dataframe.
# df = your dataframe with all the data
towns = ['A', 'B', 'C']
for town in towns:
town_df = df[df['town'] == town]
plt.plot(town_df['week'], town_df['price'], label=town)
plt.legend()
plt.xlabel('Week')
plt.ylabel('Price')
plt.title('Price Graph')
plt.show()
Output:
I have done this for the price graph, you can similarly create a graph with Sales as the y-axis using the same steps

You may plot the pivoted data directly with pandas.
ax = df.pivot("week", "Town", "price").plot()
ax2 = df.pivot("week", "Town", "sales").plot(secondary_y=True, ax=ax)
Complete example:
import io
import pandas as pd
import matplotlib.pyplot as plt
u = """Town week price sales
A 1 1.1 101
A 2 1.2 303
A 3 1.3 234
B 1 1.2 987
B 2 1.5 213
B 3 3.9 423
C 1 2.4 129
C 2 1.3 238
C 3 1.3 132"""
df = pd.read_csv(io.StringIO(u), delim_whitespace=True)
ax = df.pivot("week", "Town", "price").plot(linestyle="--", legend=False)
ax.set_prop_cycle(None)
ax2 = df.pivot("week", "Town", "sales").plot(secondary_y=True, ax=ax, legend=False)
ax.set_ylabel('Price')
ax2.set_ylabel('Sales')
ax2.legend()
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to draw cumulative density plot from pandas? - python

Related

Python Seaborn: how to plot all columns and use index as hue?

Set color-palette in Seaborn Grouped Barplot depending on values

How to create grouped bars charts with matplotlib with data in DataFrame

How to groupby column, and then create a scatterplot of counts

Plotting multiple lines in the same graph for every different entry in a column

Categories

Resources