Plotting graph with categorical axes - python

I have the following dataframe, which I am aiming to plot both max data and min data on the same graph, using Month_Day as x-axis, but only printing 'Jan', 'Feb', 'Mar', etc...
Month_Day max min
0 Jan-01 243 86
1 Jan-02 230 90
2 Jan-03 233 104
3 Jan-04 220 73
4 Jan-05 224 71
but once I include the dates, it poped an error.
dates = pd.date_range('1/1/2015','31/12/2015', freq='D')
plt.plot(tmax, '-r', tmin, '-b')
#plt.plot(dates, tmax, '-r', dates, tmin, '-b') <- this is the line i plot dates as axis
plt.fill_between(range(len(tmin)), tmin, tmax, facecolor='gray', alpha=0.25)
plt.grid(True)
gives the error:
error: ordinal must be >= 1

You could use xaxis.set_major_formatter().
Here's a simple example of this:
import datetime
import random
import matplotlib.pyplot as plt
# make up some data
x = [datetime.datetime.now() + datetime.timedelta(days=i) for i in range(180)]
y = [i+random.gauss(0,1) for i,_ in enumerate(x)]
p1 = plt.subplot(211)
p1.xaxis.set_major_formatter(mdate.DateFormatter('%b', None))
# plot
plt.plot(x,y)
# beautify the x-labels
plt.gcf().autofmt_xdate()
plt.show()
Output

Related

how to put label in dataframe in Density plotting in matplotlib

#dataframe
a=
timestamp count
2021-08-16 20
2021-08-17 60
2021-08-18 35
2021-08-19 1
2021-08-20 0
2021-08-21 1
2021-08-22 50
2021-08-23 36
2021-08-24 68
2021-08-25 125
2021-08-26 54
I applied this code
a.plot(kind="density")
this is not what i want.
I want to put Count on Y axis and timestamp in X axis with Density plotting.
just like i can do it with plt.bar(a['timestamp'],a['count'])
OR this is not possible with Density plotting?
The following code creates a density histogram. The total area sums to 1, supposing each of the timestamps counts as 1 unit. To get the timestamps as x-axis, they are set as the index. To get the total area to sum to 1, all count values are divided by their total sum.
A kde a calculated from the same data.
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
from io import StringIO
a_str = '''timestamp count
2021-08-16 20
2021-08-17 60
2021-08-18 35
2021-08-19 1
2021-08-20 0
2021-08-21 1
2021-08-22 50
2021-08-23 36
2021-08-24 68
2021-08-25 125
2021-08-26 54'''
a = pd.read_csv(StringIO(a_str), delim_whitespace=True)
ax = (a.set_index('timestamp') / a['count'].sum()).plot.bar(width=0.9, rot=0, figsize=(12, 5))
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a['count'])
xs = np.linspace(-1, len(a), 200)
ax.plot(xs, kde(xs), lw=2, color='crimson', label='kde')
ax.set_xlim(xs[0], xs[-1])
ax.legend(labels=['kde', 'density histogram'])
ax.set_xlabel('')
ax.set_ylabel('density')
plt.tight_layout()
plt.show()
If you just want to plot the kde curve, you can leave out the histogram. Optionally you can fill the area under the curve.
fig, ax = plt.subplots(figsize=(12, 5))
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a['count'])
xs = np.linspace(-1, len(a), 200)
# plot the kde curve
ax.plot(xs, kde(xs), lw=2, color='crimson', label='kernel density estimation')
# optionally fill the area below the curve
ax.fill_between(xs, kde(xs), color='crimson', alpha=0.2)
ax.set_xticks(np.arange(len(a)))
ax.set_xticklabels(a['timestamp'])
ax.set_xlim(xs[0], xs[-1])
ax.set_ylim(ymin=0)
ax.legend()
ax.set_xlabel('')
ax.set_ylabel('density')
plt.tight_layout()
plt.show()
To plot multiple similar curves, for example using more count columns, you can use a loop. A list of colors that go well together could be obtained from the Set2 colormap:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde
a = pd.DataFrame({'timestamp': ['2021-08-16', '2021-08-17', '2021-08-18', '2021-08-19', '2021-08-20', '2021-08-21',
'2021-08-22', '2021-08-23', '2021-08-24', '2021-08-25', '2021-08-26']})
for i in range(1, 5):
a[f'count{i}'] = (np.random.uniform(0, 12, len(a)) ** 2).astype(int)
xs = np.linspace(-1, len(a), 200)
fig, ax = plt.subplots(figsize=(12, 4))
for column, color in zip(a.columns[1:], plt.cm.Set2.colors):
kde = gaussian_kde(np.arange(len(a)), bw_method=0.2, weights=a[column])
ax.plot(xs, kde(xs), lw=2, color=color, label=f"kde of '{column}'")
ax.fill_between(xs, kde(xs), color=color, alpha=0.2)
ax.set_xlim(xs[0], xs[-1])
ax.set_xticks(np.arange(len(a)))
ax.set_xticklabels(a['timestamp'])
ax.set_xlim(xs[0], xs[-1])
ax.set_ylim(ymin=0)
ax.legend()
ax.set_xlabel('Date')
ax.set_ylabel('Density of Counts')
plt.tight_layout()
plt.show()

Set color-palette in Seaborn Grouped Barplot depending on values

I have a dataframe with positive and negative values from three kind of variables.
labels variable value
0 -10e5 nat -38
1 2e5 nat 50
2 10e5 nat 16
3 -10e5 agr -24
4 2e5 agr 35
5 10e5 agr 26
6 -10e5 art -11
7 2e5 art 43
8 10e5 art 20
when values are negative I want the barplot to follow the color sequence:
n_palette = ["#ff0000","#ff0000","#00ff00"]
Instead when positive I want it to reverse the palette:
p_palette = ["#00ff00","#00ff00","#ff0000"]
I've tried this:
palette = ["#ff0000","#ff0000","#00ff00",
"#00ff00","#00ff00","#ff00",
"#00ff00","#00ff00","#ff00"]
ax = sns.barplot(x=melted['labels'], y=melted['value'], hue = melted['variable'],
linewidth=1,
palette=palette)
But I get the following output:
what I'd like is the first two bars of the group to become green and the last one red when values are positive.
You seem to want to do the coloring depending on a criterion on two columns. It seems suitable to add a new column which uniquely labels that criterion.
Further, seaborn allows the palette to be a dictionary telling exactly which hue label gets which color. Adding barplot(..., order=[...]) would define a fixed order.
Here is some example code:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from io import StringIO
data_str = ''' labels variable value
0 -10e5 nat -38
1 2e5 nat 50
2 10e5 nat 16
3 -10e5 agr -24
4 2e5 agr 35
5 10e5 agr 26
6 -10e5 art -11
7 2e5 art 43
8 10e5 art 20
'''
melted = pd.read_csv(StringIO(data_str), delim_whitespace=True, dtype={'labels': str})
melted['legend'] = np.where(melted['value'] < 0, '-', '+')
melted['legend'] = melted['variable'] + melted['legend']
palette = {'nat-': "#ff0000", 'agr-': "#ff0000", 'art-': "#00ff00",
'nat+': "#00ff00", 'agr+': "#00ff00", 'art+': "#ff0000"}
ax = sns.barplot(x=melted['labels'], y=melted['value'], hue=melted['legend'],
linewidth=1, palette=palette)
ax.axhline(0, color='black')
plt.show()
PS: To remove the legend: ax.legend_.remove(). Or to have a legend with multiple columns: ax.legend(ncol=3).
A different approach, directly with the original dataframe, is to create two bar plots: one for the negative values and one for the positive. For this to work well, it is necessary that the 'labels' column (the x=) is explicitly made categorical. Also adding pd.Categorical(..., categories=['nat', 'agr', 'art']) for the 'variable' column could fix an order.
This will generate a legend with the labels twice with different colors. Depending on what you want, you can remove it or create a more custom legend.
An idea is to add the labels under the positive and on top of the negative bars:
sns.set()
melted = pd.read_csv(StringIO(data_str), delim_whitespace=True, dtype={'labels': str})
palette_pos = {'nat': "#00ff00", 'agr': "#00ff00", 'art': "#ff0000"}
palette_neg = {'nat': "#ff0000", 'agr': "#ff0000", 'art': "#00ff00"}
melted['labels'] = pd.Categorical(melted['labels'])
ax = sns.barplot(data=melted[melted['value'] < 0], x='labels', y='value', hue='variable',
linewidth=1, palette=palette_neg)
sns.barplot(data=melted[melted['value'] >= 0], x='labels', y='value', hue='variable',
linewidth=1, palette=palette_pos, ax=ax)
ax.legend_.remove()
ax.axhline(0, color='black')
ax.set_xlabel('')
ax.set_ylabel('')
for bar_container in ax.containers:
label = bar_container.get_label()
for p in bar_container:
x = p.get_x() + p.get_width() / 2
h = p.get_height()
if not np.isnan(h):
ax.text(x, 0, label + '\n\n' if h < 0 else '\n\n' + label, ha='center', va='center')
plt.show()
Still another option involves sns.catplot() which could be clearer when a lot of data is involved:
sns.set()
melted = pd.read_csv(StringIO(data_str), delim_whitespace=True, dtype={'labels': str})
melted['legend'] = np.where(melted['value'] < 0, '-', '+')
melted['legend'] = melted['variable'] + melted['legend']
palette = {'nat-': "#ff0000", 'agr-': "#ff0000", 'art-': "#00ff00",
'nat+': "#00ff00", 'agr+': "#00ff00", 'art+': "#ff0000"}
g = sns.catplot(kind='bar', data=melted, col='labels', y='value', x='legend',
linewidth=1, palette=palette, sharex=False, sharey=True)
for ax in g.axes.flat:
ax.axhline(0, color='black')
ax.set_xlabel('')
ax.set_ylabel('')
plt.show()

How to draw cumulative density plot from pandas?

I have a dataframe:
count_single count_multi column_names
0 11345 7209 e
1 11125 6607 w
2 10421 5105 j
3 9840 4478 r
4 9561 5492 f
5 8317 3937 i
6 7808 3795 l
7 7240 4219 u
8 6915 3854 s
9 6639 2750 n
10 6340 2465 b
11 5627 2834 y
12 4783 2384 c
13 4401 1698 p
14 3305 1753 g
15 3283 1300 o
16 2767 1697 t
17 2453 1276 h
18 2125 1140 a
19 2090 929 q
20 1330 518 d
I want to visualize the single count and multi_count while column_names as a common column in both of them. I am looking something like this :
What I've tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('paper')
f, ax = plt.subplots(figsize = (6,15))
sns.set_color_codes('pastel')
sns.barplot(x = 'count_single', y = 'column_names', data = df,
label = 'Type_1', color = 'orange', edgecolor = 'w')
sns.set_color_codes('muted')
sns.barplot(x = 'count_multi', y = 'column_names', data = df,
label = 'Type_2', color = 'green', edgecolor = 'w')
ax.legend(ncol = 2, loc = 'lower right')
sns.despine(left = True, bottom = True)
plt.show()
it's giving me plot like this:
How to visualize these two columns with same as expected images?
I really appreciate any help you can provide.
# instantiate figure with two rows and one column
fig, axes = plt.subplots(nrows=2, figsize=(10,5))
# plot barplot in the first row
df.set_index('column_names').plot.bar(ax=axes[0], color=['rosybrown', 'tomato'])
# first scale each column bydividing by its sum and then use cumulative sum to generate the cumulative density function. plot on the second ax
df.set_index('column_names').apply(lambda x: x/x.sum()).cumsum().plot(ax=axes[1], color=['rosybrown', 'tomato'])
# change ticks in first plot:
axes[0].set_yticks(np.linspace(0, 12000, 7)) # this means: make 7 ticks between 0 and 12000
# adjust the axislabels for the second plot
axes[1].set_xticks(range(len(df)))
axes[1].set_xticklabels(df['column_names'], rotation=90)
plt.tight_layout()

Plot Multiple Y axis + 'hue' scatterplot in python

Dataframe
df
Sample Type y1 y2 y3 y4
S1 H 1000 135 220 171
S2 H 2900 1560 890 194
S3 P 678 350 127 255
S4 P 179 510 154 275
I want to plot y1, y2, y3, y4 vs Sample scatterplot with hue as Type.
Is there any way to do it in Seaborn?
Since, you want just one plot you can use sns.scatterplot:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#df = pd.read_csv('yourfile.csv')
#plotting
df1 = df.melt(['Type','Sample'])
sns.scatterplot(data=df1, x="Sample", y="value",hue="Sample",style="Type")
plt.show()
In case you want multiple scatter plots, you can use sns.relplot:
#some preprocessing
df1 = df.melt(['Type','Sample'])
#plotting
sns.relplot(data=df1, x="Sample", y="value", hue="Type", col="variable", height=2, aspect=1.5)
plt.show()
In case, you want 2x2 grid :
df1 = df.melt(['Type','Sample'])
#plotting
sns.relplot(data=df1, x="Sample", y="value", hue="Type", col="variable",col_wrap=2, height=2, aspect=1.5)
plt.show()
In case, you want 1x4 grid :
df1 = df.melt(['Type','Sample'])
#plotting
sns.relplot(data=df1, x="Sample", y="value", hue="Type", col="variable",col_wrap=1, height=2, aspect=1.5)
plt.show()

Add horizontal lines to plot based on sort_values criteria

Question:
How do I add horizontal lines to a plot based on the sort_values criteria specified below captured in the top_5 variable.:
Data:
Here is a slice of the data in a CSV:
This is the current plot.
axnum = today_numBars_slice[['High','Low']].plot()
axnum.yaxis.set_major_formatter(FormatStrFormatter('%.2f'))
This is the data I want to add to this plot (the High and Low values from each row):
top_5 = today_numBars_slice[['High','Low','# of Trades']].sort_values(by='# of Trades',ascending=False).head()
top_5
High Low # of Trades
Timestamp
2017-01-02 12:55:09.100 164.88 164.84 470
2017-01-02 12:10:12.000 164.90 164.86 465
2017-01-02 12:38:59.000 164.90 164.86 431
2017-01-02 11:54:49.100 164.87 164.83 427
2017-01-02 10:52:26.000 164.60 164.56 332
Desired output:
This is an example of the desired output showing two of the lines from top_5:
You can use faster DataFrame.nlargest for top 5 rows and then iterrows with axhline:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_csv('for_stack_nums')
#print (df.head())
top_5 = df[['High','Low','# of Trades']].nlargest(5, '# of Trades')
print (top_5)
High Low # of Trades
94 164.88 164.84 470
90 164.90 164.86 465
93 164.90 164.86 431
89 164.87 164.83 427
65 164.60 164.56 332
axnum = df[['High','Low']].plot()
axnum.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.2f'))
for idx, l in top_5.iterrows():
plt.axhline(y=l['High'], color='r')
plt.axhline(y=l['Low'], color='b')
plt.show()
Also subset is not necessary:
df = pd.read_csv('for_stack_nums.csv')
#print (df.head())
axnum = df[['High','Low']].plot()
axnum.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.2f'))
for idx, l in df.nlargest(5, '# of Trades').iterrows():
plt.axhline(y=l['High'], color='r')
plt.axhline(y=l['Low'], color='b')
plt.show()
Would pyplot.axhline be what you're looking for?
axnum = today_numBars_slice[['High','Low']].plot()
axnum.yaxis.set_major_formatter(FormatStrFormatter('%.2f'))
top_5 = today_numBars_slice[['High','Low','# of Trades']].sort_values(by='# of Trades',ascending=False).head()
for l in top_5.iterrows():
plt.axhline(l['high'], color='r')
plt.axhline(l['low'], color='b')
plt.show();

Categories

Resources