How to plot 3 or more values in plot.bar() - python

I tried to make plot.bar() using 2 values having them in a list, but I'm unable to plot 3 values.
I tried to add plot.bar(x,y,z), but it didn't work.
ce_data = ce_data.drop(
['pchangeinOpenInterest', 'totalTradedVolume', 'impliedVolatility', # this removes unecesssary items
'pChange', 'totalBuyQuantity', 'totalSellQuantity', 'bidQty',
'bidprice', 'askQty', 'askPrice', 'askQty', 'identifier', 'lastPrice', 'change', 'expiryDate',
'underlying'], axis=1)[
['openInterest', 'changeinOpenInterest', 'strikePrice', 'underlyingValue']]
style.use('ggplot')
ce_data.to_csv('kumar.csv')
df = pd.read_csv('kumar.csv', parse_dates=True, index_col=0)
pivot = df.iloc[2, 3] # this selects the strike price
pivot_round = round(pivot, -2) # round of the price
x = df['strikePrice'].tolist()
y = df['changeinOpenInterest'].tolist()
z = df['openInterest'].tolist()
for i in range(len(x)):
if int(x[i]) >= pivot_round - 400:
xleftpos = i
break
for i in range(len(x)):
if int(x[i]) >= pivot_round + 400:
xrightpos = i
break
x = x[xleftpos:xrightpos]
y = y[xleftpos:xrightpos]
z = z[xleftpos:xrightpos]
plot.bar([value for value in range(len(x))],y)
plot.set_xticks([idx + 0.5 for idx in range(len(x))])
plot.set_xticklabels(x, rotation=35, ha='right', size=10)
I am expecting strike price in x axis and y and z (change in oi and oi) in as bars.

IIUC, here's how I'd do it. This should have a single x-axis w/ 'strikePrice' and two bars of 'changeinOpenInterest' and 'openInterest'.
disp_df = df.pivot('strikePrice', 'changeinOpenInterest', 'openInterest')
disp_df.plot(kind='bar')
You can add the bells and whistles you want to the plot, but this avoids a lot of the manipulation you did above.

Related

Reorder Sankey diagram vertically based on label value

I'm trying to plot patient flows between 3 clusters in a Sankey diagram. I have a pd.DataFrame counts with from-to values, see below. To reproduce this DF, here is the counts dict that should be loaded into a pd.DataFrame (which is the input for the visualize_cluster_flow_counts function).
from to value
0 C1_1 C1_2 867
1 C1_1 C2_2 405
2 C1_1 C0_2 2
3 C2_1 C1_2 46
4 C2_1 C2_2 458
... ... ... ...
175 C0_20 C0_21 130
176 C0_20 C2_21 1
177 C2_20 C1_21 12
178 C2_20 C0_21 0
179 C2_20 C2_21 96
The from and to values in the DataFrame represent the cluster number (either 0, 1, or 2) and the amount of days for the x-axis (between 1 and 21). If I plot the Sankey diagram with these values, this is the result:
Code:
import plotly.graph_objects as go
def visualize_cluster_flow_counts(counts):
all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))
froms, tos, vals, labs = [], [], [], []
for index, row in counts.iterrows():
froms.append(all_sources.index(row.values[0]))
tos.append(all_sources.index(row.values[1]))
vals.append(row[2])
labs.append(row[3])
fig = go.Figure(data=[go.Sankey(
arrangement='snap',
node = dict(
pad = 15,
thickness = 5,
line = dict(color = "black", width = 0.1),
label = all_sources,
color = "blue"
),
link = dict(
source = froms,
target = tos,
value = vals,
label = labs
))])
fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
fig.show()
visualize_cluster_flow_counts(counts)
However, I would like to vertically order the bars so that the C0's are always on top, the C1's are always in the middle, and the C2's are always at the bottom (or the other way around, doesn't matter). I know that we can set node.x and node.y to manually assign the coordinates. So, I set the x-values to the amount of days * (1/range of days), which is an increment of +- 0.045. And I set the y-values based on the cluster value: either 0, 0.5 or 1. I then obtain the image below. The vertical order is good, but the vertical margins between the bars are obviously way off; they should be similar to the first result.
The code to produce this is:
import plotly.graph_objects as go
def find_node_coordinates(sources):
x_nodes, y_nodes = [], []
for s in sources:
# Shift each x with +- 0.045
x = float(s.split("_")[-1]) * (1/21)
x_nodes.append(x)
# Choose either 0, 0.5 or 1 for the y-value
cluster_number = s[1]
if cluster_number == "0": y = 1
elif cluster_number == "1": y = 0.5
else: y = 1e-09
y_nodes.append(y)
return x_nodes, y_nodes
def visualize_cluster_flow_counts(counts):
all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))
node_x, node_y = find_node_coordinates(all_sources)
froms, tos, vals, labs = [], [], [], []
for index, row in counts.iterrows():
froms.append(all_sources.index(row.values[0]))
tos.append(all_sources.index(row.values[1]))
vals.append(row[2])
labs.append(row[3])
fig = go.Figure(data=[go.Sankey(
arrangement='snap',
node = dict(
pad = 15,
thickness = 5,
line = dict(color = "black", width = 0.1),
label = all_sources,
color = "blue",
x = node_x,
y = node_y,
),
link = dict(
source = froms,
target = tos,
value = vals,
label = labs
))])
fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
fig.show()
visualize_cluster_flow_counts(counts)
Question: how do I fix the margins of the bars, so that the result looks like the first result? So, for clarity: the bars should be pushed to the bottom. Or is there another way that the Sankey diagram can vertically re-order the bars automatically based on the label value?
Firstly I don't think there is a way with the current exposed API to achieve your goal smoothly you can check the source code here.
Try to change your find_node_coordinates function as follows (note that you should pass the counts DataFrame to):
counts = pd.DataFrame(counts_dict)
def find_node_coordinates(sources, counts):
x_nodes, y_nodes = [], []
flat_on_top = False
range = 1 # The y range
total_margin_width = 0.15
y_range = 1 - total_margin_width
margin = total_margin_width / 2 # From number of Cs
srcs = counts['from'].values.tolist()
dsts = counts['to'].values.tolist()
values = counts['value'].values.tolist()
max_acc = 0
def _calc_day_flux(d=1):
_max_acc = 0
for i in [0,1,2]:
# The first ones
from_source = 'C{}_{}'.format(i,d)
indices = [i for i, val in enumerate(srcs) if val == from_source]
for j in indices:
_max_acc += values[j]
return _max_acc
def _calc_node_io_flux(node_str):
c,d = int(node_str.split('_')[0][-1]), int(node_str.split('_')[1])
_flux_src = 0
_flux_dst = 0
indices_src = [i for i, val in enumerate(srcs) if val == node_str]
indices_dst = [j for j, val in enumerate(dsts) if val == node_str]
for j in indices_src:
_flux_src += values[j]
for j in indices_dst:
_flux_dst += values[j]
return max(_flux_dst, _flux_src)
max_acc = _calc_day_flux()
graph_unit_per_val = y_range / max_acc
print("Graph Unit per Acc Val", graph_unit_per_val)
for s in sources:
# Shift each x with +- 0.045
d = int(s.split("_")[-1])
x = float(d) * (1/21)
x_nodes.append(x)
print(s, _calc_node_io_flux(s))
# Choose either 0, 0.5 or 1 for the y-v alue
cluster_number = s[1]
# Flat on Top
if flat_on_top:
if cluster_number == "0":
y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(0, d))*graph_unit_per_val/2
elif cluster_number == "1": y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val/2
else: y = 1e-09
# Flat On Bottom
else:
if cluster_number == "0": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val / 2)
elif cluster_number == "1": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val /2 )
elif cluster_number == "2": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(2,d)) * graph_unit_per_val /2 )
y_nodes.append(y)
return x_nodes, y_nodes
Sankey graphs supposed to weigh their connection width by their corresponding normalized values right? Here I do the same, first, it calculates each node flux, later by calculating the normalized coordinate the center of each node calculated according to their flux.
Here is the sample output of your code with the modified function, note that I tried to adhere to your code as much as possible so it's a bit unoptimized(for example, one could store the values of nodes above each specified source node to avoid its flux recalculation).
With flag flat_on_top = True
With flag flat_on_top = False
There is a bit of inconsistency in the flat_on_bottom version which I think is caused by the padding or other internal sources of Plotly API.

Getting RSI in python

I've been trying to calculate the 14 RSI of stocks and I managed to get it to work, somewhat, it gives me inaccurate numbers
import pandas as pd
import datetime as dt
import pandas_datareader as web
ticker = 'TSLA'
start = dt.datetime(2018, 1, 1)
end = dt.datetime.now()
data = web.DataReader(ticker, 'yahoo', start, end)
delta = data['Adj Close'].diff(1)
delta.dropna(inplace=True)
positive = delta.copy()
negative = delta.copy()
positive[positive < 0] = 0
negative[negative > 0] = 0
days = 14
average_gain = positive.rolling(window=days).mean()
average_loss = abs(negative.rolling(window=days).mean())
relative_strenght = average_gain / average_loss
rsi = 100.0 - (100.0 / (1.0 + relative_strenght))
print(ticker + str(rsi))
It ends up giving me 77.991564 (14 days RSI) when I should be getting 70.13 (14 days RSI), does any know what I'm doing wrong?
also yes I've read Calculating RSI in Python but it doesn't help me with what I need
Here is one way to calculate by yourself RSI. The code could be optimized, but I prefer to make it easy to understand, and the let you optimize.
For the example, we assume that you've got a DataFrame called df, with a column called 'Close', for the close prices. By the way, notice that if you compare results of the RSI with a station, for example, you should be sure that you compare the same values. For example, if in the station, you've got the bid close, and that you calculate by your own on the mid or the ask, it will not be the same result.
Let's see the code :
def rsi(df,_window=14,_plot=0,_start=None,_end=None):
"""[RSI function]
Args:
df ([DataFrame]): [DataFrame with a column 'Close' for the close price]
_window ([int]): [The lookback window.](default : {14})
_plot ([int]): [1 if you want to see the plot](default : {0})
_start ([Date]):[if _plot=1, start of plot](default : {None})
_end ([Date]):[if _plot=1, end of plot](default : {None})
"""
##### Diff for the différences between last close and now
df['Diff'] = df['Close'].transform(lambda x: x.diff())
##### In 'Up', just keep the positive values
df['Up'] = df['Diff']
df.loc[(df['Up']<0), 'Up'] = 0
##### Diff for the différences between last close and now
df['Down'] = df['Diff']
##### In 'Down', just keep the negative values
df.loc[(df['Down']>0), 'Down'] = 0
df['Down'] = abs(df['Down'])
##### Moving average on Up & Down
df['avg_up'+str(_window)] = df['Up'].transform(lambda x: x.rolling(window=_window).mean())
df['avg_down'+str(_window)] = df['Down'].transform(lambda x: x.rolling(window=_window).mean())
##### RS is the ratio of the means of Up & Down
df['RS_'+str(_window)] = df['avg_up'+str(_window)] / df['avg_down'+str(_window)]
##### RSI Calculation
##### 100 - (100/(1 + RS))
df['RSI_'+str(_window)] = 100 - (100/(1+df['RS_'+str(_fast)]))
##### Drop useless columns
df = df.drop(['Diff','Up','Down','avg_up'+str(_window),'avg_down'+str(_window),'RS_'+str(_window)],axis=1)
##### If asked, plot it!
if _plot == 1:
sns.set()
fig = plt.figure(facecolor = 'white', figsize = (30,5))
ax0 = plt.subplot2grid((6,4), (1,0), rowspan=4, colspan=4)
ax0.plot(df[(df.index<=end)&(df.index>=start)&(df.Symbol==_ticker.replace('/',''))]['Close'])
ax0.set_facecolor('ghostwhite')
ax0.legend(['Close'],ncol=3, loc = 'upper left', fontsize = 15)
plt.title(_ticker+" Close from "+str(start)+' to '+str(end), fontsize = 20)
ax1 = plt.subplot2grid((6,4), (5,0), rowspan=1, colspan=4, sharex = ax0)
ax1.plot(df[(df.index<=end)&(df.index>=start)&(df.Symbol==_ticker.replace('/',''))]['RSI_'+str(_window)], color = 'blue')
ax1.legend(['RSI_'+str(_window)],ncol=3, loc = 'upper left', fontsize = 12)
ax1.set_facecolor('silver')
plt.subplots_adjust(left=.09, bottom=.09, right=1, top=.95, wspace=.20, hspace=0)
plt.show()
return(df)
To call the function, you just have to type
df = rsi(df)
if you keep it with default values, or to change _window and/or _plot for the arg.
Notice that if you input _plot=1, you'll need to feed starting and ending of the plot, with a string or a date time.

How to color markers based on another column in the dataframe in Plotly?

I have a dataframe as shown below with 3 columns. I am using clump as my x values and Unif size as my y values to form a scatterplot. But I want to color the individual points based on the third column class. Points having class values 2 as green and 4 as blue.
So taking the first and last points in the dataframe as examples. The first point will have an x-value of 5, y-value of 1 with color green, while the last point will have an x-value of 4, y-value of 8 and color blue
I tried using if statement as shown, but I get syntax errors. Any ideas on how to do this?
fig = go.Figure()
fig.update_layout(width = 400, height = 400, template = 'plotly_white',xaxis_title = 'clump', yaxis_title = 'Unif Size')
fig.add_trace(go.Scatter(x = data.Clump,
y = data.UnifSize,
mode = 'markers',
if data.Class == 2:
marker = duct(
color = 'green'
)
if data.Class == 4:
marker = dict(
color = 'yellow'
)
)))
You can do for example this:
Create example x and y data, with an array containing the condition on which the color will depend:
import numpy as np
x = [x for x in range(100)]
y = [3*each*np.random.normal(loc=1.0, scale=0.1) for each in range(100)]
condition = [np.random.randint(0,2) for x in range(100)]
The x and y points which have an index which corresponds to a 0 in the condition array are:
[eachx for indexx, eachx in enumerate(x) if condition[indexx]==0]
[eachy for indexy, eachy in enumerate(y) if condition[indexy]==0]
If we want the elements in the x and y arrays which have an index corresponding to a 1 in the condition array we just change the 0 to 1:
[eachx for indexx, eachx in enumerate(x) if condition[indexx]==1]
[eachy for indexy, eachy in enumerate(y) if condition[indexy]==1]
Alternatively, you could use zip:
[eachx for eachx, eachcondition in zip(x, condition) if eachcondition==0]
And so on for the others.
This is list comprehension with a condition, well explained here: https://stackoverflow.com/a/4260304/8565438.
Then plot the 2 pair of arrays with 2 go.Scatter calls.
The whole thing together:
import numpy as np
x = [x for x in range(100)]
y = [3*each*np.random.normal(loc=1.0, scale=0.1) for each in range(100)]
condition = [np.random.randint(0,2) for x in range(100)]
import plotly.graph_objects as go
fig = go.Figure()
fig.update_layout(width = 400, height = 400, template = 'plotly_white',xaxis_title = 'clump', yaxis_title = 'Unif Size')
fig.add_trace(go.Scatter(x = [eachx for indexx, eachx in enumerate(x) if condition[indexx]==0],
y = [eachy for indexy, eachy in enumerate(y) if condition[indexy]==0],
mode = 'markers',marker = dict(color = 'green')))
fig.add_trace(go.Scatter(x = [eachx for indexx, eachx in enumerate(x) if condition[indexx]==1],
y = [eachy for indexy, eachy in enumerate(y) if condition[indexy]==1],
mode = 'markers',marker = dict(color = 'yellow')))
fig.show()
This will give you:
Which is what we wanted I believe.
For converting to list from DataFrame column, recommend this: get list from pandas dataframe column.

Matplotlib plot with multiple colors based on values on x-axis

I want to get a plot similar to the following plot that has different colors based on values for x-axis. Ignore the u and f letters and also the blue curve and gray lines. I only need the green and red lines. So, if you use my code, you will get a plot that is all one color. What I want is to have different color when x is between 0 and the turning point (in this case it is x=50%) and then a different color for the rest.
Code:
import matplotlib.pyplot as plt
def GRLC(values):
n = len(values)
assert(n > 0), 'Empty list of values'
sortedValues = sorted(values) #Sort smallest to largest
#Find cumulative totals
cumm = [0]
for i in range(n):
cumm.append(sum(sortedValues[0:(i + 1)]))
#Calculate Lorenz points
LorenzPoints = [[], []]
sumYs = 0 #Some of all y values
robinHoodIdx = -1 #Robin Hood index max(x_i, y_i)
for i in range(1, n + 2):
x = 100.0 * (i - 1)/n
y = 100.0 * (cumm[i - 1]/float(cumm[n]))
LorenzPoints[0].append(x)
LorenzPoints[1].append(y)
sumYs += y
maxX_Y = x - y
if maxX_Y > robinHoodIdx: robinHoodIdx = maxX_Y
giniIdx = 100 + (100 - 2 * sumYs)/n #Gini index
return [giniIdx, giniIdx/100, robinHoodIdx, LorenzPoints]
reg=[400,200]
result_reg = GRLC(reg)
print 'Gini Index Reg', result_reg[0]
print 'Gini Coefficient Reg', result_reg[1]
print 'Robin Hood Index Reg', result_reg[2]
#Plot
plt.plot(result_reg[3][0], result_reg[3][1], [0, 100], [0, 100], '--')
plt.legend(['Reg-ALSRank#10','Equity-Line'], loc='upper left',prop={'size':16})
plt.xlabel('% of items ')
plt.ylabel('% of times being recommended')
plt.show()
This is how you would plot two lines of different colors, knowing the index in the array at which the color should change.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,49, num=50)
y = x**2
x0 = 23
plt.plot(x[:x0+1], y[:x0+1])
plt.plot(x[x0:], y[x0:])
plt.show()
This works because by default, subsequent line plots have a different color, but you could of course set the color yourself,
plt.plot(x[:x0+1], y[:x0+1], color="cornflowerblue")

Stacked bar graph with variable width elements?

In Tableau I'm used to making graphs like the one below. It has for each day (or some other discrete variable), a stacked bar of categories of different colours, heights and widths.
You can imagine the categories to be different advertisements that I show to people. The heights correspond to the percentage of people I've shown the advertisement to, and the widths correspond to the rate of acceptance.
It allows me to see very easily which advertisements I should probably show more often (short, but wide bars, like the 'C' category on September 13th and 14th) and which I should show less often (tall, narrow bars, like the 'H' category on September 16th).
Any ideas on how I could create a graph like this in R or Python?
Unfortunately, this is not so trivial to achieve with ggplot2 (I think), because geom_bar does not really support changing widths for the same x position. But with a bit of effort, we can achieve the same result:
Create some fake data
set.seed(1234)
d <- as.data.frame(expand.grid(adv = LETTERS[1:7], day = 1:5))
d$height <- runif(7*5, 1, 3)
d$width <- runif(7*5, 0.1, 0.3)
My data doesn't add up to 100%, cause I'm lazy.
head(d, 10)
# adv day height width
# 1 A 1 1.227407 0.2519341
# 2 B 1 2.244599 0.1402496
# 3 C 1 2.218549 0.1517620
# 4 D 1 2.246759 0.2984301
# 5 E 1 2.721831 0.2614705
# 6 F 1 2.280621 0.2106667
# 7 G 1 1.018992 0.2292812
# 8 A 2 1.465101 0.1623649
# 9 B 2 2.332168 0.2243638
# 10 C 2 2.028502 0.1659540
Make a new variable for stacking
We can't easily use position_stack I think, so we'll just do that part ourselves. Basically, we need to calculate the cumulative height for every bar, grouped by day. Using dplyr we can do that very easily.
library(dplyr)
d2 <- d %>% group_by(day) %>% mutate(cum_height = cumsum(height))
Make the plot
Finally, we create the plot. Note that the x and y refer to the middle of the tiles.
library(ggplot2)
ggplot(d2, aes(x = day, y = cum_height - 0.5 * height, fill = adv)) +
geom_tile(aes(width = width, height = height), show.legend = FALSE) +
geom_text(aes(label = adv)) +
scale_fill_brewer(type = 'qual', palette = 2) +
labs(title = "Views and other stuff", y = "% of views")
If you don't want to play around with correctly scaling the widths (to something < 1), you can use facets instead:
ggplot(d2, aes(x = 1, y = cum_height - 0.5 * height, fill = adv)) +
geom_tile(aes(width = width, height = height), show.legend = FALSE) +
geom_text(aes(label = adv)) +
facet_grid(~day) +
scale_fill_brewer(type = 'qual', palette = 2) +
labs(title = "Views and other stuff", y = "% of views", x = "")
Result
set.seed(1)
days <- 5
cats <- 8
dat <- prop.table(matrix(rpois(days * cats, days), cats), 2)
bp1 <- barplot(dat, col = seq(cats))
## some width for rect
rate <- matrix(runif(days * cats, .1, .5), cats)
## calculate xbottom, xtop, ybottom, ytop
bp <- rep(bp1, each = cats)
ybot <- apply(rbind(0, dat), 2, cumsum)[-(cats + 1), ]
ytop <- apply(dat, 2, cumsum)
plot(extendrange(bp1), c(0,1), type = 'n', axes = FALSE, ann = FALSE)
rect(bp - rate, ybot, bp + rate, ytop, col = seq(cats))
text(bp, (ytop + ybot) / 2, LETTERS[seq(cats)])
axis(1, bp1, labels = format(Sys.Date() + seq(days), '%d %b %Y'), lwd = 0)
axis(2)
Probably not very useful, but you can invert the color you are plotting so that you can actually see the labels:
inv_col <- function(color) {
paste0('#', apply(apply(rbind(abs(255 - col2rgb(color))), 2, function(x)
format(as.hexmode(x), 2)), 2, paste, collapse = ''))
}
inv_col(palette())
# [1] "#ffffff" "#00ffff" "#ff32ff" "#ffff00" "#ff0000" "#00ff00" "#0000ff" "#414141"
plot(extendrange(bp1), c(0,1), type = 'n', axes = FALSE, ann = FALSE)
rect(bp - rate, ybot, bp + rate, ytop, col = seq(cats), xpd = NA, border = NA)
text(bp, (ytop + ybot) / 2, LETTERS[seq(cats)], col = inv_col(seq(cats)))
axis(1, bp1, labels = format(Sys.Date() + seq(days), '%d %B\n%Y'), lwd = 0)
axis(2)

Categories

Resources