Compare stock indices of different sizes Python - python

I am using Python to try and do some macroeconomic analysis of different stock markets. I was wondering about how to properly compare indices of varying sizes. For instance, the Dow Jones is around 25,000 on the y-axis, while the Russel 2000 is only around 1,500. I know that the website tradingview makes it possible to compare these two in their online charter. What it does is shrink/enlarge a background chart so that it matches the other on a new y-axis. Is there some statistical method where I can do this same thing in Python?

I know that the website tradingview makes it possible to compare these two in their online charter. What it does is shrink/enlarge a background chart so that it matches the other on a new y-axis.
These websites rescale them by fixing the initial starting points for both indices at, say, 100. I.e. if Dow is 25000 points and S&P is 2500, then Dow is divided by 250 to get to 100 initially and S&P by 25. Then you have two indices that start at 100 and you then can compare them side by side.
The other method (works good only if you have two series) - is to set y-axis on the right hand side for one series, and on the left hand side for the other one.

You have multiple possibilities here. Let's say you define your axis by the following call
fig, ax = plt.subplots()
Then, you can change the scale of the y axis to logarithmic using
ax.set_yscale('log')
You can also define two y axes inside the same plot with different scales with the call
ax2 = ax.twinx()
and then plot, let's say, big values on ax and small ones on ax2. That will only work well if you have two ranges of values at most.
Another solution is to create a new axis which zooms inside your plot
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
ax2 = zoomed_inset_axes(ax, zoom, bbox_to_anchor=(, ),
bbox_transform=ax.transAxes, loc='', borderpad=)
A last thing would be to directly scale your data. For example, if DowJones varies between 20,000 and 30,000, then you can apply the following transformation
DowJones = (DowJones - min(DowJones)) / (max(DowJones) - min(DowJones))
and then your values will vary between 0 and 1. Applying similar transformations to other variables will then allow you to compare variations more easily on the same graph without making any change to the axes.

Related

Python stacked barchart where y-axis scale is linear but the bar fill is logarithmic in the order of 10s

As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (i.e. the stacked bars) are logarithmic and grouped in the order of 10s.
I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.
Quick description of the data w/ more detail:
I have thousands of entries of clonal cell information. They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.
This is the .head() of the dataset I am working with to give you an idea of my data
I am trying to reproduce this following plot I made with R-Studio:
this one here
This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.
This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - i.e. the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.
What I have done so far:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()
And I get this plot here
Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).
Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.
I have tried things such as:
temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')
Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).
I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample. I also wouldn't know how to get the next 100, 1000, etc.
If you have any suggestions and can help in any way, that would be much appreciated!
Thanks
I fixed what I wanted to do with the following:
I created a new column with the category my samples fall in, base on their value (i.e. if they're the top 10 most frequent, next 100, etc etc).
df['category']='10001+'
for sampleref in df.sample_ref.unique().tolist():
print(f'Setting sample {sampleref}')
df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'
df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'
df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'
df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'
This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.
Following this, I plotted the samples with the following code:
fig, ax = plt.subplots(figsize=(15,7))
df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)
And here are the results:
I hope this helps anyone struggling with the same issue!

Matplotlib plotting data that doesnt exist

I am trying to plot three lines on one figure. I have data for three years for three sites and i am simply trying to plot them with the same x axis and same y axis. The first two lines span all three years of data, while the third dataset is usually more sparse. Using the object-oriented axes matplotlib format, when i try to plot my third set of data, I get points at the end of the graph that are out of the range of my third set of data. my third dataset is structured as tuples of dates and values such as:
data=
[('2019-07-15', 30.6),
('2019-07-16', 20.88),
('2019-07-17', 16.94),
('2019-07-18', 11.99),
('2019-07-19', 13.76),
('2019-07-20', 16.97),
('2019-07-21', 19.9),
('2019-07-22', 25.56),
('2019-07-23', 18.59),
...
('2020-08-11', 8.33),
('2020-08-12', 10.06),
('2020-08-13', 12.21),
('2020-08-15', 6.94),
('2020-08-16', 5.51),
('2020-08-17', 6.98),
('2020-08-18', 6.17)]
where the data ends in August 2020, yet the graph includes points at the end of 2020. This is happening with all my sites, as the first two datasets stay constant knowndf['DATE'] and knowndf['Value'] below.
Here is the problematic graph.
And here is what I have for the plotting:
fig, ax=plt.subplots(1,1,figsize=(15,12))
fig.tight_layout(pad=6)
ax.plot(knowndf['DATE'], knowndf['Value1'],'b',alpha=0.7)
ax.plot(knowndf['DATE'], knowndf['Value2'],color='red',alpha=0.7)
ax.plot(*zip(*data), 'g*', markersize=8) #when i plot this set of data i get nonexistent points
ax.tick_params(axis='x', rotation=45) #rotating for aesthetic
ax.set_xticks(ax.get_xticks()[::30]) #only want every 30th tick instead of every daily tick
I've tried ax.twinx() and that gives me two y axis that doesn't help me since i want to use the same x-axis and y-axis for all three sites. I've tried not using the axes approach, but there are things that come with axes that i need to plot with. Please please help!

Graphing a continuous bar plot with different colors/width

I am working on a relatively large dataset (5000 rows) in pandas and would like to draw a bar plot, but continuous and with different colors 1.
For every depth data there will be a value of SBT.
Initially, I thought to generate a bar for each depth, but due to the amount of data, the graph does not display it very well and it takes a really long time to load.
In the meantime, I generated a plot of the data, but with lines.
I added the code and the picture of this plot below 2.
fig, SBTcla = plt.subplots()
SBTcla.plot(SBT,datos['Depth (m)'], color='black',label='SBT')
plt.xlim(0, 10)
plt.grid(color='grey', linestyle='--', linewidth=1)
plt.title('SBT');
plt.xlabel('SBT');
plt.ylabel('Profundidad (mts)');
plt.gca().invert_yaxis();
Your graph consists of a lot of points with no information. Consecutive rows which contain the same SBT could we eliminated. Grouping by consecutive rows with equal content can be done by a shift and cummulative sum. The boolean expression looks for steps from one region to the next. If it is a step it returns true and the sum increases by one.
x = datos.groupby((datos['SBT'].shift() != datos['SBT']).cumsum())
Each group can be plotted on its own, with a filled style

Plotting a pandas DataFrame - group by - how to add a different title to each produced plot

I have a pandas dataframe and want to plot one value versus another, based on a particular field.
So I have 5 different types in 'Topic' and I want to plot each. Code as below at present.
dfCombinedToPlot.groupby('Topic').plot(x='DataValue', y='CasesPer100kPop', style='o')
# plt.title() Want this to equal "'Topic' vs number of cases"
# plt.xlabel() Want this to equal 'Topic'
# plt.ylabel()
plt.show()
I have 3 questions.
1. Can I add a title/xlabel to each of these which matches that from the Topic? So if the topic was "Asthma", I want the title/x label to be "Asthma", and then the next one to be "Bronchitis", and so on.
2. I want to add these on the same plot if possible, I will decide how many looks well when I see them. How do I do this?
3. (Bonus question!!) Can I easily add a "best fit" line to each plot?
Thanks all.
groupby.plot method returns a list of axes you can use to set titles, for example if there are only two axes:
axes = dfCombinedToPlot.groupby('Topic').plot(x='DataValue', y='CasesPer100kPop', style='o')
titles = ['Asthma', 'Bronchitis']
count = 0
for ax in axes:
ax.set_title(f'{titles[count]}')
ax.set_xlabel(f'{titles[count]}')
count += 1
plt.show()
To do the best fit, I assume is a linear regression, so it might worth checking regplot

Seaborn: Violinplot experiences difficulty with too many variables?

I wanted to use seaborn to visualize my entire Pandas dataframe with violinplots, and I thought I had made the necessary corrections to generate a large graph for the sizable number of 270 variables my dataframe possessed.
However, no matter what I do, the violinplots only display their inner mini-boxplots (as another question here describes) for each variable, and not their kde's:
fig, ax = plt.subplots(figsize=(50,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(50,5), dpi=220)
(apologies for the cropped graph, the whole thing is too big to post)
Whereas the following code, using the same pd.Dataframe, but only showing the first six variables, displays correctly:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_ylim(-6, 6)
a = sns.violinplot(x='variable', y='value', data=pd.melt(train_norm.iloc[:,:6]), ax=ax)
a.set_xticklabels(a.get_xticklabels(), rotation=90);
plt.savefig('massive_violinplot.png', figsize=(10,5), dpi=220)
How could I get a graph like the above for all the variables, filled with proper violinplots showing their kde's?
This is not related to the number of variables or the plot size but to the huge differences in the distributions of the variables. I can't access your data right now so I will ilustrate it with a made up dataset. You can follow along with your dataset, selecting the three variables with more dispersion and the three with less dispersion. As a dispersion measurement you can use the variance or even the data range (if you don't have crazy long tails) or something different, I am not sure what would work better.
rs = np.random.RandomState(42)
data = rs.randn(100, 6)
data[:, :3] *= 20
df = pd.DataFrame(data)
See what happens if we plot the density with common axes so they are directly comparable.
df.plot(kind='kde', subplots=True, layout=(3, 2), sharex=True, sharey=True)
plt.tight_layout()
This is more or less the same you can see in the seaborn violin plot but of course transposed.
sns.violinplot(x='variable', y='value', data=pd.melt(df))
This is usually great for comparing the variables because you can look at the differences in width as differences in density. Unfortunately the violin for the variables with more dispersion are so narrow that you can't see the width at all and you lose any sense of the shape. On the other hand the variables with less dispersion appear too short (actually in your dataset some of them are just horizontal lines).
For the first problem you can make the violins use all the available horizontal space by using scale='width' but then you no longer can compare the density across variables. The width is the same at the peaks but the density is not.
sns.violinplot(x='variable', y='value', data=pd.melt(df), scale='width')
By the way, this is what matplotlib's violin plot does by default.
plt.violinplot(df.T)
For the second problem I think your only option is to normalize or standardize the variables in some way.
sns.violinplot(x='variable', y='value', data=pd.melt((df - df.mean()) / df.std()))
Now you have a clearer view of each variable separately (how many modes they have, how skewed they are, how long the tails are...) but you can compare neither the scale nor the dispersion across variables.
The moral of the story is that you can't see everything at once, you have to pick and choose depending on what you are looking for in the data.

Categories

Resources