boxplot (from seaborn) would not plot as expected - python

The boxplot would not plot as expected.
This is what it actually plotted:
This is what it is supposed to plot:
This is the code and data:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
scores = []
for ne in range(1,41): ## ne is the number of trees
clf = RandomForestClassifier(n_estimators = ne)
score_list = cross_val_score(clf, X, Y, cv=10)
scores.append(score_list)
sns.boxplot(scores) # scores are list of arrays
plt.xlabel('Number of trees')
plt.ylabel('Classification score')
plt.title('Classification score as a function of the number of trees')
plt.show()
scores =
[array([ 0.8757764 , 0.86335404, 0.75625 , 0.85 , 0.86875 ,
0.81875 , 0.79375 , 0.79245283, 0.8490566 , 0.85534591]),
array([ 0.89440994, 0.8447205 , 0.79375 , 0.85 , 0.8625 ,
0.85625 , 0.86875 , 0.88050314, 0.86792453, 0.8427673 ]),
array([ 0.91304348, 0.9068323 , 0.83125 , 0.84375 , 0.8875 ,
0.875 , 0.825 , 0.83647799, 0.83647799, 0.87421384]),
array([ 0.86956522, 0.86956522, 0.85 , 0.875 , 0.88125 ,
0.86875 , 0.8625 , 0.8490566 , 0.86792453, 0.89308176]),
....]

I would first create pandas DF out of scores:
import pandas as pd
In [15]: scores
Out[15]:
[array([ 0.8757764 , 0.86335404, 0.75625 , 0.85 , 0.86875 , 0.81875 , 0.79375 , 0.79245283, 0.8490566 , 0.85534591]),
array([ 0.89440994, 0.8447205 , 0.79375 , 0.85 , 0.8625 , 0.85625 , 0.86875 , 0.88050314, 0.86792453, 0.8427673 ]),
array([ 0.91304348, 0.9068323 , 0.83125 , 0.84375 , 0.8875 , 0.875 , 0.825 , 0.83647799, 0.83647799, 0.87421384]),
array([ 0.86956522, 0.86956522, 0.85 , 0.875 , 0.88125 , 0.86875 , 0.8625 , 0.8490566 , 0.86792453, 0.89308176])]
In [16]: df = pd.DataFrame(scores)
In [17]: df
Out[17]:
0 1 2 3 4 5 6 7 8 9
0 0.875776 0.863354 0.75625 0.85000 0.86875 0.81875 0.79375 0.792453 0.849057 0.855346
1 0.894410 0.844720 0.79375 0.85000 0.86250 0.85625 0.86875 0.880503 0.867925 0.842767
2 0.913043 0.906832 0.83125 0.84375 0.88750 0.87500 0.82500 0.836478 0.836478 0.874214
3 0.869565 0.869565 0.85000 0.87500 0.88125 0.86875 0.86250 0.849057 0.867925 0.893082
now we can easily plot boxplots:
In [18]: sns.boxplot(data=df)
Out[18]: <matplotlib.axes._subplots.AxesSubplot at 0xd121128>

Related

array is 1-dimensional, but 2 were indexed when using numpy and recfromcsv

I am looping through a bunch of files and importing their contents as numpy arrays:
# get the dates for our gaps
import os.path
import glob
from pathlib import Path
from numpy import recfromcsv
folder = "daily_bars_filtered/*.csv"
df_gapper_list = []
df_intraday_analysis = []
# loop through the daily gappers
for fname in glob.glob(folder)[0:2]:
ticker = Path(fname).stem
daily_bars_arr = recfromcsv(fname, delimiter=',')
print(ticker)
print(daily_bars_arr)
Output:
AACG
[(b'2021-07-15', 43796169., 2.98, 3.83, 4.75, 2.9401, 2.98, 59.39597315)
(b'2022-01-04', 14934689., 1.25, 2.55, 2.59, 1.25 , 1.19, 117.64705882)
(b'2022-01-05', 8067429., 1.8 , 2.3 , 2.64, 1.72 , 2.55, 3.52941176)
(b'2022-01-07', 9718034., 1.93, 2.64, 2.94, 1.85 , 1.98, 48.48484848)]
AAL
[(b'2022-03-04', 76218689., 15.27 , 14.59, 15.4799, 14.42 , 15.71, 1.46467218)
(b'2022-03-07', 89360330., 14.32 , 12.84, 14.62 , 12.77 , 14.59, 0.20562029)
(b'2022-03-08', 88067102., 13.035, 13.51, 14.27 , 12.4401, 12.84, 11.13707165)
(b'2022-03-09', 88884229., 14.44 , 14.3 , 14.75 , 14.05 , 13.51, 9.17838638)
(b'2022-03-10', 56463182., 13.82 , 14.2 , 14.44 , 13.46 , 14.3 , 0.97902098)
(b'2022-03-11', 48342029., 14.4 , 14.02, 14.56 , 13.9 , 14.2 , 2.53521127)
(b'2022-03-14', 53284254., 14.04 , 14.25, 14.83 , 13.7 , 14.02, 5.77746077)]
What I then try to do is target the first column where my dates are, by doing:
print(daily_bars_arr[:,[0]])
But then I get the following error:
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
What am I doing wrong?

Using seaborn to plot pre-grouped line data

I have data that I have pre-grouped. Specifically they are PR-curves for 3 different classes and I want to plot them on the same axes:
import numpy as np
data_groups = {
'ap=0.16: cat_3 (4/19)': {
'precision': np.array([0. , 0. , 0. , 0. , 0.2 ,
0.16666667, 0.14285714, 0.25 , 0.22222222, 0.2 ,
0.18181818, 0.16666667, 0.15384615, 0.14285714, 0.13333333,
0.21052632], dtype=np.float64),
'recall': np.array([0. , 0. , 0. , 0. , 0.25, 0.25, 0.25, 0.5 , 0.5 , 0.5 , 0.5 ,
0.5 , 0.5 , 0.5 , 0.5 , 1. ], dtype=np.float64),
},
'ap=0.20: cat_1 (3/19)': {
'precision': np.array([0. , 0.5 , 0.33333333, 0.25 , 0.2 ,
0.16666667, 0.14285714, 0.25 , 0.22222222, 0.2 ,
0.18181818, 0.16666667, 0.15384615, 0.14285714, 0.13333333,
0.15789474], dtype=np.float64),
'recall': np.array([0. , 0.33333333, 0.33333333, 0.33333333, 0.33333333,
0.33333333, 0.33333333, 0.66666667, 0.66666667, 0.66666667,
0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666667,
1. ], dtype=np.float64),
},
'ap=0.54: cat_2 (8/19)': {
'precision': np.array([0. , 0.5 , 0.33333333, 0.5 , 0.6 ,
0.66666667, 0.71428571, 0.75 , 0.66666667, 0.6 ,
0.63636364, 0.58333333, 0.53846154, 0.5 , 0.46666667,
0.42105263], dtype=np.float64),
'recall': np.array([0. , 0.125, 0.125, 0.25 , 0.375, 0.5 , 0.625, 0.75 , 0.75 ,
0.75 , 0.875, 0.875, 0.875, 0.875, 0.875, 1. ], dtype=np.float64),
},
}
I would like to use seaborn to plot these multiple lines in a single plot, but to do so I seem to need to transform this grouped data into a single long-form pandas table.
import pandas as pd
longform = []
for key, subdata in data_groups.items():
subdata = pd.DataFrame.from_dict(subdata)
subdata['label'] = key
longform.append(subdata)
data = pd.concat(longform)
Which effectively duplicates this "label" attribute for each item in the list:
recall precision label
0 0.000000 0.000000 ap=0.54: cat_2 (8/19)
1 0.125000 0.500000 ap=0.54: cat_2 (8/19)
2 0.125000 0.333333 ap=0.54: cat_2 (8/19)
...
0 0.000000 0.000000 ap=0.20: cat_1 (3/19)
1 0.333333 0.500000 ap=0.20: cat_1 (3/19)
2 0.333333 0.333333 ap=0.20: cat_1 (3/19)
3 0.333333 0.250000 ap=0.20: cat_1 (3/19)
...
0 0.000000 0.000000 ap=0.16: cat_3 (4/19)
1 0.000000 0.000000 ap=0.16: cat_3 (4/19)
2 0.000000 0.000000 ap=0.16: cat_3 (4/19)
At which point I can plot it:
import seaborn as sns
sns.lineplot(
data=data, x='recall', y='precision',
hue='label', style='label')
But I was wondering if there was a more efficient way to send the pre-grouped data into seaborn. I would like to avoid duplication the "label" attribute and I imagine it must effectively be inverting the pd.concat operation I just performed.
In the data structures accepted by seaborn (https://seaborn.pydata.org/tutorial/data_structure.html) they only mention this long-form (which I understand pretty well) and wide-form data (which makes much less sense to me).
This pre-grouped data isn't a wide-form variant right? I just want to verify that performing the extra concat is currently the only way to do this.
You don't have to send the entire data to seaborn at once. You can plot line by line, and they will still appear on the same plot. Seaborn can handle well with numpy array (long-form), so you can send each item to plotting separately and it still works:
from matplotlib import pyplot as plt
import seaborn as sns
for key, subdata in data_groups.items():
sns.lineplot(x=subdata['recall'], y=subdata['precision'], label=key)
plt.show()
result:
Of course you need to take care of extra styling, like legend position, confidence interval and etc - but essentially, it's plotting directly each group without direct conversation to a dataframe.

Time Series with Pandas, Python, and Plotly

I'm trying to create a data visualization that's essentially a time series chart. But I have to use Panda, Python, and Plotly, and I'm stuck on how to actually label the dates. Right now, the x labels are just integers from 1 to 60, and when you hover over the chart, you get that integer instead of the date.
I'm pulling values from a Google spreadsheet, and for now, I'd like to avoid parsing csv things.
I'd really like some help on how to label x as dates! Here's what I have so far:
import pandas as pd
from matplotlib import pyplot as plt
import bpr
%matplotlib inline
import chart_studio.plotly as pl
import plotly.express as px
import plotly.graph_objects as go
f = open("../credentials.txt")
u = f.readline()
plotly_user = str(u[:-1])
k = f.readline()
plotly_api_key = str(k)
pl.sign_in(username = plotly_user, api_key = plotly_api_key)
rand_x = np.arange(61)
rand_x = np.flip(rand_x)
rand_y = np.array([0.91 , 1 , 1.24 , 1.25 , 1.4 , 1.36 , 1.72 , 1.3 , 1.29 , 1.17 , 1.57 , 1.95 , 2.2 , 2.07 , 2.03 , 2.14 , 1.96 , 1.87 , 1.25 , 1.34 , 1.13 , 1.31 , 1.35 , 1.54 , 1.38 , 1.53 , 1.5 , 1.32 , 1.26 , 1.4 , 1.89 , 1.55 , 1.98 , 1.75 , 1.14 , 0.57 , 0.51 , 0.41 , 0.24 , 0.16 , 0.08 , -0.1 , -0.24 , -0.05 , -0.15 , 0.34 , 0.23 , 0.15 , 0.12 , -0.09 , 0.13 , 0.24 , 0.22 , 0.34 , 0.01 , -0.08 , -0.27 , -0.6 , -0.17 , 0.28 , 0.38])
test_data = pd.DataFrame(columns=['X', 'Y'])
test_data['X'] = rand_x
test_data['Y'] = rand_y
test_data.head()
def create_line_plot(data, x, y, chart_title="Rate by Date", labels_dict={}, c=["indianred"]):
fig = px.line(
data,
x = x,
y = y,
title = chart_title,
labels = labels_dict,
color_discrete_sequence = c
)
fig.show()
return fig
fig = create_line_plot(test_data, 'X', 'Y', labels_dict={'X': 'Date', 'Y': 'Rate (%)'}) ```
Right now, the x labels are just integers from 1 to 60, and when you hover over the chart, you get that integer instead of the date.
This happens because you are setting rand_x as x labels, and rand_x is an array of integer. Setting labels_dict={'X': 'Date', 'Y': 'Rate (%)'} only adding text Date before x value. What you need to do is parsing an array of datetime values into x. For example:
rand_x = np.array(['2020-01-01','2020-01-02','2020-01-03'], dtype='datetime64')

How can I select single item from one list and doing operation on all items of second list using Python

For example if I have one list having data , and whose item should be selected one by one
a = [0.11 , 0.22 , 0.13, 6.7, 2.5, 2.8]
and the other one for which all items should be selected
b = [1.2 1.4, 2.6, 2.3, 5.7 9.9]
if I select 0.11 from a and do opertation like addition with all the items of b and then save the result in new array or list , how is that br possible with python? ...
I am sorry for the question as I am trying to learn python on my own, kindly tell me how is this thing possible.
Thank you in advance.
You need a nested loop. You can do it in a list comprehension to produce a list of lists:
[[item_a + item_b for item_b in b] for item_a in a]
If you want the end result to be a list of lists it could go like this:
c = [[x + y for x in b] for y in a]
If you want the end result to be a single list with next sublists appended to each other you could write as such:
c=[]
for (y in a):
c += ([y + x for x in b])
Another option is to convert your list into numpy array and then exploit the broadcasting property of numpy arrays:
import numpy as np
npA = np.array(a)
npB = np.array(b)
npA[:, None] + npB
array([[ 1.31, 1.51, 2.71, 2.41, 5.81, 10.01],
[ 1.42, 1.62, 2.82, 2.52, 5.92, 10.12],
[ 1.33, 1.53, 2.73, 2.43, 5.83, 10.03],
[ 7.9 , 8.1 , 9.3 , 9. , 12.4 , 16.6 ],
[ 3.7 , 3.9 , 5.1 , 4.8 , 8.2 , 12.4 ],
[ 4. , 4.2 , 5.4 , 5.1 , 8.5 , 12.7 ]])
You can also do element wise multiplication simply with:
npA[:, None] * npB
which returns:
array([[ 0.132, 0.154, 0.286, 0.253, 0.627, 1.089],
[ 0.264, 0.308, 0.572, 0.506, 1.254, 2.178],
[ 0.156, 0.182, 0.338, 0.299, 0.741, 1.287],
[ 8.04 , 9.38 , 17.42 , 15.41 , 38.19 , 66.33 ],
[ 3. , 3.5 , 6.5 , 5.75 , 14.25 , 24.75 ],
[ 3.36 , 3.92 , 7.28 , 6.44 , 15.96 , 27.72 ]])

creating timesliced array in numpy

I want to create a numpy array.
T = 200
I want to create an array from 0 to 199, in which each value will be divided by 200.
l = [0, 1/200, 2/200, ...]
Numpy have any such method for calculation?
Alternatively one can use linspace:
>>> np.linspace(0, 1., 200, endpoint=False)
array([ 0. , 0.005, 0.01 , 0.015, 0.02 , 0.025, 0.03 , 0.035,
0.04 , 0.045, 0.05 , 0.055, 0.06 , 0.065, 0.07 , 0.075,
...
0.92 , 0.925, 0.93 , 0.935, 0.94 , 0.945, 0.95 , 0.955,
0.96 , 0.965, 0.97 , 0.975, 0.98 , 0.985, 0.99 , 0.995])
Use np.arange:
>>> import numpy as np
>>> np.arange(200, dtype=np.float)/200
array([ 0. , 0.005, 0.01 , 0.015, 0.02 , 0.025, 0.03 , 0.035,
0.04 , 0.045, 0.05 , 0.055, 0.06 , 0.065, 0.07 , 0.075,
0.08 , 0.085, 0.09 , 0.095, 0.1 , 0.105, 0.11 , 0.115,
...
0.88 , 0.885, 0.89 , 0.895, 0.9 , 0.905, 0.91 , 0.915,
0.92 , 0.925, 0.93 , 0.935, 0.94 , 0.945, 0.95 , 0.955,
0.96 , 0.965, 0.97 , 0.975, 0.98 , 0.985, 0.99 , 0.995])
T = 200.0
l = [x / float(T) for x in range(200)]
import numpy as np
T = 200
np.linspace(0.0, 1.0 - 1.0 / float(T), T)
Personally I prefer linspace for creating evenly spaced arrays in general. It is more complex in this case as the endpoint depends on the number of points T.

Categories

Resources