Set confidence intervals for error bars plot in matplotlib - python

I have this dataset:
mydf = pd.DataFrame({'Feature':['Pysch','Physio'],'log_or':[0.3126,0.2022],
'se':[0.0712,0.0568], 'conf_low':[0.1729,0.0907], 'conf_high':[0.4522, 0.3136]})
mydf = mydf.sort_values(by='log_or')
mydf
Feature log_or se conf_low conf_high
1 Physio 0.2022 0.0568 0.0907 0.3136
0 Pysch 0.3126 0.0712 0.1729 0.4522
And I want to create an error bar plot using my calculated confidence intervals in con_low and conf_high
I tried this at the beginning but I can see that the intervals don't cover my calculated confidence intervals:
plt.errorbar(mydf['log_or'], mydf['Feature'],
xerr=mydf['se'], marker='s', mfc='Tomato')
plt.show()
You can see that, for example, in the Physio variable the error bar goes from 0.14 to 0.26 in the image approximately, but my tabulated confidence intervals go from 0.091 to 0.316.
So I tried to set up my custom intervals, with this:
lowr = mydf['conf_low'].to_numpy()
uppr = mydf['conf_high'].to_numpy()
intervals = [lowr, uppr]
plt.errorbar(mydf['log_or'], mydf['Feature'], xerr=intervals, marker='s', mfc='Tomato')
plt.show()
Now my variable Physio interval goes from 0.1 to 0.5 approx, which is wrong. Now, what I am doing wrong? How can I use my custom intervals to this plot?

I think you are misunderstanding what the values passed to xerr are meant to represent. Have a look at the plt.errorbar documentation (sub xerr, yerr).
From your first attempt: xerr=mydf['se'] will be used as follows:
shape(N,): Symmetric +/-values for each data point.
From your second attempt, xerr=intervals will be used as follows:
shape(2, N): Separate - and + values for each bar. First row contains the lower errors, the second row contains the upper errors.
So, the values you are passing here are used to measure the length of the error (+/- for each data point). However, your values in mydf.conf_low and mydf.conf_high do not represent length, they are simply x-values. As you mention for Physio:
my tabulated confidence intervals go from 0.091 to 0.316.
The solution then is to calculate the length on both sides and pass those values to xerr. Like so:
import pandas as pd
import matplotlib.pyplot as plt
mydf = pd.DataFrame({'Feature':['Pysch','Physio'],'log_or':[0.3126,0.2022],
'se':[0.0712,0.0568], 'conf_low':[0.1729,0.0907], 'conf_high':[0.4522, 0.3136]})
mydf = mydf.sort_values(by='log_or')
mydf
plt.errorbar(mydf['log_or'], mydf['Feature'],
xerr=((mydf.log_or - mydf.conf_low),(mydf.conf_high-mydf.log_or)), marker='s', mfc='Tomato')
plt.show()
Result:

Related

How can I calculate the time lag between two similar time series?

I'm trying to compute/visualize the time lag between 2 time series (I want to know the time lag between the humidity progression of outside and inside a room).
Each data point of my series was taken hourly. Plotting the 2 series together, I can clearly see a shift between them: Sorry for hiding the axis
Here are a part of my time series data. I will pack them in 2 arrays:
inside_humidity =
[11.77961297, 11.59755268, 12.28761522, 11.88797553, 11.78122077, 11.5694668,
11.70421932, 11.78122077, 11.74272005, 11.78122077, 11.69438733, 11.54126933,
11.28460592, 11.05624965, 10.9611012, 11.07527934, 11.25417308, 11.56040908,
11.6657186, 11.51171572, 11.49246536, 11.78594142, 11.22968373, 11.26840678,
11.26840678, 11.29447992, 11.25553344, 11.19711371, 11.17764047, 11.11922075,
11.04132778, 10.86996123, 10.67410607, 10.63493504, 10.74922916, 10.74922916,
10.6294765, 10.61011497, 10.59075345, 10.80373021, 11.07479154, 11.15223764,
11.19711371, 11.17764047, 11.15816723, 11.22250051, 11.22250051, 11.202915,
11.18332948, 11.16374396, 11.14415845, 11.12457293, 11.10498742, 11.14926578,
11.16896413, 11.16896413, 11.14926578, 10.8307902, 10.51742195, 10.28187137,
10.12608544, 9.98977276, 9.62267727, 9.31289289, 8.96438546, 8.77077022,
8.69332413, 8.51907042, 8.30609366, 8.38353975, 8.4513867, 8.47085994,
8.50980642, 8.52927966, 8.50980642, 8.55887037, 8.51969934, 8.48052831,
8.30425867, 8.2177078, 7.98402891, 7.92560918, 7.89950166, 7.83489682,
7.75789537, 7.5984808, 7.28426807, 7.39778913, 7.71943214, 8.01149931,
8.18276652, 8.23009255, 8.16215295, 7.93822471, 8.00350215, 7.93843482,
7.85072729, 7.49778011, 7.31782649, 7.29862668, 7.60162032, 8.29665484,
8.58797834, 8.50011383, 8.86757784, 8.76600556, 8.60491125, 8.4222628,
8.24923231, 8.14470714, 8.17351638, 8.52530093, 8.72220151, 9.26745883,
9.1580007, 8.61762692, 8.22187405, 8.43693644, 8.32414835, 8.32463974,
8.46833012, 8.55865487, 8.72647164, 9.04112806, 9.35578449, 9.59465974,
10.47339785, 11.07218093, 10.54091351, 10.56138918, 10.46099958, 10.38129168,
10.16434831, 10.10612612, 10.009246, 10.53502351, 10.8307902, 11.13420052,
11.64337309, 11.18958511, 10.49630791, 10.60856932, 10.37029108, 9.86281478,
9.64699826, 9.95341012, 10.24329812, 10.6848196, 11.47604231, 11.30505352,
10.72194974, 10.30058448, 10.05022037, 10.06318411, 9.90118897, 9.68530059,
9.47790657, 9.48585784, 9.61639418, 9.86244265, 10.29009361, 10.28297229,
10.32073088, 10.65389513, 11.09656351, 11.20188562, 11.24124169, 10.40503955,
9.74632512, 9.07606098, 8.85145589, 9.37080152, 9.65082743, 10.0707891,
10.68776091, 11.25879751, 11.0416348, 10.89558456, 10.7908258, 10.66539685,
10.7297755, 10.77571398, 10.9268264, 11.16021492, 11.60961709, 11.43827534,
11.96155427, 12.16116437, 12.80412266, 12.52540805, 11.96752965, 11.58099292]
outside_humidity =
[10.17449206, 10.4823292, 11.06818167, 10.82768699, 11.27582592, 11.4196233,
10.99393027, 11.4122507, 11.18192837, 10.87247831, 10.68664321, 10.37949651,
9.57155882, 10.86611665, 11.62547196, 11.32004266, 11.75537602, 11.51292063,
11.03107569, 10.7297755, 10.4345622, 10.61271497, 9.49271162, 10.15594248,
9.99053828, 9.80915398, 9.6452438, 10.06900573, 11.18075689, 11.8289847,
11.83334752, 11.27480708, 11.14370467, 10.88149985, 10.73930381, 10.7236597,
10.26210496, 11.01260226, 11.05428228, 11.58321342, 12.70523808, 12.5181118,
11.90023799, 11.67756426, 11.28859471, 10.86878222, 9.73984486, 10.18253902,
9.80915398, 10.50980784, 11.38673459, 11.22751685, 10.94171823, 10.56484228,
10.38220753, 10.05388847, 9.96147203, 9.90698862, 9.7732203, 9.85262125,
8.7412938, 8.88281702, 8.07919545, 8.02883587, 8.32341424, 8.07357711,
7.27302616, 6.73660684, 6.66722819, 7.29408637, 7.00046542, 6.46322019,
6.07150988, 6.00207234, 5.8818402, 6.82443881, 7.20212882, 7.52167696,
7.88857771, 8.351627, 8.36547023, 8.24802846, 8.18520693, 7.92420816,
7.64926024, 7.87944972, 7.82118727, 8.02091833, 7.93071882, 7.75789457,
7.5416447, 6.94430133, 6.65907535, 6.67454591, 7.25493614, 7.76939457,
7.55357806, 6.61479472, 7.17641357, 7.24664082, 8.62732387, 8.66913548,
8.70925667, 9.0477017, 8.24558224, 8.4330502, 8.44366397, 8.17995798,
8.1875752, 9.33296518, 9.66567041, 9.88581085, 8.95449382, 8.3587624,
9.20584448, 8.90605388, 8.87494884, 9.12694892, 8.35055177, 7.91879933,
7.78867253, 8.22800878, 9.03685287, 12.49630018, 11.11819755, 10.98869374,
10.65897176, 10.36444573, 10.052609, 10.87627021, 10.07379564, 10.02233847,
9.62022856, 11.21575473, 10.85483543, 11.67324627, 11.89234248, 11.10068132,
10.06942096, 8.50405894, 8.13168561, 8.83616476, 8.35675085, 8.33616802,
8.35675085, 9.02209801, 9.5530404, 9.44738836, 10.89645958, 11.44771721,
11.79943601, 10.7765335, 11.1453622, 10.74874776, 10.55195175, 10.34494483,
9.83813522, 11.26931785, 11.20641798, 10.51555027, 10.90808954, 11.80923545,
11.68300879, 11.60313809, 7.95163365, 7.77213815, 7.54209557, 7.30603673,
7.17842173, 8.25899805, 8.56494995, 10.44245578, 11.08542758, 11.74129079,
11.67979686, 12.94362214, 11.96285343, 11.8289847, 11.01388413, 10.6793698,
11.20662595, 11.97684701, 12.46383177, 11.34178655, 12.12477078, 12.48698059,
12.89325064, 12.07470295, 12.6777319, 10.91689448, 10.7676326, 10.66710434]
I know cross correlation is the right term to use, but after a while I still don't get the idea of using scipy.signal.correlate and numpy.correlate, because all I got is an array full of NaNs. So clearly I need some more knowledge in this area.
What I expect to achieve is probably a plot like those in the answer section of this thread How to make a correlation plot with a certain lag of two time series where I can see at how many hours the time lag is most likely.
Thank you a lot in advance!
With the given data, you can use the numpy and matplotlib modules to achieve the desired result.
so, you can do something like this:
import numpy as np
from matplotlib import pyplot as plt
x = np.array(inside_humidity)
y = np.array(outside_humidity)
fig = plt.figure()
# fit a curve of your choice
a, b = np.polyfit(inside_humidity, outside_humidity, 1)
y_fit = a * x + b
# scatter plot, and fitted plot (best fit used)
plt.scatter(inside_humidity, outside_humidity)
plt.plot(x, y_fit)
plt.show()
which gives this:

How to get create a histogram over time?

I'm trying to visualize how a distribution changes over time -- each vertical slice should be the distribution at that timestep.
I want it to look something like this (there are two such curves/temporal-histograms here).
The closest I've found is this seaborn time series, but I want the distribution or at least the min, mean, and max -- this band is a confidence interval, which I can't use (it's also prohibitively slow).
https://seaborn.pydata.org/examples/errorband_lineplots.html
Update:
Here's a snippet to product some sample data.
import numpy as np
num_timesteps = 20
samples_per_timestep = 100
timesteps = np.arange(num_timesteps)
def get_std(t):
return t if t < num_timesteps//2 else abs(num_timesteps - t)
samples = np.stack([np.random.normal(t, get_std(t), samples_per_timestep) for t in timesteps])
samples[t] is a sample of the ditribution a timestep t. The distribution starts as constant (std = 0), widens, then narrows again.
Update: So I asked about this in the Seaborn repository, and this can be way simpler using Seaborn.
Here's an example:
import seaborn as sns
import seaborn.objects as so
fmri = sns.load_dataset("fmri").query("region == 'parietal'")
p = so.Plot(fmri, "timepoint", "signal")
for tail in [25, 10, 5, 1]:
p = p.add(so.Band(), so.Perc([tail, 100 - tail]))
p.add(so.Line(), so.Agg("median"))
Which will result in this plot:
You can read more about it in Statistical estimation and error bars.
This is a lot less work and better scalable. Hope it helps!
I had the exact same problem, and took quite a few detours, but this is most definitely possible!
Imports
We need to import matplotlib, NumPy and Pandas.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Input data
I assume you have the data as a Pandas Series, with the time/step on the index and the values as values.
Step
Wealth
0
0
0
0
0
0
0
0
1
7.89338
1
7.50838
1
2.00948
1
8.74963
I load my data from a pickle (in the format specified above):
wealth = pd.read_pickle("wealth.pickle")
You can download a ZIP with this pickle file here: wealth.zip.
Aggregate the data to percentiles
This part is a bit ugly. We first define a partial NumPy function for each percentile we want to calculate:
# Define functions to calculate percentiles
def q1(x):
return x.quantile(0.01)
def q5(x):
return x.quantile(0.05)
def q25(x):
return x.quantile(0.25)
def q50(x):
return x.quantile(0.50)
def q75(x):
return x.quantile(0.75)
def q95(x):
return x.quantile(0.95)
def q99(x):
return x.quantile(0.99)
If anyone reading this has a better/cleaner way to aggerate this, please let me know!
Note we use numpy.quantile, because (in my case) it works better with the data in the series, but numpy.percentile should be equivalent.
The data now needs to be grouped by the index (level=0), and then aggerated using the functions defined above:
w_agg = wealth.groupby(level=0).agg([q1, q5, q25, q50, q75, q95, q99])
And now w_agg looks like this:
Step
q1
q5
q25
q50
q75
q95
q99
0
0
0
0
0
0
0
0
1
-3.2311
0.759751
3.2881
6.03641
8.43206
11.9663
15.4515
2
-3.22888
-1.15079
3.13756
6.41804
8.43206
12.7269
15.4515
3
-5.31614
-1.91156
3.22287
6.54126
8.77544
14.644
15.5798
4
-5.64095
-2.52143
2.65959
6.22455
9.40699
14.6545
15.9647
Plotting
Now we can start with the plotting. Aside from the regulars, we're using matplotlib.filbetween for this.
We create a figure, and then start with the widest percentile range: From 1 to 99. Then we draw 5 to 95 on top of that, then 25 to 75 on top of that and finally the median as a line.
Play a bit with the alpha and color values to make it look nice!
# Create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Add the bands to the axes
ax.fill_between(x=w_agg.index, y1=w_agg["q1"], y2=w_agg["q99"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q5"], y2=w_agg["q95"], alpha=0.3, color="tab:blue")
ax.fill_between(x=w_agg.index, y1=w_agg["q25"], y2=w_agg["q75"], alpha=0.3, color="tab:blue")
# Plot the median as line
ax.plot(w_agg.index, w_agg["q50"], '-', color="tab:blue")
# Add title, legend and plot
ax.set_title("Distribution of wealth between population over time")
ax.set_xlabel("Time")
ax.set_ylabel("Wealth")
ax.legend([f"{n}% distribution" for n in [99, 90, 50]] + ["Median"], loc="upper left")
fig.tight_layout()
The result
For my dataset, this was the result:

I want to detect ranges with the same numerical boundaries of a dataset using matplotlib or pandas in python 3.7

I have a ton of ranges. They all consist of numbers. The range has a maximum and a minimum which can not be exceeded, but given the example that you have two ranges and one max point of the range reaches above the min area of the other. That would mean that you have a small area that covers both of them. You can write one range that includes the others.
I want to see if some ranges overlap or if I can find some ranges that cover most of the other. The goal would be to see if I can simplify them by using one smaller range that fits inside the other. For example 7,8 - 9,6 and 7,9 - 9,6 can be covered with one range.
You can see my attempt to visualize them. But when I use my entire dataset consisting of hundreds of ranges my graph is not longer useful.
I know that I can detect recurrent ranges using python. But I don't want to know how often a range occurs. I want to know how many ranges lay in the same numerical boundaries.I want see if I can have a couple of ranges covering all of them. Finally my goal is to have the masterranges sorted in categories. Meaning that I have range 1 covering 50 other ranges. then range 2 covering 25 ranges and so on.
My current program shows the penetration of ranges but I also want that in a printed output with the exact digits.
It would be nice if you share some ideas to solve that program or if you have any suggestions on tools within python 3.7
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]
]
for int in intervals:
plt.plot(int,[0,0], 'b', alpha = 0.2, linewidth = 100)
plt.show()
Here is an idea, You make a pandas data frame with the array. You substract the values in column2 - colum1 ( column 1 is x, and column 2 is y ). After that you create a histogram in which you take the range and the frecuency.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]]
intervals_ar = np.array(intervals)
df = pd.DataFrame({'Column1': intervals_ar[:, 0], 'Column2': intervals_ar[:, 1]})
df['Ranges'] = df['Column2'] - df ['Column1']
print(df)
frecuency_range = df['Ranges'].value_counts().sort_index()
print(frecuency_range)
df.Ranges.value_counts().sort_index().plot(kind = 'hist', bins = 5)
plt.title("Histogram Frecuency vs Range (column 2- column1)")
plt.show()

Pyplot - show x-axis labels according to y-axis value

I have 1min 20s long video record of 23.813 FPS. More precisely, I have 1923 frames in which I've been scanning desired features. I've detected some specific behavior via neural network and using chosen metric I calculated a value for each frame.
So, now, I have X-Y values to plot a graph:
X: time (each step of size 0,041993869s)
Y: a value measured by neural network
In the default state, the plot looks like this:
So, I've tried to limit the number of bins in the faith that the bins will be spread over all my values. But they are not. As you can see, only first fifteen x-values are rendered:
pyplot.locator_params(axis='x', nbins=15)
But neither one is desired state. The desired state should render the labels of such x-bins with y-value higher than e.g. 1.2. So, it should look like this:
Is possible to achieve such result?
Code:
# draw plot
from pandas import read_csv
from matplotlib import pyplot
test_video_fps = 23.813
df = read_csv('/path/to/csv/file/file.csv', header=None)
df.columns = ['anomaly']
df['time'] = [round((i + 1) / test_video_fps, 2) for i in range(df.shape[0])]
axes = df.plot.bar(x='time', y='anomaly', rot='0')
# pyplot.locator_params(axis='x', nbins=15)
# axes.get_xaxis().set_visible(False)
fig = pyplot.gcf()
fig.set_size_inches(16, 10)
fig.savefig('/path/to/output/plot.png', dpi=100)
# pyplot.show()
Example:
Simple example with a subset of original data.
0.379799
0.383786
0.345488
0.433286
0.469474
0.431993
0.474253
0.418843
0.491070
0.447778
0.384890
0.410994
0.898229
1.872756
2.907009
3.691382
4.685749
4.599612
3.738768
8.043357
7.660785
2.311198
1.956096
2.877326
3.467511
3.896339
4.250552
6.485533
7.452986
7.103761
2.684189
2.516134
1.512196
1.435303
0.852047
0.842551
0.957888
0.983085
0.990608
1.046679
1.082040
1.119655
0.962391
1.263255
1.371034
1.652812
2.160451
2.646674
1.460051
1.163745
0.938030
0.862976
0.734119
0.567076
0.417270
Desired plot:
Your question has become a two-part problem, but it is interesting enough that I will answer both.
I will answer this in Matplotlib object oriented notation with numpy data rather than pandas. This will make things easier to explain, and can be easily generalized to pandas.
I will assume that you have the following two data arrays:
dt = 0.041993869
x = np.arange(0.0, 15 * dt, dt)
y = np.array([1., 1.1, 1.3, 7.6, 2.4, 0.8, 0.7, 0.8, 1.0, 1.5, 10.0, 4.5, 3.2, 0.9, 0.7])
Part 1: Identifying the locations where you want labels
The data can be masked to get the locations of the peaks:
mask = y > 1.2
Consecutive peaks can be easily eliminated by computing the diff. A diff of a boolean mask will be True at the locations where the mask changes sense. You will then have to take every other element to get the locations where it goes from False to True. The following code will capture all the corner cases where you start with a peak or end in the middle of a peak:
d = np.flatnonzero(np.diff(mask))
if mask[d[0]]: # First diff is end of peak: True to False
d = np.concatenate(([0], d[1::2] + 1))
else:
d = d[::2] + 1
d is now an array indices into x and y that represent the first element of each run of peaks. You can get the last element by swapping the indices [1::2] and [::2] in the if-else statement, and removing the + 1 in both cases.
The locations of the labels are now simply x[d].
Part 2: Locating and formatting the labels
For this part, you will need to access Matplotlib's object oriented API via the Axes object you are plotting on. You already have this in the pandas form, making the transfer easy. Here is a sample in raw Matplotlib:
fig, axes = plt.subplots()
axes.plot(x, y)
Now use the ticker API to easily set the locations and labels. You actually set the locations directly (not with a Locator) since you have a very fixed list of ticks:
axes.set_xticks(x[d])
axes.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:0.01g}s'))
For the sample data show here, you get

plotting high precision data

I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :

Categories

Resources