matplotlib: grouping error bars for each x-axes tick - python

I am trying to use matplotlib to plot error bars but have a slightly different requirements. So, the setup is as follows:
I have 3 different methods that I am comparing across 10 different parameter setting. So, on the y-axes I have the model fitting errors as given by the 3 methods and on the x-axes, I have the different parameter settings.
So, for each parameter setting, I would like to get 3 error bar plots corresponding to the three methods. Ideally, I would like to plot the 95% confidence interval and also the minimum and maximum for each method for each parameter setting.
Some example data can be simulated as:
parameters = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
mean_1 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_1 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_2 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_1 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_3 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_3 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
I have kept the values same as it does not change anything from the plotting point of view. I see matplotlib.errorbar method but I do not know how to extend it for multiple methods over one single x-axes value as I have in my case. Additionally, I am not sure how to add the [min, max] markers for each of the methods.

Taking your parameters list as x axis, mean_1 as y value and std_1 as errors you can plot an errorbar chart with
pylab.errorbar(parameters, mean_1, yerr=std_1, fmt='bo')
In case the error bars are not symmetric, i.e. you have lower_err and upper_err, the statement reads
pylab.errorbar(parameters, mean_1, yerr=[lower_err, upper_err], fmt='bo')
The same works with keyword xerr for errors in x direction, which is now hopefully self-explanatory.
To show several (in your case 3) different datasets, you can go the following way:
# import pylab and numpy
import numpy as np
import pylab as pl
# define datasets
parameters = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
mean_1 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_1 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_2 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_2 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
mean_3 = [10.1, 12.1, 13.6, 14.5, 18.8, 11.8, 28.5]
std_3 = [2.6, 5.7, 4.3, 8.5, 11.8, 5.3, 2.5]
# here comes the plotting;
# to achieve a grouping, two things are extra here:
# 1. Don't use line plot but circular markers and different marker color
# 2. slightly displace the datasets in x direction to avoid overlap
# and create visual grouping
pl.errorbar(np.array(parameters)-0.01, mean_1, yerr=std_1, fmt='bo')
pl.errorbar(parameters, mean_2, yerr=std_2, fmt='go')
pl.errorbar(np.array(parameters)+0.01, mean_3, yerr=std_3, fmt='ro')
pl.show()
This is about pylab.errorbar, where you have to give the errors explicitly. An alternative approach is to use pylab.boxplot and to prodice a boxplot for each model, but therefore I guess I'll need the full distribution per model per parameter instead of just mean and std.

Related

Reduce python list size while preserving information

I have multiple long lists in my program. Each list has approximately 3000 float values.
And there are around 100 such lists.
I want to reduce the size of each list to say, 500, while preserving the information in the original list. I know that it is not possible to completely preserve the information, but I would like to have the elements in the original list to have contribution to the values of the smaller list.
Let's say we have the following list, and want to shorten it to a lists of size 3 or 4.
myList = [[4.3, 2.3, 5.1, 6.4, 3.2, 7.7, 1.5, 6.5, 7.4, 4.1],
[7.3, 3.5, 6.2, 7.4, 2.6, 3.7, 2.6, 7.1, 3.4, 7.1],
[4.7, 2.6, 5.6, 7.4, 3.7, 7.7, 3.5, 6.5, 7.2, 4.1],
[7.3, 7.3, 4.1, 6.6, 2.2, 3.9, 1.6, 3.0, 2.3, 4.6],
[4.7, 2.3, 5.7, 6.4, 3.4, 6.8, 7.2, 6.9, 8.4, 7.1]]
Is there some way to do this. Maybe by averaging of some sort (?)
You can do something like this:
from statistics import mean, stdev
myList = [[4.3, 2.3, 5.1, 6.4, 3.2, 7.7, 1.5, 6.5, 7.4, 4.1], [2.3, 6.4, 3.2, 7.7, 1.5, 6.5, 7.4, 4.1]]
shorten_list = [[max(i)-min(i), mean(i), round(stdev(i), 5)] for i in myList]
You can also include information such as the sum of the list or the mode. If you just want to take the mean of each list within your list, you can just do this:
from statistics import mean
mean_list = list(map(mean, myList))
batching may work.
I request you to look at this question
How do I split a list into equally-sized chunks?
this converts the list into equal batches.
or can sequence the dimension of the list using max pool layer
import numpy as np
from keras.models import Sequential
from keras.layers import MaxPooling2D
image = np.array([[4.3, 2.3, 5.1, 6.4, 3.2, 7.7, 1.5, 6.5, 7.4, 4.1],
[7.3, 3.5, 6.2, 7.4, 2.6, 3.7, 2.6, 7.1, 3.4, 7.1],
[4.7, 2.6, 5.6, 7.4, 3.7, 7.7, 3.5, 6.5, 7.2, 4.1],
[7.3, 7.3, 4.1, 6.6, 2.2, 3.9, 1.6, 3.0, 2.3, 4.6],
[4.7, 2.3, 5.7, 6.4, 3.4, 6.8, 7.2, 6.9, 8.4, 7.1]]
)
image = image.reshape(1, 5, 10, 1)
model = Sequential([MaxPooling2D(pool_size =(1,10), strides = (1))])
output = model.predict(image)
print(output)
this gives output as
[[[[7.7]]
[[7.4]]
[[7.7]]
[[7.3]]
[[8.4]]]]
if you want to change the output size, can change the pool size.

How to sum up elements in list in a moving range

I'm trying to sum up elements from list in a moving range. For instance, when the user input a customized range 'n', list[0] to list[n] will be added up and stored in a new list, followed by list[1] to list[n+1] until the end. Finally the maximum number in the new list will print out. However, in my code, it seems that the elements are continuously summing up.
Thanks a lot for your help.
The list is:
[5.8, 1.2, 5.8, 1.0, 6.9, 0.8, 6.0, 18.4, 18.6, 1.0, 0.8, 6.4, 12.2, 18.2, 1.4, 6.8, 41.8, 3.6, 5.2, 5.2, 4.6, 8.6, 16.6, 13.2, 9.6, 41.6, 37.2, 110.0, 30.0, 34.8, 24.6, 7.0, 13.4, 0.5, 37.0, 18.8, 20.4, 0.6, 6.4, 2.4, 1.0, 7.6, 6.6, 4.4, 2.4, 0.6, 3.2, 21.2, 28.2, 3.2, 2.4, 14.4, 0.6, 1.6, 4.4, 0.8, 0.6, 1.6, 1.0, 27.0, 52.6, 10.2, 1.0, 4.2]
My code:
days = int(input('Enter customized range: '))
n = np.arange(days)
total = 0
count = 1
max_total = []
while (count + len(n) - 2) <= (len(rain_b) - 1):
for i in range(count+len(n)-4, count+len(n)-2):
total += rain_c[i]
#print(rain_b[count+number-1])
#total = sum([(rain_c(count+number-4)) : (count+number-2)])
max_total.append(total)
count += 1
print(max_total)
Since you're already using numpy, you can use np.convolve() with an array of ones with length n:
>>> n = 5
>>> x = np.arange(10)
>>> np.max(np.convolve(x, np.ones(n, dtype=x.dtype), mode="valid"))
35
This has the effect of performing the dot product of np.ones(n) with each n-element "window" of the array x. The sliding_window_view() from numpy.lib.stride_tricks is analogous and helps explain:
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> windows = np.lib.stride_tricks.sliding_window_view(x, n)
>>> windows
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])
>>> windows.sum(axis=1)
array([10, 15, 20, 25, 30, 35])
>>> np.convolve(x, np.ones(n, dtype=x.dtype), mode="valid")
array([10, 15, 20, 25, 30, 35])
Try this (lst being your list and n being your range):
print(max(sum(lst[i:i+n+1]) for i in range(len(lst)-n)))
So for example:
>>> lst = [5.8, 1.2, 5.8, 1.0, 6.9, 0.8, 6.0, 18.4]
>>> n = 5
>>> print([sum(lst[i:i+n+1]) for i in range(len(lst)-n)])
[21.5, 21.7, 38.9]
>>> print(max(sum(lst[i:i+n+1]) for i in range(len(lst)-n)))
38.9
I would clean up your loop conditionals to be more clear and idiomatic.
I believe the problem is that you're not zeroing total out between iterations.
What are rain_b and rain_c? There should only be 1 input list and 1 output list.
Why not store n as an integer instead of some object? I don't have numpy on my pc so I just removed that part.
Here's psudo code of how I would do this:
For x in range 0 up to len(input_list) - n:
window_total = 0
for y in range x to x+n-1:
window_total += input_list[y]
output_list.append(window_total)
Based on an iterator/array containing cumulative sum of numbers, you can get the rolling sum of n values by subtracting the cumulative values that are n positions behind. This approach has an O(N) time complexity (as opposed to computing the sum of every subrange, which is O(N x W) where W is the rolling window size)
Without numpy:
L = [5.8, 1.2, 5.8, 1.0, 6.9, 0.8, 6.0, 18.4, 18.6, 1.0, 0.8, 6.4, 12.2, 18.2, 1.4, 6.8, 41.8, 3.6, 5.2, 5.2, 4.6, 8.6, 16.6, 13.2, 9.6, 41.6, 37.2, 110.0, 30.0, 34.8, 24.6, 7.0, 13.4, 0.5, 37.0, 18.8, 20.4, 0.6, 6.4, 2.4, 1.0, 7.6, 6.6, 4.4, 2.4, 0.6, 3.2, 21.2, 28.2, 3.2, 2.4, 14.4, 0.6, 1.6, 4.4, 0.8, 0.6, 1.6, 1.0, 27.0, 52.6, 10.2, 1.0, 4.2]
n = 3
from itertools import accumulate
S = (a-b for a,b in zip(accumulate(L),accumulate([0]*n+L)))
print(max(S)) # 188.8
Using numpy
import numpy as np
L = np.array([5.8, 1.2, 5.8, 1.0, 6.9, 0.8, 6.0, 18.4, 18.6, 1.0, 0.8, 6.4, 12.2, 18.2, 1.4, 6.8, 41.8, 3.6, 5.2, 5.2, 4.6, 8.6, 16.6, 13.2, 9.6, 41.6, 37.2, 110.0, 30.0, 34.8, 24.6, 7.0, 13.4, 0.5, 37.0, 18.8, 20.4, 0.6, 6.4, 2.4, 1.0, 7.6, 6.6, 4.4, 2.4, 0.6, 3.2, 21.2, 28.2, 3.2, 2.4, 14.4, 0.6, 1.6, 4.4, 0.8, 0.6, 1.6, 1.0, 27.0, 52.6, 10.2, 1.0, 4.2])
n = 3
S = np.cumsum(L)
S[n:] -= S[:-n]
print(np.max(S)) # 188.8

How to combine these two numpy arrays?

How would I combine these two arrays:
x = np.asarray([[1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3], [3.0, 3.1, 3.2, 3.3],
[4.0, 4.1, 4.2, 4.3], [5.0, 5.1, 5.2, 5.3]])
y = np.asarray([[0.1], [0.2], [0.3], [0.4], [0.5]])
Into something like this:
xy = [[0.1, [1.0, 1.1, 1.2, 1.3]], [0.2, [2.0, 2.1, 2.2, 2.3]...
Thank you for the assistance!
Someone suggested I post code that I have tried and I realized I had forgot to:
xy = np.array(list(zip(x, y)))
This is my current solution, however it is extremely inefficient.
You can use zip to combine
[[a,b] for a,b in zip(y,x)]
Out:
[[array([0.1]), array([1. , 1.1, 1.2, 1.3])],
[array([0.2]), array([2. , 2.1, 2.2, 2.3])],
[array([0.3]), array([3. , 3.1, 3.2, 3.3])],
[array([0.4]), array([4. , 4.1, 4.2, 4.3])],
[array([0.5]), array([5. , 5.1, 5.2, 5.3])]]
A pure numpy solution will be much faster than list comprehension for large arrays.
I do have to say your use case makes no sense, as there is no logic in putting these arrays into a single data structure, and I believe you should re check your design.
Like #user2357112 supports Monica was subtly implying, this is very likely an XY problem. See if this is really what you are trying to solve, and not something else. If you want something else, try asking about that.
I strongly suggest checking what you want to do before moving on, as you will put yourself in a place with bad design.
That aside, here's a solution
import numpy as np
x = np.asarray([[1.0, 1.1, 1.2, 1.3], [2.0, 2.1, 2.2, 2.3], [3.0, 3.1, 3.2, 3.3],
[4.0, 4.1, 4.2, 4.3], [5.0, 5.1, 5.2, 5.3]])
y = np.asarray([[0.1], [0.2], [0.3], [0.4], [0.5]])
xy = np.hstack([y, x])
print(xy)
prints
[[0.1 1. 1.1 1.2 1.3]
[0.2 2. 2.1 2.2 2.3]
[0.3 3. 3.1 3.2 3.3]
[0.4 4. 4.1 4.2 4.3]
[0.5 5. 5.1 5.2 5.3]]

Find median for each element in list

I have some large list of data, between 1000 and 10000 elements. Now I want to filter out some peak values with the help of the median function.
#example list with just 10 elements
my_list = [4.5, 4.7, 5.1, 3.9, 9.9, 5.6, 4.3, 0.2, 5.0, 4.6]
#list of medians calculated from 3 elements
my_median_list = []
for i in range(len(my_list)):
if i == 0:
my_median_list.append(statistics.median([my_list[0], my_list[1], my_list[2]])
elif i == (len(my_list)-1):
my_median_list.append(statistics.median([my_list[-1], my_list[-2], my_list[-3]])
else:
my_median_list.append(statistics.median([my_list[i-1], my_list[i], my_list[i+1]])
print(my_median_list)
# [4.7, 4.7, 4.7, 5.1, 5.6, 5.6, 4.3, 4.3, 4.6, 4.6]
This works so far. But I think it looks ugly and is maybe inefficient? Is there a way with statistics or NumPy to do it faster? Or another solution? Also, I look for a solution where I can pass an argument from how many elements the median is calculated. In my example, I used the median always from 3 elements. But with my real data, I want to play with the median setting and then maybe use the median out of 10 elements.
You are calculating too many values since:
my_median_list.append(statistics.median([my_list[i-1], my_list[i], my_list[i+1]])
and
my_median_list.append(statistics.median([my_list[0], my_list[1], my_list[2]])
are the same when i == 1. The same error happens at the end so you get one too many end values.
It's easier and less error-prone to do this with zip() which will make the three element tuples for you:
from statistics import median
my_list = [4.5, 4.7, 5.1, 3.9, 9.9, 5.6, 4.3, 0.2, 5.0, 4.6]
[median(l) for l in zip(my_list, my_list[1:], my_list[2:])]
# [4.7, 4.7, 5.1, 5.6, 5.6, 4.3, 4.3, 4.6]
For groups of arbitrary size collections.deque is super handy because you can set a max size. Then you just keep pushing items on one end and it removes items off the other to maintain the size. Here's a generator example that takes you groups size as n:
from statistics import median
from collections import deque
def rolling_median(l, n):
d = deque(l[0:n], n)
yield median(d)
for num in l[n:]:
d.append(num)
yield median(d)
my_list = [4.5, 4.7, 5.1, 3.9, 9.9, 5.6, 4.3, 0.2, 5.0, 4.6]
list(rolling_median(my_list, 3))
# [4.7, 4.7, 5.1, 5.6, 5.6, 4.3, 4.3, 4.6]
list(rolling_median(my_list, 5))
# [4.7, 5.1, 5.1, 4.3, 5.0, 4.6]

Merging arrays and plots

Let's say I have 2 arrays like these:
x1 = [ 1.2, 1.8, 2.3, 4.5, 20.0]
y1 = [10.3, 11.8, 12.3, 11.5, 11.5]
and other two that represent the same function but sampled in different values
x2 = [ 0.2, 1,8, 5.3, 15.5, 17.2, 18.3, 20.0]
y2 = [10.3, 11.8, 12.3, 12.5, 15.2, 10.3, 10.0]
is there a way with numpy to merge x1 and x2 and according to the result merging also the related values of y without explicitly looping all over the arrays? (like doing an average of y or taking the max for that interval)
I don't know if you can find something in numpy, but here is a solution using pandas instead. (Pandas is using numpy behind the scenes, so there isn't so much data conversion.)
import numpy as np
import pandas as pd
x1 = np.asarray([ 1.2, 1.8, 2.3, 4.5, 20.0])
y1 = np.asarray([10.3, 11.8, 12.3, 11.5, 11.5])
x2 = np.asarray([ 0.2, 1.8, 5.3, 15.5, 17.2, 18.3, 20.0])
y2 = np.asarray([10.3, 11.8, 12.3, 12.5, 15.2, 10.3, 10.0])
c1 = pd.DataFrame({'x': x1, 'y': y1})
c2 = pd.DataFrame({'x': x2, 'y': y2})
c = pd.concat([c1, c2]).groupby('x').mean().reset_index()
x = c['x'].values
y = c['y'].values
# Result:
x = array([ 0.2, 1.2, 1.8, 2.3, 4.5, 5.3, 15.5, 17.2, 18.3, 20. ])
y = array([10.3 , 10.3, 11.8, 12.3, 11.5, 12.3, 12.5, 15.2, 10.3, 10.75])
Here I concatenate the two vectors and do a groupby operation to get the equal values for 'x'. For these "groups" I than take the mean(). reset_index() will than move the index 'x' back to a column. To get the result back as a numpy array I use .values. (Use to_numpy() for pandas version 24.0 and higher.)
How about using numpy.hstack followed by sorting using numpy.sort ?
In [101]: x1_arr = np.array(x1)
In [102]: x2_arr = np.array(x2)
In [103]: y1_arr = np.array(y1)
In [104]: y2_arr = np.array(y2)
In [111]: np.sort(np.hstack((x1_arr, x2_arr)))
Out[111]:
array([ 0.2, 1.2, 1.8, 1.8, 2.3, 4.5, 5.3, 15.5, 17.2, 18.3, 20. ,
20. ])
In [112]: np.sort(np.hstack((y1_arr, y2_arr)))
Out[112]:
array([10. , 10.3, 10.3, 10.3, 11.5, 11.5, 11.8, 11.8, 12.3, 12.3, 12.5,
15.2])
If you want to get rid of the duplicates, you can apply numpy.unique on top of the above results.
I'd propose a solution based on the accepted answer of this question:
import numpy as np
import pylab as plt
x1 = [1.2, 1.8, 2.3, 4.5, 20.0]
y1 = [10.3, 11.8, 12.3, 11.5, 11.5]
x2 = [0.2, 1.8, 5.3, 15.5, 17.2, 18.3, 20.0]
y2 = [10.3, 11.8, 12.3, 12.5, 15.2, 10.3, 10.0]
# create a merged and sorted x array
x = np.concatenate((x1, x2))
ids = x.argsort(kind='mergesort')
x = x[ids]
# find unique values
flag = np.ones_like(x, dtype=bool)
np.not_equal(x[1:], x[:-1], out=flag[1:])
# discard duplicated values
x = x[flag]
# merge, sort and select values for y
y = np.concatenate((y1, y2))[ids][flag]
plt.plot(x, y, marker='s', color='b', ls='-.')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
This is the result:
x = [ 0.2 1.2 1.8 2.3 4.5 5.3 15.5 17.2 18.3 20. ]
y = [10.3 10.3 11.8 12.3 11.5 12.3 12.5 15.2 10.3 11.5]
As you notice, this code keeps only one value for y if several ones are available for the same x: in this way, the code is faster.
Bonus solution: the following solution is based on a loop and mainly standard Python functions and objects (not numpy), so I known that it is not acceptable; by the way, it is very coincise and elegant and it handles multiple values for y, so I decied to include it here as a plus:
x = sorted(set(x1 + x2))
y = np.nanmean([[d.get(i, np.nan) for i in x]
for d in map(lambda a: dict(zip(*a)), ((x1, y1), (x2, y2)))], axis=0)
In this case, you get the following results:
x = [0.2, 1.2, 1.8, 2.3, 4.5, 5.3, 15.5, 17.2, 18.3, 20.0]
y = [10.3 10.3 11.8 12.3 11.5 12.3 12.5 15.2 10.3 10.75]

Categories

Resources