I have a set of lists (about 100) of the form [6, 17, 5, 1, 4, 7, 14, 19, 0, 10] and I want to get one box plot which plots the averages of box-plot information (i.e. median, max, min, Q1, Q3, outliers) of all of the lists.
For example, if I have 2 lists
l1 = [6, 17, 5, 1, 4, 7, 14, 19, 0, 10]
l2 = [4, 12, 3, 5, 16, 0, 14, 7, 8, 15]
I can get averages of max, median, and min of the lists as follows:
maxs = np.array([])
mins = np.array([])
medians = np.array([])
for l in [l1, l2]:
medians = np.append(medians, np.median(l))
maxs = np.append(maxs, np.max(l))
mins = np.append(mins, np.min(l))
averMax = np.mean(maxs)
averMin = np.mean(mins)
averMedian = np.mean(medians)
I should do the same for other info in the box plot such as average Q1, average Q3. I then need to use this information (averMax, averMin, etc.) to plot just one single box plot (not multiple box plots in one graph).
I know from Draw Box-Plot with matplotlib that you don't have to calculate the values for a normal box plot. You just need to specify the data as a variable.
Is it possible to do the same for my case instead of manually calculating the averages of the values of all the lists?
pd.describe() will get the quartiles, so you can make a graph based on them. I customized the calculated numbers with the help of this answer and the example graph from the official reference.
import pandas as pd
import numpy as np
import io
l1 = [6, 17, 5, 1, 4, 7, 14, 19, 0, 10]
l2 = [4, 12, 3, 5, 16, 0, 14, 7, 8, 15]
df = pd.DataFrame({'l1':l1, 'l2':l2}, index=np.arange(len(l1)))
df.describe()
l1 l2
count 10.000000 10.000000
mean 8.300000 8.400000
std 6.532823 5.561774
min 0.000000 0.000000
25% 4.250000 4.250000
50% 6.500000 7.500000
75% 13.000000 13.500000
max 19.000000 16.000000
import matplotlib.pyplot as plt
# spread,center, filer_high, flier_low
x1 = [l1[4]-1.5*(l1[6]-l1[4]), l1[4], l1[5], l1[5]+1.5*(l1[6]-l1[4])]
x2 = [l2[4]-1.5*(l2[6]-l2[4]), l2[4], l2[5], l2[5]+1.5*(l2[6]-l2[4])]
fig = plt.figure(figsize=(8,6))
plt.boxplot([x for x in [x1, x2]], 0, 'rs', 1)
plt.xticks([y+1 for y in range(len([x1, x2]))], ['x1', 'x2'])
plt.xlabel('measurement x')
t = plt.title('Box plot')
plt.show()
Related
I am a manufacturing engineer, very new to Python and Matplotlib. Currently, I am trying to plot a scatter time graph, where for every single record, I have the data (read from a sensor) and upper and lower limits for that data that will stop the tool if data is not between them.
So for a simple set of data like this:
time = [1, 2, 3, 7, 8, 9, 10]*
data = [5, 6, 5, 5, 6, 7, 8]
lower_limit = [4, 4, 5, 5, 5, 5, 5]
upper_limit = [6, 6, 6, 7, 7, 7, 7]
When the tool is not working, nothing will be recorded, hence a gap b/w 3 & 7 in time records.
The desired graph would look like this:
A few rules that I am trying to stick to:
All three graphs (data, upper_limit, and lower_limit) are required to be scattered points and not lines, with the x-axis (time) being shared among them. - required.
A green highlight that fills between upper and lower limits, considering only the two points with the same time for each highlight. - highly recommended.
(I tried matplotlib.fill_between, but it creates a polygon between trend lines, rather than straight vertical lines between matching pairs of L.L. & U.L. dots. Therefore, it won't be accurate, and it will fill up the gap b/w times 3s and 7s, which is not desired. Also, I tried to use matplot.bar for limits along the scatter plot for the 'data', but I was not able to set a minimum = lower_limit for the bars.)
When the value of data is not equal to or between the limits, the representing dot should appear in red, rather than the original color. -highly recommended.
So, with all of that in mind, and thousands of records per day, a regular graph, for a 24hr time span, should look like the following: (notice the gap due to possible lack of records in a time span, as well as vertical green lines, for the limits.)
Thanks for your time and help!
This is a version using numpys masking and matplotlibs errorbar
import matplotlib.pyplot as plt
import numpy as np
time = np.array( [0, 1, 2, 3, 7, 8, 9, 10] )
data = np.array([2, 5, 6, 5, 5, 6, 7, 8] )
lower = np.array([4, 4, 4, 5, 5, 5, 5, 5] )
upper = np.array([6, 6, 6, 6, 7, 7, 7, 7] )
nn = len( lower )
delta = upper - lower
### creating masks
inside = ( ( upper - data ) >= 0 ) & ( ( data - lower ) >= 0 )
outside = np.logical_not( inside )
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.errorbar( time, lower, yerr=( nn*[0], delta), ls='', ecolor="#00C023" )
ax.scatter( time[ inside ], data[ inside ], c='k' )
ax.scatter( time[ outside ], data[ outside ], c='r' )
plt.show()
Something like this should work, plotting each component separately:
time = [1, 2, 3, 7, 8, 9, 10]
data = [5, 6, 5, 5, 6, 7, 8]
lower_limit = [4, 4, 5, 5, 5, 5, 5]
upper_limit = [6, 6, 6, 7, 7, 7, 7]
# put data into dataframe and identify which points are out of range (not between the lower and upper limit)
df = pd.DataFrame({'time': time, 'data': data, 'll': lower_limit, 'ul': upper_limit})
df.loc[:, 'in_range'] = 0
df.loc[((df['data'] >= df['ll']) & (df['data'] <= df['ul'])), 'in_range'] = 1
# make the plot
fig, ax = plt.subplots()
# plot lower-limit and upper-limit points
plt.scatter(df['time'], df['ll'], c='green')
plt.scatter(df['time'], df['ul'], c='green')
# plot data points in range
plt.scatter(df.loc[df['in_range']==1, :]['time'], df.loc[df['in_range']==1, :]['data'], c='black')
# plot data points out of range (in red)
plt.scatter(df.loc[df['in_range']==0, :]['time'], df.loc[df['in_range']==0, :]['data'], c='red')
# plot lines between lower limit and upper limit
plt.plot((df['time'],df['time']),([i for i in df['ll']], [j for j in df['ul']]), c='lightgreen')
Suppose I have two Pandas dataframes, df1 and df2, each with two columns, hour and value. Some of the hours are missing in the two dataframes.
import pandas as pd
import matplotlib.pyplot as plt
data1 = [
('hour', [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]),
('value', [12.044324085714285, 8.284134466666668, 9.663580800000002,
18.64010145714286, 15.817029916666664, 13.242198508695651,
10.157177889201877, 9.107153674476985, 10.01193336545455,
16.03340384878049, 16.037368506666674, 16.036160044827593,
15.061596637500001, 15.62831551764706, 16.146087032608694,
16.696574719512192, 16.02603831463415, 17.07469460470588,
14.69635686969697, 16.528905725581396, 12.910250661111112,
13.875522341935481, 12.402971938461539])
]
df1 = pd.DataFrame.from_items(data1)
df1.head()
# hour value
# 0 0 12.044324
# 1 1 8.284134
# 2 2 9.663581
# 3 4 18.640101
# 4 5 15.817030
data2 = [
('hour', [0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23]),
('value', [27.2011904, 31.145661266666668, 27.735570511111113,
18.824297487999996, 17.861847334275623, 25.3033003254902,
22.855934450000003, 31.160574200000003, 29.080220000000004,
30.987719745454548, 26.431310216666663, 30.292641480000004,
27.852885586666666, 30.682682472727276, 29.43023531764706,
24.621718962500005, 33.92878745, 26.873105866666666,
34.06412232, 32.696606333333335])
]
df2 = pd.DataFrame.from_items(data2)
df2.head()
# hour value
# 0 0 27.201190
# 1 5 31.145661
# 2 6 27.735571
# 3 7 18.824297
# 4 8 17.861847
I would like to join them together using the key of hour and then produce a side-by-side barplot of the data. The x-axis would be hour, and the y-axis would be value.
I can create a bar plot of one dataframe at a time.
_ = plt.bar(df1.hour.tolist(), df1.value.tolist())
_ = plt.xticks(df1.hour, rotation=0)
_ = plt.grid()
_ = plt.show()
_ = plt.bar(df2.hour.tolist(), df2.value.tolist())
_ = plt.xticks(df2.hour, rotation=0)
_ = plt.grid()
_ = plt.show()
However, what I want is to create a barchart of them side by side, like this:
Thank you for any help.
You can do it all in one line, if you wish. Making use of the pandas plotting wrapper and the fact that plotting a dataframe with several columns will group the plot. Given the definitions of df1 and df2 from the question, you can call
pd.merge(df1,df2, how='outer', on=['hour']).set_index("hour").plot.bar()
plt.show()
resulting in
Note that this leaves out the number 3 in this case as it is not part of any hour column in any of the two dataframes. To include it, use reset_index
pd.merge(df1,df2, how='outer', on=['hour']).set_index("hour").reindex(range(24)).plot.bar()
First reindex the dataframes and then create two barplots using the data. The positioning of the rectangles is given by (x - width/2, x + width/2, bottom, bottom + height).
import numpy as np
index = np.arange(0, 24)
bar_width = 0.3
df1 = df1.set_index('hour').reindex(index)
df2 = df2.set_index('hour').reindex(index)
plt.figure(figsize=(10, 5))
plt.bar(index - bar_width / 2, df1.value, bar_width, label='df1')
plt.bar(index + bar_width / 2, df2.value, bar_width, label='df2')
plt.xticks(index)
plt.legend()
plt.tight_layout()
plt.show()
I want to generate "category intervals" from categories.
for example, suppose I have the following :
>>> df['start'].describe()
count 259431.000000
mean 10.435858
std 5.504730
min 0.000000
25% 6.000000
50% 11.000000
75% 15.000000
max 20.000000
Name: start, dtype: float64
and unique value of my column are:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20], dtype=int8)
but I want to use the following list of intervals:
>>> intervals
[[0, 2.2222222222222223],
[2.2222222222222223, 4.4444444444444446],
[4.4444444444444446, 6.666666666666667],
[6.666666666666667, 8.8888888888888893],
[8.8888888888888893, 11.111111111111111],
[11.111111111111111, 13.333333333333332],
[13.333333333333332, 15.555555555555554],
[15.555555555555554, 17.777777777777775],
[17.777777777777775, 20]]
to change my column 'start' into values x where x represents the index of the interval that contains df['start'] (so x in my case will vary from 0 to 8)
is there a more or less simple way to do it using pandas/numpy?
In advance, thanks a lot for the help.
Regards.
You can use np.digitize:
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(start=np.random.random_integers(0, 20, 10000)))
# the left-hand edges of each "interval"
intervals = np.linspace(0, 20, 9, endpoint=False)
print(intervals)
# [ 0. 2.22222222 4.44444444 6.66666667 8.88888889
# 11.11111111 13.33333333 15.55555556 17.77777778]
df['start_idx'] = np.digitize(df['start'], intervals) - 1
print(df.head())
# start start_idx
# 0 8 3
# 1 16 7
# 2 0 0
# 3 7 3
# 4 0 0
I am generating a heat map for my data.
everything works fine, but I have a little problem. My data (numbers) are from 0 to 10.000.
0 means nothing (no data) and at the moment the field with 0 just take the lowest color of my color scala. My problem is how to make the data with 0 to have a total different color (e.g. black or white)
Just see the Picture to better understand what i mean:
My code (snippet) looks like this:
matplotlib.pyplot.imshow(results, interpolation='none')
matplotlib.pyplot.colorbar();
matplotlib.pyplot.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8], [10, 15, 20, 25, 30, 35, 40, 45, 50]);
matplotlib.pyplot.xlabel('Population')
matplotlib.pyplot.yticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 'serial']);
matplotlib.pyplot.ylabel('Communication Step');
axis.xaxis.tick_top();
matplotlib.pyplot.savefig('./results_' + optimisationProblem + '_dim' + str(numberOfDimensions) + '_' + statisticType + '.png');
matplotlib.pyplot.close();
If you are not interested in a smooth transition between the values 0 and 0.0001, you can just set every value that equals 0 to NaN. This will result in a white color whereas 0.0001 will still be deep blue-ish.
In the following code I include an example. I generate the data randomly. I therefore select a single element from my array and set it to NaN. This results in the color white. I also included a line in which you can set every data point that equals 0 to NaN.
import numpy
import matplotlib.pyplot as plt
#Random data
data = numpy.random.random((10, 10))
#Set all data points equal to zero to NaN
#data[data == 0.] = float("NaN")
#Set single data value to nan
data[2][2] = float("NaN")
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.imshow(data, interpolation = "nearest")
plt.show()
I am trying to group a numpy array into smaller size by taking average of the elements. Such as take average foreach 5x5 sub-arrays in a 100x100 array to create a 20x20 size array. As I have a huge data need to manipulate, is that an efficient way to do that?
I have tried this for smaller array, so test it with yours:
import numpy as np
nbig = 100
nsmall = 20
big = np.arange(nbig * nbig).reshape([nbig, nbig]) # 100x100
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
An example with 6x6 -> 3x3:
nbig = 6
nsmall = 3
big = np.arange(36).reshape([6,6])
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
This is pretty straightforward, although I feel like it could be faster:
from __future__ import division
import numpy as np
Norig = 100
Ndown = 20
step = Norig//Ndown
assert step == Norig/Ndown # ensure Ndown is an integer factor of Norig
x = np.arange(Norig*Norig).reshape((Norig,Norig)) #for testing
y = np.empty((Ndown,Ndown)) # for testing
for yr,xr in enumerate(np.arange(0,Norig,step)):
for yc,xc in enumerate(np.arange(0,Norig,step)):
y[yr,yc] = np.mean(x[xr:xr+step,xc:xc+step])
You might also find scipy.signal.decimate interesting. It applies a more sophisticated low-pass filter than simple averaging before downsampling the data, although you'd have to decimate one axis, then the other.
Average a 2D array over subarrays of size NxN:
height, width = data.shape
data = average(split(average(split(data, width // N, axis=1), axis=-1), height // N, axis=1), axis=-1)
Note that eumiro's approach does not work for masked arrays as .mean(3).mean(1) assumes that each mean along axis 3 was computed from the same number of values. If there are masked elements in your array, this assumption does not hold any more. In that case, you have to keep track of the number of values used to compute .mean(3) and replace .mean(1) by a weighted mean. The weights are the normalized number of values used to compute .mean(3).
Here is an example:
import numpy as np
def gridbox_mean_masked(data, Nbig, Nsmall):
# Reshape data
rshp = data.reshape([Nsmall, Nbig//Nsmall, Nsmall, Nbig//Nsmall])
# Compute mean along axis 3 and remember the number of values each mean
# was computed from
mean3 = rshp.mean(3)
count3 = rshp.count(3)
# Compute weighted mean along axis 1
mean1 = (count3*mean3).sum(1)/count3.sum(1)
return mean1
# Define test data
big = np.ma.array([[1, 1, 2],
[1, 1, 1],
[1, 1, 1]])
big.mask = [[0, 0, 0],
[0, 0, 1],
[0, 0, 0]]
Nbig = 3
Nsmall = 1
# Compute gridbox mean
print gridbox_mean_masked(big, Nbig, Nsmall)