I know how to plot a histogram when individual datapoints are given like:
(33, 45, 54, 33, 21, 29, 15, ...)
by simply using something matplotlib.pyplot.hist(x, bins=10)
but what if I only have grouped data like:
and so on.
I know that I can use bar plots to mimic a histogram by changing xticks but what if I want to do this by using only hist function of matplotlib.pyplot?
Is it possible to do this?
You can build the hist() params manually and use the existing value counts as weights.
Say you have this df:
>>> df = pd.DataFrame({'Marks': ['0-10', '10-20', '20-30', '30-40'], 'Number of students': [8, 12, 24, 26]})
Marks Number of students
0 0-10 8
1 10-20 12
2 20-30 24
3 30-40 26
The bins are all the unique boundary values in Marks:
>>> bins = pd.unique(df.Marks.str.split('-', expand=True).astype(int).values.ravel())
array([ 0, 10, 20, 30, 40])
Choose one x value per bin, e.g. the left edge to make it easy:
>>> x = bins[:-1]
array([ 0, 10, 20, 30])
Use the existing value counts (Number of students) as weights:
>>> weights = df['Number of students'].values
array([ 8, 12, 24, 26])
Then plug these into hist():
>>> plt.hist(x=x, bins=bins, weights=weights)
One possibility is to “ungroup” data yourself.
For example, for the 8 students with a mark between 0 and 10, you can generate 8 data points of value of 5 (the mean). For the 12 with a mark between 10 and 20, you can generate 12 data points of value 15.
However, the “ungrouped” data will only be an approximation of the real data. Thus, it is probably better to just use a matplotlib.pyplot.bar plot.
Related
I have a single list as this:
prices = [10, 10, 10, 40, 40, 50]
and I would like to calculate the weighted average from this list, so the weight of the number 10 would be 3, weight of 40 would be 2 and weight of 50 would be 1. How do I do this without having 2 separated lists?
You neeed some form of lookup for your weights, either a list of the same lenght or a dictionary seem prudent.
Compute it using a dictionary of weights:
l = [10, 10, 10, 40, 40, 50]
w = {10:3,40:2,50:1}
avg_w = sum(i*w[i] for i in l)/len(l)
print(avg_w)
or use a weight list:
w2= [3,3,3,2,2,1]
avg_w = sum(v*w2[i] for i,v in enumerate(l))/len(l)
print(avg_w)
Output:
50
I would like to make a barchart in python using matplotlib pyplot. The data consists of an index, which is a datetime list, and a number corresponding to that datetime. I have various samples that all belong to the same day. However, when making the bar chart, it only shows the first samples corresponding to a certain datetime, instead of all of them. How can I make a barchart showing every entry?
The index has the following structure:
ind = [datetime.datetime(2017, 3, 1, 0, 0), datetime.datetime(2017, 3, 1, 0, 0),
datetime.datetime(2017, 3, 15, 0, 0), datetime.datetime(2017, 3, 15, 0, 0)]
and the values are just integers:
values = [10, 20, 30, 40]
So when plotting, it only shows the bars 2017-3-1 with value 10, and 2017-3-15 with value 30. How can I make them show all of them?
You can groupby the dates, add the values and then plot the barchart from the same dataframe:
df = pd.DataFrame(data=values, index=ind)
df = df.groupby(df.index).sum()
df.plot(kind='bar')
If what you want is all values to appear in the plot, regardless of the date, you can simply use:
df.plot(kind='bar')
And entries with duplicate date will be plotted independently.
I am working with a column which values should have small changes between the rows. The values are physical measurements, and due to the environment factors, the measurement values can be incorrect, with a very high increments between the consecutive samples. Rate of change is a quantity that is considered as an input to the problem, as it can be changed to adapt to the precision needs of this outliers detection.
The detection method could either calculate the mean of the values seen so far and mark outliers as values that are above it by the given rate of change or check the value changes between the rows and mark the index value where the distance was greater than the rate of change and the index value where the values returned below the accepted rate of change with respect to the first value before the ones marked as outliers. The first approach could be harder, since the mean should be calculated from values that are correct, that is the values marked as outliers should not be considered into the calculation of the mean.
The correct solution should return the list of indices that indicate the outliers, which would be then used to set the corresponding values to f.e. NaN or use an interpolation method to fill in those values.
Example
df = pd.DataFrame({'small_changing': [5.14, 5.18, 5.22, 5.18, 5.20, 5.17, 5.25, 5.55, 5.62, 5.78, 6.21, 6.13, 5.71, 5.35, 5.29, 5.24, 5.16, 5.18, 5.20, 5.15, 5.17, 5.00, 4.96, 4.88, 4.71, 4.65, 4.73, 4.79, 4.89, 4.92, 5.05, 5.11, 5.14, 5.17, 5.22, 5.24, 5.18, 5.20]})
Assuming the rate of change of 0.15 there are two outliers groups to detect assuming the second approach of detection where the difference between the rows is taken into account.
The first group corresponds to the index values [7, 12], because the difference between the rows 6 and 7 is 0.3, which is higher than the 0.15 limit, and the difference between the rows 6 and 13 is 0.1, row 13 being the first row with the difference within the 0.15 limit.
The second group corresponds to the index values [21, 29], because the difference between the rows 20 and 21 is 0.17, which is higher than the 0.15 limit, and the difference between the rows 20 and 30 is 0.12, row 30 being the first row with the difference within the 0.15 limit.
Result for this example: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]
I hope it will be helpful.
I think it isn't pythonic, but it works:
def outlier_detection(points, limit):
outliers_index = list()
k=0
for i in range(0,len(points)-1):
if abs(points[i-k] - points[i+1]) >= limit:
k+=1
outliers_index.append(i+1)
else:
k=0
return outliers_index
outlier_detection(df['small_changing'].values, 0.15)
OUT: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]
This might save time on sparsely distributed outliers on a big dataset -
def df_outlier(df, threshold=0.15):
column = df.columns[0]
df["outlier"] = False
df_difference = df.copy()
df_difference["difference"] = abs(df[column] - df[column].shift(1)).shift(-1)
df_difference = df_difference.loc[df_difference["difference"] > threshold]
for index in df_difference.index:
row = df.loc[index]
if not row["outlier"]:
df_check = df[index+1:].copy()
df_check["a_difference"] = abs(df_check[column] - row[column])
df_check.loc[df_check["a_difference"] > threshold, "outlier"] = True
df.loc[((df.index >= df_check.index[0]) & (df.index < df_check["outlier"].ne(True).idxmax())), "outlier"] = True
return list(df.loc[df["outlier"] == True].index)
I am using this.
I'm new to numpy and I have a 2D array of objects that I need to bin into a smaller matrix and then get a count of the number of objects in each bin to make a heatmap. I followed the answer on this thread to create the bins and do the counts for a simple array but I'm not sure how to extend it to 2 dimensions. Here's what I have so far:
data_matrix = numpy.ndarray((500,500),dtype=float)
# fill array with values.
bins = numpy.linspace(0,50,50)
digitized = numpy.digitize(data_matrix, bins)
binned_data = numpy.ndarray((50,50))
for i in range(0,len(bins)):
for j in range(0,len(bins)):
k = len(data_matrix[digitized == i:digitized == j]) # <-not does not work
binned_data[i:j] = k
P.S. the [digitized == i] notation on an array will return an array of binary values. I cannot find documentation on this notation anywhere. A link would be appreciated.
You can reshape the array to a four dimensional array that reflects the desired block structure, and then sum along both axes within each block. Example:
>>> a = np.arange(24).reshape(4, 6)
>>> a
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
>>> a.reshape(2, 2, 2, 3).sum(3).sum(1)
array([[ 24, 42],
[ 96, 114]])
If a has the shape m, n, the reshape should have the form
a.reshape(m_bins, m // m_bins, n_bins, n // n_bins)
At first I was also going to suggest that you use np.histogram2d rather than reinventing the wheel, but then I realized that it would be overkill to use that and would need some hacking still.
If I understand correctly, you just want to sum over submatrices of your input. That's pretty easy to brute force: going over your output submatrix and summing up each subblock of your input:
import numpy as np
def submatsum(data,n,m):
# return a matrix of shape (n,m)
bs = data.shape[0]//n,data.shape[1]//m # blocksize averaged over
return np.reshape(np.array([np.sum(data[k1*bs[0]:(k1+1)*bs[0],k2*bs[1]:(k2+1)*bs[1]]) for k1 in range(n) for k2 in range(m)]),(n,m))
# set up dummy data
N,M = 4,6
data_matrix = np.reshape(np.arange(N*M),(N,M))
# set up size of 2x3-reduced matrix, assume congruity
n,m = N//2,M//3
reduced_matrix = submatsum(data_matrix,n,m)
# check output
print(data_matrix)
print(reduced_matrix)
This prints
print(data_matrix)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
print(reduced_matrix)
[[ 24 42]
[ 96 114]]
which is indeed the result for summing up submatrices of shape (2,3).
Note that I'm using // for integer division to make sure it's python3-compatible, but in case of python2 you can just use / for division (due to the numbers involved being integers).
Another solution is to have a look at the binArray function on the comments here:
Binning a numpy array
To use your example :
data_matrix = numpy.ndarray((500,500),dtype=float)
binned_data = binArray(data_matrix, 0, 10, 10, np.sum)
binned_data = binArray(binned_data, 1, 10, 10, np.sum)
The result sum all square of size 10x10 in data_matrix (of size 500x500) to obtain a single value per square in binned_data (of size 50x50).
Hope this help !
I am generating a heat map for my data.
everything works fine, but I have a little problem. My data (numbers) are from 0 to 10.000.
0 means nothing (no data) and at the moment the field with 0 just take the lowest color of my color scala. My problem is how to make the data with 0 to have a total different color (e.g. black or white)
Just see the Picture to better understand what i mean:
My code (snippet) looks like this:
matplotlib.pyplot.imshow(results, interpolation='none')
matplotlib.pyplot.colorbar();
matplotlib.pyplot.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8], [10, 15, 20, 25, 30, 35, 40, 45, 50]);
matplotlib.pyplot.xlabel('Population')
matplotlib.pyplot.yticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 'serial']);
matplotlib.pyplot.ylabel('Communication Step');
axis.xaxis.tick_top();
matplotlib.pyplot.savefig('./results_' + optimisationProblem + '_dim' + str(numberOfDimensions) + '_' + statisticType + '.png');
matplotlib.pyplot.close();
If you are not interested in a smooth transition between the values 0 and 0.0001, you can just set every value that equals 0 to NaN. This will result in a white color whereas 0.0001 will still be deep blue-ish.
In the following code I include an example. I generate the data randomly. I therefore select a single element from my array and set it to NaN. This results in the color white. I also included a line in which you can set every data point that equals 0 to NaN.
import numpy
import matplotlib.pyplot as plt
#Random data
data = numpy.random.random((10, 10))
#Set all data points equal to zero to NaN
#data[data == 0.] = float("NaN")
#Set single data value to nan
data[2][2] = float("NaN")
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.imshow(data, interpolation = "nearest")
plt.show()