How to handle huge difference in values when plotting a histogram?

How to handle huge difference in values when plotting a histogram? - python

I have a list (intensityList) with 1354 numbers. they range from 25941.9 to 1639980000.0, so there is a very big difference, and I expect that most points are closer to 1639980000.0 than 25941.9. When I make a histogram out of this
plt.hist(intensityList,20)
plt.title('Amount of features per intensity')
plt.xlabel('intensity')
plt.ylabel('frequency')
plt.show()
it puts almost all data in one bar and messes up the x-axis. It works with a test set (random normal numbers) so I'm pretty sure it has to do with the broad range. How can I deal with a dataset like this?
edit:
The data is likely very skewed, the standard deviation is much larger than the mean. (mean = 6501401.54114, standard devaition = 49423145.7749)

Quite obvious answer, shows that it helps when you write a question down.. I logged the values and it's all dandy

you can increase the number of bins or keep only the values in a range you find interesting.
intensityList = intensityList[intensityList < maxVal]
intensityList = intensityList[intensityList > minVal]

Related

bin value of histograms from grouped data

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!

First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Scale histograms by a certain factor?

I'm trying to represent three different data sets on the same histogram but one is 100 data points, one is 362 and one is 289. I'd like to scale the latter two by factors of 3.62 and 2.89 respectively so they don't overshadow the 100 point one. I feel like this should be easy but I'm not sure where to put my division. I feel like I've tried all the spots you can try. Here's how it is now:
plt.figure(figsize=(10,6))
scale_pc = (1 / 3.62) #this is the math I'd like to use, but where to put it?
scale_ar = (1 / 2.89) #this is the math I'd like to use, but where to put it?
alldf2[alldf2['playlist']==1]['danceability'].hist(bins=35, color='orange', label='PopConnoisseur', alpha=0.6)
alldf2[alldf2['playlist']==2]['danceability'].hist(bins=35, color='green',label='Ambient',alpha=0.6)
alldf2[alldf2['playlist']==0]['danceability'].hist(bins=35, color='blue',label='Billboard',alpha=0.6)
plt.legend()
plt.xlabel('Danceability')
I've tried variations on this but none work:
alldf2[alldf2['playlist']==1]['danceability'].hist(bins=35, color='orange', label='PopConnoisseur', alpha=0.6)
alldf2[alldf2['playlist']==2]['danceability'/ 3.62].hist(bins=35, color='green',label='Ambient',alpha=0.6)
alldf2[alldf2['playlist']==0]['danceability'/ 2.89].hist(bins=35, color='blue',label='Billboard',alpha=0.6)
Any thoughts?
Edit: Here's the plot as it currently is:

The 2nd one for sure won't work because it seems to have a syntax error here
'danceability'/ 3.62
in parenthesis you are calling the column I do not think that you can divide the values like that. Moreover, even if something like that would work it would probably divide your values in that column by 3.62, not return 100 data points...
Also I am not sure what is the problem with having more data points in the other histogram, that's kind of the thing which you want the histogram to show - i.e. how many elements are having a particular value.
Also, as Blazej said in the comment, give an example of data so we can understand a bit more what are you trying to do. Specify what you want to achieve by using just 100 points.

Python, finding all local maxima of array, adjusting for flaws in measurement

i'm trying to get the maximum points in a one dimensional array, where it makes several curves. To do this i use scipy.signal.argrelextrema, along with np.greater. Give here where the array is y:
argrelextrema(y, np.greater)
The issue is, that this one dimensional data has inaccuracies due to the way the data for y was gathered. And so, there's a lot of "false positives" at the bottom of the curve, where there technically is a maxima at the bottom because of one value being bigger than the surrounding ones.
For reference, here's y, over x which is just the index of each y value, to demonstrate the array i'm working with. The inaccuracies at the bottom is not visible. Ignore axises, used what i had in the code.
Also, here's the result of me using the found maxima to calculate a value, as seen this is not wanted, as the expected result should have been a smooth falling curve. The graph was made with one point for each maxima, in an increasing order. And it's clearly wrong, as one can observe from the actual graph.
So, what's the best solution to avoid shis? I failed to find something that could approximate the graph in a good enough manner for me the be able to use it. I looked into smoothening, but the methods i found, like savgol_filter from scipy.signal, was something i could not understand.
The current solution was to ignore values of y that were below 5, which was roughly a bit over the bottom of the curve, but not an ideal solution at all.
Update:
Found out find_peaks_cwt from scipy.signal works for this too. It's a tad more complex as i have absolutely no clue how most of it works, even after reading up on it a bit. However, i managed to make a slightly better graph, i think, using: find_peaks_cwt(y, [3], noise_perc=2) However, the result seen below, is only a result of me dropping noise from 10 to 2, without really knowing how that affects the result.
Edit:
Here's is the 1D-array i'm working on: https://pastebin.com/GZrBBRce
Sorry for bad representation, but each line is the next value in the list. It's a bit large.
Edit: Minimum working examplem, y is from the pastebin, a bit large to include in minimum working example:
energy = []
for i in find_peaks_cwt(y, [3],noise_perc=2):
energy.append(y[i])
plt.plot([i for i in range(len(energy))], energy)
plt.show()
This was made with some guessing, and the result is seen in the last image in this question.
Update 2: Further progress, i smoothed out the y function using numpy.polyfit with a 15 degree approximation. It's surprisingly accurate. And since that is smooth, i can revert to the first function argrelextrema(y, np.greater), and it's gives me a pretty decent answer as well as not including false positives, as seen in the above graphs. (I got 30-40 maximas, when my graph only has a little above 20 of them.)
I'll let it stand a bit, before marking solved, in case anyone want to have a go at a better solution and approximating the graph with numpy.polyfit. However this solved the issue for my usecase.

I would use: scipy.signal.find_peaks_cwt().
From its documentation:
Attempt to find the peaks in a 1-D array.
The general approach is to smooth vector by convolving it with wavelet(width) for each width in widths. Relative maxima which appear at enough length scales, and with sufficiently high SNR, are accepted.
UPDATE (with actual data)
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import scipy.signal
y_arr = np.loadtxt('/home/raid1/metere/Downloads/1d_array.txt')
print('array size: ', y_arr.shape)
arr_size = len(y_arr)
expected_num = 30
expected_width = arr_size // expected_num // 2
print('expected width of peaks: ', expected_width)
peaks = sp.signal.find_peaks_cwt(y_arr, np.linspace(2, expected_width, 10))
print('num peaks: ', len(peaks))
print('peaks: ', peaks)
plt.plot(y_arr)
for peak in peaks:
plt.axvline(peak)
plt.show()
This can probably tweaked further, for example to increase the accuracy.

If it's not too big array the simplest way that I can think right now is just by one loop on the array's value. Somethong like:
if ar[i-1] < ar[i] & ar[i] > ar[i+1]
Add a[i] to maxArray.
You just need to check if ar[i] != ar[i+1]. In case that it equal, you should take the first or that value or the eqauls values.
Edit:
count=1
leng=len(list1)-1
while count < leng:
count=count+1
if ((list1[count-1] < list1[count]) & (list1[count] > list1[count+1])):
print list1[count]
21.55854026
4.205178829
16.6062412
16.60490417
13.14358751
11.76675489
10.71131948
10.34922208
9.703966466
4.440605216
9.557176225
9.163999162
4.530660664
9.067259599
4.482917884
8.628552441
4.443787789
8.340760319
7.9779415
4.411471328
4.415029139
7.840661767
7.858075487
4.413923668
7.555398794
7.533918443
4.445146914
7.58446368
7.56834833
7.264249919
7.34901701
7.349173404
7.315796894
7.235120197
4.577840109
7.24188544
7.243943576
7.205527364
4.480817125
4.483523464
4.526264151
6.90592723
6.903067763
6.905932124
4.513352307
4.464000858
6.848673936
6.831810008
6.819620162
4.485243384
6.606738091
Your data a bit noisy. so I got some "additional" values.
EDIT 2:
so you can add a filter, I sure that you can find a better way to do it but the simplest for me now is:
list2=[]
count=3
leng=len(list1)-4
while count < leng:
count=count+1
avr=(list1[count-3]+list1[count-2]+list1[count-1]+list1[count]+list1[count+1]+list1[count+2]+list1[count+3])/7
list2.append(avr)
count=1
leng=len(list2)-2
while count < leng:
count=count+1
if ((list2[count-1] < list2[count]) & (list2[count] > list2[count+1])):
print list2[count]
21.5255838514
16.5808501743
13.1294409014
11.75618281
10.7026162129
10.3274025343
9.68175366729
9.53899509229
9.15257714671
9.06056034386
8.615976868
8.33681455
7.971226556
7.84655894214
7.54856005157
7.57721360586
7.34372518657
7.23259654857
6.90384834786
6.83781572657

Visualizing line density using a color map, versus alpha transparency

I want to visualize 200k-300k lines, maybe up to 1 million lines, where each line is a cumulative sequence of integer values that grows over time, one value per day on the order of 1000 days. the final values of each line range from 0 to 500.
it’s likely that some lines will be appear in my population of lines 1000s of times, others 100s, others 10s of times, and some outliers will be unique. For plotting large numbers of points in an xy plane, alpha transparency can be a solution in some cases, but isn’t great if you want to be able to robustly distinguish overplot density. A solution that scales more powerfully is to use something like hexbin, which bins the space and allows you to use a color map to plot density of points in each bin.
I haven’t been able to find a ready-made solution in python (ideally) or R for doing the analogous thing for plotting lines instead of points.
The following code demonstrates the issue using a small sample (n=1000 lines): can anyone propose how I might drop the alpha value approach in favor of a solution that allows me to introduce a color map for line density, using a transform I can control?
df = pd.DataFrame(np.random.randint(2,size=(100,1000)))
df.cumsum().plot(legend=False, color='grey', alpha=.1, figsize=(12,8))
in response to request: this is what a sample plot looks like now; in the wide dark band, 10 overplots full saturates the line, so that segments of lines overplotted 10,100, and 1000 times are indistinguishable

How can I normalize a histogram such that the sum of the heights is equal to 1?

I generated the figure below using the a call to matplotlib.pyplot.hist in which I passed the kwarg normed=True:
Upon further research, I realized that this kind of normalization works in such a way that the integral of the histogram is equal to 1. How can I plot this same data such that the sum of the heights of the bars equals 1?
In other words, I want each bit to represent the proportion of the whole that its values contain.

I'm not sure if there's a straightforward way, but
you can manually divide all bar heights by the length of the input (the following is made in ipython --pylab to skip the imports):
inp = normal(size=1000)
h = hist(inp)
Which gives you
Now, you can do:
bar(h[1][:-1], h[0]/float(len(inp)), diff(h[1]))
and get

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.