Python: Compare Histograms with different bin size

Python: Compare Histograms with different bin size - python

I want to compare 2 histograms, that are coming from a evaluation board, which is already binning the counted events in a histogram. I am taking data from 2 channels with different number of events (in fact, one is background only, one is background + signal, a pretty usual experimental setting), and with different number of bins, different bin width and different center position of bins.
The datafile looks like this:
HSlice [CH1]
...
44.660 46.255 6
46.255 47.850 10
47.850 49.445 18
49.445 51.040 8
51.040 52.635 28
52.635 54.230 4
54.230 55.825 18
55.825 57.421 183
57.421 59.016 582
59.016 60.611 1786
...
HSlice [CH2]
...
52.022 53.880 0
53.880 55.738 9
55.738 57.596 213
57.596 59.454 728
59.454 61.312 2944
61.312 63.170 9564
...
The first two columns give the boundaries of the respective bin (that is time) and the last column represents the number of events within this timeframe.
Now I want make a kind of background-reduction, so to say subtract the background-histogram from the "background+signal"-histogram to obtain the time trace of the actual signal. I can not do this line-wise since the histograms are quite different. Is there a simple function in python or an elegant solution how to make the data comparable, (for example by interpolating between two datapoints in one histogram to fit the position of a bin of the other histogram) without messing up the time resolution given by the experiment (neither make it worse than it is, nor pretending a better time resolution).
Thank you,
lepakk

Channel 2 has a bigger bin size than channel 1 (1.858 vs 1.595). So I would transfer the values from the smaller bins into the bigger bins. That will lead to a loss of resolution, but I think thats more honest than transferring from bigger bins into smaller bin and therefore increase the resolution.
Now my approach would be to take all the values from the bins in channel 1 and assign them the point in the center of their time bin. You don't really know exactly where in the bin they were originally measured, so this is the point where you cheat a little bit.
Now fill the values of channel 1 into the bins of channel 2 according to their new time value.
That would be my first approach.

Related

bin value of histograms from grouped data

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!

First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Numpy Correlate is not providing an offset

I am trying to look at astronomical spectra using Python, and I'm using numpy.correlate to try and find a radial velocity shift. I'm comparing each spectrum I have to one template spectrum. The problem that I'm encountering is that, no matter which spectra I use, numpy.correlate states that the maximal value of the correlation function occurs with a shift of zero pixels, i.e. the spectra already line up, which is very clearly not true. Here is some of the relevant code:
corr = np.correlate(temp_data, imag_data, mode='same')
ax1.plot(delta_data, corr, c='g')
ax1.plot(delta_data, 100*temp_data, c='b')
ax1.plot(delta_data, 100*imag_data, c='r')
The output of this code is shown here:
What I Have
Note that the cross-correlation function peaks at an offset of zero pixels despite the template (blue) and observed (red) spectra clearly showing an offset. What I would expect to see would be something a bit like (albeit not exactly like; this is merely the closest representation I could produce):
What I Want
Here I have introduced an artificial offset of 50 pixels in the template data, and they more or less line up now. What I would like is, for a case like this, for a peak to appear at an offset of 50 pixels rather than at zero (I don't care if the spectra at the bottom appear lined up; that is merely for visual representation). However, despite several hours of work and research online, I can't find someone who even describes this problem, let alone a solution. I've attempted to use ScyPy's correlate and MatLib's xcorr, and bot show this same thing (although I'm led to believe that they are essentially the same function).
Why is the cross-correlation not acting the way I expect, and how to do I get it to act in a useful way?

The issue you're experiencing is probably because your spectra are not zero-centered; their RMS value looks to be about 100 in whichever units you're plotting, instead of 0. The reason this is an issue is because numpy.correlate works by "sliding" imag_data over temp_data to get their dot product at each possible offset between the two series. (See the wikipedia on cross-correlation to understand the operation itself.) When using mode='same' to produce an output that is the same length as your first input (temp_data), NumPy has to "pad" a bunch of dummy values--zeroes--to the ends of imag_data in order to be able to calculate the dot products of all the shifted versions of the imag_data. When we have any non-zero offset between the spectra, some of the values in temp_data are being multiplied by those dummy zero-padding values instead of the values in image_data. If the values in the spectra were centered around zero (RMS=0), then this zero-padding would not impact our expectation of the dot product, but because these spectra have RMS values around 100 units, that dot product (our correlation) is largest when we lay the two spectra on top of one another with no offset.
Notice that your cross-correlation result looks like a triangular pulse, which is what you might expect from the cross-correlation of two square pulses (c.f. Convolution of a Rectangular "Pulse" With Itself. That's because your spectra, once padded, look like a step function from zero up to a pulse of slightly noisy values around 100. You can try convolving with mode='full' to see the entire response of the two spectra you're correlating, or, notice that with mode='valid' that you should only get one value in return, since your two spectra are the exact same length, so there is only one offset (zero!) where you can entirely line them up.
To sidestep this issue, you can try either subtracting away the RMS value of the spectra so that they are zero-centered, or manually padding the beginning and end of imag_data with (len(temp_data)/2-1) dummy values equal to np.sqrt(np.mean(imag_data**2))
Edit:
In response to your questions in the comments, I thought I'd include a graphic to make the point I'm trying to describe a little clearer.
Say we have two vectors of values, not entirely unlike your spectra, each with some large non-zero mean.
# Generate two noisy, but correlated series
t = np.linspace(0,250,250) # time domain from 0 to 250 steps
# signal_model = narrow_peak + gaussian_noise + constant
f = 10*np.exp(-((t-90)**2)/8) + np.random.randn(250) + 40
g = 10*np.exp(-((t-180)**2)/8) + np.random.randn(250) + 40
f has a spike around t=90, and g has a spike around t=180. So we expect the correlation of g and f to have a spike around a lag of 90 timesteps (in the case of spectra, frequency bins instead of timesteps.)
But in order to get an output that is the same shape as our inputs, as in np.correlate(g,f,mode='same'), we have to "pad" f on either side with half its length in dummy values: np.correlate pads with zeroes. If we don't pad f (as in np.correlate(g,f,mode='valid')), we will only get one value in return (the correlation with zero offset), because f and g are the same length, and there is no room to shift one of the signals relative to the other.
When you calculate the correlation of g and f after that padding, you find that it peaks when the non-zero portion of signals aligns completely, that is, when there is no offset between the original f and g. This is because the RMS value of the signals is so much higher than zero--the size of the overlap of f and g depends much more strongly on the number of elements overlapping at this high RMS level than on the relatively small fluctuations each function has around it. We can remove this large contribution to the correlation by subtracting the RMS level from each series. In the graph below, the gray line on the right shows the cross-correlation the two series before zero-centering, and the teal line shows the cross-correlation after. The gray line is, like your first attempt, triangular with the overlap of the two non-zero signals. The teal line better reflects the correlation between the fluctuation of the two signals, as we desired.
xcorr = np.correlate(g,f,'same')
xcorr_rms = np.correlate(g-40,f-40,'same')
fig, axes = plt.subplots(5,2,figsize=(18,18),gridspec_kw={'width_ratios':[5,2]})
for n, axis in enumerate(axes):
offset = (0,75,125,215,250)[n]
fp = np.pad(f,[offset,250-offset],mode='constant',constant_values=0.)
gp = np.pad(g,[125,125],mode='constant',constant_values=0.)
axis[0].plot(fp,color='purple',lw=1.65)
axis[0].plot(gp,color='orange',lw=lw)
axis[0].axvspan(max(125,offset),min(375,offset+250),color='blue',alpha=0.06)
axis[0].axvspan(0,max(125,offset),color='brown',alpha=0.03)
axis[0].axvspan(min(375,offset+250),500,color='brown',alpha=0.03)
if n==0:
axis[0].legend(['f','g'])
axis[0].set_title('offset={}'.format(offset-125))
axis[1].plot(xcorr/(40*40),color='gray')
axis[1].plot(xcorr_rms,color='teal')
axis[1].axvline(offset,-100,350,color='maroon',lw=5,alpha=0.5)
if n == 0:
axis[1].legend(["$g \star f$","$g' \star f'$","offset"],loc='upper left')
plt.show()

plot to show large data points on x axis using python

how to show variance of these data points over time? I used this plot to show them but because the time starts from 0 to 20 000 seconds and it is difficult to see all the points properly to observe the variance or invariance, the problem is: the points are overlapped to each other.
after zoom in

I finally could solve this problem by subtracting each time from the minimum time for each subject. Now all the times starts from 0 and the variance between subjects can be seen easily

Normalize your axes to 1 by dividing with the maximum value. Afterwards you can scale your axis by a factor X.

Plotting in ggplot with non-discrete x and y

I want to create a ggplot where the x-axis is a distance (currently the distances are continuous values that range between 0 and 45 feet) that can be binned and the y-axis is whether or not the basket was made (0 is missed, 1 is made). Here is a subset of the dataframe, which is a pandas dataframe. EDIT: Not sure this is helpful, but I have also added a column that represents the bucket/bin for each attempt's shot distance.
distance(ft) outcome category
----------- --------- --------
9.5 1 (9,18]
23.3 1 (18,27]
18.7 0 (18,27]
10.8 0 (9,18]
43.6 1 (36,45]
I could just make a scatterplot where x-axis is distance and the y-axis is miss/made. However, I don't want to visualize every shot attempt as a point. Let's say I want the x axis to be bins (where each bin is every 9 ft: 0-9 ft, 9-18 ft, 18-27 ft, 27-36 ft, 36-45 ft), and the y to be the proportion of shots that was made in that bin.
What is the best way to achieve this in ggplot? How much preprocessing do I have to do before leveraging ggplot capabilities? I can imagine doing all the necessary computation myself to find the proportion of shots made per bin and then plotting those values easily, but I feel there should be some built-in capabilities to help me with this (although I am new to ggplot and unsure at this point). Any guidance would be greatly appreciated. Thanks!

You are likely using a Pandas DataFrame (or Series) to hold your data, right?
If so, you can bin your values using Pandas in-built functionality, specifically the pandas.cut function.
e.g.
bins = 9 # can be int or sequence of scalars
out, bins = df.cut(data, bins, retbins = True, include_lowest = True)

Reduce dataset to smaller size, keep the gist of information in the dataset

I'm developing a line chart. The data is being generated by a sensor and is a tuple (timestamp, value). Sensor creates a new datapoint every 60 seconds or so.
Now I want to display it in a graph and my limitation is about 900 points on then graph. In a daily view of that graph, I'd get about 1440 points and that's too much.
I'm looking for a general way how to shrink my dataset of any size to fixed size (in my case 900) while it keeps the timestamp distribution linear.
Thanks

I believe you are trying to resample your data. Your current sample rate is 1/60 samples per second and you are trying to get to 1/96 samples per second (900 / (24*60*60)). The ratio between the two rates is 5/8.
If you search for "python resample" you will find other similar questions and articles involving numpy and pandas which have built in routines for it.
To do it manually you can first upsample by 5 to get to 7200 samples per second and then downsample by 8 to get down to 900 samples per second.
To upsample you can make a new list five times as long and fill in every fifth element with your existing data. Then you can do, say, linear interpolation to fill in the gaps.
One you do that you can downsample by simply taking every eighth element.

Here's my final solution using pandas:
df = pd.read_json('co2.json')
# calculates the 'rule' parameter for resampling
seconds = int(df.tail(1)[0]) - int(df.head(1)[0])
rule = seconds // 960
df.index = pd.to_datetime(df[0], unit='s')
df.resample('%sS' % rule).mean()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.