Scale histograms by a certain factor? - python

I'm trying to represent three different data sets on the same histogram but one is 100 data points, one is 362 and one is 289. I'd like to scale the latter two by factors of 3.62 and 2.89 respectively so they don't overshadow the 100 point one. I feel like this should be easy but I'm not sure where to put my division. I feel like I've tried all the spots you can try. Here's how it is now:
plt.figure(figsize=(10,6))
scale_pc = (1 / 3.62) #this is the math I'd like to use, but where to put it?
scale_ar = (1 / 2.89) #this is the math I'd like to use, but where to put it?
alldf2[alldf2['playlist']==1]['danceability'].hist(bins=35, color='orange', label='PopConnoisseur', alpha=0.6)
alldf2[alldf2['playlist']==2]['danceability'].hist(bins=35, color='green',label='Ambient',alpha=0.6)
alldf2[alldf2['playlist']==0]['danceability'].hist(bins=35, color='blue',label='Billboard',alpha=0.6)
plt.legend()
plt.xlabel('Danceability')
I've tried variations on this but none work:
alldf2[alldf2['playlist']==1]['danceability'].hist(bins=35, color='orange', label='PopConnoisseur', alpha=0.6)
alldf2[alldf2['playlist']==2]['danceability'/ 3.62].hist(bins=35, color='green',label='Ambient',alpha=0.6)
alldf2[alldf2['playlist']==0]['danceability'/ 2.89].hist(bins=35, color='blue',label='Billboard',alpha=0.6)
Any thoughts?
Edit: Here's the plot as it currently is:

The 2nd one for sure won't work because it seems to have a syntax error here
'danceability'/ 3.62
in parenthesis you are calling the column I do not think that you can divide the values like that. Moreover, even if something like that would work it would probably divide your values in that column by 3.62, not return 100 data points...
Also I am not sure what is the problem with having more data points in the other histogram, that's kind of the thing which you want the histogram to show - i.e. how many elements are having a particular value.
Also, as Blazej said in the comment, give an example of data so we can understand a bit more what are you trying to do. Specify what you want to achieve by using just 100 points.

Related

bin value of histograms from grouped data

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!
First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Interpolating a line between two other lines in python [duplicate]

I'm sorry for the somewhat confusing title, but I wasn't sure how to sum this up any clearer.
I have two sets of X,Y data, each set corresponding to a general overall value. They are fairly densely sampled from the raw data. What I'm looking for is a way to find an interpolated X for any given Y for a value in between the sets I already have.
The graph makes this more clear:
In this case, the red line is from a set corresponding to 100, the yellow line is from a set corresponding to 50.
I want to be able to say, assuming these sets correspond to a gradient of values (even though they are clearly made up of discrete X,Y measurements), how do I find, say, where the X would be if the Y was 500 for a set that corresponded to a value of 75?
In the example here I would expect my desired point to be somewhere around here:
I do not need this function to be overly fancy — it can be simple linear interpolation of data points. I'm just having trouble thinking it through.
Note that neither the Xs nor the Ys of the two sets overlap perfectly. However it is rather trivial to say, "where are the nearest X points these sets share," or "where are the nearest Y points these sets share."
I have used simple interpolation between known values (e.g. find the X for corresponding Ys for set "50" and "100", then average those to get "75") and I end up with something like that looks like this:
So clearly I am doing something wrong here. Obviously in this case X is (correctly) returning as 0 for all of those cases where the Y is higher than the maximum Y of the "lowest" set. Things start out great but somewhere around when one starts to approach the maximum Y for the lowest set it starts going haywire.
It's easy to see why mine is going wrong. Here's another way to look at the problem:
In the "correct" version, X ought to be about 250. Instead, what I'm doing is essentially averaging 400 and 0 so X is 200. How do I solve for X in such a situation? I was thinking that bilinear interpolation might hold the answer but nothing I've been able to find on that has made it clear how I'd go about this sort of thing, because they all seem to be structured for somewhat different problems.
Thank you for your help. Note that while I have obviously graphed the above data in R to make it easy to see what I'm talking about, the final work for this is in Javascript and PHP. I'm not looking for something heavy duty; simple is better.
Good lord, I finally figured it out. Here's the end result:
Beautiful! But what a lot of work it was.
My code is too cobbled and too specific to my project to be of much use to anyone else. But here's the underlying logic.
You have to have two sets of data to interpolate from. I am calling these the "outer" curve and the "inner" curve. The "outer" curve is assumed to completely encompass, and not intersect with, the "inner" curve. The curves are really just sets of X,Y data, and correspond to a set of values defined as Z. In the example used here, the "outer" curve corresponds to Z = 50 and the "inner" curve corresponds to Z = 100.
The goal, just to reiterate, is to find X for any given Y where Z is some number in between our known points of data.
Start by figuring out the percentage between the two curve sets that the unknown Z represents. So if Z=75 in our example then that works out to be 0.5. If Z = 60 that would be 0.2. If Z = 90 then that would be 0.8. Call this proportion P.
Select the data point on the "outer" curve where Y = your desired Y. Imagine a line segment between that point and 0,0. Define that as AB.
We want to find where AB intersects with the "inner" curve. To do this, we iterate through each point on the inner curve. Define the line segment between the chosen point and the point+1 as CD. Check if AB and CD intersect. If not, continue iterating until they do.
When we find an AB-CD intersection, we now look at the line created by the intersection and our original point on the "outer" curve from step 2. This line segment, then, is a line between the inner and outer curve where the slope of the line, were it to be continued "down" the chart, would intersect with 0,0. Define this new line segment as EF.
Find the position at P percent (from step 1) of the length of EF. Check the Y value. Is it our desired Y value? If it is (unlikely), return the X of that point. If not, see if Y is less than the goal Y. If it is, store the position of that point in a variable, which I'll dub lowY. Then go back to step 2 again for the next point on the outer curve. If it is greater than the goal Y, see if lowY has a value in it. If it does, interpolate between the two values and return the interpolated X. (We have "boxed in" our desired coordinate, in other words.)
The above procedure works pretty well. It fails in the case of Y=0 but it is easy to do that one since you can just do interpolation on those two specific points. In places where the number of sample is much less, it produces kind of jaggy results, but I guess that's to be expected (these are Z = 5000,6000,7000,8000,9000,10000, where only 5000 and 10000 are known points and they have only 20 datapoints each — the rest are interpolated):
I am under no pretensions that this is an optimized solution, but solving for gobs of points is practically instantaneous on my computer so I assume it is not too taxing for a modern machine, at least with the number of total points I have (30-50 per curve).
Thanks for everyone's help; it helped a lot to talk this through a bit and realize that what I was really going for here was not any simple linear interpolation but a kind of "radial" interpolation along the curve.

Plotting millions of data points in Python?

I have written a complicated code. The code produces a set of numbers which I want to plot them. The problem is that I cannot put those numbers in a list since there are 2 700 000 000 of them.
So I need to plot one point then produce second point (the first point is replaced by second point so the first one is erased because I cannot store them). These numbers are generated in different sections of the code so I need to hold (MATLAB code) the figure.
For making it more conceivable to you, I write a simple code here and I want you to show me how to plot it.
import matplotlib.pyplot as plt
i=0
j=10
while i<2700000000:
plt.stem(i, j, '-')
i = i + 1
j = j + 2
plt.show()
Suppose I have billions of i and j!
Hmm I'm not sure if I understood you correctly but this:
import matplotlib.pyplot as plt
i=0
j=10
fig=plt.figure()
ax=fig.gca()
while i<10000: # Fewer points for speed.
ax.stem([i], [j]) # Need to provide iterable arguments to ax.stem
i = i + 1
j = j + 2
fig.show()
generates the following figure:
Isn't this what you're trying to achieve? After all the input numbers aren't stored anywhere, just added to the figure as soon as they are generated. You don't really need Matlab's hold equivalent, the figure won't be shown until you call fig.show() or plt.show() to show the current figure.
Or are you trying to overcome the problem that you can' hold the matplotlib.figure in your RAM? In which case my answer doesn't answer your question. Then you either have to save partial figures (only parts of the data) as pictures and combine them, as suggested in the comments, or think about an alternative way to show the data, as suggested in the other answer.

Opacity misleading when plotting two histograms at the same time with matplotlib

Let's say I have two histograms and I set the opacity using the parameter of hist: 'alpha=0.5'
I have plotted two histograms yet I get three colors! I understand this makes sense from an opacity point of view.
But! It makes is very confusing to show someone a graph of two things with three colors. Can I just somehow set the smallest bar for each bin to be in front with no opacity?
Example graph
The usual way this issue is handled is to have the plots with some small separation. This is done by default when plt.hist is given multiple sets of data:
import pylab as plt
x = 200 + 25*plt.randn(1000)
y = 150 + 25*plt.randn(1000)
n, bins, patches = plt.hist([x, y])
You instead which to stack them (this could be done above using the argument histtype='barstacked') but notice that the ordering is incorrect.
This can be fixed by individually checking each pair of points to see which is larger and then using zorder to set which one comes first. For simplicity I am using the output of the code above (e.g n is two stacked arrays of the number of points in each bin for x and y):
n_x = n[0]
n_y = n[1]
for i in range(len(n[0])):
if n_x[i] > n_y[i]:
zorder=1
else:
zorder=0
plt.bar(bins[:-1][i], n_x[i], width=10)
plt.bar(bins[:-1][i], n_y[i], width=10, color="g", zorder=zorder)
Here is the resulting image:
By changing the ordering like this the image looks very weird indeed, this is probably why it is not implemented and needs a hack to do it. I would stick with the small separation method, anyone used to these plots assumes they take the same x-value.

How to handle huge difference in values when plotting a histogram?

I have a list (intensityList) with 1354 numbers. they range from 25941.9 to 1639980000.0, so there is a very big difference, and I expect that most points are closer to 1639980000.0 than 25941.9. When I make a histogram out of this
plt.hist(intensityList,20)
plt.title('Amount of features per intensity')
plt.xlabel('intensity')
plt.ylabel('frequency')
plt.show()
it puts almost all data in one bar and messes up the x-axis. It works with a test set (random normal numbers) so I'm pretty sure it has to do with the broad range. How can I deal with a dataset like this?
edit:
The data is likely very skewed, the standard deviation is much larger than the mean. (mean = 6501401.54114, standard devaition = 49423145.7749)
Quite obvious answer, shows that it helps when you write a question down.. I logged the values and it's all dandy
you can increase the number of bins or keep only the values in a range you find interesting.
intensityList = intensityList[intensityList < maxVal]
intensityList = intensityList[intensityList > minVal]

Categories

Resources