I want to make a plot that splits a dataset and shows the amount of observations per category on the left axis and a confidence interval (e.g. 90%) including the mean for a certain observed value on the right axis.
It should look like this:
I know how to use ax.hist() or ax.bar() for the first job. A second axis is easily made using ax.twinx(). However, after trying both ax.boxplot() and ax.violinplot(), I believe neither could do the job (plotting the confidence interval + mean) correctly. Any suggestions?
Related
I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!
First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')
I've got a time series of sunspot numbers, where the mean number of sunspots is counted per month, and I'm trying to use a Fourier Transform to convert from the time domain to the frequency domain. The data used is from https://wwwbis.sidc.be/silso/infosnmtot.
The first thing I'm confused about is how to express the sampling frequency as once per month. Do I need to convert it to seconds, eg. 1/(seconds in 30 days)? Here's what I've got so far:
fs = 1/2592000
#the sampling frequency is 1/(seconds in a month)
fourier = np.fft.fft(sn_value)
#sn_value is the mean number of sunspots measured each month
freqs = np.fft.fftfreq(sn_value.size,d=fs)
power_spectrum = np.abs(fourier)
plt.plot(freqs,power_spectrum)
plt.xlim(0,max(freqs))
plt.title("Power Spectral Density of the Sunspot Number Time Series")
plt.grid(True)
I don't think this is correct - namely because I don't know what the scale of the x-axis is. However I do know that there should be a peak at (11years)^-1.
The second thing I'm wondering from this graph is why there seems to be two lines - one being a horizontal line just above y=0. It's more clear when I change the x-axis bounds to: plt.xlim(0,1).
Am I using the fourier transform functions incorrectly?
You can use any units you want. Feel free to express your sampling frequency as fs=12 (samples/year), the x-axis will then be 1/year units. Or use fs=1 (sample/month), the units will then be 1/month.
The extra line you spotted comes from the way you plot your data. Look at the output of the np.fft.fftfreq call. The first half of that array contains positive values from 0 to 1.2e6 or so, the other half contain negative values from -1.2e6 to almost 0. By plotting all your data, you get a data line from 0 to the right, then a straight line from the rightmost point to the leftmost point, then the rest of the data line back to zero. Your xlim call makes it so you don’t see half the data plotted.
Typically you’d plot only the first half of your data, just crop the freqs and power_spectrum arrays.
I have a dataframe containing confidence intervals of means on parameters 'likes, 'retweets', 'followers', 'pics' for 4 samples: ikke-aktant, laser, umbrella, mask. All values are a list containing the confidence intervals, e.g. [8.339078253365264, 9.023388831788864], which is the confidence interval for likes in the laser-sample. A picture of the dataframe can be seen here:https://imgur.com/a/NkDckII
I want to plot it in a seaborn pointplot, where y represents the four samples, and x is likes.
So far I have:
ax = sns.pointplot(x="likes", data=df_boot, hue='sample', join=False)
Which returns error:
TypeError: Horizontal orientation requires numeric `x` variable.
I guess this is because x is a list. Is there a way to plot my confidence intervals using pointplot?
I think the problem is that you are using data that are already confidence intervals. Pointplot expects 'raw' data like the example dataset found here: github.com/mwaskom/seaborn-data/blob/master/tips.csv. So why not use the data that you used to calculate those confidence intervals?
I am calculating trendlines for stock market, and want to know the angle between 2 lines.
The X-axis is epoch timestamp (in ms) and Y-axis is the price.
The problem is that because epoch ts's number is so high (lets say 1,591,205,309,000 ms) and the price per share can vary from 0.078$ to 10,000$, the scales are not proportional.
I am also a trader, and when I trade I see charts as described in the picture below:
This way, the ploting is probably scaling the axes to fit in some way (compressing X axis and stretching Y axis).
Also, this scaling is generic (whether I am looking at 5 minute chart or 1 day chart), when I draw lines (in the same timeframe), I see it in a comfortable way.
If you will take those lines and plot it on a ts/price graph, you will probably see 2 parallel lines.
I also must keep the line equation in ts because I need to forcast when the trade will be in the future (giving it the ts, and it returns the price where it will be at)
Right now, when calculating this angle I get around 0.0003 degrees, I want to get the degrees of the lines like in the chart above.
how to show variance of these data points over time? I used this plot to show them but because the time starts from 0 to 20 000 seconds and it is difficult to see all the points properly to observe the variance or invariance, the problem is: the points are overlapped to each other.
after zoom in
I finally could solve this problem by subtracting each time from the minimum time for each subject. Now all the times starts from 0 and the variance between subjects can be seen easily
Normalize your axes to 1 by dividing with the maximum value. Afterwards you can scale your axis by a factor X.