What is xscale and yscale? - python

I am reading some code, therein I noticed a big change while using pyplot.xaxis('log') and pyplot.yaxis('log'). So the basic scatter plot looks like this:
and after adding:
plt.xscale('log')
plt.yscale('log')
the graph looks like this:
I went to look the documentation but there was not enough explanation about it. like, What is xscale and yscale and what is the function of their respective parameters like log, linear, symlog and logit?
I am absolutely new in graph and matplotlib. I have no good knowledge of these, can you please help me out with this and explain what is xscale, yscale and what is the function of their respective parameters like log, linear, symlog and logit?
Thank you for help

You are setting the scale of your y and x-axis to be logarithmic in scale.
Normally every y distance on your axis your values increment by a fixed amount, example:
0 at 0cm
1 at 1cm
2 at 2cm
...
1000 at 10m
With logarythmic the values scale by a magnitude. Example for powers of 10:
0 at 0cm
1 at 1cm
10 at 2cm
100 at 3cm
1000 at 4cm etc.
It is a way to display widely spread data in a compacter format.
See logarithmic scale on wikipedia
Your data has a cluster of values and an outlier - by printing with a logarithmic scale your blob gets shown over distance whatever and the big distance between the blob and the outlier takes less screenarea due to it being logarithmic.
Other examples for log-plots:
https://plot.ly/python/log-plot/
matplotlib.pyplot.xscale.html
matplotlib.pyplot.yscale.html
and
matplotlib.scale.LogScale

Related

Is it possible to generate data with peak and x y location?

I am trying to create a 3d surface plot like this, link available here :
https://plotly.com/python/3d-surface-plots/
But the problem is that I only have limited data available where I only have data for the peak location and the height of peak but the rest of the data is missing. In the example z-data need 25 X 25 values 625 data points to generate a valid surface plot.
My data looks something like this:
So my question is that, is it possible to use some polynomial function with the peak location value as a constrain to generate Z-data based on the information I have?
Open to any discussion. Any form of suggestion is appreciated.
Though I don't like this form of interpolation, which is pretty artificial, you can use the following trick:
F(P) = (Σ Fk / d(P, Pk)) / (Σ 1 / d(P, Pk))
P is the point where you interpolate and Pk are the known peak positions. d is the Euclidean distance. (This gives sharp peaks; the squared distance gives smooth ones.)
Unfortunately, far from the peaks this formula tends to the average of the Fk, giving an horizontal surface that is above some of the Fk, giving downward peaks. You can work around this by adding fake peaks of negative height around your data set, to lower the average.

Pandas linear interpolation for geometrical X-Y data seems to ignore points

I am trying to upsample my dataframe in pandas (from 50 Hz to 2500 Hz). I have to upsample to match a sensor that was sampled at this higher frequency. I have points in x, y, z coming from a milling machine.
When I am plotting the original data the lines look straight, as I would expect.
I am interpolating the dataframe like this:
df.drop_duplicates(subset='time', inplace=True)
df.set_index('time', inplace=True)
df.index = pd.DatetimeIndex(df.index)
upsampled = new_df.resample('0.4ms').interpolate(method='linear')
plt.scatter(upsampled['X[mm]'], upsampled['Y[mm]'], s=0.5)
plt.plot()
I also tried with
upsampled = df.resample('0.4L').interpolate(method='linear')
I expect the new points to always come between the original points. Since I am going from 50 Hz to 2500 Hz, I expect 50 points uniformly spaced between each pair of points in the original data. However, it seems that some of the original points are ignored, as can be seen in the picture below (the second picture is zoomed in on a particularly troublesome spot).
This figure shows the original points in orange and the upsampled, interpolated points in blue (both are scattered, although the upsampled points are so dense it appears as a plot). The code for this is shown below.
upsampled = df.resample('0.4ms').interpolate(method='linear')
plt.scatter(upsampled['X[mm]'], upsampled['Y[mm]'], s=0.5, c='blue')
plt.scatter(df['X[mm]'], df['Y[mm]'], s=0.5, c='orange')
plt.gca().set_aspect('equal', adjustable='box')
fig.show()
Any ideas how I could make the interpolation work?
Most likely the problem is that the timestamps in the original and resampled DataFrames are not aligned, so when resampling we need to specify how to deal with that.
Since the original is at 50 Hz and the resampled is at 2500 Hz, simply taking mean should fix it:
upsampled = new_df.resample('0.4ms').mean().interpolate(method='linear')
Unfortunately, without having any sample data, I cannot verify that it works. Please let me know if it does help

plot to show large data points on x axis using python

how to show variance of these data points over time? I used this plot to show them but because the time starts from 0 to 20 000 seconds and it is difficult to see all the points properly to observe the variance or invariance, the problem is: the points are overlapped to each other.
after zoom in
I finally could solve this problem by subtracting each time from the minimum time for each subject. Now all the times starts from 0 and the variance between subjects can be seen easily
Normalize your axes to 1 by dividing with the maximum value. Afterwards you can scale your axis by a factor X.

Visualizing line density using a color map, versus alpha transparency

I want to visualize 200k-300k lines, maybe up to 1 million lines, where each line is a cumulative sequence of integer values that grows over time, one value per day on the order of 1000 days. the final values of each line range from 0 to 500.
it’s likely that some lines will be appear in my population of lines 1000s of times, others 100s, others 10s of times, and some outliers will be unique. For plotting large numbers of points in an xy plane, alpha transparency can be a solution in some cases, but isn’t great if you want to be able to robustly distinguish overplot density. A solution that scales more powerfully is to use something like hexbin, which bins the space and allows you to use a color map to plot density of points in each bin.
I haven’t been able to find a ready-made solution in python (ideally) or R for doing the analogous thing for plotting lines instead of points.
The following code demonstrates the issue using a small sample (n=1000 lines): can anyone propose how I might drop the alpha value approach in favor of a solution that allows me to introduce a color map for line density, using a transform I can control?
df = pd.DataFrame(np.random.randint(2,size=(100,1000)))
df.cumsum().plot(legend=False, color='grey', alpha=.1, figsize=(12,8))
in response to request: this is what a sample plot looks like now; in the wide dark band, 10 overplots full saturates the line, so that segments of lines overplotted 10,100, and 1000 times are indistinguishable

How to plot the output of scipy's k-means clustering with large number of clusters

My data is 2250 x 100. I would like to plot the output, like http://glowingpython.blogspot.com/2012/04/k-means-clustering-with-scipy.html. However, the problem is that all the examples use only a small number of clusters, usually 2 or 3. How would you plot the output of kmeans in scipy if you wanted more clusters, like a 100.
Here's what I got:
#get the centroids
centroids,_ = kmeans(data,100)
idx,_ = vq(data,centroids)
#do some plotting here...
Maybe with 10 colors and 10 point types?
Or you could plot each in a 10 x 10 grid of plots. The first would show the relationships better. The second would allow easier inspection of an arbitrary cluster.

Categories

Resources