Easy way for plotting large amount of data in python

Easy way for plotting large amount of data in python - python

I have to plot a large amount of data in python (a list of size 3 million) any method/libraries to plot them easily since matplotlib does not seem to work.

what do you mean matplotlib does not work? It works when I tried it. is your data 1-dimensional or multi-dimensional? Are you expecting to see 3 million ticks in x axis? because that would not be possible.
d = 3*10**6
a = np.random.rand(d)
a[0] = 5
a[-1] = -5
print(a.shape)
plt.plot(a)
the plot

I use quite intensively matplotlib in order to plot arrays of size n > 10**6.
You can use plt.xscale('log') which allow you to display your results.
Furthermore, if your dataset shows great disparity in value, you can use plt.yscale('log') in order to plot them nicely if you use the plt.plot() function.
If not (ie you use imshow, hist2d and so on) you can write this in your preamble :
from matplotlib.colors import LogNorm and just declare the optional argument norm = LogNorm().
One last thing : you shouldn't use numpy.loadtxt if the size of the text file is greater than your available RAM. In that case, the best option is to read the file line by line, even if it take more time. You can speed up the process with from numba import jit and declare #jit(nopython=True, parallel =True) .
With that in mind, you should be able to plot in a reasonably short time array of size of about ten millions.

Related

Is there a way to plot Matplotlib's Imshow against a specific array rather than the indices?

I'm trying to use Imshow to plot a 2-d Fourier transform of my data. However, Imshow plots the data against its index in the array. I would like to plot the data against a set of arrays I have containing the corresponding frequency values (one array for each dim), but can't figure out how.
I have a 2D array of data (gaussian pulse signal) that I Fourier transform with np.fft.fft2. This all works fine. I then get the corresponding frequency bins for each dimension with np.fft.fftfreq(len(data))*sampling_rate. I can't figure out how to use imshow to plot the data against these frequencies though. The 1D equivalent of what I'm trying to do us using plt.plot(x,y) rather than just using plt.plot(y).
My first attempt was to use imshows "extent" flag, but as fas as I can tell that just changes the axis limits, not the actual bins.
My next solution was to use np.fft.fftshift to arrange the data in numerical order and then simply re-scale the axis using this answer: Change the axis scale of imshow. However, the index to frequency bin is not a pure scaling factor, there's typically a constant offset as well.
My attempt was to use 2d hist instead of imshow, but that doesn't work since 2dhist plots the number of times an order pair occurs, while I want to plot a scalar value corresponding to specific order pairs (i.e the power of the signal at specific frequency combinations).
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
f = 200
st = 2500
x = np.linspace(-1,1,2*st)
y = signal.gausspulse(x, fc=f, bw=0.05)
data = np.outer(np.ones(len(y)),y) # A simple example with constant y
Fdata = np.abs(np.fft.fft2(data))**2
freqx = np.fft.fftfreq(len(x))*st # What I want to plot my data against
freqy = np.fft.fftfreq(len(y))*st
plt.imshow(Fdata)
I should see a peak at (200,0) corresponding to the frequency of my signal (with some fall off around it corresponding to bandwidth), but instead my maximum occurs at some random position corresponding to the frequencie's index in my data array. If anyone has any idea, fixes, or other functions to use I would greatly appreciate it!

I cannot run your code, but I think you are looking for the extent= argument to imshow(). See the the page on origin and extent for more information.
Something like this may work?
plt.imshow(Fdata, extent=(freqx[0],freqx[-1],freqy[0],freqy[-1]))

Plotting millions of data points in Python?

I have written a complicated code. The code produces a set of numbers which I want to plot them. The problem is that I cannot put those numbers in a list since there are 2 700 000 000 of them.
So I need to plot one point then produce second point (the first point is replaced by second point so the first one is erased because I cannot store them). These numbers are generated in different sections of the code so I need to hold (MATLAB code) the figure.
For making it more conceivable to you, I write a simple code here and I want you to show me how to plot it.
import matplotlib.pyplot as plt
i=0
j=10
while i<2700000000:
plt.stem(i, j, '-')
i = i + 1
j = j + 2
plt.show()
Suppose I have billions of i and j!

Hmm I'm not sure if I understood you correctly but this:
import matplotlib.pyplot as plt
i=0
j=10
fig=plt.figure()
ax=fig.gca()
while i<10000: # Fewer points for speed.
ax.stem([i], [j]) # Need to provide iterable arguments to ax.stem
i = i + 1
j = j + 2
fig.show()
generates the following figure:
Isn't this what you're trying to achieve? After all the input numbers aren't stored anywhere, just added to the figure as soon as they are generated. You don't really need Matlab's hold equivalent, the figure won't be shown until you call fig.show() or plt.show() to show the current figure.
Or are you trying to overcome the problem that you can' hold the matplotlib.figure in your RAM? In which case my answer doesn't answer your question. Then you either have to save partial figures (only parts of the data) as pictures and combine them, as suggested in the comments, or think about an alternative way to show the data, as suggested in the other answer.

How do I limit the interpolation region in the InterpolatedUnivariateSpline in Python when given non-uniform samples?

I'm trying to get a nice upsampler using Python when I have non-uniform spaced inputs. Any suggestions would be helpful. I've tried a number of interp functions. Here's an example:
from scipy.interpolate import InterpolatedUnivariateSpline
from numpy import linspace, arange, append
from matplotlib.pyplot import plot
F=[0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M=[0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
ff=linspace(F[0],F[1],10)
for i in arange(2, len(F)):
ff=append(ff,linspace(F[i-1],F[i], 10))
aa=InterpolatedUnivariateSpline(x=F,y=M,k=2);
mm=aa(ff)
plot(F,M,'r-o'); plot(ff,mm,'bo'); show()
This is the plot I get:
I need to get interpolated values that don't go below 0. Note that the blue dots go below zero. The red line represents the original F vs. M data. If I use k=1 (piece-wise linear interp) then I get good values as shown here:
aa=InterpolatedUnivariateSpline(x=F,y=M,k=1)
mm=aa(ff); plot(F,M,'r-o');plot(ff,mm,'bo'); show()
The problem is that I need to have a "smooth" interpolation and not the piece-wise value. Does anyone know if the bbox argument in InterpolatedUnivarientSpline helps to fix that? I cant find any documentation on what bbox does. Is there another easier way to accomplish this?
Thanks in advance for any help.

Positivity-preserving interpolation is hard (if it wasn't, there wouldn't be a bunch of papers written about it). The splines of low degree (2, 3) usually do pretty well in this regard, but your data has that large gap in it, and it happens to be at the end of data range, making things worse.
One solution is to do interpolation in two steps: first upsample the data by piecewise linear interpolation, then interpolate new data with a smooth spline (I'll use cubic spline below, though quadratic also works).
The gap_size array records how large each gap is, relative to the smallest one. In subsequent loop, uniformly spaced points are replaced in large gaps (those that are at least twice the size of smallest one). The result is F_new, a nearly-uniform better grid that still includes the original points. The corresponding M values for it are generated by a piecewise linear spline.
Subsequent cubic interpolation produces a smooth curve that stays positive.
F = [0, 1000,1500,2000,2500,3000,3500,4000,4500,5000,5500,22050]
M = [0.,2.85,2.49,1.65,1.55,1.81,1.35,1.00,1.13,1.58,1.21,0.]
gap_size = np.diff(F) // np.diff(F).min()
F_new = []
for i in range(len(F)-1):
F_new.extend(np.linspace(F[i], F[i+1], gap_size[i], endpoint=False))
F_new.append(F[-1])
pl_spline = InterpolatedUnivariateSpline(F, M, k=1);
M_new = pl_spline(F_new)
smooth_spline = InterpolatedUnivariateSpline(F_new, M_new, k=3)
ff = np.linspace(F[0], F[-1], 100)
plt.plot(F, M, 'ro')
plt.plot(ff, smooth_spline(ff), 'b')
plt.show()
Of course, no tricks can hide the truth that we don't know what happens between 5500 and 22050 (Hz, I presume), the nearly-linear part is just a placeholder.

Use apply_along_axis to plot

I have a 3D ndarry object, which contains spectral data (i.e. spatial xy dimensions, and an energy dimension). I would like to extract and plot the spectra from each individual pixel in a line plot. At present, I am doing this using np.ndenumerate along the axis I'm interested in, but it's quite slow. I was hoping to try np.apply_along_axis, to see if it was faster, but I keep getting a strange error.
What works:
# Setup environment, and generate sample data (much smaller than real thing!)
import numpy as np
import matplotlib.pyplot as plt
ax = range(0,10) # the scale to use when plotting the axis of interest
ar = np.random.rand(4,4,10) # the 3D data volume
# Plot all lines along axis 2 (i.e. the spectrum contained in each pixel)
# on a single line plot:
for (x,y) in np.ndenumerate(ar[:,:,1]):
plt.plot(ax,ar[x[0],x[1],:],alpha=0.5,color='black')
It is my understanding that this is basically a loop, which is less efficient than array-based methods, so I would like to try an approach using np.apply_along_axis, to see if it's faster. This is my first attempt at python, however, and am still finding out how it works, so please put me right if this idea is fundamentally flawed!
What I would like to try:
# define a function to pass to apply_along_axis
def pa(y,x):
if ~all(np.isnan(y)): # only do the plot if there is actually data there...
plt.plot(x,y,alpha=0.15,color='black')
return
# check that the function actually works...
pa(ar[1,1,:],ax) # should produce a plot - does for me :)
# try to apply to to the whole array, along the axis of interest:
np.apply_along_axis(pa,2,ar,ax) # does not work... booo!
The resulting error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-109-5192831ba03c> in <module>()
12 # pa(ar[1,1,:],ax)
13
---> 14 np.apply_along_axis(pa,2,ar,ax)
//anaconda/lib/python2.7/site-packages/numpy/lib/shape_base.pyc in apply_along_axis(func1d, axis, arr, *args)
101 holdshape = outshape
102 outshape = list(arr.shape)
--> 103 outshape[axis] = len(res)
104 outarr = zeros(outshape, asarray(res).dtype)
105 outarr[tuple(i.tolist())] = res
TypeError: object of type 'NoneType' has no len()
Any ideas whats going wrong here/advice on how to do this better would be great.
Thanks!

apply_along_axis creates a new array from the output of your function.
You're returning None (by not returning anything). Thus the error. Numpy checks the length of the returned output to see if it makes sense for the new array.
Because you're not constructing a new array from the results, there's no reason to use apply_along_axis. It's not going to be any faster.
However, your current ndenumerate statement is exactly equivalent to:
import numpy as np
import matplotlib.pyplot as plt
ar = np.random.rand(4,4,10) # the 3D data volume
plt.plot(ar.reshape(-1, 10).T, alpha=0.5, color='black')
In general, you probably want to do something like:
for pixel in ar.reshape(-1, ar.shape[-1]):
plt.plot(x_values, pixel, ...)
That way you can easily iterate over the spectra at each pixel in your hyperspectral array.
You bottleneck here probably isn't how you're using the array. Plotting each line separately with identical parameters like this in matplotlib is going to be somewhat inefficient.
It will take slightly longer to construct, but a LineCollection will render much faster. (Basically, using a LineCollection tells matplotlib to not bother checking what the properties of each line are, and just pass them all to the low-level renderer to be drawn in the same way. You bypass a bunch of individual draw calls in favor of a single draw of a large object.)
On the downside, the code will be a bit less readable.
I'll add an example in a bit.

Manipulating the numpy.random.exponential distribution in Python

I am trying to create an array of random numbers using Numpy's random exponential distribution. I've got this working fine, however I have one extra requirement for my project and that is the ability to specify precisely how many array elements have a certain value.
Let me explain (code is below, but I'll have a go at explaining it here): I generate my random exponential distribution and plot a histogram of the data, producing a nice exponential curve. What I really want to be able to do is use a variable to specify the y-intercept of this curve (point where curve meets the y-axis). I can achieve this in a basic way by changing the number of bins in my histogram, but this only changes the plot and not the original data.
I have inserted the bones of my code here. To give some context, I am trying to create the exponential disc of a galaxy, hence the random array I want to generate is an array of radii and the variable I want to be able to specify is the number density in the centre of the galaxy:
import numpy as N
import matplotlib.pyplot as P
n = 1000
scale_radius = 2
central_surface_density = 100 #I would like this to be the controlling variable, even if it's specification had knock on effects on n.
radius_array = N.random.exponential(scale_radius,(n,1))
P.figure()
nbins = 100
number_density, radii = N.histogram(radius_array, bins=nbins,normed=False)
P.plot(radii[0:-1], number_density)
P.xlabel('$R$')
P.ylabel(r'$\Sigma$')
P.ylim(0, central_surface_density)
P.legend()
P.show()
This code creates the following histogram:
So, to summarise, I would like to be able to specify where this plot intercepts the y-axis by controlling how I've generated the data, not by changing how the histogram has been plotted.
Any help or requests for further clarification would be very much appreciated.

According to the docs for numpy.random.exponential, the input parameter beta, is 1/lambda for the definition of the exponential described in wikipedia.
What you want is this function evaluated at f(x=0)=lambda=1/beta. Therefore in a normed distribution, your y-intercept should just be the inverse of the numpy function:
import numpy as np
import pylab as plt
target = 250
beta = 1.0/target
Y = np.random.exponential(beta, 5000)
plt.hist(Y, normed=True, bins=200,lw=0,alpha=.8)
plt.plot([0,max(Y)],[target,target],'r--')
plt.ylim(0,target*1.1)
plt.show()
Yes the y-intercept of the histogram will change with different bin sizes, but this doesn't mean anything. The only thing that you can reasonably talk about here is the underlying probability distribution (hence the normed=true)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Easy way for plotting large amount of data in python - python

I have to plot a large amount of data in python (a list of size 3 million) any method/libraries to plot them easily since matplotlib does not seem to work.

what do you mean matplotlib does not work? It works when I tried it. is your data 1-dimensional or multi-dimensional? Are you expecting to see 3 million ticks in x axis? because that would not be possible. d = 3*10**6 a = np.random.rand(d) a[0] = 5 a[-1] = -5 print(a.shape) plt.plot(a) the plot

Related

Is there a way to plot Matplotlib's Imshow against a specific array rather than the indices?

Plotting millions of data points in Python?

How do I limit the interpolation region in the InterpolatedUnivariateSpline in Python when given non-uniform samples?

Use apply_along_axis to plot

Manipulating the numpy.random.exponential distribution in Python

Categories

Resources