Plotting millions of data points in Python?

Plotting millions of data points in Python? - python

I have written a complicated code. The code produces a set of numbers which I want to plot them. The problem is that I cannot put those numbers in a list since there are 2 700 000 000 of them.
So I need to plot one point then produce second point (the first point is replaced by second point so the first one is erased because I cannot store them). These numbers are generated in different sections of the code so I need to hold (MATLAB code) the figure.
For making it more conceivable to you, I write a simple code here and I want you to show me how to plot it.
import matplotlib.pyplot as plt
i=0
j=10
while i<2700000000:
plt.stem(i, j, '-')
i = i + 1
j = j + 2
plt.show()
Suppose I have billions of i and j!

Hmm I'm not sure if I understood you correctly but this:
import matplotlib.pyplot as plt
i=0
j=10
fig=plt.figure()
ax=fig.gca()
while i<10000: # Fewer points for speed.
ax.stem([i], [j]) # Need to provide iterable arguments to ax.stem
i = i + 1
j = j + 2
fig.show()
generates the following figure:
Isn't this what you're trying to achieve? After all the input numbers aren't stored anywhere, just added to the figure as soon as they are generated. You don't really need Matlab's hold equivalent, the figure won't be shown until you call fig.show() or plt.show() to show the current figure.
Or are you trying to overcome the problem that you can' hold the matplotlib.figure in your RAM? In which case my answer doesn't answer your question. Then you either have to save partial figures (only parts of the data) as pictures and combine them, as suggested in the comments, or think about an alternative way to show the data, as suggested in the other answer.

Related

Easy way for plotting large amount of data in python

I have to plot a large amount of data in python (a list of size 3 million) any method/libraries to plot them easily since matplotlib does not seem to work.

what do you mean matplotlib does not work? It works when I tried it. is your data 1-dimensional or multi-dimensional? Are you expecting to see 3 million ticks in x axis? because that would not be possible.
d = 3*10**6
a = np.random.rand(d)
a[0] = 5
a[-1] = -5
print(a.shape)
plt.plot(a)
the plot

I use quite intensively matplotlib in order to plot arrays of size n > 10**6.
You can use plt.xscale('log') which allow you to display your results.
Furthermore, if your dataset shows great disparity in value, you can use plt.yscale('log') in order to plot them nicely if you use the plt.plot() function.
If not (ie you use imshow, hist2d and so on) you can write this in your preamble :
from matplotlib.colors import LogNorm and just declare the optional argument norm = LogNorm().
One last thing : you shouldn't use numpy.loadtxt if the size of the text file is greater than your available RAM. In that case, the best option is to read the file line by line, even if it take more time. You can speed up the process with from numba import jit and declare #jit(nopython=True, parallel =True) .
With that in mind, you should be able to plot in a reasonably short time array of size of about ten millions.

Scale histograms by a certain factor?

I'm trying to represent three different data sets on the same histogram but one is 100 data points, one is 362 and one is 289. I'd like to scale the latter two by factors of 3.62 and 2.89 respectively so they don't overshadow the 100 point one. I feel like this should be easy but I'm not sure where to put my division. I feel like I've tried all the spots you can try. Here's how it is now:
plt.figure(figsize=(10,6))
scale_pc = (1 / 3.62) #this is the math I'd like to use, but where to put it?
scale_ar = (1 / 2.89) #this is the math I'd like to use, but where to put it?
alldf2[alldf2['playlist']==1]['danceability'].hist(bins=35, color='orange', label='PopConnoisseur', alpha=0.6)
alldf2[alldf2['playlist']==2]['danceability'].hist(bins=35, color='green',label='Ambient',alpha=0.6)
alldf2[alldf2['playlist']==0]['danceability'].hist(bins=35, color='blue',label='Billboard',alpha=0.6)
plt.legend()
plt.xlabel('Danceability')
I've tried variations on this but none work:
alldf2[alldf2['playlist']==1]['danceability'].hist(bins=35, color='orange', label='PopConnoisseur', alpha=0.6)
alldf2[alldf2['playlist']==2]['danceability'/ 3.62].hist(bins=35, color='green',label='Ambient',alpha=0.6)
alldf2[alldf2['playlist']==0]['danceability'/ 2.89].hist(bins=35, color='blue',label='Billboard',alpha=0.6)
Any thoughts?
Edit: Here's the plot as it currently is:

The 2nd one for sure won't work because it seems to have a syntax error here
'danceability'/ 3.62
in parenthesis you are calling the column I do not think that you can divide the values like that. Moreover, even if something like that would work it would probably divide your values in that column by 3.62, not return 100 data points...
Also I am not sure what is the problem with having more data points in the other histogram, that's kind of the thing which you want the histogram to show - i.e. how many elements are having a particular value.
Also, as Blazej said in the comment, give an example of data so we can understand a bit more what are you trying to do. Specify what you want to achieve by using just 100 points.

Savefig with differents names inside a for loop. Python

Im calculating the flow in a lid-driven cavity, and Im plotting the result with a quiver. I want to save the plot in every time step, but obviously, as the name is the same, it´s only keep the last one, how can I do it?
for n in range(nt):
#Here I do all the calculation to obtain the new u and v
uC=0.5*(u[:,1:] + u[:,:-1])
vC=0.5*(v[1:,:] + v[:-1,:])
plt.cla()
plt.quiver(x, y, uC, vC);
plt.draw()
plt.savefig( "Instant1.png")
So, imagine nt = 10, I want to get ten differents png files. Any ideas?
I aprecciate all your help

You could change the file name each time:
plt.savefig("Instant{}.png".format(n))
Also, If you have more than ten plots, it might be a good idea to have some leading zeroes, eg. so "Instant5.png" doesn't come after "Instant10.png" in lexicographic order.
plt.savefig("Instant{:03}.png".format(n))

You could also do:
plt.savefig("Instand"+str(n)+".png")

Opacity misleading when plotting two histograms at the same time with matplotlib

Let's say I have two histograms and I set the opacity using the parameter of hist: 'alpha=0.5'
I have plotted two histograms yet I get three colors! I understand this makes sense from an opacity point of view.
But! It makes is very confusing to show someone a graph of two things with three colors. Can I just somehow set the smallest bar for each bin to be in front with no opacity?
Example graph

The usual way this issue is handled is to have the plots with some small separation. This is done by default when plt.hist is given multiple sets of data:
import pylab as plt
x = 200 + 25*plt.randn(1000)
y = 150 + 25*plt.randn(1000)
n, bins, patches = plt.hist([x, y])
You instead which to stack them (this could be done above using the argument histtype='barstacked') but notice that the ordering is incorrect.
This can be fixed by individually checking each pair of points to see which is larger and then using zorder to set which one comes first. For simplicity I am using the output of the code above (e.g n is two stacked arrays of the number of points in each bin for x and y):
n_x = n[0]
n_y = n[1]
for i in range(len(n[0])):
if n_x[i] > n_y[i]:
zorder=1
else:
zorder=0
plt.bar(bins[:-1][i], n_x[i], width=10)
plt.bar(bins[:-1][i], n_y[i], width=10, color="g", zorder=zorder)
Here is the resulting image:
By changing the ordering like this the image looks very weird indeed, this is probably why it is not implemented and needs a hack to do it. I would stick with the small separation method, anyone used to these plots assumes they take the same x-value.

How to change offsets of matplotlib LineCollection after creation

I would like to create a stack of line plots using a LineCollection. The following code draws two identical sine curves offset from one another by (0, 0.2):
import matplotlib.pyplot as plt
import matplotlib.collections
import numpy as np
x=np.arange(1000)
y=np.sin(x/50.)
l=zip(x,y)
f=plt.figure()
a=f.add_subplot(111)
lines=matplotlib.collections.LineCollection((l,l), offsets=(0,0.2))
a.add_collection(lines)
a.autoscale_view(True, True, True)
plt.show()
So far so good. The problem is that I'd like to be able to adjust that offset after creation. Using set_offsets doesn't seem to behave as I expect it to. The following, for instance, has no effect on the graph
a.collections[0].set_offsets((0, 0.5))
BTW, the other set commands (e.g. set_color) work as I expect. How do I change the spacing between curves after they have been created?

I think you found a bug in matplotlib, but I have a couple work arounds. It looks like lines._paths gets generated in LineCollection().__init__ using the offsets you provide. lines._paths is not property updated when you call lines.set_offsets(). In your simple example, you can re-generate the paths since you still have the originals laying around.
lines.set_offsets( (0., 0.2))
lines.set_segments( (l,l) )
You can also manually apply your offsets. Remember that you're modifying the offset points. So to get an offset of 0.2, you add 0.1 to your pre-existing offset of 0.1.
lines._paths[1].vertices[:,1] += 1

Thanks #matt for your suggestion. Based on that I've hacked together the following which shifts the curves according to new offset values, but takes into account the old offset values. This means I don't have to retain the original curve data. Something similar might be done to correct the set_offsets method of LineCollection but I don't understand the details of the class well enough to risk it.
def set_offsets(newoffsets, ax=None, c_num=0):
'''
Modifies the offsets between curves of a LineCollection
'''
if ax is None:
ax=plt.gca()
lcoll=ax.collections[c_num]
oldoffsets=lcoll.get_offsets()
if len(newoffsets)==1:
newoffsets=[i*np.array(newoffsets[0]) for\
(i,j) in enumerate(lcoll.get_paths())]
if len(oldoffsets)==1:
oldoffsets=[i*oldoffsets[0] for (i,j) in enumerate(newoffsets)]
verts=[path.vertices for path in lcoll.get_paths()]
for (oset, nset, vert) in zip(oldoffsets, newoffsets, verts):
vert[:,0]+=(-oset[0]+nset[0])
vert[:,1]+=(-oset[1]+nset[1])
lcoll.set_offsets(newoffsets)
lcoll.set_paths(verts)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plotting millions of data points in Python? - python

Related

Easy way for plotting large amount of data in python

Scale histograms by a certain factor?

Savefig with differents names inside a for loop. Python

Opacity misleading when plotting two histograms at the same time with matplotlib

How to change offsets of matplotlib LineCollection after creation

Categories

Resources