I'm trying to plot 20 million data points however it's taking an extremely long time (over an hour) using matplotlib,
Is there something in my code that is making this unusually slow?
import csv
import matplotlib.pyplot as plt
import numpy as np
import Tkinter
from Tkinter import *
import tkSimpleDialog
from tkFileDialog import askopenfilename
plt.clf()
root = Tk()
root.withdraw()
listofparts = askopenfilename() # asks user to select file
root.destroy()
my_list1 = []
my_list2 = []
k = 0
csv_file = open(listofparts, 'rb')
for line in open(listofparts, 'rb'):
current_part1 = line.split(',')[0]
current_part2 = line.split(',')[1]
k = k + 1
if k >= 2: # skips the first line
my_list1.append(current_part1)
my_list2.append(current_part2)
csv_file.close()
plt.plot(my_list1 * 10, 'r')
plt.plot(my_list2 * 10, 'g')
plt.show()
plt.close()
There is no reason whatsoever to have a line plot of 20000000 points in matplotlib.
Let's consider printing first:
The maximum figure size in matplotlib is 50 inch. Even having a high-tech plotter with 3600 dpi would give a maximum number of 50*3600 = 180000 points which are resolvable.
For screen applications it's even less: Even a high-tech 4k screen has a limited resolution of 4000 pixels. Even if one uses aliasing effects, there are a maximum of ~3 points per pixel that would still be distinguishable for the human eye. Result: maximum of 12000 points makes sense.
Therefore the question you are asking rather needs to be: How do I subsample my 20000000 data points to a set of points that still produces the same image on paper or screen.
The solution to this strongly depends on the nature of the data. If it is sufficiently smooth, you can just take every nth list entry.
sample = data[::n]
If there are high frequency components which need to be resolved, this would require more sophisticated techniques, which will again depend on how the data looks like.
One such technique might be the one shown in How can I subsample an array according to its density? (Remove frequent values, keep rare ones).
The following approach might give you a small improvement. It removes doing the split twice per row (by using Python's CSV library) and also removes the if statement by skipping over the two header lines before doing the loop:
import matplotlib.pyplot as plt
import csv
l1, l2 = [], []
with open('input.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
# Skip two header lines
next(csv_input)
next(csv_input)
for cols in csv_input:
l1.append(cols[0])
l2.append(cols[1])
plt.plot(l1, 'r')
plt.plot(l2, 'g')
plt.show()
I would say the main slow down though will still be the plot itself.
I would recommend switching to pyqtgraph. I switched to it because of speed issues while I was trying to make matplotlib plot real time data. Worked like a charm. Here's my real time plotting example.
Related
I have the following code, which reads in a set of (small) observations, runs a cross-correlation calculation on them, and then saves some plots:
import matplotlib.pyplot as plt
import numpy as np
import astropy.units as u
from sunkit_image.time_lag import cross_correlation, get_lags, max_cross_correlation, time_lag
time=np.linspace(0,43200,num=int(43200/12))
timeu = time * u.s
for i in range(len(folders)): # loop over all dates
os.chdir('/Volumes/LaCie/timelags/RARs/'+folders[i])
print(folders[i])
for j in range(len(pairs)): # iterates over every pair of data sets
for x in range(36): # sets up a sliding 2-hour window that shifts 20 min at a time
ch_a = np.load('dc'+pairs[j][0]+'.npy',allow_pickle=True)[()][100*x:(100*x)+600,:,:] # read in only necessary data (but entire file is only ~6 Gb)
ch_b = np.load('dc'+pairs[j][1]+'.npy',allow_pickle=True)[()][100*x:(100*x)+600,:,:] # read in only necessary data (but entire file is only ~6 Gb)
ctime= timeu[100*x:(100*x)+600] # sets up the correct time array
print('ctime range:',ctime[0],ctime[-1],len(ctime))
max_cc_map = max_cross_correlation(ch_a, ch_b, ctime)
tl_map = time_lag(ch_a, ch_b, ctime)
del ch_a # trying to deal with memory issue
del ch_b # trying to deal with memory issue
plt.close('all') # making sure I don't just create endless open plots
fig = plt.figure()
ax = fig.add_subplot()
im = ax.imshow(np.flip(tl_map,axis=0), cmap="cubehelix", vmin=-6000, vmax=6000)
cax = make_axes_locatable(ax).append_axes("right", size="5%", pad="10%")
fig.colorbar(im, cax=cax,label=r"$\tau_{AB}$ [s]")
plt.tight_layout()
fig.savefig('timelag_'+pairs[j][0]+'_'+pairs[j][1]+'_'+str(x)+'.png',dpi=400)
fig = plt.figure()
ax = fig.add_subplot()
im = ax.imshow(np.flip(max_cc_map,axis=0), cmap="plasma",vmin=0,vmax=1)
cax = make_axes_locatable(ax).append_axes("right", size="5%", pad="10%")
fig.colorbar(im, cax=cax,label=r"Max Cross-correlation")
plt.tight_layout()
fig.savefig('maxcc_'+pairs[j][0]+'_'+pairs[j][1]+'_'+str(x)+'.png',dpi=400)
fig=plt.figure(figsize=(10,6))
values_tl, bins_tl, bars = plt.hist(np.ravel(np.asarray(tl_map)),bins=np.arange(-6000,6000,12000/50),log=True,label='Time Lags')
values_masked, bins_masked, bars = plt.hist(np.ravel(np.asarray(tl_map)[np.where(np.asarray(max_cc_map) > 0.25)])
,bins=np.arange(-6000,6000,12000/50),log=True,label='Masked CC > 0.25')
values_masked2, bins_masked2, bars = plt.hist(np.ravel(np.asarray(tl_map)[np.where(np.asarray(max_cc_map) > 0.5)])
,bins=np.arange(-6000,6000,12000/50),log=True,label='Masked CC > 0.5')
values_masked3, bins_masked3, bars = plt.hist(np.ravel(np.asarray(tl_map)[np.where(np.asarray(max_cc_map) > 0.75)])
,bins=np.arange(-6000,6000,12000/50),log=True,label='Masked CC > 0.75')
plt.ylabel('Pixel Occurrence')
plt.legend()
fig.savefig('hist_tl_cc_'+pairs[j][0]+'_'+pairs[j][1]+'_'+str(x)+'.png',dpi=400)
As noted in the comments, I've inserted a few lines to try to dump unnecessary data between iterations; I know a 3-deep for loop isn't the most efficient way to code, but the loop over the dates and channel pairs are very short -- almost all of the time/memory is spent in the innermost loop. The problem is that after a few minutes, the memory usage is oscillating between 30-55 GB. My Mac is becoming sluggish, and it's only at the beginning of the dataset. Is there something I'm missing here? Even if the entire files were being read in at the beginning instead of a subset, it's only ~ 12 Gb of data, and the code would crash if I was reading in the whole thing (i.e., it's definitely only reading in part of the raw data). I tried a with statement but that didn't take up less memory. Any suggestions would be very welcome!
Per loop you create 3 figures but you never close them. After each fig.savefig(...), you should close the figure with plt.close(fig).
I am working with the pyaudio and matplotlib packages for the first time and I am attempting to plot live audio data from microphone input, transform it to frequency domain information, and then output peaks with an input distance. This project is a modification of the three-part guide to build a spectrum analyzer found here.
Currently the code is formatted in a class as I have alternative methods that I am applying to the audio but I am only posting the class with the relevant methods as they don't make reference to each and are self-contained. Another quirk of the program is that it calls upon a local file though it only uses input from the user microphone; this is a leftover from the original functionality of plotting a sound file's intensity while it played and is no longer integral to the code.
import pyaudio
import wave
import struct
import pandas as pd
from scipy.fftpack import fft
from scipy.signal import find_peaks
import matplotlib.pyplot as plt
import numpy as np
class Wave:
def __init__(self, file) -> None:
self.CHUNK = 1024 * 4
self.obj = wave.open(file, "r")
self.callback_output = []
self.data = self.obj.readframes(self.CHUNK)
self.rate = 44100
# Initiate an instance of PyAudio
self.p = pyaudio.PyAudio()
# Open a stream with the file specifications
self.stream = self.p.open(format = pyaudio.paInt16,
channels = self.obj.getnchannels(),
rate = self.rate,
output = True,
input = True,
frames_per_buffer = self.CHUNK)
def fft_plot(self, distance: float):
x_fft = np.linspace(0, self.rate, self.CHUNK)
fig, ax = plt.subplots()
line_fft, = ax.semilogx(x_fft, np.random.rand(self.CHUNK), "-", lw = 2)
# Bind plot window sizes
ax.set_xlim(20, self.rate / 2)
plot_data = self.stream.read(self.CHUNK)
self.data_int = pd.DataFrame(struct.unpack(\
str(self.CHUNK * 2) + 'h', plot_data)).astype(dtype = "b")[::2]
y_fft = fft(self.data_int)
line_fft.set_ydata(np.abs(y_fft[0:self.CHUNK]) / (256 * self.CHUNK))
plt.show(block = False)
while True:
# Read incoming audio data
data = self.stream.read(self.CHUNK)
# Convert data to bits then to array
self.data_int = struct.unpack(str(4 * self.CHUNK) + 'B', data)
# Recompute FFT and update line
yf = fft(self.data_int)
line_data = np.abs(yf[0:self.CHUNK]) / (128 * self.CHUNK)
line_fft.set_ydata(line_data)
# Find all values above threshold
peaks, _ = find_peaks(line_data, distance = distance)
# Update the plot
plt.plot(peaks, line_data[peaks], "x")
fig.canvas.draw()
fig.canvas.flush_events()
# Exit program when plot window is closed
fig.canvas.mpl_connect('close_event', exit)
test_file = "C:/Users/Tam/Documents/VScode/Final Project/PrismGuitars.wav"
audio_test = Wave(test_file)
audio_test.fft_plot(2000)
The code does not throw any errors and runs fine with an okay framerate and only terminates when the plot window is closed, all of which is good. The issue I'm encountering is with the determination and plotting of the peaks of line_data as when I run this code the output over time looks like this matplotlib graph instance.
It seems that the peaks (or peak) are being found but at a lower frequency than the x of line_data and as such are shifted comparatively. The other, more minor, issue is that since this is a live plot I would like to clear the previous instance of the peak marker so that it only shows the current instance and not all of the ones plotted prior.
I have attempted in prior fixes to use the line_fft in the peak detection but as it is cast to a Line2D format the peak detection algorithm isn't able to deal with the data type. I have also tried implementing a list comprehension as seen in this post but the time to cast to list is prohibitively slow and did not return any peak markers when I ran it.
EDIT: Following Jody's input the program now returns the proper values as I was only printing an index for the x-coordinate of the peak marker. Nevertheless I would still appreciate some insight as to whether it is possible to update per marker rather than having all the previous ones constantly displayed.
As for the marker updating I have attempted to clear the plot in the while loop both before and after drawing the markers (in different tests of course) but I only ever end up with a completely blank graph.
Please let me know if there is anything I should clarify and thank you for your time.
As Jody pointed out the peaks variable contains indexes for the detected peaks that then need to be retrieved from x_fft and line_data in order to match up with the displayed data.
First we create a scatter plot:
scat = ax.scatter([], [], c = "purple", marker = "x")
This data can then be stacked using a container variable in the while loop as such:
array_peaks = np.c_[x_fft[peaks], line_data[peaks]]
and update the data in the while loop with:
scat.set_offsets(array_peaks)
got an issue with plotting x,y data. x is the time series, y the value as y(x). Data3.txt is a text file simply containing all the data with no headers in a matrix with 20 columns and 65534 rows.
Here is the code
import csv
import numpy as np
import matplotlib.pyplot as plt
dates = []
with open('Data3.txt') as csvDataFile:
csvReader = csv.reader(csvDataFile,quoting=csv.QUOTE_NONNUMERIC)
for row in csvReader:
dates.append(row)
np.array(map(float, dates))
time=[]
value=[]
samples=8000
for row in dates:
time.append(row[0])
value.append(row[1])
print(len(time))
print(len(time[:samples]))
plt.plot(time[:samples], value[:samples])
plt.ylim(0,40)
plt.xlim(0,1200)
plt.show()
The plot is shown until I set samples to 7000 - see attached Figure_1. As soon as I set samples to 8000 it takes a lot longer to plot & the outcome is Figure_2.
print(len(time))
gives 65543
print(len(time[:samples]))
gives 8000
Really confused by that. Can anyone explain where this error might come from? Any hint concerning a smarter way of plotting is much appreciated, as well. I would now align every column one listname & make the plots.
Here is a link to the file with the information in 'sunspots.txt'. With the exception of external modules matploblib.pyplot and seaborn, how could one compute the running average without importing external modules like numpy and future? (If it helps, I can linspace and loadtxt without numpy.)
If it helps, my code thus far is posted below:
## open/read file
f2 = open("/Users/location/sublocation/sunspots.txt", 'r')
## extract data
lines = f2.readlines()
## close file
f2.close()
t = [] ## time
n = [] ## number
## col 1 == col[0] -- number identifying which month
## col 2 == col[1] -- number of sunspots observed
for col in lines: ## 'col' can be replaced by 'line' iff change below is made
new_data = col.split() ## 'col' can be replaced by 'line' iff change above is made
t.append(float(new_data[0]))
n.append(float(new_data[1]))
## extract data ++ close file
## check ##
# print(t)
# print(n)
## check ##
## import
import matplotlib.pyplot as plt
import seaborn as sns
## plot
sns.set_style('ticks')
plt.figure(figsize=(12,6))
plt.plot(t,n, label='Number of sunspots oberved monthly' )
plt.xlabel('Time')
plt.ylabel('Number of Sunspots Observed')
plt.legend(loc='best')
plt.tight_layout()
plt.savefig("/Users/location/sublocation/filename.png", dpi=600)
The question is from the weblink from this university (p.11 of the PDF, p.98 of the book, Exercise 3-1).
Before marking this as a duplicate:
A similar question was posted here. The difference is that all posted answers require importing external modules like numpy and future whereas I am trying to do without external imports (with the exceptions above).
Noisy data that needs to be smoothed
y = [1.0016, 0.95646, 1.03544, 1.04559, 1.0232,
1.06406, 1.05127, 0.93961, 1.02775, 0.96807,
1.00221, 1.07808, 1.03371, 1.05547, 1.04498,
1.03607, 1.01333, 0.943, 0.97663, 1.02639]
Try a running average with a window size of n
n = 3
Each window can by represented by a slice
window = y[i:i+n]
Need something to store the averages in
averages = []
Iterate over n-length slices of the data; get the average of each slice; save the average in another list.
from __future__ import division # For Python 2
for i in range(len(y) - n):
window = y[i:i+n]
avg = sum(window) / n
print(window, avg)
averages.append(avg)
When you plot the averages you'll notice there are fewer averages than there are samples in the data.
Maybe you could import an internal/built-in module and make use of this SO answer -https://stackoverflow.com/a/14884062/2823755
Lots of hits searching with running average algorithm python
I'm working with a Geiger counter which can be hooked up to a computer and which records its output in the form of a .txt file, NC.txt, where it records the time since starting and the 'value' of the radiation it recorded. It looks like
import pylab
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
x1 = []
y1 = []
#Define a dictionary: counts
f = open("NC.txt", "r")
for line in f:
line = line.strip()
parts = line.split(",") #the columns are separated by commas and spaces
time = float(parts[1]) #time is recorded in the second column of NC.txt
value = float(parts[2]) #and the value it records is in the third
x1.append(time)
y1.append(value)
f.close()
xv = np.array(x1)
yv = np.array(y1)
#Statistics
m = np.mean(yv)
d = np.std(yv)
#Strip out background radiation
trueval = yv - m
#Basic plot of counts
num_bins = 10000
plt.hist(trueval,num_bins)
plt.xlabel('Value')
plt.ylabel('Count')
plt.show()
So this code so far will just create a simple histogram of the radiation counts centred at zero, so the background radiation is ignored.
What I want to do now is perform a chi-squared test to see how well the data fits, say, Poisson statistics (and then go on to compare it with other distributions later). I'm not really sure how to do that. I have access to scipy and numpy, so I feel like this should be a simple task, but just learning python as I go here, so I'm not a terrific programmer.
Does anyone know of a straightforward way to do this?
Edit for clarity: I'm not asking so much about if there is a chi-squared function or not. I'm more interested in how to compare it with other statistical distributions.
Thanks in advance.
You can use SciPy library, here is documentation and examples.