Problem:
Here I plot 2 datasets stored in textfiles (in list dataset) each containing 21.8 billion data points. This makes the data too large to hold in memory as an array. I am still able to graph them as histograms, but I'm unsure how to calculate their difference via a 2 sample KS test. This is because I cannot figure out how to access each histogram in the plt object.
Example:
Here is some code to generate dummy data:
mu = [100, 120]
sigma = 30
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for idx, file in enumerate(dataset):
dist = np.random.normal(mu[idx], sigma, 10000)
with open(file, 'w') as g:
for s in dist:
g.write('{}\t{}\t{}\n'.format('stuff', 'stuff', str(s)))
This generates my two histograms (made possible here):
chunksize = 1000
dataset = ['gsl_test_1.txt', 'gsl_test_2.txt']
for fh in dataset:
# find the min, max, line qty, for bins
low = np.inf
high = -np.inf
loop = 0
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low)
high = np.maximum(chunk.iloc[:, 2].max(), high)
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
plt.savefig('gsl_test_hist.svg')
Question:
Most examples for KS-statistics employ two arrays of raw data/observations/points/etc, but I don't have enough memory to use this approach. Per the example above, how can I access these precomputed bins (from 'gsl_test_1.txt' and 'gsl_test_2.txt' to compute the KS statistic between the two distributions?
Bonus karma:
Record the KS statistic and pvalue on the graph!
i cleaned up your code a bit. writing to StringIO so it's more streamline than writing to a file. set the default vibe w/ seaborn instead of matplotlib to make it look more modern. the bins thresholds should be the same for both samples if you want the stat test to line up. i think if you iterate through and make the bins this way the whole thing may take way longer than it needs to. Counter could be useful b/c you'll only have to loop through once...plus you'll be able to make the same bin size. converting floats to ints since you are binning them together. from collections import Counter then C = Counter() and C[value] += 1. you'll have a dict at the end where you can make the bins from the list(C.keys()). this would be good since your data is so gnarly. also, you should see if there is a way to do chunksize with numpy instead of pandas b/c numpy is WAY faster at indexing. try a %timeit for DF.iloc[i,j] and ARRAY[i,j] and you'll see what i mean. i wrote much of it a function to try making it more modular.
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from io import StringIO
from scipy.stats import ks_2samp
import seaborn as sns; sns.set()
%matplotlib inline
#Added seaborn b/c it looks mo betta
mu = [100, 120]
sigma = 30
def write_random(file,mu,sigma=30):
dist = np.random.normal(mu, sigma, 10000)
for i,s in enumerate(dist):
file.write('{}\t{}\t{}\n'.format("label_A-%d" % i, "label_B-%d" % i, str(s)))
return(file)
#Writing to StringIO instead of an actual file
gs1_test_1 = write_random(StringIO(),mu=100)
gs1_test_2 = write_random(StringIO(),mu=120)
chunksize = 1000
def make_hist(fh,ax):
# find the min, max, line qty, for bins
low = np.inf
high = -np.inf
loop = 0
fh.seek(0)
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, sep='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low) #btw, iloc is way slower than numpy array indexing
high = np.maximum(chunk.iloc[:, 2].max(), high) #you might wanna import and do the chunks with numpy
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
fh.seek(0)
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total,axes=ax,alpha=0.5)
return(ax,bin_edges,total)
#Make the plot canvas to write on to give it to the function
fig,ax = plt.subplots()
test_1_data = make_hist(gs1_test_1,ax)
test_2_data = make_hist(gs1_test_2,ax)
#test_1_data[1] == test_2_data[1] The bins should be the same if you're going try and compare them...
ax.set_title("ks: %f, p_in_the_v: %f" % ks_2samp(test_1_data[2], test_2_data[2]))
Related
Eariler today I posted this question. I now have a MRE that can reproduce the issue.
In short, this piece of code seems to be using much more memory than it should (the idea is to average some number of time-traces into a certain number of bins. The traces are arranged in a matrix using a pd.MultiIndex).
import numpy as np
import pandas as pd
#Len of each trace
trace_len =2500
#Number of Bins
bin_num = 300
# Traces matrix dimensions
L1 = 70
L2 = 100
index = pd.MultiIndex.from_product([range(L1), range(L2)])
traces = np.random.random((L1*L2, trace_len))
traces_df = pd.DataFrame(traces, index=index)
#Lets make 300 random bins
bins = [index.to_frame().sample(frac=1, replace=True) for _ in range(bin_num)]
bins = [pd.MultiIndex.from_frame(bin) for bin in bins]
def bin_single(traces: pd.DataFrame, bin_idx:pd.Index) -> np.array:
""" Cumulative sum of all shots that are both in traces and bin_idx"""
bin_idx = bin_idx.intersection(traces.index)
binned = traces.reindex(bin_idx, copy=False)
return binned.sum(axis=0, skipna=False).to_numpy()
output = np.empty((bin_num, trace_len))
for n, bin in enumerate(bins):
output[n] = bin_single(traces_df, bin)
print(output.nbytes)
This is the memory allocation over time:
The issue cannot be due to lazy allocation of output, since that array is only 6Mb, as reported by output.nbytes, while the overall memory allocation grows by more than 200Mb over the for loop.
I think the problem might be hidden in the pd.MultiIndex usage, since this very similar program that does not use MultiIndex does not show the memory increase:
import numpy as np
import pandas as pd
#Len of each trace
trace_len =2500
#Number of Bins
bin_num = 300
# Traces matrix dimensions
L1 = 70
L2 = 100
#index = pd.MultiIndex.from_product([range(L1), range(L2)])
traces = np.random.random((L1*L2, trace_len))
traces_df = pd.DataFrame(traces)
index = traces_df.index
#Lets make 300 random bins
bins = [index.to_frame().sample(frac=1, replace=True) for _ in range(bin_num)]
bins = [pd.MultiIndex.from_frame(bin) for bin in bins]
def bin_single(traces: pd.DataFrame, bin_idx:pd.Index) -> np.array:
""" Cumulative sum of all shots that are both in traces and bin_idx"""
bin_idx = bin_idx.intersection(traces.index)
binned = traces.reindex(bin_idx, copy=False)
return binned.sum(axis=0, skipna=False).to_numpy()
output = np.empty((bin_num, trace_len))
for n, bin in enumerate(bins):
output[n] = bin_single(traces_df, bin)
print(output.nbytes)
I tend to think that there might be a bug somewhere in pd.MultiIndex, but maybe I'm just overlooking something.
Thanks a lot!
The script below is a mixture of stackoverflow answers on different topics, but closely related to finding peaks on signals. Finding peaks based on prominence, as noted here works incredibly well, but my issue is that I need to find the lowest point immediately after the peak. The dataset is a fluorescence signal of a plant captured during 14 continuous hours, and the peaks are saturating pulses used to determined saturation under light conditions. A picture of the dataset (a 68MB CSV file) bellow:
This is my python script:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
# A parser is required to translate the timestamp
custom_date_parser = lambda x: datetime.strptime(x, "%d-%m-%Y %H:%M_%S.%f")
df = pd.read_csv('15-01-2022_05_00.csv', parse_dates=[ 'Timestamp'], date_parser=custom_date_parser)
x = df['Timestamp']
y = df['Mean_values']
# As per accepted answer here:
#https://stackoverflow.com/questions/1713335/peak-finding-algorithm-for-python-scipy
peaks, _ = find_peaks(y, prominence=1)
# Invert the data to find the lowest points of peaks as per answer here:
#https://stackoverflow.com/questions/61365881/is-there-an-opposite-version-of-scipy-find-peaks
valleys, _ = find_peaks(-y, prominence=1)
print(y[peaks])
print(y[valleys])
plt.subplot(2, 1, 1)
plt.plot(peaks, y[peaks], "ob"); plt.plot(y); plt.legend(['Prominence'])
plt.subplot(2, 1, 2)
plt.plot(valleys, y[valleys], "vg"); plt.plot(y); plt.legend(['Prominence Inverted'])
plt.show()
As you can see on the picture, not all the 'prominence inverted' points are below the respective peak. The prominence inverted function comes from this post here, and it simply inverts the dataset. Some are adjacent to the previous peak (difficult to see in the picture). Peaks and valleys below:
Peaks
1817 109.587178
3674 89.191393
56783 72.779385
111593 77.868118
166403 83.288949
221213 84.955026
276023 84.340550
330833 83.186605
385643 81.134827
440453 79.060960
495264 77.457803
550074 76.292243
604884 75.867575
659694 75.511924
714504 74.221657
769314 73.830941
824125 76.977637
878935 78.826169
933745 77.819844
988555 77.298089
1043365 77.188105
1098175 75.340765
1152985 74.311185
1207796 73.163844
1262606 72.613317
1317416 73.460068
1372226 70.388324
1427036 70.835355
1481845 70.154085
Valleys
2521 4.669368
56629 26.551585
56998 26.184984
111791 26.288734
166620 27.717165
221434 28.312708
330432 28.235397
385617 27.535091
440341 26.886589
495174 26.379043
549353 26.040947
550239 25.760023
605051 25.594147
714352 25.354300
714653 25.008184
769472 24.883584
824284 25.135316
879075 25.477464
933907 25.265173
988711 25.160046
1097917 25.058851
1098333 24.626667
1153134 24.357835
1207943 23.982878
1262750 23.938298
1371013 23.766077
1372381 23.351263
1427187 23.368314
Any ideas about this awkward result on the valleys?
You complicate your task by trying to find all valleys. This will always be difficult because they do not stand out as well as your peaks in comparison to the surrounding data. Whatever your parameters for find_peaks, sometimes it will identify two valleys after a peak, sometimes none. Instead, just identify the local minimum after each peak:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
#sample data
from scipy.misc import electrocardiogram
x = electrocardiogram()[2000:4000]
date_range = pd.date_range("20210116", periods=x.size, freq="10ms")
df = pd.DataFrame({"Timestamp": date_range, "Mean_values": x})
x = df['Timestamp']
y = df['Mean_values']
fig, (ax1, ax2, ax3) = plt.subplots(3, figsize=(12, 8))
#peak finding
peaks, _ = find_peaks(y, prominence=1)
ax1.plot(x[peaks], y[peaks], "ob")
ax1.plot(x, y)
ax1.legend(['Prominence'])
#valley finder general
valleys, _ = find_peaks(-y, prominence=1)
ax2.plot(x[valleys], y[valleys], "vg")
ax2.plot(x, y)
ax2.legend(['Valleys without filtering'])
#valley finding restricted to a short time period after a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_end = x.searchsorted(time_of_peaks + time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_end = peaks + n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
true_valleys[i] = start + y[start:stop].argmin()
ax3.plot(x[true_valleys], y[true_valleys], "sr")
ax3.plot(x, y)
ax3.legend(['Valleys after events'])
plt.show()
Sample output:
I am not sure what you intend to do with these minima, but if you are only interested in baseline shifts, you can directly calculate the peak-wise baseline values like
baseline_per_peak = peaks.copy().astype(float)
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
baseline_per_peak[i] = y[start:stop].mean()
print(baseline_per_peak)
Sample output:
[-0.71125 -0.203 0.29225 0.72825 0.6835 0.79125 0.51225 0.23
0.0345 -0.3945 -0.48125 -0.4675 ]
This can, of course, also easily be adapted to the period before the peak:
#valley in the short time period before a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_start = x.searchsorted(time_of_peaks - time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_start = peaks - n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peak_start, peaks)):
true_valleys[i] = start + y[start:stop].argmin()
I am very new to coding python and I am working with a .CSV file that gives me a 32x32 matrix in a 1024 column row with a time stamp. I reshaped the data to give me 32x32 arrays and looped through each row appending the matrices to a numpy array.
`i = 0
while i < len(df_array):
if i == 0:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = spec
else:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = np.concatenate((spectrum_matrix, spec), axis = 0)
i = i + 1
print("job done")`
What I would like to do is to add the time stamp from the original data file and add them to each of the matrices thus allowing me to re sample the data over a 5 minute average. I also would like to plot the bins a to get a plot similar to this Drop size distribution
As a reference I am reading in the data .CSV with pandas and here is an example of a portion of the raw data: 01.06.2017;18:22:20;0.122;0.00;51;7.401;10375;18745;57;27;0.00;23.6;0.110;0;
<SPECTRUM>;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
The ;'s after the SPECTRUM is the 32x32 matrix.
Thanks in advance for any help!
Python and associated packages can do many things without loops
From my understanding of your data you have a (8640 x 32 x 32) Data Structure (time x size x velocity).
Pandas works very well with 2D data structures, however for higher dimensional data I would recommend you get familiar with xarray. With this package along with pandas you can create and manipulate your data without having to resort to loops.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import seaborn as sns
%matplotlib inline
#create random data
data = (np.random.binomial(n =5, p =0.2, size =(8640,32,32))*1000).astype(int)
#create labels for data
sizes= np.linspace(1,5,32)
velocities = np.linspace(1,1000, num = 32)
#make time range of 24 hours with 10sec intervals
ind = pd.date_range(start='2014-01-01', periods=8640, freq='10s')
#convert data to xarray 3D data structure
df = xr.DataArray(data, coords = [ind, sizes, velocities],
dims = ['time', 'size', 'speed'])
#make a 5 min average of the data
min_average= df.resample('300s', dim = 'time', how = 'mean')
#plot sample of data and 5 min average
my1d = min_average.isel(size = 5, speed= 10)
my1d.plot(label = '5 min avg')
plt.gca()
df.isel(size = 5, speed =10).plot(alpha = 0.3, c = 'r', label = 'raw_data')
plt.legend()
As for making a distribution plot like you linked things become a bit trickier but is possible:
#transform your data to only have mean speed for each time and size
#and convert to pandas dataframe
mean_speed =min_average.mean(dim = ['speed'])
#for some reason xarray make you name the new column when you convert
#to a pandas dataframe. I then get rid of the extra empty variable with
#a list comprehension
df= mean_speed.to_dataframe('').unstack().T
df.index = np.array([np.array(i)[1].astype(float) for i in df.index])
#make a contourplot of your new data
plt.contourf(df.columns, df.index, df.values, cmap ='PuBu_r')
plt.title('mean speed')
plt.ylabel('size')
plt.xlabel('time')
plt.colorbar()
I'm reading a specific column of a csv file as a numpy array. When I try to do the fft of this array I get an array of NaNs. How do I get the fft to work? Here's what I have so far:
#!/usr/bin/env python
from __future__ import division
import numpy as np
from numpy import fft
import matplotlib.pyplot as plt
fileName = '/Users/Name/Documents/file.csv'
#read csv file
df = np.genfromtxt(fileName, dtype = float, delimiter = ',', names = True)
X = df['X'] #get X from file
rate = 1000. #rate of data collection in points per second
Hx = abs(fft.fft(X))
freqX = fft.fftfreq(len(Hx), 1/rate)
plt.plot(freqX,Hx) #plot freqX vs Hx
Presumably there are some missing values in your csv file. By default, np.genfromtxt will replace the missing values with NaN.
If there are any NaNs or Infs in an array, the fft will be all NaNs or Infs.
For example:
import numpy as np
x = [0.1, 0.2, np.nan, 0.4, 0.5]
print np.fft.fft(x)
And we'll get:
array([ nan +0.j, nan+nanj, nan+nanj, nan+nanj, nan+nanj])
However, because an FFT operates on a regularly-spaced series of values, removing the non-finite values from an array is a bit more complex than just dropping them.
pandas has several specialized operations to do this, if you're open to using it (e.g. fillna). However, it's not too difficult to do with "pure" numpy.
First, I'm going to assume that you're working with a continuous series of data because you're taking the FFT of the values. In that case, we'd want to interpolate the NaN values based on the values around them. Linear interpolation (np.interp) may not be ideal in all situations, but it's not a bad default choice:
For example:
import numpy as np
x = np.array([0.1, 0.2, np.nan, 0.4, 0.5])
xi = np.arange(len(x))
mask = np.isfinite(x)
xfiltered = np.interp(xi, xi[mask], x[mask])
And we'll get:
In [18]: xfiltered
Out[18]: array([ 0.1, 0.2, 0.3, 0.4, 0.5])
We can then calculate the FFT normally:
In [19]: np.fft.fft(xfiltered)
Out[19]:
array([ 1.50+0.j , -0.25+0.34409548j, -0.25+0.08122992j,
-0.25-0.08122992j, -0.25-0.34409548j])
...and get a valid result.
If your data contains NaN values, you need to interpolate them. Alternatively, you can calculate the spectrum using the Fourier equation where np.sum is replaced with np.nansum. With this approach you don't need to interpolate NaN values, although the amount of missing data will effect the spectrum. More missing data will result in a noisy spectrum and hence inaccurate spectral values.
Below is a MWE to illustrate the concept, with a graph showing the result. The MWE illustrates how to calculate the single-sided amplitude spectrum of a simple reference signal containing a number of missing values.
#!/usr/bin/python
# Python code to plot amplitude spectrum of signal containing NaN values
# Python version 2.7.13
from __future__ import division
import numpy as np
import pylab as pl
import random
LW = 2 #line width
AC = 0.5 #alpha channel
pi = np.pi
def periodogramSS(inputsignal,fsamp):
N = len(inputsignal)
N_notnan = np.count_nonzero(~np.isnan(inputsignal))
hr = fsamp/N #frequency resolution
t = np.arange(0,N*Ts,Ts)
#flow,fhih = -fsamp/2,(fsamp/2)+hr #Double-sided spectrum
flow,fhih = 0,fsamp/2+hr #Single-sided spectrum
#flow,fhih = hr,fsamp/2
frange = np.arange(flow,fhih,hr)
fN = len(frange)
Aspec = np.zeros(fN)
n = 0
for f in frange:
Aspec[n] = np.abs(np.nansum(inputsignal*np.exp(-2j*pi*f*t)))/N_notnan
n+=1
Aspec *= 2 #single-sided spectrum
Aspec[0] /= 2 #DC component restored (i.e. halved)
return (frange,Aspec)
#construct reference signal:
f1 = 10 #Hz
T = 1/f1
fs = 10*f1
Ts = 1/fs
t = np.arange(0,20*T,Ts)
DC = 3.0
x = DC + 1.5*np.cos(2*pi*f1*t)
#randomly delete values from signal x:
ndel = 10 #number of samples to replace with NaN
random.seed(0)
L = len(x)
randidx = random.sample(range(0,L),ndel)
for idx in randidx:
x[idx] = np.nan
(fax,Aspectrum) = periodogramSS(x,fs)
fig1 = pl.figure(1,figsize=(6*3.13,4*3.13)) #full screen
pl.ion()
pl.subplot(211)
pl.plot(t, x, 'b.-', lw=LW, ms=2, label='ref', alpha=AC)
#mark NaN values:
for (t_,x_) in zip(t,x):
if np.isnan(x_):
pl.axvline(x=t_,color='g',alpha=AC,ls='-',lw=2)
pl.grid()
pl.xlabel('Time [s]')
pl.ylabel('Reference signal')
pl.subplot(212)
pl.stem(fax, Aspectrum, basefmt=' ', markerfmt='r.', linefmt='r-')
pl.grid()
pl.xlabel('Frequency [Hz]')
pl.ylabel('Amplitude spectrum')
fig1name = './signal.png'
print 'Saving Fig. 1 to:', fig1name
fig1.savefig(fig1name)
The reference signal (real) is shown in blue with missing values marked with green. The single-sided amplitude spectrum is shown in red. The DC component and amplitude value at 10 Hz are clearly visible. The other values are caused by the reference signal being broken up by the missing data.
I am using matplotlib and I'm finding some problems when trying to plot large vectors.
sometimes get "MemoryError"
My question is whether there is any way to reduce the scale of values that i need to plot ?
In this example I'm plotting a vector with size 2647296!
is there any way to plot the same values on a smaller scale?
It is very unlikely that you have so much resolution on your display that you can see 2.6 million data points in your plot. A simple way to plot less data is to sample e.g. every 1000th point: plot(x[::1000]). If that loses too much and it is e.g. important to see the extremal values, you could write some code to split the long vector into suitably many parts and take the minimum and maximum of each part, and plot those:
tmp = x[:len(x)-len(x)%1000] # drop some points to make length a multiple of 1000
tmp = tmp.reshape((1000,-1)) # split into pieces of 1000 points
tmp = tmp.reshape((-1,1000)) # alternative: split into 1000 pieces
figure(); hold(True) # plot minimum and maximum in the same figure
plot(tmp.min(axis=0))
plot(tmp.max(axis=0))
You can use a min/max for each block of data to subsample the signal.
Window size would have to be determined based on how accurately you want to display your signal and/or how large the window is compared to the signal length.
Example code:
from scipy.io import wavfile
import matplotlib.pyplot as plt
def value_for_window_min_max(data, start, stop):
min = data[start]
max = data[start]
for i in range(start,stop):
if data[i] < min:
min = data[i]
if data[i] > max:
max = data[i]
if abs(min) > abs(max):
return min
else:
return max
# This will only work properly if window_size divides evenly into len(data)
def subsample_data(data, window_size):
print len(data)
print len(data)/window_size
out_data = []
for i in range(0,(len(data)/window_size)):
out_data.append(value_for_window_min_max(data,i*window_size,i*window_size+window_size-1))
return out_data
sample_rate, data = wavfile.read('<path_to_wav_file>')
sub_amt = 10
sub_data = subsample_data(data, sub_amt)
print len(data)
print len(sub_data)
fig = plt.figure(figsize=(8,6), dpi=100)
fig.add_subplot(211)
plt.plot(data)
plt.title('Original')
plt.xlim([0,len(data)])
fig.add_subplot(212)
plt.plot(sub_data)
plt.xlim([0,len(sub_data)])
plt.title('Subsampled by %d'%sub_amt)
plt.show()
Output: