NumPy efficiency in dataset preprocessing

NumPy efficiency in dataset preprocessing - python

I am currently working on a research project relating to the use of neural networks operating on EEG datasets. I am using the BCICIV 2a dataset, which consists of a series of files containing trial data from subjects. Each file contains a set of 25 channels and a very long ~600000 time step array of signals. I have been working on writing code to preprocess this data into something I can pass into the neural network, but have run into some efficiency issues. Currently, I have written code that determines the location in the array of all the trials in a file, then attempts to extract a 3D NumPy array that is stored in another array. When I attempt to run this code however, it is ridiculously slow. I am not very familiar with NumPy, the majority of my experience at this point being in C. My intention had been to write the results of the preprocessing to a separate file that can be loaded to avoid the preprocessing. From a C perspective, all that would be necessary is to move the pointers around to format the data appropriately, so I am not sure why NumPy is so slow. Any suggestions would be very helpful since currently for 1 file it takes ~2 minutes to extract 1 trial, with 288 trials in a file and 9 files, this would take much longer than I would like. I am not very comfortable with my knowledge of how to make good use of NumPy's efficiency improvements over generic lists. Thanks!
import glob, os
import numpy as np
import mne
DURATION = 313
XDIM = 7
YDIM = 6
IGNORE = ('EOG-left', 'EOG-central', 'EOG-right')
def getIndex(raw, tagIndex):
return int(raw.annotations[tagIndex]['onset']*250)
def isEvent(raw, tagIndex, events):
for event in events:
if (raw.annotations[tagIndex]['description'] == event):
return True
return False
def getSlice1D(raw, channel, dur, index):
if (type(channel) == int):
channel = raw.ch_names[channel]
return raw[channel][0][0][index:index+dur]
def getSliceFull(raw, dur, index):
trial = np.zeros((XDIM, YDIM, dur))
for channel in raw.ch_names:
if not channel in IGNORE:
x, y = convertIndices(channel)
trial[x][y] = getSlice1D(raw, channel, dur, index)
return trial
def convertIndices(channel):
xDict = {'EEG-Fz':3, 'EEG-0':1, 'EEG-1':2, 'EEG-2':3, 'EEG-3':4, 'EEG-4':5, 'EEG-5':0, 'EEG-C3':1, 'EEG-6':2, 'EEG-Cz':3, 'EEG-7':4, 'EEG-C4':5, 'EEG-8':6, 'EEG-9':1, 'EEG-10':2, 'EEG-11':3, 'EEG-12':4, 'EEG-13':5, 'EEG-14':2, 'EEG-Pz':3, 'EEG-15':4, 'EEG-16':3}
yDict = {'EEG-Fz':0, 'EEG-0':1, 'EEG-1':1, 'EEG-2':1, 'EEG-3':1, 'EEG-4':1, 'EEG-5':2, 'EEG-C3':2, 'EEG-6':2, 'EEG-Cz':2, 'EEG-7':2, 'EEG-C4':2, 'EEG-8':2, 'EEG-9':3, 'EEG-10':3, 'EEG-11':3, 'EEG-12':3, 'EEG-13':3, 'EEG-14':4, 'EEG-Pz':4, 'EEG-15':4, 'EEG-16':5}
return xDict[channel], yDict[channel]
data_files = glob.glob('../datasets/BCICIV_2a_gdf/*.gdf')
try:
raw = mne.io.read_raw_gdf(data_files[0], verbose='ERROR')
except IndexError:
print("No data files found")
event_times = []
for i in range(len(raw.annotations)):
if (isEvent(raw, i, ('769', '770', '771', '772'))):
event_times.append(getIndex(raw, i))
data = np.empty((len(event_times), XDIM, YDIM, DURATION))
print(len(event_times))
for i, event in enumerate(event_times):
data[i] = getSliceFull(raw, DURATION, event)
EDIT:
I wanted to come back and add some more details on the structure of the dataset. There is the 25x~600000 array that contains the data and a much shorter annotation object that includes event tags and relates those to times within the larger array. Specific events indicate a motor imagery cue which is the trial that my network is being trained on, I am attempting to extract a 3D slice which includes the relevant channels formatted appropriately with a temporal dimension, which is found to be 313 timesteps long. The annotations gives me the relevant timesteps to investigate. The results of the profiling recommended by Ian showed that the main compute time is located in the getSlice1D() function. Particularly where I index into the raw object. The code that is extracting the event times from the annotations is comparably negligible.

This is a partial answer bc the formatting in comments is kind of garbage but
def getIndex(raw, tagIndex):
return int(raw.annotations[tagIndex]['onset']*250)
def isEvent(raw, tagIndex, events):
for event in events:
if (raw.annotations[tagIndex]['description'] == event):
return True
return False
for i in range(len(raw.annotations)):
if (isEvent(raw, i, ('769', '770', '771', '772'))):
event_times.append(getIndex(raw, i))
Notice how you're iterating over the I. What you could do is instead
def isEvent(raw_annotations_desc, raw_annotations_onset, events):
flag_container = []
for event in events: # Iterate through all the events
# Do a vectorized comparison across all the indices
flag_container.append(raw_annotations_desc == event)
# At this point flag_container will be of shape (|events|, len(raw_annotations_desc)
"""
Assuming I understand correctly, for a given index if
ANY of the events is true, we return true and get the index, correct?
def getIndex(raw, tagIndex):
return int(raw.annotations[tagIndex]['onset']*250)
"""
flag_container = np.asarray(flag_container) # Change raw list to np array
# Python treats False as 0 and True as 1. So, we sum over the cols
# and we now have an array of shape (1, len(raw_annotations))
flag_container = flag_container.sum(1)
# Add indices because we will use these later
flag_container = np.asarray(np.arange(len(raw_annotations)), flag_container)
# Almost there. Now, flag_container has 2 cols: the index AND the number of True in a given row
# Gets us all the indices where the sum was greater than 1 (aka one positive)
flag_container = flag_container[flag_container[1,:] > 0]
# Now, an array of shape (2, x <= len(raw_annotations_desc))
flag_container = flag_container[0, :] # We only care about the indices, not the actual count of positives now so we slice out the 0th-col
return int(raw_annotations_onset[flag_container] * 250)
Something to that effect :) That should speed things up a bit

Related

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue

I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)

I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

Numpy (n, 1, m) to (n,m)

I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens

The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.

Python, multiprocessing.pool took about the same amount of time as a for loop

I am trying to use python to process some large data sets from several data stations. My idea is to use multiprocessing.pool to assign each CPU the data from a single station, since the data from each station are independent from each other.
However, it seems that my calculation time does not really go down, comparing to single for loop.
Here is part of my code:
#function calculating the square of each data point, and taking the cumulative sum
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return cum_dd
#parallelization between each station
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation]) #nstation = 10
p = mp.Pool(processes=int(n_proc))
result = p.map(get_cumdd,data)
p.close()
p.join()
cum_dd = np.zeros((nstation,len(data[0])))
for i in range(nstation):
cum_dd[i] = result[i].T
I do not use chunksize because cum_dd takes the summation of all the previous data^2. I am essentially dividing my data into 10 equal pieces because there is no communication between processes. I wonder if I missed anything here.
My data has 2 million points per station per day, and I need to process years of data.

This doesn't address your multiprocessing question directly, but (as Ugur MULUK and Iguananaut mention) I think your get_cumdd function is inefficient. Numpy provides np.cumsum. Reimplementing your function I get more than 1000x speedup for an array with 10k elements. With 100k elements it's about 7000x faster. With 2M elements I didn't bother to let it finish.
# your function
def cum_dd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
cum_dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd[i]=np.sum(dd[0:i])
return cum_dd
# numpy implementation
def cum_dd2(data):
# adding an axis to match the shape of the output of your cum_dd function
return np.cumsum(data**2)[:, np.newaxis]
For 2e6 points this implementation takes ~11ms on my computer. I think that's about 30 seconds for 10 years of data for a single station.

NumPy already implements efficient parallel processing on CPUs and GPUs. The processing algorithms use Single Instruction Multiple Data (SIMD) instructions.
By pooling computations manually, you are reducing the efficiency. You can improve performance by vectorizing your explicit for loop.
See the video below for more information about vectorization.
https://www.youtube.com/watch?v=qsIrQi0fzbY
If you are having difficulties, I will be around for updates or help. Good luck!

Thanks a lot for all the comments and answers! After applying vectorization and pooling, I reduced the calculation time from one hour to 3 second (10*1.7 million data points). I have my code here in case anyone is interested,
def get_cumdd(data):
#if not isinstance(data, list):
# data = [data]
dd = np.zeros((len(data),1))
for i in range(len(data)):
dd[i] = data[i]**2
cum_dd=np.cumsum(dd)
return dd,cum_dd
if __name__ == '__main__':
n_proc = np.min([mp.cpu_count(),nstation])
p = mp.Pool(processes=int(n_proc))
result = p.map(CC.get_cumdd,d)
p.close()
p.join()
I'm not using shared memory Queue because all my processes are independent from each other.

Creating loop with particular behaviour depending on data length

In my program I have a part of code that uses an Estimated Moving Average (EMA) 4 times, but each time with different length. The program uses one or more EMAs depending on how much data it gets.
For now the code is not looped, just copy pasted with minor tweeks. That makes making changes difficult because I have to change everything 4 times.
Can somebody help me loop the code in such a way it wont loose it behaviour pattern. The mock-up code is presented here:
import random
import numpy as np
zakres=[5,10,15,20]
data=[]
def SI_sma(data, zakres):
weights=np.ones((zakres,))/zakres
smas=np.convolve(data, weights, 'valid')
return smas
def SI_ema(data, zakres):
weights_ema = np.exp(np.linspace(-1.,0.,zakres))
weights_ema /= weights_ema.sum()
ema=np.convolve(data,weights_ema)[:len(data)]
ema[:zakres]=ema[zakres]
return ema
while True:
data.append(random.uniform(0,100))
print(len(data))
if len(data)>zakres[0]:
smas=SI_sma(data=data, zakres=zakres[0])
ema=SI_ema(data=data, zakres=zakres[0])
print(smas[-1]) #calc using smas
print(ema[-1]) #calc using ema1
if len(data)>zakres[1]:
ema2=SI_ema(data=data, zakres=zakres[1])
print(ema2[-1]) #calc using ema2
if len(data)>zakres[2]:
ema3=SI_ema(data=data, zakres=zakres[2])
print(ema3[-1]) #calc using ema3
if len(data)>zakres[3]:
ema4=SI_ema(data=data, zakres=zakres[3])
print(ema4[-1]) #calc using ema4
input("press a key")

A variable number of variables is usually a bad idea. As you have found, it can make maintaining code cumbersome and error-prone. Instead, you can define a dict of results and use a for loop to iterate scenarios, defining len(data) just once.
ema = {}
while True:
data.append(random.uniform(0,100))
n = len(data)
for i, val in enumerate(zakres):
if n > val:
if i == 1:
smas = SI_sma(data=data, zakres=val)
ema[i] = SI_ema(data=data, zakres=val)
You can then access results via ema[0], ..., ema[3] as required.

Multiple Notes Simultaneously - Overflow Error

I have successfully generated .wav files in python of sine waves at different frequencies.
If I wish to generate harmonies, for example, a C major tirade, am I supposed to add the each sine wave of the individual notes together?
When adding 2 notes together, say C and G, the program creates the correct harmony. When I attempt to add a third note though, there is an overflow error. How might this be successfully accomplished.
The Code:
I am placing the data for the sine waves into an array of signed short integers.
wave = array.array('h')
And then adding multiple waves togeather to generate the harmonies.
for i in range(len(data)):
wave1[i] += wave2[i]
This works!
But when I add a third array, (wave3), it overflows.
This is because the signed short integer has reached its maximum. I am working with a 16 bit rate. Is the problem simply that the bit rate is too low? When creating complex audio with lots of harmonies, does the bit rate simply need to be much higher? Have I approached the problem in the absolute wrong direction?
Full Source

I don't think it's the byterate. You just have to normalize the values so that they fit properly. I've rewritten your code a bit so it uses lists at first, then arranges all the values from 0 to 32767, taking into account the volume, and puts it into the array.
def normalize(nmin, nmax, nums): #this could probably be done a bit shorter
orange = max(nums)-min(nums)
nrange = nmax-nmin
nums = [float(num)/orange*nrange for num in nums]
omin = min(nums)
nums = [num-omin+nmin for num in nums]
return nums
if __name__ == '__main__':
data, data2, data3 = [], [], []
data.extend(create_data(getTime(1), getFreq('C', 4)))
data2.extend(create_data(getTime(1), getFreq('Bb', 4)))
data3.extend(create_data(getTime(1), getFreq('G', 4)))
for i in range(len(data)):
data[i] += data2[i]
data[i] += data3[i]
data = array.array('h', [int(val) for val in normalize(0, 32767*VOLUME/100, data)])
write_wave(data, int(len(data)/SAMPLE_RATE))
winsound.PlaySound('output.wav', winsound.SND_FILENAME)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NumPy efficiency in dataset preprocessing - python

Related

How do I improve the speed of this parser using python?

Numpy (n, 1, m) to (n,m)

Python, multiprocessing.pool took about the same amount of time as a for loop

Creating loop with particular behaviour depending on data length

Multiple Notes Simultaneously - Overflow Error

Categories

Resources