How to speed up data readings from dataframe columns in python? - python

My data frame looks like this
Good day,
I'm trying to read values of every row from 4 different columns in my data frame and store it in a single NumPy array (See attached picture). Each column has 150.000 data rows and the single NumPy array results on having 600.000 rows of data. I have to do it 4 times which means I have to create 4 x 600.000 long arrays. I used a basic for-loop in my Python code but it took almost 5 minutes to compute.
Does anyone know a better way to do this in order to improve its performance?
Thank you,
Here is my Python Code:
def oversampling(self):
# Oversampling restructuring
sh = self.df[['nSensor01_00']].values.shape
nSensor01 = np.zeros(shape=(sh[0] * 4, 1))
nSensor02 = np.zeros(shape=(sh[0] * 4, 1))
nSensor03 = np.zeros(shape=(sh[0] * 4, 1))
nSensor04 = np.zeros(shape=(sh[0] * 4, 1))
temp = np.arange(4, sh[0] * 4, 4)
ttime = np.arange(0, sh[0] / 500, 0.0005)
names = ['nSensor01', 'nSensor02', 'nSensor03', 'nSensor04']
for i in temp:
ind_begin = i - 4
ind_end = ind_begin + 4
a = int((i - 1) / 4)
nSensor01[ind_begin:ind_end ] = self.df.iloc[a, 55:59 ].values.flatten().reshape((4,1))
nSensor02[ind_begin:ind_end ] = self.df.iloc[a, 59:63 ].values.flatten().reshape((4,1))
nSensor03[ind_begin:ind_end ] = self.df.iloc[a, 63:67 ].values.flatten().reshape((4,1))
nSensor04[ind_begin :ind_end ] = self.df.iloc[a, 67:71 ].values.flatten().reshape((4,1))
d = np.hstack((nSensor01, nSensor02, nSensor03, nSensor04))
self.dfkHz = pd.DataFrame(data=d, columns=names)
self.dfkHz.insert(0, 'Time', ttime)

Does this work for you?
sh = self.df[['nSensor01_00']].values.shape
df_kHz = pd.DataFrame()
df_kHz["time"] = (np.arange(0, sh[0] / 500, 0.0005))
df_kHz["nSensor01"] = self.df.iloc[:, 55:59].values.flatten()
df_kHz["nSensor02"] = self.df.iloc[:, 59:63].values.flatten()
df_kHz["nSensor03"] = self.df.iloc[:, 63:67].values.flatten()
df_kHz["nSensor04"] = self.df.iloc[:, 67:71].values.flatten()

Related

Why is this code using both flatten() and reshape(1,-1)?

I have two questions:
Why has the programmer used both reshape(1, -1) and flatten? I think they do the same thing (convert rows to columns).
Is the role of axis=0 to add the remaining n-20 rows vertically to X?
k = 5 # supervising time (k next interval closing price that we are gonna predict)
N = np.size(data , axis=0) # N = data.shape[0]
winSize = 20 # the size of sliding window
f = data.shape[1] # the number of features
X = np.zeros((0, winSize * f))
Ytr = np.zeros((0, 1))
for i in tqdm(range(N-winSize-k)):
X = np.concatenate((X, data[i: i+winSize].flatten().reshape(1,-1)), axis = 0)
Ytr = np.concatenate((Ytr, label[i+winSize+k].reshape(-1, 1)), axis=0)
I have tried to use the code with only reshape(1,-1) or flatten but got error.

Python's `.loc` is really slow on selecting subsets of Data

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.
import numpy as np
import pandas as pd
# Full DataFrame
y_max = 50
Y_max = range(1, y_max+1)
t_max = 100
T_max = range(1, t_max+1)
idx_max = tuple((y,t) for y in Y_max for t in T_max)
df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])
# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN+1)
t1 = 5
tN = 9
T = range(t1, tN+1)
idx_sub = tuple((y,t) for y in Y for t in T)
data_sub = df.loc[(Y,T), :] #This is really slow
dict_sub = dict(zip(idx_sub, data_sub['Value']))
# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']
I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.
One idea is use Index.isin with itertools.product in boolean indexing:
from itertools import product
idx_sub = tuple(product(Y, T))
dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)

Repeating a function 1000 times and saving each iteration result in a list

I have tried to simulate some event-onsets and predictors for an experiment. I have two predictors (circles and squares). The stimuli ('events') take 1 second and the ISI (interstimulus interval) is 8 seconds. I am also interested in both contrasts against baseline (circles against baseline; squares against baseline). In the end, I am trying to run the function that I have defined (simulate_data_fixed, n=420 is a paramater that is fixed) for 1000, at each iteration I would like to calculate an efficiency score in the end and store the efficiency scores in a list.
def simulate_data_fixed_ISI(N=420):
dg_hrf = glover_hrf(tr=1, oversampling=1)
# Create indices in regularly spaced intervals (9 seconds, i.e. 1 sec stim + 8 ISI)
stim_onsets = np.arange(10, N - 15, 9)
stimcodes = np.repeat([1, 2], stim_onsets.size / 2) # create codes for two conditions
np.random.shuffle(stimcodes) # random shuffle
stim = np.zeros((N, 1))
c = np.array([[0, 1, 0], [0, 0, 1]])
# Fill stim array with codes at onsets
for i, stim_onset in enumerate(stim_onsets):
stim[stim_onset] = 1 if stimcodes[i] == 1 else 2
stims_A = (stim == 1).astype(int)
stims_B = (stim == 2).astype(int)
reg_A = np.convolve(stims_A.squeeze(), dg_hrf)[:N]
reg_B = np.convolve(stims_B.squeeze(), dg_hrf)[:N]
X = np.hstack((np.ones((reg_B.size, 1)), reg_A[:, np.newaxis], reg_B[:, np.newaxis]))
dvars = [(c[i, :].dot(np.linalg.inv(X.T.dot(X))).dot(c[i, :].T))
for i in range(c.shape[0])]
eff = c.shape[0] / np.sum(dvars)
return eff
However, I want to run this entire chunk 1000 times and store the 'eff' in an array, etc. so that later on I want to display them as a histogram. How ı can do this?
If I understand you correctly you should be able just to run
EFF = [simulate_data_fixed_ISI() for i in range(1000)] #1000 repeats
As #theonlygusti clarified, this line, EFF, runs your function simulate_data_fixed_ISI() 1000 times and put each return in the array EFF
Test
import numpy as np
def simulate_data_fixed_ISI(n=1):
"""
Returns 'n' random numbers
"""
return np.random.rand(n)
EFF = [simulate_data_fixed_ISI() for i in range(5)]
EFF
#[array([0.19585137]),
# array([0.91692933]),
# array([0.49294667]),
# array([0.79751017]),
# array([0.58294512])]
Your question seems to boil down to:
I am trying to run the function that I have defined for 1000, at each iteration I would like to calculate an efficiency score in the end and store the efficiency scores in a list
I guess "the function that I have defined" is the simulate_data_fixed_ISI in your question?
Then you can simply run it 1000 times using a basic for loop, and add the results into a list:
def simulate_data_fixed_ISI(N=420):
dg_hrf = glover_hrf(tr=1, oversampling=1)
# Create indices in regularly spaced intervals (9 seconds, i.e. 1 sec stim + 8 ISI)
stim_onsets = np.arange(10, N - 15, 9)
stimcodes = np.repeat([1, 2], stim_onsets.size / 2) # create codes for two conditions
np.random.shuffle(stimcodes) # random shuffle
stim = np.zeros((N, 1))
c = np.array([[0, 1, 0], [0, 0, 1]])
# Fill stim array with codes at onsets
for i, stim_onset in enumerate(stim_onsets):
stim[stim_onset] = 1 if stimcodes[i] == 1 else 2
stims_A = (stim == 1).astype(int)
stims_B = (stim == 2).astype(int)
reg_A = np.convolve(stims_A.squeeze(), dg_hrf)[:N]
reg_B = np.convolve(stims_B.squeeze(), dg_hrf)[:N]
X = np.hstack((np.ones((reg_B.size, 1)), reg_A[:, np.newaxis], reg_B[:, np.newaxis]))
dvars = [(c[i, :].dot(np.linalg.inv(X.T.dot(X))).dot(c[i, :].T))
for i in range(c.shape[0])]
eff = c.shape[0] / np.sum(dvars)
return eff
eff_results = []
for _ in range(1000):
efficiency_score = simulate_data_fixed_ISI()
eff_results.append(efficiency_score)
Now eff_results contains 1000 entries, each of which is a call to your function simulate_data_fixed_ISI

How to improve performance of coincidence filtering of a time-series?

I'm working on instationary experimental data from fluid dynamics. We have measured data on three channels, so the samples are not directly coincident (measured at the same time). I want to filter them with a window scheme to get coincident samples and disgard all others.
Unfortunately, I cannot upload the original data set due to restrictions of the company. But I tried to set up a minimal example, which generates a similiar (smaller) dataset. The original dataset consists of 500000 values per channel, each noted with an arrival time. The coincidence is checked with these time stamps.
Just now, I loop over each sample from the first channel and look at the time differences to the other channels. If it is smaller than the specified window width, the index is saved. Probably it would be a little bit faster if I specifiy an intervall in which to check for the differences (like 100 or 1000 samples in the neighborhood). But the datarate between the channels can differ significantly, so it is not implemented yet. I prefer to get rid of looping over each sample - if possible.
def filterCoincidence(df, window = 50e-6):
'''
Filters the dataset with arbitrary different data rates on different channels to coincident samples.
The coincidence is checked with regard to a time window specified as argument.
'''
AT_cols = [col for col in df.columns if 'AT' in col]
if len(AT_cols) == 1:
print('only one group available')
return
used_ix = np.zeros( (df.shape[0], len(AT_cols)))
used_ix.fill(np.nan)
for ix, sample in enumerate(df[AT_cols[0]]):
used_ix[ix, 0] = ix
test_ix = np.zeros(2)
for ii, AT_col in enumerate(AT_cols[1:]):
diff = np.abs(df[AT_col] - sample)
index = diff[diff <= window].sort_values().index.values
if len(index) == 0:
test_ix[ii] = None
continue
test_ix[ii] = [ix_use if (ix_use not in used_ix[:, ii+1] or ix == 0) else None for ix_use in index][0]
if not np.any(np.isnan(test_ix)):
used_ix[ix, 1:] = test_ix
else:
used_ix[ix, 1:] = [None, None]
used_ix = used_ix[~np.isnan(used_ix).any(axis=1)]
print(used_ix.shape)
return
no_points = 10000
no_groups = 3
meas_duration = 60
df = pd.DataFrame(np.transpose([np.sort(np.random.rand(no_points)*meas_duration) for _ in range(no_groups)]), columns=['AT {}'.format(i) for i in range(no_groups)])
filterCoincidence(df, window=1e-3)
Is there a module already implemented, which can do this sort of filtering? However, it would be awesome if you can give me some hints to increase the performance of the code.
Just to update this thread if somebody else have a similar problem. I think after several code revisions, I have found a proper solution to this.
def filterCoincidence(self, AT1, AT2, AT3, window = 0.05e-3):
'''
Filters the dataset with arbitrary different data rates on different channels to coincident samples.
The coincidence is checked with regard to a time window specified as argument.
- arguments:
- three times series AT1, AT2 and AT3 (arrival times of particles in my case)
- window size (50 microseconds as default setting)
- output: indices of combined samples
'''
start_time = datetime.datetime.now()
AT_list = [AT1, AT2, AT3]
# take the shortest period of time
min_EndArrival = np.max(AT_list)
max_BeginArrival = np.min(AT_list)
for i, col in enumerate(AT_list):
min_EndArrival = min(min_EndArrival, np.max(col))
max_BeginArrival = max(max_BeginArrival, np.min(col))
for i, col in enumerate(AT_list):
AT_list[i] = np.delete(AT_list[i], np.where((col < max_BeginArrival - window) | (col > min_EndArrival + window)))
# get channel with lowest datarate
num_points = np.zeros(len(AT_list))
datarate = np.zeros(len(AT_list))
for i, AT in enumerate(AT_list):
num_points[i] = AT.shape[0]
datarate[i] = num_points[i] / (AT[-1]-AT[0])
used_ref = np.argmin(datarate)
# process coincidence
AT_ref_val = AT_list[used_ref]
AT_list = list(np.delete(AT_list, used_ref))
overview = np.zeros( (AT_ref_val.shape[0], 3), dtype=int)
overview[:,0] = np.arange(AT_ref_val.shape[0], dtype=int)
borders = np.empty(2, dtype=object)
max_diff = np.zeros(2, dtype=int)
for i, AT in enumerate(AT_list):
neighbors_lower = np.searchsorted(AT, AT_ref_val - window, side='left')
neighbors_upper = np.searchsorted(AT, AT_ref_val + window, side='left')
borders[i] = np.transpose([neighbors_lower, neighbors_upper])
coinc_ix = np.where(np.diff(borders[i], axis=1).flatten() != 0)[0]
max_diff[i] = np.max(np.diff(borders[i], axis=1))
overview[coinc_ix, i+1] = 1
use_ix = np.where(~np.any(overview==0, axis=1))
borders[0] = borders[0][use_ix]
borders[1] = borders[1][use_ix]
overview = overview[use_ix]
# create all possible combinations refer to the reference
combinations = np.prod(max_diff)
test = np.empty((overview.shape[0]*combinations, 3), dtype=object)
for i, [ref_ix, at1, at2] in enumerate(zip(overview[:, 0], borders[0], borders[1])):
test[i * combinations:i * combinations + combinations, 0] = ref_ix
at1 = np.arange(at1[0], at1[1])
at2 = np.arange(at2[0], at2[1])
test[i*combinations:i*combinations+at1.shape[0]*at2.shape[0],1:] = np.asarray(list(itertools.product(at1, at2)))
test = test[~np.any(pd.isnull(test), axis=1)]
# check distances
ix_ref = test[:,0]
test = test[:,1:]
test = np.insert(test, used_ref, ix_ref, axis=1)
test = test.astype(int)
AT_list.insert(used_ref, AT_ref_val)
AT_mat = np.zeros(test.shape)
for i, AT in enumerate(AT_list):
AT_mat[:,i] = AT[test[:,i]]
distances = np.zeros( (test.shape[0], len(list(itertools.combinations(range(3), 2)))))
for i, AT in enumerate(itertools.combinations(range(3), 2)):
distances[:,i] = np.abs(AT_mat[:,AT[0]]-AT_mat[:,AT[1]])
ix = np.where(np.all(distances <= window, axis=1))[0]
test = test[ix,:]
distances = distances[ix,:]
# check duplicates
# use sum of differences as similarity factor
dist_sum = np.max(distances, axis=1)
unique_sorted = np.argsort([np.unique(test[:,i]).shape[0] for i in range(test.shape[1])])[::-1]
test = np.hstack([test, dist_sum.reshape(-1, 1)])
test = test[test[:,-1].argsort()]
for j in unique_sorted:
_, ix = np.unique(test[:,j], return_index=True)
test = test[ix, :]
test = test[:,:3]
test = test[test[:,used_ref].argsort()]
# check that all values are after each other
ix = np.where(np.any(np.diff(test, axis=0) > 0, axis=1))[0]
ix = np.append(ix, test.shape[0]-1)
test = test[ix,:]
print('{} coincident samples obtained in {}.'.format(test.shape[0], datetime.datetime.now()-start_time))
return test
I'm certain that there is a better solution, but for me it works now. And I know, the variable names should definitely be chosen with more clarity (e.g. test), but I will clean up my code at the end of my master thesis... perhaps :-)

Modifying a numpy array after conversion from pandas dataframe

I have the following code which I am writing as part of a simple movie recommender in python so I can mimic the results I get as part of coursera's Machine Learning Course taught by Andrew NG.
I want to modify the numpy.ndarray that I get after calling as_matrix() on the pandas dataframe and add a column vector to it like we can in MATLAB
Y = [ratings Y]
Following is my python code
dataFile='/filepath/'
userItemRatings = pd.read_csv(dataFile, sep="\t", names=['userId', 'movieId', 'rating','timestamp'])
movieInfoFile = '/filepath/'
movieInfo = pd.read_csv(movieInfoFile, sep="|", names=['movieId','Title','Release Date','Video Release Date','IMDb URL','Unknown','Action','Adventure','Animation','Childrens','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western'], encoding = "ISO-8859-1")
userMovieMatrix=pd.merge(userItemRatings, movieInfo, left_on='movieId', right_on='movieId')
userMovieSubMatrix = userMovieMatrix[['userId', 'movieId', 'rating','timestamp','Title']]
Y = pd.pivot_table(userMovieSubMatrix, values='rating', index=['movieId'], columns=['userId'])
Y.fillna(0,inplace=True)
movies = Y.shape[0]
users = Y.shape[1] +1
ratings = np.zeros((1682, 1))
ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5
features = 10
theta = pd.DataFrame(np.random.rand(users,features))# users 943*3
X = pd.DataFrame(np.random.rand(movies,features))# movies 1682 * 3
X = X.as_matrix()
theta = theta.as_matrix()
Y = Y.as_matrix()
"""want to insert a column vector into this Y to get a new Y of dimension
1682*944, but only seeing 1682*943 after the following statement
"""
np.insert(Y, 0, ratings, axis=1)
R = Y.copy()
R[R!=0] = 1
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
for i in range(movies):
idx = np.where(R[i,:] == 1)[0]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx] - Ymean[i]
print(type(Ymean), type(Ynorm), type(Y), Y.shape)
Ynorm[np.isnan(Ynorm)] = 0.
Ymean[np.isnan(Ymean)] = 0.
There is an inline comment inserted, but my problem is when I create a new numpy array and call insert, it works just fine. However the numpy array I get after calling as_matrix() on pandas dataframe on which pivot_table() is called doesn't work. Is there any alternative?
insert does not operate in place, you need to assign the output to a variable. Try:
Y = np.insert(Y, 0, ratings, axis=1)

Categories

Resources