Efficiently assign arrays along an axis (common)number in Python - python

I have an image which I subsample
Count=0
classim = np.zeros([temp1.shape[0],temp1.shape[1]])
for rows in range(int(np.floor(temp1.shape[0]/SAMPLE_SIZE))):
for cols in range(int(np.floor(temp1.shape[1]/SAMPLE_SIZE))):
classim[np.multiply(rows,SAMPLE_SIZE):np.multiply(rows+1,SAMPLE_SIZE),
np.multiply(cols,SAMPLE_SIZE):np.multiply(cols+1,SAMPLE_SIZE)] = predict.argmax(axis=-1)[Count]
Count = np.add(Count,1)
This is terribly slow. I get the labels from "predict.argmax(axis=-1)[Count]", but can of course have it in vector form.
In other words, how can I vectorise the above loop?

Taking your row calculations outside the inner loop would help a little. Therefore these calculations will only be made once for each row.
A few other tidy-ups gives:
classim = np.zeros_like(temp1)
predict_args = predict.argmax(axis=-1)
for rows in range(temp1.shape[0]//SAMPLE_SIZE):
row_0 = rows * SAMPLE_SIZE
row_1 = (rows+1) * SAMPLE_SIZE
for cols in range(temp1.shape[1]//SAMPLE_SIZE):
col_0 = cols * SAMPLE_SIZE
col_1 = (cols+1) * SAMPLE_SIZE
classim[row_0:row_1,col_0:col_1] = predict_args[Count]
Count+=1
You would need to tell us more about the predict object before I could do much more. But these changes will help a little.
--EDIT--
You could take advantage of the numpy.repeat function. Then there is no need to iterate through the whole classim:
SAMPLE_SIZE = 2
temp1 = np.arange(20*20).reshape((20,20))
sample_shape = (temp1.shape[0]//SAMPLE_SIZE, temp1.shape[0]//SAMPLE_SIZE)
#This line should work as per your question, but returns a single value
#predict_args = predict.argmax(axis=-1)
#Use this for illustration purposes
predict_args = np.arange(sample_shape[0] * sample_shape[1])
subsampled = predict_args.reshape(sample_shape)
classim = np.repeat(np.repeat(subsampled,SAMPLE_SIZE,axis =1),SAMPLE_SIZE, axis=0)
print(subsampled)
print(classim)

Related

Python, how do i add two values together from same column numpy and iterate an add function on that specific column numpy

I'm working on some python binance data, and have ran into a problem. I want to do simple math on a single column and iterate math on that same column.
The data I'm working with is volume from a binance feed.
This is my code:
candles = client.get_historical_klines("BTCUSDT", Client.KLINE_INTERVAL_1HOUR, "1 day ago UTC")
candle_dataframe = df(candles)
candle_dataframe_date = candle_dataframe[0]
date_init = []
for time in candle_dataframe_date.unique():
readable = datetime.fromtimestamp(int(time/1000))
date_init.append(readable)
candle_dataframe.pop(0)
candle_dataframe.pop(11)
dataframe_final_date = df(date_init)
dataframe_final_date.columns = ['date']
final_dataframe = candle_dataframe.join(dataframe_final_date)
final_dataframe.set_index('date', inplace=True)
final_dataframe.columns = ['open', 'high', 'low', 'close', 'volume', 'close_time', 'asset_volume', 'trade_number', 'taker_buy_base', 'taker_buy_quote']
list_volume = final_dataframe.iloc[:, [4]]
np_array = list_volume.to_numpy()
arr = np_array.astype('float64')
np.transpose(arr)
#print(np.transpose(arr))
#example 1
b = np_array[::2]
a = np_array[1::2]
print(np.add(a, b))
#example two
#for i, values, values1 in np_array:
# values[0] += 1
# if i == values[1]:
# np.multiply(values, values1)
#result = np.add(a + b)
Example two kind of shows what I'm trying to do, I hope.
But is there a way to control, the id iteration so that I get the last volume and the next from there and add those together, then add the next two values and so on?
It seems impossible to create simple math on single column ids, or is there a way through it.
I just think it would be silly to have to copy the column and work with redundant data, to actually perform the math.
Why loop? Shouldn't this work?
arr = np.random.rand(10)
d = arr[:-1] + arr[1:]
The answer to my question was in the comment posting an article and is this function.
d = np.zeros(arr.size - 1)
for i in range(len(arr) - 1):
d[i] = arr[i + 1] + arr[i]
Thank you so much.
Mathias

Trying to loop through multiple arrays and getting error: ValueError: cannot reshape array of size 2 into shape (44,1)

New to for loops and I cannot seem to get this one to work. I have multiple arrays that I want to run through my code. It works for individual arrays, but when I try to run it through a list of arrays it tries to join the arrays together.
Pandas looping, multiple attempts at looping in numpy.
Min regret matrix
for i in [a],[b],[c],[d],[e]:
sum columns and rows:
suma0 = np.sum(a,axis=0)
suma1 = np.sum(a,axis=1)
#find the minimum values for rows and columns:
col_min=np.min(a)
col_min0=data.min(0)
row_min=np.min(a[:44])
row_min0=data.min(1)
difference or least regret between scenarios and policies:
p = np.array(a)
q = np.min(p,axis=0)
r = np.min(p,axis=1)
cidx = np.argmin(p,axis=0)
ridx = np.argmin(p,axis=1)
cdif = p-q
rdif = p-r[:,None]
find the sum of the rows and columns for the difference arrays:
sumc = np.sum(cdif,axis=0)
sumr = np.sum(rdif,axis=1)
sumr1 = np.reshape(sumr,(44,1))
append the scenario array with the column sums:
sumcol = np.zeros((45,10))
sumcol = np.append([cdif],[sumc])
sumcol.shape = (45,10)
rank columns:
order0 = sumc.argsort()
rank0 = order0.argsort()
rankcol = np.zeros((46,10))
rankcol = np.append([sumcol],[rank0])
rankcol.shape = (46,10)
append the policy array with row sums
sumrow = np.zeros((44,11))
sumrow = np.hstack((rdif,sumr1))
rank rows
order1 = sumr.argsort()
rank1 = order1.argsort()
rank1r = np.reshape(rank1,(44,1))
rankrow = np.zeros((44,12))
rankrow = np.hstack((sumrow,rank1r))
print(sumrow)
print(rankrow)
Add row and column headers for least regret for df0:
RCP = np.zeros((47,11))
RCP = pd.DataFrame(rankcol, columns=column_names1, index=row_names1)
print(RCP)
Add row and column headers for least regret for df1:
RCP1 = np.zeros((45,13))
RCP1 = pd.DataFrame(rankrow, columns=column_names2, index=row_names2)
print(RCP1)
Export loops to CSV in output folder:
filepath = os.path.join(output_path, 'out_'+str(index)+'.csv')
RCP.to_csv(filepath)
filepath = os.path.join(output_path, 'out1_'+str(index)+'.csv')
RCP1.to_csv(filepath)
As per your question, please highlight the input, expected output and error as this is a base case example.
x = np.random.randn(2)
x.shape = (2,)
and if we attempt for :
x.reshape(44,1)
The error we get is:
ValueError: cannot reshape array of size 2 into shape (44,1)
reason for this error is simple as we are trying to reshape an array of size 2 into 44 sized array. As per your error highlighted please check the dimension of the input and expected output.

How to vectorize this peak finding for loop in Python?

Basically I'm writing a peak finding function that needs to be able to beat scipy.argrelextrema in benchmarking. Here is a link to the data I'm using, and the code:
https://drive.google.com/open?id=1U-_xQRWPoyUXhQUhFgnM3ByGw-1VImKB
If this link expires, the data can be found at dukascopy bank's online historical data downloader.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('EUR_USD.csv')
data.columns = ['Date', 'open', 'high', 'low', 'close','volume']
data.Date = pd.to_datetime(data.Date, format='%d.%m.%Y %H:%M:%S.%f')
data = data.set_index(data.Date)
data = data[['open', 'high', 'low', 'close']]
data = data.drop_duplicates(keep=False)
price = data.close.values
def fft_detect(price, p=0.4):
trans = np.fft.rfft(price)
trans[round(p*len(trans)):] = 0
inv = np.fft.irfft(trans)
dy = np.gradient(inv)
peaks_idx = np.where(np.diff(np.sign(dy)) == -2)[0] + 1
valleys_idx = np.where(np.diff(np.sign(dy)) == 2)[0] + 1
patt_idx = list(peaks_idx) + list(valleys_idx)
patt_idx.sort()
label = [x for x in np.diff(np.sign(dy)) if x != 0]
# Look for Better Peaks
l = 2
new_inds = []
for i in range(0,len(patt_idx[:-1])):
search = np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1))
if label[i] == -2:
idx = price[search].argmax()
elif label[i] == 2:
idx = price[search].argmin()
new_max = search[idx]
new_inds.append(new_max)
plt.plot(price)
plt.plot(inv)
plt.scatter(patt_idx,price[patt_idx])
plt.scatter(new_inds,price[new_inds],c='g')
plt.show()
return peaks_idx, price[peaks_idx]
It basically smoothes data using a fast fourier transform (FFT) then takes the derivative to find the minimum and maximum indices of the smoothed data, then finds the corresponding peaks on the unsmoothed data. Sometimes the peaks it finds are not idea due to some smoothing effects, so I run this for loop to search for higher or lower points for each index between the bounds specified by l. I need help vectorizing this for loop! I have no idea how to do it. Without the for loop, my code is about 50% faster than scipy.argrelextrema, but the for loop slows it down. So if I can find a way to vectorize it, it'd be a very quick, and very effective alternative to scipy.argrelextrema. These two images represent the data without and with the for loop respectively.
This may do it. It's not perfect but hopefully it obtains what you want and shows you a bit how to vectorize. Happy to hear any improvements you think up
label = np.array(label[:-1]) # not sure why this is 1 unit longer than search.shape[0]?
# the idea is to make the index matrix you're for looping over row by row all in one go.
# This part is sloppy and you can improve this generation.
search = np.vstack((np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1)) for i in range(0,len(patt_idx[:-1])))) # you can refine this.
# then you can make the price matrix
price = price[search]
# and you can swap the sign of elements so you only need to do argmin instead of both argmin and argmax
price[label==-2] = - price[label==-2]
# now find the indices of the minimum price on each row
idx = np.argmin(price,axis=1)
# and then extract the refined indices from the search matrix
new_inds = search[np.arange(idx.shape[0]),idx] # this too can be cleaner.
# not sure what's going on here so that search[:,idx] doesn't work for me
# probably just a misunderstanding
I find that this reproduces your result but I did not time it. I suspect the search generation is quite slow but probably still faster than your for loop.
Edit:
Here's a better way to produce search:
patt_idx = np.array(patt_idx)
starts = patt_idx[:-1]-(l+1)
stops = patt_idx[:-1]+(l+1)
ds = stops-starts
s0 = stops.shape[0]
s1 = ds[0]
search = np.reshape(np.repeat(stops - ds.cumsum(), ds) + np.arange(ds.sum()),(s0,s1))
Here is an alternative... it uses list comprehension which is generally faster than for-loops
l = 2
# Define the bounds beforehand, its marginally faster than doing it in the loop
upper = np.array(patt_idx) + l + 1
lower = np.array(patt_idx) - l - 1
# List comprehension...
new_inds = [price[low:hi].argmax() + low if lab == -2 else
price[low:hi].argmin() + low
for low, hi, lab in zip(lower, upper, label)]
# Find maximum within each interval
new_max = price[new_inds]
new_global_max = np.max(new_max)

Optimizing Python Code: Faster groupby and for loops

I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources