How can I easily Iterate NumPy arrays using approaches other than #nditer? - python

I would like to Iterate each cell value of arrays. I tried it using np.nditer methods (for i in np.nditer(bar_st_1). However, even with 64 GB RAM laptop it tooks alot of computational time and runs out of memory. Do you know what will be the easiest and fastets way to extract each array values? Thanks
#Assign the crop specific irrigated area of each array for each month accoridng to the crop calander
#Barley
for Barley in df_dist.Crop:
for i in np.nditer(bar_st_1):
for j in df_area.Month:
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
if (j>= min(k,l)) and (j<= max(k,l)):
df_area.Barley=i
else:
df_area.Barley=0
My goal is to extract a value of each array and assign it for each growing season (month). df_dist is a district-level data frame containing the growing area for each month. bar_st_1 is an array (7*7) that contains an irrigated area of a specific district. For each specific cell, I would like to extract the value of the corresponding array and assign it for a specific month based on the growing season (if the condition is stated above)

for j in df_area.Month:
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
if (j>= min(k,l)) and (j<= max(k,l)):
df_area.Barley=i
else:
df_area.Barley=0
This code block seems to be wasting a lot of effort. If you changed the order of the iterations, you could write
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
for j in range(min(k,l), max(k,l)+1):
df_area.Barley=i
Then you avoid making a lot of comparisons and calculating a lot of max(k,l)'s that aren't necessary.
The loop over i is also wasting effort, since you write certain entries of df_area.Barley to i, but then in a later iteration you overwrite them with a different value of i, without ever (in the code you've shared) using df_area with the first value of i.
So you could reduce your code to
for Barley in df_dist.Crop:
# Initialize the df_area array for this crop with zeros:
df_area.Barley = np.zeros(df_area.Month.max())
r, c = bar_st_1.shape
# Choose the last element in bar_st_1:
i = bar_st_1[r-1, c-1]
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
for j in range(min(k,l), max(k,l)+1):
df_area.Barley=i
Now you've eliminated one level from your nested loop structure and shortened the iteration in another level, so you're likely to get 10x or better improvement in speed.

Related

Appending to a numpy array in for loop

I'm trying to create a Monte Carlo simulation to simulate future stock prices using Numpy arrays.
My current approach is: create a For Loop which fills an array, stock_price_array, with simulated stock prices. These stock prices are generated by taking the last stock price, then multiplying it by 1 + an annual return. The annual returns are drawn randomly from a normal distribution and stored in the array annual_ret.
My problem is that although the "stock price" variables I print from my For Loop appear to be correct, I simply cannot figure out how to Append these stock price variables to stock_price_array.
I've tried various methods, including initializing the stock_price_array using .full instead of .empty, changing the order of where the array appears in the For Loop, and checking the size of the array.
I've read other Stack Overflow posts on similar topics but can't figure out what I'm doing wrong.
Thank you in advance for your help!
annual_mean = .06
annual_stdev = .15
start_stock_price = 100
numYears = 3
numSimulations = 4
stock_price_array = np.empty(numYears)
# draw an annual return from a normal distribution; this annual return will be random
annual_ret = np.random.normal(annual_mean, annual_stdev, numSimulations)
for i in range(numYears):
stock_price = np.multiply(start_stock_price, (1 + annual_ret[i]))
np.append(stock_price_array, [stock_price])
start_stock_price = stock_price
The 1st rule of numpy is: never iterate your array yourself. Use numpy function that does all the computation in batch (and for doing so, they iterate the array, sure. But that iteration is not a python iteration, so it is way faster).
No-for solution
For example, here, you could do something like this
np.cumprod(np.hstack([start_stock_price, annual_ret+1]))
What it does is 1st building an array of a initial value, and some factors.
So if initial value is 100, and interest rate are 0.1, -0.1, 0.2, 0.2 (for example), then hstack build and array of values 100, 1.1, 0.9, 1.2, 1.2.
And the cumprod just build the cumulative product of those
100, 100×1.1=110, 100×1.1×0.9=110×0.9=99, 100×1.1×0.9×1.2=99×1.2=118.8, 100×1.1×0.9×1.2×1.2=118.8×1.2=142.56
Correction of yours
To answer to your initial question anyway (even if I strongly advise that you try to use solutions like the usage of cumprod I've shown), you have 2 choices:
Either you allocate in advance an array, as you did (your stock_price_array = np.empty(numYears)). And then, instead of trying to append the new stock_price to stock_price_array, you should simply fill one of the empty place that are already there. By simply doing stock_price_array[i] = stock_price
Or you don't. And then you replace the np.empty line by a stock_price_array=[]. And then, at each step, you do append the result to create a new stock_price_array, like this stock_price_array = np.append(stock_price_array, [stock_price])
I strongly advise against the 2nd solution. Since you already know the final size of the array, it is way better to create it once. Because np.append recreate a brand new array, then copies the input data it it. It does not just extend the existing array (generally speaking, we can't do that anyway).
But, well, anyway, I advise against both solution, since I find mine (with cumprod) preferable. for is the taboo word in numpy. And it is even more so, when what inside this for is the creation of a new array, like append is.
Monte-Carlo
Since you've mentioned Monte-Carlo, and then shown a code that compute only one result (you draw 1 set of annual ret, and perform one computation of future values), I am wondering if that is really what you want.
In particular, I see that you have numSimulation and numYears, that appear to be playing redundant roles in your code (and therefore in mines).
The only reason why it doesn't just throw a index error, is because numSimulation is used only to decide how many annual_ret you draw. And since numSimulation > numYears, you have more than enough annual_ret to compute the result.
Wasn't your initial intention to redo the simulation over the years numSimulation time, to have numSimulation results ?
In which case, you probably need numSimulation sets of numYears annual rate. So a 2D array. And like wise, you should be computing numSimulation series of numYears results.
If my guess is not completely off, I surmise that what you really wanted to do was rather in the effect of:
annual_ret = np.random.normal(annual_mean, annual_stdev, (numSimulations, numYears)) # 2d array of interest rate. 1 simulation per row, 1 year per column
t = np.pad(annual_ret+1, ((0,0), (1,0)), constant_values=start_stock_price) # Add 1 as we did earlier. And pad with an initial 100 (`start_stock_price`) at the beginning of each simulation
res = np.cumprod(t, axis=1) # cumulative multiplication. `axis=1` means that it is done along axis 1 (along years) for each row (for each simulation)

Python: map two arrays with similar values

I have two arrays, lets say A and B. I want to find all elements in B that are within a certain value of all elements in A.
To be exact, I am working with database of galaxies and I want to compare simulated galaxies (S) with observed (O).
I have 2 arrays: brightness (Br) and distance (D) for both simulated and observations. (so 4 arrays in total)
for each simulated galaxy, i want to find a set of galaxies from the observed list, that has similar brightness and distance.
I have already made a code that could work but my program says it will take around 9-10 hours to run the code. Is there any other way I can efficiently change the code to speed up the process:
Here is my code
list1= flux_hb
list2= flux_inp_ha
flux_MC=[[]]*len(list1) # map flux from MAMBO to COSMOS (mambo--> (list of cosmos galaxies))
noise=[[]]*len(list1)
z_MC=[[]]*len(list1)
index=np.zeros(len(list1))
flux_ha_temp=np.zeros(len(list1))
for i in tqdm(range(len(list1))):
temp_flux_diff = abs(list1[i] - list2)/list1[i] #Matrix subtraction
temp_z_diff = abs(z_geo[i] - zinp)/z_geo[i]
for j in range(len(temp_flux_diff)):
if (temp_flux_diff[j]<0.01 and temp_z_diff[j]<0.01):
flux_MC[i].append(list2[j])
noise[i].append(snr_ha[j])
z_MC[i].append(zinp[j])
My approach is to make an empty list to save the information I need.
For each element in list1, find the difference in values of list1 from list2 (temp_flux_diff)
Now for each element in list2, if any of this difference is smaller than required difference, i call them similar values and save them in the empty list.
Hopefully, I was able to explain what I need and how I did it. I just want to speed up the code.
Update 1 As per the suggestion in comments by #СергейКох, i changed my second loop to:
index = np.where(np.logical_and(temp_flux_diff<0.01, temp_z_diff<0.01))[0]
flux_MC[i].append(flux_inp_ha[index])
noise[i].append(snr_ha[index])
z_MC[i].append(zinp[index])
and my code now takes around 1 hour instead of 9 hours.

Numpy notation to replace an enumerate(zip(....))

I'm starting to use numpy. I get the slice notations and element-wise computations, but I can't understand this:
for i, (I,J) in enumerate(zip(data_list[0], data_list[1])):
joint_hist[int(np.floor(I/self.bin_size))][int(np.floor(J/self.bin_size))] += 1
Variables:
data_list contains two np.array().flatten() images (eventually more)
joint_hist[] is the joint histogram of those two images, it's displayed later with plt.imshow()
bin_size is the number of slots in the histogram
I can't understand why the coordinate in the final histogram is I,J. So it's not just that the value at a position in joint_hist[] is the result of some slicing/element-wise computation. I need to take the result of that computation and use THAT as the indices in joint_hist...
EDIT:
I indeed do not use the i in the loop actually - it's a leftover from previous iterations and I simply hadn't noticed I didn't need it anymore
I do want to remain in control of the bin sizes & the details of how this is done, so not particularly looking to use histogramm2D. I will later be using that for further image processing, so I'd rather have the flexibility to adapt my approach than have to figure out if/how to do particular things with built-in functions.
You can indeed gussy up that for loop using some numpy notation. Assuming you don't actually need i (since it isn't used anywhere):
for I,J in (data_list.T // self.bin_size).astype(int):
joint_hist[I, J] += 1
Explanation
data_list.T flips data_list on its side. Each row of data_list.T will contain the data for the pixels at a particular coordinate.
data_list.T // self.bin_size will produce the same result as np.floor(I/self.bin_size), only it will operate on all of the pixels at once, instead of one at a time.
.astype(int) does the same thing as int(...), but again operates on the entire array instead of a single element.
When you iterate over a 2D array with a for loop, the rows are returned one at a time. Thus, the for I,J in arr syntax will give you back one pair of pixels at a time, just like your zip statement did originally.
Alternative
You could also just use histogramdd to calculate joint_hist, in place of your for loop. For your application it would look like:
import numpy as np
joint_hist,edges = np.histogramdd(data_list.T)
This would have different bins than the ones you specified above, though (numpy would determine them automatically).
If I understand, your goal is to make an histogram or correlated values in your images? Well, to achieve the right bin index, the computation that you used is not valid. Instead of np.floor(I/self.bin_size), use np.floor(I/(I_max/bin_size)).astype(int). You want to divide I and J by their respective resolution. The result that you will get is a diagonal matrix for joint_hist if both data_list[0] and data_list[1] are the same flattened image.
So all put together:
I_max = data_list[0].max()+1
J_max = data_list[1].max()+1
joint_hist = np.zeros((I_max, J_max))
bin_size = 256
for i, (I, J) in enumerate(zip(data_list[0], data_list[1])):
joint_hist[np.floor(I / (I_max / bin_size)).astype(int), np.floor(J / (J_max / bin_size)).astype(int)] += 1

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

Iterating over elements, finding minima per each element

First time posting, so I apologize for any confusion.
I have two numpy arrays which are time stamps for a signal.
chan1,chan2 looks like:
911.05, 7.7
1055.6, 455.0
1513.4, 1368.15
4604.6, 3004.4
4970.35, 3344.25
13998.25, 4029.9
15008.7, 6310.15
15757.35, 7309.75
16244.2, 8696.1
16554.65, 9940.0
..., ...
and so on, (up to 65000 elements per chan. pre file)
Edit : The lists are already sorted but the issue is that they are not always equal in spacing. There are gaps that could show up, which would misalign them, so chan1[3] could be closer to chan2[23] instead of, if the spacing was qual chan2[2 or 3 or 4] : End edit
For each elements in chan1, I am interested in finding the closest neighbor in chan2, which is done with:
$ np.min(np.abs(chan2-chan1[i]))
and to keep track of positive or neg. difference:
$ index=np.where( np.abs( chan2-chan1[i]) == res[i])[0][0]
$ if chan2[index]-chan1[i] <0.0 : res[i]=res[i]*(-1.0)
Lastly, I create a histogram of all the differences, in a range I am interested in.
My concern is that I do this in the for loop. I usually try to avoid for loops when I can by utilizing the numpy arrays, as each operation can be performed on the entire array. However, in this case I am unable to find a solution or a build in function (which I understand run significantly faster than anything I can make).
The routine takes about 0.03 seconds per file. There are a few more things happening outside of the function but not a significant number, mostly plotting after everything is done, and a loop to read in files.
I was wondering if anyone has seen a similar problem, or is familiar enough with the python libraries to suggest a solution (maybe a build in function?) to obtain the data I am interested in? I have to go over hundred of thousands of files, and currently my data analysis is about 10 slower than data acquisition. We are also in the middle of upgrading our instruments to where we will be able to obtain data 10-100 times faster, and so the analysis speed is going to become an serious issue.
I would prefer not to use a cluster to brute force the problem, and not too familiar with parallel processing, although I would not mind dabbling in it. It would take me a while to write it in C, and I am not sure if I would be able to make it faster.
Thank you in advance for your help.
def gen_hist(chan1,chan2):
res=np.arange(1,len(chan1)+1,1)*0.0
for i in range(len(chan1)):
res[i]=np.min(np.abs(chan2-chan1[i]))
index=np.where( np.abs( chan2-chan1[i]) == res[i])[0][0]
if chan2[index]-chan1[i] <0.0 : res[i]=res[i]*(-1.0)
return np.histogram(res,bins=np.arange(time_range[0]-interval,\
time_range[-1]+interval,\
interval))[0]
After all the files are cycled through I obtain a plot of the data:
Example of the histogram
Your question is a little vague, but I'm assuming that, given two sorted arrays, you're trying to return an array containing the differences between each element of the first array and the closest value in the second array.
Your algorithm will have a worst case of O(n^2) (np.where() and np.min() are O(n)). I would tackle this by using two iterators instead of one. You store the previous (r_p) and current (r_c) value of the right array and the current (l_c) value of the left array. For each value of the left array, increment the right array until r_c > l_c. Then append min(abs(r_p - l_c), abs(r_c - l_c)) to your result.
In code:
l = [ ... ]
r = [ ... ]
i = 0
j = 0
result = []
r_p = r_c = r[0]
while i < len(l):
l_c = l[i]
while r_c < l and j < len(r):
j += 1
r_c = r[j]
r_p = r[j-1]
result.append(min(abs(r_c - l_c), abs(r_p - l_c)))
i += 1
This runs in O(n). If you need additional speed out of it, try writing it in C or running it in Cython.

Categories

Resources