Fast code for finding the max value of a 2d array - python

I've read a lot of answers including this one so it's not a duplicate (especially with the way I read the value). Is there a faster code for this:
mx = 0
for i in range(0, len(self.board)):
for j in range(0, len(self.board[i])):
for k in range(0, len(self.board[i][j]['b'])):
l = self.board[i][j]['b'][k]
mx = max([mx, l.get('id', 0)])
in Python? Maybe with map but I dont see how?
Each "cell" of the board is a dict like that
'b' = array of dicts: each dicts contains an information about a piece of a game, example: {'id':3, 'nb':1, 'kind':'bee'}. We can have many pieces on the same cell (one piece on top of another one)
'p' = array of ids of the pieces above if we can put them on this cell
'h' = array of 'kind' of pieces that are not yet on the board but we can put on this cell
FYI I pre-calculate the whole board before sending it in JSON to a JavaScript client, so that I can do all pre-calculation in Python, and have to code as few as possible in JavaScript.

It seems you have a board which have rows with cells and each cell have multiple pieces.
You can use list comprehension to get all the ids and then get the `max of it:
max(piece.get('id', 0) for row in self.board for cell in row for piece in cell['b'])
I am not sure how fast it would be (will ofcourse depend on how big your board is) but I am pretty sure it would faster than the 3 for loops and calculating max on each iteration.

Related

How can I easily Iterate NumPy arrays using approaches other than #nditer?

I would like to Iterate each cell value of arrays. I tried it using np.nditer methods (for i in np.nditer(bar_st_1). However, even with 64 GB RAM laptop it tooks alot of computational time and runs out of memory. Do you know what will be the easiest and fastets way to extract each array values? Thanks
#Assign the crop specific irrigated area of each array for each month accoridng to the crop calander
#Barley
for Barley in df_dist.Crop:
for i in np.nditer(bar_st_1):
for j in df_area.Month:
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
if (j>= min(k,l)) and (j<= max(k,l)):
df_area.Barley=i
else:
df_area.Barley=0
My goal is to extract a value of each array and assign it for each growing season (month). df_dist is a district-level data frame containing the growing area for each month. bar_st_1 is an array (7*7) that contains an irrigated area of a specific district. For each specific cell, I would like to extract the value of the corresponding array and assign it for a specific month based on the growing season (if the condition is stated above)
for j in df_area.Month:
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
if (j>= min(k,l)) and (j<= max(k,l)):
df_area.Barley=i
else:
df_area.Barley=0
This code block seems to be wasting a lot of effort. If you changed the order of the iterations, you could write
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
for j in range(min(k,l), max(k,l)+1):
df_area.Barley=i
Then you avoid making a lot of comparisons and calculating a lot of max(k,l)'s that aren't necessary.
The loop over i is also wasting effort, since you write certain entries of df_area.Barley to i, but then in a later iteration you overwrite them with a different value of i, without ever (in the code you've shared) using df_area with the first value of i.
So you could reduce your code to
for Barley in df_dist.Crop:
# Initialize the df_area array for this crop with zeros:
df_area.Barley = np.zeros(df_area.Month.max())
r, c = bar_st_1.shape
# Choose the last element in bar_st_1:
i = bar_st_1[r-1, c-1]
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
for j in range(min(k,l), max(k,l)+1):
df_area.Barley=i
Now you've eliminated one level from your nested loop structure and shortened the iteration in another level, so you're likely to get 10x or better improvement in speed.

Remove array from a list if another array with the same entry at a certain position is already in said list

My question might sound a little confusing but what I am trying to do is find an efficient method to iterate over a list that contains arrays and remove arrays from the list if they have the same entries at certain positions. Let's say I have a list with 3x2 arrays and want to make the list unique regarding the last two elements in the bottom row for example. What I came up with so far is the following:
import numpy as np
my_array_list = [np.array([[1,2,3],[4,5,6]]), np.array([[9,8,7],[6,5,4]]),
np.array([[2,3,4],[5,6,7]]), np.array([[1,7,8],[0,5,6]])]
while i < len(my_array_list):
j = i + 1
while j < len(my_array_list):
if my_array_list[i][1,1] == my_array_list[j][1,1] and my_array_list[i][1,2] == my_array_list[j][1,2]:
del my_array_list[j]
else:
j += 1
i += 1
print(my_array_list)
>>> my_array_list = [np.array([[1,2,3],[4,5,6]]), np.array([[9,8,7],[6,5,4]]),
np.array([[2,3,4],[5,6,7]])] ## Since 5,6 is already in the last two columns of the first array the last array got deleted
This loop does what I want but the problem is that it is very slow. The data I am having will be generated from a Monte Carlo Simulation thus there will most likely be millions of arrays in my list. I was wondering if there was a faster way to do this like to somehow remember which combinations have already been encountered so I only have to loop over the list once and not len(my_array_list)-times for every single combination.
Thanks in advance, any help would be appreciated.
Cheers!

Is there a better way to send multiple arguments to itertools.product?

I am trying to create itertools.product from a 2D list containing many rows. For example, consider a list s:
[[0.7168573116730971,
1.3404415914042531,
1.8714268721791336,
11.553051251803975],
[0.6702207957021266,
1.2476179147860895,
1.7329576877705954,
10.635778602978927],
[0.6238089573930448,
1.1553051251803976,
1.5953667904468385,
9.725277699842893],
[0.5776525625901988,
1.0635778602978927,
1.4587916549764335,
8.822689900641748]]
I want to compute itertools.product between the 4 rows of the list:
pr = []
for j in (it.product(s[0],s[1],s[2],s[3])):
pr.append(j)
This gives the necessary result for pr which has dimensions 256,4 where 256 is (number of columns^number of rows). But, is there a better way to send all the rows of the list s as arguments without having to write each row's name. This would be annoying if it were to be done for a larger list.
I guess numpy.meshgrid can be used if s was a numpy.array. But even there, I'll have to jot down each row one by one as arguments.
You can use the unpacking notation * in Python for this:
import itertools as it
s = [[0.7168573116730971,
1.3404415914042531,
1.8714268721791336,
11.553051251803975],
[0.6702207957021266,
1.2476179147860895,
1.7329576877705954,
10.635778602978927],
[0.6238089573930448,
1.1553051251803976,
1.5953667904468385,
9.725277699842893],
[0.5776525625901988,
1.0635778602978927,
1.4587916549764335,
8.822689900641748]]
pr = []
for j in (it.product(*s):
pr.append(j)
It will send each item of your list s to the product function

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

Creating variable number of variables in python

I am trying to create a variable number of variables (arrays) in python.
I have a database from experiments and I am extracting data from it. I do not have control over the database or how data is written. I am extracting data in the form of a table - first (or zeroth column from python's perspective) has location ids and subsequent columns have readings over several iterations. Location ids (in 0th col) span over million of rows, and so the readings of the iterations are captured in subsequent columns. So I read over the database and create this giant table.
In the next step, I loop over columns index 1 to n (0th col has locations) and I am trying to get this - if the difference in 2 readings is more than 0.001, then write the location id to an array.
if ( (A[i][j+1] - A[i][j]) > 0.001): #1<=j<=n, 0<=i<=max rows in the table
then write A[i][0] i.e. location id to an array, arr1[m][n] = A[i][0]
Problem: It is creating dynamic number of variables like arr1. I am storing the result of each loop iteration in an array and the number of column j's are known only during runtime. So how can I create variable number of variables like arr1? Secondly, each of these variables like arr1 can have different size.
I took a look at similar questions, but multi-dimension arrays won't work as each arr1 can have different size. Also, performance is important, so I am guessing numpy arrays would be better. I am guessing that dictionary would be slow in performance for such a huge data.
I didn't understand much from your explanation of the problem, but from what you wrote it sounds like a normal list would do the job:
arr1 = []
if (your condition here):
arr1.append(A[i][0])
memory management is dynamic, i.e. it allocates new memory as needed and afterwards if you need a numpy array just make numpy_array = np.asarray(arr1).
A (very) small primer on lists in python:
A list in python is a mutable container that stores references to objects of any kind. Unlike C++, in a python list your items can be anything and you don't have to specify the list size when you define it.
In the example above, arr1 is initially defined as empty and every time you call arr1.append() a new reference to A[i][0] is pushed at the end of the list.
For example:
a = []
a.append(1)
a.append('a string')
b = {'dict_key':'my value'}
a.append(b)
print(a)
displays:
[1, 'a string', {'dict_key': 'my value'}]
As you can see, the list doesn't really care what you append, it will store a reference to the item and increase its size of 1.
I strongly suggest you to take a look at the daa structures documentation for further insight on how lists work and some of their caveats.
I took a look at similar questions, but multi dimension arrays won't work as each arr1 can have different size.
-- but a list of arrays will work, because items in a list can be anything, including arrays of different sizes.

Categories

Resources