Why to initialize an array in Numpy - python

I'm doing a DataCamp course on statistical thinking in Python. At one point in the course, the instructor advises initializing an empty array before filling it with random floats, e.g.
rand_nums = np.empty(100_000)
for i in range(100_000):
rand_nums[i] = np.random.random()
In theory, is there any reason to initialize an empty array before filling it? Does it save on memory? What is the advantage over the code above as compared to simply the following:
rand_nums = np.random.random(size=100_000)

There is absolutely no reason to do this. The second way is faster, more readable and semantically correct.
Besides that, np.empty actually does NOT initialize the array - it only allocates memory, but now it contains arbitrary data left in memory from this and other programs.

If all the code they provided is just like above, your way of initializing is better.
Their code might lead to something else later on
rand_nums = np.empty(100_000)
for i in range(100_000):
rand_nums[i] = np.random.random()
# maybe they will something else in here later with rand_nums[i]

Related

I set 3 arrays to the same thing, changing a single entry in one of them also changes the other two arrays. How can I make the three arrays separate?

I am making a puzzle game in a command terminal. I have three arrays for the level, originlevel, which is the unaltered level that the game will return to if you restart the level. Emptylevel is the level without the player. Level is just the level. I need all 3, because I will be changing the space around the player.
def Player(matrix,width):
originlevel = matrix
emptylevel = matrix
emptylevel[PlayerPositionFind(matrix)]="#"
level = matrix
The expected result is that it would set one entry to "#" in the emptylevel array, but it actually sets all 3 arrays to the same thing! My theory is that the arrays are somehow linked because they are originally said to the same thing, but this ruins my code entirely! How can I make the arrays separate, so changing one would not change the other?
I should not that matrix is an array, it is not an actual matrix.
I tried a function which would take the array matrix, and then just return it, thinking that this layer would unlink the arrays. It did not. (I called the function IHATEPYTHON).
I've also read that setting them to the same array is supposed to do this, but I didn't actually find an answer how to make them NOT do that. Do I make a function which is just something like
for i in range(0,len(array)):
newarray.append(array[i])
return newarray
I feel like that would solve the issue but that's so stupid, can I not do it in another way?
This issue is caused by the way variables work in Python. If you want more background on why this is happening, you should look up 'pass by value versus pass by reference'.
In order for each of these arrays to be independent, you need to create a copy each time you assign it. The easiest way to do that is to use an array slice. This means you will get a new copy of the array each time.
def Player(matrix,width):
originlevel = matrix[:]
emptylevel = matrix[:]
emptylevel[PlayerPositionFind(matrix)]="#"
level = matrix[:]

Alternatives to copy.deepcopy; Or is it even necessary?

To undo moves from my chess engine I heavily rely on a good caching system, which is updated everytime a move is done/undone. It is also used to "go back in time" and see earlier positions. The update code looks something like:
self.cache.append([copy.deepcopy(self.board), ..., copy.deepcopy(self.brokenCastles)])
with self.brokenCastles being a set and self.board a 2D list representing the board. There are some other lists and variables that get stored (like legal moves for each side) so they don't have to be calculated each time the same board is used. They haven't caused any problems so far (or I didn't notice them).
The big problem is with self.board and self.brokenCastles. If I just append them like all the other variables, massive problems appear (like kings getting taken and some weird stuff), which is fixed by making a deepcopy of the list/set respectively. Just using pythons built-in .copy() or using a slice like [:] didn't help.
I don't quite know why deepcopy is necessary and couldnt replicate the issue in a smaller environment. So my question is if deepcopy is even needed and if so, is there a way to make it faster, since it is the biggest bottleneck in my system right now.
The cache read function looks like this:
def undo_move(self):
self.board = self.cache[-1][0]
... # A lot more
self.brokenCastles = self.cache[-1][4]
self.cache.pop()
If there are any details missing, let me know. Thanks for any help.
Whole code (its a mess) is available at GitHub.
I used an incredibly jank bypass solution for now:
Convert the list to a string
Use the built-in eval function to retrieve it's information
Still quite slow, but a lot better than deepcopy.
EDIT: I made it a lot faster by using my own conversion to string and back:
def to_lst(my_string):
temp = my_string.split("|")
return [temp[i: i+8] for i in range(0, 64, 8)]
def to_str(my_list):
return "|".join([a for b in my_list for a in b])

Confusion about numpy's apply along axis and list comprehensions

Alright, so I apologize ahead of time if I'm just asking something silly, but I really thought I understood how apply_along_axis worked. I just ran into something that might be an edge case that I just didn't consider, but it's baffling me. In short, this is the code that is confusing me:
class Leaf(object):
def __init__(self, location):
self.location = location
def __len__(self):
return self.location.shape[0]
def bulk_leaves(child_array, axis=0):
test = np.array([Leaf(location) for location in child_array]) # This is what I want
check = np.apply_along_axis(Leaf, 0, child_array) # This returns an array of individual leafs with the same shape as child_array
return test, check
if __name__ == "__main__":
test, check = bulk_leaves(np.random.ran(100, 50))
test == check # False
I always feel silly using a list comprehension with numpy and then casting back to an array, but I'm just nor sure of another way to do this. Am I just missing something obvious?
The apply_along_axis is pure Python that you can look at and decode yourself. In this case it essentially does:
check = np.empty(child_array.shape,dtype=object)
for i in range(child_array.shape[1]):
check[:,i] = Leaf(child_array[:,i])
In other words, it preallocates the container array, and then fills in the values with an iteration. That certainly is better than appending to the array, but rarely better than appending values to a list (which is what the comprehension is doing).
You could take the above template and adjust it to produce the array that you really want.
for i in range(check.shape[0]):
check[i]=Leaf(child_array[i,:])
In quick tests this iteration times the same as the comprehension. The apply_along_axis, besides being wrong, is slower.
The problem seems to be that apply_along_axis uses isscalar to determine whether the returned object is a scalar, but isscalar returns False for user-defined classes. The documentation for apply_along_axis says:
The shape of outarr is identical to the shape of arr, except along the axis dimension, where the length of outarr is equal to the size of the return value of func1d.
Since your class's __len__ returns the length of the array it wraps, numpy "expands" the resulting array into the original shape. If you don't define a __len__, you'll get an error, because numpy doesn't think user-defined types are scalars, so it will still try to call len on it.
As far as I can see, there is no way to make this work with a user-defined class. You can return 1 from __len__, but then you'll still get an Nx1 2D result, not a 1D array of length N. I don't see any way to make Numpy see a user-defined instance as a scalar.
There is a numpy bug about the apply_along_axis behavior, but surprisingly I can't find any discussion of the underlying issue that isscalar returns False for non-numpy objects. It may be that numpy just decided to punt and not guess whether user-defined types are vector or scalar. Still, it might be worth asking about this on the numpy list, as it seems odd to me that things like isscalar(object()) return False.
However, if as you say you don't care about performance anyway, it doesn't really matter. Just use your first way with the list comprehension, which already does what you want.

How to optimize operations on large (75,000 items) sets of booleans in Python?

There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?
You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)
# Get all items that are larger than 500, takes 2.58ms
y = x > 500
# Add 10 to those items, takes 26.1ms
x[y] += 10
Since that's with a lot more rows, I think that 75000 should not be a problem either :)
Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.
class RevisionSet(set):
"""
A set of revisions, held in dictionary form for easy manipulation. If we
were to rewrite this script for Python 2.3+, we would subclass this from
set (or UserSet). As this class does not include branch
information, it's assumed that one instance will be used per
branch.
"""
def __init__(self, parm):
"""Constructs a RevisionSet from a string in property form, or from
a dictionary whose keys are the revisions. Raises ValueError if the
input string is invalid."""
revision_range_split_re = re.compile('[-:]')
if isinstance(parm, set):
print "1"
self.update(parm.copy())
elif isinstance(parm, list):
self.update(set(parm))
else:
parm = parm.strip()
if parm:
for R in parm.split(","):
rev_or_revs = re.split(revision_range_split_re, R)
if len(rev_or_revs) == 1:
self.add(int(rev_or_revs[0]))
elif len(rev_or_revs) == 2:
self.update(set(range(int(rev_or_revs[0]),
int(rev_or_revs[1])+1)))
else:
raise ValueError, 'Ill formatted revision range: ' + R
def sorted(self):
return sorted(self)
def normalized(self):
"""Returns a normalized version of the revision set, which is an
ordered list of couples (start,end), with the minimum number of
intervals."""
revnums = sorted(self)
revnums.reverse()
ret = []
while revnums:
s = e = revnums.pop()
while revnums and revnums[-1] in (e, e+1):
e = revnums.pop()
ret.append((s, e))
return ret
def __str__(self):
"""Convert the revision set to a string, using its normalized form."""
L = []
for s,e in self.normalized():
if s == e:
L.append(str(s))
else:
L.append(str(s) + "-" + str(e))
return ",".join(L)
Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).
For example, all keys from 1 to 74,000 contain true
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).
You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.
Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.

Mapping function to numpy array, varying a parameter

First, let me show you the codez:
a = array([...])
for n in range(10000):
func_curry = functools.partial(func, y=n)
result = array(map(func_curry, a))
do_something_else(result)
...
What I'm doing here is trying to apply func to an array, changing every time the value of the func's second parameter. This is SLOOOOW (creating a new function every iteration surely does not help), and I also feel I missed the pythonic way of doing it. Any suggestion?
Could a solution that gives me a 2D array be a good idea? I don't know, but maybe it is.
Answers to possible questions:
Yes, this is (using a broad definition), an optimization problem (do_something_else() hides this)
No, scipy.optimize hasn't worked because I'm dealing with boolean values and it never seems to converge.
Did you try numpy.vectorize?
...
vfunc_curry = vectorize(functools.partial(func, y=n))
result = vfunc_curry(a)
...
If a is of significant size the bottleneck should not be the creation of the function, but the duplication of the array.
Can you rewrite the function? If possible, you should write the function to take two numpy arrays a and numpy.arange(n). You may need to reshape to get the arrays to line up for broadcasting.

Categories

Resources