Following the StackOverflow post Elegantly calculate mean of first three values of a list I have tweaked the code to find the maximum.
However, I also require to know the position/index of the max.
So the code below calculates the max value for the first 3 numbers and then the max value for the next 3 numbers and so on.
For example for a list of values [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]. The code below takes the first 3 values 6,3,7 and outputs the max as 7 and then for the next 3 values 4,6,9 outputs the value 9 and so on.
But I also want to find which position/index they are at, 1.e 7 is at position 2 and 9 at position 5. The final result [2,5,8,11,12,...]. Any ideas on how to calculate the index. Thanks in advance.
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
output: test_data : [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]
output: [7, 9, 7, 7, 7, 7, 5]
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
index = [(np.argmax(test_data[i: i+3]) + i) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
print(index)
Related
I'm new to Python from Matlab.
I want to create a new variable from a subset of an existing numpy array based on equality to some condition specified by a third numpy array, an ID in this case.
This works fine for one equality.
new_x = old_x[someID == 1]
But if I try to extend it several equalities at once it no longer works:
new_x = old_x[someID == 1:3]
Ideally I want to be able to choose many equalities, like:
new_x = old_x[someID == 1:3,7]
I could loop through each number I want to check but is there a simpler way of doing this?
You could use np.isin + np.r_:
import numpy as np
# for reproducible results
np.random.seed(42)
# toy data
old_x = np.random.randint(10, size=100)
# create new array by filtering on boolean mask
new_x = old_x[np.isin(old_x, np.r_[1:3,7])]
print(new_x)
Output
[7 2 7 7 7 2 1 7 1 2 2 2 1 1 1 7 2 1 7 1 1 1 7 7 1 7 7 7 7 2 7 2 2 7]
You could substitute np.r_ by something like [1, 2, 7] and use it as below:
new_x = old_x[np.isin(old_x, [1, 2, 7])]
Additionally if the array is 1-dimensional you could use np.in1d:
new_x = old_x[np.in1d(old_x, [1, 2, 7])]
print(new_x)
Output (from in1d)
[7 2 7 7 7 2 1 7 1 2 2 2 1 1 1 7 2 1 7 1 1 1 7 7 1 7 7 7 7 2 7 2 2 7]
Let's say I have some arrays/lists that contains a lot of values, which means that loading several of these into memory would ultimately result in a memory error due to lack of memory. One way to circumvent this is to load these arrays/lists into a generator, and then use them when needed. However, with generators you don't have so much control as with arrays/lists - and that is my problem.
Let me explain.
As an example I have the following code, which produces a generator with some small lists. So yeah, this is not memory intensive at all, just an example:
import numpy as np
np.random.seed(10)
number_of_lists = range(0, 5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
If I iterate over this list I get the following:
for i in generator_list:
print(i)
>> [9 4 0 1 9 0 1 8 9 0]
>> [8 6 4 3 0 4 6 8 1 8]
>> [4 1 3 6 5 3 9 6 9 1]
>> [9 4 2 6 7 8 8 9 2 0]
>> [6 7 8 1 7 1 4 0 8 5]
What I would like to do is sum element wise for all the lists (axis = 0). So the above should in turn result in:
[36, 22, 17, 17, 28, 16, 28, 31, 29, 14]
To do this I could use the following:
sum = [0]*10
for i in generator_list:
sum += i
where 10 is the length of one of the lists.
So far so good. I am not sure if there is a better/more optimized way of doing it, but it works.
My problem is that I would like to determine which lists in the generator_list I want to use. For example, what if I wanted to sum two of the first [0] list, one of the third, and 2 of the last, i.e.:
[9 4 0 1 9 0 1 8 9 0]
[9 4 0 1 9 0 1 8 9 0]
[4 1 3 6 5 3 9 6 9 1]
[6 7 8 1 7 1 4 0 8 5]
[6 7 8 1 7 1 4 0 8 5]
>> [34, 23, 19, 10, 35, 5, 19, 22, 43, 11]
How would I go about doing that ?
And before any questions arise why I want to do it this way, the reason is that in my real case, getting the arrays into the generator takes some time. I could then in principle just generate a new generator where I put in the order of lists as seen in the new list, but again, that would mean I would have to wait to get them in a new generator. And if this is to happen thousands of times (as seen with bootstrapping), well, it would take some time. With the first generator I have ALL lists that are available. Now I just wish to use them selectively so I don't have to create a new generator every time I want to mix it up, and sum a new set of arrays/lists.
import numpy as np
np.random.seed(10)
number_of_lists = range(5)
generator_list = (np.random.randint(0, 10, 10) for i in number_of_lists)
indices = [0, 0, 2, 4, 4]
assert sorted(indices) == indices, "only works for sorted list"
# sum_ = [0] * 10
# I prefer this:
sum_ = np.zeros((10,), dtype=int)
generator_index = -1
for index in indices:
while generator_index < index:
vector = next(generator_list)
generator_index += 1
sum_ += vector
print(sum_)
outputs
[34 23 19 10 37 5 19 22 43 11]
I want a matrix to be printed with random columns(0, 9) and random rows(0, 9) with random elements(0, 9)
Where (0, 9) is any random number between 0 and 9.
First, randomize your number of columns and rows:
import numpy as np
rows, cols = np.random.randint(10, size = 2)
If you want a matrix of integers just try:
m = np.random.randint(10, size = (rows,cols))
This will output a rows x cols matrix with random numbers in the close interval [0,9].
If you want a matrix of float numbers just try:
m = np.random.rand(rows,cols) * 9
This will output a rows x cols matrix with random numbers in the close interval [0,9].
If what you're looking for is a 10x10 matrix filled with random numbers between 0 and 9, here's what you want:
# this randomizes the size of the matrix.
rows, cols = np.random.randint(9, size=(2))
# this prints a matrix filled with random numbers, with the given size.
print(np.random.randint(9, size=(rows, cols)))
Output:
[[1 7 1 4 4 4 4 3]
[1 4 7 3 0 5 3 5]
[6 3 3 7 5 7 6 1]
[3 8 5 7 2 0 1 6]
[5 0 8 5 0 1 5 1]
[1 3 3 7 3 7 5 6]
[3 7 4 1 8 3 7 8]
[8 8 8 5 8 4 7 1]]
Sample Data:
id cluster
1 3
2 3
3 3
4 3
5 1
6 1
7 2
8 2
9 2
10 4
11 4
12 5
13 6
What I would like to do is replace the largest cluster id with 0 and the second largest with 1 and so on and so forth. Output would be as shown below.
id cluster
1 0
2 0
3 0
4 0
5 2
6 2
7 1
8 1
9 1
10 3
11 3
12 4
13 5
I'm not quite sure where to start with this. Any help would be much appreciated.
The objective is to relabel groups defined in the 'cluster' column by the corresponding rank of that group's total value count within the column. We'll break this down into several steps:
Integer factorization. Find an integer representation where each unique value in the column gets its own integer. We'll start with zero.
We then need the counts of each of these unique values.
We need to rank the unique values by their counts.
We assign the ranks back to the positions of the original column.
Approach 1
Using Numpy's numpy.unique + argsort
TL;DR
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
(-c).argsort()[i]
Turns out, numpy.unique performs the task of integer factorization and counting values in one go. In the process, we get unique values as well, but we don't really need those. Also, the integer factorization isn't obvious. That's because per the numpy.unique function, the return value we're looking for is called the inverse. It's called the inverse because it was intended to act as a way to get back the original array given the array of unique values. So if we let
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_couns=True
)
You'll see i looks like:
array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])
And if we did u[i] we get back the original df.cluster.values
array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])
But we are going to use it as integer factorization.
Next, we need the counts c
array([2, 3, 4, 2, 1, 1])
I'm going to propose the use of argsort but it's confusing. So I'll try to show it:
np.row_stack([c, (-c).argsort()])
array([[2, 3, 4, 2, 1, 1],
[2, 1, 0, 3, 4, 5]])
What argsort does in general is to place the top spot (position 0), the position to draw from in the originating array.
# position 2
# is best
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# top spot
# from
# position 2
# position 1
# goes to
# pen-ultimate spot
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# pen-ultimate spot
# from
# position 1
What this allows us to do is to slice this argsort result with our integer factorization to arrive at a remapping of the ranks.
# i is
# [2 2 2 2 0 0 1 1 1 3 3 4 5]
# (-c).argsort() is
# [2 1 0 3 4 5]
# argsort
# slice
# \ / This is our integer factorization
# a i
# [[0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [3 3] <-- 3 is third position in argsort
# [3 3] <-- 3 is third position in argsort
# [4 4] <-- 4 is fourth position in argsort
# [5 5]] <-- 5 is fifth position in argsort
We can then drop it into the column with pd.DataFrame.assign
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
Approach 2
I'm going to leverage the same concepts. However, I'll use Pandas pandas.factorize to get integer factorization with numpy.bincount to count values. The reason to use this approach is because Numpy's unique actually sorts the values in the midst of factorizing and counting. pandas.factorize does not. For larger data sets, big oh is our friend as this remains O(n) while the Numpy approach is O(nlogn).
i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
You can use groupby, transform, and rank:
df['cluster'] = df.groupby('cluster').transform('count')\
.rank(ascending=False, method='dense')\
.sub(1).astype(int)
Output:
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
By using category and value_counts
df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]:
0 0
1 0
2 0
3 0
4 2
5 2
6 1
7 1
8 1
9 3
Name: cluster, dtype: int8
This isn't the cleanest solution but it does work. Feel free to suggest improvements:
valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)
for i in valueCounts_sorted.index.values:
print (i)
temp = df[df.cluster == i]
temp["random"] = count
idx = temp.index.values
df.loc[idx, "cluster"] = temp.random.values
count += 1
Let's say I have this array:
np.arange(9)
[0 1 2 3 4 5 6 7 8]
I would like to shuffle the elements with np.random.shuffle but certain numbers have to be in the original order.
I want that 0, 1, 2 have the original order.
I want that 3, 4, 5 have the original order.
And I want that 6, 7, 8 have the original order.
The number of elements in the array would be multiple of 3.
For example, some possible outputs would be:
[ 3 4 5 0 1 2 6 7 8]
[ 0 1 2 6 7 8 3 4 5]
But this one:
[2 1 0 3 4 5 6 7 8]
Would not be valid because 0, 1, 2 are not in the original order
I think that maybe zip() could be useful here, but I'm not sure.
Short solution using numpy.random.shuffle and numpy.ndarray.flatten functions:
arr = np.arange(9)
arr_reshaped = arr.reshape((3,3)) # reshaping the input array to size 3x3
np.random.shuffle(arr_reshaped)
result = arr_reshaped.flatten()
print(result)
One of possible random results:
[3 4 5 0 1 2 6 7 8]
Naive approach:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
shuffled_array = np.empty_like(array_to_shuffle)
cur_idx = 0
for idx in indices:
shuffled_array[cur_idx:cur_idx+3] = array_to_shuffle[idx*3:(idx+1)*3]
cur_idx += 3
Faster (and cleaner) option:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
tmp = array_to_shuffle.reshape([-1,3])
tmp = tmp[indices,:]
tmp.reshape([-1])