NumPy - Format array by max value - python

I have the following values:
student_list = [521 597 624 100] # Ids of students
grade_list = [[99 73 97 98] [98 71 70 99]] # Grades per student, first array are grades of student #521 for exercise #1 (4 grades)
My goal is to return a multidimensional array that for each student, will get the max grade he got in all exercises.
desired output example:
[[521 597 624 100] [ 99 73 97 99]]
[521 597 624 100] - the IDS of the students
[ 99 73 97 99] - The maximum grade per exercise, in exc #1 the highest betweeen 98 and 99 is 99, next is 73, and so on.
How I can return it using NumPy? I have looked after argmax() but not sure how to put it together.

You can try np.amax:
grade_list = np.array([[99, 73, 97, 98], [98, 71, 70, 49]])
np.amax(grade_list,axis=0)
output:
array([99, 73, 97, 98])

Related

How to sum across n elements of numpy array

I hope that someone can help me with my problem since I'm not used to python and numpy yet. I have the following array with 24 elements:
load = np.array([10, 12, 9, 13, 17, 23, 25, 28, 26, 24, 22, 20, 18, 20, 22, 24, 26, 28, 23, 24, 21, 18, 16, 13])
I want to create a new array with the same length as "load" and calculate for each element in the array the sum of the current and the next two numbers, so that my objective array would look like this:
[31, 34, 39, 53, 65, 76, 79, 78, 72, 66, 60, 58, 60, 66, 72, 78, 77, 75, 68, 63, 55, 47, 29, 13]
I tried to solve this with the following code:
output = np.empty(len(load))
for i in range((len(output))-2):
output[i] = load[i]+load[i+1]+load[i+2]
print(output)
The output array looks like this:
array([31. , 34. , 39. , 53. , 65. , 76. , 79. , 78. , 72. , 66. , 60. ,
58. , 60. , 66. , 72. , 78. , 77. , 75. , 68. , 63. , 55. , 47. ,
6. , 4.5])
The last two numbers are not right. For the 23th element I want the sum of just 16 and 13 and for the last number to stay 13 since the array ends there. I don't unterstand how python calculated these numbers. Also I would prefer the numbers to be integers without the dot.
Does anyone have a better solution in mind? I know that this probably is easy to solve, I just don't know all the functionalities of numpy.
Thank you very much!
np.empty creates an array containing uninitialized data. In your code, you initialize an array output of length 24 but assign only 22 values to it. The last 2 values contain arbitrary values (i.e. garbage). Unless performance is of importance, np.zeros is usually the better choice for initializing arrays since all values will have a consistent value of 0.
You can solve this without a for loop by padding the input array with zeros, then computing a vectorized sum.
import numpy as np
load = np.array([10, 12, 9, 13, 17, 23, 25, 28, 26, 24, 22, 20, 18, 20, 22, 24, 26, 28, 23, 24, 21, 18, 16, 13])
tmp = np.pad(load, [0, 2])
output = load + tmp[1:-1] + tmp[2:]
print(output)
Output
[31 34 39 53 65 76 79 78 72 66 60 58 60 66 72 78 77 75 68 63 55 47 29 13]
If the array is not super long, and you don't care too much about memory utilization you could use:
from itertools import zip_longest
output = [sum([x, y, z]) for x, y, z in zip_longest(load, load[1:], load[2:], fillvalue=0)]
Output is:
[31, 34, 39, 53, 65, 76, 79, 78, 72, 66, 60, 58, 60, 66, 72, 78, 77, 75, 68, 63, 55, 47, 29, 13]
I'll address the question "How Python calculated those two numbers" in your source: they were not calculated by your program.
If you notice, your main loop runs until the end of the array but the last two elements. The value of those was not touched. For this reason it corresponds to the data that was in the memory at the position corresponding to the memory allocated by np.empty(). In fact, np.empty() will only acquire the ownership of the memory without initialization (i.e. without changing its content).
A simple approach is to loop through and sum different views of the original array:
def sum_next_k_loop(arr, k):
result = arr.copy()
for i in range(1, k):
result[:-i] += arr[i:]
return result
This is quite fast for relatively small values of k, but as k gets larger one may want to avoid the relatively slow explicit looping.
One way to do this is to use strides to create a view of the array that can be used to sum along an extra dimension.
This approach leaves behind the partial sums at the end of the input.
One could either start with a zero-padded input:
import numpy as np
import numpy.lib.stride_tricks
def sum_next_k_strides(arr, k):
n = arr.size
result = np.zeros(arr.size + k - 1, dtype=arr.dtype)
result[:n] = arr
window = (k,) * result.ndim
window_size = k ** result.ndim
reduced_shape = tuple(dim - k + 1 for dim, k in zip(result.shape, window))
view = np.lib.stride_tricks.as_strided(
result, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
result = np.sum(view, axis=-1)
return result
or, more memory efficiently, construct the tail afterwards with np.cumsum():
import numpy as np
import numpy.lib.stride_tricks
def sum_next_k_strides_cs(arr, k):
n = arr.size
window = (k,) * arr.ndim
window_size = k ** arr.ndim
reduced_shape = tuple(dim - k + 1 for dim, k in zip(arr.shape, window))
view = np.lib.stride_tricks.as_strided(
arr, shape=reduced_shape + window, strides=arr.strides * 2, writeable=False)
result = np.empty_like(arr)
result[:n - k + 1] = np.sum(view, axis=-1)
result[n - k:] = np.cumsum(arr[-1:-(k + 1):-1])[::-1]
return result
Note that looping through the input size instead of k is not going to be fast, no matter the inputs, because k is limited by the size of the input.
Alternatively, one could use np.convolve(), which computes exactly what you are after but with both tails, so that you just need to slice out the starting tail:
def sum_next_k_conv(arr, k):
return np.convolve(arr, (1,) * k)[(k - 1):]
Finally, one could write a fully explicit looping solution accelerated with Numba:
import numpy as np
import numba as nb
#nb.njit
def running_sum_nb(arr, k):
n = arr.size
m = n - k + 1
o = k - 1
result = np.zeros(n, dtype=arr.dtype)
# : fill bulk
for j in range(m):
tot = arr[j]
for i in range(1, k):
tot += arr[j + i]
result[0 + j] = tot
# : fill tail
for j in range(o):
tot = 0
for i in range(j, o):
tot += arr[m + i]
result[m + j] = tot
return result
To check that all the solutions give the same result as the expected output:
funcs = running_sum_loop, running_sum_strides, running_sum_strides_cs, running_sum_conv, running_sum_nb
load = np.array([10, 12, 9, 13, 17, 23, 25, 28, 26, 24, 22, 20, 18, 20, 22, 24, 26, 28, 23, 24, 21, 18, 16, 13])
tgt = np.array([31, 34, 39, 53, 65, 76, 79, 78, 72, 66, 60, 58, 60, 66, 72, 78, 77, 75, 68, 63, 55, 47, 29, 13])
print(f"{'Input':>24} {load}")
print(f"{'Target':>24} {tgt}")
for i, func in enumerate(funcs, 1):
print(f"{func.__name__:>24} {func(load, 3)}")
Input [10 12 9 13 17 23 25 28 26 24 22 20 18 20 22 24 26 28 23 24 21 18 16 13]
Target [31 34 39 53 65 76 79 78 72 66 60 58 60 66 72 78 77 75 68 63 55 47 29 13]
running_sum_loop [31 34 39 53 65 76 79 78 72 66 60 58 60 66 72 78 77 75 68 63 55 47 29 13]
running_sum_strides_cs [31 34 39 53 65 76 79 78 72 66 60 58 60 66 72 78 77 75 68 63 55 47 29 13]
running_sum_strides [31 34 39 53 65 76 79 78 72 66 60 58 60 66 72 78 77 75 68 63 55 47 29 13]
running_sum_conv [31 34 39 53 65 76 79 78 72 66 60 58 60 66 72 78 77 75 68 63 55 47 29 13]
running_sum_nb [31 34 39 53 65 76 79 78 72 66 60 58 60 66 72 78 77 75 68 63 55 47 29 13]
Benchmarking all these for varying input size:
import pandas as pd
timeds_n = {}
for p in range(6):
n = 10 ** p
k = 3
arr = np.array(load.tolist() * n)
print(f"N = {n * len(load)}")
base = funcs[0](arr, k)
timeds_n[n] = []
for func in funcs:
res = func(arr, k)
timed = %timeit -r 8 -n 8 -q -o func(arr, k)
timeds_n[n].append(timed.best)
print(f"{func.__name__:>24} {np.allclose(base, res)} {timed.best:.9f}")
pd.DataFrame(data=timeds_n, index=[func.__name__ for func in funcs]).transpose().plot()
and varying k:
timeds_k = {}
for p in range(1, 10):
n = 10 ** 5
k = 2 ** p
arr = np.array(load.tolist() * n)
print(f"k = {k}")
timeds_k[k] = []
base = funcs[0](arr, k)
for func in funcs:
res = func(arr, k)
timed = %timeit -q -o func(arr, k)
timeds_k[k].append(timed.best)
print(f"{func.__name__:>24} {np.allclose(base, res)} {timed.best:.9f}")
pd.DataFrame(data=timeds_k, index=[func.__name__ for func in funcs]).transpose().plot()

Splitting series with single column that contains list, into multiple columns with single values

Given a Series object which I have pulled from a dataframe, for example through:
columns = list(df)
for col in columns:
s = df[col] # The series object
The Series contains a <class 'list'> in each row, making it look like this:
0 [116, 66]
2 [116, 66]
4 [116, 66]
6 [116, 66]
8 [116, 66]
...
1498 [117, 66]
1500 [117, 66]
1502 [117, 66]
1504 [117, 66]
1506 [117, 66]
How could I split this up, so it becomes two columns in the Series instead?
0 116 66
2 116 66
...
1506 116 66
And then append it back to the original df?
From Ch3steR's comment of using pd.DataFrame(s.tolist()), I managed to get the answer I was looking for, including renaming the columns in the new dataframe to also include the column name of the existing Series.
columns = list(df)
for col in columns:
df2 = pd.DataFrame(df[col].tolist())
df2.columns = [col+"_"+str(y) for y in range(len(df2.columns))]
print(df2)
To keep this shorter, as also suggested by Ch3steR, we can simplify the above to:
columns = list(df)
for col in columns:
df2 = pd.DataFrame(df[col].tolist()).add_prefix(col)
print(df2)
Which in my case, gives the following output:
FrameLen_0 FrameLen_1
0 116 66
1 116 66
2 116 66
3 116 66
4 116 66
.. ... ...
749 117 66
750 117 66
751 117 66
752 117 66
753 117 66

Python Generate unique ranges of a specific length and categorize them

I have a dataframe column which specifies how many times a user has performed an activity.
eg.
>>> df['ActivityCount']
Users ActivityCount
User0 220
User1 190
User2 105
User3 109
User4 271
User5 265
...
User95 64
User96 15
User97 168
User98 251
User99 278
Name: ActivityCount, Length: 100, dtype: int32
>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]
According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D and E.
Activity Count range varies from time to time. In the above example it's approx in-between (9-290) (lowest and highest of the series), it could be (5-500) or (5 to 30).
In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5) like Range A: 0-58, Range B: 59-116, Range C: 117-174...etc
Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories?
Expected output: -
>>> df
Users ActivityCount Category/Range
User0 220 D
User1 190 D
User2 105 B
User3 109 B
User4 271 E
User5 265 E
...
User95 64 B
User96 15 A
User97 168 C
User98 251 E
User99 278 E
The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
The output is something like:
Activity Category
34 115 b
15 43 a
57 192 d
78 271 e
26 88 b
6 25 a
55 186 d
63 220 d
1 15 a
76 268 e
An alternative view - clustering
In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.
One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.
In this case, k-means clustering can be done in the following way:
import scipy
from scipy.cluster.vq import vq, kmeans, whiten
df = pd.DataFrame({"Activity": l})
features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)
df["Category"] = code
And the output looks like:
Activity Category
40 138 1
79 272 0
72 255 0
13 38 3
41 139 1
65 231 0
26 88 2
59 197 4
76 268 0
45 145 1
A couple of notes:
The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.
Try the below solution:
df['Categ'] = pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'))
It creates Categ column - a result of division of ActivityCount
into 5 bins, labelled with A, ... E.
Borders of bins are set by division of full range into n subranges of
equal size.
You can also see the borders of each bin, calling:
pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'), retbins=True)[1]

Fastest way to do cumulative totals in Pandas dataframe

I've got a pandas dataframe of golfers' round scores going back to 2003 (approx 300000 rows). It looks something like this:
Date----Golfer---Tournament-----Score---Player Total Rounds Played
2008-01-01---Tiger Woods----Invented Tournament R1---72---50
2008-01-01---Phil Mickelson----Invented Tournament R1---73---108
I want the 'Player Total Rounds Played' column to be a running total of the number of rounds (i.e. instance in the dataframe) that a player has played up to that date. Is there a quick way of doing it? My current solution (basically using iterrows and then a one-line function) works fine but will take approx 11hrs to run.
Thanks,
Tom
Here is one way:
df = df.sort_values('Date')
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
For example:
import pandas as pd
df = pd.DataFrame([['A', 70, 50],
['B', 72, 55],
['A', 73, 45],
['A', 71, 60],
['B', 74, 55],
['A', 72, 65]],
columns=['Golfer', 'Rounds', 'Played'])
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
# Golfer Rounds Played Rounds CumSum
# 0 A 70 50 70
# 1 B 72 55 72
# 2 A 73 45 143
# 3 A 71 60 214
# 4 B 74 55 146
# 5 A 72 65 286

Inconsistent python print output

(Python 2.7.12) - I have created an NxN array, when I print it I get the exact following output:
Sample a:
SampleArray=np.random.randint(1,100, size=(5,5))
[[49 72 88 56 41]
[30 73 6 43 53]
[83 54 65 16 34]
[25 17 73 10 46]
[75 77 82 12 91]]
Nice and clean.
However, when I go to sort this array by the elements in the 4th column using the code:
SampleArray=sorted(SampleArray, key=lambda x: x[4])
I get the following output:
Sample b:
[array([90, 9, 77, 63, 48]), array([43, 97, 47, 74, 53]), array([60, 64, 97, 2, 73]), array([34, 20, 42, 80, 76]), array([86, 61, 95, 21, 82])]
How can I get my output to stay in the format of 'Sample a'. It will make debugging much easier if I can see the numbers in a straight column.
Simply with numpy.argsort() routine:
import numpy as np
a = np.random.randint(1,100, size=(5,5))
print(a) # initial array
print(a[np.argsort(a[:, -1])]) # sorted array
The output for # initial array:
[[21 99 34 33 55]
[14 81 92 44 97]
[68 53 35 46 22]
[64 33 52 40 75]
[65 35 35 78 43]]
The output for # sorted array:
[[68 53 35 46 22]
[65 35 35 78 43]
[21 99 34 33 55]
[64 33 52 40 75]
[14 81 92 44 97]]
you just need to convert sample array back to a numpy array by using
SampleArray = np.array(SampleArray)
sample code:-
import numpy as np
SampleArray=np.random.randint(1,100, size=(5,5))
print (SampleArray)
SampleArray=sorted(SampleArray, key=lambda x: x[4])
print (SampleArray)
SampleArray = np.array(SampleArray)
print (SampleArray)
output:-
[[28 25 33 56 54]
[77 88 10 68 61]
[30 83 77 87 82]
[83 93 70 1 2]
[27 70 76 28 80]]
[array([83, 93, 70, 1, 2]), array([28, 25, 33, 56, 54]), array([77, 88, 10, 68, 61]), array([27, 70, 76, 28, 80]), array([30, 83, 77, 87, 82])]
[[83 93 70 1 2]
[28 25 33 56 54]
[77 88 10 68 61]
[27 70 76 28 80]
[30 83 77 87 82]]
This can help:
from pprint import pprint
pprint(SampleArray)
The output is a little bit different from the one for Sample A but it still looks neat and debugging will be easier.
Edit: here's my output
[[92 8 41 64 61]
[18 67 91 80 35]
[68 37 4 6 43]
[26 81 57 26 52]
[ 6 82 95 15 69]]
[array([18, 67, 91, 80, 35]),
array([68, 37, 4, 6, 43]),
array([26, 81, 57, 26, 52]),
array([92, 8, 41, 64, 61]),
array([ 6, 82, 95, 15, 69])]

Categories

Resources