Ranking groups based on size - python

Sample Data:
id cluster
1 3
2 3
3 3
4 3
5 1
6 1
7 2
8 2
9 2
10 4
11 4
12 5
13 6
What I would like to do is replace the largest cluster id with 0 and the second largest with 1 and so on and so forth. Output would be as shown below.
id cluster
1 0
2 0
3 0
4 0
5 2
6 2
7 1
8 1
9 1
10 3
11 3
12 4
13 5
I'm not quite sure where to start with this. Any help would be much appreciated.

The objective is to relabel groups defined in the 'cluster' column by the corresponding rank of that group's total value count within the column. We'll break this down into several steps:
Integer factorization. Find an integer representation where each unique value in the column gets its own integer. We'll start with zero.
We then need the counts of each of these unique values.
We need to rank the unique values by their counts.
We assign the ranks back to the positions of the original column.
Approach 1
Using Numpy's numpy.unique + argsort
TL;DR
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
(-c).argsort()[i]
Turns out, numpy.unique performs the task of integer factorization and counting values in one go. In the process, we get unique values as well, but we don't really need those. Also, the integer factorization isn't obvious. That's because per the numpy.unique function, the return value we're looking for is called the inverse. It's called the inverse because it was intended to act as a way to get back the original array given the array of unique values. So if we let
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_couns=True
)
You'll see i looks like:
array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])
And if we did u[i] we get back the original df.cluster.values
array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])
But we are going to use it as integer factorization.
Next, we need the counts c
array([2, 3, 4, 2, 1, 1])
I'm going to propose the use of argsort but it's confusing. So I'll try to show it:
np.row_stack([c, (-c).argsort()])
array([[2, 3, 4, 2, 1, 1],
[2, 1, 0, 3, 4, 5]])
What argsort does in general is to place the top spot (position 0), the position to draw from in the originating array.
# position 2
# is best
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# top spot
# from
# position 2
# position 1
# goes to
# pen-ultimate spot
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# pen-ultimate spot
# from
# position 1
What this allows us to do is to slice this argsort result with our integer factorization to arrive at a remapping of the ranks.
# i is
# [2 2 2 2 0 0 1 1 1 3 3 4 5]
# (-c).argsort() is
# [2 1 0 3 4 5]
# argsort
# slice
# \ / This is our integer factorization
# a i
# [[0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [3 3] <-- 3 is third position in argsort
# [3 3] <-- 3 is third position in argsort
# [4 4] <-- 4 is fourth position in argsort
# [5 5]] <-- 5 is fifth position in argsort
We can then drop it into the column with pd.DataFrame.assign
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
Approach 2
I'm going to leverage the same concepts. However, I'll use Pandas pandas.factorize to get integer factorization with numpy.bincount to count values. The reason to use this approach is because Numpy's unique actually sorts the values in the midst of factorizing and counting. pandas.factorize does not. For larger data sets, big oh is our friend as this remains O(n) while the Numpy approach is O(nlogn).
i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5

You can use groupby, transform, and rank:
df['cluster'] = df.groupby('cluster').transform('count')\
.rank(ascending=False, method='dense')\
.sub(1).astype(int)
Output:
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3

By using category and value_counts
df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]:
0 0
1 0
2 0
3 0
4 2
5 2
6 1
7 1
8 1
9 3
Name: cluster, dtype: int8

This isn't the cleanest solution but it does work. Feel free to suggest improvements:
valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)
for i in valueCounts_sorted.index.values:
print (i)
temp = df[df.cluster == i]
temp["random"] = count
idx = temp.index.values
df.loc[idx, "cluster"] = temp.random.values
count += 1

Related

Convert a dataframe to an array

I have a dataframe like the following.
i
j
element
0
0
1
0
1
2
0
2
3
1
0
4
1
1
5
1
2
6
2
0
7
2
1
8
2
2
9
How can I convert it to the 3*3 array below?
1
2
3
4
5
6
7
8
9
Assuming that the dataframe is called df, one can use pandas.DataFrame.pivot as follows, with .to_numpy() (recommended) or .values as follows
array = df.pivot(index='i', columns='j', values='element').to_numpy()
# or
array = df.pivot(index='i', columns='j', values='element').values
[Out]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=int64)
If you transform your dataframe into three lists where the first is containing "i" values, the second - j and the third is data, you can create NumPy array "manually":
i, j, v = zip(*[x for x in df.itertuples(index=False, name=None)])
arr = np.zeros(df.shape)
arr[i, j] = v

Pyramid fill - how to write input 2 and output [1,1][1,1]

Read an integer nn from standard input and fill the square n×n with integers as follows:
1 2 3 2 1
2 3 4 3 2
3 4 5 4 3
2 3 4 3 2
1 2 3 2 1
or
1 2 2 1
2 3 3 2
2 3 3 2
1 2 2 1
That is, numbers in each column / row increase by one when moving along the column/row towards the center of the matrix.
Implement the matrix as a list of lists of integers, and print the resulting list of lists on standard output.
there is my code, but it cant be output [1,1][1,1],it will be output []
MY CODE:
N = int(input("Enter N value:"))
k = (N) - 1
matrix = [[0 for i in range(k)] for j in range(k)]
for i in range (k):
matrix = []
for j in range (k):
print(matrix)
The code you provided does nothing, I'm assuming that's a copy error on your part.
Anyways, the following should work but bear in mind, that the reversals are not creating deep copies in the following code so if you try to edit the matrix once it is built, it may not behave the way you expect it to.
import math
N = int(input("Enter N value:"))
matrix = []
# iterate over the first math.ceil(N/2) rows
for i in range(math.ceil(N/2)):
matrix.append([])
# create the core numbers
# if i = 0, N = 4, range would be 1,2
# if i = 1, N = 5, range would be 2,3,4
for j in range(i+1, i+1+math.ceil(N/2)):
matrix[i].append(j)
# create copy and reverse it excluding the center element of row
matrix[i] += matrix[i][0:N//2][::-1]
# reverse and append the existing matrix excluding the center row
matrix += matrix[0:N//2][::-1]
print(matrix)
You never populate matrix you are only looping and reassigning matrixto an empty list.
You can use a list comprehension to get the first half, then append it's reverse. You will need some logic to determine what to do if it is even or odd but that's mainly trivial:
n = int(input("Enter N value: "))
half = n // 2
is_odd = n % 2
matrix = [[*range(1 + i, half + i + is_odd+1),
*range(half + i, i, -1)] for i in range(half+is_odd)]
matrix += matrix[:-is_odd or None][::-1]
Result: (formatted as your expected output)
>>> Enter N value: 4
[[1, 2, 2, 1],
[2, 3, 3, 2],
[2, 3, 3, 2],
[1, 2, 2, 1]]
>>> Enter N value: 5
[[1, 2, 3, 2, 1],
[2, 3, 4, 3, 2],
[3, 4, 5, 4, 3],
[2, 3, 4, 3, 2],
[1, 2, 3, 2, 1]]
You can add the intersection of values from an increasing, then decreasing list:
n = 4
R = [(n-abs(i))//2 for i in range(1-n,n,2)] # [0, 1, 1, 0]
M = [ [r+c+1 for c in R] for r in R]
print(*M,sep="\n")
[1, 2, 2, 1]
[2, 3, 3, 2]
[2, 3, 3, 2]
[1, 2, 2, 1]
Visually (looking at R)
r+c+1 for n=4 r+c+1 for n=5
R | 0 1 1 0 R | 0 1 2 1 0
----------- -------------
0 | 1 2 2 1 0 | 1 2 3 2 1
1 | 2 3 3 2 1 | 2 3 4 3 2
1 | 2 3 3 2 2 | 3 4 5 4 3
0 | 1 2 2 1 1 | 2 3 4 3 2
0 | 1 2 3 2 1
You could also combine this in a single list comprehension:
M = [[n-(abs(r)+abs(c))//2 for c in range(1-n,n,2)] for r in range(1-n,n,2)]
or in a more basic for-loop:
for r in range(1-n,n,2): # r\c | -3 -1 1 3
for c in range(1-n,n,2): # -3 | 1 2 2 1
print(n-(abs(r)+abs(c))//2,end=" ") # -1 | 2 3 3 2
print() # 1 | 2 3 3 2
# 3 | 1 2 2 1

Pandas Mapping Numbers to another Number

I have ~5000 rows and all values in my 'Round' column go from -1 to 7. I'm trying to create a new column and it mapped where -1 = 0 and then anything from 1-7 is 1. I tried a simple map and listed all the mappings, but this doesn't work.
combine['Drafted'] = combine.Round.map({'-1':0,'1':1,'2':1,'3':1,'4':1,'5':1,'6':1,'7':1})
Is there something wrong with the logic above that it wouldn't work?
I guess you can achieve it using below code:
df = pd.DataFrame({'Round': [-1, 1, 0, 7, -1, 2, 3, 5, -1, 4, 6]})
df['Drafted'] = np.where(df['Round'] == -1, 0, 1)
print(df)
And the output is as below:
Round Drafted
0 -1 0
1 1 1
2 0 1
3 7 1
4 -1 0
5 2 1
6 3 1
7 5 1
8 -1 0
9 4 1
10 6 1

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
​
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
​
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Shuffle "coupled" elements in python array

Let's say I have this array:
np.arange(9)
[0 1 2 3 4 5 6 7 8]
I would like to shuffle the elements with np.random.shuffle but certain numbers have to be in the original order.
I want that 0, 1, 2 have the original order.
I want that 3, 4, 5 have the original order.
And I want that 6, 7, 8 have the original order.
The number of elements in the array would be multiple of 3.
For example, some possible outputs would be:
[ 3 4 5 0 1 2 6 7 8]
[ 0 1 2 6 7 8 3 4 5]
But this one:
[2 1 0 3 4 5 6 7 8]
Would not be valid because 0, 1, 2 are not in the original order
I think that maybe zip() could be useful here, but I'm not sure.
Short solution using numpy.random.shuffle and numpy.ndarray.flatten functions:
arr = np.arange(9)
arr_reshaped = arr.reshape((3,3)) # reshaping the input array to size 3x3
np.random.shuffle(arr_reshaped)
result = arr_reshaped.flatten()
print(result)
One of possible random results:
[3 4 5 0 1 2 6 7 8]
Naive approach:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
shuffled_array = np.empty_like(array_to_shuffle)
cur_idx = 0
for idx in indices:
shuffled_array[cur_idx:cur_idx+3] = array_to_shuffle[idx*3:(idx+1)*3]
cur_idx += 3
Faster (and cleaner) option:
num_indices = len(array_to_shuffle) // 3 # use normal / in python 2
indices = np.arange(num_indices)
np.random.shuffle(indices)
tmp = array_to_shuffle.reshape([-1,3])
tmp = tmp[indices,:]
tmp.reshape([-1])

Categories

Resources