Replace duplicate elements in a numpy array based on a range - python

I have an 1d numpy array arr as below:
arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
For the duplicate elements, I wish to randomly select any one of the indices containing the same element and replace it with a value missing in between 0 and arr.shape[0].
e.g. In the given array, 7 is present in indices 1, 4 and 9. Hence, I wish to randomly select an index between 1, 4 and 9 and set its value by randomly selecting some element like 8, which is absent in the array. In the end, arr should contain arr.shape[0] unique elements lying between 0 and arr.shape[0] - 1 (both inclusive)
How can I do this efficiently using Numpy (possibly without needing to use any explicit loop)?

Here's one based on np.isin -
def create_uniques(arr):
# Get unique ones and the respective counts
unq,c = np.unique(arr,return_counts=1)
# Get mask of matches from the arr against the ones that have
# respective counts > 1, i.e. the ones with duplicates
m = np.isin(arr,unq[c>1])
# Get the ones that are absent in original array and shuffle it
newvals = np.setdiff1d(np.arange(len(arr)),arr[~m])
np.random.shuffle(newvals)
# Assign the shuffled values into the duplicate places to get final o/p
arr[m] = newvals
return ar
Sample runs -
In [53]: arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
In [54]: create_uniques(arr)
Out[54]: array([9, 7, 0, 1, 6, 4, 8, 2, 3, 5])
In [55]: arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
In [56]: create_uniques(arr)
Out[56]: array([9, 4, 0, 5, 6, 2, 7, 1, 3, 8])
In [57]: arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
In [58]: create_uniques(arr)
Out[58]: array([9, 4, 0, 1, 7, 2, 6, 8, 3, 5])

Extending Divakar's answer (and I have basically no experience in python so this is likely a very roundabout and unpython way of doing it but):
import numpy as np
def create_uniques(arr):
np.random.seed()
indices = []
for i, x in enumerate(arr):
indices.append([arr[i], [j for j, y in enumerate(arr) if y == arr[i]]])
indices[i].append(np.random.choice(indices[i][1]))
indices[i][1].remove(indices[i][2])
sidx = arr.argsort()
b = arr[sidx]
new_vals = np.setdiff1d(np.arange(len(arr)),arr)
arr[sidx[1:][b[:-1] == b[1:]]] = new_vals
for i,x in enumerate(arr):
if x == indices[i][0] and i != indices[i][2]:
arr[i] = arr[indices[i][2]]
arr[indices[i][2]] = x
return arr
Sample:
arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
print(arr)
print(create_uniques(arr))
arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
print(create_uniques(arr))
arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
print(create_uniques(arr))
arr = np.array([9, 7, 0, 4, 7, 4, 2, 2, 3, 7])
print(create_uniques(arr))
Outputs:
[9 7 0 4 7 4 2 2 3 7]
[9 7 0 4 6 5 2 1 3 8]
[9 8 0 4 6 5 1 2 3 7]
[9 8 0 4 6 5 2 1 3 7]
[9 7 0 5 6 4 2 1 3 8]

Related

How to ravel numpy array in 's' path?

Given a 2D np.array:
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
How do I ravel it in an s-path such that I get
>>> sravel(arr)
array([1, 2, 3, 6, 5, 4, 7, 8, 9])
Additonally, I would like the option of going down the 0-axis first as well, i.e.
>>> sravel(arr, [0,1])
array([1, 4, 7, 8, 5, 2, 3, 6, 9])
here the second argument of the parenthesis indicates the order of axis.
I don't think there is any direct way to do that, but it's not hard to get that result:
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
arr2 = arr.copy()
arr2[1::2] = np.flip(arr[1::2], 1)
print(arr2.ravel())
# [1 2 3 6 5 4 7 8 9]
arr3 = arr.T.copy()
arr3[1::2] = np.flip(arr.T[1::2], 1)
print(arr3.ravel())
# [1 4 7 8 5 2 3 6 9]
EDIT: As pointed out by scleronomic, the second case can also be done by means of an F-contiguous array:
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# The array is copied with F order so ravel does not require another copy later
arr3 = arr.copy(order='F')
arr3[:, 1::2] = np.flip(arr3[:, 1::2], 0)
print(arr3.ravel(order='F'))
# [1 4 7 8 5 2 3 6 9]

How do I shift col of numpy matrix to last col?

Say I have a numpy matrix as such:
[[1, 3, 4, 7, 8]
[5, 6, 8, 2, 6]
[2, 9, 3, 3, 6]
[7, 1, 9, 3, 5]]
I want to shift column 2 of the matrix to the last column:
[[1, 4, 7, 8, 3]
[5, 8, 2, 6, 6]
[2, 3, 3, 6, 9]
[7, 9, 3, 5, 1]]
How exactly do I do this?
Use numpy.roll:
arr[:, 1:] = np.roll(arr[:, 1:], -1, 1)
Output:
array([[1, 4, 7, 8, 3],
[5, 8, 2, 6, 6],
[2, 3, 3, 6, 9],
[7, 9, 3, 5, 1]])
How:
np.roll takes three arguments: a, shift and axis:
np.roll(a = arr[:, 1:], shift = -1, axis = 1)
This means that, take arr[:, 1:](all rows, all columns from 1), and shift it one unit to the left (-1. to the right would be +1), along the axis 1 (i.e. columnar shift, axis 0 would be row shift).
np.roll, as name states, is a circular shift. One unit shift will make last column to be the first, and so on.
Create a list of columns, then use that to index the array. Here, new_column_order uses a range to get all columns before col, another range to get all columns after col, then puts col at the end. Each range object is unpacked via * into the new column list.
x = np.array([[1, 3, 4, 7, 8],
[5, 6, 8, 2, 6],
[2, 9, 3, 3, 6],
[7, 1, 9, 3, 5]])
col = 1 # 2nd column
new_column_order = [*range(col), *range(col + 1, x.shape[-1]), col]
x_new = x[:, new_column_order]
print(x_new)
Output:
[[1 4 7 8 3]
[5 8 2 6 6]
[2 3 3 6 9]
[7 9 3 5 1]]

How can I get the index of an element of a diagonal in a matrix?

To explain further, I will give an example. I have a 8x8 grid made up of random numbers,
m = [
[1 ,5, 2, 8, 6, 9, 6, 8]
[2, 2, 2, 2, 8, 2, 2, 1]
[9, 5, 9, 6, 8, 2, 7, 2]
[2, 8, 8 ,6 ,4 ,1 ,8 ,1]
[2, 5, 5, 5, 4, 4, 7, 9]
[3, 9, 8, 8, 9, 4, 1, 1]
[8, 9, 2, 4, 2, 8, 4, 3]
[4, 4, 7, 8, 7, 5, 3, 6]]
I have written code that gives me the list of the diagonal given an x and y value. For example, if an x of 2 and a y of 3 is given, the diagonal [2,5,8,5,9,8,3] will be returned. This is the code:
def main():
m = [[1 ,5, 2, 8, 6, 9, 6, 8],[2, 2, 2, 2, 8, 2, 2, 1],[9, 5, 9, 6, 8, 2, 7, 2],[2, 8, 8 ,6 ,4 ,1 ,8 ,1],[2, 5, 5, 5, 4, 4, 7, 9],[3, 9, 8, 8, 9, 4, 1, 1],[8, 9, 2, 4, 2, 8, 4, 3],[4, 4, 7, 8, 7, 5, 3, 6]]
x = 2
y = 3
for i in m:
print(i)
print(diagonal(m,x,y))
def diagonal(m, x, y):
#x
row = max((y - x, 0))
#y
col = max((x - y, 0))
while row < len(m) and col < len(m[row]):
yield m[row][col]
row += 1
col += 1
main()
My question is, how could I get the index of the given element in the diagonal list. In the example, the coordinates are x=2 and y=3(which is the number 8), and the resulting diagonal is [2,5,8,5,9,8,3], so the index of the element is 2. Also I cannot use numpy fyi.
First, the case where x
if x<y:
row = y-x
idx = y-row
This simplifies to idx=x, and by symetry
idx = min(x,y)
You can grab the index of a element in a list by using list.index(element).
For example:
diagonal = [2,5,8,5,9,8,3]
theIndex = diagonal.index(8)
print(theIndex)
I hope this helps. Good luck!
I would suggest you change your function (or make a variant) to return a tuple with the coordinates and numbers instead of just the numbers (similar to what enumerate() does. It will be easier to map this to numbers and find coordinates of numbers afterward
In other words, if you:
yield (row,col,m[row][col])
you will be able to obtain just the numbers with :
numbers = [ num for row,col,num in diagonal(m,2,3) ]
but you will also be able to manipulate the coordinates when you need to

Groupby and reduce pandas dataframes with numpy arrays as entries

I have a pandas.DataFrame with the following structure:
>>> data
a b values
1 0 [1, 2, 3, 4]
2 0 [3, 4, 5, 6]
1 1 [1, 3, 7, 9]
2 1 [2, 4, 6, 8]
('values' has the type of numpy.array). What I want to do is to group the data by column 'a' and then combine the list of values.
My goal is to end up with the following:
>>> data
a values
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Note, that the order of the values does not matter. How do I achieve this? I though about something like
>>> grps = data.groupby(['a'])
>>> grps['values'].agg(np.concatenate)
but this fails with a KeyError. I'm sure there is a pandaic way to achieve this - but how?
Thanks.
Similar to the John Galt's answer, you can group and then apply np.hstack:
In [278]: df.groupby('a')['values'].apply(np.hstack)
Out[278]:
a
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Name: values, dtype: object
To get back your frame, you'll need pd.Series.to_frame and pd.reset_index:
In [311]: df.groupby('a')['values'].apply(np.hstack).to_frame().reset_index()
Out[311]:
a values
0 1 [1, 2, 3, 4, 1, 3, 7, 9]
1 2 [3, 4, 5, 6, 2, 4, 6, 8]
Performance
df_test = pd.concat([df] * 10000) # setup
%timeit df_test.groupby('a')['values'].apply(np.hstack) # mine
1 loop, best of 3: 219 ms per loop
%timeit df_test.groupby('a')['values'].sum() # John's
1 loop, best of 3: 4.44 s per loop
sum is very inefficient for list, and does not work when Values is a np.array.
You can use sum to join lists.
In [640]: data.groupby('a')['values'].sum()
Out[640]:
a
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Name: values, dtype: object
Or,
In [653]: data.groupby('a', as_index=False).agg({'values': 'sum'})
Out[653]:
a values
0 1 [1, 2, 3, 4, 1, 3, 7, 9]
1 2 [3, 4, 5, 6, 2, 4, 6, 8]

Duplicating specific elements in lists or Numpy arrays

I work with large data sets in my research.
I need to duplicate an element in a Numpy array. The code below achieves this, but is there a function in Numpy that performs the operation in a more efficient manner?
"""
Example output
>>> (executing file "example.py")
Choose a number between 1 and 10:
2
Choose number of repetitions:
9
Your output array is:
[1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>>
"""
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = int(input('Choose the number you want to repeat (1-10):\n'))
repetitions = int(input('Choose number of repetitions:\n'))
output = []
for i in range(len(x)):
if x[i] != y:
output.append(x[i])
else:
for j in range(repetitions):
output.append(x[i])
print('Your output array is:\n', output)
One approach would be to find the index of the element to be repeated with np.searchsorted. Use that index to slice the left and right sides of the array and insert the repeated array in between.
Thus, one solution would be -
idx = np.searchsorted(x,y)
out = np.concatenate(( x[:idx], np.repeat(y, repetitions), x[idx+1:] ))
Let's consider a bit more generic sample case with x as -
x = [2, 4, 5, 6, 7, 8, 9, 10]
Let the number to be repeated is y = 5 and repetitions = 7.
Now, use the proposed codes -
In [57]: idx = np.searchsorted(x,y)
In [58]: idx
Out[58]: 2
In [59]: np.concatenate(( x[:idx], np.repeat(y, repetitions), x[idx+1:] ))
Out[59]: array([ 2, 4, 5, 5, 5, 5, 5, 5, 5, 6, 7, 8, 9, 10])
For the specific case of x always being [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], we would have a more compact/elegant solution, like so -
np.r_[x[:y-1], [y]*repetitions, x[y:]]
There is the numpy.repeat function:
>>> np.repeat(3, 4)
array([3, 3, 3, 3])
>>> x = np.array([[1,2],[3,4]])
>>> np.repeat(x, 2)
array([1, 1, 2, 2, 3, 3, 4, 4])
>>> np.repeat(x, 3, axis=1)
array([[1, 1, 1, 2, 2, 2],
[3, 3, 3, 4, 4, 4]])
>>> np.repeat(x, [1, 2], axis=0)
array([[1, 2],
[3, 4],
[3, 4]])

Categories

Resources