Finding the distance to the next higher value in pandas dataframe - python

I have a data frame containing floating point values
my_df = pd.DataFrame([1,2,1,4,3,2,5,4,7])
I'm trying to find for each number, when (how many indices need to move forward) till I find the next number larger than the current number, if there is no larger number, I mark it with some value (like 999999).
So for the example above, the correct answer should be
result = [1,2,1,3,2,1,2,1,999999]
Currently I've solved it by very slow double loop with itertuples (meaning O(n^2))
Is there a smarter way to do it ?

Here's a numpy based one leveraging broadcasting:
a = my_df.squeeze().to_numpy() # my_df.squeeze().values for versions 0.24.0.<
diff_mat = a - a[:,None]
result = (np.triu(diff_mat)>0).argmax(1) - np.arange(diff_mat.shape[1])
result[result <= 0] = 99999
print(result)
array([ 1, 2, 1, 3, 2, 1, 2, 1, 99999],
dtype=int64)
Where diff_mat is the distance matrix, and we're looking for the values from the main diagonal onwards, which are greater than 0:
array([[ 0, 1, 0, 3, 2, 1, 4, 3, 6],
[-1, 0, -1, 2, 1, 0, 3, 2, 5],
[ 0, 1, 0, 3, 2, 1, 4, 3, 6],
[-3, -2, -3, 0, -1, -2, 1, 0, 3],
[-2, -1, -2, 1, 0, -1, 2, 1, 4],
[-1, 0, -1, 2, 1, 0, 3, 2, 5],
[-4, -3, -4, -1, -2, -3, 0, -1, 2],
[-3, -2, -3, 0, -1, -2, 1, 0, 3],
[-6, -5, -6, -3, -4, -5, -2, -3, 0]], dtype=int64)
We have np.triu for that:
np.triu(diff_mat)
array([[ 0, 1, 0, 3, 2, 1, 4, 3, 6],
[ 0, 0, -1, 2, 1, 0, 3, 2, 5],
[ 0, 0, 0, 3, 2, 1, 4, 3, 6],
[ 0, 0, 0, 0, -1, -2, 1, 0, 3],
[ 0, 0, 0, 0, 0, -1, 2, 1, 4],
[ 0, 0, 0, 0, 0, 0, 3, 2, 5],
[ 0, 0, 0, 0, 0, 0, 0, -1, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 3],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
And by checking which are greater than 0, and taking the argmax of the boolean ndarray we'll find the first value greater than 0 in each row:
(np.triu(diff_mat)>0).argmax(1)
array([1, 3, 3, 6, 6, 6, 8, 8, 0], dtype=int64)
We only need to subtract the corresponding offset from the main diagonal to the beginning

Related

matrix python numpy with positif and negative value

i want to generate a diagonal matrix with size such as nxn
This is a toeplitz matrix, you can use SciPy's linalg.toeplitz to construct such a pattern. You can look at its implementation code here which uses from np.lib.stride_tricks.as_strided under the hood.
>>> toeplitz(-np.arange(3), np.arange(3))
array([[ 0, 1, 2],
[-1, 0, 1],
[-2, -1, 0]])
>>> toeplitz(-np.arange(6), np.arange(6))
array([[ 0, 1, 2, 3, 4, 5],
[-1, 0, 1, 2, 3, 4],
[-2, -1, 0, 1, 2, 3],
[-3, -2, -1, 0, 1, 2],
[-4, -3, -2, -1, 0, 1],
[-5, -4, -3, -2, -1, 0]])
It's quite easy to write as a custom function:
def diagonal(N):
a = np.arange(N)
return a-a[:,None]
diagonal(3)
array([[ 0, 1, 2],
[-1, 0, 1],
[-2, -1, 0]])
diagonal(6)
array([[ 0, 1, 2, 3, 4, 5],
[-1, 0, 1, 2, 3, 4],
[-2, -1, 0, 1, 2, 3],
[-3, -2, -1, 0, 1, 2],
[-4, -3, -2, -1, 0, 1],
[-5, -4, -3, -2, -1, 0]])

List gets changed when Scope gets changed

When I append List in a for loop it changes it value correctly
and when I print it outside for loop it's value gets changed
arr=[]
b=[1,2,3,4,5,6,7]
for i in range(0,len(b)):
b[i]=0
arr.append(b)
print(arr[i])
Here output is
[0, 2, 3, 4, 5, 6, 7]
[0, 0, 3, 4, 5, 6, 7]
[0, 0, 0, 4, 5, 6, 7]
[0, 0, 0, 0, 5, 6, 7]
[0, 0, 0, 0, 0, 6, 7]
[0, 0, 0, 0, 0, 0, 7]
[0, 0, 0, 0, 0, 0, 0]
And here
arr=[]
b=[1,2,3,4,5,6,7]
for i in range(0,len(b)):
b[i]=0
arr.append(b)
print(arr)
Output is
[[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]]
On each iteration, you are adding a reference to the same list b to your arr, which means that when you later set new values to zero, you are modifying all of the lists inside arr simultaneously. To avoid this, you can append a copy of b to arr instead by using list(b), i.e.:
arr = []
b = [1, 2, 3, 4, 5, 6, 7]
for i in range(len(b)):
b[i] = 0
arr.append(list(b))
print(arr)
This outputs:
[[0, 2, 3, 4, 5, 6, 7],
[0, 0, 3, 4, 5, 6, 7],
[0, 0, 0, 4, 5, 6, 7],
[0, 0, 0, 0, 5, 6, 7],
[0, 0, 0, 0, 0, 6, 7],
[0, 0, 0, 0, 0, 0, 7],
[0, 0, 0, 0, 0, 0, 0]]

Is there a way to find the largest change in a pandas dataframe column?

Im trying to find the largest difference between i and j in a series where i cannot be before j. Is there an efficient way to do this in pandas:
x = [1, 2, 5, 4, 2, 4, 2, 1, 7]
largest_change = 0
for i in range(len(x)):
for j in range(i+1, len(x)):
change = x[i] - x[j]
print(x[i], x[j], change)
if change > largest_change:
largest_change = change
The output would just be the value, in this case 4 from 5 to 1.
Try numpy broadcast with np.triu and max
arr = np.array(x)
np.triu(arr[:,None] - arr)
array([[ 0, -1, -4, -3, -1, -3, -1, 0, -6],
[ 0, 0, -3, -2, 0, -2, 0, 1, -5],
[ 0, 0, 0, 1, 3, 1, 3, 4, -2],
[ 0, 0, 0, 0, 2, 0, 2, 3, -3],
[ 0, 0, 0, 0, 0, -2, 0, 1, -5],
[ 0, 0, 0, 0, 0, 0, 2, 3, -3],
[ 0, 0, 0, 0, 0, 0, 0, 1, -5],
[ 0, 0, 0, 0, 0, 0, 0, 0, -6],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0]])
np.triu(arr[:,None] - arr).max()
Out[758]: 4
Besides Andy's smart method, here is another one propagating the minimum value backward whose advantage is to have linear time complexity, instead of quadratic, in case you handle a large amount of data.
a = np.flipud(np.array(x))
largest_change = (a - np.minimum.accumulate(a)).max()
How about this?
x = [1, 2, 5, 4, 2, 4, 2, 1, 7]
largest_change = 0
position = 0
for i in range(len(x)-1):
change = x[i] - min(x[i+1:])
if change > largest_change:
largest_change = change
position = i
print(x[position], min(x[position+1:]), largest_change)
Why don't you just take the diff then the max of that?
x = [1, 2, 5, 4, 2, 4, 2, 1, 7]
s = pd.Series(x)
z = abs(s.diff())
idx_max_val = z[z==z.max()].index[0]
print(f'Max difference in value ({z.max()}) occurs at the indices of {idx_max_val-1}:{idx_max_val}')
I would suggest rolling window:
import pandas
df = pandas.DataFrame({'col1': [1, 2, 5, 4, 2, 4, 2, 1, 7]})
df["diff"] = df['col1'].rolling(window=2).apply(lambda x: x[1] - x[0])
print(df["diff"].max())
Output: 6.0
Or did I misunderstand you and you just want the largest difference between any two values?
This would be:
import pandas
df = pandas.DataFrame({'col1': [1, 2, 5, 4, 2, 4, 2, 1, 7]})
max_diff = df["col1"].max() - df["col1"].min()
print("Min:", df["col1"].min(), "Max:", df["col1"].max(), "Diff:", max_diff)
Output:
Min: 1 Max: 7 Diff: 6

Numpy - even more position-based array modification

I have a large 2-dimensional numpy array, with every value being a 0 or a 1. I'd like to create a function which takes this array as an input, and returns a new array of the same size in which each element is based on the elements above, below and to either side. The returned array should have 0's stay 0, and each 1 will get +1 if there is a 1 to the north, +2 for a 1 to the right, +4 for a 1 below and +8 for a 1 to the left. These all stack, so a 1 surrounded by 1's should end up as a 17. Diagonals do not matter. This might also be faster with explicit bitwise operations (with 4 bits, each bit corresponding to a direction and whether there is a 1 or 0 in it).
I would like this operation to be done as quickly as possible. I played around with a for loop but it is too slow, and I don't understand masking in numpy well enough to use that.
The operation you describe can be expressed as linear convolution followed by zeroing out the spots that were zero before:
>>> import numpy as np
>>> from scipy import signal
>>>
>>> kernel = np.array([[0,1,0], [8,1,2], [0,4,0]])[::-1, ::-1]
>>>
>>> pattern = np.random.randint(0, 2, (10, 10))
>>>
>>> pattern
array([[0, 1, 1, 1, 1, 1, 1, 0, 1, 1],
[1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 1, 1, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 1, 1, 0, 0, 0],
[0, 1, 0, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 0, 0, 1, 1, 0, 1, 0, 1, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 1, 0],
[0, 1, 1, 1, 0, 1, 0, 1, 1, 0],
[0, 1, 0, 1, 1, 1, 0, 1, 0, 0]])
>>>
>>> pattern * signal.convolve(pattern, kernel, 'same')
array([[ 0, 3, 15, 11, 15, 11, 9, 0, 3, 9],
[ 1, 0, 2, 0, 6, 0, 0, 0, 0, 0],
[ 0, 1, 0, 3, 12, 13, 0, 1, 0, 1],
[ 1, 0, 1, 0, 0, 8, 13, 0, 0, 0],
[ 0, 5, 0, 3, 11, 16, 12, 15, 15, 9],
[ 0, 2, 0, 0, 0, 2, 0, 4, 14, 0],
[ 0, 0, 0, 3, 9, 0, 5, 0, 6, 0],
[ 0, 5, 0, 0, 0, 0, 2, 0, 6, 0],
[ 0, 8, 11, 13, 0, 5, 0, 7, 10, 0],
[ 0, 2, 0, 4, 11, 10, 0, 2, 0, 0]])
I hope this can help. I start copying the original matrix, then add the contribution from each direction. For example if I have to add the contribute of the elements on the right, they may modify all the columns but the last one, thus I can write result[:,:-1] += m[:,1:]. The last multiplication for m ensures that the starting value of each cell to modify was one and not zero, as you required.
import numpy as np
def f(m):
result = np.copy(m)
# propagation from the four directions
result[1:,:] += m[:-1,:] # north
result[:,:-1] += 2 * m[:,1:] # est
result[:-1,:] += 4 * m[1:,:] # sud
result[:,1:] += 8 * m[:,:-1] # west
return result * m

Label regions with unique combinations of values in two numpy arrays?

I have two labelled 2D numpy arrays a and b with identical shapes. I would like to re-label the array b by something similar to a GIS geometric union of the two arrays, such that cells with unique combination of values in array a and b are assigned new unique IDs:
I'm not concerned with the specific numbering of the regions in the output, so long as the values are all unique. I have attached sample arrays and desired outputs below: my real datasets are much larger, with both arrays having integer labels which range from "1" to "200000". So far I've experimented with concatenating the array IDs to form unique combinations of values, but ideally I would like to output a simple set of new IDs in the form of 1, 2, 3..., etc.
import numpy as np
import matplotlib.pyplot as plt
# Example labelled arrays a and b
input_a = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 0],
[0, 0, 3, 3, 3, 3, 2, 2, 2, 2, 0, 0],
[0, 0, 3, 3, 3, 3, 2, 2, 2, 2, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
input_b = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 3, 3, 3, 3, 3, 0, 0],
[0, 0, 1, 1, 1, 3, 3, 3, 3, 3, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# Plot inputs
plt.imshow(input_a, cmap="spectral", interpolation='nearest')
plt.imshow(input_b, cmap="spectral", interpolation='nearest')
# Desired output, union of a and b
output = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 2, 3, 3, 3, 3, 0, 0],
[0, 0, 1, 1, 1, 2, 3, 3, 3, 3, 0, 0],
[0, 0, 1, 1, 1, 4, 7, 7, 7, 7, 0, 0],
[0, 0, 5, 5, 5, 6, 7, 7, 7, 7, 0, 0],
[0, 0, 5, 5, 5, 6, 7, 7, 7, 7, 0, 0],
[0, 0, 5, 5, 5, 6, 7, 7, 7, 7, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# Plot desired output
plt.imshow(output, cmap="spectral", interpolation='nearest')
If I understood the circumstances correctly, you are looking to have unique pairings from a and b. So, 1 from a and 1 from b would have one unique tag in the output; 1 from a and 3 from b would have another unique tag in the output. Also looking at the desired output in the question, it seems that there is an additional conditional situation here that if b is zero, the output is to be zero as well irrespective of the unique pairings.
The following implementation tries to solve all of that -
c = a*(b.max()+1) + b
c[b==0] = 0
_,idx = np.unique(c,return_inverse= True)
out = idx.reshape(b.shape)
Sample run -
In [21]: a
Out[21]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0],
[0, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 0],
[0, 0, 3, 3, 3, 3, 2, 2, 2, 2, 0, 0],
[0, 0, 3, 3, 3, 3, 2, 2, 2, 2, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
In [22]: b
Out[22]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 3, 3, 3, 3, 3, 0, 0],
[0, 0, 1, 1, 1, 3, 3, 3, 3, 3, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
In [23]: out
Out[23]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 3, 5, 5, 5, 5, 0, 0],
[0, 0, 1, 1, 1, 3, 5, 5, 5, 5, 0, 0],
[0, 0, 1, 1, 1, 2, 4, 4, 4, 4, 0, 0],
[0, 0, 6, 6, 6, 7, 4, 4, 4, 4, 0, 0],
[0, 0, 6, 6, 6, 7, 4, 4, 4, 4, 0, 0],
[0, 0, 6, 6, 6, 7, 4, 4, 4, 4, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Sample plot -
# Plot inputs
plt.figure()
plt.imshow(a, cmap="spectral", interpolation='nearest')
plt.figure()
plt.imshow(b, cmap="spectral", interpolation='nearest')
# Plot output
plt.figure()
plt.imshow(out, cmap="spectral", interpolation='nearest')
Here is a way to do it conceptually in terms of set union, but not to GIS geometric union, since that was mentioned after I answered.
Make a list of all possible unique 2-tuples of values with one from a and the other from b in that order. Map each tuple in that list to its index in it. Create the union array using that map.
For example say a and b are arrays each containing values in range(4) and assume for simplicity they have the same shape. Then:
v = range(4)
from itertools import permutations
p = list(permutations(v,2))
m = {}
for i,x in enumerate(p):
m[x] = i
union = np.empty_like(a)
for i,x in np.ndenumerate(a):
union[i] = m[(x,b[i])]
For demonstration, generating a and b with
np.random.randint(4, size=(3, 3))
produced:
a = array([[3, 0, 3],
[1, 3, 2],
[0, 0, 3]])
b = array([[1, 3, 1],
[0, 0, 1],
[2, 3, 0]])
m = {(0, 1): 0,
(0, 2): 1,
(0, 3): 2,
(1, 0): 3,
(1, 2): 4,
(1, 3): 5,
(2, 0): 6,
(2, 1): 7,
(2, 3): 8,
(3, 0): 9,
(3, 1): 10,
(3, 2): 11}
union = array([[10, 2, 10],
[ 3, 9, 7],
[ 1, 2, 9]])
In this case the property that a union should be bigger or equal to its composits is reflected in increased numerical values rather than increase in number of elements.
An issue with using itertools permutations is that the number of permutations could be much larger than needed. It would be much larger if the number of overlaps per area is much smaller than the number of areas.
The question uses Union but the picture shows an Intersection. Divakar's answer replicates the pictured Intersection, and is more elegant than my solution below, which produces the Union.
One could make a dictionary of only the actual overlaps, and then work from that. Flattening the input arrays first makes this easier for me to see, I'm not sure if that is feasible for you:
shp = numpy.shape(input_a)
a = input_a.flatten()
b = input_b.flatten()
s = set(((i,j) for i,j in zip(a,b))) # unique pairings
d = {p:i for i,p in enumerate(sorted(list(s))} # dict{pair:index}
output_c = numpy.array([d[i,j] for i,j in zip(a,b)]).reshape(shp)
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 0],
[ 0, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 0],
[ 0, 1, 2, 2, 2, 4, 7, 7, 7, 7, 5, 0],
[ 0, 1, 2, 2, 2, 4, 7, 7, 7, 7, 5, 0],
[ 0, 1, 2, 2, 2, 3, 6, 6, 6, 6, 5, 0],
[ 0, 8, 9, 9, 9, 10, 6, 6, 6, 6, 5, 0],
[ 0, 0, 9, 9, 9, 10, 6, 6, 6, 6, 0, 0],
[ 0, 0, 9, 9, 9, 10, 6, 6, 6, 6, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

Categories

Resources