Replacing chunks of elements in numpy array - python

I have an np.array like this one:
x = [1,2,3,4,5,6,7,8,9,10 ... N]. I need to replace the first n chunks with a certain element, like so:
for i in np.arange(0,125):
x[i] = x[0]
for i in np.arange(125,250):
x[i] = x[125]
for i in np.arange(250,375):
x[i] = x[250]
This is obviously not the way to go, but I just wrote it to this so I can show you what I need to achieve.

One way would be -
In [47]: x
Out[47]: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21])
In [49]: n = 5
In [50]: x[::n][np.arange(len(x))//n]
Out[50]: array([10, 10, 10, 10, 10, 15, 15, 15, 15, 15, 20, 20])
Another with np.repeat -
In [67]: np.repeat(x[::n], n)[:len(x)]
Out[67]: array([10, 10, 10, 10, 10, 15, 15, 15, 15, 15, 20, 20])
For in-situ edit, we can reshape and assign in a broadcasted-manner, like so -
m = (len(x)-1)//n
x[:n*m].reshape(-1,n)[:] = x[:n*m:n,None]
x[n*m:] = x[n*m]

import numpy as np
x = np.arange(0,1000)
a = x[0]
b = x[125]
c = x[250]
x[0:125] = a
x[125:250] = b
x[250:375] = c
No need to write loops, you can replace bunch of values using slicing.
if the splits are equal, you can loop to calculate the stat and end positions instead of hard coding

To keep flexibility in the number of slice/value pairs you can write something like:
def chunk_replace(array, slice_list, value_list):
for s,v in zip(slice_list, value_list):
array[s] = v
return array
array = np.arange(1000)
slice_list = [slice(0,125), slice(125, 250), slice(250, 375)]
value_list = [array[0], array[125], array[250]]
result = chunk_replace(array, slice_list, value_list)

Related

Efficient way of splitting or deleting the dataframe rows based on range filtering

I have 2 dataframes of unequal lengths among which the 1st one's rows will be filtered based on the ranges of the 2nd dataframe. For the better context of the I/O, please refer to this post:
Efficient solution of dataframe range filtering based on another ranges
For example:
M = (x,y) = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = (m,n) = [(5,7), (15,16), (15,18), (21,25), (5,25), (5,15)]
Case-1:
M = [(10,20)]
E = [(5,7)]
out: M = [(10,20)] (no change, because out of E range)
Case-2:
M = [(10,20)]
E = [(15,16)]
out: M = [(10,14),(17,20)] (split (10,20) into 2 rows to remove E range inside it)
Case-3:
M = [(10,20)]
E = [(21,25)]
out: M = [(10,20)] (no change, because out of E range)
Case-4:
M = [(10,20)]
E = [(5,25)]
out: M = [] (delete because totally inclusive within range of E)
Case-5:
M = [(10,20)]
E = [(5,15)]
out: M = [(16,20)] (because (16,20) isn't E range inclusive)
Case-6:
M = [(10,20)]
E = [(13,20)]
out: M = [(10,12)] (because (10,12) isn't E range inclusive)
I have formulated the following algorithm for the above-stated cases:
M = (x,y)
E = (m,n)
if (m<=x):
if(y<=n):
delete the row
elif (x<=n):
(start, end) = (n+1,y)
else:
continue
else:
if(y>=n):
(start, end) = (x,m-1)
(start, end) = (n-1,y)
elif (y>=m):
(start, end) = (x,m-1)
else:
continue
But I wanted to implement it using NumPy and pandas combination:
df1 = pd.read_csv('a.tsv', sep='\t') #main dataframe which I want to filter
temp_bed = bedfile # 2nd dataframe based on which I need to filter
# for array broadcasting
m = temp_bed['first.start'].to_numpy()[:, None]
n = temp_bed['first.end'].to_numpy()[:, None]
# A chunk_size that is too small or too big will lower performance.
# Experiment to find a sweet spot
chunk_size = 100_000
offset = 0
mask = []
while offset < len(df1):
x = df1['first.start'].to_numpy()[offset:offset+chunk_size] #main
y = df1['first.end'].to_numpy()[offset:offset+chunk_size]
mask.append(
# necessary logical conditions #####
# but the problem is with splitting the rows or ranges
((m <= x) & (n >= y)).any(axis=0)
)
offset += chunk_size
import numpy as np
mask = np.hstack(mask)
df1[mask]
Could anyone give me an efficient solution for splitting or deleting or ignoring the dataframe rows based on the above conditions?
Your case can be solved using Numpy and Pandas together.
To get results for all your cases, I extended M and E by one pair:
M = [(10,20), (10,20), (10,20), (10,20), (10,20), (10,20), (10,20)]
E = [( 5, 7), (15,16), (15,18), (21,25), ( 5,25), ( 5,15), (13,20)]
To get result, both as a list and a sequence of pairs (ranges of consecutive
numbers, inclusive), you can run:
for m1, m2, e1, e2 in np.hstack([M, E]):
s = pd.Index(np.arange(m1, m2 + 1)).difference(pd.Index(
np.arange(e1, e2 + 1))).to_series()
rng = s.groupby(s.subtract(s.shift()).gt(1).cumsum()).apply(
lambda grp: (grp.iloc[0], grp.iloc[-1]))
print(f'{m1:2}, {m2:2}, {e1:2}, {e2:2}\n{s.tolist()}\n{rng.values}\n')
For the above source data I got:
10, 20, 5, 7
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]
10, 20, 15, 16
[10, 11, 12, 13, 14, 17, 18, 19, 20]
[(10, 14) (17, 20)]
10, 20, 15, 18
[10, 11, 12, 13, 14, 19, 20]
[(10, 14) (19, 20)]
10, 20, 21, 25
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[(10, 20)]
10, 20, 5, 25
[]
[]
10, 20, 5, 15
[16, 17, 18, 19, 20]
[(16, 20)]
10, 20, 13, 20
[10, 11, 12]
[(10, 12)]
Each 3-rows sequence contains:
original data (a pair from M and a pair from E),
the result as a list of numbers,
the result as a sequence of ramges.

Sample irregular list of numbers with a set delta

Is there a simpler way, using e.g. numpy, to get samples for a given X and delta than the below code?
>>> X = [1, 4, 5, 6, 11, 13, 15, 20, 21, 22, 25, 30]
>>> delta = 5
>>> samples = [X[0]]
>>> for x in X:
... if x - samples[-1] >= delta:
... samples.append(x)
>>> samples
[1, 6, 11, 20, 25, 30]
If you are aiming to "vectorize" the process for performance reasons (e.g. using numpy), you could compute the number of elements that are less than each element plus the delta. This will give you indices for the items to select with the items that need to be skipped getting the same index as the preceding ones to be kept.
import numpy as np
X = np.array([1, 4, 5, 6, 11, 13, 15, 20, 21, 22, 25, 30])
delta = 5
i = np.sum(X<X[:,None]+delta,axis=1) # index of first to keep
i = np.insert(i[:-1],0,0) # always want the first, never the last
Y = X[np.unique(i)] # extract values as unique indexes
print(Y)
[ 1 6 11 20 25 30]
This assumes that the numbers are in ascending order
[EDIT]
As indicated in my comment, the above solution is flawed and will only work some of the time. Although vectorizing a python function does not fully leverage the parallelism (and is slower than the python loop), it is possible to implement the filter like this
X = np.array([1, 4, 5, 6, 10,11,12, 13, 15, 20, 21, 22, 25, 30])
delta = 5
fdelta = np.frompyfunc(lambda a,b:a if a+delta>b else b,2,1)
Y = X[X==fdelta.accumulate(X,dtype=np.object)]
print(Y)
[ 1 6 11 20 25 30]

Summing up arrays without doubles

I would like to know if I have generated the 3 arrays in the manner below, how can I sum all the numbers up from all 3 arrys without summing up the ones that appear in each array.
(I would like to only som upt 10 once but I cant add array X_1 andX_2 because they both have 10 and 20, I only want to som up those numbers once.)
Maybe this can be done by creating a new array out of the X_1, X_2 and X_3 what leave out doubles?
def get_divisible_by_n(arr, n):
return arr[arr%n == 0]
x = np.arange(1,21)
X_1=get_divisible_by_n(x, 2)
#we get array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
X_2=get_divisible_by_n(x, 5)
#we get array([ 5, 10, 15, 20])
X_3=get_divisible_by_n(x, 3)
#we get array([3, 6, 9, 12, 15, 18])
it is me again!
here is my solution using numpy, cuz i had more time this time:
import numpy as np
arr = np.arange(1,21)
divisable_by = lambda x: arr[np.where(arr % x == 0)]
n_2 = divisable_by(2)
n_3 = divisable_by(3)
n_5 = divisable_by(5)
what_u_want = np.unique( np.concatenate((n_2, n_3, n_5)) )
# [ 2, 3, 4, 5, 6, 8, 9, 10, 12, 14, 15, 16, 18, 20]
Not really efficient and not using numpy but here is one solution:
def get_divisible_by_n(arr, n):
return [i for i in arr if i % n == 0]
x = [i for i in range(21)]
X_1 = get_divisible_by_n(x, 2)
X_2 = get_divisible_by_n(x, 5)
X_3 = get_divisible_by_n(x, 3)
X_all = X_1+X_2+X_3
y = set(X_all)
print(sum(y)) # 142

store two for loop output in a 2D matrix

I have a 3D matrix ‘DATA’ whose dimension is 100(L)X200(B)X50(H). The values are random for each grid point.
I want to the find the number of points where the values are between 10 and 20 in each vertical column. The output will be a 2D matrix.
For this I used the following code:
out = []
for i in range(np.shape(DATA)[0]):
for j in range(np.shape(DATA)[1]):
a = DATA[i,j,:]
b = a[(a>25) & (a<30)]
c = len(b)
out.append(c)
but I am not getting the 2D matrix. Instead I am getting an array
Please Help
if you want to leverage numpy functionality:
import numpy as np
data = np.random.randint(0, 50, size=(100,200,50))
range_sum = np.sum(np.logical_and(np.less_equal(data, 20),
np.greater_equal(data, 10)
), axis=-1)
range_sum.shape
Out[6]: (100, 200)
range_sum
Out[7]:
array([[11, 12, 12, ..., 13, 9, 10],
[ 6, 12, 11, ..., 10, 14, 5],
[11, 11, 16, ..., 10, 12, 15],
...,
[11, 17, 9, ..., 12, 12, 11],
[ 9, 8, 10, ..., 7, 15, 12],
[12, 10, 11, ..., 12, 11, 19]])
You're using out as a list, and appending each value. Here's a quick modification to your code that should give you the desired result:
out = []
for i in range(np.shape(DATA)[0]):
out.append([]) # make a second dim for each i
for j in range(np.shape(DATA)[1]):
a = DATA[i,j,:]
b = a[(a>25) & (a<30)]
c = len(b)
out[i].append(c)
The change is that I made out a list of lists. In each iteration over i, we append a new list. Then in the inner loop, we append values to the list at index i.
Update
If you want an numpy.ndarray instead, you can modify your code as follows:
import numpy as np
out = np.ndarray(np.shape(DATA)) # initialize to the desired shape
for i in range(np.shape(DATA)[0]):
for j in range(np.shape(DATA)[1]):
a = DATA[i,j,:]
b = a[(a>25) & (a<30)]
c = len(b)
out[i][j] = c

Find common numbers in Python

I've 2 list
a = [1,9] # signifies the start point and end point, ie numbers 1,2,3,4,5,6,7,8,9
b = [4,23] # same for this.
Now I need to find whether the numbers from a intersect with numbers from b.
I can do it via making a list of numbers from a and b,and then intersecting the 2 lists, but I'm looking for some more pythonic solution.
Is there anything better solution.
My o/p should be 4,5,6,7,8,9
This is using intersecting two lists:
c = list(set(range(a[0],a[1]+1)) & set(range(b[0],b[1]+1)))
>>> print c
[4,5,6,7,8,9]
This is using min and max:
>>> c = range(max([a[0],b[0]]), min([a[1],b[1]])+1)
a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
b = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
The most efficient way is using sets:
result = set(a).intersection(b)
Of course you can use a generator (a pythonic way of applying your logic)
result = (x for x in a if x in b)
You need to get [] or None or sth if sets do not inersect. Something like this would be most efficient:
def intersect(l1, l2):
bg = max(l1[0], l2[0])
end = max(l1[1], l2[1])
return [bg, end] if bg < end else []

Categories

Resources