I have a 2d numpy array called arm_resets that has positive integers. The first column has all positive integers < 360. For all columns other than the first, I need to replace all values over 360 with the value that is in the same row in the 1st column. I thought this would be a relatively easy thing to do, here's what I have:
i = 300
over_360 = arm_resets[:, [i]] >= 360
print(arm_resets[:, [i]][over_360])
print(arm_resets[:, [0]][over_360])
arm_resets[:, [i]][over_360] = arm_resets[:, [0]][over_360]
print(arm_resets[:, [i]][over_360])
And here's what prints:
[3600 3609 3608 ... 3600 3611 3605]
[ 0 9 8 ... 0 11 5]
[3600 3609 3608 ... 3600 3611 3605]
Since all numbers that are being shown in the first print (first 3 and last 3) are above 360, they should be getting replaced by the 2nd print in the 3rd print. Why is this not working?
edit: reproducible example:
df = pd.DataFrame({"start":[1,2,5,6],"freq":[1,5,6,9]})
periods = 6
arm_resets = df[["start"]].values
freq = df[["freq"]].values
arm_resets = np.pad(arm_resets,((0,0),(0,periods-1)))
for i in range(1,periods):
arm_resets[:,[i]] = arm_resets[:,[i-1]] + freq
#over_360 = arm_resets[:,[i]] >= periods
#arm_resets[:,[i]][over_360] = arm_resets[:,[0]][over_360]
arm_resets
Given commented out code here's what prints:
array([[ 1, 2, 3, 4, 5, 6],
[ 2, 7, 12, 17, 22, 27],
[ 3, 9, 15, 21, 27, 33],
[ 4, 13, 22, 31, 40, 49]])
What I would expect:
array([[ 1, 2, 3, 4, 5, 1],
[ 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3],
[ 4, 4, 4, 4, 4, 4]])
Now if it helps, the final 2d array I'm actually trying to create is a 1/0 array that indicates which are filled in, so in this example I'd want this:
array([[ 0, 1, 1, 1, 1, 1],
[ 0, 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 1, 0]])
The code I use to achieve this from the above arm_resets is this:
fin = np.zeros((len(arm_resets),periods),dtype=int)
for i in range(len(arm_resets)):
fin[i,a[i]] = 1
The slice arm_resets[:, [i]] is a fancy index, and therefore makes a copy of the ith column of the data. arm_resets[:, [i]][over_360] = ... therefore calls __setitem__ on a temporary array that is discarded as soon as the statement executes. If you want to assign to the mask, call __setitem__ on the sliced object directly:
arm_resets[over_360, [i]] = ...
You also don't need to make the index into a list. It's generally better to use simple indices, especially when doing assignments, since they create views rather than copies:
arm_resets[over_360, i] = ...
With slicing, even the following should work, since it calls __setitem__ on a view:
arm_resets[:, i][over_360] = ...
This index does not help you process each row of the data, since i is a column. In fact, you can process the entire matrix in one step, without looping, if you use indices rather than a boolean mask. The reason that indices are useful is that you can match the item from the correct row in the first column:
rows, cols = np.nonzero(arm_resets[:, 1:] >= 360)
arm_resets[rows, cols] = arm_resets[rows, 1]
You can use np.where()
first_col = arm_resets[:,0] # first col
first_col = first_col.reshape(first_col.size,1) #Transfor in 2d array
arm_resets = np.where(arm_resets >= 360,first_col,arm_resets)
You can see in detail how np.where work here, but basically it compare arm_resets >= 360, if true it put first_col value in place (there another detail here with broadcasting) if false it put arm_resets value.
Edit: As suggested by Mad Physicist. You can use arm_resets[:,0,None] directly instead of creating first_col variable.
arm_resets = np.where(arm_resets >= 360,arm_resets[:,0,None],arm_resets)
Related
I'm looking to slice out the minimum value along the first axis of an array.
For example, in the code below, I want to print out np.array([13, 0, 12, 3]).
However, the slicing isn't behaving as I would think it does.
(I do need the argmin array later and don't want to just use np.min(g, axis=1))
import numpy as np
g = np.array([[13, 23, 14], [12, 23, 0], [39, 12, 92], [19, 4, 3]])
min_ = np.argmin(g, axis=1)
print(g[:, min_])
What is happening here?
Why is my result from the code
[[13 14 23 14]
[12 0 23 0]
[39 92 12 92]
[19 3 4 3]]
Other details:
Python 3.10.2
Numpy 1.22.1
If you want use np.argmin, you can try this:
For more explanation : from min_ you have array([0, 2, 1, 2]) but for accessing to array you need ((0, 1, 2, 3), (0, 2, 1, 2)) for this reason you can use range.
min_ = np.argmin(g, axis=1)
g[range(len(min_)), min_] # like as np.min(g ,axis=1)
Output:
array([13, 0, 12, 3])
Your code is printing the first, third, second, and third columns of the g array, in that order.
>>> np.argmin(g, axis=1)
array([0, 2, 1, 2]) # first, third, second, third
If you want to get the minimum value of each row, use np.min:
>>> np.min(g, axis=1)
array([13, 0, 12, 3])
When you write g[:, min_], you're saying: "give me all of the rows (shorthand :) for columns at indices min_ (namely 0, 2, 1, 2)".
What you wanted to say was: "give me the values at these rows and these columns" - in other words, you're missing the corresponding row indices to match the column indices in min_.
Since your desired row indices are simply a range of numbers from 0 to g.shape[0] - 1, you could technically write it as:
print(g[range(g.shape[0]), min_])
# output: [13 0 12 3]
But #richardec's solution is better overall if your goal is to extract the row-wise min value.
I am having trouble creating a function which takes a matrix M as an input and deletes BOTH rows and columns containing the number 0 and giving an output containing the remaining numbers. Any help is much appreciated as I have my programming exam coming up soon.
By "deleting both rows and columns" this is what I mean:
import numpy as np
x = np.array([[1,2,3,4,5],
[6,0,8,9,10],
[11,12,13,14,15],
[16,0,0,19,20]])
idxs_array = list(np.where(x==0))
idxs_array = [list(dict.fromkeys(x)) for x in idxs_array]
for axis, idxs in enumerate(idxs_array):
sub_factor = 0
for idx in idxs:
x = np.delete(x,idx-sub_factor,axis)
sub_factor += 1
print(x)
# x = [[ 1, 4, 5],
# [11, 14, 15]]
1. Locate zero elements
First of all, we need to identify the location of the zero elements in the matrix, which can be done easily with np.where().
np.where will return the row/column indices of the elements matched specific condition (doc).
row_idx, col_idx = np.where(arr == 0)
2. Remove corresponding rows/columns
To remove corresponding rows and columns, there is an easy way to do this, which is indexing (doc).
That is, you can specify the row (or column) you want to keep with True, else it shall be False.
print(np.arange(4)[[True, False, True, False]])
# array([0, 2])
3. Put two things together
Here is a minimal example.
arr = np.array([[ 1, 2, 3, 4, 5],
[ 6, 0, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 0, 0, 19, 20]])
row_idx, col_idx = np.where(arr == 0)
rm_row_idx = set(row_idx.tolist())
rm_col_idx = set(col_idx.tolist())
row_mask = [i not in rm_row_idx for i in range(arr.shape[0])]
col_mask = [i not in rm_col_idx for i in range(arr.shape[1])]
arr = arr[row_mask, :]
arr = arr[:, col_mask]
print(arr)
# Shall be:
# array([[ 1, 4, 5],
# [11, 14, 15]])
I have a problem that as to be solved as efficient as possible. My current approach kind of works, but is extreme slow.
I have a dataframe with multiple columns, in this case I only care for one of them. It contains positive continuous numbers and some zeros.
my goal: is to find the row where nearly no zeros appear in the following rows.
To make clear what I mean I wrote this example to replicate my problem:
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
There are some zeros at the beginning, but they get less after some time.
Here comes my unoptimized code to visualize the number of zeros:
zerosum = 0 # counter for all zeros that have appeared so far
for i in range(len(df)):
if(df[0][i]== 0.0):
df.loc[df.index[i],'zerosum']=zerosum
zerosum+=1
else:
df.loc[df.index[i],'zerosum']=zerosum
df['zerosum'].plot()
With that unoptimized code I can see the distribution of zeros over time.
My expected output: would be in this example the date 01-Jan-2018 08:00, because no zeros appear after that date.
The problem I have when dealing with my real data is some single zeros can appear later. Therefore I can't just pick the last row that contains a zero. I have to somehow inspect the distribution of zeros and ignore later outliers.
Note: The visualization is not necessary to solve my problem, I just included it to explain my problem as good as possible. Thanks
Ok
Second go
import pandas as pd
import numpy as np
import math
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'),
columns=['values'])
We create a column that contains the rank of each zero, and zero if there is a non-zero value
df['zero_idx'] = np.where(df['values']==0,np.cumsum(np.where(df['values']==0,1,0)), 0)
We can use this column to get the location of any zero of any rank. I dont know what your criteria is for naming a zero an outlier. But lets say we want to make sure at we are past at least 90% of all zeros...
# Total number of zeros
n_zeros = max(df['zero_idx'])
# Get past at least this percentage
tolerance = 0.9
# The rank of the abovementioned zero
rank_tolerance = math.ceil(tolerance * n_zeros)
df[df['zero_idx']==rank_tolerance].index
Out[44]: DatetimeIndex(['2018-01-01 07:30:00'], dtype='datetime64[ns]', freq='15T')
Okay, If you need to get the index after the last zero occurred, you can try this:
last = 0
for i in range(len(df)):
if(df[0][i] == 0):
last = i
print(df.iloc[last+1])
or by Filtering:
new = df.loc[df[0]==0]
last = df.index.get_loc(new.index[-1])
print(df.iloc[last+1])
here my solution using a filter and cumsum:
df = pd.DataFrame([0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 2, 3, 4, 0, 4, 0, 5, 1, 0, 1, 2, 3, 4,
0, 0, 1, 2, 1, 1, 1, 1, 2, 2, 1, 3, 6, 1, 1, 5, 1, 2, 3, 4, 4, 4, 3, 5, 1, 2, 1, 2, 3, 4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
a = df[0] == 0
df['zerosum'] = a.cumsum()
maxval = max(df['zerosum'])
firstdate = df[df['zerosum'] == maxval].index[1]
print(firstdate)
output:
2018-01-01 08:00:00
Given two arrays, one representing a stream of data, and another representing group counts, such as:
import numpy as np
# given group counts: 3 4 3 2
# given flattened data:[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ]
group_counts = np.array([3,4,3,2])
data = np.arange(group_counts.sum()) # placeholder data, real life application will be a very large array
I want to generate matrices based on the group counts for the streamed data, such as:
target_count = 3 # I want to make a matrix of all data items who's group_counts = target_count
# Expected result
# [[ 0 1 2]
# [ 7 8 9]]
To do this I wrote the following:
# Find all matches
match = np.where(groups == group_target)[0]
i1 = np.cumsum(groups)[match] # start index for slicing
i0 = i1 - groups[match] # end index for slicing
# Prep the blank matrix and fill with resuls
matched_matrix = np.empty((match.size,target_count))
# Is it possible to get rid of this loop?
for i in xrange(match.size):
matched_matrix[i] = data[i0[i]:i1[i]]
matched_matrix
# Result: array([[ 0, 1, 2],
[ 7, 8, 9]]) #
This works, but I would like to get rid of the loop and I can't figure out how.
Doing some research I did find numpy.split and numpy.array_split:
match = np.where(group_counts == target_count)[0]
match = np.array(np.split(data,np.cumsum(groups)))[match]
# Result: array([array([0, 1, 2]), array([7, 8, 9])], dtype=object) #
But numpy.split produces a list of dtype=object that I have to convert.
Is there an elegant way to produce the desired result without a loop?
You can repeat group_counts so it has the same size as data, then filter and reshape based on the target:
group_counts = np.array([3,4,3,2])
data = np.arange(group_counts.sum())
target = 3
data[np.repeat(group_counts, group_counts) == target].reshape(-1, target)
#array([[0, 1, 2],
# [7, 8, 9]])
I have:
import numpy as np
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7, ..., 4])
x = (B/position**2)*dt
A = np.cumsum(x)
assert A[0] == 0 # I want this to be true.
Where B and dt are scalar constants. This is for a numerical integration problem with initial condition of A[0] = 0. Is there a way to set A[0] = 0 and then do a cumsum for everything else?
I don't understand what exactly your problem is, but here are some things you can do to have A[0] = 0.
You can create A to be longer by one index to have the zero as the first entry:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.zeros(len(position) + 1)
A[1:] = np.cumsum((B/position**2)*dt)
Result:
A = [ 0. 0.0625 0.11559096 0.16105356 0.20073547 0.23633533 0.26711403]
len(A) == len(position) + 1
Alternatively, you can manipulate the calculation to substract the first entry of the result:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.cumsum((B/position**2)*dt)
A = A - A[0]
Result:
[ 0. 0.05309096 0.09855356 0.13823547 0.17383533 0.20461403]
len(A) == len(position)
As you see, the results have different lengths. Is one of them what you expect?
1D cumsum
A wrapper around np.cumsum that sets first element to 0:
def cumsum(pmf):
cdf = np.empty(len(pmf) + 1, dtype=pmf.dtype)
cdf[0] = 0
np.cumsum(pmf, out=cdf[1:])
return cdf
Example usage:
>>> np.arange(1, 11)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> cumsum(np.arange(1, 11))
array([ 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55])
N-D cumsum
A wrapper around np.cumsum that sets first element to 0, and works with N-D arrays:
def cumsum(pmf, axis=None, dtype=None):
if axis is None:
pmf = pmf.reshape(-1)
axis = 0
if dtype is None:
dtype = pmf.dtype
idx = [slice(None)] * pmf.ndim
# Create array with extra element along cumsummed axis.
shape = list(pmf.shape)
shape[axis] += 1
cdf = np.empty(shape, dtype)
# Set first element to 0.
idx[axis] = 0
cdf[tuple(idx)] = 0
# Perform cumsum on remaining elements.
idx[axis] = slice(1, None)
np.cumsum(pmf, axis=axis, dtype=dtype, out=cdf[tuple(idx)])
return cdf
Example usage:
>>> np.arange(1, 11).reshape(2, 5)
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
>>> cumsum(np.arange(1, 11).reshape(2, 5), axis=-1)
array([[ 0, 1, 3, 6, 10, 15],
[ 0, 6, 13, 21, 30, 40]])
I totally understand your pain, I wonder why Numpy doesn't allow this with np.cumsum. Anyway, though I'm really late and there's already another good answer, I prefer this one a bit more:
np.cumsum(np.pad(array, (1, 0), "constant"))
where array in your case is (B/position**2)*dt. You can change the order of np.pad and np.cumsum as well. I'm just adding a zero to the start of the array and calling np.cumsum.
You can use roll (shift right by 1) and then set the first entry to zero.