With regards to efficiency, how can we create a large numpy array where the values are float numbers within a specific range.
For example, for a 1-D numpy array of fixed size where the values are between 0 and 200,000,000.00 (i.e. values in [0, 200,000,000.00]), I can create the array using the smallest data type for floats (float16) and then validate any new value (from user input) before inserting it to the array:
import numpy as np
a = np.empty(shape=(1000,), dtype=np.float16))
pos = 0
new_value = input('Enter new value: ')
# validate
new_value = round(new_value, 2)
if new_value in np.arange(0.00, 200000000.00, 0.01):
# fill in new value
a[pos] = new_value
pos = pos + 1
The question is, can we enforce the new_value validity (in terms of the already-known minimum/maximum values and number of decimals) based on the dtype of the array?
In other words, the fact that we know the range and number of decimals on the time of creating the array, does this gives us any opportunity to (more) efficiently insert valid values in the array?
I am a bit confused how your code even run because it's not working as it is presented here.
It is also a bit unclear why you want to append new values to an empty array you have created beforehand. Did you meant to fill the created array with the new incoming values instead of appending?
np.arange(0.00, 200000000.00, 0.01)
This line is causing problems as it creates a huge array with values leading to a MemoryError in my environment just to check if the new_value is in a certain range.
Extending my comment and fixing issues with your code my solution would look like
import numpy as np
max_value = 200000000
arr = np.empty(shape=(1000,), dtype=np.float16)
new_value = float(input('Enter new value: ')) # More checks might be useful if input is numeric
# validate
if 0 <= new_value <= max_value:
new_value = round(new_value, 2) # round only if range criterion is fulfilled
arr = np.append(arr, new_value) # in case you really want to append your value
Related
So I have an array I am trying to slice and index using two other boolean arrays and then set a value on that subset of the array. I saw this post:
Setting values in a numpy arrays indexed by a slice and two boolean arrays
and suspect I am getting a copy instead of a view of my array so it isn't saving the values I am setting on the array. I think I managed to reproduce the problem with a much shorter code, but am a very out of my depth.
#first array
a = np.arange(0,100).reshape(10,10)
#conditional array of same size
b = np.random.rand(10,10)
b = b < 0.8
#create out array of same size as a
out = np.zeros(a.shape)
#define neighborhood to slice values from
nhood = tuple([slice(3,6),slice(5,7)])
#define subset where b == True in neighborhood
subset = b[nhood]
#define values in out that are in the neighborhood but not excluded by b
candidates = out[nhood][subset]
#get third values from neighborhood using math
c = np.random.rand(len(candidates))
#this is in a for loop so this is checking to see a value has already been changed earlier - returns all true now
update_these = candidates < c
#set sliced, indexed subset of array with the values from c that are appropriate
out[nhood][subset][update_these] = c[update_these]
print(out) ##PRODUCES - ARRAY OF ALL ZEROS STILL
I have also tried chaining the boolean index with
out[nhood][(subset)&(update_these)] = c[update_these]
But that made an array of the wrong size.
Help?
Okay so I have a an array of 1000x100 with random numbers. I want to threshold this list with a list of multiple numbers; these numbers go from [3 to 9].If they are higher than the threshold I want the sum of the row appended to a list.
I have tried many ways, including a 3 times for conditional. Right now, I have found a way to compare an array to a list of numbers but each time that happens I get random numbers from that list again.
xpatient=5
sd_healthy=2
xhealthy=7
sd_patient=2
thresholdvalue1=(xpatient-sd_healthy)*10
thresholdvalue2=(((xhealthy+sd_patient))*10)
thresholdlist=[]
x1=[]
Ahealthy=np.random.randint(10,size=(1000,100))
Apatient=np.random.randint(10,size=(1000,100))
TParray=np.random.randint(10,size=(1,61))
def thresholding(A,B):
for i in range(A,B):
thresholdlist.append(i)
i+=1
thresholding(thresholdvalue1,thresholdvalue2+1)
thresholdarray=np.asarray(thresholdlist)
thedivisor=10
newthreshold=(thresholdarray/thedivisor)
for x in range(61):
Apatient=np.random.randint(10,size=(1000,100))
Apatient=[Apatient>=newthreshold[x]]*Apatient
x1.append([sum(x) for x in zip(*Apatient)])
So,my for loop consists of a random integer within it, but if I don't do that, I don't get to see the threshold each turn. I want the threshold for the whole array to be 3,3.1,3.2 etc. etc.
I hope I delivered my point. Thanks in advance
You can solve your problem using this approach:
import numpy as np
def get_sums_by_threshold(data, threshold, axis): # use axis=0 to sum values along rows, axis=1 - along columns
result = list(np.where(data >= threshold, data, 0).sum(axis=axis))
return result
xpatient=5
sd_healthy=2
xhealthy=7
sd_patient=2
thresholdvalue1=(xpatient-sd_healthy)*10
thresholdvalue2=(((xhealthy+sd_patient))*10)
np.random.seed(100) # to keep generated array reproducable
data = np.random.randint(10,size=(1000,100))
thresholds = [num / 10.0 for num in range(thresholdvalue1, thresholdvalue2+1)]
sums = list(map(lambda x: get_sums_by_threshold(data, x, axis=0), thresholds))
But you should know that your initial array includes only integer values and you will have same result for multiple thresholds that have the same integer part (f.e. 3.0, 3.1, 3.2, ..., 3.9). If you want to store float numbers from 0 to 9 in your initial array with the specified shape you can do following:
data = np.random.randint(90,size=(1000,100)) / 10.0
i have a Dataframe which has a column like (0.12,0.14,0.16,0.13,0.23,0.25,0.28,0.32,0.33), I want to have a new column that only record the value change more than 0.1 or -0.1. And other values remain same when changes.
so the new column should be like (0.12,0.12,0.12,0.12,0.23,0.23,0.23,0.32,0.32)
Anyone knows how to write in a simple way?
Thanks ahead.
Not really sure what you're trying to achieve with this by rounding the data to arbitrary numbers. You might want to consider either round function to midpoint, or using ceiling/floor function after multiplying the array by 10.
What you're trying to achieve however can be done like this:
import numpy as np
def cookdata(data):
#Assuming your data is sorted as per example array in your question
data = np.asarray(data)
i = 0
startidx = 0
while np.unique(data).size > np.ceil((data.max()-data.min())/0.1):
lastidx = startidx + np.where(data[startidx:] < np.unique(data)[0]+0.1*(i+1))[0].size
data[startidx:lastidx] = np.unique(data)[i]
startidx = lastidx
i += 1
return data
This returns a dataset as asked in your question. I am sure there are better ways to do it:
data = np.sort(np.random.uniform(0.12, 0.5, 10))
data
array([ 0.12959374, 0.14192312, 0.21706382, 0.27638412, 0.27745105,
0.28516701, 0.37941334, 0.4037809 , 0.41016534, 0.48978927])
cookdata(data)
array([ 0.12959374, 0.12959374, 0.12959374, 0.27638412, 0.27638412,
0.27638412, 0.37941334, 0.37941334, 0.37941334, 0.48978927])
The function returns array based on first value.
You might however want to consider simpler operations that do not require rounding values to arbitrary datapoints. Consider np.round(data, decimals=1). In your case you could also use floor function as in: np.floor(data/0.1)*0.1 or if you want to keep the initial value:
data = np.asarray(data)
datamin = data.min()
data = np.floor((data-datamin)/0.1)*0.1+datamin
data
array([ 0.12959374, 0.12959374, 0.12959374, 0.22959374, 0.22959374,
0.22959374, 0.32959374, 0.32959374, 0.32959374, 0.42959374])
Here the data is as multiples of first value, rather than arbitrary value between the multiples of first value.
I have a dataset dod, and I want to take the mean of values in dod. In dod there fill values, with values of -9.99, and values of 0. I would like to ignore those values when taking the mean.
So far I can only ignore the fill values:
dod = f.variables['dod_modis_flg1'][i]
def nan_if(arr, value):
return np.where(arr == value, np.nan, arr)
mean = np.nanmean([nan_if(dod, -9.99)])
print(mean)
Does anyone know how I can ignore the values of 0 as well, while taking the mean?
You can either replace unwanted values with nans, but that will change your original array:
dod[~numpy.logical_or(numpy.isclose(dod, 0), numpy.isclose(dod, -9.99))] = numpy.nan
numpy.mean(dod)
Three additional boolean arrays need to be allocated to perform the operation, but no additional float arrays are created.
Or only select the values you want and take the mean:
tmp = dod[~numpy.logical_or(numpy.isclose(dod, 0), numpy.isclose(dod, -9.99))]
numpy.mean(tmp)
but that will additionally create an intermediate float array
numpy.may_share_memory(tmp, dod) # False
Third option is to create a masked array like:
tmp2 = numpy.ma.masked_where(
numpy.logical_or(numpy.isclose(dod, 0), numpy.isclose(dod, -9.99)),
dod,
copy=False
)
numpy.mean(tmp2)
which only creates the additional boolean arrays but no intermediate float array:
numpy.may_share_memory(tmp2.data, dod) # True
Conclusion
If you are allowed to modify the input array, do so (option 1)
If you are not allowed to modify the input array, create a temporary masked array (option 3)
This should work. You can do it all in one line - I'm just creating the new_array variable for clarity:
# get the dod array but without -9.99 or 0.
new_array = dod[~np.isin(dod, [-9.99, 0])]
np.nanmean(new_array)
Note: requires numpy >= 1.13
I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.
The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals
This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))