Im trying to write a code with numpy where it outputs the maximum value between indexes. I think using argmax could be usable. However I do not know how I can use slices without using a for loop in python. If there is a pandas function for this it could be useable too. I want to make the computation as fast as possible.
list_ = np.array([9887.89, 9902.99, 9902.99, 9910.23, 9920.79, 9911.34, 9920.01, 9927.51, 9932.3, 9932.33, 9928.87, 9929.22, 9929.22, 9935.24, 9935.24, 9935.26, 9935.26, 9935.68, 9935.68, 9940.5])
indexes = np.array([0, 5, 10, 19])
Expected result:
Max number between index(0 - 5): 9920.79 at index 5
Max number between index(5 - 10): 9932.33 at index 10
Max number between index(10 - 19): 9940.5 at index 19
You can use reduceat directly yo your array without the need to splice/split it:
np.maximum.reduceat(list_,indexes[:-1])
output:
array([9932.33, 9929.22, 9940.5 ])
Assuming that the first (zero) index and the last index is specified in the indexes array,
import numpy as np
list_ = np.array([9887.89, 9902.99, 9902.99, 9910.23, 9920.79, 9911.34, 9920.01, 9927.51, 9932.3, 9932.33, 9928.87, 9929.22, 9929.22, 9935.24, 9935.24, 9935.26, 9935.26, 9935.68, 9935.68, 9940.5])
indexes = np.array([0, 5, 10, 19])
chunks = np.split(list_, indexes[1:-1])
print([c.max() for c in chunks])
max_ind = [c.argmax() for c in chunks]
print(max_ind + indexes[:-1])
It's not necessary that each chunk will have the same size with an arbitrary specification of indices. So The vectorization benefits of numpy is going to be lost in there one way or another (Since you can't have a numpy array where each element is of a different size in memory which also has all the benefits of vectorization).
At least one for loop is going to be necessary, I think. However, you can use split, to make the splitting a numpy-optimized operation.
Related
Code:
import numpy as np
ray = [1,22,33,42,51], [61,71,812,92,103], [113,121,132,143,151], [16,172,183,19,201]
ray = np.asarray(ray)
type(ray)
ray[np.ix_([-2:],[3:4])]
I'd like to use index slicing and get a subarray consisting of the last two rows and the 3rd/4th columns. My current code produces an error:
I'd also like to sum each column. What am I doing wrong? I cannot post a picture because I need at least 10 reputation points.
So you want to make a slice of an array. The most straightforward way to do it is... slicing:
slice = ray[-2:,3:]
or if you want it explicitly
slice = ray[-2:,3:5]
See it explained in Understanding slicing
But if you do want to use np.ix_ for some reason, you need
slice = ray[np.ix_([-2,-1],[3,4])]
You can't use : here, because [] here don't make a slice, they construct lists and you should specify explicitly every row number and every column number you want in the result. If there are too many consecutive indices, you may use range:
slice = ray[np.ix_(range(-2, 0),range(3, 5))]
And to sum each column:
slice.sum(0)
0 means you want to reduce the 0th dimension (rows) by summation and keep other dimensions (columns in this case).
I am brute force calculating the shortest distance from one point to many others on a 2D plane with data coming from pandas dataframes using df['column'].to_numpy().
Currently, I am doing this using nested for loops on numpy arrays to fill up a list, taking the minimum value of that list, and storing that value in another list.
Checking 1000 points (from df_point) against 25,000 (from df_compare) takes about one minute, as this is understandably an inefficient process. My code is below.
point_x = df_point['x'].to_numpy()
compare_x = df_compare['x'].to_numpy()
point_y = df_point['y'].to_numpy()
compare_y = df_compare['y'].to_numpy()
dumarr = []
minvals = []
# Brute force caclulate the closet point by using the Pythagorean theorem comparing each
# point to every other point
for k in range(len(point_x)):
for i,j in np.nditer([compare_x,compare_y]):
dumarr.append(((point_x[k] - i)**2 + (point_y[k] - j)**2))
minval.append(df_compare['point_name'][dumarr.index(min(dumarr))])
# Clear dummy array (otherwise it will continuously append to)
dumarr = []
This isn't a particularly pythonic. Is there a way to do this with vectorization or at least without using nested for loops?
The approach is to create a 1000 x 25000 matrix, and then find the indices of the row minimums.
# distances for all combinations (1000x25000 matrix)
dum_arr = (point_x[:, None] - compare_x)**2 + (point_y[:, None] - compare_y)**2
# indices of minimums along rows
idx = np.argmin(dum_arr, axis=1)
# Not sure what is needed from the indices, this get the values
# from `point_name` dataframe using found indices
min_vals = df_compare['point_name'].iloc[idx]
I'm gonna give you the approach :
Create DataFrame with columns being ->pointID,CoordX,CoordY
Create a secondary DataFrame with an offset value of 1 (oldDF.iloc[pointIDx] = newDF.iloc[pointIDx]-1)
This offset value needs to be looped from 1 till the number of coordinates-1
tempDF["Euclid Dist"] = sqrt(square(oldDf["CoordX"]-newDF["CoordX"])+square(oldDf["CoordY"]-newDF["CoordY"]))
Append this tempDF to a list
Reasons why this will be faster:
Only one loop to iterate offset from 1 till number of coordinates-1
Vectorization has been taken care off by step 4
Utilize numpy squareroot and square functions to ensure best results
Instead of to find the closest point, you could try finding the closest in the x and y direction separately, and then compare those two to find which is closer by using the built-in min function like the top answer from this question:
min(myList, key=lambda x:abs(x-myNumber))
from list of integers, get number closest to a given value
EDIT:
Your loop would end up something like this if you do it all in one function call. Also, I'm not sure if the min function will end up looping through the compare arrays in a way that would take the same amount of time as your current code:
for k,m in np.nditer([point_x, point_y]):
min = min(compare_x, compare_y, key=lambda x,y: (x-k)**2 + (y-m)**2 )
Another alternative could be to pre-compute the distance from (0,0) or another point like (-1000,1000) for all the points in the compare array, sort the compare array based on that, then only check points with a similar distance from the reference.
Here’s an example using scipy cdist, which is ideal for this type of problem:
import numpy as np
from scipy.spatial.distance import cdist
point = np.array([[1, 2], [3, 5], [4, 7]])
compare = np.array([[3, 2], [8, 5], [4, 1], [2, 2], [8, 9]])
# create 3x5 distance matrix
dm = cdist(point, compare)
# get row-wise mins
mins = dm.min(axis=1)
Is there anyway to basically take a column of a numpy array and whenever the absolute value is greater than a number, set the value to that signed number.
ie.
for val in col:
if abs(val) > max:
val = (signed) max
I know this can be done by looping and such but i was wondering if there was a cleaner/builtin way to do this.
I see there is something like
arr[arr > 255] = x
Which is kind of what i want but i want do this by column instead of the whole array. As a bonus maybe a way to do absolute values instead of having to do two separate operations for positive and negative.
The other answer is good but it doesn't get you all the way there. Frankly, this is somewhat of a RTFM situation. But you'd be forgiven for not grokking the Numpy indexing docs on your first try, because they are dense and the data model will be alien if you are coming from a more traditional programming environment.
You will have to use np.clip on the columns you want to clip, like so:
x[:,2] = np.clip(x[:,2], 0, 255)
This applies np.clip to the 2nd column of the array, "slicing" down all rows, then reassigns it to the 2nd column. The : is Python syntax meaning "give me all elements of an indexable sequence".
More generally, you can use the boolean subsetting index that you discovered in the same fashion, by slicing across rows and selecting the desired columns:
x[x[:,2] > 255, 2] = -1
Try calling clip on your numpy array:
import numpy as np
values = np.array([-3,-2,-1,0,1,2,3])
values.clip(-2,2)
Out[292]:
array([-2, -2, -1, 0, 1, 2, 2])
Maybe is a little late, but I think it's a good option:
import numpy as np
values = np.array([-3,-2,-1,0,1,2,3])
values = np.clip(values,-2,2)
I have an array of integers.
data = [10,20,30,40,50,60,70,80,90,100]
I want to extract a range of integers from the array and get a smaller array.
data_extracted = [20,30,40]
I tried numpy.take.
data = [10,20,30,40,50,60,70,80,90,100]
start = 1 # index of starting data entry (20)
end = 3 # index of ending data entry (40)
data_extracted = np.take(data,[start:end])
I get a syntax error pointing to the : in numpy.take.
Is there a better way to use numpy.take to store part of an array in a separate array?
You can directly slice the list.
import numpy as np
data = [10,20,30,40,50,60,70,80,90,100]
data_extracted = np.array(data[1:4])
Also, you do not need to use numpy.array, you could just store the data in another list:
data_extracted = data[1:4]
If you want to use numpy.take, you have to pass it a list of the desired indices as second argument:
import numpy as np
data = [10,20,30,40,50,60,70,80,90,100]
data_extracted = np.take(data, [1, 2, 3])
I do not think numpy.take is needed for this application though.
You ought to just use a slice to get a range of indices, there is no need for numpy.take, which is intended as a shortcut for fancy indexing.
data_extracted = data[1:4]
As others have mentioned, you can use fancy indexing in this case. However, if you need to use np.take because e.g. the axis you're slicing over is variable, you might try:
axis=0
data.take(range(1,4), axis=axis)
Note: this might be slower than:
data_extracted = data[1:4]
I have a 2d array with dimensions array[x][9]. X because its reading from a file of varying length. I want to find the sum of each column of the array but for 24 columns at a time and input the results into a new array; equivalent to sum(array2[0:24]) but for a 2d array. Is there a special syntax i just dont know about or do i have to do it manually. I know if it was a 1d array i could iterate through it by doing
for x in range(len(array)/24):
total.append(sum(array2[x1:x24])) # so i get an array of the sums
What is the equivalent for a 2d array and doing it column by column. I can imagine doing it by storing each column in its own separate 1d array and then finding the sums, or a mess of for and while loops. Neither of which sound even slightly elegant.
It sounds like you perhaps are working with time series data, with a file containing hourly values and you want a daily sum (hence the 24). The pandas library will do this really nicely:
Suppose you have your data in data.csv:
import pandas
df = pandas.read_csv('data.csv')
If one of your columns was a timestamp, you could use that, but if you only have raw data, you can create a time index:
df.index = pandas.date_range(pandas.datetime.today().date(),
periods=df.shape[0], freq='H')
Now the summing of all columns on daily basis is very easy:
daily = df.resample('D').apply(sum)
You can use zip to transpose your array and use a comprehension to sum each column separately:
>>> array = [[1, 2, 3], [10, 20, 30], [100, 200, 300]]
>>> [sum(a) for a in zip(*array)]
[111, 222, 333]
Please try this:
x = len(a) # x is the length of a
step = 24
# get the number of iterations you need to do
n = int(math.ceil(float(x) / step))
new_a = [map(lambda k: sum(list(k)), zip(*a[i * step:(i + 1) * step]))
for i in range(0, n)]
If x is not a multiple of 24 then the last row in the new_a will have the sum of remainder rows (count of which will be less that 24).
This also assumes that the values in a are numbers so I have not done any conversions.