I have a 2d array with dimensions array[x][9]. X because its reading from a file of varying length. I want to find the sum of each column of the array but for 24 columns at a time and input the results into a new array; equivalent to sum(array2[0:24]) but for a 2d array. Is there a special syntax i just dont know about or do i have to do it manually. I know if it was a 1d array i could iterate through it by doing
for x in range(len(array)/24):
total.append(sum(array2[x1:x24])) # so i get an array of the sums
What is the equivalent for a 2d array and doing it column by column. I can imagine doing it by storing each column in its own separate 1d array and then finding the sums, or a mess of for and while loops. Neither of which sound even slightly elegant.
It sounds like you perhaps are working with time series data, with a file containing hourly values and you want a daily sum (hence the 24). The pandas library will do this really nicely:
Suppose you have your data in data.csv:
import pandas
df = pandas.read_csv('data.csv')
If one of your columns was a timestamp, you could use that, but if you only have raw data, you can create a time index:
df.index = pandas.date_range(pandas.datetime.today().date(),
periods=df.shape[0], freq='H')
Now the summing of all columns on daily basis is very easy:
daily = df.resample('D').apply(sum)
You can use zip to transpose your array and use a comprehension to sum each column separately:
>>> array = [[1, 2, 3], [10, 20, 30], [100, 200, 300]]
>>> [sum(a) for a in zip(*array)]
[111, 222, 333]
Please try this:
x = len(a) # x is the length of a
step = 24
# get the number of iterations you need to do
n = int(math.ceil(float(x) / step))
new_a = [map(lambda k: sum(list(k)), zip(*a[i * step:(i + 1) * step]))
for i in range(0, n)]
If x is not a multiple of 24 then the last row in the new_a will have the sum of remainder rows (count of which will be less that 24).
This also assumes that the values in a are numbers so I have not done any conversions.
Related
Im trying to write a code with numpy where it outputs the maximum value between indexes. I think using argmax could be usable. However I do not know how I can use slices without using a for loop in python. If there is a pandas function for this it could be useable too. I want to make the computation as fast as possible.
list_ = np.array([9887.89, 9902.99, 9902.99, 9910.23, 9920.79, 9911.34, 9920.01, 9927.51, 9932.3, 9932.33, 9928.87, 9929.22, 9929.22, 9935.24, 9935.24, 9935.26, 9935.26, 9935.68, 9935.68, 9940.5])
indexes = np.array([0, 5, 10, 19])
Expected result:
Max number between index(0 - 5): 9920.79 at index 5
Max number between index(5 - 10): 9932.33 at index 10
Max number between index(10 - 19): 9940.5 at index 19
You can use reduceat directly yo your array without the need to splice/split it:
np.maximum.reduceat(list_,indexes[:-1])
output:
array([9932.33, 9929.22, 9940.5 ])
Assuming that the first (zero) index and the last index is specified in the indexes array,
import numpy as np
list_ = np.array([9887.89, 9902.99, 9902.99, 9910.23, 9920.79, 9911.34, 9920.01, 9927.51, 9932.3, 9932.33, 9928.87, 9929.22, 9929.22, 9935.24, 9935.24, 9935.26, 9935.26, 9935.68, 9935.68, 9940.5])
indexes = np.array([0, 5, 10, 19])
chunks = np.split(list_, indexes[1:-1])
print([c.max() for c in chunks])
max_ind = [c.argmax() for c in chunks]
print(max_ind + indexes[:-1])
It's not necessary that each chunk will have the same size with an arbitrary specification of indices. So The vectorization benefits of numpy is going to be lost in there one way or another (Since you can't have a numpy array where each element is of a different size in memory which also has all the benefits of vectorization).
At least one for loop is going to be necessary, I think. However, you can use split, to make the splitting a numpy-optimized operation.
I am brute force calculating the shortest distance from one point to many others on a 2D plane with data coming from pandas dataframes using df['column'].to_numpy().
Currently, I am doing this using nested for loops on numpy arrays to fill up a list, taking the minimum value of that list, and storing that value in another list.
Checking 1000 points (from df_point) against 25,000 (from df_compare) takes about one minute, as this is understandably an inefficient process. My code is below.
point_x = df_point['x'].to_numpy()
compare_x = df_compare['x'].to_numpy()
point_y = df_point['y'].to_numpy()
compare_y = df_compare['y'].to_numpy()
dumarr = []
minvals = []
# Brute force caclulate the closet point by using the Pythagorean theorem comparing each
# point to every other point
for k in range(len(point_x)):
for i,j in np.nditer([compare_x,compare_y]):
dumarr.append(((point_x[k] - i)**2 + (point_y[k] - j)**2))
minval.append(df_compare['point_name'][dumarr.index(min(dumarr))])
# Clear dummy array (otherwise it will continuously append to)
dumarr = []
This isn't a particularly pythonic. Is there a way to do this with vectorization or at least without using nested for loops?
The approach is to create a 1000 x 25000 matrix, and then find the indices of the row minimums.
# distances for all combinations (1000x25000 matrix)
dum_arr = (point_x[:, None] - compare_x)**2 + (point_y[:, None] - compare_y)**2
# indices of minimums along rows
idx = np.argmin(dum_arr, axis=1)
# Not sure what is needed from the indices, this get the values
# from `point_name` dataframe using found indices
min_vals = df_compare['point_name'].iloc[idx]
I'm gonna give you the approach :
Create DataFrame with columns being ->pointID,CoordX,CoordY
Create a secondary DataFrame with an offset value of 1 (oldDF.iloc[pointIDx] = newDF.iloc[pointIDx]-1)
This offset value needs to be looped from 1 till the number of coordinates-1
tempDF["Euclid Dist"] = sqrt(square(oldDf["CoordX"]-newDF["CoordX"])+square(oldDf["CoordY"]-newDF["CoordY"]))
Append this tempDF to a list
Reasons why this will be faster:
Only one loop to iterate offset from 1 till number of coordinates-1
Vectorization has been taken care off by step 4
Utilize numpy squareroot and square functions to ensure best results
Instead of to find the closest point, you could try finding the closest in the x and y direction separately, and then compare those two to find which is closer by using the built-in min function like the top answer from this question:
min(myList, key=lambda x:abs(x-myNumber))
from list of integers, get number closest to a given value
EDIT:
Your loop would end up something like this if you do it all in one function call. Also, I'm not sure if the min function will end up looping through the compare arrays in a way that would take the same amount of time as your current code:
for k,m in np.nditer([point_x, point_y]):
min = min(compare_x, compare_y, key=lambda x,y: (x-k)**2 + (y-m)**2 )
Another alternative could be to pre-compute the distance from (0,0) or another point like (-1000,1000) for all the points in the compare array, sort the compare array based on that, then only check points with a similar distance from the reference.
Here’s an example using scipy cdist, which is ideal for this type of problem:
import numpy as np
from scipy.spatial.distance import cdist
point = np.array([[1, 2], [3, 5], [4, 7]])
compare = np.array([[3, 2], [8, 5], [4, 1], [2, 2], [8, 9]])
# create 3x5 distance matrix
dm = cdist(point, compare)
# get row-wise mins
mins = dm.min(axis=1)
I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?
Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.
First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.
I am missing something simple here. I cannot figure out how to pass multiple columns of a 2D array (matrix) and output them as a single column array.
Here is a simple example:
import numpy as np
Z = lambda x1,x2: (x1*x2 - 3)^2 + 1
# a sample 2D array with 15 rows and 2 columns
x= np.arange(30).reshape((15,2))
answer = [Z(i[0],i[1]) for i in x]
The last line of code is where my problem lies. I would like the output to be a single column array with 15 rows.
As a final note: My code only uses 2 columns as inputs. If it could be further expanded to a flexible number of columns, it would be greatly appreciated.
Could you make your last line:
answer = np.array([Z(i[0],i[1]) for i in x]).reshape(15,1)
which gives:
array([[ -2],
[ 0],
[ 18],
[ 36],
[ 70],
[104],
[154],
[204],
[270],
[336],
[418],
[500],
[598],
[696],
[810]])
You could do
import numpy as np
Z = lambda data, i, j: ((data[:,i]*data[:,j] - 3)**2 + 1)[:,np.newaxis]
# a sample 2D array with 15 rows and 2 columns
x= np.arange(30).reshape((15,2))
answer = Z(x,0,1)
so maybe you don't need the lambda function after all
I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.