I want to create a large NumPy array (L) to hold the result of some operations. However, I can only compute one part of L at a time. Then, to have L, I need to create an array of zeros of shape L.shape, and fill it using these parts or subarrays. I'm currently able to do it, but in a very inefficient way.
If the shape of L is (x, y, z, a, b, c), then I'm creating a NumPy arrays of shape (x, y, z, 1, b, c) which correspond to the different parts of L, from part 0 to part a-1. I'm forced to create arrays of this particular shape due to the operations involved.
In order to fill the array of zeros, I'm creating one Pandas DataFrame per subarray (or part). Each dataframe contains the indices and the values of one subarray of shape (x, y, z, 1, b, c), like this:
index0 | index1 | index2 | index3 | index4 | index5 | value
------------------------------------------------------------
0 | 0 | 0 | 0 | 0 | 0 | 434.2
0 | 0 | 0 | 0 | 0 | 1 | 234.5
..., and so on.
Because of the shape (x, y, z, 1, b, c), index3 can only contain zeros. So, there's a change to make before the values can be inserted at the right index of L: the column at index3 will contain only 0s for the first subarray, only 1s for the second subarray, etc. So, df['index3'] = subarray_number, where subarray_number goes from 0 to a-1. Only the column at index3 is changed.
So, the fifth subarray represented as a dataframe would look like this:
index0 | index1 | index2 | index3 | index4 | index5 | value
------------------------------------------------------------
0 | 0 | 0 | 4 | 0 | 0 | 434.2
0 | 0 | 0 | 4 | 0 | 1 | 234.5
...
x-1 | y-1 | z-1 | 4 | b-1 | c-1 | 371.8
After this, I only have to iterate over the rows of each of the dataframes using iterrows, and assign the values to the corresponding indices of the array of zeros, like this:
for subarray_df in subarrays_dfs:
for i, row in subarray_df.iterrows():
index0, index1, index2, index3, index4, index5, value = row
L[index0][index1][index2][index3][index4][index5] = value
The problem is that converting the arrays to dataframes and then assigning the values one by one is expensive, especially for large arrays. I would like to insert the subarrays in L directly without having to go through this intermediate step.
I tried using slices but the generated array is not the one I expect. This is what I'm doing:
L[:subarray.shape[0], :subarray.shape[1], :subarray.shape[2],
subarray_number, :subarray.shape[4], :subarray.shape[5]] = subarray
What would be the right way of using slices to fill L the way I need?
Thanks!
Your example is not very clear, but maybe you can adapt something from this code snippet. It looks to me like you are generating your L array, shape (x, y, z, a, b, c), by computing a slices, of shape (x, y, z, b, c), equivalent to (x, y, z, 1, b, c). Let me know if I am completely wrong.
import numpy as np
L = np.zeros((10, 10, 10, 2, 10, 10)) # shape (x, y, z, a, b, c)
def compute():
return np.random.rand(10, 10, 10, 10, 10) # shape (x, y, z, b, c)
for k in range(L.shape[3]):
L[:, :, :, k, :, :] = compute() # Select slice of shape (x, y, z, b, c)
Basically, It computes a slice of the array (part of the array) and place it at the desired location.
One thing to note, an array of shape (x, y, z, a, b, c) can quickly get out of hand. For instance I naively tried to do L = np.zeros((100, 100, 100, 5, 100, 100)), resulting in a 373 Gb RAM allocation.. Depending on the size of your data, maybe you could work only on each slices and store them to disk when the others are not in use?
Following my comment, some snippet for you to get this dimension problem:
import numpy as np
L = np.zeros((10, 10, 10))
L.shape # (10, 10, 10)
L[:, 0, :].shape # (10, 10)
L[:, 0:3, :].shape # (10, 3, 10)
The slice on : selects all, the slice on x:y selets all from x to y and the slice on a specific index k selects only that 'line'/'column' (analogy for 2D), and thus returning an array of dimension n-1. In 2D, a line or column would be 1D.
Related
I have a set like this:
N1 N2
0 a b
1 b f
2 c d
3 d a
4 e b
I want to get the indexes with the repeated values between the two columns, and the value itself.
From the example, I should get something like these shortlists:
(value, idx(N1), idx(N2))
(a, 0, 3)
(b, 1, 0)
(b, 1, 4)
(d, 3, 2)
I have been able to do it with two for-loops, but for a half-million rows dataframe it took hours...
Use numpy broadcasting comparison and then use argwhere to find the indices where the values where equal:
import numpy as np
# make a broadcasted comparison
mat = df['N2'].values == df['N1'].values[:, None]
# find the indices where the values are True
where = np.argwhere(mat)
# select the values
values = df['N1'][where[:, 0]]
# create the DataFrame
res = pd.DataFrame(data=[[val, *row] for val, row in zip(values, where)], columns=['values', 'idx_N1', 'idx_N2'])
print(res)
Output
values idx_N1 idx_N2
0 a 0 3
1 b 1 0
2 b 1 4
3 d 3 2
I have a matrix where each column is a different brand.
Each row is a different category.
I have a separate matrix which is the desired outcome.
I need to mix the first martix to find the most optimal outcome of % to = the second matrix.
First matrix: C | 105 130 120
P | 1 3 5
F | 2 4 2
goal is to design a mix that has these attributes:
Optimal: C | 245
P | 6
F | 7
What formula is this?
If I understand you correctly, you are actually searching for a solution to a set of linear equations. Assuming you want to "mix" each of the columns of the matrix to get to the final target, you're actually looking for a vector x such that target = M # x. The solution is to multiply by the inverse, x = inv(M) # target. Using numpy, this translates to
import numpy
M = numpy.array([[105, 130, 120],
[1, 3, 5],
[2, 4, 2]])
target = numpy.array([[245], [6], [7]])
x = numpy.linalg.inv(M)#target
x is
array([[0.11940299],
[1.57462687],
[0.23134328]])
First of all thank you for any support. This is my first question published as usually my doubts are solved reading through other user's questions.
Here is my question: I have a number (n) of sets with common elements. These elements are usually added sequentially creating new sets although I do not have the sequence and this is what I am trying to find. The sequence is not always perfect and at some points I have to find the closest one with some uncertainty when the sequence is not 'perfect'.
I coded it using theory of Sets searching sequentially the set that contains all the other sets and when I do not reach the last set then I start from the smallest to the bigger.
I gave some thoughts to the topic and I found, in theory, a more robust and generic approach. The idea is to build a square matrix with the n sets as row index (i) and the n sets as column index (j). The element i,j will be equal to 1 when set j is contained in i.
Here I have an example with sets A to G:
A={a, b, c, d1, d2, e, f};
B={b, c, d1, d2, e, f};
C={c, d1, d2, e, f};
D={d1, f, g};
E={d2, f, g};
F={f, g};
G={g};
If I create the matrix assuming sequence B, E, C, F, D, A, G, I would have:
B E C F D A G
B 1 1 1 1 1 0 1
E 0 1 0 1 0 0 1
C 0 1 1 1 1 0 1
F 0 0 0 1 0 0 1
D 0 0 0 1 1 0 1
A 1 1 1 1 1 1 1
G 0 0 0 0 0 0 1
I should get this matrix transformed into following matrix:
A B C D E F G
A 1 1 1 1 1 1 1
B 0 1 1 1 1 1 1
C 0 0 1 1 1 1 1
D 0 0 0 1 0 1 1
E 0 0 0 0 1 1 1
F 0 0 0 0 0 1 1
G 0 0 0 0 0 0 1
Which shows one of the two possible sequence: A, B, C, D, E, F, G
Here I add a picture as I am not sure matrix are shown clearly.
My first question is how you recommend to handle this matrix (which kind of data type should I use with typical functions to swap rows and columns).
And my second question is if there is already a matrix transformation function for this topic.
From my (small) experience, most used types for matrices are lists and numpy.ndarrays.
For columns swaps in particular, I would recommend numpy. There are many array creation routines in numpy. You either give the list with data explicitly or you create an array based on the shape you want. Example
>>> import numpy as np
>>> np.array([1, 2, 3])
array([1, 2, 3])
>>> np.array([[1, 2, 3], [1, 2, 3]])
array([[1, 2, 3],
[1, 2, 3]])
>>> np.zeros((2, 2))
array([[0., 0.],
[0., 0.]])
np.zeros accepts a shape as an argument (number of rows and columns for matrices). Of course, you can create arrays with how many dimensions you want.
numpy is quite complex regarding indexing its arrays. For a matrix you have:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[0] # row indexing
array([0, 1, 2])
>>> a[1, 1] # element indexing
4
>>> a[:, 2] # column indexing
array([2, 5])
Hopefully the examples are self-explanatory. Regarding the column index, : means "over all the values". So you specify a column index and the fact that you want all the values on that column.
For swapping rows and columns it's pretty short:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[[0, 1]] = a[[1, 0]] # row swapping
>>> a
array([[3, 4, 5],
[0, 1, 2]])
>>> a[:, [0, 2]] = a[:, [2, 0]] # column swapping
>>> a
array([[5, 4, 3],
[2, 1, 0]])
Here advance indexing is used. Each dimension (called axis by numpy) can accept a list of indices. So you can get 2 or more rows/columns at the same time from a matrix.
You don't have to ask for them in a certain order. numpy gives you the values in the order you ask for them.
Swapping rows is done by asking numpy for the two rows in reversed order and saving them in their original positions. It actually respects the pythonic way of swapping values between 2 variables (although surrounded by a complex frame):
a, b = b, a
Regarding matrix transformation, it depends on what you are looking for.
Using the swapping ideas from I made my own functions to find all the swapping to do to get the triangular matrix.
Here I write the code:
`def simple_sort_matrix(matrix):
orden=np.array([i for i in range(len(matrix[0]))])
change=True
while change:
rows_index=row_index_ones(matrix)
change=False
#for i in range(len(rows_index)-1):
i=0
while i
def swap_row_and_column(matrix,i,j):
matrix[[i, j]] = matrix[[j, i]] # row swapping
matrix[:, [i, j]] = matrix[:, [j, i]] # column swapping
return matrix
def row_index_ones(matrix):
return(np.sum(matrix,axis=1))`
Best regards,
Pablo
I have two 2D arrays like:
A=array[[4,5,6],
[0,7,8],
[0,9,0]]
B = array[[11,12,13],
[14,15,16],
[17,18,19]]
In array A where element value is 0 i want to replace same value in array B by 0 and store the changed matrix in a new variable and retain the old B matrix.
Thanks in advance.
import numpy as np
A=np.array([[4,5,6],
[0,7,8],
[0,9,0]])
B =np.array([[11,12,13],
[14,15,16],
[17,18,19]])
C = B.copy()
B[A == 0] = 0
C, B = B, C
The line B[A == 0] basically first gets all the the values where the array A is 0 by the line A == 0 . It return a boolean array with true at the position where value is zero in array A. This boolean array is then used to mask the array B and assigns 0 to indices the boolean values is True.
Use case
I get random observations from a population.
Then I group them by bin using pd.cut
Then I extract values with pd.values_counts
I want to get the calculated interval labels and the frequency count
I want to 'glue' the labels column to the frequency counts column to get 2d array (with 2 columns, and n interval rows)
I want to convert 2d array to a list for COM interop.
I am close to desired output but I am Python newbie so some smart guy can optimize my label code.
The problem here is the constraint of the final output which needs to be a list so it can be marshalled via COM interop layer to Excel VBA.
import inspect
import numpy as np
import pandas as pd
from scipy.stats import skewnorm
pop = skewnorm.rvs(0, size=20)
bins=[-5,-4,-3,-2,-1,0,1,2,3,4,5]
bins2 = np.array(bins)
bins3 = pd.cut(pop,bins2)
bins4 = [0]*(bins2.size-1)
#print my own labels, doh!
idx=0
for binLoop in bins3.categories:
intervalAsString="(" + str(binLoop.left)+ "," + str(binLoop.right)+"]"
print (intervalAsString)
bins4[idx]=intervalAsString
idx=idx+1
table = pd.value_counts(bins3, sort=False)
joined = np.vstack((bins4,table.tolist()))
print (joined)
Target output a 2d array convertible to a list
| (-5, -4] | 0 |
| (-4, -3] | 0 |
| (-3, -2] | 0 |
| (-2, -1] | 1 |
| (-1, 0] | 3 |
| (0, 1] | 9 |
| (1, 2] | 4 |
| (2, 3] | 2 |
| (3, 4] | 1 |
| (4, 5] | 0 |
If I understand you correctly, the following should do what you are after:
pop = skewnorm.rvs(0, size=20)
bins = range(-5, 5)
binned = pd.cut(pop, bins)
# create the histogram data
hist = binned.value_counts()
# hist is a pandas series with a categorical index describing the bins
# `index.astype(str)` will convert the categories to strings.
hist.index = hist.index.astype(str)
# `.reset_index()` will turn the index into an ordinary column
# `.values` gives you the underlying numpy array
# `tolist()` converts the numpy array to a native python list o' lists.
print(hist.reset_index().values.tolist())