Transformation to simplify matrix with python - python

First of all thank you for any support. This is my first question published as usually my doubts are solved reading through other user's questions.
Here is my question: I have a number (n) of sets with common elements. These elements are usually added sequentially creating new sets although I do not have the sequence and this is what I am trying to find. The sequence is not always perfect and at some points I have to find the closest one with some uncertainty when the sequence is not 'perfect'.
I coded it using theory of Sets searching sequentially the set that contains all the other sets and when I do not reach the last set then I start from the smallest to the bigger.
I gave some thoughts to the topic and I found, in theory, a more robust and generic approach. The idea is to build a square matrix with the n sets as row index (i) and the n sets as column index (j). The element i,j will be equal to 1 when set j is contained in i.
Here I have an example with sets A to G:
A={a, b, c, d1, d2, e, f};
B={b, c, d1, d2, e, f};
C={c, d1, d2, e, f};
D={d1, f, g};
E={d2, f, g};
F={f, g};
G={g};
If I create the matrix assuming sequence B, E, C, F, D, A, G, I would have:
B E C F D A G
B 1 1 1 1 1 0 1
E 0 1 0 1 0 0 1
C 0 1 1 1 1 0 1
F 0 0 0 1 0 0 1
D 0 0 0 1 1 0 1
A 1 1 1 1 1 1 1
G 0 0 0 0 0 0 1
I should get this matrix transformed into following matrix:
A B C D E F G
A 1 1 1 1 1 1 1
B 0 1 1 1 1 1 1
C 0 0 1 1 1 1 1
D 0 0 0 1 0 1 1
E 0 0 0 0 1 1 1
F 0 0 0 0 0 1 1
G 0 0 0 0 0 0 1
Which shows one of the two possible sequence: A, B, C, D, E, F, G
Here I add a picture as I am not sure matrix are shown clearly.
My first question is how you recommend to handle this matrix (which kind of data type should I use with typical functions to swap rows and columns).
And my second question is if there is already a matrix transformation function for this topic.

From my (small) experience, most used types for matrices are lists and numpy.ndarrays.
For columns swaps in particular, I would recommend numpy. There are many array creation routines in numpy. You either give the list with data explicitly or you create an array based on the shape you want. Example
>>> import numpy as np
>>> np.array([1, 2, 3])
array([1, 2, 3])
>>> np.array([[1, 2, 3], [1, 2, 3]])
array([[1, 2, 3],
[1, 2, 3]])
>>> np.zeros((2, 2))
array([[0., 0.],
[0., 0.]])
np.zeros accepts a shape as an argument (number of rows and columns for matrices). Of course, you can create arrays with how many dimensions you want.
numpy is quite complex regarding indexing its arrays. For a matrix you have:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[0] # row indexing
array([0, 1, 2])
>>> a[1, 1] # element indexing
4
>>> a[:, 2] # column indexing
array([2, 5])
Hopefully the examples are self-explanatory. Regarding the column index, : means "over all the values". So you specify a column index and the fact that you want all the values on that column.
For swapping rows and columns it's pretty short:
>>> a = np.arange(6).reshape(2, 3)
>>> a
array([[0, 1, 2],
[3, 4, 5]])
>>> a[[0, 1]] = a[[1, 0]] # row swapping
>>> a
array([[3, 4, 5],
[0, 1, 2]])
>>> a[:, [0, 2]] = a[:, [2, 0]] # column swapping
>>> a
array([[5, 4, 3],
[2, 1, 0]])
Here advance indexing is used. Each dimension (called axis by numpy) can accept a list of indices. So you can get 2 or more rows/columns at the same time from a matrix.
You don't have to ask for them in a certain order. numpy gives you the values in the order you ask for them.
Swapping rows is done by asking numpy for the two rows in reversed order and saving them in their original positions. It actually respects the pythonic way of swapping values between 2 variables (although surrounded by a complex frame):
a, b = b, a
Regarding matrix transformation, it depends on what you are looking for.

Using the swapping ideas from I made my own functions to find all the swapping to do to get the triangular matrix.
Here I write the code:
`def simple_sort_matrix(matrix):
orden=np.array([i for i in range(len(matrix[0]))])
change=True
while change:
rows_index=row_index_ones(matrix)
change=False
#for i in range(len(rows_index)-1):
i=0
while i
def swap_row_and_column(matrix,i,j):
matrix[[i, j]] = matrix[[j, i]] # row swapping
matrix[:, [i, j]] = matrix[:, [j, i]] # column swapping
return matrix
def row_index_ones(matrix):
return(np.sum(matrix,axis=1))`
Best regards,
Pablo

Related

Random shuffle with weight in python

I am currently trying to shuffle an array and am running into some problems.
What I have:
my_array=array([nan, 1, 1, nan, nan, 2, nan, ..., nan, nan, nan])
What I want to do:
I want to shuffle the dataset while keeping the numbers (e.g. the 1,1 in the array) together.
What I did is first converting every naninto an unique negative number.
my_array=array([-1, 1, 1, -2, -3, 2, -4, ..., -2158, -2159, -2160])
Afterward I split everything up with pandas:
df = pd.DataFrame(my_array)
df.rename(columns={0: 'sampleID'}, inplace=True)
groups = [df.iloc[:, 0] for _, df in df.groupby('sampleID')]
If I know shuffle my dataset I will have an equal probability for every group to appear at a given place, but this would neglect the number of elements in each group. If I have a group of several elements like [9,9,9,9,9,9] it should have a higher chance at appearing earlier than some random nan. Correct me on this one if I'm wrong.
One way to get around this problem is numpys choice method.
For this I have to create a probability array
probability_array = np.zeros(len(groups))
for index, item in enumerate(groups):
probability_array[index] = len(item) / len(groups)
All of this to finally call:
groups=np.array(groups,dtype=object)
rng = np.random.default_rng()
shuffled_indices = rng.choice(len(groups), len(groups), replace=False, p=probability_array)
shuffled_array = np.concatenate(groups[shuffled_indices]).ravel()
shuffled_array[shuffled_array < 1] = np.NaN
All of this is quite cumbersome and not very fast. Besides the fact that you can certainly code it better, I feel like I am missing some very simple solution to my problem.
Can somebody point me in the right direction?
One approach:
import numpy as np
from itertools import groupby
# toy data
my_array = np.array([np.nan, 1, 1, np.nan, np.nan, 2, 2, 2, np.nan, 3, 3, 3, np.nan, 4, 4, np.nan, np.nan])
# find groups
groups = np.array([[key, sum(1 for _ in group)] for key, group in groupby(my_array)])
# permute
keys, repetitions = zip(*np.random.permutation(groups))
# recreate new array
res = np.repeat(keys, repetitions)
print(res)
Output (single run)
[ 3. 3. 3. nan nan nan nan 2. 2. 2. 1. 1. nan nan nan 4. 4.]
I have solved your problem under some restrictions
Instead of NaN, I have used zeros as separators
I assumed that an array of yours ALWAYS starts with a sequence of non-zero integers and ends with another sequence of non-zero integers.
With these provisions, I have essentially shuffled a representation of the sequences of integers, and later I have stitched everything in place again.
In [102]: import numpy as np
...: from itertools import groupby
...: a = np.array([int(_) for _ in '1110022220003044440005500000600777'])
...: print(a)
...: n, z = [], []
...: for i,g in groupby(a):
...: if i:
...: n.append((i, sum(1 for _ in g)))
...: else:
...: z.append(sum(1 for _ in g))
...: np.random.shuffle(n)
...: nn = n[0]
...: b = [*[nn[0]]*nn[1]]
...: for zz, nn in zip(z, n[1:]):
...: b += [*[0]*zz, *[nn[0]]*nn[1]]
...: print(np.array(b))
[1 1 1 0 0 2 2 2 2 0 0 0 3 0 4 4 4 4 0 0 0 5 5 0 0 0 0 0 6 0 0 7 7 7]
[7 7 7 0 0 1 1 1 0 0 0 4 4 4 4 0 6 0 0 0 5 5 0 0 0 0 0 2 2 2 2 0 0 3]
Note
The lengths of the runs of separators in the shuffled array is exactly the same as in the original array, but shuffling also the separators is easy. A more difficult problem would be to change arbitrarily the lengths, keepin' the array length unchanged.

Get index from rows with matching values in different columns

I have a set like this:
N1 N2
0 a b
1 b f
2 c d
3 d a
4 e b
I want to get the indexes with the repeated values between the two columns, and the value itself.
From the example, I should get something like these shortlists:
(value, idx(N1), idx(N2))
(a, 0, 3)
(b, 1, 0)
(b, 1, 4)
(d, 3, 2)
I have been able to do it with two for-loops, but for a half-million rows dataframe it took hours...
Use numpy broadcasting comparison and then use argwhere to find the indices where the values where equal:
import numpy as np
# make a broadcasted comparison
mat = df['N2'].values == df['N1'].values[:, None]
# find the indices where the values are True
where = np.argwhere(mat)
# select the values
values = df['N1'][where[:, 0]]
# create the DataFrame
res = pd.DataFrame(data=[[val, *row] for val, row in zip(values, where)], columns=['values', 'idx_N1', 'idx_N2'])
print(res)
Output
values idx_N1 idx_N2
0 a 0 3
1 b 1 0
2 b 1 4
3 d 3 2

How to sort rows of a binary-valued array as if they were long binary numbers?

There is a 2D numpy array of about 500000 rows by 512 values each row:
[
[1,0,1,...,0,0,1], # 512 1's or 0's
[0,1,0,...,0,1,1],
...
[0,0,1,...,1,0,1], # row number 500000
]
How to sort the rows ascending as if each row is a long 512-bit integer?
[
[0,0,1,...,1,0,1],
[0,1,0,...,0,1,1],
[1,0,1,...,0,0,1],
...
]
Instead of converting to strings you can also use a void view (as from #Jaime here) of the data and argsort by that.
def sort_bin(b):
b_view = np.ascontiguousarray(b).view(np.dtype((np.void, b.dtype.itemsize * b.shape[1])))
return b[np.argsort(b_view.ravel())] #as per Divakar's suggestion
Testing
np.random.seed(0)
b = np.random.randint(0, 2, (10,5))
print(b)
print(sort_bin(b))
[[0 1 1 0 1]
[1 1 1 1 1]
[1 0 0 1 0]
...,
[1 0 1 1 0]
[0 1 0 1 1]
[1 1 1 0 1]]
[[0 0 0 0 1]
[0 1 0 1 1]
[0 1 1 0 0]
...,
[1 1 1 0 1]
[1 1 1 1 0]
[1 1 1 1 1]]
Should be much faster and less memory-intensive since b_view is just a view into b
t = np.random.randint(0,2,(2000,512))
%timeit sort_bin(t)
100 loops, best of 3: 3.09 ms per loop
%timeit np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, t))])
1 loop, best of 3: 3.29 s per loop
About 1000x faster actually
You could sort them in a stable way 512 times, starting with the right-most bit first.
Sort by last bit
Sort by second-last bit, stable (to not mess up results of previous sort)
...
...
Sort by first bit, stable
A smaller example: assume you want to sort these three 2-bit numbers by bits:
11
01
00
In the first step, you sort by the right bit, resulting in:
00
11
01
Now you sort by the first bit, in this case we have two 0s in that column. If your sorting algorithm is not stable it would be allowed to put these equal items in any order in the result, that could cause 01 to appear before 00 which we do not want, so we use a stable sort, keeping the relative order of equal items, for the first column, resulting in the desired:
00
01
11
Creating a string of each row and then applying np.sort()
So if we have an array to test on:
a = np.array([[1,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,1,1]])
We can create strings of each row by using np.apply_along_axis:
a = np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a)
which would make a now:
array(['1010', '0010', '0011', '0011'], dtype='<U4')
and so now we can sort the strings with np.sort():
a = np.sort(a)
making a:
array(['0010', '0011', '0011', '1010'], dtype='<U4')
we can then convert back to the original format with:
a = np.array([[int(i) for i in r] for r in a])
which makes a:
array([[0, 0, 1, 0],
[0, 0, 1, 1],
[0, 0, 1, 1],
[1, 0, 1, 0]])
And if you wanted to cram this all into one line:
a = np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a))])
This is slow but does the job.
def sort_col(arr, col_num=0):
# if we have sorted over all columns return array
if col_num >= arr.shape[1]:
return arr
# sort array over given column
arr_sorted = arr[arr[:, col_num].argsort()]
# if the number of 1s in the given column is not equal to the total number
# of rows neither equal to 0, split on 1 and 0, sort and then merge
if len(arr) > np.sum(arr_sorted[:, col_num]) > 0:
arr_sorted0s = sort_col(arr_sorted[arr_sorted[:, col_num]==0], col_num+1)
arr_sorted1s = sort_col(arr_sorted[arr_sorted[:, col_num]==1], col_num+1)
# change order of stacking if you want ascenting order
return np.vstack((arr_sorted0s, arr_sorted1s))
# if the number of 1s in the given column is equal to the total number
# of rows or equal to 0, just go to the next iteration
return sort_col(arr_sorted, col_num + 1)
np.random.seed(0)
a = np.random.randint(0, 2, (5, 4))
print(a)
print(sort_col(a))
# prints
[[0 1 1 0]
[1 1 1 1]
[1 1 1 0]
[0 1 0 0]
[0 0 0 1]]
[[0 0 0 1]
[0 1 0 0]
[0 1 1 0]
[1 1 1 0]
[1 1 1 1]]
Edit. Or better yet use Daniels solution. I didn't check for new answers before I posted my code.

Conversion of pandas dataframe to sparse key-item matrix with composite key

I have a data frame of 3 columns. Col 1 is a string order number, Col 2 is an integer day, and Col 3 is a product name.
I would like to convert this into a matrix where each row represents a unique order/day combination, and each column represents a 1/0 for the presence of a product name for that combination.
My approach so far makes use of a product dictionary, and a dictionary with a composite key of order # & day.
The final step, which iterates through the original dataframe in order to flip the bits in the matrix to 1s is sloooow. Like 10 minutes for a matrix the size of 363K X 331 and a sparseness of ~97%.
Is there a different approach I should consider?
E.g.,
ord_nb day prod
1 1 A
1 1 B
1 2 B
1 2 C
1 2 D
would become
A B C D
1 1 0 0
0 1 1 1
My approach has been to create a dictionary of order/day pairs:
ord_day_dict = {}
print("Making a dictionary of ord-by-day keys...")
gp = df.groupby(['day', 'ord'])
for i,g in enumerate(gp.groups.items()):
ord_day_dict[g[0][0], g[0][1]] = i
I append the index represention to the original dataframe:
df['ord_day_idx'] = 0 #Create a place holder column
for i, row in df.iterrows(): #populate the column with the index
df.set_value(i,'ord_day_idx',ord_day_dict[(row['day'], row['ord_nb'])])
I then initialize a matrix the size of my ord/day X unique products:
n_items = df.prod_nm.unique().shape[0] #unique number of products
n_ord_days = len(ord_day_dict) #unique number of ord-by-day combos
df_fac_matrix = np.zeros((n_ord_days, n_items), dtype=np.float64)#-1)
I convert my products from strings into an index via a dictionary:
prod_dict = dict()
i = 0
for v in df.prod:
if v not in prod_dict:
prod_dict[v] = i
i = i + 1
And finally iterate through the original dataframe to populate the matrix with 1s where a specific order on a specific day included a specific product.
for line in df.itertuples():
df_fac_matrix[line[4], line[3]] = 1.0 #in the order-by-day index row and the product index column of our ord/day-by-prod matrix, mark a 1
Here is one option you can try:
df.groupby(['ord_nb', 'day'])['prod'].apply(list).apply(lambda x: pd.Series(1, x)).fillna(0)
# A B C D
#ord_nb day
# 1 1 1.0 1.0 0.0 0.0
# 2 0.0 1.0 1.0 1.0
Here's a NumPy based approach to have an array as output -
a = df[['ord_nb','day']].values.astype(int)
row = np.unique(np.ravel_multi_index(a.T,a.max(0)+1),return_inverse=1)[1]
col = np.unique(df.prd.values,return_inverse=1)[1]
out_shp = row.max()+1, col.max()+1
out = np.zeros(out_shp, dtype=int)
out[row,col] = 1
Please note that the third column was assumed to be of name 'prd' instead to avoid name conflict with built-in.
Possible improvements with focus on performance -
If prd has single letter characters only starting from A, we could compute col with simply : df.prd.values.astype('S1').view('uint8')-65.
Alternatively, we could compute row with : np.unique(a[:,0]*(a[:,1].max()+1) + a[:,1],return_inverse=1)[1].
Saving memory with sparse array : For really huge arrays, we could save on memory by storing them as sparse matrices. Thus, the final steps to get such a sparse matrix would be -
from scipy.sparse import coo_matrix
d = np.ones(row.size,dtype=int)
out_sparse = coo_matrix((d,(row,col)), shape=out_shp)
Sample input, output -
In [232]: df
Out[232]:
ord_nb day prd
0 1 1 A
1 1 1 B
2 1 2 B
3 1 2 C
4 1 2 D
In [233]: out
Out[233]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])
In [241]: out_sparse
Out[241]:
<2x4 sparse matrix of type '<type 'numpy.int64'>'
with 5 stored elements in COOrdinate format>
In [242]: out_sparse.toarray()
Out[242]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])

Matrix as a dictionary; is it safe?

I know that the order of the keys is not guaranteed and that's OK, but what exactly does it mean that the order of the values is not guaranteed as well*?
For example, I am representing a matrix as a dictionary, like this:
signatures_dict = {}
M = 3
for i in range(1, M):
row = []
for j in range(1, 5):
row.append(j)
signatures_dict[i] = row
print signatures_dict
Are the columns of my matrix correctly constructed? Let's say I have 3 rows and at this signatures_dict[i] = row line, row will always have 1, 2, 3, 4, 5. What will signatures_dict be?
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
or something like
1 2 3 4 5
1 4 3 2 5
5 1 3 4 2
? I am worried about cross-platform support.
In my application, the rows are words and the columns documents, so can I say that the first column is the first document?
*Are order of keys() and values() in python dictionary guaranteed to be the same?
You will guaranteed have 1 2 3 4 5 in each row. It will not reorder them. The lack of ordering of values() refers to the fact that if you call signatures_dict.values() the values could come out in any order. But the values are the rows, not the elements of each row. Each row is a list, and lists maintain their order.
If you want a dict which maintains order, Python has that too: https://docs.python.org/2/library/collections.html#collections.OrderedDict
Why not use a list of lists as your matrix? It would have whatever order you gave it;
In [1]: matrix = [[i for i in range(4)] for _ in range(4)]
In [2]: matrix
Out[2]: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]
In [3]: matrix[0][0]
Out[3]: 0
In [4]: matrix[3][2]
Out[4]: 2

Categories

Resources