Handeling the shape of a specific column in pandas - python

I have pandas.DataFrame that I'm interested only in the values of the last column.
np.shape(dataframe.iloc[:,:]) # the output is (2190,460)
# Now here is the shape of one cell in the last column
np.shape(dataframe.iloc[0,-1]) # the output is ( 20,)
dataframe.iloc[0,-1] # the output [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
My question is how can get this column saved in the following shape : (2190, 20)
since running:
np.shape(dataframe.iloc[:,-1]) # the output (2190,)
And this shape is causing me a huge problems
Solution using a loop:
Test_Labels = []
for i in range(len(dataframe)):
Test_Labels.append(dataframe.iloc[i,-1])
np.shape(Test_Labels)
If someone can solve it using a pandas function, will be glad to see it.

You can get the last column from a pandas.DataFrame like:
Code:
df[df.columns[-1]]
Test Code:
df = pd.DataFrame({"a": [dt.datetime(2017, 1, 3),
dt.datetime(2017, 2, 4),
dt.datetime(2017, 3, 5)],
"b": [[2, 4], [6, 8], [10, 12]]})
print(df)
print(df[df.columns[-1]])
Results:
a b
0 2017-01-03 [2, 4]
1 2017-02-04 [6, 8]
2 2017-03-05 [10, 12]
0 [2, 4]
1 [6, 8]
2 [10, 12]
Name: b, dtype: object
But I need the result to be an array of arrays not lists
If you need to convert the array of lists, to an array of arrays, then cast the whole to a numpy.array, numpy will reach in and convert the inner lists to arrays.
last_col = np.array(list(df[df.columns[-1]]))
print(last_col)
print(last_col.shape)
Results:
[[ 2 4]
[ 6 8]
[10 12]]
(3, 2)

Related

Scipy's linear_sum_assignment giving incorrect result

When I tried using scipy.optimize.linear_sum_assignment as shown, it gives the assignment vector [0 2 3 1] with a total cost of 15.
However, from the cost matrix c, you can see that for the second task, the 5th agent has a cost of 1. So the expected assignment should be [0 3 None 2 1] (total cost of 9)
Why is linear_sum_assignment not returning the optimal assignments?
from scipy.optimize import linear_sum_assignment
c = [
[1, 5, 9, 5],
[5, 8, 3, 2],
[3, 2, 6, 8],
[7, 3, 5, 4],
[2, 1, 9, 9],
]
results = linear_sum_assignment(c)
print(results[1]) # [0 2 3 1]
linear_sum_assignment returns a tuple of two arrays. These are the row indices and column indices of the assigned values. For your example (with c converted to a numpy array):
In [51]: c
Out[51]:
array([[1, 5, 9, 5],
[5, 8, 3, 2],
[3, 2, 6, 8],
[7, 3, 5, 4],
[2, 1, 9, 9]])
In [52]: row, col = linear_sum_assignment(c)
In [53]: row
Out[53]: array([0, 1, 3, 4])
In [54]: col
Out[54]: array([0, 2, 3, 1])
The corresponding index pairs from row and col give the selected entries. That is, the indices of the selected entries are (0, 0), (1, 2), (3, 3) and (4, 1). It is these pairs that are the "assignments".
The sum associated with this assignment is 9:
In [55]: c[row, col].sum()
Out[55]: 9
In the original version of the question (but since edited),
it looks like you wanted to know the row index for each column, so you expected [0, 4, 1, 3]. The values that you want are in row, but the order is not what you expect, because the indices in col are not simply [0, 1, 2, 3]. To get the result in the form that you expected, you have to reorder the values in row based on the order of the indices in col. Here are two ways to do that.
First:
In [56]: result = np.zeros(4, dtype=int)
In [57]: result[col] = row
In [58]: result
Out[58]: array([0, 4, 1, 3])
Second:
In [59]: result = row[np.argsort(col)]
In [60]: result
Out[60]: array([0, 4, 1, 3])
Note that the example in the linear_sum_assignment docstring is potentially misleading; because it displays only col_ind in the python session, it gives the impression that col_ind is "the answer". In general, however, the answer involves both of the returned arrays.

Transform an array to 1s and 0s using another array

I have two arrays
arr1 = np.array([[4, 1, 3, 2, 5], [5, 2, 4, 1, 3]])
arr2 = np.array([[2], [1]])
I want to transform array 1 to a binary array using the elements of the array 2 in the following way
For row 1 of array 1, I want to use the row 1 of array 2 i.e. 2 - to make the top 2 values of array 1 as 1s and the rest as 0s
Similarly for row 2 of array 1, I want to use the row 2 of array 2 i.e. 1 - to make the top 1 value of array 1 as 1s and the rest as 0s
So arr1 would get transformed as follows
arr1_transformed = np.array([[1, 0, 0, 0, 1], [1, 0, 0, 0, 0]])
Here is what I tried.
arr1_sorted_indices = np.argosrt(-arr1)
This gave me the indices of the sorted array
array([[1, 3, 2, 0, 4],
[3, 1, 4, 2, 0]])
Now I think I need to mask this array with the help of arr2 to get the desired output and I'm not sure how to do it.
this should do the job in the mentioned case:
def trasform_arr(arr1,arr2):
for i in range(0,len(arr1)):
if i >= len(arr2):
arr1[i] = [0 for x in arr1[i]]
else:
sorted_arr = sorted(arr1[i])[-arr2[i][0]:]
arr1[i] = [1 if x in sorted_arr else 0 for x in arr1[i]]
arr1 = [[4, 1, 3, 2, 5], [5, 2, 4, 1, 3]]
arr2 = [[2], [1]]
trasform_arr(arr1,arr2)
print(arr1)
You can try the following:
import numpy as np
arr1 = np.array([[4, 1, 3, 2, 5], [5, 2, 4, 1, 3]])
arr2 = np.array([[2],[1]])
r, c = arr1.shape
s = np.argsort(np.argsort(-arr1))
out = (np.arange(c) < arr2)[np.c_[0:r], s] * 1
print(out)
It gives:
[[1 0 0 0 1]
[1 0 0 0 0]]

Sorted array by column sum and excluding the largest sum of each column using Numpy

I would like to sort an array by column sum and delete the largest element of each column then continue the sorting.
#sorted by sum of columns
def sorting(a):
b = np.sum(a, axis = 0)
idx = b.argsort()
a = np.take(a, idx, axis=1)
return a
arr = [[1,2,3,8], [3,0,2,1],[5, 4, 25, 67], [11, 1, 6, 10]]
print(sorting(arr))
Here is the output:
[[ 2 1 3 8]
[ 0 3 2 1]
[ 4 5 25 67]
[ 1 11 6 10]]
I was able to able to find the max of each column and their indexes but I couldn't delete them without deleting the whole row/column. Please any help I am new to numpy!!!
Though not very elegant, one way to achieve this would be like this using broadcasting and fancy/advanced indexing:
import numpy as np
arr = np.array([[1,2,3,8], [3,0,2,1],[5, 4, 25, 67], [11, 1, 6, 10]])
First get the intermediate array sorted by column sums.
arr1 = arr[:, arr.sum(axis = 0).argsort()]
print(arr1)
# array([[ 2, 1, 3, 8],
# [ 0, 3, 2, 1],
# [ 4, 5, 25, 67],
# [ 1, 11, 6, 10]])
Next get where the maximas occur in each column.
idx = arr1.argmax(axis = 0)
print(idx)
# array([2, 3, 2, 2])
Now prepare row and column index arrays to slice from arr1. Note that the line to compute rows essentially performs a set difference of {0, 1, 2, 3} (in general to number of rows in arr) for each element in idx above, and stores them along the columns of the rows matrix.
k = np.arange(arr1.shape[0]) # original number of rows
rows = np.nonzero(k != idx[:, None])[1].reshape(-1, arr1.shape[0] - 1).T
cols = np.arange(arr1.shape[1])
print(rows)
# array([[0, 0, 0, 0],
# [1, 1, 1, 1],
# [3, 2, 3, 3]])
Note that cols will be broadcasted to the shape of rows while indexing arr1 by them. For your understanding cols will look like this to be compatible with rows:
print(np.broadcast_to(cols, rows.shape))
# array([[0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3]])
Basically when you (fancy) index arr1 by them, you get the 0th column for rows 0, 1 and 3; 1st column for rows 0, 1 and 2 and so on. Hope you get the idea.
arr2 = arr1[rows, cols]
print(arr2)
# array([[ 2, 1, 3, 8],
# [ 0, 3, 2, 1],
# [ 1, 5, 6, 10]])
You can write a simple function composing these steps for your convenience to perform the operation multiplie times.

how to understand such shuffling data code in Numpy

I am learning at Numpy and I want to understand such shuffling data code as following:
# x is a m*n np.array
# return a shuffled-rows array
def shuffle_col_vals(x):
rand_x = np.array([np.random.choice(x.shape[0], size=x.shape[0], replace=False) for i in range(x.shape[1])]).T
grid = np.indices(x.shape)
rand_y = grid[1]
return x[(rand_x, rand_y)]
So I input an np.array object as following:
x1 = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]], dtype=int)
And I get a output of shuffle_col_vals(x1) like comments as following:
array([[ 1, 5, 11, 15],
[ 3, 8, 9, 14],
[ 4, 6, 12, 16],
[ 2, 7, 10, 13]], dtype=int64)
I get confused about the initial way of rand_x and I didn't get such way in numpy.array
And I have been thinking it a long time, but I still don't understand why return x[(rand_x, rand_y)] will get a shuffled-rows array.
If not mind, could anyone explain the code to me?
Thanks in advance.
In indexing Numpy arrays, you can take single elements. Let's use a 3x4 array to be able to differentiate between the axes:
In [1]: x1 = np.array([[1, 2, 3, 4],
...: [5, 6, 7, 8],
...: [9, 10, 11, 12]], dtype=int)
In [2]: x1[0, 0]
Out[2]: 1
If you review Numpy Advanced indexing, you will find that you can do more in indexing, by providing lists for each dimension. Consider indexing with x1[rows..., cols...], let's take two elements.
Pick from the first and second row, but always from the first column:
In [3]: x1[[0, 1], [0, 0]]
Out[3]: array([1, 5])
You can even index with arrays:
In [4]: x1[[[0, 0], [1, 1]], [[0, 1], [0, 1]]]
Out[4]:
array([[1, 2],
[5, 6]])
np.indices creates a row and col array, that if used for indexing, give back the original array:
In [5]: grid = np.indices(x1.shape)
In [6]: np.alltrue(x1[grid[0], grid[1]] == x1)
Out[6]: True
Now if you shuffle the values of grid[0] col-wise, but keep grid[1] as-is, and then use these for indexing, you get an array with the values of the columns shuffled.
Each column index vector is [0, 1, 2]. The code now shuffles these column index vectors for each column individually, and stacks them together into rand_x into the same shape as x1.
Create a single shuffled column index vector:
In [7]: np.random.seed(0)
In [8]: np.random.choice(x1.shape[0], size=x1.shape[0], replace=False)
Out[8]: array([2, 1, 0])
The stacking works by (pseudo-code) stacking with [random-index-col-vec for cols in range(x1.shape[1])] and then transposing (.T).
To make it a little clearer we can rewrite i as col and use column_stack instead of np.array([... for col]).T:
In [9]: np.random.seed(0)
In [10]: col_list = [np.random.choice(x1.shape[0], size=x1.shape[0], replace=False)
for col in range(x1.shape[1])]
In [11]: col_list
Out[11]: [array([2, 1, 0]), array([2, 0, 1]), array([0, 2, 1]), array([2, 0, 1])]
In [12]: rand_x = np.column_stack(col_list)
In [13]: rand_x
Out[13]:
array([[2, 2, 0, 2],
[1, 0, 2, 0],
[0, 1, 1, 1]])
In [14]: x1[rand_x, grid[1]]
Out[14]:
array([[ 9, 10, 3, 12],
[ 5, 2, 11, 4],
[ 1, 6, 7, 8]])
Details to note:
the example output you give is different from what the function you provide does. It seems to be transposed.
the use of rand_x and rand_y in the sample code can be confusing when being used to the convention of x=column index, y=row index
See output:
import numpy as np
def shuffle_col_val(x):
print("----------------------------\n A rand_x\n")
f = np.random.choice(x.shape[0], size=x.shape[0], replace=False)
print(f, "\nNow I transpose an array.")
rand_x = np.array([f]).T
print(rand_x)
print("----------------------------\n B rand_y\n")
print("Grid gives you two possibilities\n you choose second:")
grid = np.indices(x.shape)
print(format(grid))
rand_y = grid[1]
print("\n----------------------------\n C Our rand_x, rand_y:")
print("\nThe order of values in the column CHANGE:\n has random order\n{}".format(rand_x))
print("\nThe order of values in the row NO CHANGE:\n has normal order 0, 1, 2, 3\n{}".format(rand_y))
return x[(rand_x, rand_y)]
x1 = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]], dtype=int)
print("\n----------------------------\n D Our shuffled-rows: \n{}\n".format(shuffle_col_val(x1)))
Output:
A rand_x
[2 3 0 1]
Now I transpose an array.
[[2]
[3]
[0]
[1]]
----------------------------
B rand_y
Grid gives you two possibilities, you choose second:
[[[0 0 0 0]
[1 1 1 1]
[2 2 2 2]
[3 3 3 3]]
[[0 1 2 3]
[0 1 2 3]
[0 1 2 3]
[0 1 2 3]]]
----------------------------
C Our rand_x, rand_y:
The order of values in the column CHANGE: has random order
[[2]
[3]
[0]
[1]]
The order of values in the row NO CHANGE: has normal order 0, 1, 2, 3
[[0 1 2 3]
[0 1 2 3]
[0 1 2 3]
[0 1 2 3]]
----------------------------
D Our shuffled-rows:
[[ 9 10 11 12]
[13 14 15 16]
[ 1 2 3 4]
[ 5 6 7 8]]

What is a faster way to get the location of unique rows in numpy

I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...
import numpy
uniq_rows = numpy.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = numpy.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
print row, numpy.where((test_rows == row).all(axis=1))[0]
This prints...
[0, 1, 0] [ 1 4 10]
[1, 1, 0] [ 3 8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]
Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.
EDIT:
This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.
Numpy
From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)
Pandas
df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)
uniq
0 1 2
0 0 1 0
1 1 1 0
2 1 1 1
3 0 1 1
Or you could generate the unique rows automatically from the incoming DataFrame
uniq_generated = df.drop_duplicates().reset_index(drop=True)
yields
0 1 2
0 0 1 1
1 0 1 0
2 0 0 0
3 1 1 0
4 1 1 1
and then look for it
d = dict()
for idx, row in uniq.iterrows():
d[idx] = df.index[(df == row).all(axis=1)].values
This is about the same as your where method
d
{0: array([ 1, 4, 10], dtype=int64),
1: array([ 3, 8, 12], dtype=int64),
2: array([7, 9], dtype=int64),
3: array([0, 5, 6], dtype=int64)}
There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.
np.where((uniq_rows[:, None, :] == test_rows).all(2))
Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.
(array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6]))
How it works:
(uniq_rows[:, None, :] == test_rows)
Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.
With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)
In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]:
(array([[0, 0, 0],
[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]]),
array([2, 1, 0, 3, 7], dtype=int32),
array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))
In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
...: dd[v].append(i)
...:
In [167]: dd
Out[167]:
defaultdict(list,
{0: [2, 11],
1: [1, 4, 10],
2: [0, 5, 6],
3: [3, 8, 12],
4: [7, 9]})
or indexing the dictionary with the unique rows (as hashable tuple):
In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
...: dd[tuple(a[v])].append(i)
...:
In [172]: dd
Out[172]:
defaultdict(list,
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]})
This will do the job:
import numpy as np
uniq_rows = np.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = np.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]
This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
with just the row:
res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)
Thus avoiding loops entirely.
Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:
test_rowsdict = {}
for i,j in enumerate(test_rows):
test_rowsdict.setdefault(tuple(j),[]).append(i)
test_rowsdict
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]}
Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:
out = []
for i in uniq_rows:
out.append((i, test_rowsdict.get(tuple(i),[])))
For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.
Approach #1
Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -
def get1Ds(a, b): # Get 1D views of each row from the two inputs
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
return a_void, b_void
def matching_row_indices(uniq_rows, test_rows):
A, B = get1Ds(uniq_rows, test_rows)
validA_mask = np.in1d(A,B)
sidx_A = A.argsort()
validA_mask = validA_mask[sidx_A]
sidx = B.argsort()
sortedB = B[sidx]
split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
all_split_indx = np.split(sidx, split_idx)
match_mask = np.in1d(B,A)[sidx]
valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])
locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]
return uniq_rows[sidx_A[validA_mask]], locations
Scope(s) of improvement (on performance) :
np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.
Sample run -
In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)
In [332]: unq_rows
Out[332]:
array([[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]])
In [333]: idx
Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])]
Approach #2
Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -
A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
idx_group.append(np.flatnonzero(B == row))
The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:
import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)
If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:
subset_indices = npi.indices(unique_test_rows, unique_rows)
As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?
EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:
def encode(rows):
return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)

Categories

Resources