Why is my hungarian algorithm application function not working? - python

I want to solve this combinatorial optimization task using Hungarian algorithm:
Different teams of the same company may have employees in different cities. If employees have the same position, we can swap them. And the goal is to distribute employees to teams in such a way as to maximize the number of teams that have employees which are together in the same city.
Data we have consists of 4 columns: employee's id, employee's team, employee's position, employee's city. The goal is to redistribute employees across teams and get a table with 4 columns: employee's id, employee's team, employee's position, employee's city
Example:
input:
employee_id team_id position city
1 0.0 Manager New York
2 0.0 Manager San Francisco
3 1.0 Engineer Boston
4 1.0 Engineer Boston
5 2.0 Engineer London
6 2.0 Engineer London
7 2.0 Manager New York
output:
employee_id team_id position city
1 0.0 Manager New York
7 0.0 Manager New York
3 1.0 Engineer Boston
4 1.0 Engineer Boston
5 2.0 Engineer London
6 2.0 Engineer London
2 2.0 Manager San Francisco
as you see employees with id 7 and 2 were swapped and now all employees of team 0.0 are in New York.
i wrote this code:
import numpy as np
import scipy.optimize as opt
def hungarian(cost_matrix):
row_ind, col_ind = opt.linear_sum_assignment(cost_matrix)
return row_ind, col_ind
def redistribute_employees(employee_data, cost_matrix):
n = len(employee_data)
row_ind, col_ind = hungarian(cost_matrix)
new_teams = np.zeros(n)
for i in range(n):
new_teams[i] = col_ind[int(employee_data[i, 1])]
return np.column_stack((employee_data[:, 0], new_teams, employee_data[:, 2], employee_data[:, 3]))
employee_data = np.array([[1, 0, 'Manager', 'New York'],
[2, 0, 'Manager', 'San Francisco'],
[3, 1, 'Engineer', 'Boston'],
[4, 1, 'Engineer', 'Boston'],
[5, 2, 'Engineer', 'London'],
[6, 2, 'Engineer', 'London'],
[7, 2, 'Manager', 'New York']])
city_map = {'New York': 0, 'San Francisco': 1, 'Boston': 2, 'London': 3}
n = len(employee_data)
cost_matrix = np.zeros((n, n))
for i in range(n):
for j in range(n):
if employee_data[i, 2] == employee_data[j, 2] and employee_data[i, 3] != employee_data[j, 3]:
cost_matrix[i, j] = 1
redistributed_employees = redistribute_employees(employee_data, cost_matrix)
print(redistributed_employees)
The condition employee_data[i, 2] == employee_data[j, 2] and employee_data[i, 3] != employee_data[j, 3] checks if two employees have the same position (employee_data[i, 2] == employee_data[j, 2]) but work in different cities (employee_data[i, 3] != employee_data[j, 3]). In other words, it checks if two employees can be swapped without affecting the position but making sure that they are not working in the same city.
print of cost_matrix is: array([[0., 1., 0., 0., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 1.], [0., 0., 0., 0., 1., 1., 0.], [0., 0., 0., 0., 1., 1., 0.], [0., 0., 1., 1., 0., 0., 0.], [0., 0., 1., 1., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0.]])
But the output is:
[['1' '0.0' 'Manager' 'New York']
['2' '0.0' 'Manager' 'San Francisco']
['3' '1.0' 'Engineer' 'Boston']
['4' '1.0' 'Engineer' 'Boston']
['5' '2.0' 'Engineer' 'London']
['6' '2.0' 'Engineer' 'London']
['7' '2.0' 'Manager' 'New York']]
nothing changed and I don't understand why. the steps of algorithm I use are:
Create a cost matrix where the entries represent the cost of assigning an employee to a team. In this case, the cost could be defined as 1 if the employees are in different cities, and 0 if they are in the same city.
Run the Hungarian algorithm on the cost matrix to find a minimum-cost matching between employees and teams. The algorithm will find the minimum number of swaps needed to achieve a maximum number of employees in the same city.
Based on the minimum-cost matching, reassign employees to teams to maximize the number of employees in the same city.
The final output is a table with the reassigned employee ids, teams, positions, and cities
What am I doing wrong? How to fix my code?

Related

Mask zero values in matrix and reconstruct original matrix using indices

In case we have
indice=[0 0 1 1 0 1];
and
X=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4],
i would like to use indice to mask 0 value in X and get Xx=[5 8 9;10 11 12; 20 3 4], and then from Xx, we back to initial dimension newX=[0 0 0;0 0 0;5 8 9;10 11 12; 0 0 0; 20 3 4]
for i in range(3):
a=X[:,i];
Xx[:,i]=a[indice];
--and back to initial dimension:
for ii in range(3)
aa=Xx[:,ii]
bb[indice]=aa
newX[:,ii]=bb
could you help me please to solve that with python?
Using numpy.where the life is much easier.
X=np.array([[0 ,0 ,0],[0, 0, 0],[5, 8, 9],[10, 11, 12],[ 0, 0 ,0],[ 20, 3, 4]])
index = np.where(X.any(axis=1))[0] # find rows with all 0s
print(X[index])
#array([[ 5, 8, 9],
# [10, 11, 12],
# [20, 3, 4]])
EDIT:
If you really want to reconstruct it, and based on the fact that you know that you have removed lines with all 0s, then:
Create a new matrix with all 0s:
X_new = np.zeros(X.shape)
and insert the values where they should be:
X_new[index] = X[index]
Now check the X_new:
X_new
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 5., 8., 9.],
[10., 11., 12.],
[ 0., 0., 0.],
[20., 3., 4.]])

Ignore dimension when using np.einsum

I use np.einsum to calculate the flow of material in a graph (1 node to 4 nodes in this example). The amount of flow is given by amount (amount.shape == (1, 1, 2) the dimensions define certain criteria, let's call them a, b, c).
The boolean matrix route determines the permissible flow based on the a, b, c criteria into y (route.shape == (4, 1, 1, 2); yabc). I label the dimensions y, a, b, c. abc are equivalent to amounts dimensions abc, y is the direction of the flow (0, 1, 2 or 3). To determine the amount of material in y, I calculate np.einsum('abc,yabc->y', amount, route) and get a y-dim vector with the flows into y. There's also an implicit priorisation of the route. For instance, any route[0, ...] == True is False for any y=1..3, any route[1, ...] == True is False for the next higher y-dim routes and so on. route[3, ...] (last y-index) defines the catch-all route, that is, its values are True when previous y-index values were False ((route[0] ^ route[1] ^ route[2] ^ route[3]).all() == True).
This works fine. However, when I introduce another criteria (dimension) x which only exists in route, but not in amount, this logic seems to break. The below code demonstrates the problem:
>>> import numpy as np
>>> amount = np.asarray([[[5000.0, 0.0]]])
>>> route = np.asarray([[[[[False, True]]], [[[False, True]]], [[[False, True]]]], [[[[True, False]]], [[[False, False]]], [[[False, False]]]], [[[[False, False]]], [[[True, False]]], [[[False, False]]]], [[[[False, False]]], [[[False, False]]], [[[True, False]]]]], dtype=bool)
>>> amount.shape
(1, 1, 2)
>>> Added dimension `x`
>>> # y,x,a,b,c
>>> route.shape
(4, 3, 1, 1, 2)
>>> # Attempt 1: `5000` can flow into y=1, 2 or 3. I expect
>>> # `flows1.sum() == amount.sum()` as it would be without `x`.
>>> # Correct solution would be `[0, 5000, 0, 0]` because material is routed
>>> # to y=1, and is not available for y=2 and y=3 as they are lower
>>> # priority (higher index)
>>> flows1 = np.einsum('abc,yxabc->y', amount, route)
>>> flows1
array([ 0., 5000., 5000., 5000.])
>>> # Attempt 2: try to collapse `x` => not much different, duplication
>>> np.einsum('abc,yabc->y', amount, route.any(1))
array([ 0., 5000., 5000., 5000.])
>>> # This is the flow by `y` and `x`. I'd only expect a `5000` in the
>>> # 2nd row (`[5000., 0., 0.]`) not the others.
>>> np.einsum('abc,yxabc->yx', amount, route)
array([[ 0., 0., 0.],
[5000., 0., 0.],
[ 0., 5000., 0.],
[ 0., 0., 5000.]])
Is there any feasible operation which I can apply to route (.all(1) doesn't work either) to ignore the x-dimension?
Another example:
>>> amount2 = np.asarray([[[5000.0, 1000.0]]])
>>> np.einsum('abc,yabc->y', amount2, route.any(1))
array([1000., 5000., 5000., 5000.])
can be interpreted as 1000.0 being routed to y=0 (and none of the other y-destinations) and 5000.0 being compatible with destination y=1, y=2 and y=3, but ideally, I'd only like to show 5000.0 up in y=1 (as that's the lowest index and highest destination priority).
Solution attempt
The below works, but is not very numpy-ish. It'll be great if the loop could be eliminated.
# Initialise destination
result = np.zeros((route.shape[0]))
# Calculate flow by maintaining all dimensions (this will cause
# double ups because `x` is not part of `amount2`
temp = np.einsum('abc,yxabc->yxabc', amount2, route)
temp_ixs = np.asarray(np.where(temp))
# For each original amount, find the destination (`y`)
for a, b, c in zip(*np.where(amount2)):
# Find where dimensions `abc` are equal in the destination.
# Take the first vector which contains `yxabc` (we get `yx` as result)
ix = np.where((temp_ixs[2:].T == [a, b, c]).all(axis=1))[0][0]
y_ix = temp_ixs.T[ix][0]
# ignored
x_ix = temp_ixs.T[ix][1]
v = amount2[a, b, c]
# build resulting destination
result[y_ix] += v
# result == array([1000., 5000., 0., 0.])
With other words for each value in amount2, I am looking for the lowest indices yx in temp so that the value can be written to result[y] = value (x is ignored).
>>> temp = np.einsum('abc,yxabc->yx', amount2, route)
>>> temp
# +--- value=1000 at y=0 => result[0] += 1000
# /
array([[1000., 1000., 1000.],
# +--- value=5000 at y=1 => result[1] += 5000
# /
[5000., 0., 0.],
[ 0., 5000., 0.],
[ 0., 0., 5000.]])
>>> result
array([1000., 5000., 0., 0.])
>>> amount2
array([[[5000., 1000.]]])
Another attempt to reduce the dimensionality of route is:
>>> r = route.any(1)
>>> for x in xrange(1, route.shape[0]):
r[x] = r[x] & (r[:x] == False).all(axis=0)
>>> np.einsum('abc,yabc->y', amount2, r)
array([1000., 5000., 0., 0.])
This essentially preserves above-mentioned priority given by the first dimension of route. Any lower priority (higher index) array cannot contain a True value when a higher priority array has a value of True already at that sub index. While this is a lot better than my explicit approach, it would be great if the for x in xrange... loop could be expressed as numpy vector operation.
I haven't tried to follow your 'flow' interpretation of the multiplication problem. I'm just focusing on the calculation options.
Stripped of unnecessary dimensions, your arrays are:
In [194]: amount
Out[194]: array([5000., 0.])
In [195]: route
Out[195]:
array([[[0, 1],
[0, 1],
[0, 1]],
[[1, 0],
[0, 0],
[0, 0]],
[[0, 0],
[1, 0],
[0, 0]],
[[0, 0],
[0, 0],
[1, 0]]])
And the yx calculation is:
In [197]: np.einsum('a,yxa->yx',amount, route)
Out[197]:
array([[ 0., 0., 0.],
[5000., 0., 0.],
[ 0., 5000., 0.],
[ 0., 0., 5000.]])
which is just this slice of route times 5000.
In [198]: route[:,:,0]
Out[198]:
array([[0, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
Omit the x on the RHS of the einsum results in summation across the dimension.
Equivalently we can multiply (with broadcasting):
In [200]: (amount*route).sum(axis=2)
Out[200]:
array([[ 0., 0., 0.],
[5000., 0., 0.],
[ 0., 5000., 0.],
[ 0., 0., 5000.]])
In [201]: (amount*route).sum(axis=(1,2))
Out[201]: array([ 0., 5000., 5000., 5000.])
Maybe looking at amount*route will help visualize the problem. You can also use max, min, argmax etc instead of sum, or along with it on one or more of the axes.

How to transform this data for logistic regression?

I hava 'y' and 'X' data:
y = [1, 0, 0, 0, 0, 0, 0, 0 ...] its ok for my purpose
and
X = [['reg' '03b' '03e' 'buy']
['reg' '03b' '04e' 'sell']
['pref' '02b' '03e' 'sell']
['cur' '03b' '03e' 'buy']
['val' '03b' '03e' 'buy']
['reg' '03b' '03e' 'buy'] ...]
X[0] may take values : 'reg'/'pref'/'cur'/'val'
X[1] : string with number of mounth + b( = begin) at the end
X[2] : string with number of mounth + e( = end) at the end
X[3] : 'buy' or 'sell'
But I cant do
logreg = LogisticRegression()
logreg.fit(X,y)
Because I have troubles with structure of X (it is lists with strings)
I want to fix it and do:
logreg = preprocessing.LabelEncoder()
i=0
while i<len(X):
logreg.fit(X[i])
b[i]=logreg.transform(X[i])
i=i+1
But I get this:
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
[3 0 1 2]
...
[3 0 1 2]
All elements are the same. How can I correctly transform my data for .fit(X,y)?
The problem is that you mistake row and column in X.
import numpy as np
from sklearn import preprocessing
X = [['reg', '03b', '03e', 'buy'],
['reg', '03b', '04e', 'sell'],
['pref', '02b', '03e', 'sell'],
['cur', '03b', '03e', 'buy'],
['val', '03b', '03e', 'buy'],
['reg', '03b', '03e', 'buy']]
X = np.array(X)
b = np.zeros(X.shape)
logreg = preprocessing.LabelEncoder()
i = 0
while i < X.shape[1]:
logreg.fit(X[:,i])
b[:,i] = logreg.transform(X[:,i])
i += 1
b
array([[2., 1., 0., 0.],
[2., 1., 1., 1.],
[1., 0., 0., 1.],
[0., 1., 0., 0.],
[3., 1., 0., 0.],
[2., 1., 0., 0.]])

Converting Python Dictionary to 3D Matlab Matrix

I have the following dictionary results_dict in Python 3.2 where the key field is a string value and the value field is a list of 3 arrays. Each array has 400 float values. I want to convert this dictionary into a data structure that can be used in Matlab 2017b. However, if I execute the following:
savemat('GridCellResults.mat', results_dict, oned_as='row');
The command executes successfully but Matlab is not able to understand the matrix file. For this reason, I wrote the following code to convert the previous dictionary into a 3 Dimensional Matrix (X,Y,Z) where X is the size of the array (400 Elements) and Y is the number of arrays for each dictionary key (3 Arrays) and Z is the number of elements in the dictionary. However, when I execute the code below I get the following error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Here is the code. Any clue why I am getting this error. Also even if I try without the transpose function i keep getting the same error.
import numpy as np
CARDINALITY = 400 # Number of angular domain values.
NUM_COLUMNS = 3
NUM_CELLS = 114
matlab_array = np.zeros((CARDINALITY, NUM_COLUMNS, NUM_CELLS))
for key, value in results_dict.items():
matlab_array[:, 0, key] = np.transpose(value[0])
matlab_array[:, 1, key] = np.transpose(value[1])
matlab_array[:, 2, key] = np.transpose(value[2])
Trying to follow your description, I can successfully write and read such a dictionary
In an ipython session:
In [48]: from scipy.io import savemat, loadmat
In [49]: adict = {'a':[np.arange(3),np.ones(3),np.array([4,2,1])]}
In [50]: adict['b'] = [np.arange(3),np.ones(3),np.array([4,2,1])]
In [51]: adict
Out[51]:
{'a': [array([0, 1, 2]), array([1., 1., 1.]), array([4, 2, 1])],
'b': [array([0, 1, 2]), array([1., 1., 1.]), array([4, 2, 1])]}
In [52]: pwd
Out[52]: '/home/paul/mypy'
In [53]: savemat('stack48385062.mat',adict, oned_as='row')
In [54]: data = loadmat('stack48385062.mat')
In [55]: data
Out[55]:
{'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Mon Jan 22 09:15:31 2018',
'__version__': '1.0',
'a': array([[0., 1., 2.],
[1., 1., 1.],
[4., 2., 1.]]),
'b': array([[0., 1., 2.],
[1., 1., 1.],
[4., 2., 1.]])}
The lists of arrays (of constant size) were converted to 2d arrays.
In an Octave session:
>> load stack48385062.mat
>> a
a =
0 1 2
1 1 1
4 2 1
>> b
b =
0 1 2
1 1 1
4 2 1
>>
Or creating your 3d array (using a numeric index rather than string key):
In [56]: M=np.zeros([3, 3, 2])
In [57]: for i in range(len(adict)):
...: for j in range(3):
...: v = adict[list(adict.keys())[i]]
...: M[:, j, i] = v[j]
...:
In [58]: M
Out[58]:
array([[[0., 0.],
[1., 1.],
[4., 4.]],
[[1., 1.],
[1., 1.],
[2., 2.]],
[[2., 2.],
[1., 1.],
[1., 1.]]])
>> load stack48385062_1.mat
>> M
M =
ans(:,:,1) =
0 1 4
1 1 2
2 1 1
ans(:,:,2) =
0 1 4
1 1 2
2 1 1
I should have made the initial dictionary with a list of 3 of 4 element arrays, so it would be easier to track track transpositions. MATLAB and numpy have different axis orders, which can be confusing. savemat tries to compensate.

Save one-hot-encoded features into Pandas DataFrame the fastest way

I have a Pandas DataFrame with all my features and labels. One of my feature is categorical and needs to be one-hot-encoded.
The feature is an integer and can only have values from 0 to 4
To save those arrays back in my DataFrame I use the following code
# enc is my OneHotEncoder object
df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray())
My DataFrame has more than 1 million rows so the above code takes a while.Is there a faster way to assign the arrays to the DataFrame cells? Because I have just 5 categories i dont need to call the transform() function 1 million times.
I already tried something like
num_categories = 5
i = 0
while (i<num_categories):
df.loc[df['mycol'] == i, 'mycol'] = enc.transform(i).toarray()
i += 1
Which yields this error
ValueError: Must have equal len keys and value when setting with an ndarray
You can use pd.get_dummies:
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Alternatively:
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
[1],
[3],
[2],
[2]]
>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1., 0., 0.],
[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 1., 0.]])

Categories

Resources