Minimize number of shops while reaching all customers

Minimize number of shops while reaching all customers - python

In this particular issue, I have an imaginary city divided into squares - basically a MxN grid of squares covering the city. M and N can be relatively big, so I have cases with more than 40,000 square cells overall.
I have a number of customers Z distributed in this grid, some cells will contain many customers while others will be empty. I would like to find a way to place the minimum number of shops (only one per cell) to be able to serve all customers, with the restriction that all customers must be “in reach” of one shop and all customers need to be included.
As an additional couple of twist, I have these constraints/issues:
There is a maximum distance that a customer can travel - if the shop is in a cell too far away then the customer cannot be associated with that shop. Edit: it’s not really a distance, it’s a measure of how easy it is for a customer to reach a shop, so I can’t use circles...
While respecting the condition (1) above, there may well be multiple shops in reaching distance of the same customer. In this case, the closest shop should win.
At the moment I’m trying to ignore the issue of costs - many customers means bigger shops and larger costs - but maybe at some point I’ll think about that too. The problem is, I have no idea of the name of the problem I’m looking at nor about possible algorithmic solutions for it: can this be solved as a Linear Programming problem?
I normally code in Python, so any suggestions on a possible algorithmic approach and/or some code/libraries to solve it would be very much appreciated.
Thank you in advance.
Edit: as a follow up, I kind of found out I could solve this problem as a MINLP “uncapacitated facility problem”, but all the information I have found are way too complex: I don’t care to know which customer is served by which shop, I only care to know if and where a shop is built. I have a secondary way - as post processing - to associate a customer to the most appropriate shop.
All the codes I found set up this monstrous linear system associating a constraint per customer per shop (as “explained” here: https://en.m.wikipedia.org/wiki/Facility_location_problem#Uncapacitated_facility_location), so in a situation like mine I could easily end up with a linear system with millions of rows and columns, which with integer/binary variables will take about the age of the universe to solve.
There must be an easier way to handle this...

I think this can be formulated as a set covering problem link.
You say:
in a situation like mine I could easily end up with a linear system
with millions of rows and columns, which with integer/binary variables
will take about the age of the universe to solve
So let's see if that is even remotely true.
Step 1: generate some data
I generated a grid of 200 x 200, yielding 40,000 cells. I place at random M=500 customers. This looks like:
---- 22 PARAMETER cloc customer locations
x y
cust1 35 75
cust2 169 84
cust3 111 18
cust4 61 163
cust5 59 102
...
cust497 196 148
cust498 115 136
cust499 63 101
cust500 92 87
Step 2: calculate reach of each customer
The next step is to determine for each customer c the allowed locations (i,j) within reach. I created a large sparse boolean matrix reach(c,i,j) for this. I used the rule: if the manhattan distance is
|i-cloc(c,'x')|+|j-cloc(c,'y')| <= 10
then the store at (i,j) can service customer c. My data looks like:
(zeros are not stored). This data structure has 106k elements.
Step 3: Form MIP model
We form a simple MIP model:
The inequality constraint says: we need at least one store that is within reach of each customer. This is a very simple model to formulate and to implement.
Step 4: Solve
This is a large but easy MIP. It has 40,000 binary variables. It solves very fast. On my laptop it took less than 1 second with a commercial solver (3 seconds with open-source solver CBC).
The solution looks like:
---- 47 VARIABLE numStores.L = 113 number of stores
---- 47 VARIABLE placeStore.L store locations
j1 j6 j7 j8 j9 j15 j16 j17 j18
i4 1
i18 1
i40 1
i70 1
i79 1
i80 1
i107 1
i118 1
i136 1
i157 1
i167 1
i193 1
+ j21 j23 j26 j28 j29 j31 j32 j36 j38
i10 1
i28 1
i54 1
i72 1
i96 1
i113 1
i147 1
i158 1
i179 1
i184 1
i198 1
+ j39 j44 j45 j46 j49 j50 j56 j58 j59
i5 1
i18 1
i39 1
i62 1
i85 1
i102 1
i104 1
i133 1
i166 1
i195 1
+ j62 j66 j67 j68 j69 j73 j74 j76 j80
i11 1
i16 1
i36 1
i61 1
i76 1
i105 1
i112 1
i117 1
i128 1
i146 1
i190 1
+ j82 j84 j85 j88 j90 j92 j95 j96 j97
i17 1
i26 1
i35 1
i48 1
i68 1
i79 1
i97 1
i136 1
i156 1
i170 1
i183 1
i191 1
+ j98 j102 j107 j111 j112 j114 j115 j116 j118
i4 1
i22 1
i36 1
i56 1
i63 1
i68 1
i88 1
i100 1
i101 1
i111 1
i129 1
i140 1
+ j119 j121 j126 j127 j132 j133 j134 j136 j139
i11 1
i30 1
i53 1
i72 1
i111 1
i129 1
i144 1
i159 1
i183 1
i191 1
+ j140 j147 j149 j150 j152 j153 j154 j156 j158
i14 1
i35 1
i48 1
i83 1
i98 1
i117 1
i158 1
i174 1
i194 1
+ j161 j162 j163 j164 j166 j170 j172 j174 j175
i5 1
i32 1
i42 1
i61 1
i69 1
i103 1
i143 1
i145 1
i158 1
i192 1
i198 1
+ j176 j178 j179 j180 j182 j183 j184 j188 j191
i6 1
i13 1
i23 1
i47 1
i61 1
i81 1
i93 1
i103 1
i125 1
i182 1
i193 1
+ j192 j193 j196
i73 1
i120 1
i138 1
i167 1
I think we have debunked your statement that a MIP model is not a feasible approach to this problem.
Note that the age of the universe is 13.7 billion years or 4.3e17 seconds. So we have achieved a speed-up of about 1e17. This is a record for me.
Note that this model does not find the optimal locations for the stores, but only a configuration that minimizes the number of stores needed to service all customers. It is optimal in that sense. But the solution will not minimize the distances between customers and stores.

Related

How to extract specific Tables from multiple PDFs in Python

I have a data bank of PDF files that I've downloaded through webscraping. I can extract the tables from these PDF files and visualise them in jupyter notebook like this:
import os
import camelot.io as camelot
n = 1
arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel
for item in arr:
tables = camelot.read_pdf(item, pages='all', split_text=True)
print(f'''DATENBLATT {n}: {item}
''')
n += 1
for tabs in tables:
print(tabs.df, "\n==============================================================================\n")
in this way I get the results for two PDF files in the data bank as follows.
(PDf1, PDF2)
Now I would like to ask how I can get only the specific data from tables that contain for example "Voltage" and "Current" info. More specifically I would like to extract user-defined or targeted info and make charts with this values instead of printing them as whole.
Thanks in advance.
DATENBLATT 1: HY-Energy-Plus-Peak-Pack-HYP-00-2972-R2.pdf
0 1
0 Part Number HYP-00-2972
1 Voltage Nominal 51.8V
2 Voltage Range Min/Max 43.4V/58.1V
3 Charge Current 160A maximum \nDe-rated by BMS message over CA...
4 Discharge Current 300A maximum \nDe-rated by BMS message over CA...
5 Maximum Capacity 5.76kWh/111.4Ah
6 Maximum Energy Density 164Wh/kg
7 Useable capacity Limited to 90% by BMS to improve cell life
8 Dimensions W: 243 x L: 352 x H: 300.5mm
9 Weight 37kg
10 Mounting Fixtures 4x M8 mounting points for easy secure mounting
11
==============================================================================
0 \
0 Communication Protocol
1 Reported Information
2 Pack Protection Mechanism
3 Balancing Method
4 Multi-Pack Behaviour
5 Compatible Chargers as standard
6 Charger Control
7 Auxiliary Connectors
8 Power connectors
9
1
0 CAN bus at user selectable baud rate (propriet...
1 Cell Temperatures and Voltages, Pack Current, ...
2 Interlock to control external protection devic...
3 Actively controlled dissipative balancing
4 BMS implements a single master and multi-slave...
5 Zivan, Victron, Delta-Q, TC-Charger, SPE. For ...
6 Direct current control based on cell voltage/t...
7 Binder 720-Series 8-way male & female
8 4x Amphenol SurLok Plus 8mm \nWhen using batte...
9
==============================================================================
0 \
0 Max no of packs in series
1 Max Number of Parallel Packs
2 External System Requirements
3
1
0 10
1 127
2 External Protection Device (e.g. Contactor) co...
3
==============================================================================
DATENBLATT 2: HY-Energy-Standard-Pack-HYP-00-2889-R2.pdf
0 1
0 Part Number HYP-00-2889
1 Voltage Nominal 44.4V
2 Voltage Range Min/Max 37.2V/49.8V
3 Charge Current 132A maximum \nDe-rated by BMS message over CA...
4 Discharge Current 132A maximum \nDe-rated by BMS message over CA...
5 Maximum Capacity 4.94kWh/111Ah
6 Maximum Energy Density 152Wh/kg
7 Useable capacity Limited to 90% by BMS to improve cell life
8 Dimensions W: 243 x L: 352 x H: 265mm
9 Weight 32kg
10 Mounting Fixtures 4x M8 mounting points for easy secure mounting
11
==============================================================================
0 \
0 Communication Protocol
1 Reported Information
2 Pack Protection Mechanism
3 Balancing Method
4 Multi-Pack Behaviour
5 Compatible Chargers as standard
6 Charger Control
7 Auxiliary Connectors
8 Power connectors
9
1
0 CAN bus at user selectable baud rate (propriet...
1 Cell Temperatures and Voltages, Pack Current, ...
2 Interlock to control external protection devic...
3 Actively controlled dissipative balancing
4 BMS implements a single master and multi-slave...
5 Zivan, Delta-Q, TC-Charger, SPE, Victron, Bass...
6 Direct current control based on cell voltage/t...
7 Binder 720-Series 8-way male & female
8 4x Amphenol SurLok Plus 8mm \nWhen using batte...
9
==============================================================================
0 \
0 Max no of packs in series
1 Max Number of Parallel Packs
2 External System Requirements
3
1
0 12
1 127
2 External Protection Device (e.g. Contactor) co...
3
==============================================================================

You can define a list of the strings of interest;
then select only the tables which contain at least one of these strings.
import os
import camelot.io as camelot
n = 1
# define your strings of interest
interesting_strings=["voltage", "current"]
arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel
for item in arr:
tables = camelot.read_pdf(item, pages='all', split_text=True)
print(f'''DATENBLATT {n}: {item}
''')
n += 1
for tabs in tables:
# select only tables which contain at least one of the interesting strings
if any(s in tabs.df.to_string().lower() for s in interesting_strings) :
print(tabs.df, "\n==============================================================================\n")
If you want to search for interesting strings only in specific places (for example, in the first column), you can use Pandas dataframes properties, such as iloc:
any(s in tabs.df.iloc[0].to_string().lower() for s in interesting_strings)

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.

There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Different shapes between new data and training dataset

I have a dataframe and looks something like the one below.
Spent Products bought Target Variable
0 2300 Car/Mortgage/Leisure 0
1 1500 Car/Education 0
2 150 Groceries 1
3 700 Groceries/Education 1
4 900 Mortgage 1
5 180 Education/Sports 1
6 1800 Car/Mortgage/Others 0
7 900 Sports/Groceries 1
8 1000 Self-Enrichment/Car 1
9 140 Car/Groceries 1
I used pd.get_dummies to one hot encode all the "products bought" column. Now I have a shape of (5000,150).
I train/test/split my data and thereafter, applied PCA. I fit_transform the train set, and applied only transform on the test set. Following that I used a decision tree classifier to predict which got me a 90% accuracy.
Now here comes the problem. I have new set of data. I know my model was trained on a shape of (,150) and this **new data only has a shape of (150, 28) after** applying encoding with pd.get_dummies.
I know merging the new data with the old dataset is not a solution. I'm kind of stuck, and I'm not sure how to go about solving this. Anyone has any input? Thanks
Edit: I tried reindexing the new dataset but it did not work. There are more unique variables in the "products bought" column my training set and less so in my new dataset.
The new dataframe looks more like something like the one below.
Spent Products bought Target Variable
0 230 Leisure 1
1 150 Others 1
2 100 Groceries 1
3 700 Education 1
4 900 Mortgage 0
5 180 Education/Sports 1
6 1800 Car/Mortgage 0
7 400 Groceries 1
8 4000 Car 1
9 140 Car/Groceries 1

How to find a maximimum with given constraints in Python/Pandas

For a given set of players, player positions, player cost, a budget and a set of constraints, how can I find the "optimal" solution? For example:
ID - Pos - cost - pts
1 1 13 10
2 1 5 13
3 2 10 15
4 2 10 8
5 3 12 12
6 3 7 14
and a budget of 30 (total cost cannot exceed 30), limitation of 1 player per position.
The real problem I'm trying to solve is: I have estimated points per player in fantasy football. Now given the constraints in fantasy football, i.e.
a budget of 100
1 goalkeeper
a max of 5 defenders, min of 3 defenders
a max of 5 midfielders, min of 3 midfielders
a max of 3 strikers, min of 1 striker
Given these constraints, how do I find the maximum pts?
What libraries and tools are available for something like this? I could imagine myself doing this in Excel solver, but given my dataset with over 1000 players it wouldn't work.
I started writing some custom code, but quickly realized there must be some readymade solutions for this.

Scikit-optimize is a good starting point
https://scikit-optimize.github.io/

This is pretty much the knapsack problem (https://en.wikipedia.org/wiki/Knapsack_problem) with the additional constraint, that same positions can't go together, which has been discussed here before:
Knapsack with items to consider constraint
As stated there the problem is NP-hard.
You might have a look at the itertools module to reduce the runtime of your computation.

Efficient Repeated DataFrame Row Selections Based on Multiple Columns' Values

This is a snippet of my Data Frame in pandas
SubJob DetectorName CategoryID DefectID Image
0 0 NECK:1 79 5
1 0 NECK:2 79 6
2 0 NECK:3 92 4
3 0 NECK:4 99 123
4 0 NECK:5 99 124
5 1 NECK:6 79 47
6 1 NECK:7 91 631
7 1 NECK:8 98 646
8 1 NECK:9 99 7
9 2 NECK:10 79 15
10 2 NECK:11 89 1023
11 2 NECK:12 79 1040
12 2 NECK:13 79 2458
13 3 NECK:14 73 2459
14 3 NECK:15 87 2517
15 3 NECK:15 79 3117
16 3 NECK:16 79 3118
till n which is very large
We have multiple subjobs whichare sorted inside which we have multiple categoryId which are sorted and inside categoryId we have multiple defectId which are also sorted
I have a separate nested list
[[CategoryId, DefectId, Image-Link] [CategoryId, DefectId, Image-Link] ...m times]
m is large
here category id , defect id represents integer values and image link is string
now i repeatedly pick a categoryId, DefectId from list and find a row in dataframe corresponding to that categoryId, DefectId and add image link in that row
my current code is
for image_info_list in final_image_info_list:
# add path of image in Image_Link
frame_main.ix[(frame_main["CategoryID"].values == image_info_list[0])
&
(frame_main["DefectID"].values == image_info_list[1]),
"Image_Link"] = image_info_list[2]
which is working perfectly but my issue is since n, m is very large it is lot of time to compute it is there any other appropriate approach
can i apply binary search here ? if yes then how

For a fixed n, if m is large enough, you can perform queries more efficiently by some preprocessing.
(I would start with Idea 2 below, because Idea 1 is much more work to implement.)
Idea 1
First, sort the dataframe by [CategoryId, DefectId, Image-Link]. Following that, you can find any triplet by a triple application of a bisect algorithms, one per column, on the column's values.
The cost of what you're doing now is O(m n). The cost of my suggestion is O(n log(n) + m log(n)).
This will work better for some values of m and n, and worse for others. E.g., if m = Θ(n), then your current algorithm is Θ(n2) = ω(n log(n)). YMMV.
Idea 2
Since Image-link is a string sequence, I'm guessing pandas has a harder time searching for specific values within it. You can preprocess by making a dictionary mapping each value to a list of indices within the Dataframe. In the extreme case, where each Image-link value has O(1) rows, this can reduce the time from Θ(mn) to Θ(n + m).
Edit
In the extreme case the OP mentions in the comment, all Image-link values are unique. In this case, it is possible to build a dictionary mapping their values to indices like so:
dict([(k, i) for (i, k) in enumerate(df['Image-link'].values)])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.