Schedule Optimization Question - Variant of Job Shop - python

I am looking to build code on python for a variant of the job shop problem.
The differences are:
The goal is to combine as many tasks to the fewest number of machines without overlap while ensure the most efficient outcome
The machines have a capacity to run all day and will be measured in minutes 0-1440 minutes
there are multiple jobs, there is no order to the jobs, but each job has its own defined schedule
e.g. job 1: starts at 02:00 (140 minutes) and ends at 08:00 (480 minutes)
e.g. job 2: starts at 07:00 (420 minutes) and ends at 19:00 (1140 minutes)
e.g. job 3: starts at 08:00 (480 minutes) and ends at 20:00 (1200 minutes)
e.g. job 4: starts at 02:00 (140 minutes) and ends at 05:00 (300 minutes)
Can you help with the ideation / or code variation of the job shop problem to combine the jobs on the fewest number of machines?
Additionally as an extra request if possible (not too complex) would it be possible to incorporate jobs with different schedules during the week?
e.g.
Job 1: runs daily 02:00 - 08:00
job 2: runs Monday and Thursday only 07:00 - 19:00 etc.
In essence - assume I have a weekly Gannt chart / schedule with 50 machines - each with a single job, i want to compress the Gantt chart to reduce number of machines if they have space to run more than one job (simple illustration going from A to the more efficient B)
Have tried the job shop problem, and researching other problems but couldn't find a similar problem statement

This is not a super difficult problem to solve.
Instead of talking about jobs (which may repeat during the week), it may be easier to talk about individual sub-jobs. The first thing to do is to find out if two sub-jobs k and k' overlap. If this is the case, they cannot run on the same machine. The data structure to hold the overlap information can be calculated in advance (before the model runs).
The best thing is to first write down on a piece of paper the mathematical model of the problem. I propose something like:
The objective is to minimize the number of machines we use (i.e. to which we assign sub-jobs to). The first constraint says: assign each sub-job to exactly one machine. The second constraint: if sub-jobs k and k' overlap, they can't run on the same machine. The third constraint: if a machine is not used, we can't assign any sub-jobs to it. The whole model is expressed in terms of individual sub-jobs instead of (job,day) combinations.
This model can be implemented using any reasonable MIP solver. It is not very difficult to implement (it is rather simple: most of the work is to get the data in shape and calculate the overlap data) and not very difficult to solve (at least for the data set I used). The mathematical model above is sufficiently detailed and precise that you are advised to follow it as closely as possible.
I tried it out with the following random data:
---- 24 PARAMETER start start time
job1 206.000, job2 1012.000, job3 661.000, job4 361.000, job5 350.000, job6 269.000, job7 420.000
job8 1028.000, job9 80.000, job10 600.000, job11 1198.000, job12 695.000, job13 1190.000, job14 915.000
job15 156.000, job16 768.000, job17 191.000, job18 300.000, job19 803.000, job20 522.000, job21 432.000
job22 422.000, job23 157.000, job24 180.000, job25 707.000, job26 997.000, job27 277.000, job28 799.000
job29 931.000, job30 364.000, job31 132.000, job32 603.000, job33 192.000, job34 1047.000, job35 318.000
job36 343.000, job37 713.000, job38 867.000, job39 754.000, job40 557.000, job41 496.000, job42 141.000
job43 377.000, job44 55.000, job45 406.000, job46 218.000, job47 775.000, job48 673.000, job49 924.000
job50 357.000
---- 24 PARAMETER end finish time
job1 841.663, job2 1721.541, job3 1270.409, job4 702.414, job5 537.411, job6 468.961, job7 1040.176
job8 1573.341, job9 224.589, job10 1338.041, job11 1374.758, job12 952.016, job13 1719.993, job14 1620.162
job15 414.936, job16 914.630, job17 767.402, job18 904.559, job19 1226.702, job20 921.797, job21 741.567
job22 734.209, job23 378.792, job24 1028.091, job25 1123.352, job26 1728.052, job27 631.027, job28 1016.877
job29 1635.122, job30 538.001, job31 409.572, job32 726.951, job33 522.298, job34 1556.884, job35 556.003
job36 598.852, job37 1090.897, job38 1234.187, job39 1125.228, job40 1428.902, job41 1391.010, job42 549.524
job43 787.853, job44 777.143, job45 835.414, job46 1050.215, job47 988.271, job48 1366.674, job49 1087.226
job50 926.514
---- 24 PARAMETER length length of job (minutes)
job1 635.663, job2 709.541, job3 609.409, job4 341.414, job5 187.411, job6 199.961, job7 620.176
job8 545.341, job9 144.589, job10 738.041, job11 176.758, job12 257.016, job13 529.993, job14 705.162
job15 258.936, job16 146.630, job17 576.402, job18 604.559, job19 423.702, job20 399.797, job21 309.567
job22 312.209, job23 221.792, job24 848.091, job25 416.352, job26 731.052, job27 354.027, job28 217.877
job29 704.122, job30 174.001, job31 277.572, job32 123.951, job33 330.298, job34 509.884, job35 238.003
job36 255.852, job37 377.897, job38 367.187, job39 371.228, job40 871.902, job41 895.010, job42 408.524
job43 410.853, job44 722.143, job45 429.414, job46 832.215, job47 213.271, job48 693.674, job49 163.226
job50 569.514
---- 24 SET jd jobs need to run on certain days
day1 day2 day3 day4 day5 day6 day7
job1 YES YES YES
job2 YES YES
job3 YES YES YES
job4 YES YES YES
job6 YES YES YES
job7 YES YES YES
job8 YES YES YES
job9 YES
job10 YES YES YES YES YES
job11 YES
job12 YES YES YES YES YES YES
job13 YES YES
job14 YES
job15 YES
job16 YES YES
job17 YES
job18 YES
job19 YES YES
job21 YES YES YES
job22 YES YES
job23 YES YES
job24 YES YES
job25 YES
job26 YES YES
job27 YES YES
job28 YES YES
job30 YES YES
job31 YES
job32 YES YES
job33 YES YES YES
job34 YES YES YES
job35 YES YES
job36 YES YES
job37 YES YES
job38 YES YES YES
job39 YES YES
job40 YES YES
job42 YES YES YES
job44 YES
job45 YES
job46 YES
job48 YES YES
job49 YES YES
---- 29 PARAMETER counts
jobs 43.000, subjobs 93.000
---- 49 SET k subjobs
subjob1 , subjob2 , subjob3 , subjob4 , subjob5 , subjob6 , subjob7 , subjob8 , subjob9 , subjob10
subjob11, subjob12, subjob13, subjob14, subjob15, subjob16, subjob17, subjob18, subjob19, subjob20
subjob21, subjob22, subjob23, subjob24, subjob25, subjob26, subjob27, subjob28, subjob29, subjob30
subjob31, subjob32, subjob33, subjob34, subjob35, subjob36, subjob37, subjob38, subjob39, subjob40
subjob41, subjob42, subjob43, subjob44, subjob45, subjob46, subjob47, subjob48, subjob49, subjob50
subjob51, subjob52, subjob53, subjob54, subjob55, subjob56, subjob57, subjob58, subjob59, subjob60
subjob61, subjob62, subjob63, subjob64, subjob65, subjob66, subjob67, subjob68, subjob69, subjob70
subjob71, subjob72, subjob73, subjob74, subjob75, subjob76, subjob77, subjob78, subjob79, subjob80
subjob81, subjob82, subjob83, subjob84, subjob85, subjob86, subjob87, subjob88, subjob89, subjob90
subjob91, subjob92, subjob93
---- 49 SET map mapping between jobs/days and subjobs
job1 .day1.subjob1 , job1 .day2.subjob2 , job1 .day6.subjob3 , job2 .day2.subjob4 , job2 .day5.subjob5
job3 .day3.subjob6 , job3 .day5.subjob7 , job3 .day7.subjob8 , job4 .day3.subjob9 , job4 .day4.subjob10
job4 .day7.subjob11, job6 .day2.subjob12, job6 .day5.subjob13, job6 .day7.subjob14, job7 .day4.subjob15
job7 .day6.subjob16, job7 .day7.subjob17, job8 .day3.subjob18, job8 .day4.subjob19, job8 .day6.subjob20
job9 .day2.subjob21, job10.day1.subjob22, job10.day2.subjob23, job10.day3.subjob24, job10.day5.subjob25
job10.day7.subjob26, job11.day5.subjob27, job12.day2.subjob28, job12.day3.subjob29, job12.day4.subjob30
job12.day5.subjob31, job12.day6.subjob32, job12.day7.subjob33, job13.day3.subjob34, job13.day4.subjob35
job14.day3.subjob36, job15.day5.subjob37, job16.day4.subjob38, job16.day7.subjob39, job17.day5.subjob40
job18.day2.subjob41, job19.day1.subjob42, job19.day7.subjob43, job21.day3.subjob44, job21.day5.subjob45
job21.day7.subjob46, job22.day1.subjob47, job22.day6.subjob48, job23.day4.subjob49, job23.day6.subjob50
job24.day3.subjob51, job24.day6.subjob52, job25.day3.subjob53, job26.day3.subjob54, job26.day5.subjob55
job27.day1.subjob56, job27.day5.subjob57, job28.day1.subjob58, job28.day5.subjob59, job30.day1.subjob60
job30.day5.subjob61, job31.day3.subjob62, job32.day2.subjob63, job32.day5.subjob64, job33.day1.subjob65
job33.day6.subjob66, job33.day7.subjob67, job34.day1.subjob68, job34.day6.subjob69, job34.day7.subjob70
job35.day2.subjob71, job35.day6.subjob72, job36.day2.subjob73, job36.day7.subjob74, job37.day1.subjob75
job37.day5.subjob76, job38.day1.subjob77, job38.day3.subjob78, job38.day5.subjob79, job39.day4.subjob80
job39.day5.subjob81, job40.day2.subjob82, job40.day7.subjob83, job42.day1.subjob84, job42.day4.subjob85
job42.day6.subjob86, job44.day1.subjob87, job45.day5.subjob88, job46.day2.subjob89, job48.day4.subjob90
job48.day5.subjob91, job49.day5.subjob92, job49.day7.subjob93
---- 64 PARAMETER start2 start time of subjob
subjob1 206.000, subjob2 1646.000, subjob3 7406.000, subjob4 2452.000, subjob5 6772.000, subjob6 3541.000
subjob7 6421.000, subjob8 9301.000, subjob9 3241.000, subjob10 4681.000, subjob11 9001.000, subjob12 1709.000
subjob13 6029.000, subjob14 8909.000, subjob15 4740.000, subjob16 7620.000, subjob17 9060.000, subjob18 3908.000
subjob19 5348.000, subjob20 8228.000, subjob21 1520.000, subjob22 600.000, subjob23 2040.000, subjob24 3480.000
subjob25 6360.000, subjob26 9240.000, subjob27 6958.000, subjob28 2135.000, subjob29 3575.000, subjob30 5015.000
subjob31 6455.000, subjob32 7895.000, subjob33 9335.000, subjob34 4070.000, subjob35 5510.000, subjob36 3795.000
subjob37 5916.000, subjob38 5088.000, subjob39 9408.000, subjob40 5951.000, subjob41 1740.000, subjob42 803.000
subjob43 9443.000, subjob44 3312.000, subjob45 6192.000, subjob46 9072.000, subjob47 422.000, subjob48 7622.000
subjob49 4477.000, subjob50 7357.000, subjob51 3060.000, subjob52 7380.000, subjob53 3587.000, subjob54 3877.000
subjob55 6757.000, subjob56 277.000, subjob57 6037.000, subjob58 799.000, subjob59 6559.000, subjob60 364.000
subjob61 6124.000, subjob62 3012.000, subjob63 2043.000, subjob64 6363.000, subjob65 192.000, subjob66 7392.000
subjob67 8832.000, subjob68 1047.000, subjob69 8247.000, subjob70 9687.000, subjob71 1758.000, subjob72 7518.000
subjob73 1783.000, subjob74 8983.000, subjob75 713.000, subjob76 6473.000, subjob77 867.000, subjob78 3747.000
subjob79 6627.000, subjob80 5074.000, subjob81 6514.000, subjob82 1997.000, subjob83 9197.000, subjob84 141.000
subjob85 4461.000, subjob86 7341.000, subjob87 55.000, subjob88 6166.000, subjob89 1658.000, subjob90 4993.000
subjob91 6433.000, subjob92 6684.000, subjob93 9564.000
---- 64 PARAMETER end2 end time of subjob
subjob1 841.663, subjob2 2281.663, subjob3 8041.663, subjob4 3161.541, subjob5 7481.541
subjob6 4150.409, subjob7 7030.409, subjob8 9910.409, subjob9 3582.414, subjob10 5022.414
subjob11 9342.414, subjob12 1908.961, subjob13 6228.961, subjob14 9108.961, subjob15 5360.176
subjob16 8240.176, subjob17 9680.176, subjob18 4453.341, subjob19 5893.341, subjob20 8773.341
subjob21 1664.589, subjob22 1338.041, subjob23 2778.041, subjob24 4218.041, subjob25 7098.041
subjob26 9978.041, subjob27 7134.758, subjob28 2392.016, subjob29 3832.016, subjob30 5272.016
subjob31 6712.016, subjob32 8152.016, subjob33 9592.016, subjob34 4599.993, subjob35 6039.993
subjob36 4500.162, subjob37 6174.936, subjob38 5234.630, subjob39 9554.630, subjob40 6527.402
subjob41 2344.559, subjob42 1226.702, subjob43 9866.702, subjob44 3621.567, subjob45 6501.567
subjob46 9381.567, subjob47 734.209, subjob48 7934.209, subjob49 4698.792, subjob50 7578.792
subjob51 3908.091, subjob52 8228.091, subjob53 4003.352, subjob54 4608.052, subjob55 7488.052
subjob56 631.027, subjob57 6391.027, subjob58 1016.877, subjob59 6776.877, subjob60 538.001
subjob61 6298.001, subjob62 3289.572, subjob63 2166.951, subjob64 6486.951, subjob65 522.298
subjob66 7722.298, subjob67 9162.298, subjob68 1556.884, subjob69 8756.884, subjob70 10196.884
subjob71 1996.003, subjob72 7756.003, subjob73 2038.852, subjob74 9238.852, subjob75 1090.897
subjob76 6850.897, subjob77 1234.187, subjob78 4114.187, subjob79 6994.187, subjob80 5445.228
subjob81 6885.228, subjob82 2868.902, subjob83 10068.902, subjob84 549.524, subjob85 4869.524
subjob86 7749.524, subjob87 777.143, subjob88 6595.414, subjob89 2490.215, subjob90 5686.674
subjob91 7126.674, subjob92 6847.226, subjob93 9727.226
Some jobs, due to my random data, don't appear in any of the days. So we don't have 50 jobs here, but just 43. The number of subjobs is 93.
This model solves very fast: 0.06 seconds. Here is the solution:
---- 106 VARIABLE x.L assign job to machine
machine1 machine2 machine3 machine4 machine5 machine6 machine7 machine8 machine9 machine10
subjob1 1.000
subjob2 1.000
subjob3 1.000
subjob4 1.000
subjob5 1.000
subjob6 1.000
subjob7 1.000
subjob8 1.000
subjob9 1.000
subjob10 1.000
subjob11 1.000
subjob12 1.000
subjob13 1.000
subjob14 1.000
subjob15 1.000
subjob16 1.000
subjob17 1.000
subjob18 1.000
subjob19 1.000
subjob20 1.000
subjob21 1.000
subjob22 1.000
subjob23 1.000
subjob24 1.000
subjob25 1.000
subjob26 1.000
subjob27 1.000
subjob28 1.000
subjob29 1.000
subjob30 1.000
subjob31 1.000
subjob32 1.000
subjob33 1.000
subjob34 1.000
subjob35 1.000
subjob36 1.000
subjob37 1.000
subjob38 1.000
subjob39 1.000
subjob40 1.000
subjob41 1.000
subjob42 1.000
subjob43 1.000
subjob44 1.000
subjob45 1.000
subjob46 1.000
subjob47 1.000
subjob48 1.000
subjob49 1.000
subjob50 1.000
subjob51 1.000
subjob52 1.000
subjob53 1.000
subjob54 1.000
subjob55 1.000
subjob56 1.000
subjob57 1.000
subjob58 1.000
subjob59 1.000
subjob60 1.000
subjob61 1.000
subjob62 1.000
subjob63 1.000
subjob64 1.000
subjob65 1.000
subjob66 1.000
subjob67 1.000
subjob68 1.000
subjob69 1.000
subjob70 1.000
subjob71 1.000
subjob72 1.000
subjob73 1.000
subjob74 1.000
subjob75 1.000
subjob76 1.000
subjob77 1.000
subjob78 1.000
subjob79 1.000
subjob80 1.000
subjob81 1.000
subjob82 1.000
subjob83 1.000
subjob84 1.000
subjob85 1.000
subjob86 1.000
subjob87 1.000
subjob88 1.000
subjob89 1.000
subjob90 1.000
subjob91 1.000
subjob92 1.000
subjob93 1.000
---- 106 VARIABLE used.L
machine1 1.000, machine2 1.000, machine3 1.000, machine4 1.000, machine5 1.000, machine6 1.000
machine7 1.000, machine8 1.000, machine9 1.000, machine10 1.000
---- 106 VARIABLE num.L = 10.000 number of machines needed
---- 110 PARAMETER assignments
machine1 machine2 machine3 machine4 machine5 machine6 machine7 machine8 machine9 machine10
job1 .day1 1.000
job1 .day2 1.000
job1 .day6 1.000
job2 .day2 1.000
job2 .day5 1.000
job3 .day3 1.000
job3 .day5 1.000
job3 .day7 1.000
job4 .day3 1.000
job4 .day4 1.000
job4 .day7 1.000
job6 .day2 1.000
job6 .day5 1.000
job6 .day7 1.000
job7 .day4 1.000
job7 .day6 1.000
job7 .day7 1.000
job8 .day3 1.000
job8 .day4 1.000
job8 .day6 1.000
job9 .day2 1.000
job10.day1 1.000
job10.day2 1.000
job10.day3 1.000
job10.day5 1.000
job10.day7 1.000
job11.day5 1.000
job12.day2 1.000
job12.day3 1.000
job12.day4 1.000
job12.day5 1.000
job12.day6 1.000
job12.day7 1.000
job13.day3 1.000
job13.day4 1.000
job14.day3 1.000
job15.day5 1.000
job16.day4 1.000
job16.day7 1.000
job17.day5 1.000
job18.day2 1.000
job19.day1 1.000
job19.day7 1.000
job21.day3 1.000
job21.day5 1.000
job21.day7 1.000
job22.day1 1.000
job22.day6 1.000
job23.day4 1.000
job23.day6 1.000
job24.day3 1.000
job24.day6 1.000
job25.day3 1.000
job26.day3 1.000
job26.day5 1.000
job27.day1 1.000
job27.day5 1.000
job28.day1 1.000
job28.day5 1.000
job30.day1 1.000
job30.day5 1.000
job31.day3 1.000
job32.day2 1.000
job32.day5 1.000
job33.day1 1.000
job33.day6 1.000
job33.day7 1.000
job34.day1 1.000
job34.day6 1.000
job34.day7 1.000
job35.day2 1.000
job35.day6 1.000
job36.day2 1.000
job36.day7 1.000
job37.day1 1.000
job37.day5 1.000
job38.day1 1.000
job38.day3 1.000
job38.day5 1.000
job39.day4 1.000
job39.day5 1.000
job40.day2 1.000
job40.day7 1.000
job42.day1 1.000
job42.day4 1.000
job42.day6 1.000
job44.day1 1.000
job45.day5 1.000
job46.day2 1.000
job48.day4 1.000
job48.day5 1.000
job49.day5 1.000
job49.day7 1.000
We need 10 machines for the assignments to be non-overlapping. The number of sub-jobs for some of the machines is very small. Machine 10 has only one sub-job. The results look fine at first sight:
The numbers are the job numbers. E.g. job 1 appears in day 1, 2 and 6 all on machine 1.

Related

Requests in multiprocessing fail python

I'm trying to query data from a website, but it takes 6 seconds to query. For all 3000 of my queries, I'd be sitting around for 5 hours. I heard there was a way to parallelize this stuff, so I tried using multiprocessing to do it. It didn't work, and I tried asyncio, but it gave much the same result. Here's the multiprocessing code since that's simpler.
I have 5+ urls in a list I want to request tables from:
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=1&btime=2014+01+17+18:38:41&etime=2014+01+18+18:38:41&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=2&btime=2014+05+18+23:10:01&etime=2014+05+19+23:10:01&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=3&btime=2014+11+04+06:01:27&etime=2014+11+05+06:01:27&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=4&btime=2014+07+14+10:01:45&etime=2014+07+15+10:01:45&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=5&btime=2014+07+04+20:17:01&etime=2014+07+05+20:17:01&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
Here's my request code:
from astropy.io import fits
from astropy.io import ascii as astro_ascii
from astropy.time import Time
def get_real_meta(url):
df = astro_ascii.read(url)
df=df.to_pandas()
print(df)
return df
import multiprocessing as mp
pool = mp.Pool(processes=10)
results = pool.map(get_real_meta, urls)
When I run this, some of the results are failed requests.
Why is this happening?
This is the full result from the run:
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_1YL1MV_20034/Gator/irsa/20034/log.20034.html"]
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_gteaoW_20031/Gator/irsa/20031/log.20031.html"]
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_yWl2vY_20037/Gator/irsa/20037/log.20037.html"]
cntr_u dist_x pang_x ra_u dec_u ra \
0 1 2.338323 -43.660587 153.298036 13.719689 153.297574
1 2 1.047075 96.475058 153.337711 13.730126 153.338009
2 3 1.709365 -159.072115 153.377399 13.740497 153.377224
3 4 0.903435 84.145005 153.377439 13.740491 153.377696
4 5 0.800164 99.321042 153.397283 13.745653 153.397509
5 6 0.591963 16.330683 153.417180 13.750790 153.417228
6 7 0.642462 63.761302 153.437090 13.755910 153.437255
7 8 1.020182 -123.497531 153.457013 13.761012 153.456770
8 9 1.051963 130.842143 153.476909 13.766102 153.477137
9 11 1.007540 -55.156815 153.516820 13.776216 153.516583
10 12 0.607295 118.463910 153.556744 13.786265 153.556897
11 13 0.227240 -79.964079 153.556784 13.786259 153.556720
12 14 1.526454 -113.268004 153.596760 13.796237 153.596359
dec clon clat w1mpro w1sigmpro w1snr w1rchi2 \
0 13.720159 10h13m11.42s 13d43m12.57s 7.282 0.014 78.4 307.90
1 13.730093 10h13m21.12s 13d43m48.34s 6.925 0.018 59.8 82.35
2 13.740054 10h13m30.53s 13d44m24.19s 6.354 0.012 91.3 203.90
3 13.740517 10h13m30.65s 13d44m25.86s 6.862 0.015 70.3 61.03
4 13.745617 10h13m35.40s 13d44m44.22s 7.005 0.016 68.0 62.28
5 13.750948 10h13m40.13s 13d45m03.41s 6.749 0.015 70.4 26.35
6 13.755989 10h13m44.94s 13d45m21.56s 7.031 0.019 57.5 37.56
7 13.760856 10h13m49.62s 13d45m39.08s 6.729 0.013 84.9 66.91
8 13.765911 10h13m54.51s 13d45m57.28s 6.944 0.022 49.0 44.22
9 13.776376 10h14m03.98s 13d46m34.95s 7.049 0.022 49.1 20.63
10 13.786185 10h14m13.66s 13d47m10.26s 6.728 0.018 58.9 14.40
11 13.786270 10h14m13.61s 13d47m10.57s 6.773 0.024 45.3 10.65
12 13.796069 10h14m23.13s 13d47m45.85s 7.126 0.015 72.2 219.50
w1flux w1sigflux w1sky w1mag_2 w1sigm mjd
0 248830.0 3173.5 24.057 7.719 0.013 56795.965297
1 345700.0 5780.5 27.888 8.348 0.006 56796.096965
2 584870.0 6406.8 24.889 7.986 0.006 56796.228504
3 366210.0 5206.7 27.876 7.653 0.006 56796.228632
4 321210.0 4725.7 26.150 7.867 0.009 56796.294338
5 406400.0 5771.4 25.240 7.711 0.006 56796.360172
6 313360.0 5449.7 26.049 7.988 0.006 56796.426005
7 414100.0 4877.9 25.581 8.022 0.006 56796.491839
8 339610.0 6935.9 25.564 8.029 0.007 56796.557545
9 308370.0 6285.2 25.491 8.331 0.006 56796.689212
10 414410.0 7035.5 27.656 7.851 0.007 56796.820752
11 397500.0 8773.4 27.628 8.015 0.006 56796.820880
12 287270.0 3980.2 24.825 8.310 0.006 56796.952419
cntr_u dist_x pang_x ra_u dec_u ra dec \
0 1 0.570817 137.605512 128.754979 4.242103 128.755086 4.241986
1 2 0.852021 14.819525 128.791578 4.225474 128.791639 4.225703
2 3 1.099000 -4.816139 128.828083 4.208860 128.828057 4.209164
3 4 1.207022 9.485091 128.864456 4.192260 128.864511 4.192591
4 5 0.323112 107.976317 128.882608 4.183966 128.882694 4.183938
5 6 0.627727 99.373708 128.882645 4.183967 128.882817 4.183939
6 7 0.489166 19.732971 128.900773 4.175676 128.900819 4.175804
7 8 0.231292 -139.425350 128.918877 4.167389 128.918835 4.167340
8 9 0.393206 -28.705753 128.936958 4.159106 128.936905 4.159202
9 10 0.466548 -35.199460 128.936995 4.159107 128.936920 4.159213
10 11 1.153921 -100.879703 128.955053 4.150828 128.954737 4.150767
11 12 1.078232 -38.005043 128.973087 4.142552 128.972902 4.142788
12 13 1.172329 -27.290606 128.991097 4.134280 128.990947 4.134569
13 15 1.399220 54.717544 129.027083 4.117750 129.027401 4.117974
clon clat w1mpro w1sigmpro w1snr w1rchi2 w1flux \
0 08h35m01.22s 04d14m31.15s 6.768 0.018 58.9 60.490 395130.0
1 08h35m09.99s 04d13m32.53s 6.706 0.018 59.1 30.780 418160.0
2 08h35m18.73s 04d12m32.99s 6.754 0.024 45.4 20.520 400280.0
3 08h35m27.48s 04d11m33.33s 6.667 0.024 44.9 34.090 433390.0
4 08h35m31.85s 04d11m02.18s 6.782 0.023 47.8 9.326 389870.0
5 08h35m31.88s 04d11m02.18s 6.710 0.035 31.4 11.360 416570.0
6 08h35m36.20s 04d10m32.89s 6.880 0.021 52.7 7.781 356410.0
7 08h35m40.52s 04d10m02.42s 6.653 0.023 46.8 18.900 439130.0
8 08h35m44.86s 04d09m33.13s 6.986 0.023 47.2 8.576 323350.0
9 08h35m44.86s 04d09m33.17s 6.917 0.019 58.5 25.720 344400.0
10 08h35m49.14s 04d09m02.76s 6.782 0.015 70.1 173.800 390170.0
11 08h35m53.50s 04d08m34.04s 6.671 0.016 69.6 70.490 431820.0
12 08h35m57.83s 04d08m04.45s 7.152 0.016 66.6 131.100 277440.0
13 08h36m06.58s 04d07m04.71s 6.436 0.017 63.3 86.350 536630.0
w1sigflux w1sky w1mag_2 w1sigm mjd
0 6711.7 21.115 7.988 0.008 56965.251016
1 7070.1 23.748 7.830 0.007 56965.382556
2 8812.6 20.930 8.456 0.007 56965.514096
3 9649.6 21.350 8.120 0.008 56965.645509
4 8161.1 19.988 7.686 0.007 56965.711215
5 13264.0 22.180 7.902 0.016 56965.711343
6 6769.0 22.962 8.023 0.008 56965.777049
7 9382.2 22.355 8.030 0.007 56965.842755
8 6847.6 23.531 8.024 0.007 56965.908462
9 5882.5 21.256 7.654 0.007 56965.908589
10 5568.7 21.926 8.051 0.007 56965.974295
11 6202.3 23.497 7.950 0.007 56966.040002
12 4165.8 20.094 8.091 0.010 56966.105708
13 8482.9 22.436 8.191 0.008 56966.237248

Printing out effect sizes from an anova_lm model

I have the following code. I am running a linear model on the dataframe 'x', with gender and highest education level achieved as categorical variables.
The aim is to assess how well age, gender and highest level of education achieved can predict 'weighteddistance'.
resultmodeldistancevariation2sleep = smf.ols(formula='weighteddistance ~ age + C(gender) + C(highest_education_level_acheived)',data=x).fit()
summarymodel = resultmodeldistancevariation2sleep.summary()
print(summarymodel)
This gives me the output:
0 1 2 3 4 5 6
0 coef std err t P>|t| [0.025 0.975]
1 Intercept 6.3693 1.391 4.580 0.000 3.638 9.100
2 C(gender)[T.2.0] 0.2301 0.155 1.489 0.137 -0.073 0.534
3 C(gender)[T.3.0] 0.0302 0.429 0.070 0.944 -0.812 0.872
4 C(highest_education_level_acheived)[T.3] 1.1292 0.501 2.252 0.025 0.145 2.114
5 C(highest_education_level_acheived)[T.4] 1.0876 0.513 2.118 0.035 0.079 2.096
6 C(highest_education_level_acheived)[T.5] 1.0692 0.498 2.145 0.032 0.090 2.048
7 C(highest_education_level_acheived)[T.6] 1.2995 0.525 2.476 0.014 0.269 2.330
8 C(highest_education_level_acheived)[T.7] 1.7391 0.605 2.873 0.004 0.550 2.928
However, I want to calculate the main effect of each categorical variable on distance, which are not shown in the model above, and so I entered the model fit into an anova using 'anova_lm'.
anovaoutput = sm.stats.anova_lm(resultmodeldistancevariation2sleep)
anovaoutput['PR(>F)'] = anovaoutput['PR(>F)'].round(4)
This gives me following output below, and as I wanted, does show me the main effect of each categorical variable - gender and highest education level achieved - rather than the different groups within that variable (meaning that there is no gender[2.0] and gender[3.0] in the output below).
df sum_sq mean_sq F PR(>F)
C(gender) 2.0 4.227966 2.113983 5.681874 0.0036
C(highest_education_level_acheived) 5.0 11.425706 2.285141 6.141906 0.0000
age 1.0 8.274317 8.274317 22.239357 0.0000
Residual 647.0 240.721120 0.372057 NaN NaN
However, this output no longer shows me the confidence intervals or the coefficients for each variable.
So in other words, I would like the bottom anova table should have a column with 'coef' and '[0.025 0.975]' like in the first table.
How can I achieve this?
I would be so grateful for a helping hand!

Why using group by makes some id disappear

i was working in a machine learning project , and while i am extracting features i found that some of consumers LCLid disappear from the data set while i was grouping by the LCLid
Dataset: SmartMeter Energy Consumption Data in London Households
here is the original data set
and here is the code that i used to extract some features
LCLid=[]
for i in range(68):
LCLid.append('MAC0000'+str(228+i))
consommation=data.groupby('LCLid')['KWH/hh'].sum()
consommation_min=data.groupby('LCLid')['KWH/hh'].min()
consommation_max=data.groupby('LCLid')['KWH/hh'].max()
consommation_mean=data.groupby('LCLid')['KWH/hh'].mean()
consommation_evening=data.groupby(['LCLid','period'])['KWH/hh'].mean()
#creation de dataframe
list_of_tuples = list(zip (LCLid, consommation, consommation_min, consommation_max, consommation_mean))
data2= pd.DataFrame(list_of_tuples, columns = ['LCLid', 'Consumption', 'Consumption_min', 'Consumption_max', 'Consumption_mean'])
as you see after the execution of the code the dataset stopped in the LCLid 282 while iin the original one the dataset containes also the LCLid from 283 to 295
Using low-carbon-london-data from SmartMeter Energy Consumption Data in London Households
The issue is LCLid does not uniformly increment by 1, from MAC000228 to MAC000295.
print(data.LCLid.unique())
array(['MAC000228', 'MAC000229', 'MAC000230', 'MAC000231', 'MAC000232',
'MAC000233', 'MAC000234', 'MAC000235', 'MAC000237', 'MAC000238',
'MAC000239', 'MAC000240', 'MAC000241', 'MAC000242', 'MAC000243',
'MAC000244', 'MAC000245', 'MAC000246', 'MAC000248', 'MAC000249',
'MAC000250', 'MAC000251', 'MAC000252', 'MAC000253', 'MAC000254',
'MAC000255', 'MAC000256', 'MAC000258', 'MAC000260', 'MAC000262',
'MAC000263', 'MAC000264', 'MAC000267', 'MAC000268', 'MAC000269',
'MAC000270', 'MAC000271', 'MAC000272', 'MAC000273', 'MAC000274',
'MAC000275', 'MAC000276', 'MAC000277', 'MAC000279', 'MAC000280',
'MAC000281', 'MAC000282', 'MAC000283', 'MAC000284', 'MAC000285',
'MAC000287', 'MAC000289', 'MAC000291', 'MAC000294', 'MAC000295'],
dtype=object)
print(len(data.LCLid.unique()))
>>> 55
To resolve the issue
import pandas as pd
import numpy as np
df = pd.read_csv('Power-Networks-LCL-June2015(withAcornGps)v2.csv')
# determine the rows needed for the MAC000228 - MAC000295
df[df.LCLid == 'MAC000228'].iloc[0, :] # first row of 228
df[df.LCLid == 'MAC000295'].iloc[-1, :] # last row of 295
# create a dataframe with the desired data
data = df[['LCLid', 'DateTime', 'KWH/hh (per half hour) ']].iloc[6989700:9032044, :].copy()
# fix the data
data.DateTime = pd.to_datetime(data.DateTime)
data.rename(columns={'KWH/hh (per half hour) ': 'KWH/hh'}, inplace=True)
data['KWH/hh'] = data['KWH/hh'].str.replace('Null', 'NaN')
data['KWH/hh'].fillna(np.nan, inplace=True)
data['KWH/hh'] = data['KWH/hh'].astype('float')
data.reset_index(drop=True, inplace=True)
# aggregate your functions
agg_data = data.groupby('LCLid')['KWH/hh'].agg(['sum', 'min', 'max', 'mean']).reset_index()
print(agg_data)
agg_data
LCLid sum min max mean
0 MAC000228 5761.288000 0.021 1.616 0.146356
1 MAC000229 6584.866999 0.008 3.294 0.167456
2 MAC000230 8911.154000 0.029 2.750 0.226384
3 MAC000231 3174.314000 0.000 1.437 0.080663
4 MAC000232 2083.042000 0.005 0.736 0.052946
5 MAC000233 2241.591000 0.000 3.137 0.056993
6 MAC000234 9700.328001 0.029 2.793 0.246646
7 MAC000235 8473.999003 0.011 3.632 0.223194
8 MAC000237 22263.294998 0.036 4.450 0.598299
9 MAC000238 7814.889998 0.016 2.835 0.198781
10 MAC000239 6113.029000 0.015 1.346 0.155481
11 MAC000240 7280.662000 0.000 3.146 0.222399
12 MAC000241 4181.169999 0.024 1.733 0.194963
13 MAC000242 1654.336000 0.000 1.481 0.042088
14 MAC000243 11057.366999 0.009 3.588 0.281989
15 MAC000244 5894.271000 0.005 1.884 0.149939
16 MAC000245 22788.699005 0.037 4.743 0.580087
17 MAC000246 13787.060005 0.014 3.516 0.351075
18 MAC000248 10192.239001 0.000 4.351 0.259536
19 MAC000249 24401.468995 0.148 5.242 0.893042
20 MAC000250 5850.003000 0.000 2.185 0.148999
21 MAC000251 8400.234000 0.035 3.505 0.213931
22 MAC000252 21748.489004 0.135 4.171 0.554978
23 MAC000253 9739.408999 0.009 1.714 0.248201
24 MAC000254 9351.614001 0.009 2.484 0.238209
25 MAC000255 14142.974002 0.097 3.305 0.360220
26 MAC000256 20398.665001 0.049 3.019 0.520680
27 MAC000258 6646.485998 0.017 2.319 0.169666
28 MAC000260 5952.563001 0.006 2.192 0.151952
29 MAC000262 13909.603999 0.000 2.878 0.355181
30 MAC000263 3753.997000 0.015 1.060 0.095863
31 MAC000264 7022.967000 0.020 0.910 0.179432
32 MAC000267 8797.094000 0.029 2.198 0.224898
33 MAC000268 3734.252001 0.000 1.599 0.095359
34 MAC000269 2395.232000 0.000 1.029 0.061167
35 MAC000270 15569.711002 0.131 2.249 0.397501
36 MAC000271 7244.860000 0.028 1.794 0.184974
37 MAC000272 8703.658998 0.034 3.295 0.222446
38 MAC000273 3622.199002 0.005 5.832 0.092587
39 MAC000274 28724.718997 0.032 3.927 0.734422
40 MAC000275 5564.004999 0.012 1.840 0.161290
41 MAC000276 11060.774001 0.000 1.709 0.315724
42 MAC000277 8446.528999 0.027 1.938 0.241075
43 MAC000279 3444.160999 0.016 1.846 0.098354
44 MAC000280 12595.780001 0.125 1.988 0.360436
45 MAC000281 6282.568000 0.024 1.433 0.179538
46 MAC000282 4457.989001 0.030 1.830 0.127444
47 MAC000283 5024.917000 0.011 2.671 0.143627
48 MAC000284 1293.503000 0.000 0.752 0.047975
49 MAC000285 2399.018000 0.006 0.931 0.068567
50 MAC000287 1407.290000 0.000 2.372 0.045253
51 MAC000289 4767.490999 0.000 2.287 0.136436
52 MAC000291 13456.678999 0.072 3.354 0.385060
53 MAC000294 9477.966000 0.053 2.438 0.271264
54 MAC000295 7750.128000 0.010 1.839 0.221774

I am trying to interpolate a value from a pandas dataframe using numpy.interp but it continuously returns a wrong interpolation

import pandas as pd
import numpy as np
# defining a function for interpolation
def interpolate(x, df, xcol, ycol):
return np.interp([x], df[xcol], df[ycol])
# function call
print(interpolate(0.4, freq_data, 'Percent_cum_freq', 'cum_OGIP'))
Trying a more direct method:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP']))
Output:
from function [2.37197912e+10]
from direct 23719791158.266743
For any values of x that I pass: 0.4, 0.6 and 0.9, it gives the same result, that is, 2.37197912e+10
freq_data dataframe
Percent_cum_freq cum_OGIP
0 0.999 4.455539e+07
1 0.981 1.371507e+08
2 0.913 2.777860e+08
3 0.824 4.664612e+08
4 0.720 7.031764e+08
5 0.615 9.879315e+08
6 0.547 1.320727e+09
7 0.464 1.701562e+09
8 0.396 2.130436e+09
9 0.329 2.607351e+09
10 0.285 3.132306e+09
11 0.245 3.705301e+09
12 0.199 4.326336e+09
13 0.167 4.995410e+09
14 0.136 5.712525e+09
15 0.115 6.477680e+09
16 0.085 7.290874e+09
17 0.072 8.152108e+09
18 0.056 9.061383e+09
19 0.042 1.001870e+10
20 0.034 1.102405e+10
21 0.027 1.207745e+10
22 0.022 1.317888e+10
23 0.015 1.432835e+10
24 0.013 1.552587e+10
25 0.010 1.677142e+10
26 0.007 1.806502e+10
27 0.002 1.940665e+10
28 0.002 2.079632e+10
29 0.002 2.223404e+10
30 0.001 2.371979e+10
What is wrong? How can I solve the problem?
Well I was as well surprised by the results when I implemented the code you provided. After a little search on the documentation for np.interp , found that the x-coordinates must be always increasing.
np.interp(x,list_of_x_coordinates,list_of_y_coordinates)
Where x is the value you want the value of y at.
list_of_x_coordinates is df[xcol] -> This must always be increasing. But as your dataframe is decreasing, it will never give correct result.
list_of_y_coordinates is df[ycol] -> This must be of same dimension and in order with the df[xcol]
My reproduced code:
import numpy as np
list_1=np.interp([0.1,0.5,0.8],[0.999,0.547,0.199,0.056,0.013,0.001],[4.455539e+07,1.320727e+09,4.326336e+09,9.061383e+09,1.552587e+10, 2.371979e+10])
list_2=np.interp([0.1,0.5,0.8],[0.001,0.013,0.056,0.199,0.547,0.999],[2.371979e+10,1.552587e+10,9.061383e+09,4.326336e+09,1.320727e+09,4.455539e+07])
print("In decreasing order -> As in your case",list_1)
print("In increasing order of x xoordinates",list_2)
Output:
In decreasing order -> As in your case [2.371979e+10 2.371979e+10 2.371979e+10]
In increasing order of x xoordinates [7.60444546e+09 1.72665695e+09 6.06409705e+08]
As you can understand now, you have to sort the df[x_col] and accordingly pass the df[y_col]
​
By default np.interp needs the x values to be sorted. If you do not want to sort your dataframe a workaround is to set the period argument to np.inf:
print(np.interp(0.4, freq_data['Percent_cum_freq'], freq_data['cum_OGIP'], period=np.inf))

how to set a variable dynamically - python, pandas

>>>import pandas as pd
>>>import numpy as np
>>>from pandas import Series, DataFrame
>>>rawData = pd.read_csv('wow.txt')
>>>rawData
time mean
0 0.005 0
1 0.010 258.64
2 0.015 258.43
3 0.020 253.72
4 0.025 0
5 0.030 0
6 0.035 253.84
7 0.040 254.17
8 0.045 0
9 0.050 0
10 0.055 0
11 0.060 254.73
12 0.065 254.90
.
.
.
489 4.180 167.46
I want to apply below formula and get 'y' when I enter 'x' value dynamically to plotting a graph.
y = y0 + (y1-y0)*(x-x0/x1-x0)
If 'mean' value is 0(for example index4,5 index8,9,10)
1) Ask question "Do you want to interpolate?"
2) If yes, enter the 'x' value
3) calculating using formula (repeat 1-3 until answer is no)
4) If answer is no, finish the program.
time(x-axis) mean(y-axis)
0 0.005 0
1 0.010 258.64
2 0.015 258.43
3 0.020 <--x0 253.72 <-- y0
4 0.025 0
5 0.030 0
6 0.035 <--x1 253.84 <-- y1
7 0.040 <--x0 254.17 <-- y0
8 0.045 0
9 0.050 0
10 0.055 0
11 0.060 <--x1 254.73 <-- y1
12 0.065 254.90
.
.
.
489 4.180 167.46
variable x0,x1,y0,y1 is determined when they are located outside between '0' value.
How to get a variable dynamically and calculate?
Do you have any good idea to design program?
for i in df.index:
if df.mean [i]=0:
answer=input ("Do you want to interpolate?")
if answer="Y":
x1 = df.loc[df.mean > 0].index[1]
y1 = df.loc[df.time > 0].index[1]
Interpolate eqn
else:
Process
else:
x0 = df.time [i]
y0 = df.mean[i]
Excuse typos, working on mobile phone.

Categories

Resources