Removing outliers from lists/XY scatter - python

I have two lists containing heart beat intervals (Y-axis, in ms; IBIs below) and their absolute timepoints (X-axis, in ms; RR_times below). There are some misreadings, such that the first list contains outliers that need to be removed, and the second one their corresponding timepoints. It would be optimal if the outliers in the first list are NaN-ed so that the total time for the recording remains the same.
RR_times = [411, 827, 1241, 1653, 2066, 2481, 2894, 3308,
3714, 4126, 4532, 4938, 5343, 5751, 6156, 6552,
6951, 7346, 7749, 8149, 8546, 8944, 9338, 9735,
10123, 10511, 10905, 11290, 11675, 12060, 12441, 12825,
13205, 13581, 13960, 14342, 14717, 15087, 15462, 15829,
16204, 16531, 16902, 17304, 17670, 18040, 18398, 18762,
19127, 19465, 19823, 20196, 20554, 20906, 21256, 21609,
21959, 22264, 22637, 22995, 23308, 23649, 24012, 24352,
24687, 25026, 25390, 25681, 26014, 26347, 26680, 27330,
27985, 28628, 28951, 29596, 29915, 30238, 30562, 31191,
31826, 32141, 32461, 32775, 33095, 33382, 33695, 34029,
34341, 34654, 34967, 35281, 35595, 36220, 36530, 36844,
37150, 37462, 37775, 38084, 38395, 38703, 39014, 39324,
39632, 39937, 40246, 40554, 40862, 41169, 41479, 41787,
42095, 42406, 42714, 43019, 43330, 43642, 43945, 44254,
44563, 44871, 45183, 45491, 45796, 46101, 46410, 46713,
47327, 47632, 47937, 48244, 48555, 48867, 49177, 49488,
49792, 50094, 50398, 50707, 50993, 51324, 51626, 51931,
52239, 52550, 52857, 53161, 53773, 54080, 54387, 54693,
54998, 55311, 55617, 55924, 56235, 56547, 56852, 57159,
57470, 57781, 58091, 58400, 58709, 59020, 59331, 59644,
59955, 60265, 60579, 60890, 61206, 61521, 61833, 62149,
62463, 62772, 63088, 63403, 63716, 64034, 64352, 64665,
64984, 65624, 65940, 66262, 66578, 66900, 67221, 67543,
67861, 68179, 68504, 68819, 69145, 69459, 69782, 70111,
70428, 70747, 71070, 71389, 71710, 72036, 72358, 72680,
73003, 73326, 73648, 73973, 74296, 74620, 74944, 75269,
75592, 75916, 76241, 76566, 76889, 77216, 77543, 77869,
78191, 78518, 78843, 79165, 79496, 79823, 80148, 80479,
80803, 81128, 81459, 81783, 82110, 82439, 82771, 83095,
83426, 83757, 84086, 84416, 84741, 85074, 85400, 85729,
86060, 86390, 86719, 87051, 87380, 87711, 88041, 88373,
88705, 89029, 89365, 89698, 90023, 90356, 90690, 91019,
91352, 91684, 92014, 92347, 92681, 93014, 93349, 93678,
94011, 94344, 94675, 95009, 95339, 95673, 96007, 96341,
96668, 97002, 97337, 97665, 98003, 98335, 98668, 99003,
99339, 99673, 100007, 100346, 100684, 101017, 101357, 101693,
102028, 102368, 102705, 103043, 103380, 103718, 104061, 104403,
104736, 105077, 105421, 105756, 106096, 106437, 106777, 107118,
107461, 107800, 108141, 108485, 108822, 109167, 109507, 109848,
110196, 110538, 110884, 111230, 111571, 111918, 112263, 112606,
112952, 113639, 113987, 114336, 114680, 115025, 115372, 115722,
116068, 116418, 116766, 117114, 117464, 117811, 118158, 118511,
118858, 119208, 119557, 119904, 120257, 120606, 120952, 121303,
121655, 122003, 122354, 122707, 123057, 123408, 123760, 124114,
124466, 124815, 125172, 125523, 125879, 126231, 126586, 126946,
127298, 127653, 128014, 128369, 128724, 129084, 129441, 129794,
130150, 130504, 130863, 131219, 131576, 131937, 132297, 132653,
133012, 133375, 133731, 134091, 134455, 134813, 135174, 135534,
135897, 136258, 136621, 136986, 137349, 137711, 138073, 138439,
138799, 139164, 139526, 139887, 140253, 140617, 140977, 141344,
141706, 142071, 142438, 142803, 143170, 143537, 143904, 144274,
144641, 145011, 145382, 145749, 146124, 146493, 146864, 147235,
147605, 147977, 148346, 148718, 149085, 149455, 149826, 150195,
150566, 150936, 151310, 151676, 152048, 152423, 152795, 153167,
153539, 153916, 154290, 154661, 155036, 155408, 155782, 156159,
156530, 156905, 157280, 157655, 158029, 158404, 158783, 159157,
159532, 159910, 160290, 160660, 161037, 161415, 161786, 162161,
162538, 162913, 163289, 163665, 164040, 164415, 164789, 165164,
165539, 165911, 166286, 166661, 167040, 167418, 167791, 168169,
168545, 168922, 169300, 169676, 170053, 170429, 170811, 171195,
171571, 171952, 172335, 172717, 173098, 173484, 173869, 174254,
174637, 175020, 175403, 175785, 176167, 176552, 176933, 177316,
177698, 178080, 178463, 178840, 179224, 179603, 179979, 180360,
180739, 181114, 181492, 181870, 182248, 182626, 183001, 183378,
183752, 184128, 184503, 184876, 185252, 185629, 186003, 186384,
186760, 187134, 187515, 187900, 188281, 188656, 189031, 189415,
189798, 190176, 190555, 190936, 191313, 191692, 192069, 192448,
192824, 193203, 193578, 193953, 194330, 194707]
IBIs = [411,416,414,412,413,415, 413, 414, 406, 412, 406, 406, 405,
408, 405, 396, 399, 395, 403, 400, 397, 398, 394, 397, 388, 388,
394, 385, 385, 385, 381, 384, 380, 376, 379, 382, 375, 370, 375,
367, 375, 327, 371, 402, 366, 370, 358, 364, 365, 338, 358, 373,
358, 352, 350, 353, 350, 305, 373, 358, 313, 341, 363, 340, 335,
339, 364, 291, 333, 333, 333, 650, 655, 643, 323, 645, 319, 323,
324, 629, 635, 315, 320, 314, 320, 287, 313, 334, 312, 313, 313,
314, 314, 625, 310, 314, 306, 312, 313, 309, 311, 308, 311, 310,
308, 305, 309, 308, 308, 307, 310, 308, 308, 311, 308, 305, 311,
312, 303, 309, 309, 308, 312, 308, 305, 305, 309, 303, 614, 305,
305, 307, 311, 312, 310, 311, 304, 302, 304, 309, 286, 331, 302,
305, 308, 311, 307, 304, 612, 307, 307, 306, 305, 313, 306, 307,
311, 312, 305, 307, 311, 311, 310, 309, 309, 311, 311, 313, 311,
310, 314, 311, 316, 315, 312, 316, 314, 309, 316, 315, 313, 318,
318, 313, 319, 640, 316, 322, 316, 322, 321, 322, 318, 318, 325,
315, 326, 314, 323, 329, 317, 319, 323, 319, 321, 326, 322, 322,
323, 323, 322, 325, 323, 324, 324, 325, 323, 324, 325, 325, 323,
327, 327, 326, 322, 327, 325, 322, 331, 327, 325, 331, 324, 325,
331, 324, 327, 329, 332, 324, 331, 331, 329, 330, 325, 333, 326,
329, 331, 330, 329, 332, 329, 331, 330, 332, 332, 324, 336, 333,
325, 333, 334, 329, 333, 332, 330, 333, 334, 333, 335, 329, 333,
333, 331, 334, 330, 334, 334, 334, 327, 334, 335, 328, 338, 332,
333, 335, 336, 334, 334, 339, 338, 333, 340, 336, 335, 340, 337,
338, 337, 338, 343, 342, 333, 341, 344, 335, 340, 341, 340, 341,
343, 339, 341, 344, 337, 345, 340, 341, 348, 342, 346, 346, 341,
347, 345, 343, 346, 687, 348, 349, 344, 345, 347, 350, 346, 350,
348, 348, 350, 347, 347, 353, 347, 350, 349, 347, 353, 349, 346,
351, 352, 348, 351, 353, 350, 351, 352, 354, 352, 349, 357, 351,
356, 352, 355, 360, 352, 355, 361, 355, 355, 360, 357, 353, 356,
354, 359, 356, 357, 361, 360, 356, 359, 363, 356, 360, 364, 358,
361, 360, 363, 361, 363, 365, 363, 362, 362, 366, 360, 365, 362,
361, 366, 364, 360, 367, 362, 365, 367, 365, 367, 367, 367, 370,
367, 370, 371, 367, 375, 369, 371, 371, 370, 372, 369, 372, 367,
370, 371, 369, 371, 370, 374, 366, 372, 375, 372, 372, 372, 377,
374, 371, 375, 372, 374, 377, 371, 375, 375, 375, 374, 375, 379,
374, 375, 378, 380, 370, 377, 378, 371, 375, 377, 375, 376, 376,
375, 375, 374, 375, 375, 372, 375, 375, 379, 378, 373, 378, 376,
377, 378, 376, 377, 376, 382, 384, 376, 381, 383, 382, 381, 386,
385, 385, 383, 383, 383, 382, 382, 385, 381, 383, 382, 382, 383,
377, 384, 379, 376, 381, 379, 375, 378, 378, 378, 378, 375, 377,
374, 376, 375, 373, 376, 377, 374, 381, 376, 374, 381, 385, 381,
375, 375, 384, 383, 378, 379, 381, 377, 379, 377, 379, 376, 379,
375, 375, 377, 377]
Plotting the whole dataset gives:
I previously used an above:below-filter, but that does not work for longer recordings in which the trace spans across larger values (in some recordings the intervals span from 300 (during training) to 1500 (after a period of resting).
What is the best way to remove the outliers in this case, and how would one go about implementing it? Moving average, exclusion based on stdev, median filter...?

Here's a ugly approach that seems to work:
import numpy as np
RR_times = np.array([411, 827, 1241, ...])
IBIs = np.array([411, 416, 414, ...])
diffs = [np.abs(IBIs[i]-IBIs[i+1]) for i in range(len(IBIs)-1)]
IBIs_cleaned = np.full(IBIs.shape, np.nan) # create a array full of NaNs
IBIs_cleaned[0] = IBIs[0] # The first value isn't a outlier
for i in range(1, len(IBIs)):
if np.abs(IBIs[i]-IBIs[i-1]) < np.mean(diffs) and IBIs[i] < 1.6 * np.mean(IBIs):
IBIs_cleaned[i] = IBIs[i]

Related

How to implement different sequences in shell sort in python?

Hi I have the following code for implementing Shell sort in Python. How can I implement the following sequences in Shell sort using the code below (Note this is not the list I want to sort) :
1, 4, 13, 40, 121, 364, 1093, 3280, 9841, 29524 (Knuth’s sequence)
1, 5, 17, 53, 149, 373, 1123, 3371, 10111, 30341
1, 10, 30, 60, 120, 360, 1080, 3240, 9720, 29160
interval = n // 2
while interval > 0:
for i in range(interval, n):
temp = array[i]
j = i
while j >= interval and array[j - interval] > temp:
array[j] = array[j - interval]
j -= interval
array[j] = temp
interval //= 2
You could modify the pseudo-code provided in the Wikipedia article for Shellsort to take in the gap sequence as a parameter:
from random import choices
from timeit import timeit
RAND_SEQUENCE_SIZE = 500
GAP_SEQUENCES = {
'CIURA_A102549': [701, 301, 132, 57, 23, 10, 4, 1],
'KNUTH_A003462': [29524, 9841, 3280, 1093, 364, 121, 40, 13, 4, 1],
'SPACED_OUT_PRIME_GAPS': [30341, 10111, 3371, 1123, 373, 149, 53, 17, 5, 1],
'SPACED_OUT_EVEN_GAPS': [29160, 9720, 3240, 1080, 360, 120, 60, 30, 10, 1],
}
def shell_sort(seq: list[int], gap_sequence: list[int]) -> None:
n = len(seq)
# Start with the largest gap and work down to a gap of 1. Similar to
# insertion sort but instead of 1, gap is being used in each step.
for gap in gap_sequence:
# Do a gapped insertion sort for every element in gaps.
# Each gap sort includes (0..gap-1) offset interleaved sorting.
for offset in range(gap):
for i in range(offset, n, gap):
# Save seq[i] in temp and make a hole at position i.
temp = seq[i]
# Shift earlier gap-sorted elements up until the correct
# location for seq[i] is found.
j = i
while j >= gap and seq[j - gap] > temp:
seq[j] = seq[j - gap]
j -= gap
# Put temp (the original seq[i]) in its correct location.
seq[j] = temp
def main() -> None:
seq = choices(population=range(1000), k=RAND_SEQUENCE_SIZE)
print(f'{seq = }')
print(f'{len(seq) = }')
for name, gap_sequence in GAP_SEQUENCES.items():
print(f'Shell sort using {name} gap sequence: {gap_sequence}')
print(f'Time taken to sort 100 times: {timeit(lambda: shell_sort(seq.copy(), gap_sequence), number=100)} seconds')
if __name__ == '__main__':
main()
Example Output:
seq = [331, 799, 153, 700, 373, 38, 203, 535, 894, 500, 922, 939, 507, 506, 89, 40, 442, 108, 112, 359, 280, 946, 395, 708, 140, 435, 588, 306, 202, 23, 6, 189, 570, 600, 857, 949, 606, 617, 556, 863, 521, 776, 436, 801, 501, 588, 927, 279, 210, 72, 460, 52, 340, 632, 385, 965, 730, 360, 88, 216, 991, 520, 74, 112, 770, 853, 483, 787, 229, 812, 259, 349, 967, 227, 957, 728, 780, 51, 604, 748, 3, 679, 33, 488, 130, 203, 493, 471, 397, 53, 49, 172, 7, 306, 613, 519, 575, 64, 168, 161, 376, 903, 338, 800, 58, 729, 421, 238, 967, 294, 967, 218, 456, 823, 649, 569, 144, 103, 970, 780, 859, 719, 15, 536, 263, 917, 0, 54, 370, 703, 911, 518, 78, 41, 106, 452, 355, 571, 249, 58, 274, 327, 500, 341, 743, 536, 432, 799, 597, 681, 301, 856, 219, 63, 653, 680, 891, 725, 537, 673, 815, 504, 720, 573, 60, 91, 909, 892, 964, 119, 793, 540, 303, 538, 130, 717, 755, 968, 46, 229, 837, 398, 182, 303, 99, 808, 56, 780, 415, 33, 511, 771, 875, 593, 120, 727, 505, 905, 619, 295, 958, 566, 8, 291, 811, 529, 789, 523, 545, 5, 631, 28, 107, 292, 831, 657, 952, 239, 814, 862, 912, 2, 147, 750, 132, 528, 408, 916, 718, 261, 488, 621, 261, 963, 880, 625, 151, 982, 819, 749, 224, 572, 690, 766, 278, 417, 248, 987, 664, 515, 691, 940, 860, 172, 898, 321, 381, 662, 293, 354, 642, 219, 133, 133, 854, 162, 254, 816, 630, 21, 577, 486, 792, 731, 714, 581, 633, 794, 120, 386, 874, 177, 652, 159, 264, 414, 417, 730, 728, 716, 973, 688, 106, 345, 153, 909, 382, 505, 721, 363, 230, 588, 765, 340, 142, 549, 558, 189, 547, 728, 974, 468, 182, 255, 637, 317, 40, 775, 696, 135, 985, 884, 131, 797, 84, 89, 962, 810, 520, 843, 24, 400, 717, 834, 170, 681, 333, 68, 159, 688, 422, 198, 621, 386, 391, 839, 283, 167, 655, 314, 820, 432, 412, 181, 440, 864, 828, 217, 491, 593, 298, 885, 831, 535, 92, 305, 510, 90, 949, 461, 627, 851, 606, 280, 413, 624, 916, 16, 517, 700, 776, 323, 161, 329, 25, 868, 258, 97, 219, 620, 69, 24, 794, 981, 361, 691, 20, 90, 825, 442, 531, 562, 240, 0, 440, 418, 338, 526, 34, 230, 381, 598, 734, 925, 209, 231, 980, 122, 374, 752, 144, 105, 920, 780, 828, 948, 515, 443, 810, 81, 303, 751, 779, 516, 394, 455, 116, 448, 652, 293, 327, 367, 793, 47, 946, 653, 927, 910, 583, 845, 442, 989, 393, 490, 564, 54, 656, 689, 626, 531, 941, 575, 628, 865, 705, 219, 42, 19, 10, 155, 436, 319, 510, 520, 869, 101, 918, 170, 826, 146, 389, 200, 992, 404, 982, 889, 818, 684, 524, 642, 991, 973, 561, 104, 418, 207, 963, 192, 410, 33]
len(seq) = 500
Shell sort using CIURA_A102549 gap sequence: [701, 301, 132, 57, 23, 10, 4, 1]
Time taken to sort 100 times: 0.06717020808719099 seconds
Shell sort using KNUTH_A003462 gap sequence: [29524, 9841, 3280, 1093, 364, 121, 40, 13, 4, 1]
Time taken to sort 100 times: 0.34870366705581546 seconds
Shell sort using SPACED_OUT_PRIME_GAPS gap sequence: [30341, 10111, 3371, 1123, 373, 149, 53, 17, 5, 1]
Time taken to sort 100 times: 0.3563524999190122 seconds
Shell sort using SPACED_OUT_EVEN_GAPS gap sequence: [29160, 9720, 3240, 1080, 360, 120, 60, 30, 10, 1]
Time taken to sort 100 times: 0.38147866702638566 seconds

use string to reference already assigned local variable [duplicate]

This question already has answers here:
How can I select a variable by (string) name?
(5 answers)
Closed 8 months ago.
I want to use form_sate_data which is a string str(), to reference same named local variable inside the is_valid_phone function, you can see in print function.
form_state_data will always be two character short code for states that also exists as local variable containing list of postal code as integer in function _is_valid_phone .
form_state_data = 'AL'
form_phone_data_sliced = 205
def is_valid_phone(form_state_data, form_phone_data_sliced):
# Some codes are not correct.
AL = [205, 251, 256, 334, 938]
AK = [907]
AZ = [480, 520, 602, 623, 928]
AR = [479, 501, 870]
CA = [209, 213, 279, 310, 323, 408, 415, 424, 442, 510, 530, 559, 562, 619, 626, 628, 650, 657, 661, 669, 707, 714, 747, 760, 805, 818, 820, 831, 858, 909, 916, 925, 949, 951]
CO = [303, 719, 720, 970]
CT = [203, 475, 860, 959]
DE = [302]
DC = [202]
FL = [239, 305, 321, 352, 386, 407, 561, 727, 754, 772, 786, 813, 850, 863, 904, 941, 954]
GA = [229, 404, 470, 478, 678, 706, 762, 770, 912]
HI = [808]
ID = [208, 986]
IL = [217, 224, 309, 312, 331, 618, 630, 708, 773, 779, 815, 847, 872]
IN = [219, 260, 317, 463, 574, 765, 812, 930]
IA = [319, 515, 563, 641, 712]
KS = [316, 620, 785, 913]
KY = [270, 364, 502, 606, 859]
LA = [225, 318, 337, 504, 985]
ME = [207]
MT = [339, 351, 413, 508, 617, 774, 781, 857, 978]
NE = [308, 402, 531]
NV = [702, 725, 775]
NH = [603]
NJ = [201, 551, 609, 640, 732, 848, 856, 862, 908, 973]
NM = [505, 575]
NY = [212, 315, 332, 347, 516, 518, 585, 607, 631, 646, 680, 716, 718, 838, 845, 914, 917, 929, 934]
NC = [252, 336, 704, 743, 828, 910, 919, 980, 984]
ND = [701]
OH = [216, 220, 234, 330, 380, 419, 440, 513, 567, 614, 740, 937]
OK = [405, 539, 580, 918]
OR = [458, 503, 541, 971]
MD = [240, 301, 410, 443, 667]
MA = [218, 320, 507, 612, 651, 763, 952]
MI = [228, 601, 662, 769]
MN = [218, 320, 507, 612, 651, 763, 952]
MS = [314, 417, 573, 636, 660, 816]
MO = [406]
PA = [215, 223, 267, 272, 412, 445, 484, 570, 610, 717, 724, 814, 878]
RI = [401]
SC = [803, 843, 854, 864]
SD = [605]
TN = [423, 615, 629, 731, 865, 901, 931]
TX = [210, 214, 254, 281, 325, 346, 361, 409, 430, 432, 469, 512, 682, 713, 726, 737, 806, 817, 830, 832, 903, 915, 936, 940, 956, 972, 979]
UT = [385, 435, 801]
VT = [802]
VA = [276, 434, 540, 571, 703, 757, 804]
WA = [206, 253, 360, 425, 509, 564]
WV = [304, 681]
WI = [262, 414, 534, 608, 715, 920]
WY = [307]
print(form_phone_data_sliced in form_state_data)
is_valid_phone(form_state_data,form_phone_data_sliced)
You should use a dictionary to store the state codes, below is a example of how to achieve this.
states = {
'AL': [205, 251, 256, 334, 938],
'AK': [907],
'AZ': [480, 520, 602, 623, 928],
'AR': [479, 501, 870],
'CA': [209, 213, 279, 310, 323, 408, 415, 424, 442, 510, 530, 559, 562, 619, 626, 628, 650, 657, 661, 669, 707, 714, 747, 760, 805, 818, 820, 831, 858, 909, 916, 925, 949, 951],
'CO': [303, 719, 720, 970],
'CT': [203, 475, 860, 959],
'DE': [302],
'DC': [202],
'FL': [239, 305, 321, 352, 386, 407, 561, 727, 754, 772, 786, 813, 850, 863, 904, 941, 954],
'GA': [229, 404, 470, 478, 678, 706, 762, 770, 912],
'HI': [808],
'ID': [208, 986],
'IL': [217, 224, 309, 312, 331, 618, 630, 708, 773, 779, 815, 847, 872],
'IN': [219, 260, 317, 463, 574, 765, 812, 930],
'IA': [319, 515, 563, 641, 712],
'KS': [316, 620, 785, 913],
'KY': [270, 364, 502, 606, 859],
'LA': [225, 318, 337, 504, 985],
'ME': [207],
'MT': [339, 351, 413, 508, 617, 774, 781, 857, 978],
'NE': [308, 402, 531],
'NV': [702, 725, 775],
'NH': [603],
'NJ': [201, 551, 609, 640, 732, 848, 856, 862, 908, 973],
'NM': [505, 575],
'NY': [212, 315, 332, 347, 516, 518, 585, 607, 631, 646, 680, 716, 718, 838, 845, 914, 917, 929, 934],
'NC': [252, 336, 704, 743, 828, 910, 919, 980, 984],
'ND': [701],
'OH': [216, 220, 234, 330, 380, 419, 440, 513, 567, 614, 740, 937],
'OK': [405, 539, 580, 918],
'OR': [458, 503, 541, 971],
'MD': [240, 301, 410, 443, 667],
'MA': [218, 320, 507, 612, 651, 763, 952],
'MI': [228, 601, 662, 769],
'MN': [218, 320, 507, 612, 651, 763, 952],
'MS': [314, 417, 573, 636, 660, 816],
'MO': [406],
'PA': [215, 223, 267, 272, 412, 445, 484, 570, 610, 717, 724, 814, 878],
'RI': [401],
'SC': [803, 843, 854, 864],
'SD': [605],
'TN': [423, 615, 629, 731, 865, 901, 931],
'TX': [210, 214, 254, 281, 325, 346, 361, 409, 430, 432, 469, 512, 682, 713, 726, 737, 806, 817, 830, 832, 903, 915, 936, 940, 956, 972, 979],
'UT': [385, 435, 801],
'VT': [802],
'VA': [276, 434, 540, 571, 703, 757, 804],
'WA': [206, 253, 360, 425, 509, 564],
'WV': [304, 681],
'WI': [262, 414, 534, 608, 715, 920],
'WY': [307],
}
is_valid_phone = lambda state, code : code in states[state]
print(is_valid_phone('AL', 205))
print(is_valid_phone('AL', 2000005))
If you really want to assign these to variables (bad practice), instead of making you can change function for class, and then access all variables by calling vars() on the class.
class Phone:
AL = [1,2]
p = phone()
print(vars(p)['AL'])
vars() takes an object and outputs a dict of all objects inside it, accessible with strings.
What you want to do like people in the comments have said is make all of the states into a dictionary and then use this code:
if form_phone_data_sliced in states[form_state_code]:
return True

Fill area of overlap between two normal distributions in seaborn / matplotlib

I want to fill the area overlapping between two normal distributions. I've got the x min and max, but I can't figure out how to set the y boundaries.
I've looked at the plt documentation and some examples. I think this related question and this one come close, but no luck. Here's what I have so far.
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
pepe_calories = np.array([361, 291, 263, 284, 311, 284, 282, 228, 328, 263, 354, 302, 293,
254, 297, 281, 307, 281, 262, 302, 244, 259, 273, 299, 278, 257,
296, 237, 276, 280, 291, 278, 251, 313, 314, 323, 333, 270, 317,
321, 307, 256, 301, 264, 221, 251, 307, 283, 300, 292, 344, 239,
288, 356, 224, 246, 196, 202, 314, 301, 336, 294, 237, 284, 311,
257, 255, 287, 243, 267, 253, 257, 320, 295, 295, 271, 322, 343,
313, 293, 298, 272, 267, 257, 334, 276, 337, 325, 261, 344, 298,
253, 302, 318, 289, 302, 291, 343, 310, 241])
modern_calories = np.array([310, 315, 303, 360, 339, 416, 278, 326, 316, 314, 333, 317, 357,
304, 363, 387, 279, 350, 367, 321, 366, 311, 308, 303, 299, 363,
335, 357, 392, 321, 361, 285, 321, 290, 392, 341, 331, 338, 326,
314, 327, 320, 293, 333, 297, 315, 365, 408, 352, 359, 312, 300,
263, 358, 345, 360, 336, 378, 315, 354, 318, 300, 372, 305, 336,
286, 296, 413, 383, 328, 418, 388, 416, 371, 313, 321, 321, 317,
402, 290, 328, 344, 330, 319, 309, 327, 351, 324, 278, 369, 416,
359, 381, 324, 306, 350, 385, 335, 395, 308])
ax = sns.distplot(pepe_calories, fit_kws={"color":"blue"}, kde=False,
fit=stats.norm, hist=None, label="Pepe's");
ax = sns.distplot(modern_calories, fit_kws={"color":"orange"}, kde=False,
fit=stats.norm, hist=None, label="Modern");
# Get the two lines from the axes to generate shading
l1 = ax.lines[0]
l2 = ax.lines[1]
# Get the xy data from the lines so that we can shade
x1 = l1.get_xydata()[:,0]
y1 = l1.get_xydata()[:,1]
x2 = l2.get_xydata()[:,0]
y2 = l2.get_xydata()[:,1]
x2min = np.min(x2)
x1max = np.max(x1)
ax.fill_between(x1,y1, where = ((x1 > x2min) & (x1 < x1max)), color="red", alpha=0.3)
#> <matplotlib.collections.PolyCollection at 0x1a200510b8>
plt.legend()
#> <matplotlib.legend.Legend at 0x1a1ff2e390>
plt.show()
Any ideas?
Created on 2018-12-01 by the reprexpy package
import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-18.2.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-12-01
#> Packages ------------------------------------------------------------------------
#> matplotlib==2.1.2
#> numpy==1.15.4
#> reprexpy==0.1.1
#> scipy==1.1.0
#> seaborn==0.9.0
While gathering the pdf data from get_xydata is clever, you are now at the mercy of matplotlib's rendering / segmentation algorithm. Having x1 and x2 span different ranges also makes comparing y1 and y2 difficult.
You can avoid these problems by fitting the normals yourself instead of
letting sns.distplot do it. Then you have more control over the values you are
looking for.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
norm = stats.norm
pepe_calories = np.array([361, 291, 263, 284, 311, 284, 282, 228, 328, 263, 354, 302, 293,
254, 297, 281, 307, 281, 262, 302, 244, 259, 273, 299, 278, 257,
296, 237, 276, 280, 291, 278, 251, 313, 314, 323, 333, 270, 317,
321, 307, 256, 301, 264, 221, 251, 307, 283, 300, 292, 344, 239,
288, 356, 224, 246, 196, 202, 314, 301, 336, 294, 237, 284, 311,
257, 255, 287, 243, 267, 253, 257, 320, 295, 295, 271, 322, 343,
313, 293, 298, 272, 267, 257, 334, 276, 337, 325, 261, 344, 298,
253, 302, 318, 289, 302, 291, 343, 310, 241])
modern_calories = np.array([310, 315, 303, 360, 339, 416, 278, 326, 316, 314, 333, 317, 357,
304, 363, 387, 279, 350, 367, 321, 366, 311, 308, 303, 299, 363,
335, 357, 392, 321, 361, 285, 321, 290, 392, 341, 331, 338, 326,
314, 327, 320, 293, 333, 297, 315, 365, 408, 352, 359, 312, 300,
263, 358, 345, 360, 336, 378, 315, 354, 318, 300, 372, 305, 336,
286, 296, 413, 383, 328, 418, 388, 416, 371, 313, 321, 321, 317,
402, 290, 328, 344, 330, 319, 309, 327, 351, 324, 278, 369, 416,
359, 381, 324, 306, 350, 385, 335, 395, 308])
pepe_params = norm.fit(pepe_calories)
modern_params = norm.fit(modern_calories)
xmin = min(pepe_calories.min(), modern_calories.min())
xmax = max(pepe_calories.max(), modern_calories.max())
x = np.linspace(xmin, xmax, 100)
pepe_pdf = norm(*pepe_params).pdf(x)
modern_pdf = norm(*modern_params).pdf(x)
y = np.minimum(modern_pdf, pepe_pdf)
fig, ax = plt.subplots()
ax.plot(x, pepe_pdf, label="Pepe's", color='blue')
ax.plot(x, modern_pdf, label="Modern", color='orange')
ax.fill_between(x, y, color='red', alpha=0.3)
plt.legend()
plt.show()
If, let's say, sns.distplot (or some other plotting function) made a plot that you did not want to have to reproduce, then you could use the data from get_xydata this way:
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
pepe_calories = np.array([361, 291, 263, 284, 311, 284, 282, 228, 328, 263, 354, 302, 293,
254, 297, 281, 307, 281, 262, 302, 244, 259, 273, 299, 278, 257,
296, 237, 276, 280, 291, 278, 251, 313, 314, 323, 333, 270, 317,
321, 307, 256, 301, 264, 221, 251, 307, 283, 300, 292, 344, 239,
288, 356, 224, 246, 196, 202, 314, 301, 336, 294, 237, 284, 311,
257, 255, 287, 243, 267, 253, 257, 320, 295, 295, 271, 322, 343,
313, 293, 298, 272, 267, 257, 334, 276, 337, 325, 261, 344, 298,
253, 302, 318, 289, 302, 291, 343, 310, 241])
modern_calories = np.array([310, 315, 303, 360, 339, 416, 278, 326, 316, 314, 333, 317, 357,
304, 363, 387, 279, 350, 367, 321, 366, 311, 308, 303, 299, 363,
335, 357, 392, 321, 361, 285, 321, 290, 392, 341, 331, 338, 326,
314, 327, 320, 293, 333, 297, 315, 365, 408, 352, 359, 312, 300,
263, 358, 345, 360, 336, 378, 315, 354, 318, 300, 372, 305, 336,
286, 296, 413, 383, 328, 418, 388, 416, 371, 313, 321, 321, 317,
402, 290, 328, 344, 330, 319, 309, 327, 351, 324, 278, 369, 416,
359, 381, 324, 306, 350, 385, 335, 395, 308])
ax = sns.distplot(pepe_calories, fit_kws={"color":"blue"}, kde=False,
fit=stats.norm, hist=None, label="Pepe's");
ax = sns.distplot(modern_calories, fit_kws={"color":"orange"}, kde=False,
fit=stats.norm, hist=None, label="Modern");
# Get the two lines from the axes to generate shading
l1 = ax.lines[0]
l2 = ax.lines[1]
# Get the xy data from the lines so that we can shade
x1, y1 = l1.get_xydata().T
x2, y2 = l2.get_xydata().T
xmin = max(x1.min(), x2.min())
xmax = min(x1.max(), x2.max())
x = np.linspace(xmin, xmax, 100)
y1 = np.interp(x, x1, y1)
y2 = np.interp(x, x2, y2)
y = np.minimum(y1, y2)
ax.fill_between(x, y, color="red", alpha=0.3)
plt.legend()
plt.show()
I suppose not using seaborn in cases where you want to have full control over the resulting plot is often a useful strategy. Hence just calculate the fits, plot them and use fill between the curves up to the point where they cross each other.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
pepe_calories = np.array(...)
modern_calories = np.array(...)
x = np.linspace(150,470,1000)
y1 = stats.norm.pdf(x, *stats.norm.fit(pepe_calories))
y2 = stats.norm.pdf(x, *stats.norm.fit(modern_calories))
cross = x[y1-y2 <= 0][0]
fig, ax = plt.subplots()
ax.fill_between(x,y1,y2, where=(x<=cross), color="red", alpha=0.3)
ax.plot(x,y1, label="Pepe's")
ax.plot(x,y2, label="Modern")
ax.legend()
plt.show()

assign value of arbitrary line in 2-d array to nans

I have a 2D numpy array, z, in which I would like to assign values to nan based on the equation of a line +/- a width of 20. I am trying to implement the Raman 2nd scattering correction as it is done by the eem_remove_scattering method in the eemR package listed here:
https://cran.r-project.org/web/packages/eemR/vignettes/introduction.html
but the method isn't visible.
import numpy as np
ex = np.array([240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300,
305, 310, 315, 320, 325, 330, 335, 340, 345, 350, 355, 360, 365,
370, 375, 380, 385, 390, 395, 400, 405, 410, 415, 420, 425, 430,
435, 440, 445, 450])
em = np.array([300, 302, 304, 306, 308, 310, 312, 314, 316, 318, 320, 322, 324,
326, 328, 330, 332, 334, 336, 338, 340, 342, 344, 346, 348, 350,
352, 354, 356, 358, 360, 362, 364, 366, 368, 370, 372, 374, 376,
378, 380, 382, 384, 386, 388, 390, 392, 394, 396, 398, 400, 402,
404, 406, 408, 410, 412, 414, 416, 418, 420, 422, 424, 426, 428,
430, 432, 434, 436, 438, 440, 442, 444, 446, 448, 450, 452, 454,
456, 458, 460, 462, 464, 466, 468, 470, 472, 474, 476, 478, 480,
482, 484, 486, 488, 490, 492, 494, 496, 498, 500, 502, 504, 506,
508, 510, 512, 514, 516, 518, 520, 522, 524, 526, 528, 530, 532,
534, 536, 538, 540, 542, 544, 546, 548, 550, 552, 554, 556, 558,
560, 562, 564, 566, 568, 570, 572, 574, 576, 578, 580, 582, 584,
586, 588, 590, 592, 594, 596, 598, 600])
X, Y = np.meshgrid(ex, em)
z = np.sin(X) + np.cos(Y)
The equation that I would like to apply is em = - 2 ex/ (0.00036*ex-1) + 500.
I want to set every value in the array that intersects this line (+/- 20 ) to be set to nans. Its simple enough to set a single element to nans, but I havent been able to locate a python function to apply this equation to the array and only set values that intersect with this line to nans.
The desired output would be a new array with the same dimensions as z, but with the values that intersect the line equivalent to nan. Any suggestions on how to proceed are greatly appreciated.
Use np.where in the form np.where( "condition for intersection", np.nan, z):
zi = np.where( np.abs(-2*X/(0.00036*X-1) + 500 - Y) <= 20, np.nan, z)
As a matter of fact, there are no intersections here because (0.00036*ex-1) is close to -1 for all your values, which makes - 2*ex/(0.00036*ex-1) close to 2*ex, and adding 500 brings this over any values you have in em. But in principle this works.
Also, I suspect that the goal you plan to achieve by setting those values to NaN would be better achieved by using a masked array.

Why my python multiprocessing code return the same result in randomized number? [duplicate]

This question already has answers here:
Python Multiprocessing Numpy Random [duplicate]
(2 answers)
Closed 7 years ago.
I'm analyzing a large graph. So, I divide the graph into chunks and hopefully with multi-core CPU it would be faster. However, my model is a randomized model so there's a chance that the results of each run won't be the same. I'm testing the idea and I get the same result all the time so I'm wondering if my code is correct.
Here's my code
from multiprocessing import Process, Queue
# split a list into evenly sized chunks
def chunks(l, n):
return [l[i:i+n] for i in range(0, len(l), n)]
def multiprocessing_icm(queue, nodes):
queue.put(independent_cascade_igraph(twitter_igraph, nodes, steps=1))
def dispatch_jobs(data, job_number):
total = len(data)
chunk_size = total / job_number
slice = chunks(data, chunk_size)
jobs = []
processes = []
queue = Queue()
for i, s in enumerate(slice):
j = Process(target=multiprocessing_icm, args=(queue, s))
jobs.append(j)
for j in jobs:
j.start()
for j in jobs:
j.join()
return queue
dispatch_jobs(['121817564', '121817564'], 2)
if you're wondering what independent_cascade_igraph is. Here's the code
def independent_cascade_igraph(G, seeds, steps=0):
# init activation probabilities
for e in G.es():
if 'act_prob' not in e.attributes():
e['act_prob'] = 0.1
elif e['act_prob'] > 1:
raise Exception("edge activation probability:", e['act_prob'], "cannot be larger than 1")
# perform diffusion
A = copy.deepcopy(seeds) # prevent side effect
if steps <= 0:
# perform diffusion until no more nodes can be activated
return _diffuse_all(G, A)
# perform diffusion for at most "steps" rounds
return _diffuse_k_rounds(G, A, steps)
def _diffuse_all(G, A):
tried_edges = set()
layer_i_nodes = [ ]
layer_i_nodes.append([i for i in A]) # prevent side effect
while True:
len_old = len(A)
(A, activated_nodes_of_this_round, cur_tried_edges) = _diffuse_one_round(G, A, tried_edges)
layer_i_nodes.append(activated_nodes_of_this_round)
tried_edges = tried_edges.union(cur_tried_edges)
if len(A) == len_old:
break
return layer_i_nodes
def _diffuse_k_rounds(G, A, steps):
tried_edges = set()
layer_i_nodes = [ ]
layer_i_nodes.append([i for i in A])
while steps > 0 and len(A) < G.vcount():
len_old = len(A)
(A, activated_nodes_of_this_round, cur_tried_edges) = _diffuse_one_round(G, A, tried_edges)
layer_i_nodes.append(activated_nodes_of_this_round)
tried_edges = tried_edges.union(cur_tried_edges)
if len(A) == len_old:
break
steps -= 1
return layer_i_nodes
def _diffuse_one_round(G, A, tried_edges):
activated_nodes_of_this_round = set()
cur_tried_edges = set()
for s in A:
for nb in G.successors(s):
if nb in A or (s, nb) in tried_edges or (s, nb) in cur_tried_edges:
continue
if _prop_success(G, s, nb):
activated_nodes_of_this_round.add(nb)
cur_tried_edges.add((s, nb))
activated_nodes_of_this_round = list(activated_nodes_of_this_round)
A.extend(activated_nodes_of_this_round)
return A, activated_nodes_of_this_round, cur_tried_edges
def _prop_success(G, src, dest):
'''
act_prob = 0.1
for e in G.es():
if (src, dest) == e.tuple:
act_prob = e['act_prob']
break
'''
return random.random() <= 0.1
Here's the result of multiprocessing
[['121817564'], [1538, 1539, 4, 517, 1547, 528, 2066, 1623, 1540, 538, 1199, 31, 1056, 1058, 547, 1061, 1116, 1067, 1069, 563, 1077, 1591, 1972, 1595, 1597, 1598, 1088, 1090, 1608, 1656, 1098, 1463, 1105, 1619, 1622, 1111, 601, 1627, 604, 1629, 606, 95, 612, 101, 1980, 618, 1652, 1897, 1144, 639, 640, 641, 647, 650, 1815, 1677, 143, 1170, 1731, 660, 1173, 1690, 1692, 1562, 1563, 1189, 1702, 687, 689, 1203, 1205, 1719, 703, 1219, 1229, 1744, 376, 1746, 211, 1748, 213, 1238, 218, 221, 735, 227, 1764, 741, 230, 1769, 1258, 1780, 1269, 1783, 761, 763, 1788, 1789, 1287, 769, 258, 1286, 263, 264, 780, 1298, 1299, 1812, 473, 1822, 1828, 806, 811, 1324, 814, 304, 478, 310, 826, 1858, 1349, 326, 327, 1352, 329, 1358, 336, 852, 341, 854, 1879, 1679, 868, 2022, 1385, 1902, 1904, 881, 1907, 1398, 1911, 888, 1940, 1402, 1941, 1920, 1830, 387, 1942, 905, 1931, 1411, 399, 1426, 915, 916, 917, 406, 407, 1433, 1947, 1441, 419, 1445, 1804, 428, 1454, 1455, 948, 1973, 951, 1466, 443, 1468, 1471, 1474, 1988, 966, 1479, 1487, 976, 467, 1870, 2007, 985, 1498, 990, 1504, 1124, 485, 486, 489, 492, 2029, 2033, 1524, 1534, 2038, 1018, 1535, 510, 1125]]
[['121817564'], [1538, 1539, 4, 517, 1547, 528, 2066, 1623, 1540, 538, 1199, 31, 1056, 1058, 547, 1061, 1116, 1067, 1069, 563, 1077, 1591, 1972, 1595, 1597, 1598, 1088, 1090, 1608, 1656, 1098, 1463, 1105, 1619, 1622, 1111, 601, 1627, 604, 1629, 606, 95, 612, 101, 1980, 618, 1652, 1897, 1144, 639, 640, 641, 647, 650, 1815, 1677, 143, 1170, 1731, 660, 1173, 1690, 1692, 1562, 1563, 1189, 1702, 687, 689, 1203, 1205, 1719, 703, 1219, 1229, 1744, 376, 1746, 211, 1748, 213, 1238, 218, 221, 735, 227, 1764, 741, 230, 1769, 1258, 1780, 1269, 1783, 761, 763, 1788, 1789, 1287, 769, 258, 1286, 263, 264, 780, 1298, 1299, 1812, 473, 1822, 1828, 806, 811, 1324, 814, 304, 478, 310, 826, 1858, 1349, 326, 327, 1352, 329, 1358, 336, 852, 341, 854, 1879, 1679, 868, 2022, 1385, 1902, 1904, 881, 1907, 1398, 1911, 888, 1940, 1402, 1941, 1920, 1830, 387, 1942, 905, 1931, 1411, 399, 1426, 915, 916, 917, 406, 407, 1433, 1947, 1441, 419, 1445, 1804, 428, 1454, 1455, 948, 1973, 951, 1466, 443, 1468, 1471, 1474, 1988, 966, 1479, 1487, 976, 467, 1870, 2007, 985, 1498, 990, 1504, 1124, 485, 486, 489, 492, 2029, 2033, 1524, 1534, 2038, 1018, 1535, 510, 1125]]
But here's the example if I run indepedent_cascade_igraph twice
independent_cascade_igraph(twitter_igraph, ['121817564'], steps=1)
[['121817564'],
[514,
1773,
1540,
1878,
2057,
1035,
1550,
2064,
1042,
533,
1558,
1048,
1054,
544,
545,
1061,
1067,
1885,
1072,
350,
1592,
1460,...
independent_cascade_igraph(twitter_igraph, ['121817564'], steps=1)
[['121817564'],
[1027,
2055,
8,
1452,
1546,
1038,
532,
1045,
542,
546,
1059,
549,
1575,
1576,
2030,
1067,
1068,
1071,
564,
573,
575,
1462,
584,
1293,
1105,
595,
599,
1722,
1633,
1634,
614,
1128,
1131,
1286,
621,
1647,
1648,
627,
636,
1662,
1664,
1665,
130,
1671,
1677,
656,
1169,
148,
1686,
1690,
667,
1186,
163,
1700,
1191,
1705,
1711,...
So, what I'm hoping to get out of this is if I have a list of 500 ids, I would like the first CPU to calculate the first 250 and the second CPU to calculate the last 250 and then merge the result. I'm not sure if I understand multiprocessing correctly.
As mentioned e.g. in this SO answer, in *nix child processes inherit the state of the RNG. Call random.seed() in every child process to initialize it yourself to a per-process seed, or randomly.
Haven't read your program in detail but my general feeling is that you probably have a random number generator seed problem. If you run twice the program on the same CPU the random number generator's state will be different the second time you run it. But if you run it on 2 different CPUs, maybe your generators are initialized with the same default seed, thus giving the same results.

Categories

Resources