Pandas: Eliminating for loops - python

I have two Pandas dataframes, namely: habitat_family and habitat_species. I want to populate habitat_species based on the taxonomical lookupMap and the values in habitat_family:
import pandas as pd
import numpy as np
species = ['tiger', 'lion', 'mosquito', 'ladybug', 'locust', 'seal', 'seabass', 'shark', 'dolphin']
families = ['mammal','fish','insect']
lookupMap = {'tiger':'mammal', 'lion':'mammal', 'mosquito':'insect', 'ladybug':'insect', 'locust':'insect',
'seal':'mammal', 'seabass':'fish', 'shark':'fish', 'dolphin':'mammal' }
habitat_family = pd.DataFrame({'id': range(1,11),
'mammal': [101,123,523,562,546,213,562,234,987,901],
'fish' : [625,254,929,827,102,295,174,777,123,763],
'insect': [345,928,183,645,113,942,689,539,789,814]
}, index=range(1,11), columns=['id','mammal','fish','insect'])
habitat_species = pd.DataFrame(0.0, index=range(1,11), columns=species)
# My highly inefficient solution:
for id in habitat_family.index: # loop through habitat id's
for spec in species: # loop through species
corresp_family = lookupMap[spec]
habitat_species.loc[id,spec] = habitat_family.loc[id,corresp_family]
The nested for loops above do the job. But in reality the sizes of my dataframes are massive and using for loops are not feasible.
Is there a more efficient method to achieve this using maybe dataframe.apply() or a similar function?
EDIT: The desired output habitat_species is:
habitat_species
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901

You don't need any loops at all. Check it out:
In [12]: habitat_species = habitat_family[Series(species).replace(lookupMap)]
In [13]: habitat_species.columns = species
In [14]: habitat_species
Out[14]:
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
[10 rows x 9 columns]

First of all, fantastically written question. Thanks.
I would suggest making a DataFrame for each family, and concatenating at the end:
You'll need to reverse your lookupMap:
In [80]: d = {'mammal': ['dolphin', 'lion', 'seal', 'tiger'], 'insect': ['ladybug', 'locust', 'mosquito'], 'fish':
['seabass', 'shark']}
So as an example:
In [83]: k, v = 'mammal', d['mammal']
In [86]: pd.DataFrame([habitat_family[k] for _ in v], index=v).T
Out[86]:
dolphin lion seal tiger
1 101 101 101 101
2 123 123 123 123
3 523 523 523 523
4 562 562 562 562
5 546 546 546 546
6 213 213 213 213
7 562 562 562 562
8 234 234 234 234
9 987 987 987 987
10 901 901 901 901
[10 rows x 4 columns]
Now do that for each family:
In [88]: for k, v in d.iteritems():
....: results.append(pd.DataFrame([habitat_family[k] for _ in v], index=v).T)
And concat:
In [89]: habitat_species = pd.concat(results, axis=1)
In [90]: habi
habitat_family habitat_species
In [90]: habitat_species
Out[90]:
dolphin lion seal tiger ladybug locust mosquito seabass shark
1 101 101 101 101 345 345 345 625 625
2 123 123 123 123 928 928 928 254 254
3 523 523 523 523 183 183 183 929 929
4 562 562 562 562 645 645 645 827 827
5 546 546 546 546 113 113 113 102 102
6 213 213 213 213 942 942 942 295 295
7 562 562 562 562 689 689 689 174 174
8 234 234 234 234 539 539 539 777 777
9 987 987 987 987 789 789 789 123 123
10 901 901 901 901 814 814 814 763 763
[10 rows x 9 columns]
You might consider passing the families as the key parameter to concat if you want a hierarchical index for the columns with (family, species) pairs.
Some profiling, since you said performance matters:
# Mine
In [97]: %%timeit
....: for k, v in d.iteritems():
....: results.append(pd.DataFrame([habitat_family[k] for _ in v], index=v).T)
....: habitat_species = pd.concat(results, axis=1)
....:
1 loops, best of 3: 296 ms per loop
# Your's
In [98]: %%timeit
....: for id in habitat_family.index: # loop through habitat id's
....: for spec in species: # loop through species
....: corresp_family = lookupMap[spec]
....: habitat_species.loc[id,spec] = habitat_family.loc[id,corresp_family]
10 loops, best of 3: 21.5 ms per loop
# Dan's
In [102]: %%timeit
.....: habitat_species = habitat_family[Series(species).replace(lookupMap)]
.....: habitat_species.columns = species
.....:
100 loops, best of 3: 2.55 ms per loop
Looks like Dan wins by a longshot!

This might be the most pandonic:
In [1]: habitat_species.apply(lambda x: habitat_family[lookupMap[x.name]])
Out[1]:
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
%timeit habitat_species.apply(lambda x: habitat_family[lookupMap[x.name]])
1000 loops, best of 3: 1.57 ms per loop

as far as I can tell, the data in the columns don't change but the columns are merely repeated for each corresponding animal.
I.E if you just had a tiger and a lion, you would want a resulting dataframe with the mammal column repeated twice and the header changed?
In that case, you can do:
habitat_species = pd.DataFrame(0.0, index=range(1,11))
for key, value in lookupMap.iteritems():
habitat_species[key] = habitat_family[value]
This will create a new column in the habitat_species dataframe with the name given by key, and assign all the values in the corresponding column in the habitat_family dataframe, whose name is given by value

Related

How do I filter based in Indices in Python?

I am having an issue with manipulating indices once I have used the groupby command. My problem is similar to this code:
import pandas as pd
import numpy as np
np.random.seed(0)
df=pd.DataFrame(np.random.randint(0,10,size=(1000000,5)),columns=list('ABCDE'))
M=df.groupby(['A','B','D','E'])['C'].sum().unstack()
M
E 0 1 2 3 4 5 6 7 8 9
A B D
0 0 0 464 414 553 420 499 394 528 423 415 443
1 407 479 392 441 433 472 520 421 484 384
2 545 546 523 356 386 434 531 534 486 417
3 408 511 422 424 477 351 452 395 341 492
4 502 462 403 434 428 444 506 414 418 328
... ... ... ... ... ... ... ... ... ... ...
9 9 5 419 416 485 386 581 330 408 489 394 454
6 416 475 469 490 357 523 418 514 555 499
7 528 419 462 486 565 388 438 445 469 521
8 390 454 566 341 459 463 478 463 426 499
9 414 436 441 462 403 415 362 472 433 430
[1000 rows x 10 columns]
I am wondering how to filter down to only situations where B is greater than A, when they are both in the index here. If they weren't in the index then I would be doing something like M=M[M['A']<M['B']].
You can temporarily convert the index to_frame:
out = M.loc[M.index.to_frame().query('B>A').index]
Or use Index.get_level_values:
A = M.index.get_level_values('A')
B = M.index.get_level_values('B')
out = M.loc[B>A]
Output:
E 0 1 2 3 4 5 6 7 8 9
A B D
0 1 0 489 452 421 455 442 377 440 476 477 451
1 468 448 473 443 557 492 471 460 476 469
2 576 472 465 355 503 448 491 437 546 425
3 404 438 474 516 410 446 411 459 467 450
4 500 418 441 445 420 605 467 580 479 377
... ... ... ... ... ... ... ... ... ... ...
8 9 5 390 466 436 493 446 508 375 390 485 393
6 457 478 476 417 458 460 361 397 432 403
7 516 587 379 406 396 449 430 433 357 432
8 390 460 489 427 346 490 498 454 395 345
9 474 510 466 336 484 577 443 428 459 406
[450 rows x 10 columns]

Python list alignment

I have an assignment I am trying to complete.
I have 100 random int. From those 100 I have to create a 10x10 table. DONE..
within that table I have to align my values to the right side of the each column. That is the part I'm missing.
Below is the code for that:
print(num, end=(" " if counter < 10 else "\n"))
Late answer, but you can also use:
import random
rl = random.sample(range(100, 999), 100)
max_n = 10
for n, x in enumerate(rl, 1):
print(x, end=("\n" if n % max_n == 0 else " "))
440 688 758 837 279 736 510 706 392 631
588 511 610 792 535 526 335 842 247 124
552 329 245 689 832 407 919 302 592 385
542 890 406 898 189 116 495 764 664 471
851 728 292 314 839 503 691 355 350 213
661 489 800 649 521 958 123 205 983 219
321 633 120 388 632 187 158 576 294 835
673 470 699 908 456 270 220 878 376 884
816 525 147 104 602 637 249 763 494 127
981 524 262 915 267 873 886 397 922 932
You can just format the number before printing it.
print(f"{num:>5}", end=(" " if counter < 10 else "\n"))
Alternatively, if you wanna cast the numbers to string you can use the rjust method of string.
There is a simple way to do it. I hope I have made it clear.
import random
# Generate 100 random numbers in range 1 to 1000.
random_numbers = list(map(lambda x:random.randrange(1,1000), range(100)))
# Create an empty string to store the pretty string.
pretty_txt = ''
# Loop through random_numbers and use an enumerate to get iteration count.
for index, i in enumerate(random_numbers):
# Add the current number into the string with a space.
pretty_txt += str(i) + ' '
# Add a newline every ten numbers.
# If you don't add index != 0 it will put a space in first iteration
if index % 9 == 0 and index != 0:
pretty_txt += '\n'
print(pretty_txt)
The output is:
796 477 578 458 284 287 43 535 514 504
91 411 288 980 85 233 394 313 263
135 770 793 394 362 433 370 725 472
981 932 398 275 626 631 817 82 551
775 211 755 202 81 750 695 402 809
477 925 347 31 313 514 363 115 144
341 684 662 522 236 219 142 114 621
940 241 110 851 997 699 685 434 813
983 710 124 443 569 613 456 232 80
927 445 179 49 871 821 428 750 792
527 799 878 731 221 780 16 779 333

Clustering incosistencys by time difference in a timeseries df

I have a pandas dataframe that looks like this:
df
Out[94]:
nr aenvm aenhm ... naenhs naesvs naeshs
date ...
2019-11-16 08:44:24 1 388 776 ... 402 305 566
2019-11-16 08:44:25 2 383 767 ... 407 304 561
2019-11-16 08:44:26 3 378 762 ... 410 301 570
2019-11-16 08:44:27 4 376 766 ... 403 304 567
2019-11-16 08:44:28 5 374 773 ... 398 297 569
The data is inconsistent by Events.
Sometimes there are around 6 minutes of data (lets call it an "event") and then for maybe some hours or some days no data. See i.e. the structure break in the timestamp:
df.iloc[1056:1065]
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1057 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1058 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 1059 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 1060 304 715 507 ... 302 159 289 508
All I want to do is to "index" or "categories" those events. That the rows between two structural breaks are combined under one number [1, 2, 3, ...] in a new column.
My Goal is to creat a new column like "nr" that seperates the "events".
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 2 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 2 304 715 507 ... 302 159 289 508
To be honest I am a complete python newbie, I tried some classification with the timestamp datetime64 and .asfreq but with zero to nothing success...
I would be very thankfully for a good advice! :)

append pandas data frames in columns

I searching for a way to append data frames in new columns.
df = pd.DataFrame([])
perf = [650, 875, 400, 200, 630, 950, 850, 800]
for _ in range(0,8):
perf = [650+i, 875+i, 400+i, 200+i, 630+i, 950+i, 850+i, 800+i] #perf is independent of i, it's just to show that i have 8 different list
df = df.append(pd.DataFrame({'Pp': [i for i in perf]}))
print(df)
Pp
0 650
1 875
2 400
3 200
4 630
.. ...
3 207
4 637
5 957
6 857
7 807
64 rows x 1 column but I searching for a way to get 8 rows x 8 columns
Pp Pp Pp
0 650 651 ...
1 875 876 ...
2 400 401 ...
3 200 201 ...
4 630 631 ...
.. ... ... ...
Try this
import pandas as pd
import random
df = pd.DataFrame([])
for i in range(0,8):
df['Pp'+str(i)] = [random.randint(100, 1000) for val in perf ]
print(df)
Output:
Pp0 Pp1 Pp2 Pp3 Pp4 Pp5 Pp6 Pp7
0 963 394 165 750 918 687 637 164
1 642 217 154 455 173 807 995 649
2 508 399 833 853 686 834 529 992
3 688 178 328 101 469 559 455 844
4 145 113 416 927 503 882 725 326
5 171 548 394 952 459 725 460 625
6 189 129 136 541 280 131 956 356
7 906 562 779 773 412 423 429 769
There's actually no need to use a for loop/appending for this - simply pass the list when creating the DataFrame:
import pandas as pd
perf = [650, 875, 400, 200, 630, 950, 850, 800]
df = pd.DataFrame(perf)
Then to create the other 7 columns, simply create a new column using the list:
df["1"] = perf
df["2"] = perf
and so on. Hope this helps!

Efficient for loop in python

I want to write a for loop in python which iterates for example like 111, 112, 113, 114, 121, 122, 123, 124, 131,.. up to 444. Is there an efficient way to do so?
I tried to convert between decimal and base 4 system but is there a better way to do so?
>>> from itertools import chain
>>> for k in chain.from_iterable(range(i+1, i+5) for i in range(110, 450, 10)):
... print(k)
...
111
112
113
114
121
122
123
124
131
132
133
134
141
142
.
.
.
423
424
431
432
433
434
441
442
443
444
SO like this:
[ i for i in range(111, 445) if '0' < str(i)[-1] < '5']
you can use:
[ i for i in range(111, 445) if 0< i%(i-i%10) <5]
You can convert a range of integers to base 4 using base_repr from numpy:
import numpy
for i in range(64):
print(int(numpy.base_repr(i, base=4)) + 111)
Output:
111
112
113
114
121
122
123
124
131
132
133
134
141
142
143
144
211
212
213
214
221
222
223
224
231
232
233
234
241
242
243
244
311
312
313
314
321
322
323
324
331
332
333
334
341
342
343
344
411
412
413
414
421
422
423
424
431
432
433
434
441
442
443
444

Categories

Resources