append pandas data frames in columns - python

I searching for a way to append data frames in new columns.
df = pd.DataFrame([])
perf = [650, 875, 400, 200, 630, 950, 850, 800]
for _ in range(0,8):
perf = [650+i, 875+i, 400+i, 200+i, 630+i, 950+i, 850+i, 800+i] #perf is independent of i, it's just to show that i have 8 different list
df = df.append(pd.DataFrame({'Pp': [i for i in perf]}))
print(df)
Pp
0 650
1 875
2 400
3 200
4 630
.. ...
3 207
4 637
5 957
6 857
7 807
64 rows x 1 column but I searching for a way to get 8 rows x 8 columns
Pp Pp Pp
0 650 651 ...
1 875 876 ...
2 400 401 ...
3 200 201 ...
4 630 631 ...
.. ... ... ...

Try this
import pandas as pd
import random
df = pd.DataFrame([])
for i in range(0,8):
df['Pp'+str(i)] = [random.randint(100, 1000) for val in perf ]
print(df)
Output:
Pp0 Pp1 Pp2 Pp3 Pp4 Pp5 Pp6 Pp7
0 963 394 165 750 918 687 637 164
1 642 217 154 455 173 807 995 649
2 508 399 833 853 686 834 529 992
3 688 178 328 101 469 559 455 844
4 145 113 416 927 503 882 725 326
5 171 548 394 952 459 725 460 625
6 189 129 136 541 280 131 956 356
7 906 562 779 773 412 423 429 769

There's actually no need to use a for loop/appending for this - simply pass the list when creating the DataFrame:
import pandas as pd
perf = [650, 875, 400, 200, 630, 950, 850, 800]
df = pd.DataFrame(perf)
Then to create the other 7 columns, simply create a new column using the list:
df["1"] = perf
df["2"] = perf
and so on. Hope this helps!

Related

How do I filter based in Indices in Python?

I am having an issue with manipulating indices once I have used the groupby command. My problem is similar to this code:
import pandas as pd
import numpy as np
np.random.seed(0)
df=pd.DataFrame(np.random.randint(0,10,size=(1000000,5)),columns=list('ABCDE'))
M=df.groupby(['A','B','D','E'])['C'].sum().unstack()
M
E 0 1 2 3 4 5 6 7 8 9
A B D
0 0 0 464 414 553 420 499 394 528 423 415 443
1 407 479 392 441 433 472 520 421 484 384
2 545 546 523 356 386 434 531 534 486 417
3 408 511 422 424 477 351 452 395 341 492
4 502 462 403 434 428 444 506 414 418 328
... ... ... ... ... ... ... ... ... ... ...
9 9 5 419 416 485 386 581 330 408 489 394 454
6 416 475 469 490 357 523 418 514 555 499
7 528 419 462 486 565 388 438 445 469 521
8 390 454 566 341 459 463 478 463 426 499
9 414 436 441 462 403 415 362 472 433 430
[1000 rows x 10 columns]
I am wondering how to filter down to only situations where B is greater than A, when they are both in the index here. If they weren't in the index then I would be doing something like M=M[M['A']<M['B']].
You can temporarily convert the index to_frame:
out = M.loc[M.index.to_frame().query('B>A').index]
Or use Index.get_level_values:
A = M.index.get_level_values('A')
B = M.index.get_level_values('B')
out = M.loc[B>A]
Output:
E 0 1 2 3 4 5 6 7 8 9
A B D
0 1 0 489 452 421 455 442 377 440 476 477 451
1 468 448 473 443 557 492 471 460 476 469
2 576 472 465 355 503 448 491 437 546 425
3 404 438 474 516 410 446 411 459 467 450
4 500 418 441 445 420 605 467 580 479 377
... ... ... ... ... ... ... ... ... ... ...
8 9 5 390 466 436 493 446 508 375 390 485 393
6 457 478 476 417 458 460 361 397 432 403
7 516 587 379 406 396 449 430 433 357 432
8 390 460 489 427 346 490 498 454 395 345
9 474 510 466 336 484 577 443 428 459 406
[450 rows x 10 columns]

pandas excel data read with incorrect output -no getting all the tabular data from excel plus pandas "FutureWarning" Error from "usecols" parameter

I wrote the following function (which could be made more efficient) to traverse my project directory: ' ../data/test_input' using os.listdir() and read my data files (10 in total) with the shapes of the data matrixes ranging from 4X4, 6X6, 8X8, ..., 22X22.
Below is a snippet of the excel tabular data. The same tabular set goes for the 6X6, 8X8, ..., 22X22
My goal is that the function returns a tuple of df_4, df_6, df_8, df_10, df_12, df_14, df_16, df_18, df_20, df_22 which I could loop over and perform a few preprocessing before feeding them individually to my model.
import pandas as pd
import numpy as np
import os
import re
def read_files(file_name, loc_list=None):
if loc_list is None:
loc_list = []
for itm in loc_list:
if itm == 4:
df_4 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=4, usecols=range(1, 5))
df_4.columns = [k for k in range(1, len(df_4.columns) + 1)]
df_4.index = df_4.index + 1
# loc_list.remove(itm)
elif itm == 6:
df_6 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=6, usecols=range(1, 7))
df_6.columns = [k for k in range(1, len(df_6.columns) + 1)]
df_6.index = df_6.index + 1
# loc_list.remove(itm)
elif itm == 8:
df_8 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=8, usecols=range(1, 9))
df_8.columns = [k for k in range(1, len(df_8.columns) + 1)]
df_8.index = df_8.index + 1
elif itm == 10:
df_10 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=10, usecols=range(1, 11))
df_10.columns = [k for k in range(1, len(df_10.columns) + 1)]
df_10.index = df_10.index + 1
elif itm == 12:
df_12 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=12, usecols=range(1, 13))
df_12.columns = [k for k in range(1, len(df_12.columns) + 1)]
df_12.index = df_12.index + 1
elif itm == 14:
df_14 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=14, usecols=range(1, 15))
df_14.columns = [k for k in range(1, len(df_14.columns) + 1)]
df_14.index = df_14.index + 1
elif itm == 16:
df_16 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=16, usecols=range(1, 17))
df_16.columns = [k for k in range(1, len(df_16.columns) + 1)]
df_16.index = df_16.index + 1
elif itm == 18:
df_18 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=18, usecols=range(1, 19))
df_18.columns = [k for k in range(1, len(df_18.columns) + 1)]
df_18.index = df_18.index + 1
elif itm == 20:
df_20 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=20, usecols=range(1, 21))
df_20.columns = [k for k in range(1, len(df_20.columns) + 1)]
df_20.index = df_20.index + 1
elif itm == 22:
df_22 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=22, usecols=range(1, 23))
df_22.columns = [k for k in range(1, len(df_22.columns) + 1)]
df_22.index = df_22.index + 1
return df_4, df_6, df_8, df_10, df_12, df_14, df_16, df_18, df_20, df_22
breaking_point = 0
loca_list = []
[loca_list.append(int(z)) for fname in os.listdir('../data/test_input') for z in re.findall('[0-9]+', fname)]
loca_list = sorted(loca_list)
breaking_point = 0
# TODO - perhaps consider mass read of data from excel in the dir/listdir
for fname in os.listdir('../data/test_input'):
if fname.endswith('.xlsx') and re.findall('[0-9]+', fname) and 'ex' in fname:
df_tuple = read_files('../data/test_input/' + fname, loc_list=loca_list) # TODO
breaking_point = 1
# print the shape of df_tuple to inspect
for tuP in df_tuple:
print(tuP.shape)
breaking_point = 2
for tuP in df_tuple:
print('------------------ \n')
print(tuP)
my expected output is to have a pandas df for each of the above-listed returned values. Instead, I am getting the following result, which is not what I want.
(4, 4)
(6, 6)
(8, 8)
(8, 8)
(8, 8)
(8, 8)
(8, 8)
(8, 8)
(8, 8)
(8, 8)
------------------ below is correct as expected:
1 2 3 4
1 9999 1606 1410 330
2 1096 9999 531 567
3 485 2322 9999 1236
4 960 496 700 9999
------------------ also correct as expected:
1 2 3 4 5 6
1 9999 1606 1410 330 42 539
2 1096 9999 531 567 1359 29
3 485 2322 9999 1236 28 290
4 960 496 700 9999 650 904
5 626 780 1367 696 9999 220
6 631 1218 1486 1163 24 9999
------------------ correct as expected:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong; expected 10 X 10:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong; expected 12 X 12:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong; expected 14 X14:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong; expected 16 X16:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong; expected 18 X 18:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong; expected 20 X 20:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
------------------ below is from wrong; expected 22 X 22:
1 2 3 4 5 6 7 8
1 9999 1606 1410 330 42 539 626 652
2 1096 9999 531 567 1359 29 846 481
3 485 2322 9999 1236 28 290 742 180
4 960 496 700 9999 650 904 416 1149
5 626 780 1367 696 9999 220 329 828
6 631 1218 1486 1163 24 9999 416 1057
7 657 460 819 733 761 1265 9999 463
8 1102 376 566 1324 409 1168 743 9999
Also, I am getting the following panda "FutureWarning" message:
FutureWarning: Defining usecols with out of bounds indices is deprecated and will raise a ParserError in a future version.
df_12 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=12, usecols=range(1, 13))
FutureWarning: Defining usecols with out of bounds indices is deprecated and will raise a ParserError in a future version.
df_14 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=14, usecols=range(1, 15))
...
FutureWarning: Defining usecols with out of bounds indices is deprecated and will raise a ParserError in a future version.
df_22 = pd.read_excel(file_name, sheet_name='Sheet1', skiprows=1, nrows=22, usecols=range(1, 23))
I have also looked up this "FutureWarming" online several times but did not find the correct remedy for my issue.
I shall be glad if someone could help point out my mistake to me as I have already spent a lot of time tracking the error but to no avail.
import pandas as pd
import os
import re
def create_tuple_list(fpath):
tuple_list = [(fname, int(z)) for fname in os.listdir(fpath) for z in re.findall('[0-9]+', fname) if
fname.endswith('.xlsx') and 'ex' in fname and re.findall('[0-9]+', fname)]
return tuple_list
def main():
# define file path
dirpath = '../data/test/'
# function call
dtup_list = create_tuple_list(dirpath)
for tuP in dtup_list:
fname = tuP[0]
nbr = tuP[1]
df_c = pd.read_excel(dirpath + fname, sheet_name='Sheet1', skiprows=1, usecols=range(nbr + 1))
df_c.index = df_c.index + 1
if __name__ == '__main__':
main()

Slicing a 2D array using indices 1D array

I have a 2D array of (10,24) and a 1D array of (10,) shape.
I want to slice a 2D array using 1D array such that my resultant array will be (10,24) but the values are sliced from indices in 1D array onwards.
import numpy as np
x1 = np.random.randint(1,20,10)
print(x1)
[ 8, 13, 13, 13, 14, 3, 14, 14, 11, 16]
y1 = np.random.randint(low = 1, high = 999, size = 240).reshape(10,24)
print(y1)
[[152 128 251 282 334 776 650 247 990 803 700 323 250 262 552 220 744 50
684 695 600 293 138 5]
[830 917 148 612 801 746 623 794 435 469 610 598 29 452 188 688 364 56
246 991 554 33 716 712]
[603 16 838 65 312 764 676 392 187 476 878 229 555 558 58 194 565 764
48 579 447 202 81 300]
[315 562 276 993 859 145 82 484 134 59 397 566 573 263 340 465 728 406
767 408 294 115 394 941]
[422 891 475 174 720 672 526 52 938 347 114 613 186 151 925 482 315 373
856 155 5 60 65 746]
[978 621 543 785 663 32 817 497 615 897 713 459 396 154 220 221 171 589
571 587 248 668 413 553]
[227 188 4 874 975 586 93 179 356 740 645 723 558 814 64 922 748 457
249 688 799 239 708 516]
[230 556 563 55 390 666 304 661 218 744 502 720 418 581 839 772 818 278
190 997 553 71 897 909]
[631 928 606 111 927 912 81 38 529 956 759 6 725 325 944 174 62 804
82 358 305 291 454 34]
[193 661 452 54 816 251 750 183 60 563 787 283 599 182 823 546 629 527
667 614 615 3 790 124]]
I want my resultant array to be be:
[[990 803 700 323 250 262 552 220 744 50 684 695 600 293 138 5 0 0 0 0 0 0 0 0]
[452 188 688 364 56 246 991 554 33 716 712 0 0 0 0 0 0 0 0 0 0 0 0 0]
[558 58 194 565 764 48 579 447 202 81 300 0 0 0 0 0 0 0 0 0 0 0 0 0]
[263 340 465 728 406 767 408 294 115 394 941 0 0 0 0 0 0 0 0 0 0 0 0 0]
[925 482 315 373 856 155 5 60 65 746 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[785 663 32 817 497 615 897 713 459 396 154 220 221 171 589 571 587 248 668 413 553 0 0 0]
[64 922 748 457 249 688 799 239 708 516 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
[839 772 818 278 190 997 553 71 897 909 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[6 725 325 944 174 62 804 82 358 305 291 454 34 0 0 0 0 0 0 0 0 0 0 0 ]
[546 629 527 667 614 615 3 790 124 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Here's a vectorized one with masking and also leveraging broadcasting -
def select_gt_indices(a, idx):
r = np.arange(a.shape[1])
select_mask = idx[:,None] <= r
put_mask = (a.shape[1]-idx-1)[:,None] >= r
# or put_mask = np.sort(select_mask,axis=1)[:,::-1]
out = np.zeros_like(a)
out[put_mask] = a[select_mask]
return out
Sample run -
In [92]: np.random.seed(0)
...: a = np.random.randint(0,999,(4,5))
...: idx = np.array([2,4,3,0])
In [93]: a
Out[93]:
array([[684, 559, 629, 192, 835],
[763, 707, 359, 9, 723],
[277, 754, 804, 599, 70],
[472, 600, 396, 314, 705]])
In [94]: idx
Out[94]: array([2, 4, 3, 0])
In [95]: select_gt_indices(a, idx)
Out[95]:
array([[629, 192, 835, 0, 0],
[723, 0, 0, 0, 0],
[599, 70, 0, 0, 0],
[472, 600, 396, 314, 705]])
I don't think you can slice the array as you are padding with 0's. You can create an empty zeros array and populate it, for example
y1_result = np.zeros(y1.shape)
for row, x_i in enumerate(x):
for j, element in enumerate(y1[row, x_i:]):
y1_result[row, j] = element

Adding consecutive x values to a list

Suppose I have a list with items [123, 124, 125, ... 9820] and from that list I want to append to a second list with a string of every 8 items separated by a space up until the end. For example the list would have:
["123 124 125 126 127 128 129 130", "131, 132, 133, 134, 135, 136, 137, 138",..] etc.
What is the best way to do this in python? I have tried a naive solution of looping from 123 to 9820 but this takes way too much runtime and times out some of my simple tests I have set up. Are there any functions that would be useful to me?
Collect the elements into chunks of length 8 and use join(). Here's an example using an adapted recipe from itertools:
from itertools import zip_longest
lst = [str(x) for x in range(123, 9821)]
def grouper(iterable, n, fillvalue=""):
"Collect data into fixed-length chunks or blocks"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
lst2 = [" ".join(x) for x in grouper(lst, 8)]
We have to jump by 8 index to get next item from items list.
Demo
Consider items list from 1 to 999 numbers, Length of items list is 999.
Then use for loop with range function to jump by 8 index in a items list.
Use append method of string to get final result.
code:
>>> items = range(1, 1000)
>>> len(items)
999
>>> output_str = ""
>>> for i in range(0, 999, 8):
... output_str += " " + str(items[i])
...
>>> output_str.strip()
'1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217 225 233 241 249 257 265 273 281 289 297 305 313 321 329 337 345 353 361 369 377 385 393 401 409 417 425 433 441 449 457 465 473 481 489 497 505 513 521 529 537 545 553 561 569 577 585 593 601 609 617 625 633 641 649 657 665 673 681 689 697 705 713 721 729 737 745 753 761 769 777 785 793 801 809 817 825 833 841 849 857 865 873 881 889 897 905 913 921 929 937 945 953 961 969 977 985 993'
>>>
I think this does the work you want:
The code:
list = [str(x) for x in range(123, 9821)]
results = []
for index in range(0, len(list), 8):
results.append(" ".join(list[index:index+8]))
print(results)
The output:
[
'123 124 125 126 127 128 129 130',
'131 132 133 134 135 136 137 138',
'139 140 141 142 143 144 145 146',
'147 148 149 150 151 152 153 154',
'155 156 157 158 159 160 161 162',
...
'9795 9796 9797 9798 9799 9800 9801 9802',
'9803 9804 9805 9806 9807 9808 9809 9810',
'9811 9812 9813 9814 9815 9816 9817 9818',
'9819 9820'
]

Pandas: Eliminating for loops

I have two Pandas dataframes, namely: habitat_family and habitat_species. I want to populate habitat_species based on the taxonomical lookupMap and the values in habitat_family:
import pandas as pd
import numpy as np
species = ['tiger', 'lion', 'mosquito', 'ladybug', 'locust', 'seal', 'seabass', 'shark', 'dolphin']
families = ['mammal','fish','insect']
lookupMap = {'tiger':'mammal', 'lion':'mammal', 'mosquito':'insect', 'ladybug':'insect', 'locust':'insect',
'seal':'mammal', 'seabass':'fish', 'shark':'fish', 'dolphin':'mammal' }
habitat_family = pd.DataFrame({'id': range(1,11),
'mammal': [101,123,523,562,546,213,562,234,987,901],
'fish' : [625,254,929,827,102,295,174,777,123,763],
'insect': [345,928,183,645,113,942,689,539,789,814]
}, index=range(1,11), columns=['id','mammal','fish','insect'])
habitat_species = pd.DataFrame(0.0, index=range(1,11), columns=species)
# My highly inefficient solution:
for id in habitat_family.index: # loop through habitat id's
for spec in species: # loop through species
corresp_family = lookupMap[spec]
habitat_species.loc[id,spec] = habitat_family.loc[id,corresp_family]
The nested for loops above do the job. But in reality the sizes of my dataframes are massive and using for loops are not feasible.
Is there a more efficient method to achieve this using maybe dataframe.apply() or a similar function?
EDIT: The desired output habitat_species is:
habitat_species
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
You don't need any loops at all. Check it out:
In [12]: habitat_species = habitat_family[Series(species).replace(lookupMap)]
In [13]: habitat_species.columns = species
In [14]: habitat_species
Out[14]:
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
[10 rows x 9 columns]
First of all, fantastically written question. Thanks.
I would suggest making a DataFrame for each family, and concatenating at the end:
You'll need to reverse your lookupMap:
In [80]: d = {'mammal': ['dolphin', 'lion', 'seal', 'tiger'], 'insect': ['ladybug', 'locust', 'mosquito'], 'fish':
['seabass', 'shark']}
So as an example:
In [83]: k, v = 'mammal', d['mammal']
In [86]: pd.DataFrame([habitat_family[k] for _ in v], index=v).T
Out[86]:
dolphin lion seal tiger
1 101 101 101 101
2 123 123 123 123
3 523 523 523 523
4 562 562 562 562
5 546 546 546 546
6 213 213 213 213
7 562 562 562 562
8 234 234 234 234
9 987 987 987 987
10 901 901 901 901
[10 rows x 4 columns]
Now do that for each family:
In [88]: for k, v in d.iteritems():
....: results.append(pd.DataFrame([habitat_family[k] for _ in v], index=v).T)
And concat:
In [89]: habitat_species = pd.concat(results, axis=1)
In [90]: habi
habitat_family habitat_species
In [90]: habitat_species
Out[90]:
dolphin lion seal tiger ladybug locust mosquito seabass shark
1 101 101 101 101 345 345 345 625 625
2 123 123 123 123 928 928 928 254 254
3 523 523 523 523 183 183 183 929 929
4 562 562 562 562 645 645 645 827 827
5 546 546 546 546 113 113 113 102 102
6 213 213 213 213 942 942 942 295 295
7 562 562 562 562 689 689 689 174 174
8 234 234 234 234 539 539 539 777 777
9 987 987 987 987 789 789 789 123 123
10 901 901 901 901 814 814 814 763 763
[10 rows x 9 columns]
You might consider passing the families as the key parameter to concat if you want a hierarchical index for the columns with (family, species) pairs.
Some profiling, since you said performance matters:
# Mine
In [97]: %%timeit
....: for k, v in d.iteritems():
....: results.append(pd.DataFrame([habitat_family[k] for _ in v], index=v).T)
....: habitat_species = pd.concat(results, axis=1)
....:
1 loops, best of 3: 296 ms per loop
# Your's
In [98]: %%timeit
....: for id in habitat_family.index: # loop through habitat id's
....: for spec in species: # loop through species
....: corresp_family = lookupMap[spec]
....: habitat_species.loc[id,spec] = habitat_family.loc[id,corresp_family]
10 loops, best of 3: 21.5 ms per loop
# Dan's
In [102]: %%timeit
.....: habitat_species = habitat_family[Series(species).replace(lookupMap)]
.....: habitat_species.columns = species
.....:
100 loops, best of 3: 2.55 ms per loop
Looks like Dan wins by a longshot!
This might be the most pandonic:
In [1]: habitat_species.apply(lambda x: habitat_family[lookupMap[x.name]])
Out[1]:
tiger lion mosquito ladybug locust seal seabass shark dolphin
1 101 101 345 345 345 101 625 625 101
2 123 123 928 928 928 123 254 254 123
3 523 523 183 183 183 523 929 929 523
4 562 562 645 645 645 562 827 827 562
5 546 546 113 113 113 546 102 102 546
6 213 213 942 942 942 213 295 295 213
7 562 562 689 689 689 562 174 174 562
8 234 234 539 539 539 234 777 777 234
9 987 987 789 789 789 987 123 123 987
10 901 901 814 814 814 901 763 763 901
%timeit habitat_species.apply(lambda x: habitat_family[lookupMap[x.name]])
1000 loops, best of 3: 1.57 ms per loop
as far as I can tell, the data in the columns don't change but the columns are merely repeated for each corresponding animal.
I.E if you just had a tiger and a lion, you would want a resulting dataframe with the mammal column repeated twice and the header changed?
In that case, you can do:
habitat_species = pd.DataFrame(0.0, index=range(1,11))
for key, value in lookupMap.iteritems():
habitat_species[key] = habitat_family[value]
This will create a new column in the habitat_species dataframe with the name given by key, and assign all the values in the corresponding column in the habitat_family dataframe, whose name is given by value

Categories

Resources