How to split column data and create new DataFrame with multiple columns - python

I'd like to split the data in the following DataFrame
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32), 'r':12*range(8), 'cnt': np.random.randint(300, 400, 96)}); df
cnt per r
0 355 10 0
1 359 10 1
2 347 10 2
3 390 10 3
4 304 10 4
5 306 10 5
.. ... ... ..
87 357 30 7
88 371 30 0
89 396 30 1
90 357 30 2
91 353 30 3
92 306 30 4
93 301 30 5
94 329 30 6
95 312 30 7
[96 rows x 3 columns]
such that for each r value a new column cnt_r{r} exist in a DataFrame but also keeping the corresponding per column.
The following piece of code almost does what I want except that it looses the per column:
pd.DataFrame({'cnt_r{}'.format(i): df[df.r==i].reset_index()['cnt'] for i in range(8)})
cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 355 359 347 390 304 306 366 310
1 394 331 384 312 380 350 318 396
2 340 336 360 389 352 370 353 319
...
9 341 300 386 334 386 314 358 326
10 357 386 311 382 356 339 375 357
11 371 396 357 353 306 301 329 312
I need a way to build the follow DataFrame:
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 355 359 347 390 304 306 366 310
1 10 394 331 384 312 380 350 318 396
2 10 340 336 360 389 352 370 353 319
...
7 20 384 385 376 323 345 339 339 347
9 30 341 300 386 334 386 314 358 326
10 30 357 386 311 382 356 339 375 357
11 30 371 396 357 353 306 301 329 312
Note that by construction my dataset has same number of values per per for each r. Obviously my dataset is much larger than the example one (about 800 million records).
Many thanks for your time.

If possible use reshape for 2d array and then insert new colum per:
np.random.seed(1256)
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32),
'r': 12*list(range(8)),
'cnt': np.random.randint(300, 400, 96)})
df1 = pd.DataFrame(df['cnt'].values.reshape(-1, 8)).add_prefix('cnt_r')
df1.insert(0, 'per', np.repeat([10,20,30], 4))
print (df1)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304
Or use cumcount for create new groups and reshape by set_index with unstack:
df = (df.set_index([df.groupby('r').cumcount(), 'per','r'])['cnt']
.unstack()
.add_prefix('cnt_r')
.reset_index(level=1)
.rename_axis(None, axis=1))
print (df)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304

Related

How do I filter based in Indices in Python?

I am having an issue with manipulating indices once I have used the groupby command. My problem is similar to this code:
import pandas as pd
import numpy as np
np.random.seed(0)
df=pd.DataFrame(np.random.randint(0,10,size=(1000000,5)),columns=list('ABCDE'))
M=df.groupby(['A','B','D','E'])['C'].sum().unstack()
M
E 0 1 2 3 4 5 6 7 8 9
A B D
0 0 0 464 414 553 420 499 394 528 423 415 443
1 407 479 392 441 433 472 520 421 484 384
2 545 546 523 356 386 434 531 534 486 417
3 408 511 422 424 477 351 452 395 341 492
4 502 462 403 434 428 444 506 414 418 328
... ... ... ... ... ... ... ... ... ... ...
9 9 5 419 416 485 386 581 330 408 489 394 454
6 416 475 469 490 357 523 418 514 555 499
7 528 419 462 486 565 388 438 445 469 521
8 390 454 566 341 459 463 478 463 426 499
9 414 436 441 462 403 415 362 472 433 430
[1000 rows x 10 columns]
I am wondering how to filter down to only situations where B is greater than A, when they are both in the index here. If they weren't in the index then I would be doing something like M=M[M['A']<M['B']].
You can temporarily convert the index to_frame:
out = M.loc[M.index.to_frame().query('B>A').index]
Or use Index.get_level_values:
A = M.index.get_level_values('A')
B = M.index.get_level_values('B')
out = M.loc[B>A]
Output:
E 0 1 2 3 4 5 6 7 8 9
A B D
0 1 0 489 452 421 455 442 377 440 476 477 451
1 468 448 473 443 557 492 471 460 476 469
2 576 472 465 355 503 448 491 437 546 425
3 404 438 474 516 410 446 411 459 467 450
4 500 418 441 445 420 605 467 580 479 377
... ... ... ... ... ... ... ... ... ... ...
8 9 5 390 466 436 493 446 508 375 390 485 393
6 457 478 476 417 458 460 361 397 432 403
7 516 587 379 406 396 449 430 433 357 432
8 390 460 489 427 346 490 498 454 395 345
9 474 510 466 336 484 577 443 428 459 406
[450 rows x 10 columns]

Finding substrings in a DNA sequence; script returns higher values than expected

I'm struggling with a really frustrating problem, I've spent the past 2.5 hours trying to find the bug, but I can't manage. The problem is this: I have to find the amount of occurrences of each combination of 4 DNA nucleotides (AAAA-TTTT) in a string. The problem is; my script returns wrong values.
The idea is: create a dict of all possible 4-mers AAAA-TTTT, then iterate over every character in the DNA sequence, and check which k-mer in the dict is a match for the current index. Add 1 to that k-mer in the dict, and then return the value.
Seems simple enough, but I really can't figure out why it doesn't work.
Input file: https://drive.google.com/file/d/1hDRaQ76hhhCQO4mFocC6yIKSY5mhgDjt/view?usp=sharing
Output:
340 319 348 337 331 329 343 348 336 345 347 370 307 356 313 368 324 315 365 338 322 327 332 341 336 352 350 354 381 339 330 377 346 318 337 346 383 326 311 335 343 326 354 349 326 367 355 344 313 314 320 356 370 347 327 369 340 337 335 340 368 308 363 346 331 324 341 324 344 330 326 382 323 360 355 355 326 360 341 357 329 342 313 360 335 354 320 359 331 350 311 355 350 345 335 338 308 359 321 316 332 348 331 354 312 351 340 339 356 353 365 343 384 331 363 379 341 329 346 378 356 329 316 342 354 371 357 320 345 331 346 347 350 337 359 343 334 324 338 319 319 327 344 336 322 376 339 332 340 346 360 333 317 308 337 365 355 351 328 330 338 344 313 345 331 333 339 340 345 338 293 333 326 357 319 325 331 374 335 339 378 333 344 351 340 354 307 343 330 340 341 365 329 368 366 339 318 326 359 342 364 320 338 346 351 366 356 326 357 361 375 351 343 336 328 336 317 361 340 350 375 356 357 354 377 367 348 319 317 363 343 342 333 321 317 302 367 340 315 368 378 326 346 321 348 336 348 344 379 342 319 372 324 353 362 358
Expected output: I don't know, I only know that the output I'm currently getting is wrong.
from sys import argv
# functions
def kmer():
"""Returns a dict of all possible 4-mers, and sets the occurrences to 0"""
kmers = {}
bases = ['A', 'C', 'G', 'T']
for i in range(4**4):
kmer = bases[i // 64] + bases[(i % 64) // 16] + bases[(i % 16) // 4] + bases[(i % 4)]
kmers[kmer] = 0
return kmers
def count_kmer(seq, kmer_dict):
"""Returns occurrences of each k-mer in the DNA sequence"""
for nt in range(len(seq)):
if seq[nt : nt + 4] in kmer_dict:
kmer_dict[seq[nt : nt + 4]] += 1
return kmer_dict
def input_parser(filename):
"""Returns contents of input filename as a string"""
lines = open(filename, 'r').readlines()[1:]
seq = ""
for line in lines:
line.replace('\n', '')
seq += line
return seq
def write_output(result):
output = open('kmers.txt', 'w')
for i in result:
value = str(result[i]) + ' '
output.write(value)
# main
if __name__ == '__main__':
# Parse input
seq = input_parser('rosalind_kmer.txt')
# Create k-mer dict
kmer_dict = kmer()
# Count k-mer occurrences
result = count_kmer(seq, kmer_dict)
# Print results
write_output(result)
After 3 hours I found the problem.. It was quite simple:
Because I read the sequence from a file, newline characters were included. These newline characters broke the sequence up, while the sequence should have been a continuous string. After creating a small for loop (shown below), the script performed as expected.
def input_parser(filename):
"""Returns contents of input filename as a string"""
text = open(filename, 'r').read()
seq = ''
for ch in text:
if ch in ['A', 'C', 'G', 'T']:
seq += ch
return seq

Python list alignment

I have an assignment I am trying to complete.
I have 100 random int. From those 100 I have to create a 10x10 table. DONE..
within that table I have to align my values to the right side of the each column. That is the part I'm missing.
Below is the code for that:
print(num, end=(" " if counter < 10 else "\n"))
Late answer, but you can also use:
import random
rl = random.sample(range(100, 999), 100)
max_n = 10
for n, x in enumerate(rl, 1):
print(x, end=("\n" if n % max_n == 0 else " "))
440 688 758 837 279 736 510 706 392 631
588 511 610 792 535 526 335 842 247 124
552 329 245 689 832 407 919 302 592 385
542 890 406 898 189 116 495 764 664 471
851 728 292 314 839 503 691 355 350 213
661 489 800 649 521 958 123 205 983 219
321 633 120 388 632 187 158 576 294 835
673 470 699 908 456 270 220 878 376 884
816 525 147 104 602 637 249 763 494 127
981 524 262 915 267 873 886 397 922 932
You can just format the number before printing it.
print(f"{num:>5}", end=(" " if counter < 10 else "\n"))
Alternatively, if you wanna cast the numbers to string you can use the rjust method of string.
There is a simple way to do it. I hope I have made it clear.
import random
# Generate 100 random numbers in range 1 to 1000.
random_numbers = list(map(lambda x:random.randrange(1,1000), range(100)))
# Create an empty string to store the pretty string.
pretty_txt = ''
# Loop through random_numbers and use an enumerate to get iteration count.
for index, i in enumerate(random_numbers):
# Add the current number into the string with a space.
pretty_txt += str(i) + ' '
# Add a newline every ten numbers.
# If you don't add index != 0 it will put a space in first iteration
if index % 9 == 0 and index != 0:
pretty_txt += '\n'
print(pretty_txt)
The output is:
796 477 578 458 284 287 43 535 514 504
91 411 288 980 85 233 394 313 263
135 770 793 394 362 433 370 725 472
981 932 398 275 626 631 817 82 551
775 211 755 202 81 750 695 402 809
477 925 347 31 313 514 363 115 144
341 684 662 522 236 219 142 114 621
940 241 110 851 997 699 685 434 813
983 710 124 443 569 613 456 232 80
927 445 179 49 871 821 428 750 792
527 799 878 731 221 780 16 779 333

Clustering incosistencys by time difference in a timeseries df

I have a pandas dataframe that looks like this:
df
Out[94]:
nr aenvm aenhm ... naenhs naesvs naeshs
date ...
2019-11-16 08:44:24 1 388 776 ... 402 305 566
2019-11-16 08:44:25 2 383 767 ... 407 304 561
2019-11-16 08:44:26 3 378 762 ... 410 301 570
2019-11-16 08:44:27 4 376 766 ... 403 304 567
2019-11-16 08:44:28 5 374 773 ... 398 297 569
The data is inconsistent by Events.
Sometimes there are around 6 minutes of data (lets call it an "event") and then for maybe some hours or some days no data. See i.e. the structure break in the timestamp:
df.iloc[1056:1065]
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1057 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1058 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 1059 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 1060 304 715 507 ... 302 159 289 508
All I want to do is to "index" or "categories" those events. That the rows between two structural breaks are combined under one number [1, 2, 3, ...] in a new column.
My Goal is to creat a new column like "nr" that seperates the "events".
Out[95]:
nr aenvm aenhm aesvm ... naenvs naenhs naesvs naeshs
date ...
2019-11-17 05:18:49 1 276 707 477 ... 244 136 247 525
2019-11-17 05:18:50 1 268 703 470 ... 238 138 228 504
2019-11-17 05:56:45 2 304 717 508 ... 295 157 282 519
2019-11-17 05:56:46 2 304 715 507 ... 302 159 289 508
To be honest I am a complete python newbie, I tried some classification with the timestamp datetime64 and .asfreq but with zero to nothing success...
I would be very thankfully for a good advice! :)

Issue with choosing columns when creating max/min columns in dataframe

I have the following code:
def minmaxdata():
Totrigs,TotOrDV,TotOrH,TotGas,TotOil,TotBit,ABFl,ABOr,SKFl,SKOr,BCOr,MBFl,MBOr = dataforgraphs()
tr = Totrigs
tr['year'] = tr.index.year
tr['week']= tr.groupby('year').cumcount()+1
tr2 = tr.pivot_table(index='week',columns='year')
tr2['max07_13']=tr2.max(axis=1)
tr2['min07_13']=tr2.min(axis=1)
print(tr2)
Which gives me this:
Total Rigs max07_13 min07_13
year 2007 2008 2009 2010 2011 2012 2013 2014
week
1 408 333 303 322 419 382 270 477 477 270
2 539 449 357 382 495 541 460 514 541 357
3 581 482 355 419 511 554 502 509 581 355
4 597 485 356 441 514 568 502 502 597 356
5 587 496 340 462 522 570 503 500 587 340
6 590 521 304 457 526 564 506 512 590 304
7 586 539 294 465 517 571 519 530 586 294
8 555 529 282 455 517 555 517 NaN 555 282
9 550 534 232 437 532 519 518 NaN 550 232
10 510 502 160 366 528 419 472 NaN 528 160
11 396 411 107 259 466 296 405 NaN 466 107
...But I would like the two max/min columns on the right to only take the max/min for 2007-2013. I have tried several indexing methods but seems to result in errors.
Any suggestions??
EDIT:
Tried the scalable solution but got the following error:
KeyError: "['2007' '2008' '2009' '2010' '2011' '2012' '2013'] not in index"
EDIT2:
tr2.columns output is the following:
Year
Total Rigs 2007
2008
2009
2010
2011
2012
2013
2014
max07_13
min07_13
EDIT3:
This was the solution:
gcols=[('Total Rigs',2007),('Total Rigs',2008),('Total Rigs',2009),('Total Rigs',2010),('Total Rigs',2011),('Total Rigs',2012),('Total Rigs',2013)]
tr2['Max 2007-2013']=tr2[gcols].max(axis=1)
tr2['Min 2007-2013']=tr2[gcols].min(axis=1)
A bit of a non-scalable solution would be to drop 2014 and then call max and min -
tr2['max07_13']=tr2.drop('2014', axis=1).max(axis=1)
If you know the columns of interest, you can also use that -
columns_of_interest = ['2007', '2008', '2009', '2010', '2011', '2012', '2013']
tr2['max07_13']=tr2[columns_of_interest].max(axis=1)

Categories

Resources