Separate columns by sequential N rows in python - python

I have a dataframe with two columns, I need to separate these two columns (column A and B) by N sequential rows (for example 100 rows). so the output will be 100 rows in column A and B, another 100 rows in column C and D, .....is there a specific function can deal with this purpose?

Input data:
df = pd.DataFrame(np.arange(1, 2001).reshape((-1, 2)), columns=["A", "B"])
print(df)
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
.. ... ...
995 1991 1992
996 1993 1994
997 1995 1996
998 1997 1998
999 1999 2000
[1000 rows x 2 columns]
Use np.array_split
out = np.concatenate(np.array_split(df, range(100, len(df), 100)), axis=1)
print(out)
array([[ 1, 2, 201, ..., 1602, 1801, 1802],
[ 3, 4, 203, ..., 1604, 1803, 1804],
[ 5, 6, 205, ..., 1606, 1805, 1806],
...,
[ 195, 196, 395, ..., 1796, 1995, 1996],
[ 197, 198, 397, ..., 1798, 1997, 1998],
[ 199, 200, 399, ..., 1800, 1999, 2000]])
Build your dataframe:
df1 = pd.DataFrame(out, columns=list(map(chr, range(65, out.shape[1]+65))))
print(df1)
A B C D E F ... O P Q R S T
0 1 2 201 202 401 402 ... 1401 1402 1601 1602 1801 1802
1 3 4 203 204 403 404 ... 1403 1404 1603 1604 1803 1804
2 5 6 205 206 405 406 ... 1405 1406 1605 1606 1805 1806
3 7 8 207 208 407 408 ... 1407 1408 1607 1608 1807 1808
4 9 10 209 210 409 410 ... 1409 1410 1609 1610 1809 1810
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
95 191 192 391 392 591 592 ... 1591 1592 1791 1792 1991 1992
96 193 194 393 394 593 594 ... 1593 1594 1793 1794 1993 1994
97 195 196 395 396 595 596 ... 1595 1596 1795 1796 1995 1996
98 197 198 397 398 597 598 ... 1597 1598 1797 1798 1997 1998
99 199 200 399 400 599 600 ... 1599 1600 1799 1800 1999 2000
[100 rows x 20 columns]

Assuming n always evenly divides the length of the frame, vsplit + hstack is an option:
n = 5
new_df = pd.DataFrame(np.hstack(np.vsplit(df.values, len(df) // n)))
new_df.columns = new_df.columns.map(lambda c: chr(c + ord('A')))
Complete Working Example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(1, 16),
'B': np.arange(101, 116)})
n = 5
new_df = pd.DataFrame(np.hstack(np.vsplit(df.values, len(df) // n)))
new_df.columns = new_df.columns.map(lambda c: chr(c + ord('A')))
print(new_df)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
new_df:
A B C D E F
0 1 101 6 106 11 111
1 2 102 7 107 12 112
2 3 103 8 108 13 113
3 4 104 9 109 14 114
4 5 105 10 110 15 115

Related

Vectorization Pandas DataFrame

Assume the following simplified framework:
I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:
iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)
parameters
instances classes a b c d
0 0 1127 1460 861 1295
1 1131 1096 1725 1045
2 1639 122 467 1239
3 331 1483 88 1397
1 0 1124 872 1688 131
... ... ... ... ...
98 3 1321 1750 779 1431
99 0 1793 814 1637 1429
1 1370 1646 420 1206
2 983 825 1025 1855
3 1974 567 371 936
Let df be a dataframe that for each instance and each feature (column), report the observed class.
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)),
columns=columns)
a b c d
0 2 0 2 2
1 0 0 2 1
2 2 2 2 2
3 0 2 1 0
4 1 1 1 1
.. .. .. .. ..
95 1 2 0 1
96 2 1 2 1
97 0 0 1 2
98 0 0 0 1
99 1 2 2 2
I would like to create a third dataframe (let's call it new_df) of shape (100, 4) containing the parameters in the dataframe parameters based on the observed classes on the dataframe df.
For example, in the first row of df for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters dataframe, namely 1127 that will populate the first row and column of new df. Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df I would like to observe 1460 and so on.
With a for loop I can obtain the desired result:
new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
for c in df.columns:
new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]
new_df
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.
Is there a way to vectorize such a problem in order to avoid for loops? (at least over 1 dimension)
Reshape both DataFrames, using stack, into a long format, then perform the merge and reshape, with unstack, back to the wide format. There's a bunch of renaming just so we can reference and align the columns in the merge.
(df.rename_axis(index='instances', columns='cols').stack().to_frame('classes')
.merge(parameters.rename_axis(columns='cols').stack().rename('vals'),
on=['instances', 'classes', 'cols'])
.unstack(-1)['vals']
.rename_axis(index=None, columns=None)
)
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855

Removing duplicate entries and extracting desired information

I have a 2 X 2 mattrix that looks like this :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16 44 23 49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2 96 5 95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3 21 277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14 29 345 360
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1 121 1 121
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30 80 157 209
DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94 2 101 273 369
SMC_N 220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3 199 19 351
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015 231 300 279 352
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05 4 53 19 67
AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014 3 41 22 60
I want to filter out the results so that for example, for the item "DNA_pol3_beta_3" there are 2 entries. out of these two entries, I want to extract only that row whose respective value at the 5th column is the lowest. so that means, out of the two entries :
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2 121 264 383
the above one should be in the result. similarly for "DNA_pol3_beta_2" there are 4 entries and the program should extract only
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20 3 115 133 260
because it has the lowest value of 5th column among 4. Also, the program should ignore the entries whose value at 5th column is less than 1E-5.
i tried following code :
for i in lines:
if lines[i+1] == lines [i]:
if lines[i+1][4] > lines [i][4]:
evalue = lines[i][4]
else:
evalue = lines[i+1][4]
You would better use pandas for this. See below:
import pandas as pd
df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))
df=df[df[4]>=0.00001]
result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)
Output:
>>> print(result)
0 1 2 3 4 5 6 7 8
0 DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1200.00000 16 44 23 49
1 DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.70000 2 96 5 95
2 DNA_pol3_beta 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94000 2 101 273 369
3 AAA_21 303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1 32 40 68
4 AAA_15 369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00004 4 53 19 67
5 AAA_23 200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00140
If you want the file back to csv, you can save it with df.to_csv()

for-loop generates 'cannot insert {}, already exists' error depending on the pandas dataframe definition

I have a series of astronomical observations that are placed into program-generated DataFrames for each observation year (i.e., df2015, df2016, etc.) These DataFrames need to be modified in the subsequent processes, and I put all of them on the list. The method used to define the list makes a difference. An explicitly define list
dfs = [df2015, df2016, df2017, df2018, df2019]
allows further df modifications, but it does not agree with the purpose of the code - to automate the processing of standard data sets regardless of the number of years. A program-generated list
for yr in years:
exec('dfs = [df' + yr + ' for yr in years]')
seems to be working most of the time, as in :
for df in dfs:
dfX = df.dtypes
for index, val2 in dfX.items():
if val2 == 'float64':
df.iloc[:,index] = df.iloc[:,index].fillna(0).astype('int64')
, but fails in some cases, as in:
for df in dfs:
i=1
for i in range(1, 13):
ncol = i + (i-1) *2
if i < 10:
nmon = '0' + str(i)
else:
nmon = '' + str(i)
df.insert(ncol, 'M' + nmon, nmon)
i += 1
when a for loop with an insert statement results in an error:
ValueError: cannot insert M01, already exists
I've tried list comprehension instead of for loops, tried changing the loop nesting order (just in case), etc.
The intent of the above reference step is to convert this:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 1 713 1623 658.0 1659.0 619 1735 526.0 1810.0 439 ... 437 1903 510.0 1818.0 542 1725 618.0 1637.0 654 1613
1 2 714 1624 657.0 1700.0 618 1736 525.0 1812.0 438 ... 438 1902 511.0 1816.0 543 1724 619.0 1636.0 655 1613
2 3 714 1625 655.0 1702.0 616 1737 523.0 1813.0 437 ... 439 1901 512.0 1814.0 544 1722 620.0 1635.0 656 1612
3 4 714 1626 654.0 1703.0 614 1738 521.0 1814.0 435 ... 440 1900 513.0 1813.0 545 1720 622.0 1634.0 657 1612
4 5 713 1627 653.0 1704.0 613 1739 520.0 1815.0 434 ... 441 1859 514.0 1811.0 546 1719 623.0 1633.0 658 1612
into this
0 M01 D01 1 2 M02 D02 3 4 M03 ... 19 20 M11 D11 21 22 M12 D12 23 24
0 1 01 1 713 1623 02 1 658 1659 03 ... 542 1725 11 1 618 1637 12 1 654 1613
1 2 01 2 714 1624 02 2 657 1700 03 ... 543 1724 11 2 619 1636 12 2 655 1613
2 3 01 3 714 1625 02 3 655 1702 03 ... 544 1722 11 3 620 1635 12 3 656 1612
3 4 01 4 714 1626 02 4 654 1703 03 ... 545 1720 11 4 622 1634 12 4 657 1612
4 5 01 5 713 1627 02 5 653 1704 03 ... 546 1719 11 5 623 1633 12 5 658 1612
You create a list of copies of the last year's dataframe. If your years list is e.g. ['2015', '2016', '2017', '2018'], then you generate a dfs as [df2018, df2018, df2018, df2018] which will lead to the error.
This will get you the correct result:
dfs = [eval('df' + yr) for yr in years]
It forms the required dataframe names and evaluates them so you get a list of dataframes.

How to mask a data frame by the month?

I have a dataframe df1 with a column dates which includes dates. I want to plot the dataframe for just a certain month. The column dates look like:
Unnamed: 0 Unnamed: 0.1 dates DPD weekday
0 0 1612 2007-06-01 23575.0 4
1 3 1615 2007-06-04 28484.0 0
2 4 1616 2007-06-05 29544.0 1
3 5 1617 2007-06-06 29129.0 2
4 6 1618 2007-06-07 27836.0 3
5 7 1619 2007-06-08 23434.0 4
6 10 1622 2007-06-11 28893.0 0
7 11 1623 2007-06-12 28698.0 1
8 12 1624 2007-06-13 27959.0 2
9 13 1625 2007-06-14 28534.0 3
10 14 1626 2007-06-15 23974.0 4
.. ... ... ... ... ...
513 721 2351 2009-06-09 54658.0 1
514 722 2352 2009-06-10 51406.0 2
515 723 2353 2009-06-11 48255.0 3
516 724 2354 2009-06-12 40874.0 4
517 727 2357 2009-06-15 77085.0 0
518 728 2358 2009-06-16 77989.0 1
519 729 2359 2009-06-17 75209.0 2
520 730 2360 2009-06-18 72298.0 3
521 731 2361 2009-06-19 60037.0 4
522 734 2364 2009-06-22 69348.0 0
523 735 2365 2009-06-23 74086.0 1
524 736 2366 2009-06-24 69187.0 2
525 737 2367 2009-06-25 68912.0 3
526 738 2368 2009-06-26 57848.0 4
527 741 2371 2009-06-29 72718.0 0
528 742 2372 2009-06-30 72306.0 1
And I just want to have June 2007 for example.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['month'] = pd.PeriodIndex(df1.dates, freq='M')
nov_mask=df1['month'] == 2007-06
plot_data= df1[nov_mask].pivot(index='dates', values='DPD')
plot_data.plot()
plt.show()
I don't know what's wrong with my code.The error shows that there is something wrong with 2007-06 when i defining nov_mask, i think the data type is wrong but I tried a lot and nothing works..
You don't need PeriodIndex if you just want to get June 2007 data. I have no access to IPython right now but this should point you in the right direction.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['year'] = df1['dates'].dt.year
df1['month'] = df1['dates'].dt.month
july_mask = ((df1['year'] == 2007) & (df1['month'] == 7))
filtered = df1[july_mask ]
# ... Do something with filtered.

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all
np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630
One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

Categories

Resources