Assume the following simplified framework:
I have a 3D Pandas dataframe of parameters composed of 100 rows, 4 classes and 4 features for each instance:
iterables = [list(range(100)), [0,1,2,3]]
index = pd.MultiIndex.from_product(iterables, names=['instances', 'classes'])
columns = ['a', 'b', 'c', 'd']
np.random.seed(42)
parameters = pd.DataFrame(np.random.randint(1, 2000, size=(len(index), len(columns))), index=index, columns=columns)
parameters
instances classes a b c d
0 0 1127 1460 861 1295
1 1131 1096 1725 1045
2 1639 122 467 1239
3 331 1483 88 1397
1 0 1124 872 1688 131
... ... ... ... ...
98 3 1321 1750 779 1431
99 0 1793 814 1637 1429
1 1370 1646 420 1206
2 983 825 1025 1855
3 1974 567 371 936
Let df be a dataframe that for each instance and each feature (column), report the observed class.
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 3, size=(100, len(columns))), index=list(range(100)),
columns=columns)
a b c d
0 2 0 2 2
1 0 0 2 1
2 2 2 2 2
3 0 2 1 0
4 1 1 1 1
.. .. .. .. ..
95 1 2 0 1
96 2 1 2 1
97 0 0 1 2
98 0 0 0 1
99 1 2 2 2
I would like to create a third dataframe (let's call it new_df) of shape (100, 4) containing the parameters in the dataframe parameters based on the observed classes on the dataframe df.
For example, in the first row of df for the first column (a) i observe the class 2, so the value I am interested in is the second class in the first instance of the parameters dataframe, namely 1127 that will populate the first row and column of new df. Following this method, the first observation for the column "b" is class 0, so in the first row, column b of the new_df I would like to observe 1460 and so on.
With a for loop I can obtain the desired result:
new_df = pd.DataFrame(0, index=list(range(100)), columns=columns) # initialize the df
for i in range(len(df)):
for c in df.columns:
new_df.iloc[i][c] = parameters.loc[i][c][df.iloc[i][c]]
new_df
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
However, the original dataset contains millions of rows and hundreds of columns, and proceeding with for loop is unfeasible.
Is there a way to vectorize such a problem in order to avoid for loops? (at least over 1 dimension)
Reshape both DataFrames, using stack, into a long format, then perform the merge and reshape, with unstack, back to the wide format. There's a bunch of renaming just so we can reference and align the columns in the merge.
(df.rename_axis(index='instances', columns='cols').stack().to_frame('classes')
.merge(parameters.rename_axis(columns='cols').stack().rename('vals'),
on=['instances', 'classes', 'cols'])
.unstack(-1)['vals']
.rename_axis(index=None, columns=None)
)
a b c d
0 1639 1460 467 1239
1 1124 872 806 344
2 1083 511 1706 1500
3 958 1155 1268 563
4 14 242 777 1370
.. ... ... ... ...
95 1435 1316 1709 755
96 346 712 363 815
97 1234 985 683 1348
98 127 1130 1009 1014
99 1370 825 1025 1855
Related
I have a dataframe with two columns, I need to separate these two columns (column A and B) by N sequential rows (for example 100 rows). so the output will be 100 rows in column A and B, another 100 rows in column C and D, .....is there a specific function can deal with this purpose?
Input data:
df = pd.DataFrame(np.arange(1, 2001).reshape((-1, 2)), columns=["A", "B"])
print(df)
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
.. ... ...
995 1991 1992
996 1993 1994
997 1995 1996
998 1997 1998
999 1999 2000
[1000 rows x 2 columns]
Use np.array_split
out = np.concatenate(np.array_split(df, range(100, len(df), 100)), axis=1)
print(out)
array([[ 1, 2, 201, ..., 1602, 1801, 1802],
[ 3, 4, 203, ..., 1604, 1803, 1804],
[ 5, 6, 205, ..., 1606, 1805, 1806],
...,
[ 195, 196, 395, ..., 1796, 1995, 1996],
[ 197, 198, 397, ..., 1798, 1997, 1998],
[ 199, 200, 399, ..., 1800, 1999, 2000]])
Build your dataframe:
df1 = pd.DataFrame(out, columns=list(map(chr, range(65, out.shape[1]+65))))
print(df1)
A B C D E F ... O P Q R S T
0 1 2 201 202 401 402 ... 1401 1402 1601 1602 1801 1802
1 3 4 203 204 403 404 ... 1403 1404 1603 1604 1803 1804
2 5 6 205 206 405 406 ... 1405 1406 1605 1606 1805 1806
3 7 8 207 208 407 408 ... 1407 1408 1607 1608 1807 1808
4 9 10 209 210 409 410 ... 1409 1410 1609 1610 1809 1810
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
95 191 192 391 392 591 592 ... 1591 1592 1791 1792 1991 1992
96 193 194 393 394 593 594 ... 1593 1594 1793 1794 1993 1994
97 195 196 395 396 595 596 ... 1595 1596 1795 1796 1995 1996
98 197 198 397 398 597 598 ... 1597 1598 1797 1798 1997 1998
99 199 200 399 400 599 600 ... 1599 1600 1799 1800 1999 2000
[100 rows x 20 columns]
Assuming n always evenly divides the length of the frame, vsplit + hstack is an option:
n = 5
new_df = pd.DataFrame(np.hstack(np.vsplit(df.values, len(df) // n)))
new_df.columns = new_df.columns.map(lambda c: chr(c + ord('A')))
Complete Working Example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(1, 16),
'B': np.arange(101, 116)})
n = 5
new_df = pd.DataFrame(np.hstack(np.vsplit(df.values, len(df) // n)))
new_df.columns = new_df.columns.map(lambda c: chr(c + ord('A')))
print(new_df)
A B
0 1 101
1 2 102
2 3 103
3 4 104
4 5 105
5 6 106
6 7 107
7 8 108
8 9 109
9 10 110
10 11 111
11 12 112
12 13 113
13 14 114
14 15 115
new_df:
A B C D E F
0 1 101 6 106 11 111
1 2 102 7 107 12 112
2 3 103 8 108 13 113
3 4 104 9 109 14 114
4 5 105 10 110 15 115
I have a series of astronomical observations that are placed into program-generated DataFrames for each observation year (i.e., df2015, df2016, etc.) These DataFrames need to be modified in the subsequent processes, and I put all of them on the list. The method used to define the list makes a difference. An explicitly define list
dfs = [df2015, df2016, df2017, df2018, df2019]
allows further df modifications, but it does not agree with the purpose of the code - to automate the processing of standard data sets regardless of the number of years. A program-generated list
for yr in years:
exec('dfs = [df' + yr + ' for yr in years]')
seems to be working most of the time, as in :
for df in dfs:
dfX = df.dtypes
for index, val2 in dfX.items():
if val2 == 'float64':
df.iloc[:,index] = df.iloc[:,index].fillna(0).astype('int64')
, but fails in some cases, as in:
for df in dfs:
i=1
for i in range(1, 13):
ncol = i + (i-1) *2
if i < 10:
nmon = '0' + str(i)
else:
nmon = '' + str(i)
df.insert(ncol, 'M' + nmon, nmon)
i += 1
when a for loop with an insert statement results in an error:
ValueError: cannot insert M01, already exists
I've tried list comprehension instead of for loops, tried changing the loop nesting order (just in case), etc.
The intent of the above reference step is to convert this:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 1 713 1623 658.0 1659.0 619 1735 526.0 1810.0 439 ... 437 1903 510.0 1818.0 542 1725 618.0 1637.0 654 1613
1 2 714 1624 657.0 1700.0 618 1736 525.0 1812.0 438 ... 438 1902 511.0 1816.0 543 1724 619.0 1636.0 655 1613
2 3 714 1625 655.0 1702.0 616 1737 523.0 1813.0 437 ... 439 1901 512.0 1814.0 544 1722 620.0 1635.0 656 1612
3 4 714 1626 654.0 1703.0 614 1738 521.0 1814.0 435 ... 440 1900 513.0 1813.0 545 1720 622.0 1634.0 657 1612
4 5 713 1627 653.0 1704.0 613 1739 520.0 1815.0 434 ... 441 1859 514.0 1811.0 546 1719 623.0 1633.0 658 1612
into this
0 M01 D01 1 2 M02 D02 3 4 M03 ... 19 20 M11 D11 21 22 M12 D12 23 24
0 1 01 1 713 1623 02 1 658 1659 03 ... 542 1725 11 1 618 1637 12 1 654 1613
1 2 01 2 714 1624 02 2 657 1700 03 ... 543 1724 11 2 619 1636 12 2 655 1613
2 3 01 3 714 1625 02 3 655 1702 03 ... 544 1722 11 3 620 1635 12 3 656 1612
3 4 01 4 714 1626 02 4 654 1703 03 ... 545 1720 11 4 622 1634 12 4 657 1612
4 5 01 5 713 1627 02 5 653 1704 03 ... 546 1719 11 5 623 1633 12 5 658 1612
You create a list of copies of the last year's dataframe. If your years list is e.g. ['2015', '2016', '2017', '2018'], then you generate a dfs as [df2018, df2018, df2018, df2018] which will lead to the error.
This will get you the correct result:
dfs = [eval('df' + yr) for yr in years]
It forms the required dataframe names and evaluates them so you get a list of dataframes.
I have a dataframe df1 with a column dates which includes dates. I want to plot the dataframe for just a certain month. The column dates look like:
Unnamed: 0 Unnamed: 0.1 dates DPD weekday
0 0 1612 2007-06-01 23575.0 4
1 3 1615 2007-06-04 28484.0 0
2 4 1616 2007-06-05 29544.0 1
3 5 1617 2007-06-06 29129.0 2
4 6 1618 2007-06-07 27836.0 3
5 7 1619 2007-06-08 23434.0 4
6 10 1622 2007-06-11 28893.0 0
7 11 1623 2007-06-12 28698.0 1
8 12 1624 2007-06-13 27959.0 2
9 13 1625 2007-06-14 28534.0 3
10 14 1626 2007-06-15 23974.0 4
.. ... ... ... ... ...
513 721 2351 2009-06-09 54658.0 1
514 722 2352 2009-06-10 51406.0 2
515 723 2353 2009-06-11 48255.0 3
516 724 2354 2009-06-12 40874.0 4
517 727 2357 2009-06-15 77085.0 0
518 728 2358 2009-06-16 77989.0 1
519 729 2359 2009-06-17 75209.0 2
520 730 2360 2009-06-18 72298.0 3
521 731 2361 2009-06-19 60037.0 4
522 734 2364 2009-06-22 69348.0 0
523 735 2365 2009-06-23 74086.0 1
524 736 2366 2009-06-24 69187.0 2
525 737 2367 2009-06-25 68912.0 3
526 738 2368 2009-06-26 57848.0 4
527 741 2371 2009-06-29 72718.0 0
528 742 2372 2009-06-30 72306.0 1
And I just want to have June 2007 for example.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['month'] = pd.PeriodIndex(df1.dates, freq='M')
nov_mask=df1['month'] == 2007-06
plot_data= df1[nov_mask].pivot(index='dates', values='DPD')
plot_data.plot()
plt.show()
I don't know what's wrong with my code.The error shows that there is something wrong with 2007-06 when i defining nov_mask, i think the data type is wrong but I tried a lot and nothing works..
You don't need PeriodIndex if you just want to get June 2007 data. I have no access to IPython right now but this should point you in the right direction.
df1 = pd.read_csv('DPD.csv')
df1['dates'] = pd.to_datetime(df1['dates'])
df1['year'] = df1['dates'].dt.year
df1['month'] = df1['dates'].dt.month
july_mask = ((df1['year'] == 2007) & (df1['month'] == 7))
filtered = df1[july_mask ]
# ... Do something with filtered.
I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.
Long story short
I have a nested dictionary. When I turn it into a dataframe.
import pandas
pdf = pandas.DataFrame(nested_dict)
95 96 97 98 99 100 101 102 103 104 105 \
A 70019 102 4243 3083 3540 6311 4851 5938 4140 4659 3100
C 0 185 427 433 1190 910 3898 3869 2861 2149 3065
D 8 9 23463 1237 2574 4174 3640 4747 3557 4582 5934
E 141 89 5034 1576 2303 3416 2377 1252 1204 1703 718
F 7 12 1937 2246 1687 1154 1317 3473 1881 2221 3060
G 343 1550 13497 10659 12343 8213 9251 7341 6354 9058 9022
H 1 1978 1829 1394 1945 1003 1382 1489 4182 932 556
I 5 772 1361 3914 3255 3242 2808 3765 3284 2127 3120
K 3 10353 540 2364 1196 882 3439 2107 803 743 621
L 6 14 1599 11759 4571 4821 3450 5071 4364 1891 3677
M 1 6 158 211 524 2738 686 443 612 509 1721
N 6 186 299 2971 791 1440 2028 1163 1689 4296 1535
P 54 31 726 6208 7160 5494 6184 4282 3587 3727 3821
Q 10 87 1228 2233 1016 1801 1768 1693 3414 515 563
R 7 53939 3030 8904 6712 6134 5127 3223 4764 3768 6429
S 76 5213 3676 7480 9831 7666 5410 8185 7508 11237 8298
T 4369 1253 3087 2487 6559 4572 6863 3184 7352 6068 4756
V 732 5 7595 4331 5216 5444 5187 6013 4245 4545 4761
W 0 6 103 1225 598 888 601 713 1298 1323 908
Y 12 9 1968 1085 2787 5489 5529 7840 8691 9745 10136
Eventually I want to melt down this data frame to look like the following.
residue residue_num count
A 95 70019
A 96 102
A 97 4243
....
The residue column is being marked as the index so I don't know how to make it an arbitrary index like 0,1,2,3 and call "A C D E F.." another name.
EDIT
Answered myself as per suggestion
Answered from here and here
import pandas
pdf = pandas.DataFrame(the_matrix)
pdf = pdf.reset_index()
pdf.rename(columns={'index':'aa'},inplace=True)
pandas.melt(pdf,id_vars='aa',var_name="position",value_name="counts")
aa position counts
0 A 95 70019
1 C 95 0
2 D 95 8
3 E 95 141
4 F 95 7
5 G 95 343
6 H 95 1
7 I 95 5
8 K 95 3
Your pdf looks like a pivot table. Let's assume we have a dataframe with three columns. We can pivot it with a single function like this:
pivoted = df.pivot(index='col1',columns='col2',values='col3')
Unpivoting it back without losing the index requires a reset_index dance:
pivoted.reset_index().melt(id_vars=pivoted.index.name)
To get the exact original df:
pivoted.reset_index().melt(id_vars=pivoted.index.name, var_name='col2', value_name='col3')
PS. To my surprise, melt does not get a kwarg like keep_index=True. Enhancement suggestion is still open: https://github.com/pandas-dev/pandas/issues/17440