Drop range of columns by labels - python

Suppose I had this large data frame:
In [31]: df
Out[31]:
A B C D E F G H I J ... Q R S T U V W X Y Z
0 0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
1 26 27 28 29 30 31 32 33 34 35 ... 42 43 44 45 46 47 48 49 50 51
2 52 53 54 55 56 57 58 59 60 61 ... 68 69 70 71 72 73 74 75 76 77
[3 rows x 26 columns]
which you can create using
alphabet = [chr(letter_i) for letter_i in range(ord('A'), ord('Z')+1)]
df = pd.DataFrame(np.arange(3*26).reshape(3, 26), columns=alphabet)
What's the best way to drop all columns between column 'D' and 'R' using the labels of the columns?
I found one ugly way to do it:
df.drop(df.columns[df.columns.get_loc('D'):df.columns.get_loc('R')+1], axis=1)

Here's my entry:
>>> df.drop(df.columns.to_series()["D":"R"], axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
By converting df.columns from an Index to a Series, we can take advantage of the ["D":"R"]-style selection:
>>> df.columns.to_series()["D":"R"]
D D
E E
F F
G G
H H
I I
J J
... ...
Q Q
R R
dtype: object

Here you are:
print df.ix[:,'A':'C'].join(df.ix[:,'S':'Z'])
Out[1]:
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77

Here's another way ...
low, high = df.columns.get_slice_bound(('D', 'R'), 'left')
drops = df.columns[low:high+1]
print df.drop(drops, axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77

Use numpy for more flexibility ... numpy allows comparison of letters (probably by comparing on ASCII bit level, or something):
import numpy as np
array = (['A','B','C','D'])
array > 'B'
print(array)
print(array>'B')
gives:
['A' 'B' 'C' 'D']
array([False, False, True, True], dtype=bool)
More difficult selections are also easily possible:
b[np.logical_and(b>'B', b<'D')]
gives:
array(['C'],
dtype='|S1')

Related

percentile of the last value of a row

I am trying to get the percentile value for the last value in each row and store it in a different column. But unable to (new to python). What i have been able to achieve is the percentile value of each row through indexing. But not my desired output.
Following the code:
df = pd.DataFrame(np.random.randint(20,60,size=(10, 7)), columns=list('ABCDEFG'))
values = df.loc[1][0:]
min_value = values.min()
max_value = values.max()
percentiles = ((values - min_value) / (max_value - min_value) * 100)
print(percentiles)
current output:
A B C D
0 35 45 25 38
2 35 31 28 55
3 59 38 44 40
4 40 57 30 52
5 20 51 31 48
6 52 24 39 49
7 47 59 39 47
8 20 42 21 26
9 27 53 38 56
I am getting this way the percentile value:
A 61.538462
B 65.384615
C 100.000000
D 61.538462
E 50.000000
F 96.153846
G 0.000000
desired output:
A B C D E F G Per
0 52 41 23 53 22 22 39 23.6
1 48 49 58 48 45 57 32 23.5
2 38 49 48 25 32 22 27 56.2
3 46 34 43 52 50 32 30 63.5
4 59 47 49 22 53 31 38 65.9
5 49 49 58 37 28 31 34 50.2
6 31 29 28 41 39 36 47 90.2
7 34 55 52 39 32 25 55 85.6
8 34 21 48 22 22 53 42 80.5
9 44 23 57 52 29 54 43 90.6
Per value is the percentile value of col G for each row when compared to that row.
Try:
def perc_func(r):
x = r
last_val = x[-1]
min_val = x.min()
max_val = x.max()
percentile = ((last_val - min_val) / (max_val - min_val) * 100)
return percentile
df['Per'] = df.apply(lambda row:perc_func(row), axis=1)

Finding the maximum difference for a subset of columns with pandas

I have a dataframe:
A B C D E
0 a 34 55 43 aa
1 b 53 77 65 bb
2 c 23 100 34 cc
3 d 54 43 23 dd
4 e 23 67 54 ee
5 f 43 98 23 ff
I need to get the maximum difference between the column B,C and D and return the value in column A . in row 'a' maximum difference between columns is 55 - 34 = 21 . data is in a dataframe.
The expected result is
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Use np.ptp:
# df['A'] = np.ptp(df.loc[:, 'B':'D'], axis=1)
df['A'] = np.ptp(df[['B', 'C', 'D']], axis=1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
Or, find the max and min yourself:
df['A'] = df[['B', 'C', 'D']].max(1) - df[['B', 'C', 'D']].min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff
If performance is important, you can do this in NumPy space:
v = df[['B', 'C', 'D']].values
df['A'] = v.max(1) - v.min(1)
df
A B C D E
0 21 34 55 43 aa
1 24 53 77 65 bb
2 77 23 100 34 cc
3 31 54 43 23 dd
4 44 23 67 54 ee
5 75 43 98 23 ff

Panda Dataframe query

I like to retrieve data based on the column name and its minimum and maximum value. I am not able to figure out how to get that result. I am able to get data based on column name but don't understand how to apply the limit.
Column name and corresponding min and max value given in list and tuple.
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
selected_data = data_frame.loc[:, [X[0] for X in column_cutoff]]
return selected_data
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
newdata_cutoff = c_cutoff(df,column_cutoffdata)
print(df.head())
print(newdata_cutoff)
result
B E
R0 78 73
R1 27 7
R2 53 44
R3 65 84
R4 9 1
..
.
Expected output
I want all value less than 27 and greater than 78 should be discarded, same for E
You can be rather explicit and do the following:
lim = [('B',27,78),('E',44,73)]
for lim in limiters:
df = df[(df[lim[0]]>=lim[1]) & (df[lim[0]]<=lim[2])]
Yields:
A B C D E F
R0 99 78 61 16 73 8
R2 15 53 80 27 44 77
R8 30 62 11 67 65 55
R11 90 31 9 38 47 16
R15 16 64 8 90 44 37
R16 94 75 5 22 52 69
R46 11 30 26 8 51 61
R48 39 59 22 80 58 44
R66 55 38 5 49 58 15
R70 36 78 5 13 73 69
R72 70 58 52 99 67 11
R75 20 59 57 33 53 96
R77 32 31 89 49 69 41
R79 43 28 17 16 73 54
R80 45 34 90 67 69 70
R87 9 50 16 61 65 30
R90 43 56 76 7 47 62
pipe + where + between
You can't discard values in an array; that would involve reshaping an array and a dataframe's columns must all have the same size.
But you can iterate and use pd.Series.where to replace out-of-scope vales with NaN. Note the Pandas way to feed a dataframe through a function is via pipe:
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
for col, min_val, max_val in column_cutoffdata:
data_frame[col] = data_frame[col].where(data_frame[col].between(min_val, max_val))
return data_frame
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
print(df.head())
# A B C D E F
# R0 99 78 61 16 73 8
# R1 62 27 30 80 7 76
# R2 15 53 80 27 44 77
# R3 75 65 47 30 84 86
# R4 18 9 41 62 1 82
newdata_cutoff = df.pipe(c_cutoff, column_cutoffdata)
print(newdata_cutoff.head())
# A B C D E F
# R0 99 78.0 61 16 73.0 8
# R1 62 27.0 30 80 NaN 76
# R2 15 53.0 80 27 44.0 77
# R3 75 65.0 47 30 NaN 86
# R4 18 NaN 41 62 NaN 82
If you want to drop rows with any NaN values, you can then use dropna:
newdata_cutoff = newdata_cutoff.dropna()

Split a Pandas Dataframe into multiple Dataframes based on Triangular Number Series

I have a DataFrame (df) and I need to split it into n number of Dataframes based on the column numbers. But, it has to follow the Triangular Series pattern:
df1 = df[[0]]
df2 = df[[1,2]]
df3 = df[[3,4,5]]
df4 = df[[6,7,8,9]]
etc.
Consider the dataframe df
df = pd.DataFrame(
np.arange(100).reshape(10, 10),
columns=list('ABCDEFGHIJ')
)
df
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
i_s, j_s = np.arange(4).cumsum(), np.arange(1, 5).cumsum()
df1, df2, df3, df4 = [
df.iloc[:, i:j] for i, j in zip(i_s, j_s)
]
Verify
pd.concat(dict(enumerate([df.iloc[:, i:j] for i, j in zip(i_s, j_s)])), axis=1)
0 1 2 3
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
first get Triangular Number Series, then apply it to dataframe
n = len(df.columns.tolist())
end = 0
i = 0
res = []
while end < n:
begin = end
end = i*(i+1)/2
res.append(begin,end)
idx = map( lambda x:range(x),res)
for i in idx:
df[i]

Shuffle DataFrame rows except the first row

I am trying to randomize all rows in a data frame except for the first. I would like for the first row to always appear first, and the remaining rows can be in any randomized order.
My data frame is:
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
Any suggestions as to how I can approach this?
try this:
df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
test:
In [38]: df
Out[38]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 -1.213517 0.994057 0.634805 0.517844 -0.128375
2 0.937532 0.814923 -0.231120 1.970019 1.438927
3 1.499967 0.105707 1.255207 0.929084 -3.359826
4 0.418702 -0.894226 -1.088968 0.631398 0.152026
5 1.214119 -0.122633 0.983818 -0.445202 -0.807955
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.428827 -0.569009 -0.718485 0.161108 1.300349
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 0.468671 0.004839 -0.738240 -0.385624 -0.532640
In [39]: df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
In [40]: df
Out[40]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 0.468671 0.004839 -0.738240 -0.385624 -0.532640
2 0.418702 -0.894226 -1.088968 0.631398 0.152026
3 -1.213517 0.994057 0.634805 0.517844 -0.128375
4 1.428827 -0.569009 -0.718485 0.161108 1.300349
5 0.937532 0.814923 -0.231120 1.970019 1.438927
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.499967 0.105707 1.255207 0.929084 -3.359826
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 1.214119 -0.122633 0.983818 -0.445202 -0.807955
Use numpy's shuffle
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(100).reshape(20, 5), columns=list('ABCDE'))
np.random.shuffle(df.values[1:, :])
print df
A B C D E
0 0 1 2 3 4
1 55 56 57 58 59
2 10 11 12 13 14
3 80 81 82 83 84
4 90 91 92 93 94
5 70 71 72 73 74
6 25 26 27 28 29
7 40 41 42 43 44
8 65 66 67 68 69
9 5 6 7 8 9
10 45 46 47 48 49
11 85 86 87 88 89
12 15 16 17 18 19
13 30 31 32 33 34
14 60 61 62 63 64
15 20 21 22 23 24
16 35 36 37 38 39
17 95 96 97 98 99
18 75 76 77 78 79
19 50 51 52 53 54
np.random.shuffle shuffles an ndarray in place. The dataframe is just a wrapper on an ndarray. You can access that ndarray with the values attribute. To specify that all but the first row get shiffled, operate on the array slice [1:, :].

Categories

Resources