percentile of the last value of a row - python

I am trying to get the percentile value for the last value in each row and store it in a different column. But unable to (new to python). What i have been able to achieve is the percentile value of each row through indexing. But not my desired output.
Following the code:
df = pd.DataFrame(np.random.randint(20,60,size=(10, 7)), columns=list('ABCDEFG'))
values = df.loc[1][0:]
min_value = values.min()
max_value = values.max()
percentiles = ((values - min_value) / (max_value - min_value) * 100)
print(percentiles)
current output:
A B C D
0 35 45 25 38
2 35 31 28 55
3 59 38 44 40
4 40 57 30 52
5 20 51 31 48
6 52 24 39 49
7 47 59 39 47
8 20 42 21 26
9 27 53 38 56
I am getting this way the percentile value:
A 61.538462
B 65.384615
C 100.000000
D 61.538462
E 50.000000
F 96.153846
G 0.000000
desired output:
A B C D E F G Per
0 52 41 23 53 22 22 39 23.6
1 48 49 58 48 45 57 32 23.5
2 38 49 48 25 32 22 27 56.2
3 46 34 43 52 50 32 30 63.5
4 59 47 49 22 53 31 38 65.9
5 49 49 58 37 28 31 34 50.2
6 31 29 28 41 39 36 47 90.2
7 34 55 52 39 32 25 55 85.6
8 34 21 48 22 22 53 42 80.5
9 44 23 57 52 29 54 43 90.6
Per value is the percentile value of col G for each row when compared to that row.

Try:
def perc_func(r):
x = r
last_val = x[-1]
min_val = x.min()
max_val = x.max()
percentile = ((last_val - min_val) / (max_val - min_val) * 100)
return percentile
df['Per'] = df.apply(lambda row:perc_func(row), axis=1)

Related

Place data from a Pandas DF into a Grid or Template

I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.
pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN

Panda Dataframe query

I like to retrieve data based on the column name and its minimum and maximum value. I am not able to figure out how to get that result. I am able to get data based on column name but don't understand how to apply the limit.
Column name and corresponding min and max value given in list and tuple.
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
selected_data = data_frame.loc[:, [X[0] for X in column_cutoff]]
return selected_data
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
newdata_cutoff = c_cutoff(df,column_cutoffdata)
print(df.head())
print(newdata_cutoff)
result
B E
R0 78 73
R1 27 7
R2 53 44
R3 65 84
R4 9 1
..
.
Expected output
I want all value less than 27 and greater than 78 should be discarded, same for E
You can be rather explicit and do the following:
lim = [('B',27,78),('E',44,73)]
for lim in limiters:
df = df[(df[lim[0]]>=lim[1]) & (df[lim[0]]<=lim[2])]
Yields:
A B C D E F
R0 99 78 61 16 73 8
R2 15 53 80 27 44 77
R8 30 62 11 67 65 55
R11 90 31 9 38 47 16
R15 16 64 8 90 44 37
R16 94 75 5 22 52 69
R46 11 30 26 8 51 61
R48 39 59 22 80 58 44
R66 55 38 5 49 58 15
R70 36 78 5 13 73 69
R72 70 58 52 99 67 11
R75 20 59 57 33 53 96
R77 32 31 89 49 69 41
R79 43 28 17 16 73 54
R80 45 34 90 67 69 70
R87 9 50 16 61 65 30
R90 43 56 76 7 47 62
pipe + where + between
You can't discard values in an array; that would involve reshaping an array and a dataframe's columns must all have the same size.
But you can iterate and use pd.Series.where to replace out-of-scope vales with NaN. Note the Pandas way to feed a dataframe through a function is via pipe:
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
for col, min_val, max_val in column_cutoffdata:
data_frame[col] = data_frame[col].where(data_frame[col].between(min_val, max_val))
return data_frame
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
print(df.head())
# A B C D E F
# R0 99 78 61 16 73 8
# R1 62 27 30 80 7 76
# R2 15 53 80 27 44 77
# R3 75 65 47 30 84 86
# R4 18 9 41 62 1 82
newdata_cutoff = df.pipe(c_cutoff, column_cutoffdata)
print(newdata_cutoff.head())
# A B C D E F
# R0 99 78.0 61 16 73.0 8
# R1 62 27.0 30 80 NaN 76
# R2 15 53.0 80 27 44.0 77
# R3 75 65.0 47 30 NaN 86
# R4 18 NaN 41 62 NaN 82
If you want to drop rows with any NaN values, you can then use dropna:
newdata_cutoff = newdata_cutoff.dropna()

Split a Pandas Dataframe into multiple Dataframes based on Triangular Number Series

I have a DataFrame (df) and I need to split it into n number of Dataframes based on the column numbers. But, it has to follow the Triangular Series pattern:
df1 = df[[0]]
df2 = df[[1,2]]
df3 = df[[3,4,5]]
df4 = df[[6,7,8,9]]
etc.
Consider the dataframe df
df = pd.DataFrame(
np.arange(100).reshape(10, 10),
columns=list('ABCDEFGHIJ')
)
df
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
i_s, j_s = np.arange(4).cumsum(), np.arange(1, 5).cumsum()
df1, df2, df3, df4 = [
df.iloc[:, i:j] for i, j in zip(i_s, j_s)
]
Verify
pd.concat(dict(enumerate([df.iloc[:, i:j] for i, j in zip(i_s, j_s)])), axis=1)
0 1 2 3
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
first get Triangular Number Series, then apply it to dataframe
n = len(df.columns.tolist())
end = 0
i = 0
res = []
while end < n:
begin = end
end = i*(i+1)/2
res.append(begin,end)
idx = map( lambda x:range(x),res)
for i in idx:
df[i]

Python For Loop Triangle [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to make a triangle that looks like this
10
11 12
13 14 15
16 17 18 19
20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36 37
38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54
I am trying to use two for loops with one nested. Here is as close as I have gotten so far.
for j in range(11):
print(end='\n')
for i in range(j+1):
print(i+j,'',end='')
print(end='\n')
I'm pretty sure I need to create a variable, but not really sure how to incorporate it into the loop.
Here you go:
>>> a=range(10, 55)
>>> for i in range(10):
... print(' '.join(repr(e) for e in a[:i+1]))
... a = a[i+1:]
...
10
11 12
13 14 15
16 17 18 19
20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36 37
38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54
How about short and simple like this:
k=10
for i in range(9):
for j in range(i+1):
print(k, end='')
k+=1
print('')
Here is another single for loop based solution:
number = 10
for line_length in range(9):
print(*range(number, number + line_length + 1))
number += line_length + 1
Giving:
10
11 12
13 14 15
16 17 18 19
20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36 37
38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54
I like Maelstrom's short and sweet answer, but if you want to look at it mathematically, you might do something like this instead:
>>> for i in range(1, 10):
... j = 10 + i * (i - 1) // 2
... print(*range(j, j + i)) # This line edited per lvc's comment
...
10
11 12
13 14 15
16 17 18 19
20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36 37
38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54
Use one variable to keep track of the current number and one to keep track of the tier you are on
num = 10;
tier = 1;
tiers = 10;
for i in range(tiers):
for j in range(tier):
print(num + " ");
num = num + 1;
print("\n");
tier = tier + 1
You can change the triangle height by adjusting the triangle_height variable and the starting element by changing print_number.
print_number = 10
triangle_height = 9
for level_element_count in range(triangle_height):
print('\n')
while level_element_count > -1:
print(print_number, '', end='')
print_number += 1
level_element_count -= 1
print('\n')
Just for fun.
This is related to a common pattern where you divide a given sequence (here, the numbers from 10 to 54, inclusive) into non-overlapping 'windows', to do some analysis on, say, 10 values at a time. The twist here is that each window is one element larger than the last.
This looks like a job for itertools!
import itertools as it
def increasing_windows(i, start=1, step=1):
'''yield non-overlapping windows from iterable `i`,
increasing in size from `start` by `step`.
'''
pos = 0
for size in it.count(start, step):
yield it.islice(i, pos, pos+size)
pos += size
for line in it.islice(increasing_windows(range(10, 55)), 9):
print(*line)
Try this
counter = 10
for i in range(10):
output = ""
for j in range(i):
output = output + " " + str(counter)
counter += 1
print(output)
Output:
10
11 12
13 14 15
16 17 18 19
20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36 37
38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54
Explanation:
First loop controls the width of the triangle and second loop controls the content and hence the height. We need to convert an integer to string and concatenate. We create proper output in a string variable in each iteration of second loop and then display it once it gets finished.The key thing is to iterate second loop according to the first one, i.e. loop it as much as first does
You could write a generator:
def number_triangle(start, nrows):
current = start
for length in range(1, nrows+1):
yield range(current, current+length)
current += length
>>> for row in number_triangle(10, 9):
... print(*row)
10
11 12
13 14 15
16 17 18 19
20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36 37
38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54
>>> for row in number_triangle(1, 12):
... print(*row)
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45
46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66
67 68 69 70 71 72 73 74 75 76 77 78
Or you could have an infinite generator and leave it up to the caller to control how many rows to generate:
def number_triangle(start=0):
length = 1
while True:
yield range(start, start+length)
start += length
length += 1
>>> nt = number_triangle()
>>> for i in range(15):
... print(*next(nt))
0
1 2
3 4 5
6 7 8 9
10 11 12 13 14
15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63 64 65
66 67 68 69 70 71 72 73 74 75 76 77
78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100 101 102 103 104
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
No need for nested loop:
a = range(10, 55)
flag = 0
current = 0
for i, e in enumerate(a):
print e,
if flag == i:
current += 1
flag = i + 1 + current
print '\n',

Drop range of columns by labels

Suppose I had this large data frame:
In [31]: df
Out[31]:
A B C D E F G H I J ... Q R S T U V W X Y Z
0 0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
1 26 27 28 29 30 31 32 33 34 35 ... 42 43 44 45 46 47 48 49 50 51
2 52 53 54 55 56 57 58 59 60 61 ... 68 69 70 71 72 73 74 75 76 77
[3 rows x 26 columns]
which you can create using
alphabet = [chr(letter_i) for letter_i in range(ord('A'), ord('Z')+1)]
df = pd.DataFrame(np.arange(3*26).reshape(3, 26), columns=alphabet)
What's the best way to drop all columns between column 'D' and 'R' using the labels of the columns?
I found one ugly way to do it:
df.drop(df.columns[df.columns.get_loc('D'):df.columns.get_loc('R')+1], axis=1)
Here's my entry:
>>> df.drop(df.columns.to_series()["D":"R"], axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
By converting df.columns from an Index to a Series, we can take advantage of the ["D":"R"]-style selection:
>>> df.columns.to_series()["D":"R"]
D D
E E
F F
G G
H H
I I
J J
... ...
Q Q
R R
dtype: object
Here you are:
print df.ix[:,'A':'C'].join(df.ix[:,'S':'Z'])
Out[1]:
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
Here's another way ...
low, high = df.columns.get_slice_bound(('D', 'R'), 'left')
drops = df.columns[low:high+1]
print df.drop(drops, axis=1)
A B C S T U V W X Y Z
0 0 1 2 18 19 20 21 22 23 24 25
1 26 27 28 44 45 46 47 48 49 50 51
2 52 53 54 70 71 72 73 74 75 76 77
Use numpy for more flexibility ... numpy allows comparison of letters (probably by comparing on ASCII bit level, or something):
import numpy as np
array = (['A','B','C','D'])
array > 'B'
print(array)
print(array>'B')
gives:
['A' 'B' 'C' 'D']
array([False, False, True, True], dtype=bool)
More difficult selections are also easily possible:
b[np.logical_and(b>'B', b<'D')]
gives:
array(['C'],
dtype='|S1')

Categories

Resources