I have the following DataFrame:
1-A-873 2-A-129 3-A-123
12/12/20 45 32 41
13/12/20 94 56 87
14/12/20 12 42 84
15/12/20 73 24 25
Each column represent an equipment. Each equipment has a size that is declared in the code:
1A = 5
2A = 3
3A = 7
Every column will need to be divided by this equipment size that is - (value / size)
This is what I am using:
df["1A-NewValue"] = df["1-A-873"] / 1A
df["2A-NewValue"] = df["2-A-129"] / 2A
df["3A-NewValue"] = df["3-A-123"] / 3A
End result:
1-A-873 2-A-129 3-A-123 1A-NewValue 2A-NewValue 3A-NewValue
12/12/20 45 32 41 9 10.67 5.86
13/12/20 94 56 87 18.8 18.67 12.43
14/12/20 12 42 84 2.4 14 12
15/12/20 73 24 25 14.6 8 3.57
This works perfectly and do what I want by having three extra columns at the end of the DataFrame.
However, this will be a tedious effort later on if my total number of equipment increases to 250 instead of 3; I will need to have 250 lines for equipment size and 250 lines for the formula.
Naturally the first thing that come to my mind is a for loop, but is there a more Pandas-way of doing this efficiently?
Thanks!
You can create dictionary, rename columns names by split by - and join first 2 values for match and divide like:
d = {'1A': 5, '2A':3, '3A':7}
f = lambda x: ''.join(x.split('-')[:2])
df = df.join(df.rename(columns=f).div(d).add_suffix(' NewValue'))
print (df)
1-A-873 2-A-129 3-A-123 1A NewValue 2A NewValue 3A NewValue
12/12/20 45 32 41 9.0 10.666667 5.857143
13/12/20 94 56 87 18.8 18.666667 12.428571
14/12/20 12 42 84 2.4 14.000000 12.000000
15/12/20 73 24 25 14.6 8.000000 3.571429
Related
My dataframe called df_results
It looks like this
Cust Code Price
=====================
1 A 98
1 B 25
1 C 74
1 D 55
1 E 15
1 F 32
1 G 71
2 A 10
2 K 52
2 M 33
2 S 14
99 K 10
99 N 24
99 S 26
99 A 49
99 W 50
99 J 52
99 Q 55
99 U 68
99 C 73
99 Z 74
99 P 82
99 E 92
. . .
. . .
. . .
I am trying to break each customer prices into categories per percntile
Cust 99 prices are 10 24 26 49 50 52 55 68 73 74 82 92
Therefore for this customer the
25% ==> 31.75
50% ==> 53.5
75% ==> 73.75
100% ==> 92
Prices for each customer have to be averaged as per the percentile it belongs to
Cust Code Price Perctile Perce_Value Perc_Avg
=====================================================
99 K 10 25% 31.75 20
99 N 24 25% 31.75 20
99 S 26 25% 31.75 20
99 A 49 50% 53.5 50.33
99 W 50 50% 53.5 50.33
99 J 52 50% 53.5 50.33
99 Q 55 75% 73.75 65.33
99 U 68 75% 73.75 65.33
99 C 73 75% 73.75 65.33
99 Z 74 100% 92 82.66
99 P 82 100% 92 82.66
99 E 92 100% 92 82.66
I managed to do that through multiple looping through the dataframe
which is not effecient and I believe there must be a better solution.
Is there a better way to do that?
EDIT
I tried using lambda function
Step 1 : to fine Percentile_Value
df_results["Percentile_Value"] = df_results.apply(lambda x: np.percentile(x["Price"],25), axis=1)
but this did not give me any value , it just repeated Price into Percentile_Value as is
I'm running into a strange issue with scipy's percentileofscore function.
In Excel, I have the following rows:
0
1
3
3
3
3
3
4
6
8
9
11
11
11
12
45
Next, I have a column that calculates the percentilerank.inc for each row:
=100 * (1-PERCENTRANK.INC($A:$A,A1))
The results are as follows:
100
94
87
87
87
87
87
54
47
40
34
27
27
27
7
0
I then take the same data and put them into an array and calculate the percentilofscore using scipy
100 - stats.percentileofscore(array, score, kind='strict')
However, my results are as follows:
100
94
88
88
88
88
88
56
50
44
38
31
31
31
13
7
Here are the results side by side to show the differences:
Data Excel Scipy
0 100 100
1 94 94
3 87 88
3 87 88
3 87 88
3 87 88
3 87 88
4 54 56
6 47 50
8 40 44
9 34 38
11 27 31
11 27 31
11 27 31
12 7 13
45 0 7
There are clearly some differences in the results. Some of them off by 4 digits.
Any thoughts on how to mimic Excel's PERCENTILERANK.INC function?
I'm using scipy 1.0.0, numpy 1.13.3, python 3.5.2, Excel 2016
Edit
If I do not include the max value of 45, the numbers jive. Could this be how PERCENTILERANK.INC works?
The Excel function PERCENTILERANK.INC excludes the max value (in my case 45). Which is why it shows 0 versus 6.25 like scipy does.
To rectify this, I modified my function to remove the max values of the array like so:
array = list(filter(lambda a: a != max(array), array))
return 100 - int(stats.percentileofscore(array, score, kind='strict'))
This gave me the correct results, and all my other tests passed.
Additional information based on Brian Pendleton's comment. Here is a link to the Excel functions explaining PERCENTILERANK.INC as well as other ranking functions. Thanks for this.
I just had a quick question. How would one go about getting the last cell value of an excel spreadsheet when working with it as a dataframe using pandas, for every single different column. I'm having quite some difficulty with this, I know the index can be found with len(), but I can't quite wrap my finger around it. Thank you any help would be greatly appreciated.
If you want the last cell of a dataframe meaning the most bottom right cell, then you can use .iloc:
df = pd.DataFrame(np.arange(1,101).reshape((10,-1)))
df
Output:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10
1 11 12 13 14 15 16 17 18 19 20
2 21 22 23 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37 38 39 40
4 41 42 43 44 45 46 47 48 49 50
5 51 52 53 54 55 56 57 58 59 60
6 61 62 63 64 65 66 67 68 69 70
7 71 72 73 74 75 76 77 78 79 80
8 81 82 83 84 85 86 87 88 89 90
9 91 92 93 94 95 96 97 98 99 100
Use .iloc with -1 index selection on both rows and columns.
df.iloc[-1,-1]
Output:
100
DataFrame.head(n) gets the top n results from the dataframe. DataFrame.tail(n) gets the bottom n results from the dataframe.
If your dataframe is named df, you could use df.tail(1) to get the last row of the dataframe. The returned value is also a dataframe.
The request is simple: I want to select all rows which contain a value greater than a threshold.
If I do it like this:
df[(df > threshold)]
I get these rows, but values below that threshold are simply NaN. How do I avoid selecting these rows?
There is absolutely no need for the double transposition - you can simply call any along the column index (supplying 1 or 'columns') on your Boolean matrix.
df[(df > threshold).any(1)]
Example
>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))
>>> df
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
2 37 2 55 68 16 14 93 14 71 84
3 67 45 79 75 27 94 46 43 7 40
4 61 65 73 60 67 83 32 77 33 96
>>> df[(df > 95).any(1)]
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
4 61 65 73 60 67 83 32 77 33 96
Transposing as your self-answer does is just an unnecessary performance hit.
df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))
# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop
# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop
This is actually very simple:
df[df.T[(df.T > 0.33)].any()]
I have a Pandas dataframe with several columns that range from 0 to 100. I would like to add a column on to the dataframe that contains the name of the column from among these that has the greatest value for each row. So:
one two three four COLUMN_I_WANT_TO_CREATE
5 40 12 19 two
90 15 58 23 one
74 95 34 12 two
44 81 22 97 four
10 59 59 44 [either two or three, selected randomly]
etc.
Bonus points if the solution can resolve ties randomly.
You can use idxmax with parameter axis=1:
print df
one two three four
0 5 40 12 19
1 90 15 58 23
2 74 95 34 12
3 44 81 22 97
df['COLUMN_I_WANT_TO_CREATE'] = df.idxmax(axis=1)
print df
one two three four COLUMN_I_WANT_TO_CREATE
0 5 40 12 19 two
1 90 15 58 23 one
2 74 95 34 12 two
3 44 81 22 97 four
With random duplicity max values is it more complicated.
You can first find all max values by x[(x == x.max())]. Then you need index values, where apply sample. But it works only with Series, so index is converted to
Series by to_series. Last you can select only first value of Serie by iloc:
print df
one two three four
0 5 40 12 19
1 90 15 58 23
2 74 95 34 12
3 44 81 22 97
4 10 59 59 44
5 59 59 59 59
6 10 59 59 59
7 59 59 59 59
#first run
df['COL']=df.apply(lambda x:x[(x==x.max())].index.to_series().sample(frac=1).iloc[0], axis=1)
print df
one two three four COL
0 5 40 12 19 two
1 90 15 58 23 one
2 74 95 34 12 two
3 44 81 22 97 four
4 10 59 59 44 three
5 59 59 59 59 one
6 10 59 59 59 two
7 59 59 59 59 three
#one of next run
df['COL']=df.apply(lambda x:x[(x==x.max())].index.to_series().sample(frac=1).iloc[0], axis=1)
print df
one two three four COL
0 5 40 12 19 two
1 90 15 58 23 one
2 74 95 34 12 two
3 44 81 22 97 four
4 10 59 59 44 two
5 59 59 59 59 one
6 10 59 59 59 three
7 59 59 59 59 four