How to style my dataframe by column with conditions? - python

I want to paint the share price cell green if it is higher than the target price and red if it is lower than the alert price and my code is not working as it keeps popping errors.
This is the code that I use
temp_df.style.apply(lambda x: ["background: red" if v < x.iloc[:,1:] and x.iloc[:,1:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
temp_df.style.apply(lambda x: ["background: green" if v > x.iloc[:,2:] and x.iloc[:,2:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
Can anyone give me an idea on how to do it?
Index Share Price Alert/Entry Target
0 622.0 424.0 950.0
1 6880.0 5200.0 7450.0
2 62860.0 40000.0 60000.0
3 7669.0 5500.0 8000.0
4 5295.0 3500.0 5500.0
5 227.0 165.0 250.0
6 3970.0 3200.0 4250.0
7 1300.0 850.0 1650.0
8 8480.0 6500.0 8500.0
9 11.3 0.0 0.0
10 66.0 58.0 75.0
11 7.3 6.4 9.6
12 114.8 75.0 130.0
13 172.3 90.0 0.0
14 2.6 2.4 3.2
15 76.8 68.0 85.0
16 19.6 15.4 21.0
17 21.9 11.0 18.6
18 35.4 29.0 42.0
19 12.5 9.2 0.0
20 15.5 0.0 0.0
21 449.8 0.0 0.0
22 4.3 3.6 5.0
23 47.4 40.0 55.0
24 0.6 0.5 0.6
25 49.2 45.0 72.0
26 13.9 0.0 0.0
27 3.0 2.4 4.5
28 2.4 1.8 4.2
29 54.0 0.0 0.0
30 293.5 100.0 250.0
31 190000.0 140000.0 220000.0
32 52200.0 46000.0 58000.0
33 100500.0 75000.0 115000.0
34 4.9 3.8 6.5
35 0.2 0.0 0.0
36 1430.0 980.0 1450.0
37 1585.0 0.0 0.0
38 15.6 11.0 18.0
39 3.3 2.8 6.0
40 52.5 45.0 68.0
41 46.5 35.0 0.0
42 193.6 135.0 0.0
43 122.8 90.0 0.0
44 222.6 165.0 265.0

Provided that "Index" is also a column:
temp_df.style.apply(lambda x: ["background: green" if (i==1 and v > x.iloc[3] and x.iloc[3] != 0) else ("background: red" if (i==1 and v < x.iloc[2]) else "") for i, v in enumerate(x)], axis=1)
i: aims to define the column Share Price to be styled (column: 1)

Related

How to do similar to conditional countifs on a dataframe

I am trying to replicate countifs in excel to get a rank between two unique values that are listed in my dataframe. I have attached the expected output calculated in excel using countif and let/rank functions.
I am trying to generate "average rank of gas and coal plants" that takes the number from the "average rank column" and then ranks the two unique types from technology (CCGT or COAL) into two new ranks (Gas or Coal) so then I can get the relavant quantiles for this. In case you are wondering why I would need to do this seeing as there are only two coal plants, well when I run this model on a larger dataset it will be useful to know how to do this in code and not manually on my dataset.
Ideally the output will return two ranks 1-47 for all units with technology == CCGT and 1-2 for all units with technology == COAL.
This is the column I am looking to make
Unit ID
Technology
03/01/2022
04/01/2022
05/01/2022
06/01/2022
07/01/2022
08/01/2022
Average Rank
Unit Rank
Avg Rank of Gas & Coal plants
Gas Quintiles
Coal Quintiles
Quintiles
FAWN-1
CCGT
1.0
5.0
1.0
5.0
2.0
1.0
2.5
1
1
1
0
Gas_1
GRAI-6
CCGT
4.0
18.0
2.0
4.0
3.0
3.0
5.7
2
2
1
0
Gas_1
EECL-1
CCGT
5.0
29.0
4.0
1.0
1.0
2.0
7.0
3
3
1
0
Gas_1
PEMB-21
CCGT
7.0
1.0
6.0
13.0
8.0
8.0
7.2
4
4
1
0
Gas_1
PEMB-51
CCGT
3.0
3.0
3.0
11.0
16.0
7.2
5
5
1
0
Gas_1
PEMB-41
CCGT
9.0
4.0
7.0
7.0
10.0
13.0
8.3
6
6
1
0
Gas_1
WBURB-1
CCGT
6.0
9.0
22.0
2.0
7.0
5.0
8.5
7
7
1
0
Gas_1
PEMB-31
CCGT
14.0
6.0
13.0
6.0
4.0
9.0
8.7
8
8
1
0
Gas_1
GRMO-1
CCGT
2.0
7.0
10.0
24.0
11.0
6.0
10.0
9
9
1
0
Gas_1
PEMB-11
CCGT
21.0
2.0
9.0
10.0
9.0
14.0
10.8
10
10
2
0
Gas_2
STAY-1
CCGT
19.0
12.0
5.0
23.0
6.0
7.0
12.0
11
11
2
0
Gas_2
GRAI-7
CCGT
10.0
27.0
15.0
9.0
15.0
11.0
14.5
12
12
2
0
Gas_2
DIDCB6
CCGT
28.0
11.0
11.0
8.0
19.0
15.0
15.3
13
13
2
0
Gas_2
SCCL-3
CCGT
17.0
16.0
31.0
3.0
18.0
10.0
15.8
14
14
2
0
Gas_2
STAY-4
CCGT
12.0
8.0
20.0
18.0
14.0
23.0
15.8
14
14
2
0
Gas_2
CDCL-1
CCGT
13.0
22.0
8.0
25.0
12.0
16.0
16.0
16
16
2
0
Gas_2
STAY-3
CCGT
8.0
17.0
17.0
20.0
13.0
22.0
16.2
17
17
2
0
Gas_2
MRWD-1
CCGT
19.0
26.0
5.0
19.0
17.3
18
18
2
0
Gas_2
WBURB-3
CCGT
24.0
14.0
17.0
17.0
18.0
19
19
3
0
Gas_3
WBURB-2
CCGT
14.0
21.0
12.0
31.0
18.0
19.2
20
20
3
0
Gas_3
GYAR-1
CCGT
26.0
14.0
17.0
20.0
21.0
19.6
21
21
3
0
Gas_3
STAY-2
CCGT
18.0
20.0
18.0
21.0
24.0
20.0
20.2
22
22
3
0
Gas_3
KLYN-A-1
CCGT
24.0
12.0
19.0
27.0
20.5
23
23
3
0
Gas_3
SHOS-1
CCGT
16.0
15.0
28.0
15.0
29.0
27.0
21.7
24
24
3
0
Gas_3
DIDCB5
CCGT
10.0
35.0
22.0
22.3
25
25
3
0
Gas_3
CARR-1
CCGT
33.0
26.0
27.0
22.0
4.0
22.4
26
26
3
0
Gas_3
LAGA-1
CCGT
15.0
13.0
29.0
32.0
23.0
24.0
22.7
27
27
3
0
Gas_3
CARR-2
CCGT
24.0
25.0
27.0
29.0
21.0
12.0
23.0
28
28
3
0
Gas_3
GRAI-8
CCGT
11.0
28.0
36.0
16.0
26.0
25.0
23.7
29
29
4
0
Gas_4
SCCL-2
CCGT
29.0
16.0
28.0
25.0
24.5
30
30
4
0
Gas_4
LBAR-1
CCGT
19.0
25.0
31.0
28.0
25.8
31
31
4
0
Gas_4
CNQPS-2
CCGT
20.0
32.0
32.0
26.0
27.5
32
32
4
0
Gas_4
SPLN-1
CCGT
23.0
30.0
30.0
27.7
33
33
4
0
Gas_4
DAMC-1
CCGT
23.0
21.0
38.0
34.0
29.0
34
34
4
0
Gas_4
KEAD-2
CCGT
30.0
30.0
35
35
4
0
Gas_4
SHBA-1
CCGT
26.0
23.0
35.0
37.0
30.3
36
36
4
0
Gas_4
HUMR-1
CCGT
22.0
30.0
37.0
37.0
33.0
28.0
31.2
37
37
4
0
Gas_4
CNQPS-4
CCGT
27.0
33.0
35.0
30.0
31.3
38
38
5
0
Gas_5
CNQPS-1
CCGT
25.0
40.0
33.0
32.7
39
39
5
0
Gas_5
SEAB-1
CCGT
32.0
34.0
36.0
29.0
32.8
40
40
5
0
Gas_5
PETEM1
CCGT
35.0
35.0
41
41
5
0
Gas_5
ROCK-1
CCGT
31.0
34.0
38.0
38.0
35.3
42
42
5
0
Gas_5
SEAB-2
CCGT
31.0
39.0
39.0
34.0
35.8
43
43
5
0
Gas_5
WBURB-43
COAL
32.0
37.0
40.0
39.0
31.0
35.8
44
1
0
1
Coal_1
FDUNT-1
CCGT
36.0
36.0
45
44
5
0
Gas_5
COSO-1
CCGT
30.0
42.0
36.0
36.0
45
44
5
0
Gas_5
WBURB-41
COAL
33.0
38.0
41.0
40.0
32.0
36.8
47
2
0
1
Coal_1
FELL-1
CCGT
34.0
39.0
43.0
41.0
33.0
38.0
48
46
5
0
Gas_5
KEAD-1
CCGT
43.0
43.0
49
47
5
0
Gas_5
I have tried to do it the same way I got average rank, which is a rank of the average of inputs in the dataframe but it doesn't seem to work with additional conditions.
Thank you!!
import pandas as pd
df = pd.read_csv("gas.csv")
display(df['Technology'].value_counts())
print('------')
display(df['Technology'].value_counts()[0]) # This is how you access count of CCGT
display(df['Technology'].value_counts()[1])
Output:
CCGT 47
COAL 2
Name: Technology, dtype: int64
------
47
2
By the way: pd.cut or pd.qcut can be used to calculate quantiles. You don't have to manually define what a quantile is.
Refer to the documentation and other websites:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
https://www.geeksforgeeks.org/how-to-use-pandas-cut-and-qcut/
There are many methods you can pass to rank. Refer to documentation:
https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
df['rank'] = df.groupby("Technology")["Average Rank"].rank(method = "dense", ascending = True)
df
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.

Add New Column that returns the min value for unique values in another column

I'm attempting to create a new column called 'jolly' that will populate the min value from the HIR_ValuePrice column for each unique value in RH_RNo.
Here is my current attempt:
def evepricefav(races):
for race in races:
clv.loc[clv['RH_RNo'] == race]['HIR_EveningPrice']
clv['jolly'] = clv.apply(evepricefav, axis=1)
Here is a sample of the dataframe. You can see a failed attempt has populated the jolly column with 1.4.
RH_RNo HIR_BSP HIR_EveningPrice value jolly
794565 189631 28.75 26.0 -0.269565 1.4
794566 189631 15.38 13.0 -0.414824 1.4
794567 189631 15.00 6.0 -0.533333 1.4
794568 189631 4.80 5.0 0.458333 1.4
794569 189631 9.85 13.0 0.522843 1.4
794570 189631 4.30 9.0 0.627907 1.4
794571 189631 5.45 6.0 0.467890 1.4
794572 189631 34.00 17.0 -0.500000 1.4
794573 189631 13.00 11.0 -0.153846 1.4
794574 189634 31.77 9.0 -0.527856 1.4
794575 189634 60.00 26.0 -0.433333 1.4
794576 189634 13.50 17.0 0.925926 1.4
794577 189634 9.20 11.0 -0.130435 1.4
794578 189634 9.80 8.0 -0.081633 1.4
794579 189634 10.00 17.0 0.700000 1.4
794580 189634 11.79 17.0 0.102629 1.4
794581 189634 29.60 21.0 0.148649 1.4
794582 189634 2.99 3.5 0.337793 1.4
794583 189634 8.48 6.0 -0.292453 1.4
794584 189637 18.24 11.0 -0.396930 1.4
You can groupby according the RH_RNo column and then use .transform('min') on 'HIR_EveningPrice':
df['jolly'] = df.groupby('RH_RNo')['HIR_EveningPrice'].transform('min')
print(df)
Prints:
id RH_RNo HIR_BSP HIR_EveningPrice value jolly
0 794565 189631 28.75 26.0 -0.269565 5.0
1 794566 189631 15.38 13.0 -0.414824 5.0
2 794567 189631 15.00 6.0 -0.533333 5.0
3 794568 189631 4.80 5.0 0.458333 5.0
4 794569 189631 9.85 13.0 0.522843 5.0
5 794570 189631 4.30 9.0 0.627907 5.0
6 794571 189631 5.45 6.0 0.467890 5.0
7 794572 189631 34.00 17.0 -0.500000 5.0
8 794573 189631 13.00 11.0 -0.153846 5.0
9 794574 189634 31.77 9.0 -0.527856 3.5
10 794575 189634 60.00 26.0 -0.433333 3.5
11 794576 189634 13.50 17.0 0.925926 3.5
12 794577 189634 9.20 11.0 -0.130435 3.5
13 794578 189634 9.80 8.0 -0.081633 3.5
14 794579 189634 10.00 17.0 0.700000 3.5
15 794580 189634 11.79 17.0 0.102629 3.5
16 794581 189634 29.60 21.0 0.148649 3.5
17 794582 189634 2.99 3.5 0.337793 3.5
18 794583 189634 8.48 6.0 -0.292453 3.5
19 794584 189637 18.24 11.0 -0.396930 11.0
Use the following code
data1 = [1,1,2,1,2]
data2 = [7,2,8,1,3]
import pandas as pd
df = pd.DataFrame(columns=["a","b"])
df['a'] = data1
df['b'] = data2
dfc = df.groupby('a')['b']
df = df.assign(jolly=dfc.transform(max))
print(df)
Of course set your var names there :)
Output for the sample data:
a b jolly
0 1 7 7
1 1 2 7
2 2 8 8
3 1 1 7
4 2 3 8

Unable to understand the difference between two statements

I recently tried to strip ("\t") values from a dataframe.
def FillMissing(dataFrame):
dataFrame.replace("\?",np.nan,inplace=True,regex=True)
dataFrame[:] = dataFrame.apply(lambda x: x.str.strip("\t") \
if x.dtype == "object" else x)
print(id(dataFrame))
.......
(test_df,obj,int_,float_)= FillMissing(test_df)
print(id(test_df))
O/P-
ID(dataFrame) - 4647827832
htn-->['yes' 'no']
dm-->['yes' 'no' ' yes']
cad-->['no' 'yes']
appet-->['good' 'poor']
pe-->['no' 'yes']
ane-->['no' 'yes']
ID(test_df) - 4647827832
And by slightly modifying the method , The O/P changes
def FillMissing(dataFrame):
dataFrame.replace("\?",np.nan,inplace=True,regex=True)
***dataFrame = dataFrame.apply(lambda x: x.str.strip("\t") \
if x.dtype == "object" else x)***
print(id(dataFrame))
.......
(test_df,obj,int_,float_)= FillMissing(test_df)
print(id(test_df))
O/P -
ID(dataFrame) - 4647827832
htn-->['yes' 'no' nan]
dm-->['yes' 'no' ' yes' nan]
cad-->['no' 'yes' nan]
appet-->['good' 'poor' nan]
pe-->['no' 'yes' nan]
ID(test_df) - 4517467528
I m unable to understand the difference , in the statements.
Also while executing the strip outside the function on the test_df it works completely fine.
Snapshot of Data used--
id age bp sg al su rbc pc pcc \
0 0 48.0 80.0 1.020 1.0 0.0 normal normal notpresent
1 1 7.0 50.0 1.020 4.0 0.0 normal normal notpresent
2 2 62.0 80.0 1.010 2.0 3.0 normal normal notpresent
3 3 48.0 70.0 1.005 4.0 0.0 normal abnormal present
4 4 51.0 80.0 1.010 2.0 0.0 normal normal notpresent
5 5 60.0 90.0 1.015 3.0 0.0 normal normal notpresent
6 6 68.0 70.0 1.010 0.0 0.0 normal normal notpresent
7 7 24.0 80.0 1.015 2.0 4.0 normal abnormal notpresent
8 8 52.0 100.0 1.015 3.0 0.0 normal abnormal present
9 9 53.0 90.0 1.020 2.0 0.0 abnormal abnormal present
10 10 50.0 60.0 1.010 2.0 4.0 abnormal abnormal present
11 11 63.0 70.0 1.010 3.0 0.0 abnormal abnormal present
12 12 68.0 70.0 1.015 3.0 1.0 normal normal present
13 13 68.0 70.0 1.020 0.0 0.0 normal abnormal notpresent
14 14 68.0 80.0 1.010 3.0 2.0 normal abnormal present
15 15 40.0 80.0 1.015 3.0 0.0 abnormal normal notpresent
16 16 47.0 70.0 1.015 2.0 0.0 abnormal normal notpresent
17 17 47.0 80.0 1.020 0.0 0.0 abnormal normal notpresent
18 18 60.0 100.0 1.025 0.0 3.0 abnormal normal notpresent
19 19 62.0 60.0 1.015 1.0 0.0 abnormal abnormal present
20 20 61.0 80.0 1.015 2.0 0.0 abnormal abnormal notpresent
21 21 60.0 90.0 1.020 0.0 0.0 normal abnormal notpresent
22 22 48.0 80.0 1.025 4.0 0.0 normal abnormal notpresent
23 23 21.0 70.0 1.010 0.0 0.0 normal normal notpresent
24 24 42.0 100.0 1.015 4.0 0.0 normal abnormal notpresent
25 25 61.0 60.0 1.025 0.0 0.0 normal normal notpresent
26 26 75.0 80.0 1.015 0.0 0.0 normal normal notpresent
27 27 69.0 70.0 1.010 3.0 4.0 normal abnormal notpresent
28 28 75.0 70.0 1.020 1.0 3.0 abnormal abnormal notpresent
29 29 68.0 70.0 1.005 1.0 0.0 abnormal abnormal present

Dropping multiple columns in pandas at once

I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)
You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6

Convert specific string to a numeric value in pandas

I am trying to do data analysis of some rainfall data. Example of the data looks like this:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 TRACE 3.5 17 TRACE 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 T 3 12 T 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
The rainfall data contain a specific string 'TRACE' or 'T' (both meaning non measurable rainfall amount). For analysis, I would like to convert this strings in to '1.0' (float). My desired data should look like this so as to plot the values as line diagram:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 1.0 3.5 17 1.0 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 1.0 3 12 1.0 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
Can some one point me to right direction?
You can use df.replace, and then converting the numeric to float using df.astype (the original datatype would be object, so any operations on these columns would still suffer from performance issues):
df = df.replace('^T(RACE)?$', 1.0, regex=True)
df.iloc[:, 1:] = df.iloc[:, 1:].astype(float) # converting object columns to floats
This will replace all T or TRACE elements with 1.0.
Output:
10 18/05/2016 26.9 40 20.8 34.0 52.2 20.8 46.5 45.0
11 19/05/2016 25.5 32 0.3 41.6 42.0 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9.0 36.0 18.4 28.6 46.0
13 21/05/2016 24.5 18 1 3.5 17.0 1 4.4 40.0
14 22/05/2016 0.6 18 0 6.5 14.0 0 8.6 20.0
15 23/05/2016 3.5 9 0.6 4.3 14.0 0.6 7.0 15.0
16 24/05/2016 3.6 25 1 3.0 12.0 1 14.9 9.0
17 25/05/2016 25.0 21 2.2 25.6 50.0 2.2 25.0 9.0
Use replace by dict:
df = df.replace({'T':1.0, 'TRACE':1.0})
And then if necessary convert columns to float:
cols = df.columns.difference(['Date','another cols dont need convert'])
df[cols] = df[cols].astype(float)
df = df.replace({'T':1.0, 'TRACE':1.0})
cols = df.columns.difference(['Date','a'])
df[cols] = df[cols].astype(float)
print (df)
a Date 2 3 4 5 6 7 8 9
0 10 18/05/2016 26.9 40.0 20.8 34.0 52.2 20.8 46.5 45.0
1 11 19/05/2016 25.5 32.0 0.3 41.6 42.0 0.3 56.3 65.2
2 12 20/05/2016 8.5 29.0 18.4 9.0 36.0 18.4 28.6 46.0
3 13 21/05/2016 24.5 18.0 1.0 3.5 17.0 1.0 4.4 40.0
4 14 22/05/2016 0.6 18.0 0.0 6.5 14.0 0.0 8.6 20.0
5 15 23/05/2016 3.5 9.0 0.6 4.3 14.0 0.6 7.0 15.0
6 16 24/05/2016 3.6 25.0 1.0 3.0 12.0 1.0 14.9 9.0
7 17 25/05/2016 25.0 21.0 2.2 25.6 50.0 2.2 25.0 9.0
print (df.dtypes)
a int64
Date object
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
dtype: object
Extending the answer from #jezrael, you can replace and convert to floats in a single statement (assumes the first column is Date and the remaining are the desired numeric columns):
df.iloc[:, 1:] = df.iloc[:, 1:].replace({'T':1.0, 'TRACE':1.0}).astype(float)

Categories

Resources