Plotting based on column criteria in Pandas - python

If i have a dataframe say
df = {'carx' : [merc,rari,merc,hond,fia,merc]
'cary' : [bent,maz,ben,merc,fia,fia]
'milesx' : [0,100,2,22,5,6]
'milesy' : [10,3,18,2,19,2]}
I then would like to plot the value from column milesx if corresponding index of carx has the value 'merc'. The same criteria applies for cary and milesy, else nothing should be plotted. How can i do this?
milesy and milesx should be plotted on the x-axis. The y-axis should just be some continuous values (1,2...).

IIUC, assuming you have following dataframe:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# input dictionary
df = {'carx' : ['merc','rari','merc','hond','fia','merc'],
'cary' : ['bent','maz','ben','merc','fia','fia'],
'milesx' : [0,100,2,22,5,6],
'milesy' : [10,3,18,2,19,2]}
# creating input dataframe
dataframe = pd.DataFrame(df)
print(dataframe)
Result:
carx cary milesx milesy
0 merc bent 0 10
1 rari maz 100 3
2 merc ben 2 18
3 hond merc 22 2
4 fia fia 5 19
5 merc fia 6 2
Then, you want to plot values given condition which can be done using function, and using apply:
def my_function(row):
if row['carx'] == 'merc':return row['milesx']
if row['cary'] == 'merc': return row['milesy']
else: return None
# filter those with only 'merc'
filtered = dataframe.apply(lambda row: my_function(row), axis=1)
print(filtered)
Result:
0 0.0
1 NaN
2 2.0
3 2.0
4 NaN
5 6.0
dtype: float64
You do not want to plot when neither of them are which would be NaN, so dropna() may be used:
# plotting
filtered.dropna().plot(kind='bar', legend=None);

Related

Change value in pandas after chained loc and iloc

I have the following problem: in a df, I want to select specific rows and a specific column and in this selection take the first n elements and assign a new value to them. Naively, I thought that the following code should do the job:
import seaborn as sns
import pandas as pd
df = sns.load_dataset('tips')
df.loc[df.day=="Sun", "smoker"].iloc[:4] = "Yes"
Both of the loc and iloc should return a view into the df and the value should be overwritten. However, the dataframe does not change. Why?
I know how to go around it -- creating a new df first just with the loc, then changing the value using iloc and updating back the original df (as below).
But a) I do not think it's optimal, and b) I would like to know why the top solution does not work. Why does it return a copy and not a view of a view?
The alternative solution:
df = sns.load_dataset('tips')
tmp = df.loc[df.day=="Sun", "smoker"]
tmp.iloc[:4] = "Yes"
df.loc[df.day=="Sun", "smoker"] = tmp
Note: I have read the docs, this really great post and this question but they don't explain this. Their concern is the difference between df.loc[mask,"z] and the chained df["z"][mask].
I believe df.loc[].iloc[] is a chained assignment case and pandas doesn't guarantee that you will get a view at the end. From the docs:
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided.
Since you have a filtering condition in loc, pandas will create a new pd.Series and than will apply an assignment to it. For example the following will work because you'll get the same series as df["smoker"]:
df.loc[:, "smoker"].iloc[:4] = 'Yes'
But you will get SettingWithCopyWarning warning.
You need to rewrite your code so that pandas handles this as a single loc entity.
Another possible workaround:
df.loc[df[df.day=="Sun"].index[:4], "smoker"] = 'Yes'
In your case, you can define the columns to impute
Let's suppose the following dataset
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Nellore", "Guwahati", "Nellore", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, 124.5, 128.2, 128.2, 115.4, 115.1, 117.3, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 124.5
4 5 Nellore 127.8 128.2
5 6 Guwahati 125.9 128.2
6 7 Nellore 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.3
9 10 Munger-Jamalpu 117.7 118.3
So, I would like to change to 0 all dates where Sno Center is equals to Nellore
mask = df["Sno Center"] == "Nellore"
df.loc[mask, ["Mar-21", "Apr-21"]] = 0
The result
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 0.0 0.0
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 124.5
4 5 Nellore 0.0 0.0
5 6 Guwahati 125.9 128.2
6 7 Nellore 0.0 0.0
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.3
9 10 Munger-Jamalpu 117.7 118.3
Other option is to define the columns as a list
COLS = ["Mar-21", "Apr-21"]
df.loc[mask, COLS] = 0
Other options using iloc
COLS = df.iloc[:, 2:4].columns.tolist()
df.loc[mask, COLS] = 0
Or
df.loc[mask, df.iloc[:, 2:4].columns.tolist()] = 0

How to identify and highlight outliers in each row of a pandas dataframe

I want to do the following to my dataframe:
For each row identify outliers/anomalies
Highlight/color the identified outliers' cells (preferably 'red' color)
Count the number of identified outliers in each row (store in a column 'anomaly_count')
Export the output as an xlsx file
See below for sample data
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16,5)),
columns=list('ABCDE')
)
df
A B C D E
0 -1.685112 -0.432143 0.876200 1.626578 1.512677
1 0.401134 0.439393 1.027222 0.036267 -0.655949
2 -0.074890 0.312793 -0.236165 0.660909 0.074468
3 0.842169 2.759467 0.223652 0.432631 -0.484871
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380
5 0.083653 0.792835 -0.643204 1.182606 -1.207692
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188
8 2.354769 1.099483 -0.653342 -0.532208 0.269307
9 0.431649 0.666982 0.361765 0.419482 0.531072
10 -0.124268 -0.170720 -0.979012 -0.410861 1.000371
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283
14 0.029966 -0.579152 0.648176 0.833141 -0.942752
15 0.824767 0.974580 0.363170 0.428062 -0.232174
The desired outcome should look something like this:
## I want to ONLY identify the outliers NOT remove or substitute them. I only used NaN to depict the outlier value. Ideally, the outlier values cell should be colored/highlighted 'red'.
## Please note: the outliers NaN in the sample are randomly assigned.
A B C D E Anomaly_Count
0 NaN -0.432143 0.876200 NaN 1.512677 2
1 0.401134 0.439393 1.027222 0.036267 -0.655949 0
2 -0.074890 0.312793 -0.236165 0.660909 0.074468 0
3 0.842169 NaN 0.223652 0.432631 -0.484871 1
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380 0
5 0.083653 0.792835 -0.643204 NaN NaN 2
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728 0
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188 0
8 2.354769 1.099483 -0.653342 -0.532208 0.269307 0
9 0.431649 0.666982 0.361765 0.419482 0.531072 0
10 -0.124268 -0.170720 -0.979012 -0.410861 NaN 1
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289 0
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504 0
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283 0
14 0.029966 -0.579152 0.648176 0.833141 -0.942752 0
15 0.824767 NaN 0.363170 0.428062 -0.232174 1
See below for my attempt, I am open to other approaches
import numpy as np
from scipy import stats
def outlier_detection (data):
# step I: identify the outliers in each row
df[(np.abs(stats.zscore(df)) < 3).all(axis = 0)] # unfortunately this removes the outliers which I dont want
# step II: color/highlight the outlier cell
df = df.style.highlight_null('red')
# Step III: count the number of outliers in each row
df['Anomaly_count'] = df.isnull().sum(axis=1)
# step IV: export as xlsx file
df.to_excel(r'Path to store the exported excel file\File Name.xlsx', sheet_name='Your sheet name', index = False)
outlier_detection(df)
Thanks for your time.
This works for me
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16, 5)),
columns=list('ABCDE')
)
mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")
sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,
sheet_name=sheet_name,
index=False)
Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.
The resulting excel file looks now like this

lambda function referencing a column value not specified in function

I have a situation where I want to use the results of a groupby in my training set to fill in results for my test set.
I don't think there's a straight forward way to do this in pandas, so I'm trying use the apply method on the column in my test set.
MY SITUATION:
I want to use the average values from my MSZoning column to infer the missing value for my LotFrontage column.
If I use the groupby method on my training set I get this:
train.groupby('MSZoning')['LotFrontage'].agg(['mean', 'count'])
giving.....
Now, I want to use these values to impute missing values on my test set, so I can't just use the transform method.
Instead, I created a function that I wanted to pass into the apply method, which can be seen here:
def fill_MSZoning(row):
if row['MSZoning'] == 'C':
return 69.7
elif row['MSZoning'] == 'FV':
return 59.49
elif row['MSZoning'] == 'RH':
return 58.92
elif row['MSZoning'] == 'RL':
return 74.68
else:
return 52.4
I call the function like this:
test['LotFrontage'] = test.apply(lambda x: x.fillna(fill_MSZoning), axis=1)
Now, the results for the LotFrontage column are the same as the Id column, even though I didn't specify this.
Any idea what is happening?
you can do it like this
import pandas as pd
import numpy as np
## creating dummy data
np.random.seed(100)
raw = {
"group": np.random.choice("A B C".split(), 10),
"value": [np.nan if np.random.rand()>0.8 else np.random.choice(100) for _ in range(10)]
}
df = pd.DataFrame(raw)
display(df)
## calculate mean
means = df.groupby("group").mean()
display(means)
Fill With Group Mean
## fill with mean value
def fill_group_mean(x):
group_mean = means["value"].loc[x["group"].max()]
return x["value"].mask(x["value"].isna(), group_mean)
r= df.groupby("group").apply(fill_group_mean)
r.reset_index(level=0)
Output
group value
0 A NaN
1 A 24.0
2 A 60.0
3 C 9.0
4 C 2.0
5 A NaN
6 C NaN
7 B 83.0
8 C 91.0
9 C 7.0
group value
0 A 42.00
1 A 24.00
2 A 60.00
5 A 42.00
7 B 83.00
3 C 9.00
4 C 2.00
6 C 27.25
8 C 91.00
9 C 7.00

Applying a function to a pandas col

I would like to map the function GetPermittedFAR to my dataframe(df) such that I could test if a value in the col zonedist1 == a certain value I could build new cols such as df['FAR_Permitted'] etc.
I have tried various means of map() etc. but haven't gotten this to work. I feel this should be a pretty simple thing to do?
Ideally, I would use a simple list comprehension / lambda as I have many of these test conditional values resulting in col data to create.
import pandas as pd
import numpy as np
def GetPermittedFAR():
if df['zonedist1'] == 'R7-3':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R3-2':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R1-1':
df['FAR_Permitted'] = 0.7
df['Building Height Max'] = 100
#etc...if statement for each unique value in 'zonedist'
df = pd.DataFrame({'zonedist1':['R7-3', 'R3-2', 'R1-1',
'R1-2', 'R2', 'R2A', 'R2X',
'R1-1','R7-3','R3-2','R7-3',
'R3-2', 'R1-1', 'R1-2'
]}
df = df.apply(lambda x: GetPermittedFAR(), axis=1)
How about using pd.merge()?
Let df be your dataframe
In [612]: df
Out[612]:
zonedist1
0 R7-3
1 R3-2
2 R1-1
3 R1-2
4 R2
5 R2A
6 R2X
merge be another dataframe with conditions
In [613]: merge
Out[613]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
Then, merge df with merge on 'left'
In [614]: df.merge(merge, how='left')
Out[614]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
2 R1-1 NaN NaN
3 R1-2 NaN NaN
4 R2 NaN NaN
5 R2A NaN NaN
6 R2X NaN NaN
Later you can replace NaN values.

Python Pandas: Get row by median value

I'm trying to get the row of the median value for a column.
I'm using data.median() to get the median value for 'column'.
id 30444.5
someProperty 3.0
numberOfItems 0.0
column 70.0
And data.median()['column'] is subsequently:
data.median()['performance']
>>> 70.0
How can get the row or index of the median value?
Is there anything similar to idxmax / idxmin?
I tried filtering but it's not reliable in cases multiple rows have the same value.
Thanks!
You can use rank and idxmin and apply it to each column:
import numpy as np
import pandas as pd
def get_median_index(d):
ranks = d.rank(pct=True)
close_to_median = abs(ranks - 0.5)
return close_to_median.idxmin()
df = pd.DataFrame(np.random.randn(13, 4))
df
0 1 2 3
0 0.919681 -0.934712 1.636177 -1.241359
1 -1.198866 1.168437 1.044017 -2.487849
2 1.159440 -1.764668 -0.470982 1.173863
3 -0.055529 0.406662 0.272882 -0.318382
4 -0.632588 0.451147 -0.181522 -0.145296
5 1.180336 -0.768991 0.708926 -1.023846
6 -0.059708 0.605231 1.102273 1.201167
7 0.017064 -0.091870 0.256800 -0.219130
8 -0.333725 -0.170327 -1.725664 -0.295963
9 0.802023 0.163209 1.853383 -0.122511
10 0.650980 -0.386218 -0.170424 1.569529
11 0.678288 -0.006816 0.388679 -0.117963
12 1.640222 1.608097 1.779814 1.028625
df.apply(get_median_index, 0)
0 7
1 7
2 3
3 4
May be just : data[data.performance==data.median()['performance']].

Categories

Resources