I want to flag the anomalies in the desired_columns (desired_D to L). Here, an anomaly is defined as any value <1500 and >400000 in each row.
See below for the dataset
import pandas as pd
# intialise data of lists
data = {
'A':['L1', 'L2', 'L3', 'L4', 'L5'],
'B':[1,1,1,1,1],
'C':[1,2,3,5,9],
'desired_D':[12005, 18190, 1021, 13301, 31119],
'desired_E':[11021, 19112, 19021, 15, 24509 ],
'desired_F':[10022,19910, 19113,449999, 25519],
'desired_G':[14029, 29100, 39022, 24509, 412271],
'desired_H':[52119,32991,52883,69359,57835],
'desired_J':[41218, 52991,55121,69152,79355],
'desired_K': [43211,7672991,56881,211,77342],
'desired_L': [31211,42901,53818,62158,69325],
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Currently, my code flags columns B, and C inclusively (I want to exclude them).
The revised code looks like this:
# function to flag the anomaly in each row- this flags columns B and C as well (I want to exclude these columns)
dont_format_cols = ['B','C']
def flag_outliers(s, dont_format_cols):
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
s = pd.to_numeric(s, errors='coerce')
indexes = (s<1500)|(s>400000)
return ['background-color: red' if v else '' for v in indexes]
styled = df.style.apply(flag_outliers, axis=1)
styled
The error after edits
Desired output: should exclude B and C,refer to the image below.
df.style.apply(..., axis=1) applies your outlier-styling function (column-wise) to all of df's columns. If you only want to apply it to some columns, use the subset argument.
EDIT: I wasn't aware df.style.apply() had a subset argument, I had proposed these hacky approaches:
1: inspect the series name s.name inside the styling function, like the solution Pandas style function to highlight specific columns.
### Hack solution just hardwire it into the body of `flag_outliers()` without adding in an extra arg `dont_format_cols`
def flag_outliers(s):
dont_format_cols = ['B','C']
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
# code to apply formatting
2: Another hack approach: add a second arg dont_format_cols to your function flag_outliers(s, dont_format_cols). Now you have to pass it in in the apply call, so you'll need a lambda:
styled = df.style.apply(lambda s: flag_outliers(s, dont_format_cols), axis=1)
and:
def flag_outliers(s, dont_format_cols):
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
# code to apply formatting
Use the subset argument. That is precisely its purpose to isolate styles to only specific regions.
i.e df.style.apply(flag_outliers, axis=1, subset=<list of used columns>)
You can see examples in the pandas Styler user guide documentation entitled finer slicing.
Related
I need to replace the values inside the dataset, I used the fillna() method, the function runs, but when I check the data is still null
import pandas as pd
import numpy as np
dataset = pd.read_csv('mamografia.csv')
dataset
mamografia = dataset
mamografia
malignos = mamografia[mamografia['Severidade'] == 0].isnull().sum()
print('Valores ausentes: ')
print()
print('Valores Malignos: ', malignos)
print()
belignos = mamografia[mamografia['Severidade'] == 1].isnull().sum()
print('Valores Belignos:', belignos)
def substitui_ausentes(lista_col):
for lista in lista_col:
if lista != 'Idade':
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mode())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mode())
else:
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mean())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mean())
mamografia.columns
substitui_ausentes(mamografia.columns)
mamografia
I'm trying to replace the null values, using fillna()
By default fillna does not work in place but returns the result of the operation.
You can either set the new value manually using
df = df.fillna(...)
Or overwrite the default behaviour by setting the parameter inplace=True
df.fillna(... , inplace=True)
However your code will still not work since you want to fill the different severities separately.
Since the function is being rewritten lets also make it more pandonic by not making it change the Dataframe by default
def substitui_ausentes(dfc, reglas, inplace = False):
if inplace:
df = dfc
else:
df = dfc.copy()
fill_values = df.groupby('Severidade').agg(reglas).to_dict(orient='index')
for k in fill_values:
df.loc[df['Severidade'] == k] = df.loc[df['Severidade'] == k].fillna(fill_values[k])
return df
Note that you now need to call the function using
reglas = {
'Idade':lambda x: pd.Series.mode(x)[0],
'Densidade':'mean'
}
substitui_ausentes(df,reglas, inplace=True)
and the reglas dictionary needs to include only the columns you want to fill and how you want to fill them.
Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])
I have a dataframe like as given below
df = pd.DataFrame({
'date' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 12:00:00','2173-04-04 11:30:00','2173-04-04 16:00:00','2173-04-04 22:00:00','2173-04-05 04:00:00'],
'subject_id':[1,1,1,1,1,1,1,1,1],
'val' :[5,5,5,10,10,5,5,8,8]
})
I would like to apply couple of logic (logic_1 on val column and logic_2 on date column) to the code. Please find below the logic
logic_1 = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
credit to SO users for helping me with logic
This is what I tried below
df['label'] = ''
df['date'] = pd.to_datetime(df['date'])
df['tdiff'] = df['date'].shift(-1) - df['date']
df['tdiff'] = df['tdiff'].dt.total_seconds()/3600
df['lo_1'] = df.groupby('subject_id')['val'].transform(logic_1).map({True:'1',False:''})
df['lo_2'] = df.groupby('subject_id')['tdiff'].transform(logic_2).map({True:'1',False:''})
How can I make both the logic_1 and logic_2 be part of one logic statement? Is it even possible? I might have more than 2 logics as well. Instead of writing one line for each logic, is it possible to couple all logics together in one logic statement.
I expect my output to be with label column being filled with 1 when both logic_1 and logic_2 are satisfied
You have a few things to fix
First, in logic_2, you have lambda x but use y, so, you got to change that as below
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
Then you can use the logic's together as below'
No need to create a blank column label. You can create the '`label' column directly as below.
df['label'] = ((df.groupby('subject_id')['val'].transform(logic_1))
& (df.groupby('subject_id')['tdiff'].transform(logic_2))).map({True:'0',False:'1'})
Note: You logic produces all False values. So, you will get 1's if False is mapped to 1, not True
I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])
I have a function that takes in some complex parameters and is expected to return a filter to be used on a pandas dataframe.
filters = build_filters(df, ...)
filtered_df = df[filters]
For example, if the dataframe has series Gender and Age, build_filters could return (df.Gender == 'M') & (df.Age == 100)
If, however, build_filters determines that there should be no filters applied, is there anything that I can return (i.e. the "identity filter") that will result in df not being filtered?
I've tried the obvious things like None, True, and even a generator that returns True for every call to next()
The closest I've come is
operator.ne(df.ix[:,0], nan)
which I think is silly, and likely going to cause bugs I can't yet forsee.
You can return slice(None). Here's a trivial demonstration:
df = pd.DataFrame([[1, 2, 3]])
df2 = df[slice(None)] # equivalent to df2 = df[:]
df2[0] = -1
assert df.equals(df2)
Alternatively, use pd.DataFrame.pipe and return df if no filters need to be applied:
def apply_filters(df):
# some logic
if not filter_flag:
return df
else:
# mask = ....
return df[mask]
filtered_df = df.pipe(apply_filters)