I need to replace the values inside the dataset, I used the fillna() method, the function runs, but when I check the data is still null
import pandas as pd
import numpy as np
dataset = pd.read_csv('mamografia.csv')
dataset
mamografia = dataset
mamografia
malignos = mamografia[mamografia['Severidade'] == 0].isnull().sum()
print('Valores ausentes: ')
print()
print('Valores Malignos: ', malignos)
print()
belignos = mamografia[mamografia['Severidade'] == 1].isnull().sum()
print('Valores Belignos:', belignos)
def substitui_ausentes(lista_col):
for lista in lista_col:
if lista != 'Idade':
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mode())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mode())
else:
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mean())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mean())
mamografia.columns
substitui_ausentes(mamografia.columns)
mamografia
I'm trying to replace the null values, using fillna()
By default fillna does not work in place but returns the result of the operation.
You can either set the new value manually using
df = df.fillna(...)
Or overwrite the default behaviour by setting the parameter inplace=True
df.fillna(... , inplace=True)
However your code will still not work since you want to fill the different severities separately.
Since the function is being rewritten lets also make it more pandonic by not making it change the Dataframe by default
def substitui_ausentes(dfc, reglas, inplace = False):
if inplace:
df = dfc
else:
df = dfc.copy()
fill_values = df.groupby('Severidade').agg(reglas).to_dict(orient='index')
for k in fill_values:
df.loc[df['Severidade'] == k] = df.loc[df['Severidade'] == k].fillna(fill_values[k])
return df
Note that you now need to call the function using
reglas = {
'Idade':lambda x: pd.Series.mode(x)[0],
'Densidade':'mean'
}
substitui_ausentes(df,reglas, inplace=True)
and the reglas dictionary needs to include only the columns you want to fill and how you want to fill them.
Related
I want to flag the anomalies in the desired_columns (desired_D to L). Here, an anomaly is defined as any value <1500 and >400000 in each row.
See below for the dataset
import pandas as pd
# intialise data of lists
data = {
'A':['L1', 'L2', 'L3', 'L4', 'L5'],
'B':[1,1,1,1,1],
'C':[1,2,3,5,9],
'desired_D':[12005, 18190, 1021, 13301, 31119],
'desired_E':[11021, 19112, 19021, 15, 24509 ],
'desired_F':[10022,19910, 19113,449999, 25519],
'desired_G':[14029, 29100, 39022, 24509, 412271],
'desired_H':[52119,32991,52883,69359,57835],
'desired_J':[41218, 52991,55121,69152,79355],
'desired_K': [43211,7672991,56881,211,77342],
'desired_L': [31211,42901,53818,62158,69325],
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Currently, my code flags columns B, and C inclusively (I want to exclude them).
The revised code looks like this:
# function to flag the anomaly in each row- this flags columns B and C as well (I want to exclude these columns)
dont_format_cols = ['B','C']
def flag_outliers(s, dont_format_cols):
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
s = pd.to_numeric(s, errors='coerce')
indexes = (s<1500)|(s>400000)
return ['background-color: red' if v else '' for v in indexes]
styled = df.style.apply(flag_outliers, axis=1)
styled
The error after edits
Desired output: should exclude B and C,refer to the image below.
df.style.apply(..., axis=1) applies your outlier-styling function (column-wise) to all of df's columns. If you only want to apply it to some columns, use the subset argument.
EDIT: I wasn't aware df.style.apply() had a subset argument, I had proposed these hacky approaches:
1: inspect the series name s.name inside the styling function, like the solution Pandas style function to highlight specific columns.
### Hack solution just hardwire it into the body of `flag_outliers()` without adding in an extra arg `dont_format_cols`
def flag_outliers(s):
dont_format_cols = ['B','C']
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
# code to apply formatting
2: Another hack approach: add a second arg dont_format_cols to your function flag_outliers(s, dont_format_cols). Now you have to pass it in in the apply call, so you'll need a lambda:
styled = df.style.apply(lambda s: flag_outliers(s, dont_format_cols), axis=1)
and:
def flag_outliers(s, dont_format_cols):
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
# code to apply formatting
Use the subset argument. That is precisely its purpose to isolate styles to only specific regions.
i.e df.style.apply(flag_outliers, axis=1, subset=<list of used columns>)
You can see examples in the pandas Styler user guide documentation entitled finer slicing.
I know how to apply an IF condition in Pandas DataFrame. link
However, my question is how to do the following:
if (df[df['col1'] == 0]):
sys.path.append("/desktop/folder/")
import self_module as sm
df = sm.call_function(df)
What I really want to do is when value in col1 equals to 0 then call function call_function().
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
I provide a simple example above for call_function().
Since your function interacts with multiple columns and returns a whole data frame, run conditional logic inside the method:
def call_function(ds):
ds['new_age'] = np.nan
ds.loc[ds['col'] == 0, 'new_age'] = ds['age'].mul(0.012345678901).round(12)
return ds
df = call_function(df)
If you are unable to modify the function, run method on splits of data frame and concat or append together. Any new columns in other split will be have values filled with NAs.
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
df = pd.concat([call_function(df[df['col'] == 0].copy()),
df[df['col'] != 0].copy()])
I just had a below concern in performing window operation on pyspark dataframe. I want to get the latest records from the input table with the below condition, but I want to exclude the for loop:
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date').desc()", "col('load_date').desc()"]
window_spec = Window.partitionBy(*groupby_col).orderBy([eval(x) for x in orderby_col])
df = df.withColumn("rank", rank().over(window_spec))
df = df.filter(col('rank') == '1')
My concern, is I'm using the orderby_col and evaluating to covert in columner way using eval() and for loop to check all the orderby columns in the list.
Could you please let me know how we can pass multiple columns in order by without having a for loop to do the descending order??
import pyspark.sql.functions as f
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date')", "col('load_date')"]
window_spec = Window.partitionBy(*groupby_col).orderBy(f.desc(*orderby_col))
df = df.withColumn("rank", f.rank().over(window_spec))
df = df.filter(col('rank') == '1')
I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)
I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])