I know how to apply an IF condition in Pandas DataFrame. link
However, my question is how to do the following:
if (df[df['col1'] == 0]):
sys.path.append("/desktop/folder/")
import self_module as sm
df = sm.call_function(df)
What I really want to do is when value in col1 equals to 0 then call function call_function().
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
I provide a simple example above for call_function().
Since your function interacts with multiple columns and returns a whole data frame, run conditional logic inside the method:
def call_function(ds):
ds['new_age'] = np.nan
ds.loc[ds['col'] == 0, 'new_age'] = ds['age'].mul(0.012345678901).round(12)
return ds
df = call_function(df)
If you are unable to modify the function, run method on splits of data frame and concat or append together. Any new columns in other split will be have values filled with NAs.
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
df = pd.concat([call_function(df[df['col'] == 0].copy()),
df[df['col'] != 0].copy()])
Related
I need to replace the values inside the dataset, I used the fillna() method, the function runs, but when I check the data is still null
import pandas as pd
import numpy as np
dataset = pd.read_csv('mamografia.csv')
dataset
mamografia = dataset
mamografia
malignos = mamografia[mamografia['Severidade'] == 0].isnull().sum()
print('Valores ausentes: ')
print()
print('Valores Malignos: ', malignos)
print()
belignos = mamografia[mamografia['Severidade'] == 1].isnull().sum()
print('Valores Belignos:', belignos)
def substitui_ausentes(lista_col):
for lista in lista_col:
if lista != 'Idade':
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mode())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mode())
else:
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mean())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mean())
mamografia.columns
substitui_ausentes(mamografia.columns)
mamografia
I'm trying to replace the null values, using fillna()
By default fillna does not work in place but returns the result of the operation.
You can either set the new value manually using
df = df.fillna(...)
Or overwrite the default behaviour by setting the parameter inplace=True
df.fillna(... , inplace=True)
However your code will still not work since you want to fill the different severities separately.
Since the function is being rewritten lets also make it more pandonic by not making it change the Dataframe by default
def substitui_ausentes(dfc, reglas, inplace = False):
if inplace:
df = dfc
else:
df = dfc.copy()
fill_values = df.groupby('Severidade').agg(reglas).to_dict(orient='index')
for k in fill_values:
df.loc[df['Severidade'] == k] = df.loc[df['Severidade'] == k].fillna(fill_values[k])
return df
Note that you now need to call the function using
reglas = {
'Idade':lambda x: pd.Series.mode(x)[0],
'Densidade':'mean'
}
substitui_ausentes(df,reglas, inplace=True)
and the reglas dictionary needs to include only the columns you want to fill and how you want to fill them.
I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.
Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]
I just had a below concern in performing window operation on pyspark dataframe. I want to get the latest records from the input table with the below condition, but I want to exclude the for loop:
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date').desc()", "col('load_date').desc()"]
window_spec = Window.partitionBy(*groupby_col).orderBy([eval(x) for x in orderby_col])
df = df.withColumn("rank", rank().over(window_spec))
df = df.filter(col('rank') == '1')
My concern, is I'm using the orderby_col and evaluating to covert in columner way using eval() and for loop to check all the orderby columns in the list.
Could you please let me know how we can pass multiple columns in order by without having a for loop to do the descending order??
import pyspark.sql.functions as f
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date')", "col('load_date')"]
window_spec = Window.partitionBy(*groupby_col).orderBy(f.desc(*orderby_col))
df = df.withColumn("rank", f.rank().over(window_spec))
df = df.filter(col('rank') == '1')
I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)
I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])