return various variables in nested function based on conditional statement - python

I have an original dataframe from which I am able to create a modified dataframe, however there will be cases that I am interested in selecting a subset of my data and not using the dataframe as a whole, but I want this all to be done in an entire function for which I am opting to use a subset of the data, however is it possible to return different variables based on a conditional or would this be incorrect.
The function below works fine when I run
modified_df = modify_data(protein_embeddings, protein_df, subset = False)
but when I try executing:
gal_subset_first, gal_subset_second = modify_data(protein_embeddings, protein_df, subset = True)
I get the error:
ValueError: too many values to unpack (expected 2)
The Function
def modify_data(embeddings, df, subset = False):
"""
Modifies Original Dataframe with respective embedddings
:return: Final Dataframe to be used in data split and modelling
"""
#Original_DF
OD_df = df.copy(deep = True)
OD_df = df.reset_index()
OD_df.loc[:,'task'] = 'stability'
#Embeddings Df
embeddings_df = pd.DataFrame(data=embeddings)
embeddings_df = embeddings_df.reset_index()
embedded_df = pd.merge(embeddings_df, OD_df, on='index')
embedded_df = embedded_df.drop(['index', 'sequence', 'temperature'], axis = 1)
def subsetting(embedded_df, sample_no, row_no):
"Select a Subset of rows desired from original dataframe"
#Selecting subset
embedded_df = embedded_df.sample(n = sample_no)
subset_first = gal_subset[:row_no]
subset_second = gal_subset[row_no:]
return subset_first, subset_second
if subset == True:
gal_subset_first, gal_subset_second = subsetting(embedded_df, sample_no = 2000, row_no = 1000)
else:
pass
return embedded_df

Your function returns an iterable data frame. When you assign the result to one variable, the whole data frame will be written to the variable. However, if you assign the result multiple variables, Python will iterate over the returned value and check if the number of variables matches the data frame iterator items.
Compare the code samples:
def f():
return (1,2,3)
a = f() # a is a tuple (1, 2, 3)
a, b = f() # raises the same exception ValueError: too many values to unpack (expected 2)
a, b, c = f() # a=1 b=2 c=3 because the number of returned values matches the number of the assigned variables.

Related

How to return arrays from a function in python

I have a function to read some data files and make some pandas data. I have 4 paths I want to read and make dataframes.
def read_files(paths:np.array,scalings:np.array):
names = ['E','I']
for p,s in zip(paths,scalings):
df = pd.read_csv(p,engine = 'python', sep ='\s+', names=names)
energy_ = df['E']
intensity_ = df['I']
return energy_, intensity_
I want to make 2 arrays which have all of the dfs inside them to use for other functions.
(below) where each energy0 is the df['E'] from the first path in the paths array and so on.
energy = [energy_0, energy_1, energy2, energy_3]
intensity = [intensity_0, intensity_1, intensity_2, intensity_3]
to use in
fig = plotfunction(energy,intensity,etc)
How can I get call each specific dataframe so I can make an array with them? Edit: If I want to use the energy and intensity dataframes from the path3 if i use paths = [path0,path1,path2,path3]
How to return arrays from a function in python
Here's an example on how to return 2 array from a function:
def function():
array = [1,2,3]
array2 = [4,5,6]
return array, array2
a, a2 = function()
print(a)
print(a2)
How can I get call each specific dataframe
What?
I can only guess you want
def read_files(paths:np.array):
names = ['E','I']
energy_ = [] # create an array here
intensity_ = [] # create another array here
for p,s in zip(paths, scalings):
df = pd.read_csv(p,engine = 'python', sep ='\s+', names=names)
energy_.append(df['E']) # append to that array
intensity_.append(df['I']) # append to that other array
return energy_, intensity_ # return both arrays
Note that scalings is not defined in the code you have shown us.

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.
Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

if statement and call function for dataframe

I know how to apply an IF condition in Pandas DataFrame. link
However, my question is how to do the following:
if (df[df['col1'] == 0]):
sys.path.append("/desktop/folder/")
import self_module as sm
df = sm.call_function(df)
What I really want to do is when value in col1 equals to 0 then call function call_function().
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
I provide a simple example above for call_function().
Since your function interacts with multiple columns and returns a whole data frame, run conditional logic inside the method:
def call_function(ds):
ds['new_age'] = np.nan
ds.loc[ds['col'] == 0, 'new_age'] = ds['age'].mul(0.012345678901).round(12)
return ds
df = call_function(df)
If you are unable to modify the function, run method on splits of data frame and concat or append together. Any new columns in other split will be have values filled with NAs.
def call_function(ds):
ds['new_age'] = (ds['age']* 0.012345678901).round(12)
return ds
df = pd.concat([call_function(df[df['col'] == 0].copy()),
df[df['col'] != 0].copy()])

How to make my modified pandas/numpy .where function adaptable to different sizes of a list parameter?

I want to create my own function that scans a number of user-specified columns in a dataframe, and that function will create a new variable and assign it as '1' if all the specified columns == 1, otherwise 0.
In the following codes, I am accommodating if users are inputting exactly two columns to be scanned over.
import numpy as np
class Tagger:
def __init__(self):
pass
def summing_all_tagger(self, df, tag_var_list, tag_value=1):
# This tagger creates a tag='1' if all variables in tag_var_list equals to tag_value; otherwise='0'
self.df = df
self.tag_var_list = tag_var_list
self.tag_value = tag_value
self.df['temp'] = np.where((self.df[self.tag_var_list[0]]==self.tag_value) &
(self.df[self.tag_var_list[1]]==self.tag_value), 1, 0)
return self.df_pin['temp']
Then I can call it in the main.py file:
import pandas as pd
import datetime
import feature_tagger.feature_tagger as ft
tagger_obj = ft.Tagger()
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG', 'PIN_TIME_TAG'], tag_value=1)
How can I modify it so users can enter as many column names for tag_var_list as they want?
Such as
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG', 'PIN_TIME_TAG', 'PIN_NAME_TAG'], tag_value=1)
# or
df_pin['PIN_RX&TIME_TAG'] = tagger_obj.summing_all_tagger(df_pin, tag_var_list=['PIN_RX_TAG'], tag_value=1)
The np.all() is your friend.
self.df['temp'] = np.where(np.all(self.df[self.tag_var_list] == self.tag_value, axis=1), 1, 0)
I think you can create list comprehension for list of boolean masks and then reduce of masks to one with casting to integer for 0/1 column:
L = [self.df[x]==self.tag_value for x in tag_var_list]
self.df['temp'] = np.logical_and.reduce(L).astype(int)
Or DataFrame.all with casting boolean mask to integers:
self.df['temp'] = (self.df[self.tag_var_list] == self.tag_value).all(axis=1).astype(int)

Subsetting a dataframe to use with fmin yielding unexpected errors

I'm currently using fmin() to try and fit an equation to my data. The data in the file is just a 2-column list of floats. This is my code:
filename = 'HAB_30_Master_Overall_no2100'
data = pd.read_csv(filename+'.csv', header=0, usecols=['Wavelength', '2.5'])
def fitFunc(x):
global B, A, data, sumresids
wave = data['Wavelength']
modelforfit = x[0]*wave**-x[1]
data['model'] = modelforfit
data['Residuals'] = abs(data['2.5'] - data['model'])
sumresids = data['Residuals'].sum()
return sumresids
def fitData():
global xopt
B = 2
A = 1
x0 = np.array([B, A])
xopt, fopt, iter, funcalls, warnflag = fmin(fitFunc,x0,maxiter = 10000, full_output=True, disp=False)
print xopt[0], xopt[1]
fitFunc(data['Wavelength'])
fitData()
And this code works when I use all the values in the file. What I'm trying to do, though, is subset the dataframe so that I can see how the fit changes when only some data points are included. If the only thing I changes is add nrows=10 to the read_csv call, even though there are >90 rows in the file, I get the error:
ValueError: Integers to negative integer powers are not allowed.
And if I try doing something like making a new dataframe with .iloc to subset rows like this:
filename = 'HAB_30_Master_Overall_no2100'
data = pd.read_csv(filename+'.csv', header=0, usecols=['Wavelength', '2.5'])
newdata = data.iloc[:10]
def fitFunc(x):
global B, A, data, sumresids
wave = newdata['Wavelength']
modelforfit = x[0]*wave**-x[1]
newdata['model'] = modelforfit
newdata['Residuals'] = abs(newdata['2.5'] - newdata['model'])
sumresids = newdata['Residuals'].sum()
return sumresids
def fitData():
global xopt
B = 2
A = 1
x0 = np.array([B, A])
xopt, fopt, iter, funcalls, warnflag = fmin(fitFunc,x0,maxiter = 10000, full_output=True, disp=False)
print xopt[0], xopt[1]
fitFunc(newdata['Wavelength'])
fitData()
I get a warning like this:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
newdata['model'] = modelforfit
/var/folders/j8/1fzjf9cj3slcmyy1t89sth5w0000gp/T/tmpZRuvLX.py:20:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
newdata['Residuals'] = abs(newdata['2.5'] - newdata['model'])
And it crashes. Ideally I'd like to be able to use non-contiguous rows too, like the first 5 and last 5, but I'd even settle for just one chunk at a time. It would be really helpful if someone could tell me why neither of the above works and also provide a solution.
Edit: This is a snippet of what my data looks like.
To get this figured out, I've just been importing one column in the read_csv call, but the end goal is to have it loop through segments of this larger file either with a subset of rows (the problem) and/or column by column (already figured out)

Categories

Resources