the cell value is not returned in pandas (where function) - python

I used the following code to:
To map the values from another data frame (Map function is used)
To finalize the values in the column; where() is used
Requirement: VT_Final column to either of these values (V_Team1 or V_Team2 or Non-PA)
Issue: VT_Final returns an empty cell(blanks);
Please advise with clarifications.
Code:
Bookings['V_Team1'] = Bookings.Marker1.map(Manpower_1.set_index('Marker1')['Vertical Team'].to_dict())
Bookings['V_Team2'] = Bookings.Marker1.map(Attrition_1.set_index('Marker1')['Vertical Team'].to_dict())
Bookings['VT_Final'] = Bookings['V_Team1']
Bookings['VT_Final'].where(Bookings['V_Team1'] !='')
Bookings['VT_Final'] = Bookings['V_Team2']
Bookings['VT_Final'].where(Bookings['V_Team1'] =='')
Bookings['VT_Final'] = 'Non PA'
Bookings['VT_Final'].where((Bookings['V_Team1'] =='')&(Bookings['V_Team2']==''))

Related

Pandas check value in column

I'm not used to pandas 's syntax, i'm trying to just check if a value is in a column.
i've tested this :
data = my dataframe
ini = column name
data = pdsql.read_sql_query("select id_bdcarth, id_nd_ini::int ini, id_nd_fin::int fin, v from tempturbi.tmp_somme_v19",connection)
exists = 973237173 in data.ini
print(exists)
and i get False as result (but the value is in the column). Is my method wrong ?

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.
Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

Any differences between iterating over values in columns of dataframe and assigning variable to data in column?

I ran the following codes but Spyder returned "float division by zero"
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
for value in df[columnName]:
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min()) (This line showed up error)
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
When I changed into this, it works (the change here is assigning column values to a variable)
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
value=df[columnName]
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min())
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
Can anybody explain why the later works but the former does not?
df[Columnname] returns a pd.Series object and you are trying cast a Series object into int.
while in latter case, value=df[ColumnName], df[ColumnName]-df[ColumnName].min()/(....),
the pandas will broadcast the df[ColumnName].min() (which is a int/float value) into a pd.Series object. pandas automatically performs matrix operation on dataframe, thats why you do not need to iterate for every value in column.

return various variables in nested function based on conditional statement

I have an original dataframe from which I am able to create a modified dataframe, however there will be cases that I am interested in selecting a subset of my data and not using the dataframe as a whole, but I want this all to be done in an entire function for which I am opting to use a subset of the data, however is it possible to return different variables based on a conditional or would this be incorrect.
The function below works fine when I run
modified_df = modify_data(protein_embeddings, protein_df, subset = False)
but when I try executing:
gal_subset_first, gal_subset_second = modify_data(protein_embeddings, protein_df, subset = True)
I get the error:
ValueError: too many values to unpack (expected 2)
The Function
def modify_data(embeddings, df, subset = False):
"""
Modifies Original Dataframe with respective embedddings
:return: Final Dataframe to be used in data split and modelling
"""
#Original_DF
OD_df = df.copy(deep = True)
OD_df = df.reset_index()
OD_df.loc[:,'task'] = 'stability'
#Embeddings Df
embeddings_df = pd.DataFrame(data=embeddings)
embeddings_df = embeddings_df.reset_index()
embedded_df = pd.merge(embeddings_df, OD_df, on='index')
embedded_df = embedded_df.drop(['index', 'sequence', 'temperature'], axis = 1)
def subsetting(embedded_df, sample_no, row_no):
"Select a Subset of rows desired from original dataframe"
#Selecting subset
embedded_df = embedded_df.sample(n = sample_no)
subset_first = gal_subset[:row_no]
subset_second = gal_subset[row_no:]
return subset_first, subset_second
if subset == True:
gal_subset_first, gal_subset_second = subsetting(embedded_df, sample_no = 2000, row_no = 1000)
else:
pass
return embedded_df
Your function returns an iterable data frame. When you assign the result to one variable, the whole data frame will be written to the variable. However, if you assign the result multiple variables, Python will iterate over the returned value and check if the number of variables matches the data frame iterator items.
Compare the code samples:
def f():
return (1,2,3)
a = f() # a is a tuple (1, 2, 3)
a, b = f() # raises the same exception ValueError: too many values to unpack (expected 2)
a, b, c = f() # a=1 b=2 c=3 because the number of returned values matches the number of the assigned variables.

TypeError: string indices must be integers, not str in python

Here is my python code, Which is throwing error while executing.
def split_cell(s):
a = s.split(".")
b = a[1].split("::=")
return (a[0].lower(),b[0].lower(),b[1].lower())
logic_tbl,logic_col,logic_value = split_cell(rules['logic_1'][ith_rule])
mems = logic_tbl[logic_tbl[logic_col]==logic_value]['mbr_id'].tolist()
Function split_cell is working fine, and all the columns in logic_tbl are of object datatypes.
HEre is the Traceback
Got this corrected!
Logic_tbl contains name of pandas dataframe
Logic_col contains name of column name in the pandas dataframe
logic_value contains value of the rows in the logic_col variable in logic_tbl dataframe.
mems = logic_tbl[logic_tbl[logic_col]==logic_value]['mbr_id'].tolist()
I was trying like above, But python treating logic_tbl as string, not doing any pandas dataframe level operations.
So, I had created a dictionary like this
dt_dict={}
dt_dict['a_med_clm_diag'] = a_med_clm_diag
And modified my code as below,
mems = dt_dict[logic_tbl][dt_dict[logic_tbl][logic_col]==logic_value]['mbr_id'].tolist()
This is working as expected. I come to this idea when i wrote like,
mems = logic_tbl[logic_tbl[logic_col]==logic_value,'mbr_id']
And this throwed message like,"'logic_tbl' is a string Nothing to filter".
Try writing that last statement like below code:
filt = numpy.array[a==logic_value for a in logic_col]
mems = [i for indx,i in enumerate(logic_col) if filt[indx] == True]
Does this work?

Categories

Resources