Python: Trying to encode all DF strings into ints

Python: Trying to encode all DF strings into ints - python

I am trying to build a function that checks every column of the DF and if it finds a string in the column, I want to encode that string into the int. The function also ensures that the binary model label is rounded to 1 and 0.
However, this gets me an ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This error is caused by everything within an IF statement, which makes me wonder if there is a better way to approach this problem?
This is what I got so far:
credit_key = "consumer"
model_label = "sc"
def df_column_headers(df):
headers = []
for column in df.columns:
headers.append(column)
return headers
def control_df_modelling(path_df_modelling, credit_key, model_label):
path_df_modelling = path_df_modelling + f"{file_name_modelling}_{credit_key}.{file_format_modelling}"
df_modelling = pd.read_csv(path_df_modelling)
print(f"Conducting Quality Checks: DF Modelling {credit_key}")
count_string_values = 0
df_modelling[model_label] = [round(x) for x in df_modelling[model_label]]
columns_df_modelling = df_column_headers(df_modelling)
for column in df_modelling[columns_df_modelling]:
column_element = df_modelling[column]
if column_element != np.number:
print(f"String Value Found: {column_element}")
print(f"String Value {column_element} is Converted into {hash(column_element)}")
count_string_values += 1
column_element = hash(column_element)
else:
print(f"No String Values in DF {credit_key} Found")
print(f"Total Number of String Values Found in DF Modelling {credit_key}: {count_string_values}")
return df_modelling

Related

How to iterate through and compare a string with manual input in pandas data frame?

I have the following pandas data table:
The above image is just the data I got from yfinance for AAPL stock
newsdf is the pandas data frame that has bunch of dates from another API call that has dates for specific news
I have the following code:
df['Boolean'] = df['Open'] < df['Close']
print(df)
if df['Boolean'] == 'False':
for h in range(0, k):
if newsdf[h] == df['Date']:
print('Bearish signal ')
print(h)
else:
print('Signal bullish')
I am getting the error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Edit: I see that I am comparing the whole Boolean and I can't do that but what would be a way for me to iterate through the Boolean column, see if the value is ''True'' or ''False'', compare the value with the newsdf and print that index

try this, comments have been put against the line of code.
#package import
import pandas as pd
from io import StringIO
#data setup
newdf=pd.DataFrame({"h":['2021-01-04']})
raw_data= \
'''
Date Open Boolean
2021-01-04 132 False
2021-01-05 120 True
2021-01-06 123 False
'''
df=pd.read_csv(StringIO(raw_data),sep=" ")
#function to have a use case logic
def check_the_signal(df_row,newdf=newdf):
if not df_row['Boolean']: # this is false
if df_row['Date'] in list(newdf['h']): # caution when newdf is large dataset,list will be large
return 'Bearish signal '
else:
return 'Signal bullish'
else:
return "Neutral" # added for demo only!
df['single']=df.apply(check_the_signal,axis=1)# axis == 1 will send data at row level , saving the value in df, in case needed

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()/ select subset from data [duplicate]

This question already has answers here:
Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
(13 answers)
Closed 2 years ago.
My input file is under the form:
gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold
T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace,
N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace,
....
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace,
T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace,
N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
and I would like to select rows that either have the values for (CallersU equal to either Low OR -1) AND the values of (CalleesU equal either to Low OR -1).
Here is the code I am using below:
import pandas as pd
SeparateProjectLearning=False
CompleteCallersCallees=False
PartialTrainingSetCompleteCallersCallees=True
def main():
dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False)
#convert strings into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
dataset['Program'] = dataset['Program'].astype('category').cat.codes
dataset['classGold'] = dataset['classGold'].astype('category').cat.codes
dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes
dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes
dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes
dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes
dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes
dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes
dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes
dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes
dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes
dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes
dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes
dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes
dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes
print(dataset)
CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2)
and (dataset['CalleesU']==0 or dataset['CalleesU']==2)]
print(CompleteSet)
if __name__=="__main__":
main()
I am using the line dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes to convert the string values that can be taken by CallersU into digits. Similarly, I am using the line of code dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes to convert the string values that can be taken by CalleesU into digits. The four values that can be taken by CallersU/CalleesU are -1, Low,Medium,High. The line ...astype('category').cat.codes automatically makes the following conversions. -1 corresponds to 0, 1 Corresponds to High, 2 corresponds to Low and 3 corresponds to Medium. Thus, I am using the line CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)] to specify that I only want to select rows with either (CallersU==0 OR CallersU==2) and (CalleesU==0 OR CalleesU==2), the problem is that I am getting the error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). after executing the line of code CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)], How can I fix that and perform what's needed?

Replace the and with & and add ()
CompleteSet = dataset[((dataset['CallersU'] == 0) | (dataset['CallersU'] == 2)) & ((dataset['CalleesU']==0) | (dataset['CalleesU']==2))]

SMOTE in python

I am trying to use SMOTE in python and looking if there is any way to manually specify the number of minority samples.
Suppose we have 100 records of one class and 10 records of another class if we use ratio = 1 we get 100:100, if we use ratio 1/2, we get 100:200. But I am looking if there is any way to manually specify the number of instances to be generated for both the classes.
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = Ndf_class_0_records.DIED.value_counts()
Ndf_class_1_record_counts = Ndf_class_1_records.DIED.value_counts()
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
In the above code, I am trying to manually specify the number for each of the classes, but I am getting the following error at the last line of code
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

If I understand you correctly and the documentation here, you are not passing numbers as ratio. You are instead passing a series object.
The accepted types for ratio are:
float, str, dict or callable, (default=’auto’)
Please try doing:
Ndf_class_0_records = trainData[trainData['DIED'] == 0]
Ndf_class_1_records = trainData[trainData['DIED'] == 1]
Ndf_class_0_record_counts = len(Ndf_class_0_records) ##### CHANGED THIS
Ndf_class_1_record_counts = len(Ndf_class_1_records) ##### CHANGED THIS
X_smote = trainData.drop("DIED", axis=1)
y_smote = trainData["DIED"]
smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2})
X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote)
This should now work, please try!

How to apply a function in Pandas to a cell in every row where a different cell in that same row meets a condition?

I am trying to use the pandas string method "str.zfill" to add leading zeros to a cell in the same column for every row in the dataframe where another cell in that row meets a certain condition. So for any given row in my DataFrame "excodes", when the value in column "LOB_SVC_CD" is "MTG", apply the str.zfill(5) method to the cell in column "PS_CD". When the value in "LOB_SVC_CD" is not "MTG" leave the value in "PS_CD" as is.
I've tried a few custom functions, "np.where" and a few apply/map lambdas. I'm getting errors on all of them.
#Custom Function
def add_zero(column):
if excodes.loc[excodes.LOB_SVC_CD == 'MTG']:
excodes.PS_CD.str.zfill(5)
else:
return excodes.PS_CD
excodes['code'] = excodes.apply(add_zero)
#Custom Function with For Loop
def add_zero2(column):
code = []
for row(i) in column:
if excodes.LOB_SVC_CD == 'MTG':
code.append(excodes.PS_CD.str.zfill(5))
else:
code.append(excodes.PS_CD)
excodes['Code'] = code
excodes['code'] = excodes.apply(add_zero)
#np.Where
mask = excodes[excodes.LOB_SVC_CD == 'MTG']
excodes['code'] = pd.DataFrame[np.where(mask, excodes.PS_CD.str.zfill(5), excodes.PS_CD)]
#Lambda
excodes['code'] = excodes['LOB_SVC_CD'].map(lambda x: excodes.PS_CD.str.zfill(5)) if x[excodes.LOB_SVC_CD == 'MTG'] else excodes.PS_CD)
#Assign with a "Where"
excodes.assign((excodes.PS_CD.str.zfill(5)).where(excodes.LOB_SVC_CD == 'MTG'))
Expected results will be either:
create a new called "code" with all values in "PS_CD" are given leading zeroes in rows where excodes.LOB_SVC_CD == 'MTG'
adding leading zeroes to the values in excodes["PS_CD"] when the row excodes['LOB_SVC_CD'] == 'MTG'
Error Messages I'm getting are - on each of the approaches I've tried:
#Custom Function:
"ValueError: ('The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index PS_CD')"
# Custom Function with For Loop:
"SyntaxError: can't assign to function call"
#np.Where:
"ValueError: operands could not be broadcast together with shapes (152,7) (720,) (720,)"
#Apply Lambda:
"string indices must be integers"
#Assign with a "Where":
"TypeError: assign() takes 1 positional argument but 2 were given"

This seems to work :)
# Ensure the data in the PS_CD are strings
data["PS_CD"] = data["PS_CD"].astype(str)
# Iterate over all rows
for index in data.index:
# If the LOB_SVC_CD is "MTG"
if (data.loc[index, "LOB_SVC_CD"] == "MTG"):
# Apply the zfill(5) in the PS_CD on the same row (index)
data.loc[index, "PS_CD"] = data.loc[index, "PS_CD"].zfill(5)
# Print the result
print(data)
Alternative way (maybe a bit more Python-ish) :)
# Ensure the data in the PS_CD are strings
data["PS_CD"] = data["PS_CD"].astype(str)
# Custom function for applying the zfill
def my_zfill(x, y):
return y.zfill(5) if x == "MTG" else y
# Iterate over the data applying the custom function on each row
data["PS_CD"] = pd.Series([my_zfill(x, y) for x, y in zip(data["LOB_SVC_CD"], data["PS_CD"])])

My take:
>>> import pandas
>>> df = pandas.DataFrame(data = [['123', 'MTG'],['321', 'CLOC']], columns = ['PS_CD', 'LOB_SVC_CD'])
>>> df
PS_CD LOB_SVC_CD
0 123 MTG
1 321 CLOC
>>>
>>> df['PS_CD'] = df.apply(lambda row: row['PS_CD'].zfill(5) if row['LOB_SVC_CD'] == 'MTG' else row['PS_CD'], axis='columns')
>>> df
PS_CD LOB_SVC_CD
0 00123 MTG
1 321 CLOC
Using lambda will return value for every row, zfilled PS_CD if LOB_SVC_CD was MTG else original PS_CD.

Grabbing characters after a certain value

Below is the dataframe. PIC_1 and Wgt are strings and p.lgth and p_lgth are integers. If p_lgth is not equal to 30, I want to find 42 in PIC_1 and grab 42 and the 15 digits that come after it.
PIC_1 Wgt p.lgth p_lgth
**PARTIAL-DECODE***P / 42011721930018984390078... 112 53 53
So the output from above should be 42011721930018984
My code that does not work follows:
def pic_mod(row):
if row['p_lgth'] !=30:
PIC_loc = row['PIC_1'].find('42')
PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 15]
elif row['p_lgth']==30:
PIC_2=PIC_1
return PIC_2
row_1 is just a row from the larger df that is identical to the example row given above
row_1 = df71[2:3]
pic_mod(row_1)
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool (), a.item(), a.any() or a.all().
I did type() on the variables and got
type(df71['PIC_1']) = pandas.core.series.Series
type(df71['p_lgth']) = pandas.core.series.Series
type(df71['Wgt']) = pandas.core.series.Series
I'm fairly new to Python. Should these data types come back as int and str? df71 is a df.

According to the error message in your post, perhaps try with this one:
def pic_mod(row):
if row['p_lgth'].any() != 30:
PIC_loc = row['PIC_1'].str.find('42')[0]
PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 17]
elif row['p_lgth'].any() == 30:
PIC_2=PIC_1
return PIC_2
However, if your data is already structured in a pandas dataframe, you normally wouldn't write such an explicit function.
E.g. the initial filtering of all rows in the dataset by p_legth not equal to 30 would be a single line like:
df_fltrd = df[df['p_lgth']!=30]
Having this done you could apply any arbitrary function to the entries in the PIC_1-column, e.g. in your case the substring of length 17 starting with '42':
df_fltrd['PIC_1'].apply(lambda x: x[x.find('42'):x.find('42')+17])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Trying to encode all DF strings into ints - python

Related

How to iterate through and compare a string with manual input in pandas data frame?

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()/ select subset from data [duplicate]

SMOTE in python

How to apply a function in Pandas to a cell in every row where a different cell in that same row meets a condition?

Grabbing characters after a certain value

Categories

Resources