Python: Trying to encode all DF strings into ints - python
I am trying to build a function that checks every column of the DF and if it finds a string in the column, I want to encode that string into the int. The function also ensures that the binary model label is rounded to 1 and 0.
However, this gets me an ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This error is caused by everything within an IF statement, which makes me wonder if there is a better way to approach this problem?
This is what I got so far:
credit_key = "consumer"
model_label = "sc"
def df_column_headers(df):
headers = []
for column in df.columns:
headers.append(column)
return headers
def control_df_modelling(path_df_modelling, credit_key, model_label):
path_df_modelling = path_df_modelling + f"{file_name_modelling}_{credit_key}.{file_format_modelling}"
df_modelling = pd.read_csv(path_df_modelling)
print(f"Conducting Quality Checks: DF Modelling {credit_key}")
count_string_values = 0
df_modelling[model_label] = [round(x) for x in df_modelling[model_label]]
columns_df_modelling = df_column_headers(df_modelling)
for column in df_modelling[columns_df_modelling]:
column_element = df_modelling[column]
if column_element != np.number:
print(f"String Value Found: {column_element}")
print(f"String Value {column_element} is Converted into {hash(column_element)}")
count_string_values += 1
column_element = hash(column_element)
else:
print(f"No String Values in DF {credit_key} Found")
print(f"Total Number of String Values Found in DF Modelling {credit_key}: {count_string_values}")
return df_modelling
Related
How to iterate through and compare a string with manual input in pandas data frame?
I have the following pandas data table: The above image is just the data I got from yfinance for AAPL stock newsdf is the pandas data frame that has bunch of dates from another API call that has dates for specific news I have the following code: df['Boolean'] = df['Open'] < df['Close'] print(df) if df['Boolean'] == 'False': for h in range(0, k): if newsdf[h] == df['Date']: print('Bearish signal ') print(h) else: print('Signal bullish') I am getting the error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Edit: I see that I am comparing the whole Boolean and I can't do that but what would be a way for me to iterate through the Boolean column, see if the value is ''True'' or ''False'', compare the value with the newsdf and print that index
try this, comments have been put against the line of code. #package import import pandas as pd from io import StringIO #data setup newdf=pd.DataFrame({"h":['2021-01-04']}) raw_data= \ ''' Date Open Boolean 2021-01-04 132 False 2021-01-05 120 True 2021-01-06 123 False ''' df=pd.read_csv(StringIO(raw_data),sep=" ") #function to have a use case logic def check_the_signal(df_row,newdf=newdf): if not df_row['Boolean']: # this is false if df_row['Date'] in list(newdf['h']): # caution when newdf is large dataset,list will be large return 'Bearish signal ' else: return 'Signal bullish' else: return "Neutral" # added for demo only! df['single']=df.apply(check_the_signal,axis=1)# axis == 1 will send data at row level , saving the value in df, in case needed
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()/ select subset from data [duplicate]
This question already has answers here: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() (13 answers) Closed 2 years ago. My input file is under the form: gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, .... N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace, T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace, and I would like to select rows that either have the values for (CallersU equal to either Low OR -1) AND the values of (CalleesU equal either to Low OR -1). Here is the code I am using below: import pandas as pd SeparateProjectLearning=False CompleteCallersCallees=False PartialTrainingSetCompleteCallersCallees=True def main(): dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False) #convert strings into 1 and N into 0 dataset['gold'] = dataset['gold'].astype('category').cat.codes dataset['Program'] = dataset['Program'].astype('category').cat.codes dataset['classGold'] = dataset['classGold'].astype('category').cat.codes dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes print(dataset) CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)] print(CompleteSet) if __name__=="__main__": main() I am using the line dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes to convert the string values that can be taken by CallersU into digits. Similarly, I am using the line of code dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes to convert the string values that can be taken by CalleesU into digits. The four values that can be taken by CallersU/CalleesU are -1, Low,Medium,High. The line ...astype('category').cat.codes automatically makes the following conversions. -1 corresponds to 0, 1 Corresponds to High, 2 corresponds to Low and 3 corresponds to Medium. Thus, I am using the line CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)] to specify that I only want to select rows with either (CallersU==0 OR CallersU==2) and (CalleesU==0 OR CalleesU==2), the problem is that I am getting the error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). after executing the line of code CompleteSet = dataset[(dataset['CallersU']==0 or dataset['CallersU']==2) and (dataset['CalleesU']==0 or dataset['CalleesU']==2)], How can I fix that and perform what's needed?
Replace the and with & and add () CompleteSet = dataset[((dataset['CallersU'] == 0) | (dataset['CallersU'] == 2)) & ((dataset['CalleesU']==0) | (dataset['CalleesU']==2))]
SMOTE in python
I am trying to use SMOTE in python and looking if there is any way to manually specify the number of minority samples. Suppose we have 100 records of one class and 10 records of another class if we use ratio = 1 we get 100:100, if we use ratio 1/2, we get 100:200. But I am looking if there is any way to manually specify the number of instances to be generated for both the classes. Ndf_class_0_records = trainData[trainData['DIED'] == 0] Ndf_class_1_records = trainData[trainData['DIED'] == 1] Ndf_class_0_record_counts = Ndf_class_0_records.DIED.value_counts() Ndf_class_1_record_counts = Ndf_class_1_records.DIED.value_counts() X_smote = trainData.drop("DIED", axis=1) y_smote = trainData["DIED"] smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2}) X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote) In the above code, I am trying to manually specify the number for each of the classes, but I am getting the following error at the last line of code ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If I understand you correctly and the documentation here, you are not passing numbers as ratio. You are instead passing a series object. The accepted types for ratio are: float, str, dict or callable, (default=’auto’) Please try doing: Ndf_class_0_records = trainData[trainData['DIED'] == 0] Ndf_class_1_records = trainData[trainData['DIED'] == 1] Ndf_class_0_record_counts = len(Ndf_class_0_records) ##### CHANGED THIS Ndf_class_1_record_counts = len(Ndf_class_1_records) ##### CHANGED THIS X_smote = trainData.drop("DIED", axis=1) y_smote = trainData["DIED"] smt = SMOTE(ratio={0:Ndf_class_0_record_counts, 1:Ndf_class_1_record_counts*2}) X_smote_res, y_smote_res = smt.fit_sample(X_smote, y_smote) This should now work, please try!
How to apply a function in Pandas to a cell in every row where a different cell in that same row meets a condition?
I am trying to use the pandas string method "str.zfill" to add leading zeros to a cell in the same column for every row in the dataframe where another cell in that row meets a certain condition. So for any given row in my DataFrame "excodes", when the value in column "LOB_SVC_CD" is "MTG", apply the str.zfill(5) method to the cell in column "PS_CD". When the value in "LOB_SVC_CD" is not "MTG" leave the value in "PS_CD" as is. I've tried a few custom functions, "np.where" and a few apply/map lambdas. I'm getting errors on all of them. #Custom Function def add_zero(column): if excodes.loc[excodes.LOB_SVC_CD == 'MTG']: excodes.PS_CD.str.zfill(5) else: return excodes.PS_CD excodes['code'] = excodes.apply(add_zero) #Custom Function with For Loop def add_zero2(column): code = [] for row(i) in column: if excodes.LOB_SVC_CD == 'MTG': code.append(excodes.PS_CD.str.zfill(5)) else: code.append(excodes.PS_CD) excodes['Code'] = code excodes['code'] = excodes.apply(add_zero) #np.Where mask = excodes[excodes.LOB_SVC_CD == 'MTG'] excodes['code'] = pd.DataFrame[np.where(mask, excodes.PS_CD.str.zfill(5), excodes.PS_CD)] #Lambda excodes['code'] = excodes['LOB_SVC_CD'].map(lambda x: excodes.PS_CD.str.zfill(5)) if x[excodes.LOB_SVC_CD == 'MTG'] else excodes.PS_CD) #Assign with a "Where" excodes.assign((excodes.PS_CD.str.zfill(5)).where(excodes.LOB_SVC_CD == 'MTG')) Expected results will be either: create a new called "code" with all values in "PS_CD" are given leading zeroes in rows where excodes.LOB_SVC_CD == 'MTG' adding leading zeroes to the values in excodes["PS_CD"] when the row excodes['LOB_SVC_CD'] == 'MTG' Error Messages I'm getting are - on each of the approaches I've tried: #Custom Function: "ValueError: ('The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index PS_CD')" # Custom Function with For Loop: "SyntaxError: can't assign to function call" #np.Where: "ValueError: operands could not be broadcast together with shapes (152,7) (720,) (720,)" #Apply Lambda: "string indices must be integers" #Assign with a "Where": "TypeError: assign() takes 1 positional argument but 2 were given"
This seems to work :) # Ensure the data in the PS_CD are strings data["PS_CD"] = data["PS_CD"].astype(str) # Iterate over all rows for index in data.index: # If the LOB_SVC_CD is "MTG" if (data.loc[index, "LOB_SVC_CD"] == "MTG"): # Apply the zfill(5) in the PS_CD on the same row (index) data.loc[index, "PS_CD"] = data.loc[index, "PS_CD"].zfill(5) # Print the result print(data) Alternative way (maybe a bit more Python-ish) :) # Ensure the data in the PS_CD are strings data["PS_CD"] = data["PS_CD"].astype(str) # Custom function for applying the zfill def my_zfill(x, y): return y.zfill(5) if x == "MTG" else y # Iterate over the data applying the custom function on each row data["PS_CD"] = pd.Series([my_zfill(x, y) for x, y in zip(data["LOB_SVC_CD"], data["PS_CD"])])
My take: >>> import pandas >>> df = pandas.DataFrame(data = [['123', 'MTG'],['321', 'CLOC']], columns = ['PS_CD', 'LOB_SVC_CD']) >>> df PS_CD LOB_SVC_CD 0 123 MTG 1 321 CLOC >>> >>> df['PS_CD'] = df.apply(lambda row: row['PS_CD'].zfill(5) if row['LOB_SVC_CD'] == 'MTG' else row['PS_CD'], axis='columns') >>> df PS_CD LOB_SVC_CD 0 00123 MTG 1 321 CLOC Using lambda will return value for every row, zfilled PS_CD if LOB_SVC_CD was MTG else original PS_CD.
Grabbing characters after a certain value
Below is the dataframe. PIC_1 and Wgt are strings and p.lgth and p_lgth are integers. If p_lgth is not equal to 30, I want to find 42 in PIC_1 and grab 42 and the 15 digits that come after it. PIC_1 Wgt p.lgth p_lgth **PARTIAL-DECODE***P / 42011721930018984390078... 112 53 53 So the output from above should be 42011721930018984 My code that does not work follows: def pic_mod(row): if row['p_lgth'] !=30: PIC_loc = row['PIC_1'].find('42') PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 15] elif row['p_lgth']==30: PIC_2=PIC_1 return PIC_2 row_1 is just a row from the larger df that is identical to the example row given above row_1 = df71[2:3] pic_mod(row_1) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool (), a.item(), a.any() or a.all(). I did type() on the variables and got type(df71['PIC_1']) = pandas.core.series.Series type(df71['p_lgth']) = pandas.core.series.Series type(df71['Wgt']) = pandas.core.series.Series I'm fairly new to Python. Should these data types come back as int and str? df71 is a df.
According to the error message in your post, perhaps try with this one: def pic_mod(row): if row['p_lgth'].any() != 30: PIC_loc = row['PIC_1'].str.find('42')[0] PIC_2 = row['PIC_1'].str[PIC_loc:PIC_loc + 17] elif row['p_lgth'].any() == 30: PIC_2=PIC_1 return PIC_2 However, if your data is already structured in a pandas dataframe, you normally wouldn't write such an explicit function. E.g. the initial filtering of all rows in the dataset by p_legth not equal to 30 would be a single line like: df_fltrd = df[df['p_lgth']!=30] Having this done you could apply any arbitrary function to the entries in the PIC_1-column, e.g. in your case the substring of length 17 starting with '42': df_fltrd['PIC_1'].apply(lambda x: x[x.find('42'):x.find('42')+17])