I'm working with the wisconsin breast cancer dataset found here. Feature engineering is important in machine learning so a teacher of mine recommended the MeanEncoder part of a library found here. The dataframe looks like the following:
I did specifically change the diagnosis feature/column to category because one of the errors said that might of been the issue but apparently not as it's not solved.
I want to mean encode the target feature/column using MeanEncode found in the library linked above. Here's my function to attempt to do so:
def MeanEncoding(self):
# Get the columns besides the target variable at the front, which is diagnosis, as recommended by teacher.
cols = self.m_df.iloc[:, 1:].columns.to_list()
# Save specifically the target variable too.
target = self.m_df.iloc[:, 0]
# Now get the object ready.
encoder = MeanEncoder(variables=cols)
print('---Fitting---')
encoder.fit(self.m_df.drop('diagnosis', axis=1), target)
In this code:
m_df - just the dataframe hence the "df"
I drop the diagnosis column/feature in the first argument of encoder.fit, since it's provided in the 2nd argument of the same function. But it means nothing. Because I still get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"
Now with #2, I'm thinking, "No way, I have to transform the numeric features which are 'radius_mean', 'texture_mean', etc into category or object? That makes 0 sense". But I google this error of course and it brings me to this SO thread. This individual is having similar concerns like me except with a different function. The suggestion for him was "Just change the dtype of grade column to object before using imputer", so I change the types as well to object with the following code:
for i in range(1, len(self.m_df.columns)):
columnName = self.m_df.columns[i]
self.m_df[columnName] = self.m_df[columnName].astype('object')
Doesn't make sense to me because it's converting the types of genuine numeric columns/features. I get this error which is KIND of expected:
pandas.core.base.DataError: No numeric types to aggregate
Now I'm thinking it just wants a few numeric types, so I slightly alter the code:
for i in range(1, len(self.m_df.columns) - 2):
columnName = self.m_df.columns[i]
self.m_df[columnName] = self.m_df[columnName].astype('object')
Which literally just leaves the last 2 columns as float64 types and therefore all others are type object (besides the diagnosis column which is category but I doubt that matters). Now some numeric types ARE present. Yet I still get the error again
TypeError: Some of the variables are not categorical. Please cast
them as object or category before calling this transformer
I am clearly missing something but not sure what. No matter how I alter the types to satisfy the function, it's wrong.
The MeanEncoder from Feature-engine, as well as all other Feature-engine encoders, work only on variables cast as object or category by default.
So the variables captured in the list cols in this line of code: cols = self.m_df.iloc[:, 1:].columns.to_list() should only contain categorical variables (object or category).
When you set up the encoder here: encoder = MeanEncoder(variables=cols), in variables, you indicate the variables to encode. If you pass cols, it means you want to encode all the variables within the cols list. So you need to ensure that all of them are of type category or object.
If you get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"it means that some of the variables in cols are not of type object or category.
If you want to encode numerical variables, there are 2 options: 1) recast the variables you want to encode as object. 2) set the parameter ignore_format=True as per the transformer's documentation. That should solve your problem.
Related
More of a conceptual question.
When I import files into Python (without specifying the data types) -- just straight up df = pd.read_csv("blah.csv/") or df = pd.read_excel("blah.xls"), Python naturally guesses the data types of the columns.
No issues here.
However, sometimes when I am working with one of the columns, say, an object column, and I know for certain that Python guessed correctly, my .str functions sometimes don't work as intended, or I get an error. Yet, if I were to specify the column data type after importation as a .str everything works as intended.
I also noticed that if I specify one of the object columns as a str datatype after importation, the size of the object increases. So I am guessing Python's object type is different from a "string object" datatype? What causes this discrepancy?
I am trying to fine-tune Tapas following the instructions here: https://huggingface.co/transformers/v4.3.0/model_doc/tapas.html#usage-fine-tuning , Weak supervision for aggregation (WTQ) using the https://www.microsoft.com/en-us/download/details.aspx?id=54253 , which follow the required format of dataset in the SQA format, tsv files with most of the named columns. But, there is no float_answer column. And as mentioned,
float_answer: the float answer to the question, if there is one (np.nan if there isn’t). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)
Since I am using WTQ, I need the float_answer column. I tried populating float_answer based on answer_text as suggested here, using https://github.com/google-research/tapas/blob/master/tapas/utils/interaction_utils_parser.py 's parse_question(table, question, mode) function. However, I am getting errors.
I copied everything from here and put these args:
.
But, I get this error: TypeError: Parameter to CopyFrom() must be instance of same class: expected language.tapas.Question got str.
1) Can you, please help understand what args should I Use or how else can I populate float_answer?
I am using table_csv and the question, answer to which is in the table given:
2) Also we have tried to simply add float_answer column and make all the values np.nan. Crashed, too.
Is there tutorial for WTQ fine-tuning? Thanx!
My problem is that I need to change some sets of categorized columns into numbers for machine learning.
I don't want to use LabelEncoding because I heard it's not as efficient as OnehotEncoder.
So i used this code
X = df.drop("SalePrice", axis=1)
y = df['SalePrice']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,categorical_features)], remainder="passthrough")
transformed_X = transformer.fit_transform(df)
Where the categorical features are the list of columns i want to use the onehotencoder on
But I get a multiple line error as an output with the overall problem stating:
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Someone has had similar issues and was asked to clean his data to remove nan values and i have done that already but no change. I have also been asked to change the datatypes of my colums to strings and i wrote a loop to do that like here:
This error is pretty self-explainatory : you cannot have str AND float in your columnS to use the encoder.
Where the categorical features are the list of columns i want to use the onehotencoder on
Make sure that all your columns share the same type too.
You can try to do this in order to force everything to be a string
for e in categorical_features:
df[e]=df[e].astype(str)
or maybe you have another issue with your data if everything 'should' be float. In this case use things like isnumeric
If I analyze these two datasets individually, I don't get any error and the I also get the viz of all the integer columns.
But when I try to compare these dataframe, I get the below error.
Cannot convert series 'Web Visit' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
I also tried the FeatureConfig to skip it, but no avail.
pid_compare = sweetviz.compare([pdf,"234_7551009"],[pdf_2,"215_220941058"])
Maintainer of the lib here; this question was asked in the git also, but it will be useful to detail the answer here.
After looking at your data provided in the link above, it looks like the first dataframe (pdf) only contains 0 & 1, so it is classified as boolean so it cannot be compared against the second one which is categorical (that one has 0,1,2,3 as you probably know!).
The system will be able to handle it if you use FeatureConfig to force the first dataframe to be considered CATEGORICAL.
I just tried the following and it seems to work, let me know if it helps!
feature_config = sweetviz.FeatureConfig(force_cat = ["Web Desktop Interaction"])
report = sweetviz.compare(pdf, pdf_2, None, feature_config)
This is similar to another question.
I'm running a 2-stage least square regression with set of categorical variables.
I've run the model successfully once but when I tried to replicate the model it ran into this error: ValueError: instruments [exog instruments] do not have full column rank
As far as I can tell this is related to my exogenous variable missing data or having fewer rows than my other variables. However, the constant was created using this method:
df['const'] = 1
controls = ['const'] + controls
which simply adds column called 'const' to my dataframe and then adds this to my list of control variables (instruments).
I've also checked the dataframe with the new 'const' column was created and added correctly.
Any insight would be much, much appreciated.
thanks!