Linearmodels value error - do not have full column rank - python

This is similar to another question.
I'm running a 2-stage least square regression with set of categorical variables.
I've run the model successfully once but when I tried to replicate the model it ran into this error: ValueError: instruments [exog instruments] do not have full column rank
As far as I can tell this is related to my exogenous variable missing data or having fewer rows than my other variables. However, the constant was created using this method:
df['const'] = 1
controls = ['const'] + controls
which simply adds column called 'const' to my dataframe and then adds this to my list of control variables (instruments).
I've also checked the dataframe with the new 'const' column was created and added correctly.
Any insight would be much, much appreciated.
thanks!

Related

Machine Learning feature-engine MeanEncoder gives error with Cancer dataset

I'm working with the wisconsin breast cancer dataset found here. Feature engineering is important in machine learning so a teacher of mine recommended the MeanEncoder part of a library found here. The dataframe looks like the following:
I did specifically change the diagnosis feature/column to category because one of the errors said that might of been the issue but apparently not as it's not solved.
I want to mean encode the target feature/column using MeanEncode found in the library linked above. Here's my function to attempt to do so:
def MeanEncoding(self):
# Get the columns besides the target variable at the front, which is diagnosis, as recommended by teacher.
cols = self.m_df.iloc[:, 1:].columns.to_list()
# Save specifically the target variable too.
target = self.m_df.iloc[:, 0]
# Now get the object ready.
encoder = MeanEncoder(variables=cols)
print('---Fitting---')
encoder.fit(self.m_df.drop('diagnosis', axis=1), target)
In this code:
m_df - just the dataframe hence the "df"
I drop the diagnosis column/feature in the first argument of encoder.fit, since it's provided in the 2nd argument of the same function. But it means nothing. Because I still get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"
Now with #2, I'm thinking, "No way, I have to transform the numeric features which are 'radius_mean', 'texture_mean', etc into category or object? That makes 0 sense". But I google this error of course and it brings me to this SO thread. This individual is having similar concerns like me except with a different function. The suggestion for him was "Just change the dtype of grade column to object before using imputer", so I change the types as well to object with the following code:
for i in range(1, len(self.m_df.columns)):
columnName = self.m_df.columns[i]
self.m_df[columnName] = self.m_df[columnName].astype('object')
Doesn't make sense to me because it's converting the types of genuine numeric columns/features. I get this error which is KIND of expected:
pandas.core.base.DataError: No numeric types to aggregate
Now I'm thinking it just wants a few numeric types, so I slightly alter the code:
for i in range(1, len(self.m_df.columns) - 2):
columnName = self.m_df.columns[i]
self.m_df[columnName] = self.m_df[columnName].astype('object')
Which literally just leaves the last 2 columns as float64 types and therefore all others are type object (besides the diagnosis column which is category but I doubt that matters). Now some numeric types ARE present. Yet I still get the error again
TypeError: Some of the variables are not categorical. Please cast
them as object or category before calling this transformer
I am clearly missing something but not sure what. No matter how I alter the types to satisfy the function, it's wrong.
The MeanEncoder from Feature-engine, as well as all other Feature-engine encoders, work only on variables cast as object or category by default.
So the variables captured in the list cols in this line of code: cols = self.m_df.iloc[:, 1:].columns.to_list() should only contain categorical variables (object or category).
When you set up the encoder here: encoder = MeanEncoder(variables=cols), in variables, you indicate the variables to encode. If you pass cols, it means you want to encode all the variables within the cols list. So you need to ensure that all of them are of type category or object.
If you get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"it means that some of the variables in cols are not of type object or category.
If you want to encode numerical variables, there are 2 options: 1) recast the variables you want to encode as object. 2) set the parameter ignore_format=True as per the transformer's documentation. That should solve your problem.

Showing integer columns as categorical and throwing error in sweetviz compare

If I analyze these two datasets individually, I don't get any error and the I also get the viz of all the integer columns.
But when I try to compare these dataframe, I get the below error.
Cannot convert series 'Web Visit' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
I also tried the FeatureConfig to skip it, but no avail.
pid_compare = sweetviz.compare([pdf,"234_7551009"],[pdf_2,"215_220941058"])
Maintainer of the lib here; this question was asked in the git also, but it will be useful to detail the answer here.
After looking at your data provided in the link above, it looks like the first dataframe (pdf) only contains 0 & 1, so it is classified as boolean so it cannot be compared against the second one which is categorical (that one has 0,1,2,3 as you probably know!).
The system will be able to handle it if you use FeatureConfig to force the first dataframe to be considered CATEGORICAL.
I just tried the following and it seems to work, let me know if it helps!
feature_config = sweetviz.FeatureConfig(force_cat = ["Web Desktop Interaction"])
report = sweetviz.compare(pdf, pdf_2, None, feature_config)

Python Pandas groupby mean with "No numeric types to aggregate" error

I am trying to calculate one column mean from an excel.
I delete all the null value and '-' in the column called 'TFD' and form a new dataframe by selecting three columns. I want to calculated the mean from the new dataframe with groupby. But there is an error called "No numeric types to aggregate", I don't know why I have this error and how to fix it.
sheet=pd.read_excel(file)
sheet_copy=sheet
sheet_copy=sheet_copy[(~sheet_copy['TFD'].isin(['-']))&(~sheet_copy['TFD'].isnull())]
sheet_copy=sheet_copy[['Participant ID','Paragraph','TFD']]
means=sheet_copy['TFD'].groupby([sheet_copy['Participant ID'],sheet_copy['Paragraph']]).mean()
From your spreadsheet snippet above it looks as though your Participant ID and Paragraph columns have data types which are Text formats which leads me to believe that they will be strings inside of your dataframe? Which leads me to believe this is precisely where your issue lies from the exception "No numeric types to aggregate"
Following this, here are some good examples of group by with the mean clause from the pandas documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html
If you had the dataset to hand I would have tried it out myself and provided a snippet of the code used.

Unable to fit the linear multivariate regression model due to missing values in CSV file

I am trying to perform linear regression model with multiple variables. I have a CSV file with attributes: 'area', 'bedrooms', 'age', 'price', but bedrooms has a missing value (i.e. NaN). I want to fit the model and predict the price by giving the other three attributes, but due to one missing value I am shown the error input contains nan infinity or a value too large for dtype('float64'). I found the median and filled in the missing value, but in the CSV file, it is not replaced and giving the error while fitting the model.
The CSV file is as follows:
I have used the following codes:
df=pd.read_csv(r"C:\Users\rohit\Desktop\homeprices4.csv")
df
m=math.floor(df.bedrooms.median()) // m is median which I have calculated
m
df.bedrooms.fillna(m)
reg=linear_model.LinearRegression()
reg.fit(df[['area','bedrooms','age']],df.price)
After this line I am getting the error because bedrooms has one missing value. If I remove bedrooms from this line and use only area and age for prediction then NO error, and I am getting correct results.
So my question is: How do I replace the missing value with the median in the CSV file? What is the code for that?
Why I am getting the error?
Look at the screenshot for the error:
Two questions in your post:
Your changes are not saved to the dataframe because fillna would return a copy unless inplace=True is passed in arguments. This is why you still get the error.
To save the changes to CSV you need to use DataFrame.to_csv(...), but given my previous point maybe you don't need this.
I would enrich your pipeline with a data cleaning step and save the cleaned data. I would do 2 separate scripts.
Data cleaning:
path_raw_data= r"C:\Users\rohit\Desktop\homeprices4.csv"
path_clean_data= r"C:\Users\rohit\Desktop\homeprices4_clean.csv"
df=pd.read_csv(path_raw_data)
m=math.floor(df.bedrooms.median()) // m is median which I have calculated
df.bedrooms.fillna(m, inplace=True)
df.to_csv(path_clean_data)
Linear regression:
path_clean_data= r"C:\Users\rohit\Desktop\homeprices4_clean.csv"
df=pd.read_csv(path_clean_data)
reg=linear_model.LinearRegression()
reg.fit(df[['area','bedrooms','age']],df.price)

Unable to implement MICE in Python

I'm trying to use statsmodels package of MICE to impute values for my columns. I'm unable to figure out how exactly to use it. Whatever I run, it throws the error: ValueError: variable to be imputed has no observed values
Code:
df=pd.read_csv('contacts.csv', engine='c',low_memory=False)
from statsmodels.imputation.mice import MICEData as md
md(df)
Why am I doing wrong?
at least one of the columns in the generated data frame (hence csv) is empty.
Check the dataframe, maybe you have to clean it up/normalize.
Also, don't afraid to look into the code base.
What you are looking for is the _split_indices method of MICEData.

Categories

Resources