feature_names must be unique - Xgboost - python

I am running the xgboost model for a very sparse matrix.
I am getting this error. ValueError: feature_names must be unique
How can I deal with this?
This is my code.
yprob = bst.predict(xgb.DMatrix(test_df))[:,1]

According the the xgboost source code documentation, this error only occurs in one place - in a DMatrix internal function. Here's the source code excerpt:
if len(feature_names) != len(set(feature_names)):
raise ValueError('feature_names must be unique')
So, the error text is pretty literal here; your test_df has at least one duplicate feature/column name.
You've tagged pandas on this post; that suggests test_df is a Pandas DataFrame. In this case, DMatrix literally runs df.columns to extract feature_names. Check your test_df for repeat column names, remove or rename them, and then try DMatrix() again.

Assuming the problem is indeed that columns are duplicated, the following line should solve your problem:
test_df = test_df.loc[:,~test_df.columns.duplicated()]
Source: python pandas remove duplicate columns
This line should identify which columns are duplicated:
duplicate_columns = test_df.columns[test_df.columns.duplicated()]

One way around this can be to use column names that are unique while preparing the data and then it should work out.

I converted to them to np.array(df). My problem was solved

Related

Problems with DataFrame indexing with pandas

Using pandas, I have to modify a DataFrame so that it only has the indexes that are also present in a vector, which was acquired by performing operations in one of the df's columns. Here's the specific line of code used for that (please do not mind me picking the name 'dataset' instead of 'dataframe' or 'df'):
dataset = dataset.iloc[list(set(dataset.index).intersection(set(vector.index)))]
it worked, and the image attached here shows the df and some of its indexes. However, when I try accessing a specific value by index in the new 'dataset', such as the line shown below, I get an error: single positional indexer is out-of-bounds
print(dataset.iloc[:, 21612])
note: I've also tried the following, to make sure it isn't simply an issue with me not knowing how to use iloc:
print(dataset.iloc[21612, :])
and
print(dataset.iloc[21612])
Do I have to create another column to "mimic" the actual indexes? What am I doing wrong? Please mind that it's necessary for me to make it so the indexes are not changed at all, despite the size of the DataFrame changing. E.g. if the DataFrame originally had 21000 rows and the new one only 15000, I still need to use the number 20999 as an index if it passed the intersection check shown in the first code snippet. Thanks in advance
Try this:
print(dataset.loc[21612, :])
After you have eliminated some of the original rows, the first (i.e., index) argument to iloc[] must not be greater than len(index) - 1.

How can I fix type error for One hot encoder

My problem is that I need to change some sets of categorized columns into numbers for machine learning.
I don't want to use LabelEncoding because I heard it's not as efficient as OnehotEncoder.
So i used this code
X = df.drop("SalePrice", axis=1)
y = df['SalePrice']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,categorical_features)], remainder="passthrough")
transformed_X = transformer.fit_transform(df)
Where the categorical features are the list of columns i want to use the onehotencoder on
But I get a multiple line error as an output with the overall problem stating:
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Someone has had similar issues and was asked to clean his data to remove nan values and i have done that already but no change. I have also been asked to change the datatypes of my colums to strings and i wrote a loop to do that like here:
This error is pretty self-explainatory : you cannot have str AND float in your columnS to use the encoder.
Where the categorical features are the list of columns i want to use the onehotencoder on
Make sure that all your columns share the same type too.
You can try to do this in order to force everything to be a string
for e in categorical_features:
df[e]=df[e].astype(str)
or maybe you have another issue with your data if everything 'should' be float. In this case use things like isnumeric

List index out of range when specifying column in numpy

I have been tasked with extracting data from a specific column of a csv file using numpy and loadtxt. This data is on column D of the attatched image. By my logic i should use the numpy paramter usecols=3 to only obtain the 4th column which is the one I want. But my output keeps telling me that the index is out of range when there is clearly a column there. I have done some prior searching and the general consensus seems to be that one of the rows doesn't have any data in the column. But i have checked and all the rows have data in the column. Here is the code Im using highlighted in green.Can anyone possibly tell me why this is happening?
data = open("suttonboningtondata_moodle.csv","r")
min_temp = loadtxt(data,usecols=(3),skiprows=5,dtype=str,delimiter=" ")
print(min_temp)
I will suggest you use another library to extract your data. The pandas library works well in this regard.
Here is a documentation link to guide you.
pandas docs
I added a comma instead of whitespace for the delimiter value and it worked. I have no idea why though

Showing integer columns as categorical and throwing error in sweetviz compare

If I analyze these two datasets individually, I don't get any error and the I also get the viz of all the integer columns.
But when I try to compare these dataframe, I get the below error.
Cannot convert series 'Web Visit' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
I also tried the FeatureConfig to skip it, but no avail.
pid_compare = sweetviz.compare([pdf,"234_7551009"],[pdf_2,"215_220941058"])
Maintainer of the lib here; this question was asked in the git also, but it will be useful to detail the answer here.
After looking at your data provided in the link above, it looks like the first dataframe (pdf) only contains 0 & 1, so it is classified as boolean so it cannot be compared against the second one which is categorical (that one has 0,1,2,3 as you probably know!).
The system will be able to handle it if you use FeatureConfig to force the first dataframe to be considered CATEGORICAL.
I just tried the following and it seems to work, let me know if it helps!
feature_config = sweetviz.FeatureConfig(force_cat = ["Web Desktop Interaction"])
report = sweetviz.compare(pdf, pdf_2, None, feature_config)

reshaping pandas data frame- Unique row error

I have a data frame as the following;
I am trying to use the reshape function from pandas package and it keep giving me the error that
" the id variables need to uniquely identify each row".
This is my code to reshape:
link to the data: https://pastebin.com/GzujhX3d
GG_long=pd.wide_to_long(data_GG,stubnames='time_',i=['Customer', 'date'], j='Cons')
The combination of 'Customer' and 'Date' is a unique row within my data, so I don't understand why it throws me this error and how I can fix it. Any help is appreciated.
I could identify the issue. the error was due to two things- first the name of the columns having ":" in them and second the format of the date- for some reason it doesn't like dd-mm-yy, instead it works with dd/mm/yy.

Categories

Resources