I am realizing an XG Boost model. I did my train-test split on a dataframe having 91 columns. I want to use my model on a new dataframe which have different columns than my training set. I have removed the extra columns and added the ones which were present in the train dataset and not the new one.
However, I cannot use the models because the new set does not have the same number of columns but when I am computing the list of the differences in columns the list is empty.
Do you have an idea of how I could correct this problem ?
Thanks in advance for your time !
You can try like this :
import pandas as pd
X_PAU = pd.DataFrame({'test1': ['A', 'A'], 'test2': [0, 0]})
print(len( X_PAU.columns ))
X = pd.DataFrame({'test1': ['A', 'A']})
print(len( X.columns ))
# Your implimentation
print(set(X.columns) - set(X_PAU.columns)) #This should be empty set
#
print(X_PAU.columns.difference(X.columns).tolist()) # this will print the missing column name
print(len(X_PAU.columns.difference(X.columns).tolist())) # this will print the difference number
Output:
2
1
set()
['test2']
1
Related
I'm sorry if my title is confusing but I wasn't sure how to describe the situation that I'm currently trying to understand. But basically I stumbled upon this question when I was working with train_test_split procedure from sklearn module.
So, let's go ahead and I show you an example of what has been confusing me for couple of hours already.
Let's create a simple dataframe with 3 columns:
'Letter' - a letter from alphabet;
'Number' - serial number of the letter;
'Type' - type of the number.
import pandas as pd
data = [['A', 1, 'Odd'], ['B', 2, 'Even'], ['C', 3, 'Odd'],
['D', 4, 'Even'], ['E', 5, 'Odd'], ['F', 6, 'Even'], ['G', 7, 'Odd']]
df = pd.DataFrame(data, columns=['Letter', 'Number', 'Type'])
We can create 4 samples to work with using train_test_split:
from sklearn.model_selection import train_test_split
target = df['Type']
features = df.drop('Type', axis=1)
features_train, features_valid, target_train, target_valid = train_test_split(features,
target, test_size=0.4, random_state=12)
And now if we want to see the rows of features_train with the odd numbers we can write the following code:
features_odds = features_train[target_train == 'Odd']
features_odds
And we get this:
Output
And there we have it right as new dataframe contains the rows exactly with the odd numbers.
How does that work when features_train can get the info from target_train even though those are two separated dataframes?
I think there should be an easy answer but for some reason I'm not able to understand the mechanics of this right now.
I have also tried a different approach (not using train_test_split) but it works just as fine:
target_dummy = df['Type']
features_dummy = df.drop('Type', axis=1)
features_dumb_odds = features_dummy[target_dummy == 'Odd']
features_dumb_odds
Would appreciate and help in understanding it a lot!
target_train == 'Odd' is a Series of boolean values. As a Series, it also has an index. That index is used to align with features_train that you index into, and it's compatible.
As a first step of exploration, start with print(target_train == 'Odd')
It's good to think about how the pieces fit together. In this case, the boolean series and where you index into need to have exactly the same index for it to not raise an exception.
I am doing it this way
1- dropping columns from the main dataframe which doesn't need feature scaling
2- now obtained dataframe only has columns that require feature scaling
3- concatenate the dropped out columns with the scaled columns to get the final dataframe
but
I want to do it without dropping any columns. buy using a command that will scale first 14 columns but the others are preserved in the dataframe I get as an output
Look in to DataFrame.apply(). Setting the axis parameter to 1 will apply a function along columns. You can put a filter in that function to only include the columns you want to scale.
For example:
import pandas as pd
def scaling_function(x, col_to_scale):
for col in x.index:
if col in col_to_scale:
#your scaling operation here
x[col] = x[col] * 2
return x
df = pd.DataFrame([[4, 9, 2]] * 3, columns=['A', 'B', 'C'])
col_to_scale = ['A', 'B']
scaled_df = df.apply(lambda x: scaling_function(x, col_to_scale), axis=1)
Will double the values in column A & B while leaving C as is.
I'm fairly new to python and am working with large dataframes with upwards of 40 million rows. I would like to be able to add another 'label' column based on the value of another column.
if I have a pandas dataframe (much smaller here for detailing the problem)
import pandas as pd
import numpy as np
#using random to randomly get vals (as my data is not sorted)
my_df = pd.DataFrame(np.random.randint(0,100,1000),columns = ['col1'])
I then have another dictionary containing ranges associated with a specific label, similar to something like:
my_label_dict ={}
my_label_dict['label1'] = np.array([[0,10],[30,40],[50,55]])
my_label_dict['label2'] = np.array([[11,15],[45,50]])
Where any data in my_df should be 'label1' if it is between 0,10 or 30,40 or 50,55
And any data should be 'label2' if it between 11 to 15 or 45 to 50.
I have only managed to isolate data based on the labels and retrieve an index through something like:
idx_save = np.full(len(my_label_dict['col1']),False,dtype = bool).reshape(-1,1)
for rng in my_label_dict['label1']:
idx_temp = np.logical_and( my_label_dict['col1']> rng[0], my_label_dict['col1'] < rng[1]
idx_save = idx_save | idx_temp
and then use this index to access label1 values from my_dict. and then repeat for label2.
Ideally I would like to add another column on my_label_dict named 'labels' which would add 'label1' for all datapoints that satisfy the given ranges etc. Or just a quick method to retrieve all values from the dataframe that satisfy the ranges in the labels.
I'm new to generator functions, and havent completely gotten my head around them but maybe they could be used here?
Thanks for any help!!
You can to the task in a "more pandasonic" way.
Start from creating a Series, named labels, initially with empty strings:
labels = pd.Series([''] * 100).rename('label')
The length is 100, just as the upper limit of your values.
Then fill it with proper labels:
for key, val in my_label_dict.items():
for v in val:
labels[v[0]:v[1]+1] = key
And the only thing to do is to merge your DataFrame with labels:
my_df = my_df.merge(labels, how='left', left_on='col1', right_index=True)
I also noticed such a contradiction in my_label_dict:
you have label1 for range between 50 and 55 (I assume inclusive),
you have also label2 for range between 45 and 50,
so for value of 50 you have two definitions.
My program acts on the "last decision takes precedence" principle, so the label
for 50 is label2. Maybe you should change one of these range borders?
Edit
A modified solution if the upper limit of col1 is "unpredictable":
Define labels the following way:
rngMax = max(np.array(list(itertools.chain.from_iterable(
my_label_dict.values())))[:,1])
labels = pd.Series([np.nan] * (rngMax + 1)).rename('label')
for key, val in my_label_dict.items():
for v in val:
labels[v[0]:v[1]+1] = key
labels.dropna(inplace=True)
Add .fillna('') to my_df.merge(...).
Here is a solution that would also work for float ranges, where you can't create a mapping for all possible values. This solution requires resorting your dataframes.
# build a dataframe you can join and sort it for the from-field
join_df=pd.DataFrame({
'from': [ 0, 30, 50, 11, 45],
'to': [10, 40, 55, 15, 50],
'label': ['label1', 'label1', 'label1', 'label2', 'label2']
})
join_df.sort_values('from', axis='index', inplace=True)
# calculate the maximum range length (but you could alternatively set it to any value larger than your largest range as well)
max_tolerance=(join_df['to'] - join_df['from']).max()
# sort your value dataframe for the column to join on and do the join
my_df.sort_values('col1', axis='index', inplace=True)
result= pd.merge_asof(my_df, join_df, left_on='col1', right_on='from', direction='backward', tolerance=max_tolerance)
# now you just have to remove the lables for the rows for which the value passed the end of the range and drop the two range columns
result.loc[result['to']<result['col1'], 'label']= np.NaN
result.drop(['from', 'to'], axis='columns', inplace=True)
The merge_asof(...direchtion='backward',...) just joines for each row in my_df the row in join_df with the maximum value in from that still sattisfies from<=col1. It doesn't look at the to column at all. This is why we remove the labels where the to boundary is hurt by the assignment of np.NaN in the line with the .loc.
I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to "re-calculate the index, given the current order", or "re-index" (or so I thought). Turns out that isn't exactly what DataFrame.reindex seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with "cannot reindex from a duplicate axis." I don't want to change the order of my data... just need to delete the old index and set up a new one, with the order of rows preserved.
If your index is autogenerated and you don't want to keep it, you can use the ignore_index option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.
After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2
This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True to avoid an additional column in your dataframe.
I have a data frame called active and it has 10 unique POS column values.
Then I group POS values and mean normalize the OPW columns and then store the normalized values as a seperate column ['resid'].
If I groupby on POS values shouldnt the new active data frame's POS columns contain only unique POS values??
For example:
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
print df2
df2.groupby(['X']).sum()
I get an output like this:
Y
X
A 7
B 3
In my example shouldn't I get an column with only unique Pos values as mentioned below??
POS Other Columns
Rf values
2B values
LF values
2B values
OF values
I can't be 100% sure without the actual data, but i'm pretty sure that the problem here is that you are not aggregating the data.
Let's go through the groupby step by step.
When you do active.groupby('POS'), what's actually happening is that you are slicing the dataframe per each unique POS, and passing each of these sclices, sequentially, to the applied function.
You can get a better vision of what's happening by using get_group (ex : active.groupby('POS').get_group('RF') )
So you're applying your meanNormalizeOPW function to each of those slices. That function creates a mean normalized value of the column 'resid' for each line of the passed dataframe. And you return that dataframe, ending with a shape that is similar than what was passed.
So if you just add an aggregation function to the returned df, it should work fine. I guess here you want a mean, so just change return df into return df.mean()