I am trying to do some feature selection using mutual_info_regression with SelectKBest wrapper. However I keep running into an error indicating that my list of features needs to be reshaped into a 2D array, not quite sure why I keep getting this message-
#feature selection before linear regression benchmark test
import sklearn
from sklearn.feature_selection import mutual_info_regression, SelectKBest
features = list(housing_data[housing_data.columns.difference(['sale_price'])])
target = 'sale_price'
new = SelectKBest(mutual_info_regression, k=20).fit_transform(features, target)
This is my traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-8c778124066c> in <module>()
3 features = list(housing_data[housing_data.columns.difference(['sale_price'])])
4 target = 'sale_price'
----> 5 new = SelectKBest(mutual_info_regression, k=20).fit_transform(features, target)
/usr/local/lib/python3.6/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
463 else:
464 # fit method of arity 2 (supervised transformation)
--> 465 return self.fit(X, y, **fit_params).transform(X)
466
467
/usr/local/lib/python3.6/dist-packages/sklearn/feature_selection/univariate_selection.py in fit(self, X, y)
339 self : object
340 """
--> 341 X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
342
343 if not callable(self.score_func):
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
754 ensure_min_features=ensure_min_features,
755 warn_on_dtype=warn_on_dtype,
--> 756 estimator=estimator)
757 if multi_output:
758 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
550 "Reshape your data either using array.reshape(-1, 1) if "
551 "your data has a single feature or array.reshape(1, -1) "
--> 552 "if it contains a single sample.".format(array))
553
554 # in the future np.flexible dtypes will be handled like object dtypes
ValueError: Expected 2D array, got 1D array instead:
array=['APPBBL' 'APPDate' 'Address' 'AreaSource' 'AssessLand' 'AssessTot' 'BBL'
'BldgArea' 'BldgClass' 'BldgDepth' 'BldgFront' 'BoroCode' 'Borough'
'BsmtCode' 'BuiltFAR' 'CB2010' 'CD' 'CT2010' 'ComArea' 'CommFAR'
'CondoNo' 'Council' 'EDesigNum' 'Easements' 'ExemptLand' 'ExemptTot'
'Ext' 'FIRM07_FLA' 'FacilFAR' 'FactryArea' 'FireComp' 'GarageArea'
'HealthArea' 'HealthCent' 'HistDist' 'IrrLotCode' 'LandUse' 'Landmark'
'LotArea' 'LotDepth' 'LotFront' 'LotType' 'LtdHeight' 'MAPPLUTO_F'
'NumBldgs' 'NumFloors' 'OfficeArea' 'OtherArea' 'Overlay1' 'Overlay2'
'OwnerName' 'OwnerType' 'PFIRM15_FL' 'PLUTOMapID' 'PolicePrct' 'ProxCode'
'ResArea' 'ResidFAR' 'RetailArea' 'SHAPE_Area' 'SHAPE_Leng' 'SPDist1'
'SPDist2' 'SPDist3' 'Sanborn' 'SanitBoro' 'SanitDistr' 'SanitSub'
'SchoolDist' 'SplitZone' 'StrgeArea' 'TaxMap' 'Tract2010' 'UnitsRes'
'UnitsTotal' 'Unnamed: 0' 'Version' 'XCoord' 'YCoord' 'YearAlter1'
'YearAlter2' 'YearBuilt' 'ZMCode' 'ZipCode' 'ZoneDist1' 'ZoneDist2'
'ZoneDist3' 'ZoneDist4' 'ZoneMap' 'address' 'apartment_number' 'block'
'borough' 'building_class' 'building_class_at_sale'
'building_class_category' 'commercial_units' 'easement' 'gross_sqft'
'land_sqft' 'lot' 'neighborhood' 'price_range' 'residential_units'
'sale_date' 'tax_class' 'tax_class_at_sale' 'total_units' 'year_built'
'year_of_sale' 'zip_code'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is a sample of my data:
housing_data = pd.DataFrame({'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4}, 'borough': {0: 3, 1: 3, 2: 3, 3: 3}, 'neighborhood': {0: 'DOWNTOWN-METROTECH', 1: 'DOWNTOWN-FULTON FERRY', 2: 'BROOKLYN HEIGHTS', 3: 'MILL BASIN'}, 'building_class_category': {0: '28 COMMERCIAL CONDOS', 1: '29 COMMERCIAL GARAGES', 2: '21 OFFICE BUILDINGS', 3: '22 STORE BUILDINGS'}, 'tax_class': {0: '4', 1: '4', 2: '4', 3: '4'}, 'block': {0: 140, 1: 54, 2: 204, 3: 8470}, 'lot': {0: 1001, 1: 1, 2: 1, 3: 55}, 'easement': {0: nan, 1: nan, 2: nan, 3: nan}, 'building_class': {0: 'R5', 1: 'G7', 2: 'O6', 3: 'K6'}, 'address': {0: '330 JAY STREET', 1: '85 JAY STREET', 2: '29 COLUMBIA HEIGHTS', 3: '5120 AVENUE U'}, 'apartment_number': {0: 'COURT', 1: nan, 2: nan, 3: nan}, 'zip_code': {0: 11201, 1: 11201, 2: 11201, 3: 11234}, 'residential_units': {0: 0, 1: 0, 2: 0, 3: 0}, 'commercial_units': {0: 1, 1: 0, 2: 0, 3: 123}, 'total_units': {0: 1, 1: 0, 2: 0, 3: 123}, 'land_sqft': {0: 0.0, 1: 134988.0, 2: 32000.0, 3: 905000.0}, 'gross_sqft': {0: 0.0, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'year_built': {0: 2002, 1: 0, 2: 1924, 3: 1970}, 'tax_class_at_sale': {0: 4, 1: 4, 2: 4, 3: 4}, 'building_class_at_sale': {0: 'R5', 1: 'G7', 2: 'O6', 3: 'K6'}, 'sale_price': {0: 499401179.0, 1: 345000000.0, 2: 340000000.0, 3: 276947000.0}, 'sale_date': {0: '2008-04-23', 1: '2016-12-20', 2: '2016-08-03', 3: '2012-11-28'}, 'year_of_sale': {0: 2008, 1: 2016, 2: 2016, 3: 2012}, 'Borough': {0: nan, 1: 'BK', 2: 'BK', 3: 'BK'}, 'CD': {0: nan, 1: 302.0, 2: 302.0, 3: 318.0}, 'CT2010': {0: nan, 1: 21.0, 2: 1.0, 3: 698.0}, 'CB2010': {0: nan, 1: 3017.0, 2: 1003.0, 3: 2005.0}, 'SchoolDist': {0: nan, 1: 13.0, 2: 13.0, 3: 22.0}, 'Council': {0: nan, 1: 33.0, 2: 33.0, 3: 46.0}, 'ZipCode': {0: nan, 1: 11201.0, 2: 11201.0, 3: 11234.0}, 'FireComp': {0: nan, 1: 'L118', 2: 'E205', 3: 'E323'}, 'PolicePrct': {0: nan, 1: 84.0, 2: 84.0, 3: 63.0}, 'HealthCent': {0: nan, 1: 36.0, 2: 38.0, 3: 35.0}, 'HealthArea': {0: nan, 1: 1000.0, 2: 2300.0, 3: 8822.0}, 'SanitBoro': {0: nan, 1: 3.0, 2: 3.0, 3: 3.0}, 'SanitDistr': {0: nan, 1: 2.0, 2: 2.0, 3: 18.0}, 'SanitSub': {0: nan, 1: '1B', 2: '1A', 3: '4E'}, 'Address': {0: nan, 1: '87 JAY STREET', 2: '29 COLUMBIA HEIGHTS', 3: '5120 AVENUE U'}, 'ZoneDist1': {0: nan, 1: 'M1-2/R8', 2: 'M2-1', 3: 'M3-1'}, 'ZoneDist2': {0: nan, 1: nan, 2: nan, 3: nan}, 'ZoneDist3': {0: nan, 1: nan, 2: nan, 3: nan}, 'ZoneDist4': {0: nan, 1: nan, 2: nan, 3: nan}, 'Overlay1': {0: nan, 1: nan, 2: nan, 3: nan}, 'Overlay2': {0: nan, 1: nan, 2: nan, 3: nan}, 'SPDist1': {0: nan, 1: 'MX-2', 2: nan, 3: nan}, 'SPDist2': {0: nan, 1: nan, 2: nan, 3: nan}, 'SPDist3': {0: nan, 1: nan, 2: nan, 3: nan}, 'LtdHeight': {0: nan, 1: nan, 2: nan, 3: nan}, 'SplitZone': {0: nan, 1: 'N', 2: 'N', 3: 'N'}, 'BldgClass': {0: nan, 1: 'G7', 2: 'O6', 3: 'K6'}, 'LandUse': {0: nan, 1: 10.0, 2: 5.0, 3: 5.0}, 'Easements': {0: nan, 1: 0.0, 2: 0.0, 3: 1.0}, 'OwnerType': {0: nan, 1: 'P', 2: nan, 3: nan}, 'OwnerName': {0: nan, 1: '85 JAY STREET BROOKLY', 2: '25-30 COLUMBIA HEIGHT', 3: 'BROOKLYN KINGS PLAZA'}, 'LotArea': {0: nan, 1: 134988.0, 2: 32000.0, 3: 905000.0}, 'BldgArea': {0: nan, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'ComArea': {0: nan, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'ResArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'OfficeArea': {0: nan, 1: 0.0, 2: 264750.0, 3: 0.0}, 'RetailArea': {0: nan, 1: 0.0, 2: 0.0, 3: 1263000.0}, 'GarageArea': {0: nan, 1: 0.0, 2: 0.0, 3: 1285000.0}, 'StrgeArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'FactryArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'OtherArea': {0: nan, 1: 0.0, 2: 39900.0, 3: 0.0}, 'AreaSource': {0: nan, 1: 7.0, 2: 2.0, 3: 2.0}, 'NumBldgs': {0: nan, 1: 0.0, 2: 1.0, 3: 4.0}, 'NumFloors': {0: nan, 1: 0.0, 2: 13.0, 3: 2.0}, 'UnitsRes': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'UnitsTotal': {0: nan, 1: 0.0, 2: 0.0, 3: 123.0}, 'LotFront': {0: nan, 1: 490.5, 2: 92.42, 3: 930.0}, 'LotDepth': {0: nan, 1: 275.33, 2: 335.92, 3: 859.0}, 'BldgFront': {0: nan, 1: 0.0, 2: 335.0, 3: 0.0}, 'BldgDepth': {0: nan, 1: 0.0, 2: 92.0, 3: 0.0}, 'Ext': {0: nan, 1: nan, 2: nan, 3: nan}, 'ProxCode': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'IrrLotCode': {0: nan, 1: 'N', 2: 'Y', 3: 'Y'}, 'LotType': {0: nan, 1: 5.0, 2: 3.0, 3: 3.0}, 'BsmtCode': {0: nan, 1: 5.0, 2: 5.0, 3: 5.0}, 'AssessLand': {0: nan, 1: 1571850.0, 2: 1548000.0, 3: 36532350.0}, 'AssessTot': {0: nan, 1: 1571850.0, 2: 25463250.0, 3: 149792400.0}, 'ExemptLand': {0: nan, 1: 1571850.0, 2: 0.0, 3: 0.0}, 'ExemptTot': {0: nan, 1: 1571850.0, 2: 0.0, 3: 0.0}, 'YearBuilt': {0: nan, 1: 0.0, 2: 1924.0, 3: 1970.0}, 'YearAlter1': {0: nan, 1: 0.0, 2: 1980.0, 3: 0.0}, 'YearAlter2': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'HistDist': {0: nan, 1: nan, 2: nan, 3: nan}, 'Landmark': {0: nan, 1: nan, 2: nan, 3: nan}, 'BuiltFAR': {0: nan, 1: 0.0, 2: 9.52, 3: 2.82}, 'ResidFAR': {0: nan, 1: 7.2, 2: 0.0, 3: 0.0}, 'CommFAR': {0: nan, 1: 2.0, 2: 2.0, 3: 2.0}, 'FacilFAR': {0: nan, 1: 6.5, 2: 0.0, 3: 0.0}, 'BoroCode': {0: nan, 1: 3.0, 2: 3.0, 3: 3.0}, 'BBL': {0: nan, 1: 3000540001.0, 2: 3002040001.0, 3: 3084700055.0}, 'CondoNo': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'Tract2010': {0: nan, 1: 21.0, 2: 1.0, 3: 698.0}, 'XCoord': {0: nan, 1: 988208.0, 2: 985952.0, 3: 1006597.0}, 'YCoord': {0: nan, 1: 195011.0, 2: 195007.0, 3: 161424.0}, 'ZoneMap': {0: nan, 1: '12d', 2: '12d', 3: '23b'}, 'ZMCode': {0: nan, 1: nan, 2: nan, 3: nan}, 'Sanborn': {0: nan, 1: '302 016', 2: '302 004', 3: '319 077'}, 'TaxMap': {0: nan, 1: 30101.0, 2: 30106.0, 3: 32502.0}, 'EDesigNum': {0: nan, 1: nan, 2: nan, 3: nan}, 'APPBBL': {0: nan, 1: 3000540001.0, 2: 0.0, 3: 0.0}, 'APPDate': {0: nan, 1: '12/06/2002', 2: nan, 3: nan}, 'PLUTOMapID': {0: nan, 1: 1.0, 2: 1.0, 3: 1.0}, 'FIRM07_FLA': {0: nan, 1: nan, 2: nan, 3: 1.0}, 'PFIRM15_FL': {0: nan, 1: nan, 2: nan, 3: 1.0}, 'Version': {0: nan, 1: '17V1.1', 2: '17V1.1', 3: '17V1.1'}, 'MAPPLUTO_F': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'SHAPE_Leng': {0: nan, 1: 1559.88914353, 2: 890.718521021, 3: 3729.78685686}, 'SHAPE_Area': {0: nan, 1: 140131.577176, 2: 34656.4472405, 3: 797554.847834}, 'price_range': {0: nan, 1: nan, 2: nan, 3: nan}})
The transformer expects a 2D array, of shape (n x m) where n is the number of samples and m the number of features and if you look at the shape of features I imagine it will display: (m,).
Reshaping arrays
In general for a feature array of shape (n,), you can do as the error code suggests and call .reshape(-1,1) on your feature array, the -1 lets it infer the additional dimension: The shape of the array will be (n,m), where for a 1 feature case m = 1.
Sklearn transformers
The above being said, I think there is additional errors with your code and understanding.
I would print features to screen and check it is what you want, it looks like you are printing a list of all the column names except sale_price.
I am not familiar with SelectKBest but it requires an (n,m) feature array not a list of column names of the features.
Additionally, target should not be the name of the target column, but an array of shape (n,), where its values are the observed target values of the training instances.
I would suggest checking the documentation (previously referenced) while you are writing your code to make sure you are using the correct arguments and utilising the function as it is intended.
Extracting features
Your data seems in a strange format (dictionary's nested in a pandas DF). However is a explicit example of how I would extract features from a pd.DataFrame for use with functions from the SKlearn framework.
housing_data = pd.DataFrame({'age': [1,5,1,10], 'size':[0,1,2,0],
'price':[190,100,50,100]
})
feature_arr = housing_data.drop('price', axis=1).values
target_values = housing_data['price']
Print feature_arr and you will hopefully see your issue. Normally you would then have to preprocess the data to, for example, drop NaN values or perform feature scaling.
I am trying to create a new column in a Pandas dataframe using multiple conditional statements based on other info within the dataframe. I have tried iterating using .iteritems(). This works, but seems inelegant and returns a notice that I don't know how to understand and/or correct.
My code snippet is:
proj_file_pq['pd_pq'] = 0
for key, value in proj_file_pq['pd_pq'].iteritems():
if proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_pd'].iloc[key] < 1:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - 1
elif proj_file_pq['qualifying'].iloc[key] > \
proj_file_pq['avg_start'].iloc[key]:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_finish'].iloc[key]
elif proj_file_pq['qualifying'].iloc[key] + \
proj_file_pq['avg_pd'].iloc[key] > 40:
proj_file_pq['pd_pq'].iloc[key] = \
40 - proj_file_pq['qualifying'].iloc[key]
else:
proj_file_pq['pd_pq'].iloc[key] = proj_file_pq['avg_pd'].iloc[key]
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
And here is the resulting output:
C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Driver avg_start avg_finish qualifying avg_pd pd_pq
0 A.J. Allmendinger 18.000 21.875 16 3.875 3.875
1 Alex Bowman 14.500 18.000 8 3.500 3.500
2 Aric Almirola 21.250 19.250 13 -2.000 -2.000
3 Austin Dillon 18.875 18.375 17 -0.500 -0.500
4 B.J. McLeod 33.500 33.500 36 0.000 2.500
The original dataframe has the following head:
{'Driver': {0: 'A.J. Allmendinger', 1: 'Alex Bowman', 2: 'Aric Almirola', 3: 'Austin Dillon', 4: 'B.J. McLeod'}, 'qualifying': {0: 16, 1: 8, 2: 13, 3: 17, 4: 36}, 'races': {0: 8, 1: 6, 2: 8, 3: 8, 4: 2}, 'avg_start': {0: 18.0, 1: 14.5, 2: 21.25, 3: 18.875, 4: 33.5}, 'avg_finish': {0: 21.875, 1: 18.0, 2: 19.25, 3: 18.375, 4: 33.5}, 'avg_pd': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'percent_fl': {0: 0.0036250647332988096, 1: 0.0071770334928229675, 2: 0.03655483224837256, 3: 0.006718346253229974, 4: 0.0}, 'percent_ll': {0: 0.0031071983428275505, 1: 0.001594896331738437, 2: 0.03505257886830245, 3: 0.006718346253229974, 4: 0.0}, 'percent_lc': {0: 0.9587884806355512, 1: 0.6226415094339622, 2: 0.9915590863952334, 3: 0.9607745779543198, 4: 0.2398212512413108}, 'finish_rank': {0: 25.0, 1: 17.0, 2: 20.5, 3: 19.0, 4: 35.0}, 'pd_rank': {0: 7.0, 1: 9.0, 2: 26.0, 3: 23.0, 4: 19.5}, 'fl_rank': {0: 28.0, 1: 21.0, 2: 8.0, 3: 22.0, 4: 35.0}, 'll_rank': {0: 19.0, 1: 24.0, 2: 6.0, 3: 16.0, 4: 31.0}, 'overall': {0: 79.0, 1: 71.0, 2: 60.5, 3: 80.0, 4: 120.5}, 'overall_rank': {0: 22.0, 1: 20.0, 2: 13.0, 3: 24.0, 4: 34.0}, 'pd_pts': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'fl_pts': {0: 0.5455722423614707, 1: 1.0801435406698563, 2: 5.50150225338007, 3: 1.0111111111111108, 4: 0.0}, 'll_pts': {0: 0.2338166752977732, 1: 0.12001594896331738, 2: 2.6377065598397595, 3: 0.5055555555555555, 4: 0.0}, 'finish_pts': {0: 22.0, 1: 30.0, 2: 26.5, 3: 28.0, 4: 12.0}, 'total_pts': {0: 26.654388917659244, 1: 34.70015948963317, 2: 32.63920881321983, 3: 29.016666666666666, 4: 12.0}}
Advice on improving this is appreciated.
Set up your conditions:
c1 = (df.qualifying - df.avg_pd).lt(1)
c2 = (df.qualifying.gt(df.avg_start))
c3 = (df.qualifying.add(df.avg_pd).gt(40))
And your corresponding outputs:
o1 = df.qualifying.sub(1)
o2 = df.qualifying.sub(df.avg_finish)
o3 = 40 - df.qualifying
Using np.select:
df['pd_pq'] = np.select([c1, c2, c3], [o1, o2, o3], df.avg_pd)
Driver qualifying finish_pts total_pts pd_pq
0 A.J. Allmendinger 0.233817 ... 22.0 26.654389 3.875
1 Alex Bowman 0.120016 ... 30.0 34.700159 3.500
2 Aric Almirola 2.637707 ... 26.5 32.639209 -2.000
3 Austin Dillon 0.505556 ... 28.0 29.016667 -0.500
4 B.J. McLeod 0.000000 ... 12.0 12.000000 2.500
I didn't run this as I didn't have the test data, but this should work, presuming I was correct with my parentheses and you import numpy as np
import numpy as np
proj_file_pq['pd_pq'] = np.where(proj_file_pq['qualifying'] - proj_file_pq['avg_pd'] < 1, proj_file_pq['qualifying'] - 1,
np.where(proj_file_pq['qualifying'] > proj_file_pq['avg_start'], proj_file_pq['qualifying'] - proj_file_pq['avg_finish'],
np.where(proj_file_pq['qualifying'] + proj_file_pq['avg_pd'] > 40, 40 - proj_file_pq['qualifying'],
proj_file_pq['avg_pd']))
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
You don't need to create proj_file_pq['pd_pq'] prior and set it equal to 0 with this method
One heads up I want to give you is the error: C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Usually happens for me when I've created multiple data frames without using reset_index() at the end of your command to create the data frame. You may want to use that when creating your table to see if it gets rid of the slicing error. I normally use reset_index(drop=True) if you already have an ID column to avoid creating redundant ID columns.
I hope this helps clear that up!