'forward replace' with pandas - python

I have a df where every row looks something like this:
series = pd.Series({0: 1.0,
1: 1.0,
10: nan,
11: nan,
2: 1.0,
3: 1.0,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
'B': 3.0,
'D': 1.0})
For every value in D I want series[series.D] to be 2 and, all numerical values that are higher than series.D, also to be 2. It's kinda like a 'forward replace'.
So what I want is:
target = pd.Series({0: 1.0,
1: 2.0,
10: nan,
11: nan,
2: 2.0,
3: 2.0,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
'B': 3.0,
'D': 1.0})
So far I've got:
def forward_replace(series):
if pd.notnull(series['D']):
cols = [0,1,2,3,4,5,6,7,8,9,10,11]
target_cols = [x for x in cols if x > series.D]
series.loc[target_cols].replace({1:2}, inplace=True)
return series
It seems like it's not possible to use label based indexing with numerical column labels?

Related

Explode data in a list not separated by comma

Let's say have the following database:
{'docdb_family_id': {0: 569328,
1: 574660,
2: 1187498,
3: 1226468,
4: 1236571,
5: 1239098,
6: 1239277,
7: 1239483,
8: 1239622,
9: 1239624,
10: 1239749,
11: 1334477,
12: 1340405,
13: 1340418,
14: 1340462,
15: 1340471,
16: 1340485,
17: 1340488,
18: 1340508,
19: 1340519,
20: 1340541},
'newa_cited_docdb': {0: '[ 596005 4321416 5802640 6031690 6043910 8600475 8642629 9203255 9345445 10177065 10455451 13428248 22139349 22591458 24627241 24750476 26261826 26405611 27079105 27096884]',
1: '[ 5956195 11260528 22181831 22437920 22642946 23278096 23407037 23458128 24244657 24355363 25014714 25115774 25156886 27047688 27089078 27398716]',
2: '[ 5855196 7755392 11183886 22894980 24648618 27185399]',
3: '[ 3573464 6279285 6294985 6542463 6981930 7427770 10325811 14970234 16878329 17935009 21811002 22329817 23543436 23907898 24456108 25283772]',
4: '[ 2777078 2826073 5944733 10484188 11052747 14682645 15688752 22333410 22614097 22646501 22783765 22978728 23231683 24259740 24605606 24839432 25492752 27009992 27044704]',
5: '[ 5777407 10417156 23463145 23845079 24397163 24426379 24916732 25216234 25296619 27054560 27509152]',
6: '[ 4136523 12578497 21994155 22418792 22626616 22655464 22694825 22779403 23081767 23309829 23379411 23621952 24130698 24236071 24267003 24790872 24841797 25343500 27006578]',
7: '[21722194 23841261 23870348 24749080 26713455 26884023 26892256 27123571]',
8: '[ 3770167 9249538 20340153 21805004 21826650 23074051 23211424 23586695 23664858 24139881 24669345 24951262 25109266 25172355 25351735 26158421 27074633]',
9: '[ 3773931 10400885 23825854 24863945 24904226 25372210 26673422 27108903]',
10: '[ 6245732 6270984 6282047 6313094 6323632 6357314 12700997 14934415]',
11: '[1331950 5937719 5950928 6032897 6737094 8103287]',
12: '[22536768 23111794 23827356 24148953 24483064 24636228 26369896 26722884]',
13: '[ 4096597 6452385 9164095 19820980 22468583 23758517 24922228]',
14: '[ 6273193 6365448 9349940 10531948 13589721 20897840 21818345 22422049 23234586 23722349 24282964 24466601 25476838 26223504 26685774 26756449 26812104 26900843 27088150]',
15: '[ 3770297 6285357 21272262 21883292 22392025 23100861 23160290 23827496 24060758 25448672 26918320]',
16: '[21808322 25167492 25401922 26858065]',
17: '[ 6293130 12621423 12977043 14043576 14524083 22013480 23070753 23360636 23672818 24210016 24396413 24505095 25447453 26335550 27560125]',
18: '[21923978 23414619 23700077 23916998 23917011 23917023 24227869]',
19: '[ 3029629 3461742 8589904 10338953 10633369 16254362 22248316 22635394 24392987 25416705 26671842 27391491 27406148]',
20: None},
'paperid': {0: nan,
1: nan,
2: nan,
3: nan,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
10: nan,
11: nan,
12: nan,
13: nan,
14: nan,
15: nan,
16: nan,
17: nan,
18: nan,
19: nan,
20: 1998988989.0},
'fronteer': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0,
13: 0,
14: 0,
15: 0,
16: 0,
17: 0,
18: 0,
19: 0,
20: 1},
'distance': {0: nan,
1: nan,
2: nan,
3: nan,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
10: nan,
11: nan,
12: nan,
13: nan,
14: nan,
15: nan,
16: nan,
17: nan,
18: nan,
19: nan,
20: 0.0},
'cited_docdb_ls': {0: '[ 596005 4321416 5802640 6031690 6043910 8600475 8642629 9203255 9345445 10177065 10455451 13428248 22139349 22591458 24627241 24750476 26261826 26405611 27079105 27096884]',
1: '[ 5956195 11260528 22181831 22437920 22642946 23278096 23407037 23458128 24244657 24355363 25014714 25115774 25156886 27047688 27089078 27398716]',
2: '[ 5855196 7755392 11183886 22894980 24648618 27185399]',
3: '[ 3573464 6279285 6294985 6542463 6981930 7427770 10325811 14970234 16878329 17935009 21811002 22329817 23543436 23907898 24456108 25283772]',
4: '[ 2777078 2826073 5944733 10484188 11052747 14682645 15688752 22333410 22614097 22646501 22783765 22978728 23231683 24259740 24605606 24839432 25492752 27009992 27044704]',
5: '[ 5777407 10417156 23463145 23845079 24397163 24426379 24916732 25216234 25296619 27054560 27509152]',
6: '[ 4136523 12578497 21994155 22418792 22626616 22655464 22694825 22779403 23081767 23309829 23379411 23621952 24130698 24236071 24267003 24790872 24841797 25343500 27006578]',
7: '[21722194 23841261 23870348 24749080 26713455 26884023 26892256 27123571]',
8: '[ 3770167 9249538 20340153 21805004 21826650 23074051 23211424 23586695 23664858 24139881 24669345 24951262 25109266 25172355 25351735 26158421 27074633]',
9: '[ 3773931 10400885 23825854 24863945 24904226 25372210 26673422 27108903]',
10: '[ 6245732 6270984 6282047 6313094 6323632 6357314 12700997 14934415]',
11: '[1331950 5937719 5950928 6032897 6737094 8103287]',
12: '[22536768 23111794 23827356 24148953 24483064 24636228 26369896 26722884]',
13: '[ 4096597 6452385 9164095 19820980 22468583 23758517 24922228]',
14: '[ 6273193 6365448 9349940 10531948 13589721 20897840 21818345 22422049 23234586 23722349 24282964 24466601 25476838 26223504 26685774 26756449 26812104 26900843 27088150]',
15: '[ 3770297 6285357 21272262 21883292 22392025 23100861 23160290 23827496 24060758 25448672 26918320]',
16: '[21808322 25167492 25401922 26858065]',
17: '[ 6293130 12621423 12977043 14043576 14524083 22013480 23070753 23360636 23672818 24210016 24396413 24505095 25447453 26335550 27560125]',
18: '[21923978 23414619 23700077 23916998 23917011 23917023 24227869]',
19: '[ 3029629 3461742 8589904 10338953 10633369 16254362 22248316 22635394 24392987 25416705 26671842 27391491 27406148]',
20: []}}
what I would like to do is to explode the variable cited_docdb_ls which contains lists separated by space rather than a comma.
How can I do that? If it is not possible, is there a way to separate them by comma rather than space and then explode them?
The resulting database should either contain cited_docdb_ls with traditional lists separated by comma and not by spaces or the exploded database. I have checked the df.explode() documentation but couldd not find any hint on how to manage situations where the list is separated by space.
Thank you
I would use str.findall with a (\d+) regex for numbers to convert the strings to lists, then explode:
out = (df.assign(newa_cited_docdb=df['newa_cited_docdb'].str.findall('\d+'),
cited_docdb_ls=df['cited_docdb_ls'].str.findall('(\d+)'))
.explode(['newa_cited_docdb', 'cited_docdb_ls'])
)
NB. if you don't have only digits a (\w+) regex will be more generic, however if the strings also contain [/] other than the in first and last character (e.g. [abc 12]3 45d]), then #jezrael's anwser will be an alternative.
output:
docdb_family_id newa_cited_docdb paperid fronteer distance \
0 569328 596005 NaN 0 NaN
0 569328 4321416 NaN 0 NaN
0 569328 5802640 NaN 0 NaN
0 569328 6031690 NaN 0 NaN
0 569328 6043910 NaN 0 NaN
.. ... ... ... ... ...
19 1340519 25416705 NaN 0 NaN
19 1340519 26671842 NaN 0 NaN
19 1340519 27391491 NaN 0 NaN
19 1340519 27406148 NaN 0 NaN
20 1340541 None 1.998989e+09 1 0.0
cited_docdb_ls
0 596005
0 4321416
0 5802640
0 6031690
0 6043910
.. ...
19 25416705
19 26671842
19 27391491
19 27406148
20 NaN
[239 rows x 6 columns]
Use Series.str.strip with Series.str.split for both columns and then DataFrame.explode:
df = (df.assign(newa_cited_docdb=df['newa_cited_docdb'].str.strip('[]').str.split(),
cited_docdb_ls=df['cited_docdb_ls'].str.strip('[]').str.split())
.explode(['newa_cited_docdb','cited_docdb_ls']))

Filtering by rows Pandas DataFrame [duplicate]

This question already has answers here:
Use a list of values to select rows from a Pandas dataframe
(8 answers)
Closed 12 months ago.
I want to filer the pandas DataFrame where it filters out every other column out of the DataFrame except the rows stated within the rows values. How would I be able to do that and get the Expected Output.
import pandas as pd
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buy s': {0: nan, 1: 2.0, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: nan},
'Number of Sell s': {0: 1.0, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: nan, 6: nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
Expected Output:
Use .isin()
data[data['Symbol'].isin(rows)]

Pandas : Fillna for all columns, except two

I wonder how can we fill the NaNs from all columns of a dataframe, except some.
For example, I have a dataframe with 20 columns, I want to fill the NaN for all except two columns (in my case, NaN are replaced by the mean).
df = df.drop(['col1','col2], 1).fillna(df.mean())
I tried this, but I don't think it's the best way to achieve this (also, i want to avoid the inplace=true arg).
Thank's
You can select which columns to use fillna on. Assuming you have 20 columns and you want to fill all of them except 'col1' and 'col2' you can create a list with the ones you want to fill:
f = [c for c in df.columns if c not in ['col1','col2']]
df[f] = df[f].fillna(df[f].mean())
print(df)
col1 col2 col3 col4 ... col17 col18 col19 col20
0 1.0 1.0 1.000000 1.0 ... 1.000000 1 1.000000 1
1 NaN NaN 2.666667 2.0 ... 2.000000 2 2.000000 2
2 NaN 3.0 3.000000 1.5 ... 2.333333 3 2.333333 3
3 4.0 4.0 4.000000 1.5 ... 4.000000 4 4.000000 4
(2.66666) was the mean
# Initial DF:
{'col1': {0: 1.0, 1: nan, 2: nan, 3: 4.0},
'col2': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col3': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col4': {0: 1.0, 1: 2.0, 2: nan, 3: nan},
'col5': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col6': {0: 1, 1: 2, 2: 3, 3: 4},
'col7': {0: nan, 1: 2.0, 2: 3.0, 3: 4.0},
'col8': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col9': {0: 1, 1: 2, 2: 3, 3: 4},
'col10': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col11': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col12': {0: 1, 1: 2, 2: 3, 3: 4},
'col13': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col14': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col15': {0: 1, 1: 2, 2: 3, 3: 4},
'col16': {0: 1.0, 1: nan, 2: 3.0, 3: nan},
'col17': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col18': {0: 1, 1: 2, 2: 3, 3: 4},
'col19': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col20': {0: 1, 1: 2, 2: 3, 3: 4}}

Extract strings from a Dataframe looping over a single row

I'm reading multiple PDFs (using tabula) into data frames like this:
nan = float('nan')
DataFrame_as_dict = {'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 1': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'NOTA DE NEGOCIAÇÃO': {0: nan,
1: nan,
2: 'Rico Investimentos - Grupo XP',
3: 'Av. Presidente Juscelino Kubitschek - Torre Sul, 1909 - 25o ANDAR VILA OLIMPIA 4543-907',
4: 'Tel. 3003-5465Fax: (55 11) 4007-2465',
5: 'Internet: www.rico.com.vc SAC: 0800-774-0402 e-mail: atendimento#rico.com.vc'},
'Unnamed: 3': {0: 'Nr. nota Folha',
1: '330736 1',
2: nan,
3: 'SÃO PAULO - SP',
4: nan,
5: nan},
'Unnamed: 4': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 5': {0: 'Data pregão',
1: '30/09/2015',
2: nan,
3: nan,
4: nan,
5: nan}}
df = pd.DataFrame(DataFrame_as_dict)
dataframe figure
My intention is to use that value '330736 1' into the variable "number" and '30/09/2015' into a variable "date".
The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.
Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1:2,i]).strip()
else:
n_nota = str(df.iloc[1:2,i]).strip()
However, without success... Any thoughts?
In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:
Strings inside DataFrames are of type object
df.iloc[1:2,i] will always be a pandas Series.
Since object is such a flexible type, it's not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn't work with your data, a regex pattern may be a good approach.
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i].values, object):
(df.iloc[1:2,i].values)
if "/" in str(df.iloc[1:2,i].values):
date = str(df.iloc[1:2,i].values[0]).strip()
elif " " in str(df.iloc[1:2,i].values):
n_nota = str(df.iloc[1:2,i].values[0]).strip()
Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1,i]).strip()
else:
n_nota = str(df.iloc[1,i]).strip()

Reshape error when using mutual_info regression for feature selection

I am trying to do some feature selection using mutual_info_regression with SelectKBest wrapper. However I keep running into an error indicating that my list of features needs to be reshaped into a 2D array, not quite sure why I keep getting this message-
#feature selection before linear regression benchmark test
import sklearn
from sklearn.feature_selection import mutual_info_regression, SelectKBest
features = list(housing_data[housing_data.columns.difference(['sale_price'])])
target = 'sale_price'
new = SelectKBest(mutual_info_regression, k=20).fit_transform(features, target)
This is my traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-8c778124066c> in <module>()
3 features = list(housing_data[housing_data.columns.difference(['sale_price'])])
4 target = 'sale_price'
----> 5 new = SelectKBest(mutual_info_regression, k=20).fit_transform(features, target)
/usr/local/lib/python3.6/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
463 else:
464 # fit method of arity 2 (supervised transformation)
--> 465 return self.fit(X, y, **fit_params).transform(X)
466
467
/usr/local/lib/python3.6/dist-packages/sklearn/feature_selection/univariate_selection.py in fit(self, X, y)
339 self : object
340 """
--> 341 X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
342
343 if not callable(self.score_func):
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
754 ensure_min_features=ensure_min_features,
755 warn_on_dtype=warn_on_dtype,
--> 756 estimator=estimator)
757 if multi_output:
758 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
550 "Reshape your data either using array.reshape(-1, 1) if "
551 "your data has a single feature or array.reshape(1, -1) "
--> 552 "if it contains a single sample.".format(array))
553
554 # in the future np.flexible dtypes will be handled like object dtypes
ValueError: Expected 2D array, got 1D array instead:
array=['APPBBL' 'APPDate' 'Address' 'AreaSource' 'AssessLand' 'AssessTot' 'BBL'
'BldgArea' 'BldgClass' 'BldgDepth' 'BldgFront' 'BoroCode' 'Borough'
'BsmtCode' 'BuiltFAR' 'CB2010' 'CD' 'CT2010' 'ComArea' 'CommFAR'
'CondoNo' 'Council' 'EDesigNum' 'Easements' 'ExemptLand' 'ExemptTot'
'Ext' 'FIRM07_FLA' 'FacilFAR' 'FactryArea' 'FireComp' 'GarageArea'
'HealthArea' 'HealthCent' 'HistDist' 'IrrLotCode' 'LandUse' 'Landmark'
'LotArea' 'LotDepth' 'LotFront' 'LotType' 'LtdHeight' 'MAPPLUTO_F'
'NumBldgs' 'NumFloors' 'OfficeArea' 'OtherArea' 'Overlay1' 'Overlay2'
'OwnerName' 'OwnerType' 'PFIRM15_FL' 'PLUTOMapID' 'PolicePrct' 'ProxCode'
'ResArea' 'ResidFAR' 'RetailArea' 'SHAPE_Area' 'SHAPE_Leng' 'SPDist1'
'SPDist2' 'SPDist3' 'Sanborn' 'SanitBoro' 'SanitDistr' 'SanitSub'
'SchoolDist' 'SplitZone' 'StrgeArea' 'TaxMap' 'Tract2010' 'UnitsRes'
'UnitsTotal' 'Unnamed: 0' 'Version' 'XCoord' 'YCoord' 'YearAlter1'
'YearAlter2' 'YearBuilt' 'ZMCode' 'ZipCode' 'ZoneDist1' 'ZoneDist2'
'ZoneDist3' 'ZoneDist4' 'ZoneMap' 'address' 'apartment_number' 'block'
'borough' 'building_class' 'building_class_at_sale'
'building_class_category' 'commercial_units' 'easement' 'gross_sqft'
'land_sqft' 'lot' 'neighborhood' 'price_range' 'residential_units'
'sale_date' 'tax_class' 'tax_class_at_sale' 'total_units' 'year_built'
'year_of_sale' 'zip_code'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is a sample of my data:
housing_data = pd.DataFrame({'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4}, 'borough': {0: 3, 1: 3, 2: 3, 3: 3}, 'neighborhood': {0: 'DOWNTOWN-METROTECH', 1: 'DOWNTOWN-FULTON FERRY', 2: 'BROOKLYN HEIGHTS', 3: 'MILL BASIN'}, 'building_class_category': {0: '28 COMMERCIAL CONDOS', 1: '29 COMMERCIAL GARAGES', 2: '21 OFFICE BUILDINGS', 3: '22 STORE BUILDINGS'}, 'tax_class': {0: '4', 1: '4', 2: '4', 3: '4'}, 'block': {0: 140, 1: 54, 2: 204, 3: 8470}, 'lot': {0: 1001, 1: 1, 2: 1, 3: 55}, 'easement': {0: nan, 1: nan, 2: nan, 3: nan}, 'building_class': {0: 'R5', 1: 'G7', 2: 'O6', 3: 'K6'}, 'address': {0: '330 JAY STREET', 1: '85 JAY STREET', 2: '29 COLUMBIA HEIGHTS', 3: '5120 AVENUE U'}, 'apartment_number': {0: 'COURT', 1: nan, 2: nan, 3: nan}, 'zip_code': {0: 11201, 1: 11201, 2: 11201, 3: 11234}, 'residential_units': {0: 0, 1: 0, 2: 0, 3: 0}, 'commercial_units': {0: 1, 1: 0, 2: 0, 3: 123}, 'total_units': {0: 1, 1: 0, 2: 0, 3: 123}, 'land_sqft': {0: 0.0, 1: 134988.0, 2: 32000.0, 3: 905000.0}, 'gross_sqft': {0: 0.0, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'year_built': {0: 2002, 1: 0, 2: 1924, 3: 1970}, 'tax_class_at_sale': {0: 4, 1: 4, 2: 4, 3: 4}, 'building_class_at_sale': {0: 'R5', 1: 'G7', 2: 'O6', 3: 'K6'}, 'sale_price': {0: 499401179.0, 1: 345000000.0, 2: 340000000.0, 3: 276947000.0}, 'sale_date': {0: '2008-04-23', 1: '2016-12-20', 2: '2016-08-03', 3: '2012-11-28'}, 'year_of_sale': {0: 2008, 1: 2016, 2: 2016, 3: 2012}, 'Borough': {0: nan, 1: 'BK', 2: 'BK', 3: 'BK'}, 'CD': {0: nan, 1: 302.0, 2: 302.0, 3: 318.0}, 'CT2010': {0: nan, 1: 21.0, 2: 1.0, 3: 698.0}, 'CB2010': {0: nan, 1: 3017.0, 2: 1003.0, 3: 2005.0}, 'SchoolDist': {0: nan, 1: 13.0, 2: 13.0, 3: 22.0}, 'Council': {0: nan, 1: 33.0, 2: 33.0, 3: 46.0}, 'ZipCode': {0: nan, 1: 11201.0, 2: 11201.0, 3: 11234.0}, 'FireComp': {0: nan, 1: 'L118', 2: 'E205', 3: 'E323'}, 'PolicePrct': {0: nan, 1: 84.0, 2: 84.0, 3: 63.0}, 'HealthCent': {0: nan, 1: 36.0, 2: 38.0, 3: 35.0}, 'HealthArea': {0: nan, 1: 1000.0, 2: 2300.0, 3: 8822.0}, 'SanitBoro': {0: nan, 1: 3.0, 2: 3.0, 3: 3.0}, 'SanitDistr': {0: nan, 1: 2.0, 2: 2.0, 3: 18.0}, 'SanitSub': {0: nan, 1: '1B', 2: '1A', 3: '4E'}, 'Address': {0: nan, 1: '87 JAY STREET', 2: '29 COLUMBIA HEIGHTS', 3: '5120 AVENUE U'}, 'ZoneDist1': {0: nan, 1: 'M1-2/R8', 2: 'M2-1', 3: 'M3-1'}, 'ZoneDist2': {0: nan, 1: nan, 2: nan, 3: nan}, 'ZoneDist3': {0: nan, 1: nan, 2: nan, 3: nan}, 'ZoneDist4': {0: nan, 1: nan, 2: nan, 3: nan}, 'Overlay1': {0: nan, 1: nan, 2: nan, 3: nan}, 'Overlay2': {0: nan, 1: nan, 2: nan, 3: nan}, 'SPDist1': {0: nan, 1: 'MX-2', 2: nan, 3: nan}, 'SPDist2': {0: nan, 1: nan, 2: nan, 3: nan}, 'SPDist3': {0: nan, 1: nan, 2: nan, 3: nan}, 'LtdHeight': {0: nan, 1: nan, 2: nan, 3: nan}, 'SplitZone': {0: nan, 1: 'N', 2: 'N', 3: 'N'}, 'BldgClass': {0: nan, 1: 'G7', 2: 'O6', 3: 'K6'}, 'LandUse': {0: nan, 1: 10.0, 2: 5.0, 3: 5.0}, 'Easements': {0: nan, 1: 0.0, 2: 0.0, 3: 1.0}, 'OwnerType': {0: nan, 1: 'P', 2: nan, 3: nan}, 'OwnerName': {0: nan, 1: '85 JAY STREET BROOKLY', 2: '25-30 COLUMBIA HEIGHT', 3: 'BROOKLYN KINGS PLAZA'}, 'LotArea': {0: nan, 1: 134988.0, 2: 32000.0, 3: 905000.0}, 'BldgArea': {0: nan, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'ComArea': {0: nan, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'ResArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'OfficeArea': {0: nan, 1: 0.0, 2: 264750.0, 3: 0.0}, 'RetailArea': {0: nan, 1: 0.0, 2: 0.0, 3: 1263000.0}, 'GarageArea': {0: nan, 1: 0.0, 2: 0.0, 3: 1285000.0}, 'StrgeArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'FactryArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'OtherArea': {0: nan, 1: 0.0, 2: 39900.0, 3: 0.0}, 'AreaSource': {0: nan, 1: 7.0, 2: 2.0, 3: 2.0}, 'NumBldgs': {0: nan, 1: 0.0, 2: 1.0, 3: 4.0}, 'NumFloors': {0: nan, 1: 0.0, 2: 13.0, 3: 2.0}, 'UnitsRes': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'UnitsTotal': {0: nan, 1: 0.0, 2: 0.0, 3: 123.0}, 'LotFront': {0: nan, 1: 490.5, 2: 92.42, 3: 930.0}, 'LotDepth': {0: nan, 1: 275.33, 2: 335.92, 3: 859.0}, 'BldgFront': {0: nan, 1: 0.0, 2: 335.0, 3: 0.0}, 'BldgDepth': {0: nan, 1: 0.0, 2: 92.0, 3: 0.0}, 'Ext': {0: nan, 1: nan, 2: nan, 3: nan}, 'ProxCode': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'IrrLotCode': {0: nan, 1: 'N', 2: 'Y', 3: 'Y'}, 'LotType': {0: nan, 1: 5.0, 2: 3.0, 3: 3.0}, 'BsmtCode': {0: nan, 1: 5.0, 2: 5.0, 3: 5.0}, 'AssessLand': {0: nan, 1: 1571850.0, 2: 1548000.0, 3: 36532350.0}, 'AssessTot': {0: nan, 1: 1571850.0, 2: 25463250.0, 3: 149792400.0}, 'ExemptLand': {0: nan, 1: 1571850.0, 2: 0.0, 3: 0.0}, 'ExemptTot': {0: nan, 1: 1571850.0, 2: 0.0, 3: 0.0}, 'YearBuilt': {0: nan, 1: 0.0, 2: 1924.0, 3: 1970.0}, 'YearAlter1': {0: nan, 1: 0.0, 2: 1980.0, 3: 0.0}, 'YearAlter2': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'HistDist': {0: nan, 1: nan, 2: nan, 3: nan}, 'Landmark': {0: nan, 1: nan, 2: nan, 3: nan}, 'BuiltFAR': {0: nan, 1: 0.0, 2: 9.52, 3: 2.82}, 'ResidFAR': {0: nan, 1: 7.2, 2: 0.0, 3: 0.0}, 'CommFAR': {0: nan, 1: 2.0, 2: 2.0, 3: 2.0}, 'FacilFAR': {0: nan, 1: 6.5, 2: 0.0, 3: 0.0}, 'BoroCode': {0: nan, 1: 3.0, 2: 3.0, 3: 3.0}, 'BBL': {0: nan, 1: 3000540001.0, 2: 3002040001.0, 3: 3084700055.0}, 'CondoNo': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'Tract2010': {0: nan, 1: 21.0, 2: 1.0, 3: 698.0}, 'XCoord': {0: nan, 1: 988208.0, 2: 985952.0, 3: 1006597.0}, 'YCoord': {0: nan, 1: 195011.0, 2: 195007.0, 3: 161424.0}, 'ZoneMap': {0: nan, 1: '12d', 2: '12d', 3: '23b'}, 'ZMCode': {0: nan, 1: nan, 2: nan, 3: nan}, 'Sanborn': {0: nan, 1: '302 016', 2: '302 004', 3: '319 077'}, 'TaxMap': {0: nan, 1: 30101.0, 2: 30106.0, 3: 32502.0}, 'EDesigNum': {0: nan, 1: nan, 2: nan, 3: nan}, 'APPBBL': {0: nan, 1: 3000540001.0, 2: 0.0, 3: 0.0}, 'APPDate': {0: nan, 1: '12/06/2002', 2: nan, 3: nan}, 'PLUTOMapID': {0: nan, 1: 1.0, 2: 1.0, 3: 1.0}, 'FIRM07_FLA': {0: nan, 1: nan, 2: nan, 3: 1.0}, 'PFIRM15_FL': {0: nan, 1: nan, 2: nan, 3: 1.0}, 'Version': {0: nan, 1: '17V1.1', 2: '17V1.1', 3: '17V1.1'}, 'MAPPLUTO_F': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'SHAPE_Leng': {0: nan, 1: 1559.88914353, 2: 890.718521021, 3: 3729.78685686}, 'SHAPE_Area': {0: nan, 1: 140131.577176, 2: 34656.4472405, 3: 797554.847834}, 'price_range': {0: nan, 1: nan, 2: nan, 3: nan}})
The transformer expects a 2D array, of shape (n x m) where n is the number of samples and m the number of features and if you look at the shape of features I imagine it will display: (m,).
Reshaping arrays
In general for a feature array of shape (n,), you can do as the error code suggests and call .reshape(-1,1) on your feature array, the -1 lets it infer the additional dimension: The shape of the array will be (n,m), where for a 1 feature case m = 1.
Sklearn transformers
The above being said, I think there is additional errors with your code and understanding.
I would print features to screen and check it is what you want, it looks like you are printing a list of all the column names except sale_price.
I am not familiar with SelectKBest but it requires an (n,m) feature array not a list of column names of the features.
Additionally, target should not be the name of the target column, but an array of shape (n,), where its values are the observed target values of the training instances.
I would suggest checking the documentation (previously referenced) while you are writing your code to make sure you are using the correct arguments and utilising the function as it is intended.
Extracting features
Your data seems in a strange format (dictionary's nested in a pandas DF). However is a explicit example of how I would extract features from a pd.DataFrame for use with functions from the SKlearn framework.
housing_data = pd.DataFrame({'age': [1,5,1,10], 'size':[0,1,2,0],
'price':[190,100,50,100]
})
feature_arr = housing_data.drop('price', axis=1).values
target_values = housing_data['price']
Print feature_arr and you will hopefully see your issue. Normally you would then have to preprocess the data to, for example, drop NaN values or perform feature scaling.

Categories

Resources