Optimized search in pandas - python

I have the following database:
{'docdb': {0: 2, 1: 4, 2: 6, 3: 7, 4: 9, 5: 14, 6: 18},
'cited_docdb': {0: [4],
1: [0, 0, 0],
2: [],
3: [2],
4: [4],
5: [18, 6],
6: [7]},
'fronteer': {0: nan, 1: nan, 2: 9.0, 3: nan, 4: 7.0, 5: 3.0, 6: 1.0},
'distance': {0: nan, 1: nan, 2: 0.0, 3: nan, 4: 0.0, 5: 0.0, 6: 0.0}}
and would basically like to do the following in an optimized way: whenever at least a 0 occurs in the list cited_docdb and the distance variable is NaN substitute the distance with a 1.
Now there is a naive way to do it which is the following:
m1= [0 in x for x in df['cited_docdb']]
df.loc[m1&df['distance'].isna(), 'distance'] = 1
which is great for small databases but it's very slow when the database has millions of observations and if I have this operation iteratively. In particular, this is because at every iteration m1 will check all the values of df['cited_docdb'] whereas it only needed to look at the values where distance is NaN (i.e. something along the lines of m1_new = [0 in x for x in df.loc[df['distance'].isna(), 'cited_docdb']]). So my question is: is there a way to basically combine this instruction m1_new = [0 in x for x in df.loc[df['distance'].isna(), 'cited_docdb']] with this df['distance'].isna() and assign a 1 whenever the 2 of them are satisfied? If not, is there another faster way to obtain the desired result shown below (again this is a mock example: in reality consider that the db has millions of observations and that the above operation must be executed 10000 times)?
{'docdb': {0: 2, 1: 4, 2: 6, 3: 7, 4: 9, 5: 14, 6: 18},
'cited_docdb': {0: [4],
1: [0, 0, 0],
2: [],
3: [2],
4: [4],
5: [18, 6],
6: [7]},
'fronteer': {0: nan, 1: nan, 2: 9.0, 3: nan, 4: 7.0, 5: 3.0, 6: 1.0},
'distance': {0: nan, 1: 1.0, 2: 0.0, 3: nan, 4: 0.0, 5: 0.0, 6: 0.0}}
Thank you

in this specific case you can use the fact that 0 is the smallest number (after abs() and fillna(1)), and calculate m1 without a loop:
m1 = (pd.DataFrame(list(df['cited_docdb'].values)).abs().fillna(1).min(axis=1)==0).values

Related

Filtering by rows Pandas DataFrame [duplicate]

This question already has answers here:
Use a list of values to select rows from a Pandas dataframe
(8 answers)
Closed 12 months ago.
I want to filer the pandas DataFrame where it filters out every other column out of the DataFrame except the rows stated within the rows values. How would I be able to do that and get the Expected Output.
import pandas as pd
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buy s': {0: nan, 1: 2.0, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: nan},
'Number of Sell s': {0: 1.0, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: nan, 6: nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
Expected Output:
Use .isin()
data[data['Symbol'].isin(rows)]

Pandas : Fillna for all columns, except two

I wonder how can we fill the NaNs from all columns of a dataframe, except some.
For example, I have a dataframe with 20 columns, I want to fill the NaN for all except two columns (in my case, NaN are replaced by the mean).
df = df.drop(['col1','col2], 1).fillna(df.mean())
I tried this, but I don't think it's the best way to achieve this (also, i want to avoid the inplace=true arg).
Thank's
You can select which columns to use fillna on. Assuming you have 20 columns and you want to fill all of them except 'col1' and 'col2' you can create a list with the ones you want to fill:
f = [c for c in df.columns if c not in ['col1','col2']]
df[f] = df[f].fillna(df[f].mean())
print(df)
col1 col2 col3 col4 ... col17 col18 col19 col20
0 1.0 1.0 1.000000 1.0 ... 1.000000 1 1.000000 1
1 NaN NaN 2.666667 2.0 ... 2.000000 2 2.000000 2
2 NaN 3.0 3.000000 1.5 ... 2.333333 3 2.333333 3
3 4.0 4.0 4.000000 1.5 ... 4.000000 4 4.000000 4
(2.66666) was the mean
# Initial DF:
{'col1': {0: 1.0, 1: nan, 2: nan, 3: 4.0},
'col2': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col3': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col4': {0: 1.0, 1: 2.0, 2: nan, 3: nan},
'col5': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col6': {0: 1, 1: 2, 2: 3, 3: 4},
'col7': {0: nan, 1: 2.0, 2: 3.0, 3: 4.0},
'col8': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col9': {0: 1, 1: 2, 2: 3, 3: 4},
'col10': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col11': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col12': {0: 1, 1: 2, 2: 3, 3: 4},
'col13': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col14': {0: 1.0, 1: nan, 2: 3.0, 3: 4.0},
'col15': {0: 1, 1: 2, 2: 3, 3: 4},
'col16': {0: 1.0, 1: nan, 2: 3.0, 3: nan},
'col17': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col18': {0: 1, 1: 2, 2: 3, 3: 4},
'col19': {0: 1.0, 1: 2.0, 2: nan, 3: 4.0},
'col20': {0: 1, 1: 2, 2: 3, 3: 4}}

df.apply() raises IndexingError: Unalignable boolean Series provided as indexer

I am performing df.apply() on a dataframe and I am getting the following error:
IndexingError: ('Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).', 'occurred at index 4061')
This error comes from the following line of my df (at index 4061)
The relevant code is:
i = pd.DataFrame()
i = df1.apply(
lambda row: i.append(
df.loc[
(df1["ID"] == row["ID"])
& (df1["Date"] >= (row["Date"] + timedelta(-5)))
& (df1["Date"] <= (row["Date"] + timedelta(20)))
],
ignore_index=True,
inplace=True,
)
if row["Flag"] == 1
else None,
axis=1,
)
And an example of the first 5 rows of the df on which I am using the function:
{'ID': {1: 'A US Equity',
2: 'A US Equity',
3: 'A US Equity',
4: 'A US Equity',
5: 'A US Equity'},
'Date': {1: Timestamp('2020-12-22 00:00:00'),
2: Timestamp('2020-12-23 00:00:00'),
3: Timestamp('2020-12-24 00:00:00'),
4: Timestamp('2020-12-28 00:00:00'),
5: Timestamp('2020-12-29 00:00:00')},
'PX_Last': {1: 117.37, 2: 117.3, 3: 117.31, 4: 117.83, 5: 117.23},
'Short_Int': {1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0},
'Total_Call_Volume': {1: 187.0, 2: 353.0, 3: 141.0, 4: 467.0, 5: 329.0},
'Total_Put_Volume': {1: 54.0, 2: 30.0, 3: 218.0, 4: 282.0, 5: 173.0},
'Put_OI': {1: 13354.0, 2: 13350.0, 3: 13522.0, 4: 13678.0, 5: 13785.0},
'Call_OI': {1: 8923.0, 2: 8943.0, 3: 8973.0, 4: 9075.0, 5: 9040.0},
'pct_chng': {1: -0.34810663949736975,
2: -0.059640453267451043,
3: 0.008525149190119485,
4: 0.4432699684596253,
5: -0.5092081812781091},
'Short_Int_Category': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Put/Call': {1: 0.2887700534759358,
2: 0.08498583569405099,
3: 1.5460992907801419,
4: 0.6038543897216274,
5: 0.5258358662613982},
'10% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'10%-20% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'20%-30% Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'30% + Pop Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Flag': {1: 0, 2: 0, 3: 0, 4: 0, 5: 0},
'Time_to_pop': {1: nan, 2: nan, 3: nan, 4: nan, 5: nan}}
The row at index 4061 that is causing the error is:
ID ADI US Equity
Date 2021-02-24 00:00:00
PX_Last 161.76
Short_Int 15.1847
Total_Call_Volume 52502
Total_Put_Volume 1929
Put_OI 32219
Call_OI 45557
pct_chng 2.57451
Short_Int_Category 15-20
Put/Call 0.0367415
10% + Pop Flag 0
10%-20% Pop Flag 0
20%-30% Pop Flag 0
30% + Pop Flag 0
Flag 1
Time_to_pop NaN
Name: 4061, dtype: object
How do I perform the function without getting the error mentioned above?

Reshape error when using mutual_info regression for feature selection

I am trying to do some feature selection using mutual_info_regression with SelectKBest wrapper. However I keep running into an error indicating that my list of features needs to be reshaped into a 2D array, not quite sure why I keep getting this message-
#feature selection before linear regression benchmark test
import sklearn
from sklearn.feature_selection import mutual_info_regression, SelectKBest
features = list(housing_data[housing_data.columns.difference(['sale_price'])])
target = 'sale_price'
new = SelectKBest(mutual_info_regression, k=20).fit_transform(features, target)
This is my traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-8c778124066c> in <module>()
3 features = list(housing_data[housing_data.columns.difference(['sale_price'])])
4 target = 'sale_price'
----> 5 new = SelectKBest(mutual_info_regression, k=20).fit_transform(features, target)
/usr/local/lib/python3.6/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
463 else:
464 # fit method of arity 2 (supervised transformation)
--> 465 return self.fit(X, y, **fit_params).transform(X)
466
467
/usr/local/lib/python3.6/dist-packages/sklearn/feature_selection/univariate_selection.py in fit(self, X, y)
339 self : object
340 """
--> 341 X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
342
343 if not callable(self.score_func):
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
754 ensure_min_features=ensure_min_features,
755 warn_on_dtype=warn_on_dtype,
--> 756 estimator=estimator)
757 if multi_output:
758 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
550 "Reshape your data either using array.reshape(-1, 1) if "
551 "your data has a single feature or array.reshape(1, -1) "
--> 552 "if it contains a single sample.".format(array))
553
554 # in the future np.flexible dtypes will be handled like object dtypes
ValueError: Expected 2D array, got 1D array instead:
array=['APPBBL' 'APPDate' 'Address' 'AreaSource' 'AssessLand' 'AssessTot' 'BBL'
'BldgArea' 'BldgClass' 'BldgDepth' 'BldgFront' 'BoroCode' 'Borough'
'BsmtCode' 'BuiltFAR' 'CB2010' 'CD' 'CT2010' 'ComArea' 'CommFAR'
'CondoNo' 'Council' 'EDesigNum' 'Easements' 'ExemptLand' 'ExemptTot'
'Ext' 'FIRM07_FLA' 'FacilFAR' 'FactryArea' 'FireComp' 'GarageArea'
'HealthArea' 'HealthCent' 'HistDist' 'IrrLotCode' 'LandUse' 'Landmark'
'LotArea' 'LotDepth' 'LotFront' 'LotType' 'LtdHeight' 'MAPPLUTO_F'
'NumBldgs' 'NumFloors' 'OfficeArea' 'OtherArea' 'Overlay1' 'Overlay2'
'OwnerName' 'OwnerType' 'PFIRM15_FL' 'PLUTOMapID' 'PolicePrct' 'ProxCode'
'ResArea' 'ResidFAR' 'RetailArea' 'SHAPE_Area' 'SHAPE_Leng' 'SPDist1'
'SPDist2' 'SPDist3' 'Sanborn' 'SanitBoro' 'SanitDistr' 'SanitSub'
'SchoolDist' 'SplitZone' 'StrgeArea' 'TaxMap' 'Tract2010' 'UnitsRes'
'UnitsTotal' 'Unnamed: 0' 'Version' 'XCoord' 'YCoord' 'YearAlter1'
'YearAlter2' 'YearBuilt' 'ZMCode' 'ZipCode' 'ZoneDist1' 'ZoneDist2'
'ZoneDist3' 'ZoneDist4' 'ZoneMap' 'address' 'apartment_number' 'block'
'borough' 'building_class' 'building_class_at_sale'
'building_class_category' 'commercial_units' 'easement' 'gross_sqft'
'land_sqft' 'lot' 'neighborhood' 'price_range' 'residential_units'
'sale_date' 'tax_class' 'tax_class_at_sale' 'total_units' 'year_built'
'year_of_sale' 'zip_code'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is a sample of my data:
housing_data = pd.DataFrame({'Unnamed: 0': {0: 1, 1: 2, 2: 3, 3: 4}, 'borough': {0: 3, 1: 3, 2: 3, 3: 3}, 'neighborhood': {0: 'DOWNTOWN-METROTECH', 1: 'DOWNTOWN-FULTON FERRY', 2: 'BROOKLYN HEIGHTS', 3: 'MILL BASIN'}, 'building_class_category': {0: '28 COMMERCIAL CONDOS', 1: '29 COMMERCIAL GARAGES', 2: '21 OFFICE BUILDINGS', 3: '22 STORE BUILDINGS'}, 'tax_class': {0: '4', 1: '4', 2: '4', 3: '4'}, 'block': {0: 140, 1: 54, 2: 204, 3: 8470}, 'lot': {0: 1001, 1: 1, 2: 1, 3: 55}, 'easement': {0: nan, 1: nan, 2: nan, 3: nan}, 'building_class': {0: 'R5', 1: 'G7', 2: 'O6', 3: 'K6'}, 'address': {0: '330 JAY STREET', 1: '85 JAY STREET', 2: '29 COLUMBIA HEIGHTS', 3: '5120 AVENUE U'}, 'apartment_number': {0: 'COURT', 1: nan, 2: nan, 3: nan}, 'zip_code': {0: 11201, 1: 11201, 2: 11201, 3: 11234}, 'residential_units': {0: 0, 1: 0, 2: 0, 3: 0}, 'commercial_units': {0: 1, 1: 0, 2: 0, 3: 123}, 'total_units': {0: 1, 1: 0, 2: 0, 3: 123}, 'land_sqft': {0: 0.0, 1: 134988.0, 2: 32000.0, 3: 905000.0}, 'gross_sqft': {0: 0.0, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'year_built': {0: 2002, 1: 0, 2: 1924, 3: 1970}, 'tax_class_at_sale': {0: 4, 1: 4, 2: 4, 3: 4}, 'building_class_at_sale': {0: 'R5', 1: 'G7', 2: 'O6', 3: 'K6'}, 'sale_price': {0: 499401179.0, 1: 345000000.0, 2: 340000000.0, 3: 276947000.0}, 'sale_date': {0: '2008-04-23', 1: '2016-12-20', 2: '2016-08-03', 3: '2012-11-28'}, 'year_of_sale': {0: 2008, 1: 2016, 2: 2016, 3: 2012}, 'Borough': {0: nan, 1: 'BK', 2: 'BK', 3: 'BK'}, 'CD': {0: nan, 1: 302.0, 2: 302.0, 3: 318.0}, 'CT2010': {0: nan, 1: 21.0, 2: 1.0, 3: 698.0}, 'CB2010': {0: nan, 1: 3017.0, 2: 1003.0, 3: 2005.0}, 'SchoolDist': {0: nan, 1: 13.0, 2: 13.0, 3: 22.0}, 'Council': {0: nan, 1: 33.0, 2: 33.0, 3: 46.0}, 'ZipCode': {0: nan, 1: 11201.0, 2: 11201.0, 3: 11234.0}, 'FireComp': {0: nan, 1: 'L118', 2: 'E205', 3: 'E323'}, 'PolicePrct': {0: nan, 1: 84.0, 2: 84.0, 3: 63.0}, 'HealthCent': {0: nan, 1: 36.0, 2: 38.0, 3: 35.0}, 'HealthArea': {0: nan, 1: 1000.0, 2: 2300.0, 3: 8822.0}, 'SanitBoro': {0: nan, 1: 3.0, 2: 3.0, 3: 3.0}, 'SanitDistr': {0: nan, 1: 2.0, 2: 2.0, 3: 18.0}, 'SanitSub': {0: nan, 1: '1B', 2: '1A', 3: '4E'}, 'Address': {0: nan, 1: '87 JAY STREET', 2: '29 COLUMBIA HEIGHTS', 3: '5120 AVENUE U'}, 'ZoneDist1': {0: nan, 1: 'M1-2/R8', 2: 'M2-1', 3: 'M3-1'}, 'ZoneDist2': {0: nan, 1: nan, 2: nan, 3: nan}, 'ZoneDist3': {0: nan, 1: nan, 2: nan, 3: nan}, 'ZoneDist4': {0: nan, 1: nan, 2: nan, 3: nan}, 'Overlay1': {0: nan, 1: nan, 2: nan, 3: nan}, 'Overlay2': {0: nan, 1: nan, 2: nan, 3: nan}, 'SPDist1': {0: nan, 1: 'MX-2', 2: nan, 3: nan}, 'SPDist2': {0: nan, 1: nan, 2: nan, 3: nan}, 'SPDist3': {0: nan, 1: nan, 2: nan, 3: nan}, 'LtdHeight': {0: nan, 1: nan, 2: nan, 3: nan}, 'SplitZone': {0: nan, 1: 'N', 2: 'N', 3: 'N'}, 'BldgClass': {0: nan, 1: 'G7', 2: 'O6', 3: 'K6'}, 'LandUse': {0: nan, 1: 10.0, 2: 5.0, 3: 5.0}, 'Easements': {0: nan, 1: 0.0, 2: 0.0, 3: 1.0}, 'OwnerType': {0: nan, 1: 'P', 2: nan, 3: nan}, 'OwnerName': {0: nan, 1: '85 JAY STREET BROOKLY', 2: '25-30 COLUMBIA HEIGHT', 3: 'BROOKLYN KINGS PLAZA'}, 'LotArea': {0: nan, 1: 134988.0, 2: 32000.0, 3: 905000.0}, 'BldgArea': {0: nan, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'ComArea': {0: nan, 1: 0.0, 2: 304650.0, 3: 2548000.0}, 'ResArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'OfficeArea': {0: nan, 1: 0.0, 2: 264750.0, 3: 0.0}, 'RetailArea': {0: nan, 1: 0.0, 2: 0.0, 3: 1263000.0}, 'GarageArea': {0: nan, 1: 0.0, 2: 0.0, 3: 1285000.0}, 'StrgeArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'FactryArea': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'OtherArea': {0: nan, 1: 0.0, 2: 39900.0, 3: 0.0}, 'AreaSource': {0: nan, 1: 7.0, 2: 2.0, 3: 2.0}, 'NumBldgs': {0: nan, 1: 0.0, 2: 1.0, 3: 4.0}, 'NumFloors': {0: nan, 1: 0.0, 2: 13.0, 3: 2.0}, 'UnitsRes': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'UnitsTotal': {0: nan, 1: 0.0, 2: 0.0, 3: 123.0}, 'LotFront': {0: nan, 1: 490.5, 2: 92.42, 3: 930.0}, 'LotDepth': {0: nan, 1: 275.33, 2: 335.92, 3: 859.0}, 'BldgFront': {0: nan, 1: 0.0, 2: 335.0, 3: 0.0}, 'BldgDepth': {0: nan, 1: 0.0, 2: 92.0, 3: 0.0}, 'Ext': {0: nan, 1: nan, 2: nan, 3: nan}, 'ProxCode': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'IrrLotCode': {0: nan, 1: 'N', 2: 'Y', 3: 'Y'}, 'LotType': {0: nan, 1: 5.0, 2: 3.0, 3: 3.0}, 'BsmtCode': {0: nan, 1: 5.0, 2: 5.0, 3: 5.0}, 'AssessLand': {0: nan, 1: 1571850.0, 2: 1548000.0, 3: 36532350.0}, 'AssessTot': {0: nan, 1: 1571850.0, 2: 25463250.0, 3: 149792400.0}, 'ExemptLand': {0: nan, 1: 1571850.0, 2: 0.0, 3: 0.0}, 'ExemptTot': {0: nan, 1: 1571850.0, 2: 0.0, 3: 0.0}, 'YearBuilt': {0: nan, 1: 0.0, 2: 1924.0, 3: 1970.0}, 'YearAlter1': {0: nan, 1: 0.0, 2: 1980.0, 3: 0.0}, 'YearAlter2': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'HistDist': {0: nan, 1: nan, 2: nan, 3: nan}, 'Landmark': {0: nan, 1: nan, 2: nan, 3: nan}, 'BuiltFAR': {0: nan, 1: 0.0, 2: 9.52, 3: 2.82}, 'ResidFAR': {0: nan, 1: 7.2, 2: 0.0, 3: 0.0}, 'CommFAR': {0: nan, 1: 2.0, 2: 2.0, 3: 2.0}, 'FacilFAR': {0: nan, 1: 6.5, 2: 0.0, 3: 0.0}, 'BoroCode': {0: nan, 1: 3.0, 2: 3.0, 3: 3.0}, 'BBL': {0: nan, 1: 3000540001.0, 2: 3002040001.0, 3: 3084700055.0}, 'CondoNo': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'Tract2010': {0: nan, 1: 21.0, 2: 1.0, 3: 698.0}, 'XCoord': {0: nan, 1: 988208.0, 2: 985952.0, 3: 1006597.0}, 'YCoord': {0: nan, 1: 195011.0, 2: 195007.0, 3: 161424.0}, 'ZoneMap': {0: nan, 1: '12d', 2: '12d', 3: '23b'}, 'ZMCode': {0: nan, 1: nan, 2: nan, 3: nan}, 'Sanborn': {0: nan, 1: '302 016', 2: '302 004', 3: '319 077'}, 'TaxMap': {0: nan, 1: 30101.0, 2: 30106.0, 3: 32502.0}, 'EDesigNum': {0: nan, 1: nan, 2: nan, 3: nan}, 'APPBBL': {0: nan, 1: 3000540001.0, 2: 0.0, 3: 0.0}, 'APPDate': {0: nan, 1: '12/06/2002', 2: nan, 3: nan}, 'PLUTOMapID': {0: nan, 1: 1.0, 2: 1.0, 3: 1.0}, 'FIRM07_FLA': {0: nan, 1: nan, 2: nan, 3: 1.0}, 'PFIRM15_FL': {0: nan, 1: nan, 2: nan, 3: 1.0}, 'Version': {0: nan, 1: '17V1.1', 2: '17V1.1', 3: '17V1.1'}, 'MAPPLUTO_F': {0: nan, 1: 0.0, 2: 0.0, 3: 0.0}, 'SHAPE_Leng': {0: nan, 1: 1559.88914353, 2: 890.718521021, 3: 3729.78685686}, 'SHAPE_Area': {0: nan, 1: 140131.577176, 2: 34656.4472405, 3: 797554.847834}, 'price_range': {0: nan, 1: nan, 2: nan, 3: nan}})
The transformer expects a 2D array, of shape (n x m) where n is the number of samples and m the number of features and if you look at the shape of features I imagine it will display: (m,).
Reshaping arrays
In general for a feature array of shape (n,), you can do as the error code suggests and call .reshape(-1,1) on your feature array, the -1 lets it infer the additional dimension: The shape of the array will be (n,m), where for a 1 feature case m = 1.
Sklearn transformers
The above being said, I think there is additional errors with your code and understanding.
I would print features to screen and check it is what you want, it looks like you are printing a list of all the column names except sale_price.
I am not familiar with SelectKBest but it requires an (n,m) feature array not a list of column names of the features.
Additionally, target should not be the name of the target column, but an array of shape (n,), where its values are the observed target values of the training instances.
I would suggest checking the documentation (previously referenced) while you are writing your code to make sure you are using the correct arguments and utilising the function as it is intended.
Extracting features
Your data seems in a strange format (dictionary's nested in a pandas DF). However is a explicit example of how I would extract features from a pd.DataFrame for use with functions from the SKlearn framework.
housing_data = pd.DataFrame({'age': [1,5,1,10], 'size':[0,1,2,0],
'price':[190,100,50,100]
})
feature_arr = housing_data.drop('price', axis=1).values
target_values = housing_data['price']
Print feature_arr and you will hopefully see your issue. Normally you would then have to preprocess the data to, for example, drop NaN values or perform feature scaling.

Creating Column in Dataframe Using Multiple Conditions

I am trying to create a new column in a Pandas dataframe using multiple conditional statements based on other info within the dataframe. I have tried iterating using .iteritems(). This works, but seems inelegant and returns a notice that I don't know how to understand and/or correct.
My code snippet is:
proj_file_pq['pd_pq'] = 0
for key, value in proj_file_pq['pd_pq'].iteritems():
if proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_pd'].iloc[key] < 1:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - 1
elif proj_file_pq['qualifying'].iloc[key] > \
proj_file_pq['avg_start'].iloc[key]:
proj_file_pq['pd_pq'].iloc[key] = \
proj_file_pq['qualifying'].iloc[key] - \
proj_file_pq['avg_finish'].iloc[key]
elif proj_file_pq['qualifying'].iloc[key] + \
proj_file_pq['avg_pd'].iloc[key] > 40:
proj_file_pq['pd_pq'].iloc[key] = \
40 - proj_file_pq['qualifying'].iloc[key]
else:
proj_file_pq['pd_pq'].iloc[key] = proj_file_pq['avg_pd'].iloc[key]
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
And here is the resulting output:
C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Driver avg_start avg_finish qualifying avg_pd pd_pq
0 A.J. Allmendinger 18.000 21.875 16 3.875 3.875
1 Alex Bowman 14.500 18.000 8 3.500 3.500
2 Aric Almirola 21.250 19.250 13 -2.000 -2.000
3 Austin Dillon 18.875 18.375 17 -0.500 -0.500
4 B.J. McLeod 33.500 33.500 36 0.000 2.500
The original dataframe has the following head:
{'Driver': {0: 'A.J. Allmendinger', 1: 'Alex Bowman', 2: 'Aric Almirola', 3: 'Austin Dillon', 4: 'B.J. McLeod'}, 'qualifying': {0: 16, 1: 8, 2: 13, 3: 17, 4: 36}, 'races': {0: 8, 1: 6, 2: 8, 3: 8, 4: 2}, 'avg_start': {0: 18.0, 1: 14.5, 2: 21.25, 3: 18.875, 4: 33.5}, 'avg_finish': {0: 21.875, 1: 18.0, 2: 19.25, 3: 18.375, 4: 33.5}, 'avg_pd': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'percent_fl': {0: 0.0036250647332988096, 1: 0.0071770334928229675, 2: 0.03655483224837256, 3: 0.006718346253229974, 4: 0.0}, 'percent_ll': {0: 0.0031071983428275505, 1: 0.001594896331738437, 2: 0.03505257886830245, 3: 0.006718346253229974, 4: 0.0}, 'percent_lc': {0: 0.9587884806355512, 1: 0.6226415094339622, 2: 0.9915590863952334, 3: 0.9607745779543198, 4: 0.2398212512413108}, 'finish_rank': {0: 25.0, 1: 17.0, 2: 20.5, 3: 19.0, 4: 35.0}, 'pd_rank': {0: 7.0, 1: 9.0, 2: 26.0, 3: 23.0, 4: 19.5}, 'fl_rank': {0: 28.0, 1: 21.0, 2: 8.0, 3: 22.0, 4: 35.0}, 'll_rank': {0: 19.0, 1: 24.0, 2: 6.0, 3: 16.0, 4: 31.0}, 'overall': {0: 79.0, 1: 71.0, 2: 60.5, 3: 80.0, 4: 120.5}, 'overall_rank': {0: 22.0, 1: 20.0, 2: 13.0, 3: 24.0, 4: 34.0}, 'pd_pts': {0: 3.875, 1: 3.5, 2: -2.0, 3: -0.5, 4: 0.0}, 'fl_pts': {0: 0.5455722423614707, 1: 1.0801435406698563, 2: 5.50150225338007, 3: 1.0111111111111108, 4: 0.0}, 'll_pts': {0: 0.2338166752977732, 1: 0.12001594896331738, 2: 2.6377065598397595, 3: 0.5055555555555555, 4: 0.0}, 'finish_pts': {0: 22.0, 1: 30.0, 2: 26.5, 3: 28.0, 4: 12.0}, 'total_pts': {0: 26.654388917659244, 1: 34.70015948963317, 2: 32.63920881321983, 3: 29.016666666666666, 4: 12.0}}
Advice on improving this is appreciated.
Set up your conditions:
c1 = (df.qualifying - df.avg_pd).lt(1)
c2 = (df.qualifying.gt(df.avg_start))
c3 = (df.qualifying.add(df.avg_pd).gt(40))
And your corresponding outputs:
o1 = df.qualifying.sub(1)
o2 = df.qualifying.sub(df.avg_finish)
o3 = 40 - df.qualifying
Using np.select:
df['pd_pq'] = np.select([c1, c2, c3], [o1, o2, o3], df.avg_pd)
Driver qualifying finish_pts total_pts pd_pq
0 A.J. Allmendinger 0.233817 ... 22.0 26.654389 3.875
1 Alex Bowman 0.120016 ... 30.0 34.700159 3.500
2 Aric Almirola 2.637707 ... 26.5 32.639209 -2.000
3 Austin Dillon 0.505556 ... 28.0 29.016667 -0.500
4 B.J. McLeod 0.000000 ... 12.0 12.000000 2.500
I didn't run this as I didn't have the test data, but this should work, presuming I was correct with my parentheses and you import numpy as np
import numpy as np
proj_file_pq['pd_pq'] = np.where(proj_file_pq['qualifying'] - proj_file_pq['avg_pd'] < 1, proj_file_pq['qualifying'] - 1,
np.where(proj_file_pq['qualifying'] > proj_file_pq['avg_start'], proj_file_pq['qualifying'] - proj_file_pq['avg_finish'],
np.where(proj_file_pq['qualifying'] + proj_file_pq['avg_pd'] > 40, 40 - proj_file_pq['qualifying'],
proj_file_pq['avg_pd']))
print(proj_file_pq[['Driver', 'avg_start', 'avg_finish', 'qualifying',\
'avg_pd', 'pd_pq']].head())
You don't need to create proj_file_pq['pd_pq'] prior and set it equal to 0 with this method
One heads up I want to give you is the error: C:\Python36\lib\site-packages\pandas\core\indexing.py:189: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Usually happens for me when I've created multiple data frames without using reset_index() at the end of your command to create the data frame. You may want to use that when creating your table to see if it gets rid of the slicing error. I normally use reset_index(drop=True) if you already have an ID column to avoid creating redundant ID columns.
I hope this helps clear that up!

Categories

Resources