I want to split the following pivot table into training and testing sets (to evaluate recommendation system), and was thinking of extracting two tables with non-overlapping indices (userID) and column values (ISBN). How can I split it properly? Thank you.
As suggested by #moys, can use train_test_split from scikit-learn after splitting your dataframe columns first for the non-overlapping column names.
Example:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
Generate data:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Split df columns in some way, eg half:
cols = int(len(df.columns)/2)
df_A = df.iloc[:, 0:cols]
df_B = df.iloc[:, cols:]
Use train_test_split:
train_A, test_A = train_test_split(df_A, test_size=0.33)
train_B, test_B = train_test_split(df_B, test_size=0.33)
Related
I'm trying to normalize a datasheet between [-1,+1], and this code I wrote can normalize columns by columns. Could you tell me how to normalize rows by rows?
from sklearn import preprocessing
import pandas as pd
df = pd.read_csv('/-----.csv')
df_max_scaled = df.copy()
for column in df.columns:
df_max_scaled[column] = df_max_scaled[column] /df_max_scaled[column].abs().max()
You could use apply with axis=1 which will process the DF row-by-row:
df.apply(lambda x: x/x.abs().max(), axis=1)
I'm using PySpark's ChiSqSelector to select the most important features. The code is running well, however I can't verify what my features are in terms of index or name.
So my question is: How can I identify what the values in selectedFeatures are referring to?
I have the sample code below that I use only four columns for the purpose of facilitating the visualization, however, I have to do this for a DF with almost 100 columns.
df=df.select("IsBeta","AVProductStatesIdentifier","IsProtected","Firewall","HasDetections")
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)
selector = ChiSqSelector(featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
result = selector.fit(vec_df).transform(vec_df)
print(result.show())
And yet, when trying to apply the solution I found in this question. I still cannot understand which columns are selected in terms of name or index. That is, which are the features that are being selected.
model = selector.fit(vec_df)
model.selectedFeatures
First: Please don't use one hot encoded features, the ChiSqSelector should be directly used on categorical (non-encoded) columns, as you can see here.
Without the one-hot encoded stuff the selector usage is straight forward:
Now let's look at how the ChiSqSelector is used and how to find the relevant features by name.
For example usage I'll create a df with only 2 relevant columns (AVProductStatesIdentifier and Firewall), the other 2 (IsBeta and IsProtected) will be constant:
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
import numpy as np
import pandas as pd
#create df
df_p = pd.DataFrame([np.ones(1000, dtype=int),
np.ones(1000, dtype=int),
np.random.randint(0,500, 1000, dtype=int),
np.random.randint(0,2, 1000, dtype=int)
], index=['IsBeta', 'IsProtected', 'Firewall', 'HasDetections']).T
df_p['AVProductStatesIdentifier'] = np.random.choice(['a', 'b', 'c'], 1000)
schema=StructType([StructField("IsBeta",IntegerType(),True),
StructField("AVProductStatesIdentifier",StringType(),True),
StructField("IsProtected",IntegerType(),True),
StructField("Firewall",IntegerType(),True),
StructField("HasDetections",IntegerType(),True),
])
df = spark.createDataFrame(
df_p[['IsBeta', 'AVProductStatesIdentifier', 'IsProtected', 'Firewall', 'HasDetections']],
schema
)
First let's make the column AVProductStatesIdentifier categorical
mapping = {l.AVProductStatesIdentifier:i for i,l in enumerate(df.select('AVProductStatesIdentifier').distinct().collect())}
mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
df = df.withColumn("AVProductStatesIdentifier", mapping_expr.getItem(col("AVProductStatesIdentifier")))
Now, let's assemble that and select the 2 most important columns
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)
selector = ChiSqSelector(numTopFeatures=2,featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
model = selector.fit(vec_df)
Now execute:
np.array(df.columns)[model.selectedFeatures]
which results in
array(['AVProductStatesIdentifier', 'Firewall'], dtype='<U25')
The two non-constant columns.
I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?
Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().
I'm trying to load a sklearn.dataset, and missing a column, according to the keys (target_names, target & DESCR). I have tried various methods to include the last column, but with errors.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
the keys are ['target_names', 'data', 'target', 'DESCR', 'feature_names']
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
with the code above, it only returns 30 column, when I need 31 columns. What is the best way load scikit-learn datasets into pandas DataFrame.
Another option, but a one-liner, to create the dataframe including the features and target variables is:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
If you want to have a target column you will need to add it because it's not in cancer.data. cancer.target has the column with 0 or 1, and cancer.target_names has the label. I hope the following is what you want:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
data = data.assign(target=pd.Series(cancer.target))
print data.describe()
# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.
This works too, also using pd.Series.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)
print data.keys()
print data.shape
Only target column is missing, so you can just add one.
df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target
mapping target names can be handled elegantly using map():
data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
As of scikit-learn 0.23 you can do the following to get a DataFrame with the target column included.
df = load_breast_cancer(as_frame=True)
df.frame
Suppose I have a pandas data frame surveyData:
I want to normalize the data in each column by performing:
surveyData_norm = (surveyData - surveyData.mean()) / (surveyData.max() - surveyData.min())
This would work fine if my data table only contained the columns I wanted to normalize. However, I have some columns containing string data preceding like:
Name State Gender Age Income Height
Sam CA M 13 10000 70
Bob AZ M 21 25000 55
Tom FL M 30 100000 45
I only want to normalize the Age, Income, and Height columns but my above method does not work becuase of the string data in the name state and gender columns.
You can perform operations on a sub set of rows or columns in pandas in a number of ways. One useful way is indexing:
# Assuming same lines from your example
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = survey_data[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
This will apply it to only the columns you desire and assign the result back to those columns. Alternatively you could set them to new, normalized columns and keep the originals if you want.
I think it's better to use 'sklearn.preprocessing' in this case which can give us much more scaling options.
The way of doing that in your case when using StandardScaler would be:
from sklearn.preprocessing import StandardScaler
cols_to_norm = ['Age','Height']
surveyData[cols_to_norm] = StandardScaler().fit_transform(surveyData[cols_to_norm])
Simple way and way more efficient:
Pre-calculate the mean:
dropna() avoid missing data.
mean_age = survey_data.Age.dropna().mean()
max_age = survey_data.Age.dropna().max()
min_age = survey_data.Age.dropna().min()
dataframe['Age'] = dataframe['Age'].apply(lambda x: (x - mean_age ) / (max_age -min_age ))
this way will work...
I think it's really nice to use built-in functions
# Assuming same lines from your example
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
cols_to_norm = ['Age','Height']
survey_data[cols_to_norm] = scaler.fit_transform(survey_data[cols_to_norm])
MinMax normalize all numeric columns with minmax_scale
import numpy as np
from sklearn.preprocessing import minmax_scale
# cols = ['Age', 'Height']
cols = df.select_dtypes(np.number).columns
df[cols] = minmax_scale(df[cols])
Note: Keeps index, column names or non-numerical variables unchanged.
import pandas as pd
import numpy as np
# let Dataset here be your data#
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
for x in dataset.columns[dataset.dtypes == 'int64']:
Dataset[x] = minmax.fit_transform(np.array(Dataset[I]).reshape(-1,1))