Pandas sklearn one-hot encoding dataframe or numpy? - python

How can I transform a pandas data frame to sklearn one-hot-encoded (dataframe / numpy array) where some columns do not require encoding?
mydf = pd.DataFrame({'Target':[0,1,0,0,1, 1,1],
'GroupFoo':[1,1,2,2,3,1,2],
'GroupBar':[2,1,1,0,3,1,2],
'GroupBar2':[2,1,1,0,3,1,2],
'SomeOtherShouldBeUnaffected':[2,1,1,0,3,1,2]})
columnsToEncode = ['GroupFoo', 'GroupBar']
Is an already label encoded data frame and I would like to only encode the columns marked by columnsToEncode?
My problem is that I am unsure if a pd.Dataframe or the numpy array representation are better and how to re-merge the encoded part with the other one.
My attempts so far:
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(X_train)
df = pd.concat([
df[~columnsToEncode], # select all other / numeric
# select category to one-hot encode
pd.Dataframe(encoder.transform(X_train[columnsToEncode]))#.toarray() # not sure what this is for
], axis=1).reindex_axis(X_train.columns, axis=1)
Notice: I am aware of Pandas: Get Dummies / http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html but that does not play well in a train / test split where I require such an encoding per fold.

This library provides several categorical encoders which make sklearn / numpy play nicely with pandas https://github.com/wdm0006/categorical_encoding
However, they do not yet support "handle unknown category"
for now I will use
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(df[columnsToEncode])
pd.concat([df.drop(columnsToEncode, 1),
pd.DataFrame(myEncoder.transform(df[columnsToEncode]))], axis=1).reindex()
As this supports unknown datasets. For now, I will stick with half-pandas half-numpy because of the nice pandas labels. for the numeric columns.

For One Hot Encoding I recommend using ColumnTransformer and OneHotEncoder instead of get_dummies. That's because OneHotEncoder returns an object which can be used to encode unseen samples using the same mapping that you used on your training data.
The following code encodes all the columns provided in the columns_to_encode variable:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X:
array([[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 100],
[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 200],
[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 300]], dtype=object)
To avoid multicollinearity due to the dummy variable trap, I would also suggest removing one of the columns returned by each column that you encoded. The following code encodes all the columns provided in the columns_to_encode variable AND it removes the last column of each one hot encoded column:
import pandas as pd
import numpy as np
def sum_prev (l_in):
l_out = []
l_out.append(l_in[0])
for i in range(len(l_in)-1):
l_out.append(l_out[i] + l_in[i+1])
return [e - 1 for e in l_out]
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
columns_to_encode = [df.iloc[:, del_idx].nunique() for del_idx in columns_to_encode]
columns_to_encode = sum_prev(columns_to_encode)
X = np.array(ct.fit_transform(X))
X = np.delete(X, columns_to_encode, 1)
X:
array([[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 100],
[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 200],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 300]], dtype=object)

I believe that this update to the initial answer is even better in order t perform dummy coding
import logging
import pandas as pd
from sklearn.base import TransformerMixin
log = logging.getLogger(__name__)
class CategoricalDummyCoder(TransformerMixin):
"""Identifies categorical columns by dtype of object and dummy codes them. Optionally a pandas.DataFrame
can be returned where categories are of pandas.Category dtype and not binarized for better coding strategies
than dummy coding."""
def __init__(self, only_categoricals=False):
self.categorical_variables = []
self.categories_per_column = {}
self.only_categoricals = only_categoricals
def fit(self, X, y):
self.categorical_variables = list(X.select_dtypes(include=['object']).columns)
logging.debug(f'identified the following categorical variables: {self.categorical_variables}')
for col in self.categorical_variables:
self.categories_per_column[col] = X[col].astype('category').cat.categories
logging.debug('fitted categories')
return self
def transform(self, X):
for col in self.categorical_variables:
logging.debug(f'transforming cat col: {col}')
X[col] = pd.Categorical(X[col], categories=self.categories_per_column[col])
if self.only_categoricals:
X[col] = X[col].cat.codes
if not self.only_categoricals:
return pd.get_dummies(X, sparse=True)
else:
return X

Related

Converting rdd of numpy arrays to pyspark dataframe using Vectors [duplicate]

I have a DenseVector RDD like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
I want to convert this into a Dataframe. I tried like this
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
It gives an error like this
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 360, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 340, in _inferSchema
schema = _infer_schema(first)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 991, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 968, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <type 'numpy.ndarray'>
old Solution
frequencyVectors.map(lambda vector: DenseVector(vector.toArray()))
Edit 1 - Code Reproducible
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import split
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.mllib.linalg import SparseVector, DenseVector
sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sc.setLogLevel('ERROR')
sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData = sentenceData.withColumn("sentence", split("sentence", "\s+"))
sentenceData.show()
vectorizer = CountVectorizer(inputCol="sentence", outputCol="rawfeatures").fit(sentenceData)
countVectors = vectorizer.transform(sentenceData).select("label", "rawfeatures")
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(countVectors)
tfidf = idfModel.transform(countVectors).select("label", "features")
frequencyDenseVectors = tfidf.rdd.map(lambda vector: [vector[0],DenseVector(vector[1].toArray())])
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
You cannot convert RDD[Vector] directly. It should be mapped to a RDD of objects which can be interpreted as structs, for example RDD[Tuple[Vector]]:
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
Otherwise Spark will try to convert object __dict__ and create use unsupported NumPy array as a field.
from pyspark.ml.linalg import DenseVector
from pyspark.sql.types import _infer_schema
v = DenseVector([1, 2, 3])
_infer_schema(v)
TypeError Traceback (most recent call last)
...
TypeError: not supported type: <class 'numpy.ndarray'>
vs.
_infer_schema((v, ))
StructType(List(StructField(_1,VectorUDT,true)))
Notes:
In Spark 2.0 you have to use correct local types:
pyspark.ml.linalg when working DataFrame based pyspark.ml API.
pyspark.mllib.linalg when working RDD based pyspark.mllib API.
These two namespaces can no longer compatible and require explicit conversions (for example How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT).
Code provided in the edit is not equivalent to the one from the original question. You should be aware that tuple and list don't have the same semantics. If you map vector to pair use tuple and convert directly to DataFrame:
tfidf.rdd.map(
lambda row: (row[0], DenseVector(row[1].toArray()))
).toDF()
using tuple (product type) would work for nested structure as well but I doubt this is what you want:
(tfidf.rdd
.map(lambda row: (row[0], DenseVector(row[1].toArray())))
.map(lambda x: (x, ))
.toDF())
list at any other place than the top level row is interpreted as an ArrayType.
It is much cleaner to use an UDF for conversion (Spark Python: Standard scaler error "Do not support ... SparseVector").
I believe the problem here is that createDataframe does not take denseVactor as argument Please try to convert denseVector into corresponding collection [i.e. Array or List]. In scala and java
toArray()
method is available you can convert the denseVector in array or list then try to create dataFrame.

Creating a new tensor based on the old

I have the following code:
import numpy as np
import tensorflow as tf
a = np.array([0.5, 0.5])
b = np.array([0.2, 0.2, 0.0, 0.0])
non_zeros = ~tf.equal(b, 0.)
cast_op = tf.cast(non_zeros, tf.float64)
new_vec = tf.multiply(a, cast_op) # won't work
# the required output is [0.5, 0.5, 0.0, 0.0]
I am trying to obtain the vector [0.5, 0.5, 0.0, 0.0] as explained in the code. Does anyone know how to do this? I also looked at tf.fill but that takes a scalar value, so won't work for me.
You get an error because tf.multiply expects tensors of the same shape. What you could do, however, is to simply do this:
a = np.array[0.5, 0.5])
b = np.array([0.2, 0.2, 0.0, 0.0])
b = np.logical_and(b, n.ones(b.shape)).astype(float)
a = np.concatenate((a, np.zeros(b.shape[0] - a.shape[0])))
new_vec = a * b
You can exploit the broadcasting capability of the tf.multiply op.
I've added next to every line the shape of the tensor: please note the usage of tf.expand_dims to add a 1 dimension to the a tensor in order to get, after the multiplication, a tensor with shape (2,4).
This tensor has repeated values (2 rows, 4 columns equal), hence we can just take the first row
import numpy as np
import tensorflow as tf
a = np.array([0.5, 0.5]) #(2)
b = np.array([0.2, 0.2, 0.0, 0.0]) #(4)
non_zeros = ~tf.equal(b, 0.) #(4)
cast_op = tf.cast(non_zeros, tf.float64) # (4)
new_vec = tf.multiply(tf.expand_dims(a, axis=[1]),
cast_op) # (2, 1) * (4) = (2, 4)
new_vec = new_vec[0, :] # (4)
print(new_vec)
sess = tf.InteractiveSession()
print(sess.run(new_vec))
This code produces [0.5 0.5 0. 0.]

Printing the average of floats from a list that contains both floats and strings

I'm performing sentiment analysis on Tweets I've collected, and each outcome looks like this, depending on the amount of tweets:
['pos', 0.8, 'neg', 1.0, 'pos', 1.0, 'pos', 1.0, 'pos', 1.0, 'pos', 1.0, 'neg', 1.0]
The floats stand for the confidence%, and I want to calculate & print the average of all of them from this list, but I'm having quite some trouble with it.
This is one way:
A = ['pos', 0.8, 'neg', 1.0, 'pos', 1.0, 'pos', 1.0, 'pos', 1.0, 'pos', 1.0, 'neg', 1.0]
res = sum(A[1::2]) / (len(A) / 2)
print(res)
0.9714285714285714
Or if you would rather not create a new list:
from itertools import islice
res = sum(islice(A, 1, None, 2)) / (len(A) / 2)
Alternatively, you can use statistics.mean, also in the standard library:
from statistics import mean
res = mean(A[1::2])

What is the simplest way to convert vector to Toeplitz matrix in TensorFlow?

I would like to convert a vector to a symmetric Toeplitz matrix using Tensorflow operations like this:
a = tf.placeholder(tf.float32, shape=[vector_size])
A = some_tensorflow_operation(a)
where the shape of A is [vector_size, vector_size]. The relation between the two variables is as below.
a = [a1,a2,a3]
A = [[a1,a2,a3],[a2,a1,a2],[a3,a2,a1]]
What is the simplest way to do it?
In case vector_size=3:
>>> a = tf.placeholder(tf.float32, shape=[vector_size])
>>> A = [[a[0],a[1],a[2]],[a[1],a[0],a[1]],[a[2],a[1],a[0]]]
>>> sess = tf.Session()
>>> sess.run(A, {a: [1, 2, 3]})
[[1.0, 2.0, 3.0], [2.0, 1.0, 2.0], [3.0, 2.0, 1.0]]

When normalizing list of numbers, result is all zeros

I need to normalise the values in a list to produce a (cumulative) probability distribution, but currently I'm just getting 0s out.
Here's what I'm doing:
tests = []
#some code to populate tests which simulates
count = [x[0] for x in tests]
found = [x[1] for x in tests]
found.sort()
num = Counter(found)
freqs = [x for x in num.values()]
cumsum = [sum(item for item in freqs[0:rank+1]) for rank in xrange(len(freqs))]
normcumsum = [float(x/numtests) for x in cumsum]
Currently cumsum and normcumsum are:
cumsum = [1, 2, 6, 12, 28, 39, 64, 85, 96, 98, 99, 100]
normcumsum = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
How do I get normcumsum to contain cumsum/100?
N>B Yes, these variable names are a little stupid.
x/numtests will always return 0, much like 1/2 will always return 0, because you're doing integer division
You must do float(x)/numtests, or do:
from __future__ import division
This is only necessary in python2, not python3.
Demo:
>>> [1/2, 3/2, 5/2]
[0, 1, 2]
>>> from __future__ import division
>>> [1/2, 3/2, 5/2]
[0.5, 1.5, 2.5]
when two parts of your division are integer, automatically python round the result and make it integer, you need to make one of them float. for example change "float(x/numtests) " to "float(float(x)/numtests)"

Categories

Resources