score:4

Accepted answer

feature selection allows selecting the most relevant features for use in model construction. feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. the number of features to select can be tuned using a held-out validation set.

one way to do what you are seeking is using the elementwiseproduct.

elementwiseproduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. in other words, it scales each column of the dataset by a scalar multiplier. this represents the hadamard product between the input vector, v and transforming vector, w, to yield a result vector.

so if we set the weight of the features we want to keep to 1.0 and the others to 0.0, we can say that the remaining resulting features computed by the elementwiseproduct of the original vector and the 0-1 weight vectors will select the features we need :

import org.apache.spark.mllib.feature.elementwiseproduct
import org.apache.spark.mllib.linalg.vectors

// creating dummy labeledpoint rdd
val data = sc.parallelize(array(labeledpoint(1.0, vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), labeledpoint(1.0,vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),labeledpoint(0.0,vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))

data.todf.show

// +-----+--------------------+
// |label|            features|
// +-----+--------------------+
// |  1.0|[1.0,0.0,3.0,5.0,...|
// |  1.0|[4.0,5.0,6.0,1.0,...|
// |  0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+

// you'll need to know how many features you have, i have used 5 for the example
val numfeatures = 5

// the indices represent the features we want to keep 
// note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = list(3, 4).toarray

// now we can create our weights vectors
val weights = array.fill[double](indices.size)(1)

// create the sparse vector of the features we need to keep.
val transformingvector = vectors.sparse(numfeatures, indices, weights)

// init our vector transformer
val transformer = new elementwiseproduct(transformingvector)

// apply it to the data.
val transformeddata = data.map(x => labeledpoint(x.label,transformer.transform(x.features).tosparse))

transformeddata.todf.show

// +-----+-------------------+
// |label|           features|
// +-----+-------------------+
// |  1.0|(5,[3,4],[5.0,1.0])|
// |  1.0|(5,[3,4],[1.0,2.0])|
// |  0.0|      (5,[4],[2.0])|
// +-----+-------------------+

note:

  • you noticed that i used the sparse vector representation for space optimization.
  • features are sparse vectors.

Related Query

More Query from same tag