score:4
feature selection allows selecting the most relevant features for use in model construction. feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. the number of features to select can be tuned using a held-out validation set.
one way to do what you are seeking is using the elementwiseproduct
.
elementwiseproduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. in other words, it scales each column of the dataset by a scalar multiplier. this represents the hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
so if we set the weight of the features we want to keep to 1.0 and the others to 0.0, we can say that the remaining resulting features computed by the elementwiseproduct
of the original vector and the 0-1 weight vectors will select the features we need :
import org.apache.spark.mllib.feature.elementwiseproduct
import org.apache.spark.mllib.linalg.vectors
// creating dummy labeledpoint rdd
val data = sc.parallelize(array(labeledpoint(1.0, vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), labeledpoint(1.0,vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),labeledpoint(0.0,vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))
data.todf.show
// +-----+--------------------+
// |label| features|
// +-----+--------------------+
// | 1.0|[1.0,0.0,3.0,5.0,...|
// | 1.0|[4.0,5.0,6.0,1.0,...|
// | 0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+
// you'll need to know how many features you have, i have used 5 for the example
val numfeatures = 5
// the indices represent the features we want to keep
// note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = list(3, 4).toarray
// now we can create our weights vectors
val weights = array.fill[double](indices.size)(1)
// create the sparse vector of the features we need to keep.
val transformingvector = vectors.sparse(numfeatures, indices, weights)
// init our vector transformer
val transformer = new elementwiseproduct(transformingvector)
// apply it to the data.
val transformeddata = data.map(x => labeledpoint(x.label,transformer.transform(x.features).tosparse))
transformeddata.todf.show
// +-----+-------------------+
// |label| features|
// +-----+-------------------+
// | 1.0|(5,[3,4],[5.0,1.0])|
// | 1.0|(5,[3,4],[1.0,2.0])|
// | 0.0| (5,[4],[2.0])|
// +-----+-------------------+
note:
- you noticed that i used the sparse vector representation for space optimization.
- features are sparse vectors.
Source: stackoverflow.com
Related Query
- Spark: How to run logistic regression using only some features from LabeledPoint?
- How to develop and run spark scala from vscode using sbt or Metals
- Spark : how to run spark file from spark shell
- How to create hive table from Spark data frame, using its schema?
- How to create a Row from a List or Array in Spark using Scala
- How to run simple Spark app from Eclipse/Intellij IDE?
- How do I run the Spark decision tree with a categorical feature set using Scala?
- How to read json data using scala from kafka topic in apache spark
- How to find unique elements from list of tuples based on some elements using scala?
- How to stream data from Kafka topic to Delta table using Spark Structured Streaming
- How to create a graph from a CSV file using Graph.fromEdgeTuples in Spark Scala
- How to run Multi threaded jobs in apache spark using scala or python?
- How to specify only particular fields using read.schema in JSON : SPARK Scala
- How to change date from yyyy-mm-dd to dd-mm-yyy using Spark function
- How to run stored procedure on SQL server from Spark (Databricks) JDBC python?
- How do I read binary serialized Avro (Confluent Platform) from Kafka using Spark Streaming
- How to extract variable weight from spark pipeline logistic model?
- How to convert types when reading data from Elasticsearch using elasticsearch-spark in SPARK
- How to only run tests not having a certain tag in scala using flatspec through sbt?
- How to get few rows from Spark data frame on basis of some condition
- How to load all xml files from a Hdfs directory using spark databricks xml parser
- Using Spark on Dataproc, how to write to GCS separately from each partition?
- How to run twitter popular tags of Spark streaming using scala?
- Spark Unit Testing: How to initialize sc only once for all the Suites using FunSuite
- How to get values from RDD in spark using scala
- How to get Create Statement of Table in some other database in Spark using JDBC
- How do I connect to Hive from spark using Scala on IntelliJ?
- How to run a main method from Spark (in Databricks)
- How to Remove first few lines/header from multiple files using scala in spark
- How do I parse timestamps correctly when using saveToEs to save data from a Spark DataFrame to Elasticsearch?
More Query from same tag
- Sending writes to the mysql master and reads to slave in slick
- Scala: Purpose Like suffix on traits, e.g. IndexSequenceLike
- Query a SetColumn
- How do I do a flatMap on spark Dataframe rows depending on conditions of multiple field values?
- Cannot resolve overloaded method fromJson
- How to create JSON object in Scala/Play
- Grows beyond 64 KB Error when using "When Otherwise"
- How to parse a List of Map in scala
- sbt cannot switch to project imported from github using ProjectRef
- Play 2.0 + Bootstrap3: Showing active navigation item
- Unique maximum element of Scala list
- How to efficiently select dataframe columns containing a certain value in Spark?
- Play 2.4 macwire example with extended controllers doesn't seem like it can be mocked
- How to get a size of a tree-like custom object
- Scala - Abstract message handlers with Generic parameters
- Scala getting a value from a map when unsure if the map holds the value
- Mantaining type information on the queue of tasks of a ThreadPool
- Build dynamically an elastic request with elastic4s
- Why does sbt report UNRESOLVED DEPENDENCIES for Spark 1.3.0-SNAPSHOT jars?
- How to convert Future[Vector[Either[Exception, A]]] to Future[Vector[A]]
- Are FP and OO orthogonal?
- How does List(1,2,3) work in Scala?
- Does a Future get a new thread?
- How to create typeclass with HList existential member
- JVM remote debugging session terminating on uncaught exception
- How to sum a list of tuples by keys
- Stream as constructor arg sometimes fully evaluated during early class initialization
- How can I create a parser combinator in which line endings are significant?
- How to explode space-separated column?
- Abstract over arity using shapeless