score:73
I had the same issue, and the following syntax worked for me:
df.filter(df("state")==="TX").show()
I'm using Spark 1.6.
score:-1
In Spark 2.4
To compare with one value:
df.filter(lower(trim($"col_name")) === "<value>").show()
To compare with collection of value:
df.filter($"col_name".isInCollection(new HashSet<>(Arrays.asList("value1", "value2")))).show()
score:1
Here is the complete example using spark2.2+ taking data in json...
val myjson = "[{\"name\":\"Alabama\",\"abbreviation\":\"AL\"},{\"name\":\"Alaska\",\"abbreviation\":\"AK\"},{\"name\":\"American Samoa\",\"abbreviation\":\"AS\"},{\"name\":\"Arizona\",\"abbreviation\":\"AZ\"},{\"name\":\"Arkansas\",\"abbreviation\":\"AR\"},{\"name\":\"California\",\"abbreviation\":\"CA\"},{\"name\":\"Colorado\",\"abbreviation\":\"CO\"},{\"name\":\"Connecticut\",\"abbreviation\":\"CT\"},{\"name\":\"Delaware\",\"abbreviation\":\"DE\"},{\"name\":\"District Of Columbia\",\"abbreviation\":\"DC\"},{\"name\":\"Federated States Of Micronesia\",\"abbreviation\":\"FM\"},{\"name\":\"Florida\",\"abbreviation\":\"FL\"},{\"name\":\"Georgia\",\"abbreviation\":\"GA\"},{\"name\":\"Guam\",\"abbreviation\":\"GU\"},{\"name\":\"Hawaii\",\"abbreviation\":\"HI\"},{\"name\":\"Idaho\",\"abbreviation\":\"ID\"},{\"name\":\"Illinois\",\"abbreviation\":\"IL\"},{\"name\":\"Indiana\",\"abbreviation\":\"IN\"},{\"name\":\"Iowa\",\"abbreviation\":\"IA\"},{\"name\":\"Kansas\",\"abbreviation\":\"KS\"},{\"name\":\"Kentucky\",\"abbreviation\":\"KY\"},{\"name\":\"Louisiana\",\"abbreviation\":\"LA\"},{\"name\":\"Maine\",\"abbreviation\":\"ME\"},{\"name\":\"Marshall Islands\",\"abbreviation\":\"MH\"},{\"name\":\"Maryland\",\"abbreviation\":\"MD\"},{\"name\":\"Massachusetts\",\"abbreviation\":\"MA\"},{\"name\":\"Michigan\",\"abbreviation\":\"MI\"},{\"name\":\"Minnesota\",\"abbreviation\":\"MN\"},{\"name\":\"Mississippi\",\"abbreviation\":\"MS\"},{\"name\":\"Missouri\",\"abbreviation\":\"MO\"},{\"name\":\"Montana\",\"abbreviation\":\"MT\"},{\"name\":\"Nebraska\",\"abbreviation\":\"NE\"},{\"name\":\"Nevada\",\"abbreviation\":\"NV\"},{\"name\":\"New Hampshire\",\"abbreviation\":\"NH\"},{\"name\":\"New Jersey\",\"abbreviation\":\"NJ\"},{\"name\":\"New Mexico\",\"abbreviation\":\"NM\"},{\"name\":\"New York\",\"abbreviation\":\"NY\"},{\"name\":\"North Carolina\",\"abbreviation\":\"NC\"},{\"name\":\"North Dakota\",\"abbreviation\":\"ND\"},{\"name\":\"Northern Mariana Islands\",\"abbreviation\":\"MP\"},{\"name\":\"Ohio\",\"abbreviation\":\"OH\"},{\"name\":\"Oklahoma\",\"abbreviation\":\"OK\"},{\"name\":\"Oregon\",\"abbreviation\":\"OR\"},{\"name\":\"Palau\",\"abbreviation\":\"PW\"},{\"name\":\"Pennsylvania\",\"abbreviation\":\"PA\"},{\"name\":\"Puerto Rico\",\"abbreviation\":\"PR\"},{\"name\":\"Rhode Island\",\"abbreviation\":\"RI\"},{\"name\":\"South Carolina\",\"abbreviation\":\"SC\"},{\"name\":\"South Dakota\",\"abbreviation\":\"SD\"},{\"name\":\"Tennessee\",\"abbreviation\":\"TN\"},{\"name\":\"Texas\",\"abbreviation\":\"TX\"},{\"name\":\"Utah\",\"abbreviation\":\"UT\"},{\"name\":\"Vermont\",\"abbreviation\":\"VT\"},{\"name\":\"Virgin Islands\",\"abbreviation\":\"VI\"},{\"name\":\"Virginia\",\"abbreviation\":\"VA\"},{\"name\":\"Washington\",\"abbreviation\":\"WA\"},{\"name\":\"West Virginia\",\"abbreviation\":\"WV\"},{\"name\":\"Wisconsin\",\"abbreviation\":\"WI\"},{\"name\":\"Wyoming\",\"abbreviation\":\"WY\"}]"
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
scala> df.show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
// equals matching
scala> df.filter(df("abbreviation") === "TX").show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
// or using lit
scala> df.filter(df("abbreviation") === lit("TX")).show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
//not expression
scala> df.filter(not(df("abbreviation") === "TX")).show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
only showing top 20 rows
score:1
Let's create a sample dataset and do a deep dive into exactly why OP's code didn't work.
Here's our sample data:
val df = Seq(
("Rockets", 2, "TX"),
("Warriors", 6, "CA"),
("Spurs", 5, "TX"),
("Knicks", 2, "NY")
).toDF("team_name", "num_championships", "state")
We can pretty print our dataset with the show()
method:
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Warriors| 6| CA|
| Spurs| 5| TX|
| Knicks| 2| NY|
+---------+-----------------+-----+
Let's examine the results of df.select(df("state")==="TX").show()
:
+------------+
|(state = TX)|
+------------+
| true|
| false|
| true|
| false|
+------------+
It's easier to understand this result by simply appending a column - df.withColumn("is_state_tx", df("state")==="TX").show()
:
+---------+-----------------+-----+-----------+
|team_name|num_championships|state|is_state_tx|
+---------+-----------------+-----+-----------+
| Rockets| 2| TX| true|
| Warriors| 6| CA| false|
| Spurs| 5| TX| true|
| Knicks| 2| NY| false|
+---------+-----------------+-----+-----------+
The other code OP tried (df.select(df("state")=="TX").show()
) returns this error:
<console>:27: error: overloaded method value select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (Boolean)
df.select(df("state")=="TX").show()
^
The ===
operator is defined in the Column class. The Column class doesn't define a ==
operator and that's why this code is erroring out.
Here's the accepted answer that works:
df.filter(df("state")==="TX").show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
As other posters have mentioned, the ===
method takes an argument with an Any
type, so this isn't the only solution that works. This works too for example:
df.filter(df("state") === lit("TX")).show
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
The Column equalTo
method can also be used:
df.filter(df("state").equalTo("TX")).show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
It worthwhile studying this example in detail. Scala's syntax seems magical at times, especially when method are invoked without dot notation. It's hard for the untrained eye to see that ===
is a method defined in the Column
class!
score:5
We can write multiple Filter/where conditions in Dataframe.
For example:
table1_df
.filter($"Col_1_name" === "buddy") // check for equal to string
.filter($"Col_2_name" === "A")
.filter(not($"Col_2_name".contains(" .sql"))) // filter a string which is not relevent
.filter("Col_2_name is not null") // no null filter
.take(5).foreach(println)
score:6
Worked on Spark V2.*
import sqlContext.implicits._
df.filter($"state" === "TX")
if needs to be compared against a variable (e.g., var):
import sqlContext.implicits._
df.filter($"state" === var)
Note :
import sqlContext.implicits._
score:9
df.filter($"state" like "T%%")
for pattern matching
df.filter($"state" === "TX")
or df.filter("state = 'TX'")
for equality
score:10
To get the negation, do this ...
df.filter(not( ..expression.. ))
eg
df.filter(not($"state" === "TX"))
score:16
You should be using where
, select
is a projection that returns the output of the statement, thus why you get boolean values. where
is a filter that keeps the structure of the dataframe, but only keeps data where the filter works.
Along the same line though, per the documentation, you can write this in 3 different ways
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
peopleDf($"age" > 15)
score:29
There is another simple sql like option. With Spark 1.6 below also should work.
df.filter("state = 'TX'")
This is a new way of specifying sql like filters. For a full list of supported operators, check out this class.
Source: stackoverflow.com
Related Query
- How do I check for equality using Spark Dataframe without SQL Query?
- How can i check for empty values on spark Dataframe using User defined functions
- How to add column header to Spark SQL query results without using databricks lib?
- How to check the number of partitions of a Spark DataFrame without incurring the cost of .rdd
- How to construct ClassTag for Spark SQL DataFrame Mapping?
- How to perform SQL like "IN" query using Salat for mongodb?
- Scala Spark DataFrame SQL withColumn - how to use function(x:String) for transformations
- How to test a Spark SQL Query without Scala
- How to get the SQL representation for the query logic of a (derived) Spark DataFrame?
- Error while writing spark 3.0 sql dataframe to CSV file using scala without Databricks
- How to get columns of a spark sql query without execution
- Check that a SPARK Dataframe column matches a Regex for all occurrences using Scala
- How to create a column of row id in Spark dataframe for each distinct column value using Scala
- How to add/append a new row to a DataFrame in Scala without using a SQL insert?
- How to fill column with value taken from a (non-adjacent) previous row without natural partitioning key using Spark Scala DataFrame
- how to use a custom function in query on Spark Dataframe using Scala
- Create Spark Dataframe from SQL Query
- How to specify schema for CSV file without using Scala case class?
- SPARK DataFrame: How to efficiently split dataframe for each group based on same column values
- How to set column names to toDF() function in spark dataframe using a string array?
- How to use Spark SQL DataFrame with flatMap?
- How to register byte[][] using kryo serialization for spark
- How to get the table name from Spark SQL Query [PySpark]?
- How to cast an array of struct in a spark dataframe using selectExpr?
- Spark Scala, how to check if nested column is present in dataframe
- How to compose column name using another column's value for withColumn in Scala Spark
- How to query on data frame where 1 field of StringType has json value in Spark SQL
- How to specify columns programmatically within query clauses using phantom-dsl for cassandra
- Check equality for two Spark DataFrames in Scala
- How to write a condition based on multiple values for a DataFrame in Spark
More Query from same tag
- Unable to get JSON response in Scalatra (using JPA for ORM)
- Scala type classes with constructor dependency injection
- Implementing the Monad typeclass with a multi-parameter type
- Scala bitwise operation
- Extract Data as JSON
- cats.sequence executes before all Futures complete even if lifted to EitherT
- Running a single test class from sbt , in a multi module project
- No engine found - PredictionIO
- Adding default package imports
- Is it possible to create a hierarchical multi-project build with sbt?
- Scala: Case class unapply vs a manual implementation and type erasure
- ConcurrentModificationException when using Spark collectionAccumulator
- Intellij unitest Exception encountered when invoking run on a nested suite - java.net.BindException: Permission denied
- Dynamic inheritance in Scala
- Play framework route file compilation warning: local val in method at is never used
- Extending Enumeration
- Objects being "turned" to null randomly
- How can I override a method with a dependent return type?
- Implementing a generic Vector in Scala
- Getting an RDD of Failure[T] from an RDD[Try[T]] without compilation warning
- convert RDD Array[Any] = Array(List([String], ListBuffer([string])) to RDD(String, Seq[String])
- Serializing to disk and deserializing Scala objects using Pickling
- Scala type of statements?
- How often should we really use futures and actors? Any alternatives?
- Scala type alias with companion object
- Currying in Scala: Multiple parameters in a function including an anonymous function of type ( => A)
- sbt doesn't find class of managed dependency
- java.lang.NumberFormatException error in spark scala
- Scala under MSys2 - failure to initialize terminal
- Spark/Scala - Error creating DataFrame from Json: java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.json