score:1

Accepted answer

Your problem is that your if expression returns either String in case of match of Unit in case of miss. You can fix your filter easily:

val pairs = lines.map(
  l => (if (l.split(",")(1).toInt < 60) {"rest"} else if (l.split(",")(1).toInt > 110) {"sport"}, 10))
    .filter(_._1 != ())

() in scala is identity of type Unit.

But this is not the right way, really. You still get tuples of (Unit, Int) as the result. You're losing type with this if statement.

The correct way is either to filter your data before and have exhaustive if:

val pairs =
  lines.map(_.split(",")(1).toInt)
    .filter(hr => hr < 60 || hr > 110)
    .map(hr => (if (hr < 60) "rest" else "sport", 10))

Or to use collect, which in spark is the shortcut for .filter.map:

val pairs =
  lines.map(_.split(",")(1).toInt)
    .collect{
      case hr if hr < 60 => "rest" -> 10
      case hr if hr > 110 => "sport" -> 10
    }

Probably this variant is more readable.

Also, please note how I moved split into separate step. This is done to avoid calling split second time for second if branch.

UPD. Another approach is to use flatMap, as suggested in comments:

val pairs =
  lines.flatMap(_.split(",")(1).toInt match{
      case hr if hr < 60 => Some("rest" -> 10)
      case hr if hr > 110 => Some("sport" -> 10)
      case _ => None
    })

It may or may not be more efficient, as it avoids filter step, but adds wrapping and unwrapping elements in Option. You can test performance of different approaches and tell us the results.

score:0

Note: Not a direct answer to this question. But it might be useful for users who uses dataframes

In Dataframe, we can use drop function to drop the rows which does not contain values for specified columns

In this case, you can use sc.parallelize and toDF to construct dataframe. And then you can just use df.drop()


Related Query

More Query from same tag