score:1

Accepted answer

I suppose this is what you were trying to do:

val dfHome: DataFrame = ???
val dsHome: Dataset[General] = dfHome.as[General]

val dsMary1: Dataset[Mary] = dsHome.flatMap { case General(id, name, addrs, _) =>
  addrs.map { case AddressMary(street, house) => Mary(id, name, street, house) }
}
val dsJohn1: Dataset[John] = dsHome.flatMap { case General(id, name, _, addrs) =>
  addrs.map { case AddressJohn(street, house) => John(id, name, street, house) }
}

You can also rewrite it with for-comprehension as:

val dsMary2: Dataset[Mary] =
  for {
    General(id, name, addrs, _) <- dsHome
    AddressMary(street, house) <- addrs
  } yield Mary(id, name, street, house)

val dsJohn2: Dataset[John] =
  for {
    General(id, name, _, addrs) <- dsHome
    AddressJohn(street, house) <- addrs
  } yield John(id, name, street, house)

but you will need the plugin better-monadic-for as withFilter isn't implemented for Dataset.

EDIT: Author asked for a way to get the Dataset of John and Mary at one go. We could zip the inner arrays but this would require every element one array to have a corresponding element in the other array in the same order. We could also nest the flatMaps together but that would be equivalent to a cartesian join.

val dsMaryJohnZip1: Dataset[(Mary, John)] = dsHome.flatMap { case General(id, name, addrMs, addrJs) =>
  addrMs.zip(addrJs).map { case (AddressMary(sM, hM), AddressJohn(sJ, hJ)) => (Mary(id, name, sM, hM), John(id, name, sJ, hJ)) }
}
val dsMaryJohnZip2: Dataset[(Mary, John)] =
  for {
    General(id, name, addrMs, addrJs) <- dsHome
    (AddressMary(sM, hM), AddressJohn(sJ, hJ)) <- addrMs.zip(addrJs)
  } yield (Mary(id, name, sM, hM), John(id, name, sJ, hJ))

val dsMaryJohnCartesian1: Dataset[(Mary, John)] = dsHome.flatMap { case General(id, name, addrMs, addrJs) =>
  addrMs.flatMap { case AddressMary(sM, hM) =>
    addrJs.map { case AddressJohn(sJ, hJ) =>
      (Mary(id, name, sM, hM), John(id, name, sJ, hJ))
    }
  }
}
val dsMaryJohnCatesian2: Dataset[(Mary, John)] =
  for {
    General(id, name, addrMs, addrJs) <- dsHome
    AddressMary(sM, hM) <- addrMs
    AddressJohn(sJ, hJ) <- addrJs
  } yield (Mary(id, name, sM, hM), John(id, name, sJ, hJ))

Related Query

More Query from same tag