score:1

i can't think of an easy way to achieve what you want using only native spark functions, but it is can be done using a user defined function (udf) as below:

val groupby2 = udf((s: seq[int]) => s.grouped(2).tolist)
ss.udf.register("groupby2", groupby2)
val dsgrouped: dataset[aggrbook] = bookds.groupby("city", "state").agg(collect_list("bookingid") as "books")
  .withcolumn("books", explode(groupby2(col("books"))))
  .as[aggrbook]

the udf takes a seq[int] and returns a seq[seq[int]] where the inner sequence has a length of two or less. this is then 'expanded' using the native spark explode function to give you (potentially) multiple rows for each 'city-state' pair, but only two ids in the 'books' column.


Related Query

More Query from same tag