score:1

Allow me to start with two short observations:

  1. I believe that it would be safer to avoid naming our columns with single numbers. Think of the case where we need to evaluate the expression 1 is not null. Here it is ambiguous whether we mean column 1 or the value 1 itself.
  2. As far as I am aware, it is not performant to store and process the target columns through a dataframe. That would create an overhead that can be easily avoided by using a single scala collection i.e: Seq, Array, Set, etc.

And here is the solution to your problem:

import org.apache.spark.sql.functions.col

val id_df = Seq(
  ("c1","gender"),
  ("c2","city"),
  ("c3","state"),
  ("c4","age")
).toDF("id","type")

val main_df = Seq(
    ("male", "los angeles", null),
    ("female", "new york", "new york"),
    ("trans", null, "new york")
).toDF("c1","c2","c3")

val targetCols = id_df.collect()
                      .map{_.getString(0)} //get id
                      .toSet //convert current sequence to a set (required for the intersection)
                      .intersect(main_df.columns.toSet) //get common columns with main_df
                      .map(col(_).isNotNull) //convert c1,..cN to col(c[i]).isNotNull
                      .reduce(_ && _) // apply the AND operator between items

// (((c1 IS NOT NULL) AND (c2 IS NOT NULL)) AND (c3 IS NOT NULL))

main_df.withColumn("meets_conditions", targetCols).show(false)

// +------+-----------+--------+----------------+
// |c1    |c2         |c3      |meets_conditions|
// +------+-----------+--------+----------------+
// |male  |los angeles|null    |false           |
// |female|new york   |new york|true            |
// |trans |null       |new york|false           |
// +------+-----------+--------+----------------+

Related Query

More Query from same tag