score:1

Keras' train_test_split actually has a parameter for that. While it doesn't let you pick exact number of samples, it selects them evenly from the classes.

X_train_stratified, X_test_stratified, y_train_strat, y_test_strat = train_test_split(X_train, y_train, test_size=0.2, stratify=y)

If you want to do cross validation you can also use stratified shuffle split

I hope I understood your question correctly

score:2

One possible approach is to proceed as follows:

  1. Load the data from X_train & Y_train into a single tf.data Dataset so that we ensure we keep each X matched with the correct Y
  2. .shuffle() then split the dataset into each n_i using a filter()
  3. Write our get_batch function to return the correct number of samples from each dataset, shuffle() the sample then split it back into X & Y

Something like this:

# 1: Load the data into a Dataset
raw_data = tf.data.Dataset.zip(
    (
        tf.data.Dataset.from_tensor_slices(X_train),
        tf.data.Dataset.from_tensor_slices(Y_train)
    )
  ).shuffle(7000)


# 2: Split for each category
def get_filter_fn(n):
  def filter_fn(x, y):
    return tf.equal(1.0, y[n])
  return filter_fn

n_0s = raw_data.filter(get_filter_fn(0))
n_1s = raw_data.filter(get_filter_fn(1))
n_2s = raw_data.filter(get_filter_fn(2))

# 3:
def get_batch(n_0,n_1,n_2):
  sample = n_0s.take(n_0).concatenate(n_1s.take(n_1)).concatenate(n_2s.take(n_2))
  shuffled = sample.shuffle(n_0 + n_1 + n_2)
  return shuffled.map(lambda x,y: x),shuffled.map(lambda x,y: y) 

So now we can do:

x_batch, y_batch = get_batch(100, 150, 125)

Note that I've used some potentially wasteful operations here pursuing an approach I find intuitive and straightforward (specifically reading the raw_data dataset 3 times for the filter operations) so I make no claim that this is the most efficient way to accomplish what you need but for a dataset that fits in memory like the one you describe I'm sure such inefficiencies will be negligible