score:1
Keras' train_test_split actually has a parameter for that. While it doesn't let you pick exact number of samples, it selects them evenly from the classes.
X_train_stratified, X_test_stratified, y_train_strat, y_test_strat = train_test_split(X_train, y_train, test_size=0.2, stratify=y)
If you want to do cross validation you can also use stratified shuffle split
I hope I understood your question correctly
score:2
One possible approach is to proceed as follows:
- Load the data from
X_train
&Y_train
into a singletf.data
Dataset so that we ensure we keep eachX
matched with the correctY
.shuffle()
then split the dataset into eachn_i
using afilter()
- Write our
get_batch
function to return the correct number of samples from each dataset,shuffle()
the sample then split it back intoX
&Y
Something like this:
# 1: Load the data into a Dataset
raw_data = tf.data.Dataset.zip(
(
tf.data.Dataset.from_tensor_slices(X_train),
tf.data.Dataset.from_tensor_slices(Y_train)
)
).shuffle(7000)
# 2: Split for each category
def get_filter_fn(n):
def filter_fn(x, y):
return tf.equal(1.0, y[n])
return filter_fn
n_0s = raw_data.filter(get_filter_fn(0))
n_1s = raw_data.filter(get_filter_fn(1))
n_2s = raw_data.filter(get_filter_fn(2))
# 3:
def get_batch(n_0,n_1,n_2):
sample = n_0s.take(n_0).concatenate(n_1s.take(n_1)).concatenate(n_2s.take(n_2))
shuffled = sample.shuffle(n_0 + n_1 + n_2)
return shuffled.map(lambda x,y: x),shuffled.map(lambda x,y: y)
So now we can do:
x_batch, y_batch = get_batch(100, 150, 125)
Note that I've used some potentially wasteful operations here pursuing an approach I find intuitive and straightforward (specifically reading the raw_data
dataset 3 times for the filter operations) so I make no claim that this is the most efficient way to accomplish what you need but for a dataset that fits in memory like the one you describe I'm sure such inefficiencies will be negligible
Credit To: stackoverflow.com