score:0

What I understand is, if loss is calculated at individual word level, there is no sense of sequence. A bad sequence(with mostly random words) can have loss similar to a better sequence(with mostly connected words) as loss can be spread in different ways over the vocabulary.

score:0

No, we do not need to use a beam search in the training stage. When training modern-day seq-to-seq models like Transformers we use teacher enforcing training mechanism, where we feed right-shifted target sequence to the decoder side. Again beam-search can improve generalizability, but it is not practical to use in the training stage. But there are alternatives like the use of loss function label-smoothed-cross-entropy.

score:1

Sequence-to-Sequence Learning as Beam-Search Optimization is a paper that describes the steps neccesary to use beam search in the training process. https://arxiv.org/abs/1606.02960

The following issue contains a script that can perform the beam search however it does not contain any of the training logic https://github.com/tensorflow/tensorflow/issues/654

score:13

As Oliver mentioned in order to use beam search in the training procedure we have to use beam search optimization which is clearly mentioned in the paper Sequence-to-Sequence Learning as Beam-Search Optimization.

We can't use beam search in the training procedure with the current loss function. Because current loss function is a log loss which is taken on each time step. It's a greedy way. It also clearly mentioned in the this paper Sequence to Sequence Learning with Neural Networks. In the section 3.2 it has mentioned the above case neatly.

enter image description here

"where S is the training set. Once training is complete, we produce tr anslations by finding the most likely translation according to the LSTM:"

So the original seq2seq architecture use beam search only in the testing time. If we want to use this beam search in the training time we have to use another loss and optimization method as in the paper.