e.g.,
>>> torch.max(model(tok('I like pizza', return_tensors='pt')['input_ids']).logits)
tensor(-4.5862)
So, if you add a new word, since you randomly init the embedding, it gets dot product ~0 with hidden states. Softmax([-4,-4,..., 0]) puts mass on the elt with 0!