I think it depends on PE. We didn't do too much exploration about PE, just use the same RoPE.
For the novel-view-synthesis experiment, figure-4(a), (b) shows results with different number of input images (the chunk-size = number of tokens in all input images). The model is only trained with 8 input images at the tested-resolution. So there are some zero-shot generalization. But!, the task of novel-view-synthesis use posed image as input, it comes with "natural" and "physically-correct" PEs, plucker-ray for each pixel! And transformers with such plucker-ray as PE also shows zero-shot length generalizations.
So, my take away is, if PE is correct, chunk-size generalization, length generalization is possible.