Sure! Another way to think of it is texture vs. structure, or sometimes people call this "stuff vs. things".
In an image of a dog in a field, the grass texture (stuff) is high-entropy, but we do not perceive individual realisations of this texture, we just perceive it as "grass".
If the realisation of this texture is subtly different, we often cannot tell, unless the images are layered directly on top of each other. This is a fun experiment to try with an adversarial autoencoder: when comparing an original image and its reconstruction side by side, they often look identical. But layering them on top of each other and flipping back and forth often reveals just how different the images are, especially in areas with a lot of texture.
For objects (things) on the other hand, like the dog's eyes for example, differences of a similar magnitude would be immediately obvious.
A good adversarial autoencoder will make abstraction of texture, but try to preserve structure. That way, the realisation of the grassy texture in the reconstruction can be different than the original, without it noticeably affecting the fidelity of the reconstruction. This enables the autoencoder to drop a lot of modes (i.e. other realisations of the same texture) and represent the presence of this texture more compactly in its latent space.
This in turn should make generative modelling in the latent space easier as well, because it can now model the absence/presence of a texture, rather than having to capture all the entropy associated with that texture.
This is a bit of a caricature, and what happens in reality is probably a bit more complicated, but this is roughly my intuition for why two-stage training is actually preferable over end-to-end, at least in the visual domain.