We present a novel variant of the transformer encoder architecture designed to enhance predictive ability through sequential bottom-up reasoning with cross-resolution communication. The architecture incorporates cross-scale attention to enable interactions between different hierarchical levels of resolution, auxiliary scale-specific predictors to guide each transformer encoder layer to adopt a distinct resolution in the respective output embedding space, and a variational autoencoder operating on the input embeddings to establish a continuous latent space that facilitates sampling and analysis of how perturbations propagate throughout the network across different levels of resolution.