The rapid expansion of multimodal data, encompassing text, images, and audio, presents significant challenges in efficiently processing and interpreting incomplete or partiallydefined information. The proposed framework introduces a novel approach to token propagation, addressing the inherent difficulties of grounding partially-defined tokens across multiple modalities without human intervention. Through an open-source LLM model, Llama, the study successfully demonstrated how cross-modal interactions can resolve ambiguities in token definitions, particularly in text and image data, while also identifying the limitations associated with audio processing. The methodology showed that contextual data from various modalities can be effectively integrated to refine token representations, leading to improved accuracy in multimodal tasks. The comprehensive evaluation of token grounding performance highlights the potential for further extending LLM capabilities in real-world applications, where heterogeneous data sources are increasingly prevalent.