We explore the innovative application of multimodal large language models (LLMs) in formation control of unmanned aerial vehicle (UAV) swarms by using image and text as input. To validate the feasibility of using LLMs in UAVs, in the first stage, we pre-train a large language model (LLM) using a dataset of 5000 samples for a single UAV, achieving a success rate of 78.0% in command extraction. This demonstrates the feasibility of using LLMs to control UAVs. Subsequently, we extend this framework to a UAV swarm, where the dataset is expanded to 20,000 samples, and the swarm LLM is pre-trained using this dataset. We utilize the GPT-4 multimodal LLM to process images captured by the leading UAV, which provides the UAV swarm with the capabilities of environmental recognition and semantic understanding. The pre-trained model achieves a success rate of 82.7% in command extraction. Specifically, for image understanding using multimodal LLM, we innovatively employ GPT-4 to process images captured by a leading UAV in the swarm, enabling visual feature-based formation planning with a planning success rate of 83.8%. Fianlly, based on our real-world experiments, LLMs exhibit outstanding performance in addressing both individual and collective UAV planning problems, demonstrating the extensive potential application of LLMs in multi-agent formation control and collective swarm tasks. Our experimental video is uploaded to https://youtu.be/WWQToKf8iyM.