In the field of remote sensing, semantic segmentation of Unmanned Aerial Vehicle (UAV) imagery is crucial for tasks such as land resource management, urban planning, precision agriculture, and economic assessment. Traditional methods use Convolutional Neural Networks (CNNs) for hierarchical feature extraction but are limited by their local receptive fields, restricting comprehensive contextual understanding. To overcome these limitations, we propose a combination of transformer and attention mechanisms to improve object classification, leveraging their superior information modeling capabilities to enhance scene understanding. In this paper, we present SwinFAN (Swin-based Focal Axial attention Network), a U-Net framework featuring a Swin transformer as encoder, equipped with a novel decoder that introduces two new components for enhanced semantic segmentation of urban remote sensing images. The first proposed component is a Guided Focal-Axial (GFA) attention module which combines local and global contextual information, enhancing the model's ability to discern intricate details and complex structures. The second component is an innovative Attentionbased Feature Refinement Head (AFRH) designed to improve the precision and clarity of segmentation outputs through selfattention and convolutional techniques. Comprehensive experiments demonstrate that the accuracy of our proposed architecture significantly outperforms state-of-the-art models. More specifically, our method achieves mIoU (mean Intersection over Union) improvements of 1.9% on UAVid, 3.6% on Potsdam, 1.9% on Vaihingen, and 0.8% on LoveDA.