Voice Liveness Detection (VLD) has become one of the hot research topics in the Internet of Things era. Conventional VLD methods are centralized solutions that are trained on abundant data collected from local clients and stored on the server, however, they have the risk of causing data islands and privacy leakage. To address this problem, we propose a novel word-level VLD framework based on asynchronous federated learning (FL) with pop noise, named FL-VLD. Structurally, FL-VLD takes the preprocessed voice for local model training and constructs a global model by only transmitting the learned weights after differential privacy with FL’s central server in an asynchronous manner. In addition, the local network of the framework incorporates the residual network and the spatial grouping enhancement module to optimize the complexity and accuracy of the global model. With the advantage of FL’s distributed structure, FL-VLD solves the data island problem in VLD scenario without threatening users’ privacy. Experimental results on the popular POCO dataset show that our proposal is clearly superior to the traditional centralized methods as well as overperforming other federated schemes in terms of fairness, stability, accuracy, and lightweightness. Further, for attacks involving far-field replay, synthesis, and conversion, FL-VLD has high generalisation capabilities. Finally, the ablation study attests to its efficacy.