Ruiyi Yan

and 2 more

Generative linguistic steganography aims to embed information into natural language texts with imperceptibility to achieve covert transmission, while segmentation ambiguity in unsegmented languages could induce decoding errors. Existing disambiguating ways are based on removing tokens of candidate pools, which could degrade statistical imperceptibility, potentially incurring safety risks. To avoid damaging imperceptibility, we focus on keeping candidate pools lossless meanwhile tackling segmentation ambiguity. In this paper, we propose SegFree, a segmentation-free generative linguistic steganographic approach for unsegmented languages. First, we present an adaptive-checksum verification method to filter errors caused by segmentation ambiguity. The sender appends adaptive-length checksum to the original covert message and embeds the whole message into steganographic texts, and correspondingly the receiver extracts and verifies all potential results through our proposed all-case extraction. The checksum length is adaptive and perceptible in both sides, to minimize checksum overhead and adapt to various cases. Further, in order to transmit large-scale messages and deter knowledgeable adversaries from extracting messages, we present a key-based and anti-extraction steganographic mode where the input parameters for each grouped message can be only accessed by both sides. Experiments show that SegFree with lossless candidate pools achieves about 40% higher imperceptibility and over 30% higher embedding capacity compared to linguistic steganography equipped with the existing disambiguating approach. Besides, SegFree obtains more than 99.60% success rate of preventing adversaries’ extraction.