Ruiyi Yan - 21DOCS Test Area

Ruiyi Yan

Public Documents 3

TokenFree: A Tokenization-Free Generative Linguistic Steganographic Approach with Enh...

Ruiyi Yan

and 2 more

November 27, 2023

Since tokenization serves a fundamental preprocessing step in numerous language models, tokens naturally constitute the basic embedding units for generative linguistic steganography. However, token-based methods encounter challenges including limited embedding capacity and possible segmentation ambiguity. Despite existing character-level linguistic steganographic approaches, they neglect the problem of generating unknown or out-of-vocabulary words, potentially compromising steganographic imperceptibility. In this letter, we focus on both embedding capacity and imperceptibility for a tokenization-free linguistic steganographic approach. Firstly, we suggest that unknown words mainly stem from low-entropy distributions and rigid coding rules within candidate pools, and thus we propose an entropy-based selection approach to flexibly construct candidate pools. Further, we present a lexical emphasis approach, prioritizing characters within candidate pools capable of forming in-vocabulary words. Experiments show that, across a range of high embedding rates, our approaches achieve considerably higher imperceptibility and about 17% higher anti-steganalysis capacity than the baseline method without our approaches.

SegFree: Segmentation-Free Generative Linguistic Steganographic Approach for Unsegmen...

Ruiyi Yan

and 2 more

October 02, 2023

Generative linguistic steganography aims to embed information into natural language texts with imperceptibility to achieve covert transmission, while segmentation ambiguity in unsegmented languages could induce decoding errors. Existing disambiguating ways are based on removing tokens of candidate pools, which could degrade statistical imperceptibility, potentially incurring safety risks. To avoid damaging imperceptibility, we focus on keeping candidate pools lossless meanwhile tackling segmentation ambiguity. In this paper, we propose SegFree, a segmentation-free generative linguistic steganographic approach for unsegmented languages. First, we present an adaptive-checksum verification method to filter errors caused by segmentation ambiguity. The sender appends adaptive-length checksum to the original covert message and embeds the whole message into steganographic texts, and correspondingly the receiver extracts and verifies all potential results through our proposed all-case extraction. The checksum length is adaptive and perceptible in both sides, to minimize checksum overhead and adapt to various cases. Further, in order to transmit large-scale messages and deter knowledgeable adversaries from extracting messages, we present a key-based and anti-extraction steganographic mode where the input parameters for each grouped message can be only accessed by both sides. Experiments show that SegFree with lossless candidate pools achieves about 40% higher imperceptibility and over 30% higher embedding capacity compared to linguistic steganography equipped with the existing disambiguating approach. Besides, SegFree obtains more than 99.60% success rate of preventing adversaries’ extraction.

A Secure and Disambiguating Approach for Generative Linguistic Steganography

Ruiyi Yan

and 2 more

May 12, 2023

Segmentation ambiguity in generative linguistic steganography could induce decoding errors. One existing disambiguating way is removing the tokens whose mapping words are the prefixes of others in each candidate pool. However, it neglects probability distribution of candidates and degrades imperceptibility. To enhance steganographic security, meanwhile addressing segmentation ambiguity, we propose a secure and disambiguating approach for linguistic steganography. In this letter, we focus on two questions: (1) Which candidate pools should be modified? (2) Which tokens should be retained? Firstly, we propose a secure token-selection principle that the sum of selected tokens’ probabilities is positively correlated to statistical imperceptibility. To meet both disambiguation and optimal security, we present a lightweight disambiguating approach that is finding out a maximum weight independent set (MWIS) in one candidate graph only when candidate-level ambiguity occurs. Experiments show that our approach outperforms the existing method in various security metrics, improving 25.7% statistical imperceptibility and 12.2% anti-steganalysis capacity averagely.