Articles
A Simple Discussion of Escape Encoding
By: Ao Chen
Published on: 9/22/2025
We rigorously discuss the encoding of escape characters from a mathematical perspective.
Tags: programming, coding, theoretical CS
Escape Encoding
We consider the alphabet , where denotes nonspecial characters and denotes the characters that need escaping. Choose a particular as the escape character. We first define the escape sequence
and then specify the encoding for each character
Here, the angle brackets ⟨·⟩ denote a string (no comma separators; you may formally view ⟨·⟩ as ”·”), denotes string concatenation, and denotes disjoint union, i.e., we require . Finally, for a string, we simply concatenate its per-character encodings:
Propositional Logic Expressions
As an example, consider an encoding for a simplified form of propositional logic expressions. The alphabet of propositional logic is
We do not introduce conjunction and disjunction , since both can be defined from and . We extend the above alphabet to
and stipulate
We then encode
The result looks like
Conditions
This encoding scheme depends entirely on the choice of . To ensure that the encoding map admits an inverse , we need to impose conditions on so that is injective. In that case, can be defined as the inverse of after restricting the codomain of to its image (so that becomes a bijection).
Of course, we must ensure that is injective, but that alone is not sufficient. In fact, we need a stronger condition on .
Prefix-Free Encoding
Here we need to mention prefix-free codes, sometimes simply called prefix codes.
Let be a language over the alphabet . If is prefix-free, then for all , if , then is not a prefix of , i.e., there does not exist such that .
We note the following:
If the range of is prefix-free, then is called a prefix-free encoder, and the encoder constructed according to (1)–(3) necessarily admits a unique decoder such that
We prove this in five steps.
Proof
Single-letter encoding
First, consider
Clearly, the codewords in fall into three categories:
- Single-character codes: for every there is a length-1 codeword .
- Escape code: the codeword represents the escape character .
- Extension codes: for each , the codeword is , where .
In other words, we can write
We claim:
is prefix-free.
Take any with . If were not prefix-free, then necessarily ; by (7), in this case we must have .
If , then . The only such is , hence . Thus there is no codeword that has as a prefix.
If
then if is a prefix of , the suffix is also a prefix of . But is prefix-free, a contradiction.
Therefore, is prefix-free.
Unique parsability
If a codeword set is prefix-free, then any has a unique -factorization (if it exists). That is, if
then and .
Consider the first blocks and in the two factorizations. If , then since both are elements of and both start at the beginning of , one must be a prefix of the other, contradicting the prefix-free property of . Hence . Removing the common first block from both sides, the remainder satisfies the same condition; proceed recursively to conclude all blocks are equal.
Thus we have the unique factorization property.
Existence of a decoder
The encoding map maps each source letter to a codeword in and concatenates them. For any source string ,
which is a concatenation of elements of . By the unique parsability lemma, any two different source strings have encodings with different -factorizations; hence . Therefore, is injective.
Since is injective, there exists its inverse on the image, , and for any source string we have .
Constructing the decoder
We can give a simple greedy decoding algorithm that uses the prefix-free property of and scans left to right, taking the shortest matching element of (or deciding directly by the shape of ):
For an input encoded string :
- Let point to the current first position of the remaining string;
- If , then must be the length-1 codeword for some . Output that letter and set ;
- Otherwise ():
- If and , then the current block is . Output and set ;
- Otherwise (i.e., ), read from a nonempty string (consisting of letters not equal to ) such that . Because is prefix-free and comes from , there exists a unique shortest such prefix . Output and set ;
- Repeat until passes the end of the string.
Because is prefix-free, the test in steps 2 and 3 is unambiguous at each step. Each output letter indeed corresponds to the codeword of some original single letter. By induction (peeling off one leading block each time), the algorithm recovers for any .
Uniqueness of the decoder
If there exists another map with , then for any we have . The constructed above also satisfies the same equality on the image, so and agree on . Therefore, the inverse on is unique.