CA77's Site.


Articles

A Simple Discussion of Escape Encoding

By: Ao Chen

Published on: 9/22/2025

We rigorously discuss the encoding of escape characters from a mathematical perspective.

Tags: programming, coding, theoretical CS


Escape Encoding

We consider the alphabet , where denotes nonspecial characters and denotes the characters that need escaping. Choose a particular as the escape character. We first define the escape sequence

and then specify the encoding for each character

Here, the angle brackets ⟨·⟩ denote a string (no comma separators; you may formally view ⟨·⟩ as ”·”), denotes string concatenation, and denotes disjoint union, i.e., we require . Finally, for a string, we simply concatenate its per-character encodings:

Propositional Logic Expressions

As an example, consider an encoding for a simplified form of propositional logic expressions. The alphabet of propositional logic is

We do not introduce conjunction and disjunction , since both can be defined from and . We extend the above alphabet to

and stipulate

We then encode

The result looks like

Conditions

This encoding scheme depends entirely on the choice of . To ensure that the encoding map admits an inverse , we need to impose conditions on so that is injective. In that case, can be defined as the inverse of after restricting the codomain of to its image (so that becomes a bijection).

Of course, we must ensure that is injective, but that alone is not sufficient. In fact, we need a stronger condition on .

Prefix-Free Encoding

Here we need to mention prefix-free codes, sometimes simply called prefix codes.

Definition

Let be a language over the alphabet . If is prefix-free, then for all , if , then is not a prefix of , i.e., there does not exist such that .

We note the following:

Theorem

If the range of is prefix-free, then is called a prefix-free encoder, and the encoder constructed according to (1)–(3) necessarily admits a unique decoder such that

We prove this in five steps.

Proof

Single-letter encoding

First, consider

Clearly, the codewords in fall into three categories:

  • Single-character codes: for every there is a length-1 codeword .
  • Escape code: the codeword represents the escape character .
  • Extension codes: for each , the codeword is , where .

In other words, we can write

We claim:

Lemma

is prefix-free.

Take any with . If were not prefix-free, then necessarily ; by (7), in this case we must have .

If , then . The only such is , hence . Thus there is no codeword that has as a prefix.

If

then if is a prefix of , the suffix is also a prefix of . But is prefix-free, a contradiction.

Therefore, is prefix-free.

Unique parsability

Lemma

If a codeword set is prefix-free, then any has a unique -factorization (if it exists). That is, if

then and .

Consider the first blocks and in the two factorizations. If , then since both are elements of and both start at the beginning of , one must be a prefix of the other, contradicting the prefix-free property of . Hence . Removing the common first block from both sides, the remainder satisfies the same condition; proceed recursively to conclude all blocks are equal.

Thus we have the unique factorization property.

Existence of a decoder

The encoding map maps each source letter to a codeword in and concatenates them. For any source string ,

which is a concatenation of elements of . By the unique parsability lemma, any two different source strings have encodings with different -factorizations; hence . Therefore, is injective.

Since is injective, there exists its inverse on the image, , and for any source string we have .

Constructing the decoder

We can give a simple greedy decoding algorithm that uses the prefix-free property of and scans left to right, taking the shortest matching element of (or deciding directly by the shape of ):

For an input encoded string :

  1. Let point to the current first position of the remaining string;
  2. If , then must be the length-1 codeword for some . Output that letter and set ;
  3. Otherwise ():
    • If and , then the current block is . Output and set ;
    • Otherwise (i.e., ), read from a nonempty string (consisting of letters not equal to ) such that . Because is prefix-free and comes from , there exists a unique shortest such prefix . Output and set ;
  4. Repeat until passes the end of the string.

Because is prefix-free, the test in steps 2 and 3 is unambiguous at each step. Each output letter indeed corresponds to the codeword of some original single letter. By induction (peeling off one leading block each time), the algorithm recovers for any .

Uniqueness of the decoder

If there exists another map with , then for any we have . The constructed above also satisfies the same equality on the image, so and agree on . Therefore, the inverse on is unique.