CA77's Site.

Articles

A Simple Discussion of Escape Encoding

Published on: 9/22/2025

We rigorously discuss the encoding of escape characters from a mathematical perspective.

Tags: programming, coding, theoretical CS

Escape Encoding

We consider the alphabet $Γ ⊔ Γ^{'}$ , where $Γ$ denotes nonspecial characters and $Γ^{'}$ denotes the characters that need escaping. Choose a particular $γ^{0} \in Γ$ as the escape character. We first define the escape sequence

e_{1} : Γ^{'} ⟶ (Γ ∖ {γ^{0}})^{+} (1)

and then specify the encoding for each character

e_{2} : Γ ⊔ Γ^{'} ⟶ Γ^{+}, ⎩ ⎨ ⎧ γ^{0} Γ ∖ {γ^{0}} ∋ γ Γ^{'} ∋ γ ⟼ ⟼ ⟼ ⟨ γ^{0} γ^{0} ⟩ ⟨ γ ⟩ ⟨ γ^{0} ⟩ * e_{1} (γ) (2)

Here, the angle brackets ⟨·⟩ denote a string (no comma separators; you may formally view ⟨·⟩ as ”·”), $*$ denotes string concatenation, and $⊔$ denotes disjoint union, i.e., we require $Γ \cap Γ^{'} = \emptyset$ . Finally, for a string, we simply concatenate its per-character encodings:

e : (Γ ⊔ Γ^{'})^{+} ⟶ Γ^{+}, ⟨ γ_{0} γ_{1} γ_{2} \dots γ_{n - 1} ⟩ ⟼ e_{2} (γ_{0}) * e_{2} (γ_{1}) * e_{2} (γ_{2}) * \dots * e_{2} (γ_{n - 1}) (3)

Propositional Logic Expressions

As an example, consider an encoding for a simplified form of propositional logic expressions. The alphabet of propositional logic is

{(,), \neg, \to, A_{n} ∣ n \in N} (4a)

We do not introduce conjunction $\land$ and disjunction $\lor$ , since both can be defined from $\neg$ and $\to$ . We extend the above alphabet to

{\, n, t, o, (,), \neg, \to, A_{n} ∣ n \in N} (4b)

and stipulate

Γ = {\, n, t, o, (,), A_{n} ∣ n \in N}, γ^{0} = \, Γ^{'} = {\neg, \to} (4c)

We then encode

e_{1} (\neg) = ⟨ not ⟩, e_{1} (\to) = ⟨ to ⟩ (4d)

The result looks like

e ⟨ A_{2} \to (A_{1} \to \neg A_{3}) ⟩ = ⟨ A_{2} \ to (A_{1} \ to \ not A_{3}) ⟩ (4)

Conditions

This encoding scheme depends entirely on the choice of $e_{1}$ . To ensure that the encoding map $e$ admits an inverse $d$ , we need to impose conditions on $e_{1}$ so that $e$ is injective. In that case, $d$ can be defined as the inverse of $e$ after restricting the codomain of $e$ to its image (so that $e$ becomes a bijection).

Of course, we must ensure that $e_{1}$ is injective, but that alone is not sufficient. In fact, we need a stronger condition on $e_{1}$ .

Prefix-Free Encoding

Here we need to mention prefix-free codes, sometimes simply called prefix codes.

Definition

Let $L \subset Γ^{*}$ be a language over the alphabet $Γ$ . If $L$ is prefix-free, then for all $s_{1}, s_{2} \in L$ , if $s_{1} \neq = s_{2}$ , then $s_{1}$ is not a prefix of $s_{2}$ , i.e., there does not exist $w \in Γ^{*}$ such that $s_{2} = s_{1} w$ .

We note the following:

Theorem

If the range of $e_{1}$ is prefix-free, then $e_{1}$ is called a prefix-free encoder, and the encoder $e$ constructed according to (1)–(3) necessarily admits a unique decoder $d$ such that

d \circ e = id_{(Γ ⊔ Γ^{'})^{+}} (5)

We prove this in five steps.

Proof

Single-letter encoding

First, consider

C = e_{2} (Γ ⊔ Γ^{'}) = {e_{2} (α) ∣ α \in Γ ⊔ Γ^{'}} (6)

Clearly, the codewords in $C$ fall into three categories:

Single-character codes: for every $a \in Γ ∖ {γ^{0}}$ there is a length-1 codeword $⟨ a ⟩$ .
Escape code: the codeword $⟨ γ^{0} γ^{0} ⟩$ represents the escape character $γ^{0}$ .
Extension codes: for each $β \in Γ^{'}$ , the codeword is $⟨ γ^{0} ⟩ * w$ , where $w = e_{1} (β) \in (Γ ∖ {γ^{0}})^{+}$ .

In other words, we can write

C = {⟨ a ⟩ ∣ a \in Γ ∖ {γ^{0}}} ⊔ {⟨ γ^{0} γ^{0} ⟩} ⊔ {⟨ γ^{0} ⟩ * w ∣ w \in Ran e_{1} \subset (Γ ∖ {γ^{0}}^{+})} (7)

We claim:

Lemma

$C$ is prefix-free.

Take any $c, c^{'} \in C$ with $c \neq = c^{'}$ . If $C$ were not prefix-free, then necessarily $c [0] = c^{'} [0]$ ; by (7), in this case we must have $c [0] = c^{'} [0] = γ^{0}$ .

If $c = ⟨ γ^{0} γ^{0} ⟩$ , then $c^{'} [1] = γ^{0}$ . The only such $c^{'}$ is $⟨ γ^{0} γ^{0} ⟩$ , hence $c^{'} = c$ . Thus there is no codeword that has $⟨ γ^{0} γ^{0} ⟩$ as a prefix.

c, c^{'} \in {⟨ γ^{0} ⟩ * w ∣ w \in Ran e_{1}},

then if $c$ is a prefix of $c^{'}$ , the suffix $⟨ c [1] c [2] \dots ⟩$ is also a prefix of $⟨ c^{'} [1] c^{'} [2] \dots ⟩$ . But $Ran e_{1}$ is prefix-free, a contradiction.

Therefore, $C$ is prefix-free.

Unique parsability

Lemma

If a codeword set $C \subset Γ^{*}$ is prefix-free, then any $s \in Γ^{*}$ has a unique $C$ -factorization (if it exists). That is, if

s = c_{1} * c_{2} * \dots * c_{m} = d_{1} * d_{2} * \dots * d_{n}, c_{i}, d_{j} \in C,

then $m = n$ and $c_{i} = d_{i} (\forall i)$ .

Consider the first blocks $c_{1}$ and $d_{1}$ in the two factorizations. If $c_{1} \neq = d_{1}$ , then since both are elements of $C$ and both start at the beginning of $s$ , one must be a prefix of the other, contradicting the prefix-free property of $C$ . Hence $c_{1} = d_{1}$ . Removing the common first block from both sides, the remainder satisfies the same condition; proceed recursively to conclude all blocks are equal.

Thus we have the unique factorization property.

Existence of a decoder

The encoding map $e$ maps each source letter to a codeword in $C$ and concatenates them. For any source string $u = ⟨ α_{1} α_{2} \dots α_{k} ⟩ \in (Γ ⊔ Γ^{'})^{+}$ ,

e (u) = e_{2} (α_{1}) * e_{2} (α_{2}) * \dots * e_{2} (α_{k}),

which is a concatenation of elements of $C$ . By the unique parsability lemma, any two different source strings $u \neq = v$ have encodings $e (u), e (v)$ with different $C$ -factorizations; hence $e (u) \neq = e (v)$ . Therefore, $e$ is injective.

Since $e$ is injective, there exists its inverse on the image, $d : Ran e \to (Γ ⊔ Γ^{'})^{+}$ , and for any source string $u$ we have $d (e (u)) = u$ .

Constructing the decoder

We can give a simple greedy decoding algorithm that uses the prefix-free property of $C$ and scans left to right, taking the shortest matching element of $C$ (or deciding directly by the shape of $C$ ):

For an input encoded string $s \in Ran e$ :

Let $i$ point to the current first position of the remaining string;
If $s [i] \neq = γ^{0}$ , then $s [i]$ must be the length-1 codeword for some $a \in Γ ∖ {γ^{0}}$ . Output that letter $a$ and set $i \leftarrow i + 1$ ;
Otherwise ( $s [i] = γ^{0}$ ):
- If $i + 1 ⩽ ∣ s ∣$ and $s [i + 1] = γ^{0}$ , then the current block is $⟨ γ^{0} γ^{0} ⟩$ . Output $γ^{0}$ and set $i \leftarrow i + 2$ ;
- Otherwise (i.e., $s [i + 1] \neq = γ^{0}$ ), read from $i + 1$ a nonempty string $w$ (consisting of letters not equal to $γ^{0}$ ) such that $w \in e_{1} (Γ^{'})$ . Because $e_{1} (Γ^{'})$ is prefix-free and $s$ comes from $e$ , there exists a unique shortest such prefix $w$ . Output $e_{1}^{- 1} (w)$ and set $i \leftarrow i + 1 + ∣ w ∣$ ;
Repeat until $i$ passes the end of the string.

Because $C$ is prefix-free, the test in steps 2 and 3 is unambiguous at each step. Each output letter indeed corresponds to the codeword of some original single letter. By induction (peeling off one leading block each time), the algorithm recovers $u$ for any $s = e (u)$ .

Uniqueness of the decoder

If there exists another map $d^{'} : Ran e \to (Γ ⊔ Γ^{'})^{+}$ with $d^{'} \circ e = id$ , then for any $u \in (Γ ⊔ Γ^{'})^{+}$ we have $d^{'} (e (u)) = u$ . The $d$ constructed above also satisfies the same equality on the image, so $d^{'}$ and $d$ agree on $Ran e$ . Therefore, the inverse on $Ran e$ is unique.