网络编码：容量、安全与存储（英文版）最新章节_刘宴涛著

2.2
Discrete Information Source and Discrete Entropy

Information is the description of the uncertainty of motion states or existence ways of things.Information is carried and transmitted in the form of message.As the producer and sender of messages，the essential feature of the information source is uncertainty，so it can be described by random variable.

According to the type of random variable，the source can be divided into discrete source and continuous source，which correspond to discrete random variables and continuous random variables respectively，and their probability distribution is as follows.

1.Discrete random variables

2.Continuous random variables

From the dimension of the message，the source can be divided into 1-dimensional source and multi-dimensional source.1-dimensional source is also known as single symbol source，and n -dimensional source is also known as n -dimensional symbol sequence source.Its output message should be described by n -dimensional random vector X =（ X ₁ ， X ₂ ，…， X _n ）.

For a n -dimensional source，if the output symbols X ₁ ， X ₂ ，…， X _n are statistically independent，that is， P （ X ₁ ， X ₂ ，…， X _n ）= P （ X ₁ ） P （ X ₂ ）… P （ X _n ），the source is called a memoryless source；on the contrary，it is called a source with memory.

2.2.1　Discrete Entropy

Let X be a discrete random variable with probability distribution as Equation（2.2.1）or Equation（2.2.2）.In a random experiment，the smaller the probability for an event，say X = a _i ，the more surprising and incredible it will be once it happens；on the contrary，for events with a high probability，the occurrence of this event will not be surprising，but will be very common.Thus，the information amount contained in an event should be a decreasing function of the probability.

In addition to the decrease property，the information should also have additivity property.That is，the information amount contained in the simultaneous occurrence of two independent events E ₁ and E ₂ ，i.e.， I （ E ₁ ∩ E ₂ ），should be equal to the sum of their respective information I （ E ₁ ）+ I （ E ₂ ）.In addition，the probability of the impossible event φ is 0 so that its information amount is ∞；the probability of the inevitable event Ω is 1 so that its information amount is 0.As a result，the logarithmic definition of information amount is proposed as follows.

Definition 2.2.1 The amount of information contained in an event X = a _i is recorded as I （ X = a _i ）and calculated by

Generally，the logarithm in the formula can take the base 2，and the unit of information is bit .Then，another problem arises：“How to measure the information amount of the random variable X ？” A natural idea is to use the statistical average method often used in probability，that is，take the expectation of information amount of all basic events X = a _i as the measure of information amount of X .

Definition 2.2.2 The information amount of a random variable X is called entropy ，denoted by H （ X ）and calculated by the expectation of information amount I （ X = a _i ），i.e.，

The physical senses of entropy include：

（1）The entropy H （ X ）measures the information amount of X .

（2）The entropy H （ X ）depicts the a priori uncertainty of the information source X .

（3）The entropy H （ X ）depicts the average information amount of the symbols｛ X = a _i ｝.

（4）The entropy H （ X ）equals the least number of bits to depict X .

The last one is an important property of entropy.Source coding is the main research field of information theory.Its main aim is to compress data，that is，to describe a source with as few bits as possible.The fourth property tells that entropy is the lower bound of data compression.If ones describe the source with the number of bits less than H （ X ），distortion will inevitably occur.On the contrary，if the number of bits is greater than H （ X ），it can be further compressed through reasonable coding design.

Example 2.2.1 A random variable X has the probability distribution as follows.

Two coding schemes are given below.

Code 1：1→000；2→001；3→010；4→011；5→100；6→101；7→110；8→111.

Code 2：1→0；2→10；3→110；4→1110；5→111100；6→111101；7→111110；　　　　8→111111.

For every code，the average length of each symbol is called the average length of the code.So for code 1 the average length is = 3 bits.For code 2，the average length is

The code 2 realizes to describe 8 symbols with 2 bits，and the average code length is equal to the entropy of the source，so code 2 reaches the maximum compression of the source X .

According to the Definition 2.2.2 of discrete entropy，entropy is only related to the number of messages and probability distribution of the source.If the number of messages is fixed，entropy is only a function of probability distribution.For brevity，the probability distribution of the source is denoted by

P =（ P （ a ₁ ）， P （ a ₂ ），…， P （ a _n ））=（ p ₁ ， p ₂ ，…， p _n ），with p _i ≥0（ i =1，2，…， n ）and

P =（ p ₁ ， p ₂ ，…， p _n ）is called the probability distribution vector.The entropy H （ X ）is only related to P ，so we have

H （ P ）is a function of P ，so it is called entropy function ，which has a series of properties as follows.

（1）Symmetry： H （ P ）keeps invariant in changing the order of p ₁ ， p ₂ ，…， p _n .

（2）Non-negativity： H （ P ）= H （ p ₁ ， p ₂ ，…， p _n ）≥0.

（3）Deterministic：If and only if there is a p _i =1， H （ P ）=0.At that time，the random variable takes the value a _i with the probability of 1，so it becomes a deterministic event without uncertainty.

（4）Increasing：If any p _i is divided into the sum of m probabilities p _i ₁ ， p _i ₂ ，…， p _im ，the entropy will increase.It is because the possible values of the information source increase，the uncertainty will increase.

（5）Extremum：When the number of messages n is fixed，the source entropy H （ P ）= H （ p ₁ ， p ₂ ，…， p _n ） takes the maximum value in the case of equal probability distribution，namely

（6）Convexity： H （ P ）= H （ p ₁ ， p ₂ ，…， p _n ） is a convex function of P .

For example，the probability distribution of a binary random variable is

Then the entropy should be H （ X ）= H （ p ，1- p ）=-［ p log p +（1- p ）log（1- p ）］， p ∈［0，1］.The convex curve is shown in Figure 2.2.1.

Definition 2.2.3 The joint probability distribution of 2-dimensional source（ X ₁ ， X ₂ ）is shown in Figure 2.2.2.

The joint entropy of（ X ₁ ， X ₂ ）is defined by

H （ X ₂ | X ₁ ）is called the conditional entropy of X ₂ given by X ₁ ，which is calculated by

Figure 2.2.1　Convex curve of H （ X ）

Figure 2.2.2　Joint probability distribution of 2-dimensional source（ X ₁ ， X ₂ ）

The joint entropy H （ X ₁ ， X ₂ ）represents the total uncertainty of 2-dimensional random variables（ X ₁ ， X ₂ ），so its unit is（bits/2 symbols）；while the conditional entropy H （ X ₂ | X ₁ ）represents the uncertainty of X ₂ given that X ₁ is known.The existence of X ₁ provides a certain information amount for X ₂ ，the uncertainty of X ₂ is reduced to a certain extent，so ones get

The Inequation（2.2.8）becomes an equation if and only if X ₁ and X ₂ are independent. Theorem 2.2.1 For two random variables X ₁ and X ₂ ， H （ X ₁ ， X ₂ ）， H （ X ₂ | X ₁ ）and H （ X ₁ ）satisfy

Especially，if X ₁ and X ₂ are statistically independent，then H （ X ₂ | X ₁ ）= H （ X ₂ ），so we have H （ X ₁ ， X ₂ ）= H （ X ₂ ） + H （ X ₁ ），if and only if X ₁ ， X ₂ are independent

Theorem 2.2.1 could be generalized to the cases of more random variables.It is H （ X ₁ ， X ₂ ，…， X _N ）

= H （ X _N | X _N _-1 ，…， X ₂ ， X ₁ ）+ H （ X _N _-1 | X _N _-2 ，…， X ₂ ， X ₁ ）+…+ H （ X ₂ | X ₁ ）+ H （ X ₁ ）

This is called the rule of chains.

2.2.2　Average Mutual Information

For a communication system，if X is the message symbol sent by the source and Y is the message symbol received by the sink，then the entropy H （ X ）describes a priori uncertainty of the source.When the sink receives Y ，the posterior uncertainty about X is the conditional entropy H （ X | Y ）.According to Inequation（2.2.8），there must be H （ X | Y ）≤ H （ X ），which implies that the sink gets some information about X at the reception of Y ，so reduces the uncertainty of X .The information amount transferred from the source to the sink is called average mutual information.

Definition 2.2.4 I （ X ； Y ）= H （ X ）- H （ X | Y ）is called the average mutual information .

If the probability distribution of X ， Y ，and（ X ， Y ）are briefly denoted by p （ x ）， p （ y ）， p （ x ， y ），then I （ X ； Y ）could be calculated as equation（2.2.10）.

The relationship between entropy and average mutual information is shown in Figure 2.2.3.

Figure 2.2.3　The relationship betw ^e en entropy and mutual information

2.2.3　Discrete Lossless Source Coding

Source coding is actually transforming the source symbols according to certain rules.Its function is shown in two aspects.On one hand，it matches the channel by making the symbols sent out from the source more suitable for channel transmission.The most typical example is the representation of the source symbols（text，voice，image）in the form of 0-1 bit strings in the digital communication system.On the other hand，the important role of source coding is data compression，that is，use as few code symbols（usually bits）as possible to represent source symbols to improve the efficiency of coding and transmission.

Assume the probability distribution of the source S to be

Assume the set of code symbols to be｛ x ₁ ， x ₂ ，…， x _n ｝.The most common example is binary code symbols｛0，1｝.In order to encode the source S and its extension source，two strategies can be adopted.One is to encode each symbol s _i of S first，and then concatenate the codeword of s _i to obtain the code of the extension source of S .The execution sequence is to encode first and then extend.This strategy is hereinafter referred to as single symbol coding.The second is to first obtain the extension source of S ，and then encode symbol sequence in the extension source.The execution sequence is to extend first and then encode.This strategy is hereinafter referred to as symbol sequence coding.

1.Single symbol coding

The single symbol coding is shown in Figure 2.2.4.The source encoder maps the number of q source symbols｛ s ₁ ， s ₂ ，…， s _q ｝to one of the code symbol sequence｛ W ₁ ， W ₂ ，…， W _q ｝.Each code symbol sequence W _i is composed of l _i code symbols，marked as and called the code word. l _i is the code length of W _i .The set of all code words｛ W ₁ ， W ₂ ，…， W _q ｝is called the code or code book.Thus，source coding is actually a mapping from source symbols to codewords.To achieve lossless coding，this mapping must be bijective and reversible.In addition，by concatenating the codeword of each symbol，it is easy to obtain the code of the extension source of S .For example，the codeword corresponding to the source symbol sequence s ₁ s ₂ s ₃ is W ₁ W ₂ W ₃ ，and its code length is equal to l ₁ + l ₂ + l ₃ .In the single symbol coding strategy，the code corresponding to the N -times extension source of S is called the N -times extension code of S .

Figure 2.2.4　Single symbol coding

2.Symbol sequence coding

The symbol sequence coding is shown in Figure 2.2.5.Taking memoryless extension as an example，it first makes N -times extension of the source S to get memoryless extension source S ^N and then encode each sequence in the S ^N .

Figure 2.2.5　Symbol sequence coding

Definition 2.2.5 The code in which all code words W _i are with equal length is called constant length code ，or else it is called variable length code .The code in which all code W _i are different to each other is called nonsingular code ，or else it is called singular code .For the extension code of single symbol code，if any sequence of code symbols of finite length can only be uniquely translated into a set of source symbol sequences，then this code is called uniquely decodable code .Among the uniquely decodable codes，there is a class of codes that can be decoded immediately without reference to the code symbols of the subsequent code words.The characteristic of this code is that no complete code word is the prefix of other code words，so it is called prefix code or instant code .For given sets of source symbols and code symbols，if there are many schemes of uniquely decodable code，then the code with the shortest average code length is called optimal code .

Example 2.2.2 Among the four codes in Figure 2.2.6，Code1 is singular，Code2 is not uniquely decodable；Code3 and Code4 are uniquely decodable，but the latter is instant code.

Figure 2.2.6　Comparison of four codes

Kraft proved in 1949 that for a prefix code，the length l _i of the codeword W _i needs to meet certain constraints，which is called the Kraft inequality.In 1956，McMillan proved that for the uniquely decodable code，the code length also needs to satisfy Kraft inequality.

Theorem 2.2.2（Kraft Inequality） For a uniquely decodable code or prefix code，the code length l _i ，the number of source symbols q ，and the number of code symbols r need to satisfy

Theorem 2.2.3（The Lossless Variable Length Source Coding Theorem） S ^N = ｛ α ₁ ， α ₂ ，…，｝is the N -times memoryless extension of the discrete source S =｛ s ₁ ， s ₂ ，…， s _q ^｝ .Using the set of code symbols X =｛ x ₁ ， x ₂ ，…， x _r ^｝ to encode S ^N ，there exists a uniquely decodable coding scheme in which the average code length of the symbols s _i in S satisfies

where denotes the average code length of all symbol sequences α _i in S ^N ，and denotes the average code length of all symbol sequences s _i in S .

2.2.4　Huffman Code

Huffman coding is the simplest and most representative discrete lossless variable length source coding method.It could generate the optimal code.Its basic idea is “large probability，small code length”.Although the scheme of Huffman coding is not unique，the average code length is constant.The encoding procedure is as follows.

For the source S given above，arrange the symbols s _i in a column with the probability from the largest to the smallest，sum the two smallest probabilities at the bottom，and then reorder the probabilities.At this time，the number of probabilities has been reduced by one.Repeat the process of summing and reordering until the number of probabilities is equal to 2，then allocate bits 0 and 1 for the upper and lower probabilities in each sum and the two probabilities on the right，such as upper 0 and lower 1.Finally，find the path where each symbol s _i is located and determine the 0-1 bit string on the path from right to left，which is the generated codeword.

Example 2.2.3 The probability distribution of a discrete source is given below.

The first scheme of Huffman code is illustrated in Figure 2.2.7，which generates the code as

s ₁ → 1， s ₂ → 01， s ₃ → 000， s ₄ → 0010， s ₅ → 0011

The second scheme is illustrated in Figure 2.2.8，which generates the code as

s ₁ → 00， s ₂ → 10， s ₃ → 11， s ₄ → 010， s ₅ → 011

It could be verified that the two generated codes have equal average code length of 2.2 bits/symbols.

Figure 2.2.7　Scheme1 of Huffman code

Figure 2.2.8　Scheme2 of Huffman code

2.2 Discrete Information Source and Discrete Entropy

2.2.1 Discrete Entropy

2.2.2 Average Mutual Information

2.2.3 Discrete Lossless Source Coding

2.2.4 Huffman Code

2.2
Discrete Information Source and Discrete Entropy

2.2.1　Discrete Entropy

2.2.2　Average Mutual Information

2.2.3　Discrete Lossless Source Coding

2.2.4　Huffman Code