The α-representations of the Fisher Information Matrix
— On equivalent expressions of the FIM —

Frank Nielsen

19 September 2017

This column is also available in pdf: filename alpha-FIM.pdf

The Fisher Information Matrix [1] (FIM) for a family of parametric probability models {p(x;θ)}θ (densities p(x;θ) expressed with respect to a base measure ν) indexed by a D-dimensional parameter vector θ := (θ1,D) is originally defined by

I(θ) := (Iij(θ)), Iij(θ) := Ep(x;θ)[∂il(x;θ)∂jl(x;θ)],
(1)

where l(x;θ) := log p(x;θ) is the log-likelihood function, and i :=:  ∂
∂θi (by notational convention). The FIM is a D × D positive semi-definite matrix for a D-order family.

The FIM is a cornerstone in statistics and occurs in many places, like for example the celebrated Cramér-Rao lower bound [2] for an unbiased estimator θˆ

Varp(x;θ)[ˆθ] ≽ I-1(θ),
where denotes the Löwner ordering of positive semi-definite matrices: A B iff. A-B 0 is positive semi-definite. Another use of the FIM is in gradient descent method for deep learning using the natural gradient [4].

Yet, it is common to encounter another equivalent expression of the FIM in the litterature [21]

          ∫   ∘ ------  ∘ ------
I′ij(θ) := 4   ∂i  p(x;θ)∂j  p(x;θ)dν(x)
(2)

This form of the FIM is well-suited to prove that the FIM is always positive semi-definite matrix [1]: I(θ) 0.

It turns out that one can define a family of equivalent representations of the FIM using the α-embedding of the parametric family. We define the α-representation of densities l(α)(x;θ) := kα(p(x;θ)) with

         {  -2--1-α
kα (u ) :=   1- αu 2 ,  if α ⁄= 1
           logu,      if α = 1.
(3)

The function l(α)(x;θ) is called the α-likelihood function.

The α-representation of the FIM is

|---------∫----------------------------|
|I(α)(θ) :=   ∂l(α)(x;θ)∂ l(- α)(x;θ)dν (x )|
--ij-----------i--------j---------------|
(4)

In compact notation, we have Iij(α)(θ) = il(α)jl(-α)dν(x) (this is the α-FIM). We can expand the α-FIM expressions as follows

         {
 (α)        --1- ∫ ∂p(x;θ)1-2α∂ p(x;θ)1+2αd ν(x)  for α ⁄= ±1
Iij (θ) =   1∫-α2    i          j
             ∂ilogp(x;θ)∂jp(x;θ)dν(x)          for α ∈ {- 1,1}

The proof that Iij(α)(θ) = Iij(θ) follows from the fact that

   α    - α+21      1-2α
∂il =  p    ∂ip = p   ∂il,
since il = ∂ipp-.

Therefore we get

∂il(α)∂jl(- α) = p∂il∂jl,
and Iij(α)(θ) = E[il∂jl] = Iij(θ).

Thus Eq. 1 and Eq. 2 where two examples of the α-representation, namely the 1-representation and the 0-representation, respectively. The 1-representation of Eq. 1 is called the logarithmic representation, and the 0-representation of Eq. 2 is called the square root representation.

Note that Iij(θ) = E[il∂jl] = p∂il∂jldν(x) = ip∂jldν(x) = Iij(1)(θ) since il = ∂ip-
p

In information geometry [1], {il(α)}i plays the role of tangent vectors, the α-scores. Geometrically speaking, the tangent plane Tp(x;θ) can be described using any α-base. The statistical manifold M = {p(x;θ)}θ is imbedded into the function space X, where X denotes the support of the densities.

Under regular conditions [21], the α-representation of the FIM can further be rewritten as

                ∫
I(α)(θ) = - --2--   p(x;θ)1+2α∂i∂jl(α)(x; θ)d ν(x ).
 ij        1+ α

Since we have

                   (                  )
     (α)         1-α         1 - α
∂i∂jl  (x;θ) = p 2   ∂i∂jl +--2--∂il∂jl ,
it follows that
                (                    )
I(α)(θ) = - --2--  - I(α)(θ)+ 1---α I(α)  = Iij(θ).
 ij        1+  α    ij        2    ij

Notice that when α = 1, we recover the equivalent expression of the FIM (under mild conditions)

 (1)           2
Iij (θ) = - E [∇ logp(x;θ)].

In particular, when the family is an exponential family [3] with cumulant function F(θ), we have

        2
I(θ) = ∇  F(θ).

Similarly, the coefficients of the α-connection can be expressed using the α-representation as

       ∫
  (α)          (α )(-α)
Γ ij,k =   ∂i∂jl  ∂k   dν(x).

The Riemannian metric tensor gij (a geometric object) can be expressed in matrix form Iij(α)(θ) using the α-base, and this tensor is called the Fisher metric tensor.

Initially created 19th September 2017 (last update September 22, 2017).

References

[1]   O. Calin and C. Udrişte. Geometric Modeling in Probability and Statistics. Mathematics and Statistics. Springer International Publishing, 2014.

[2]   Frank Nielsen. Cramér-Rao lower bound and information geometry. arXiv preprint arXiv:1301.3578, 2013.

[3]   Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009.

[4]   Ke Sun and Frank Nielsen. Relative Fisher information and natural gradient for learning large modular models. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3289–3298, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.