Diss/01_tex/chapters/ch02_xxx.tex

\chapter{Background / Theory}
\label{ch:02_xxx}

\lettrine{T}{his} theory chapter reviews statistical methodologies upon which the contributions are based. Section \fixme{xxx} \ldots.


For the models exact inference is considered to be infeasible, leading to Markov chain Monte Carlo techniques that are outlined in Sec. \fixme{xxx}. Finally Sec. \fixme{xxx} concludes the background material ....


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Bayesian Framework}
\label{sec:0201_xxx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this section we provide a brief motivation for the Bayesian approach and establish some concepts that reappear throughout this thesis. The overarching goal of the thesis is then to examine the flexibility a Bayesian approach can provide in the case of learning dynamical unknown systems.

\texttodo{Types of probability - classical, frequentist, and Bayesian}
If we compare the so-called frequentist philosophy of statistical analysis to Bayesian inference the difference is that in Bayesian inference the probability of an event does not mean the proportion of the event in an infinite number of trials, but the uncertainty of the event in a single trial. Because models in Bayesian inference are formulated in terms of probability distributions, the probability axioms and computation rules of the probability theory (see, e.g., Shiryaev, 1996) also apply in Bayesian inference.

%==============================================================================
\subsection{Modeling via Exchangeability}
\label{subsec:020101_xxx}
%==============================================================================

The concept of exchangeability is central to many statistical approaches, and may be viewed as critical in motivating Bayesian statistics. Let us assume that we are aggregating data in an attempt to make predictions about future values of the random process we are observing. If we were to make the strong assumption of the data being independent, we would treat every new data point individually without using past observations to predict future observations since:
\begin{align} 
		\p{y_1,\ldots,y_n} = \prod_{i=1}^{n}{\p{y_i}}
\end{align} 
implies that
\begin{align} 
		\p{y_{n+1},\ldots,y_m|y_1,\ldots,y_n} = \p{y_{n+1},\ldots,y_m}.
\end{align} 
A weaker assumption that often better describes the data we encounter is that of exchangeability, which states that the order we encounter the data is inconsequential.

\begin{definition}
A sequence of random variables $y_1,y_2,\ldots,y_n$ is said to be finitely exchangeable if 
\begin{align}
		y_1,y_2,\ldots,y_n \stackrel{\mathcal{D}}{=} y_{\pi(1)},y_{\pi(2)},\ldots,y_{\pi(n)}
\end{align}

for every permutation $\pi$ on $\{ 1,\ldots,n \}$. Here, we use the notation $\stackrel{\mathcal{D}}{=}$ to mean equality in distribution.
From this definition, we see that independence implies exchangeability, but not vice versa. We are often in settings where data is continually accumulated, or in which fixing an upper bound $n$ is challenging. We would thus like to formalize a notion of
exchangeability for infinite sequences.
\end{definition}

\begin{definition}
A sequence $y_1,y_2,\ldots$ is said to infinitely exchangeable if every finite subsequence is finite exchangeable [15].
As is demonstrated in Bernardo and Smith [15], not every finitely exchangeable sequence can be embedded in an infinitely exchangeable sequence.
\end{definition}


Exchangeability has simplifying implications for inference since we can simply ignore the order in which the data arrive. Sometimes, exchangeability is too strong of an assumption. Relaxations include considering \textit{partially exchangeable} data where some auxiliary information partitions the data into exchangeable sets. For example, consider a person flipping two biased coins, one on even throws and the other on odd throws. The data are exchangeable within the set of odd or even tosses if these labels are provided. There are many possible extensions and variations on the standard exchangeability model; however, the end goal is to group data into exchangeable, and thus relatively simple, blocks for which inference is more tractable.


A very important result arising from the assumption of exchangeable data is what is typically referred to as \textit{de Finetti’s theorem}. This theorem states that an infinite sequence of random variables $y_1,y_2,\ldots,y_n$ is exchangeable if and only if there exists a random probability measure $\nu$ with respect to which $y_1,y_2,\ldots,y_n$ are conditionally \gls{acr:iid} with distribution $\nu$. Furthermore, this random measure can be viewed as the limiting empirical measure. De Finetti actually proved this in the case of binary random variables de Finetti [33], with the more general extension to arbitrary real-valued exchangeable sequences made by Hewitt and Savage [66] and Ryll-Nardzewski [146].


We have seen in Theorem 2.1.1 that for infinitely exchangeable binary sequences, there exists a random probability measure $\nu$ that concentrates on $\{0,1\}$ implying that this measure can be uniquely described by a single parameter $\theta$. 
One can straightforwardly extend the argument in Theorem 2.1.1 to infinitely exchangeable sequences taking values in $\{1, \ldots, K\}$; here, the random measure yielding the data \gls{acr:iid} concentrates on $\{1, \ldots,K\}$ and is thus uniquely defined by a $(K-1)$-dimensional parameter $ \theta = \{ \theta_1, \ldots , \theta_{K-1} \}$ [15].
Analogous to the examples presented in Example 2.1.2, possible underlying games include rolling a $K$-sided weighted die or drawing from an urn with $K$ different colored balls. 
When moving to infinitely exchangeable sequences taking values in the reals, the random probability measures $\nu$ can be arbitrarily complex and are, in general, defined by infinitely many parameters (i.e., $\nu$ is a generic element of $\mathcal{P}(\R)$.) 
Some special cases exist in which the parametrization remains finite. For example, if $\nu$ is almost surely a Gaussian distribution, the parametrization solely consists of a mean and variance. 
The more general case in which $\theta$ may be an infinite-dimensional parameter motivates the development of Bayesian nonparametric methods, some of which we explore in this thesis. For example, the Dirichlet process of Sec. \fixme{xxx} defines a distribution on probability measures that concentrate at a countably infinite number of elements of the reals (or the more general spaces we consider in Sec. \fixme{xxx})
When we limit ourselves to the more restrictive class of finite-dimensional $\theta$ (e.g., Bernoulli, multinomial, Gaussian random variables), we can invoke the following corollaries.

\begin{corollary}
Assuming the required densities exist, and assuming the conditions
of Theorem 2.1.2 hold, then there exists a distribution function ${Q}$ such that the joint density of $y_1, \ldots, y_n$ is of the form
\begin{align} 
		\p{y_1,\ldots,y_n} = \int_\theta{\prod_{t=1}^{n}{\p{y_1|\vartheta}} \, d {Q \left(\vartheta \right)}},
\end{align} 
with $\p{\cdot|\vartheta}$ representing the density function corresponding to the finite-dimensional parameter $\vartheta \in \theta$.
\end{corollary}

From the above corollary, it is simple to see how the de Finetti theorem motivates the concept of a prior distribution $Q(\cdot)$ and a likelihood function $\p{y|\cdot}$


\begin{corollary}
Given that the conditions of Corollary 2.1.1 hold, then the predictive density is given by
\begin{align} 
		\p{y_{m+1},\ldots,y_n|y_1,\ldots,y_m} = \int_{\theta}{\prod_{i=m+1}^{n}{\p{y_i|\vartheta} } \, d Q\left( \vartheta | y_1,\ldots,y_m \right) },
\end{align} 
where
\begin{align} 
		d Q\left( \vartheta | y_1,\ldots,y_m \right)  = \frac{\prod_{i=1}^{m}{\p{y_i|\theta} } \, d Q\left(\theta \right)}{\int_{\theta}{\prod_{i=1}^{m}{\p{y_i|\vartheta} } \, d Q\left( \vartheta \right) }}.
\end{align} 
\end{corollary}

\textit{Proof.} The result follows directly from employing
\begin{align} 
		\p{y_{m+1},\ldots,y_n|y_1,\ldots,y_m} = \frac{\p{y_1,\ldots,y_n}}{\p{y_1,\ldots,y_m}},
\end{align} 
along with Corollary 2.1.1.


From the form of the predictive density in Eq. (2.10), we see that our view of the existence of an underlying random parameter $\theta$ yielding the data \gls{acr:iid} has not changed. Instead, we have simply updated our prior belief $Q\left(\theta\right)$ into a posterior belief $Q\left(\theta|y_1,\ldots,y_m\right)$ through an application of Bayes rule:
\begin{align} 
		\p{\theta|y} &= \frac{\p{y|\theta}\p{\theta}}{\int_\theta{\p{y|\vartheta}\p{\vartheta} d \vartheta}}\\
		&= \frac{\p{y|\theta}\p{\theta}}{\p{y}}
\end{align} 
Here, we have written the rule in its simplest form assuming that a density on $\theta$ exists in addition to the conditional density on $y$. Although one can view the computation of the predictive distribution in Eq. (2.10) as the objective in Bayesian statistics, we will often limit our discussion to the process of forming the posterior distribution in Eq. (2.11) from the prior by incorporating observations, since this is a fundamental step
in examining the predictive distribution.

From a practical perspective, we never have an infinite sequence of observations from which to characterize our prior distribution. Furthermore, even if we had such a quantity, the probability measure that the de Finetti theorem would suggest as yielding the data \gls{acr:iid} might be arbitrarily complex. Thus, we are left with two competing pragmatic choices in defining our prior:
\begin{enumerate}
	\item Tractable inference,
	\item Modeling flexibility.
\end{enumerate}

The issue of tractable inference often motivates the use of conjugate priors, as discussed in Sec. \fixme{xxx}. The goal of flexibility in our models leads to the study of Bayesian nonparametric methods. A brief introduction to some specific classes of nonparametric methods that maintain computational tractability is presented in Sec. \fixme{xxx}.

Another key aspect of the Bayesian framework we have established is in characterizing a model, or likelihood distribution, $\p{y|\theta}$ for how our data are generated conditioned a parameter value $\theta$. This choice, too, is often motivated by practical considerations that are typically coupled with those of choosing a prior distribution. We do not develop a full analysis of model selection in this thesis, but begin the exploration in Sec. \fixme{xxx}. 

\textbf{
As practitioners, we do not actually know the underlying generative
process, but we can use a combination of our insight on the process (e.g., we know we are observing heights from a given population and heights tend to be well-modeled as Gaussian) and our adherence to computational limitations to define a model.
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Exponential Families}
\label{sec:xxx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Exponential families represent a fundamental class of distributions in statistics. They arise as the answer to numerous, albeit related, questions. 

Within the  Bayesian framework: 
For what class of models does there exist a prior that leads to computationally tractable inference [15, 141]? 

Frequentists arrive at the exponential family when asking:
If there exists an efficient estimator, can we describe the class of models from which the data could have been generated [87, 184]? 

Common to both domains: 
What distribution is maximally random while being consistent with a set of moment constraints [15, 79, 116]?

\begin{definition}
  A parametrized family of distributions $\mathcal{P}_\theta = \{P_\theta\}$ is a $k$-parameter exponential family with natural parameter $\mv{\eta}(\cdot)=\left[\eta_1(\cdot),\ldots,\eta_k(\cdot) \right]^\top$ natural statistic $\mv{t}(\cdot)=\left[t_1(\cdot),\ldots,t_k(\cdot) \right]^\top$, and base distribution $\p{\cdot} \propto e^{\beta(\cdot)}$ if each member $P_\theta$ of the family has a density of the form
	
	
\end{definition}

	\begin{align} 
			\p{\mv{y}|\mv{\theta}} &= \exp{\left\{\mv{\eta}^\top (\mv{\theta})\mv{t}(\mv{y}) - {\alpha}(\mv{\theta}) + \beta(\mv{y}) \right\}}\\
			&= \exp{\left\{\sum_{i=1}^k{\eta_i (\mv{\theta}){t_i}(\mv{y}) - {\alpha}(\mv{\theta}) + \beta(\mv{y}) } \right\} }
	\end{align} 
with respect to a dominating measure\footnote{The dominating measure is the assumed measure on the considered measurable space, and as such provides the measure with respect to which the Radon-Nikodym derivative is taken when defining densities (amongst other measure-theoretic operations one could examine).}  $\mu$. Here, $y$\footnote{We use the notation $\mv{y}$ rather than $y$ to indicate that this quantity is allowed to be vector valued.} is a point in the sample space $\mathcal{Y}$, which represents the support of the density. The function $\alpha(\cdot)$ is referred to as the
log-partition function and ensures that the probability density integrates to $1$. We will denote this family by $\mv{\epsilon}\left( \mv{\theta}; \mv{\nu}(\cdot), \mv{t}(\cdot), \beta(\cdot)  \right)$.

The set of admissible parameter values, or the natural parameter space, for which a constant $\alpha(\mv{\theta})$ exists are those such that
\begin{align} 
		\int{\exp{\left\{\sum_{i=1}^k{\eta_i (\mv{\theta}){t_i}(\mv{y}) - {\alpha}(\mv{\theta}) + \beta(\mv{y}) } \right\} }} dy < \infty
\end{align} 

We could generalize Eq. (2.15) and the results to follow for a given measure $\mu$ rather than the assumed Lebesgue (or where appropriate, counting) measure. However, we will omit this level of mathematical formality.

It is common to restrict oneself to examining families of distributions whose support, i.e., the set of $y$ such that $\p{\mv{y}|\mv{\theta}}>0$, does not depend upon $\mv{\theta}$.

\begin{definition}
  An exponential family $\mv{\epsilon}\left( \mv{\theta}; \mv{\nu}(\cdot), \mv{t}(\cdot), \beta(\cdot)  \right)$ is called regular if the support of each member of the family does not depend upon the value of the parameter $\mv{ \theta}$.
\end{definition}

Another form of exponentail families that deserves a special name is when the density eof each member of the family depends linearly on the parameters, i.e., $\mv{ \nu}(\theta)=[\theta_1,\ldots,\theta_k]$.

\begin{definition}  
A canonical esponetial family is one which depends linearly on the parameter $\mv{ \theta}$:
\begin{align} 
		\p{\mv{y}|\mv{ \theta}} &= \exp{ \left\{ \mv{\theta}^\top \mv{t} -\alpha(\mv{ \theta}) + \beta(\mv{y})    \right\} }\\
		&= \exp{ \left\{ \sum_{i=1}^{^k}{ \theta_i t_i (\mv{y})} -\alpha(\mv{ \theta}) + \beta(\mv{y})    \right\} }.
\end{align} 
We will denote the canonical exponential family by $\mv{\epsilon}\left( \mv{\theta}; \mv{\oneM}(\cdot), \mv{t}(\cdot), \beta(\cdot)  \right)$
\end{definition}

One, in theory, can describe the canonical exp. family form by  defining a family $\mathcal{P}_\eta$ with the parameters as the possibly nonlinear mapping $\ldots$. In practice, it might be challengig to find the set of admissible values of $\mv{ \eta}$ and the form of the log-partition function. Note that some references, such as Bernardo and Smith [15], use the term \textit{canonical} to refer to esp. families that also depend linearly on the data.
\begin{definition}
For data $\mv{y}$ distributed according to $\p{\mv{y} | \mv{ \theta}}$, a parameter $\mv{\theta}$ is termed unidentifiable on the basis of $\mv{y}$ if there exists $ \mv{ \theta}_1 \neq \mv{ \theta}_2$ such that $\mv{P}_{ \theta_1} = \mv{P}_{ \theta_2}$.  
\end{definition}


%==============================================================================
\subsection{Properties of the Canonical Exponential Family}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Interpretation as Linearly Constrained Maximum Entropy Distribution}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Examples}
\label{subsec:xxx}
%==============================================================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Sufficient Statistics}
\label{sec:xxx}
For the exponential family, we have seen that the densities only depend on the data through the natural statistics $\mv{t}(\mv{y})$ and the base distributions $q(\mv{y}) \propto \exp{ \{ \beta(\mv{y}) \} }$. 

This leads one to ask under what conditions are inferences using transformations of the data, or \textit{statistics}, the same as if we had used the data itself. One might additionally ask what set of models yield a compact set of statistics, summarizing an arbitrarily large set of data, that are sufficient for the inferences we wish to make. In the following, we establish a formal framework for this data-processing concept.
\begin{definition}
  Given a sequence of random variables $\mv{y}_1,\mv{y}_2,\ldots,$ with $\mv{y}_j \in \mathcal{Y}_j$ and probability measure $P$, a sequence of statistics $\mv{t}_1,\mv{t}_2,\ldots,$ with each function $\mv{t}_j$ defined on the product space $\mathcal{Y}_1 \times \cdots \times \mathcal{Y}_j$, is said to be predictive sufficient for $\mv{y}_1,\mv{y}_2,\ldots$ if
	\begin{align} 
			\p{\mv{y}_{i_1},\ldots,\mv{y}_{i_k} | \mv{y}_1,\ldots,\mv{y}_j} = \p{\mv{y}_{i_1},\ldots,\mv{y}_{i_k} | \mv{t}_j} \qquad \forall j,k
	\end{align} 
	where $\{i_1,\ldots,i_k\}$ are a set of indices not seen in $\{1,\ldots,j\}$. Here, $\p{\cdot|\cdot}$ is the conditional density induced by the measure $P$.
	
\end{definition}

That is, given $\mv{t}_j = \mv{t}_j(\mv{y}_1,\ldots,\mv{y}_j)$, the values of the data $\mv{y}_1,\ldots,\mv{y}_j$ do not further contribute to the prediction of future values of data.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Incorporating Prior Knowledge}
\label{sec:xxx}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%==============================================================================
\subsection{Conjugate Priors}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Multinomial Observations}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Gaussian Observations}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Multivariate Linear Regression Model}
\label{subsec:xxx}
%==============================================================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Graphical Models}
\label{sec:xxx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%==============================================================================
\subsection{A Brief Overview}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Directed Graphical Models}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Undirected Graphical Models}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Belief Propagation}
\label{subsec:xxx}
%==============================================================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Hidden Markov Model}
\label{sec:xxx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%==============================================================================
\subsection{Forward-Backward Algorithm}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Viterbi Algorithm}
\label{subsec:xxx}
%==============================================================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{State Space Models}
\label{sec:xxx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%==============================================================================
\subsection{Standard Discrete-Time Linear-Gaussian State Space Formulation}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Vector Autoregressive Processes}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Switching Linear Dynamic Systems}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Stochastic Realization Theory}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Kalman Filtering and Smoothing}
\label{subsec:xxx}
%==============================================================================
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Markov Chain Monte Carlo}
\label{sec:xxx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%==============================================================================
\subsection{Monte Carlo Integration}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{The Metropolis-Hastings Algorithm}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Gibbs Sampling}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Auxiliary, Blocked, and Collapsed Gibbs Samplers}
\label{subsec:xxx}
%==============================================================================

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Bayesian Nonparametric Methods}
\label{sec:xxx}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%==============================================================================
\subsection{Dirichlet Processes}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Dirichlet Process Mixture Models}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Hierarchical Dirichlet Processes}
\label{subsec:xxx}
%==============================================================================
%==============================================================================
\subsection{Beta Process}
\label{subsec:xxx}
%==============================================================================