## # 3.1

$p'/p=N_1/\theta-N_0/(1-\theta)=0\quad\Rightarrow \theta=N_1/(N_0+N_1)$

## # 3.2, 3.3 Obvious

(3.90) should be $\Gamma(x+1)=x\Gamma(x)$

## # 3.4

$\Beta(\theta|1,1)=\operatorname {Unif}(0,1)$, $\theta$ is probability for heads. $$P(\theta|X<3)=\frac{P(\theta,X<3)}{P(X<3)}$$

\begin{align} P(\theta,X<3) &= P(X<3|\theta)\\ &=\sum_{i=0}^2\binom{5}{i}\theta^i(1-\theta)^{5-i}\\ &=(1+3\theta+6\theta^2)(1-\theta)^3 \end{align}$$P(X<3)=\int_0^1 (1+3\theta+6\theta^2)(1-\theta)^3d\theta=\frac{1}{2}$$$$P(\theta|X<3)=2(1+3\theta+6\theta^2)(1-\theta)^3$$

## # 3.5

\begin{align} p(\theta)&\propto p(\phi(\theta))\frac{d\phi}{d\theta}\\ &\propto \frac{1}{\theta}+\frac{1}{1-\theta}\\ &\propto \theta^{-1}(1-\theta)^{-1}\\ &\propto\Beta(\theta|0,0) \end{align}

## # 3.6

Suppose we have data set $\mathcal{D}={x_1,x_2,\ldots,x_n}$

Log-Likelihood is $$\ln L=-n\lambda+\sum x_i\ln\lambda-\sum\ln(x_i!)$$ $$d\ln L/d\lambda=\sum x_i/\lambda-n=0\Rightarrow \lambda=\bar x$$

## # 3.7

### # a.

\begin{align} p(\lambda|D)&\propto p(D|\lambda)p(\lambda)\\ &\propto e^{-(n+b)\lambda}\lambda^{n\bar x+a-1}\\ &\propto \operatorname{Ga}(\lambda|n\bar x+a,n+b) \end{align}

### # b.

$$\bar\lambda=\frac{n\bar x+a}{n+b}$$

It will go to $\bar x$, which is the MLE given in 3.6

## # 3.8

The MLE is $\hat a=\max{|x_i|}$. As $$L=\prod_i I(a\geq |x_i|)\frac{1}{2a}=\frac{I(a\geq \max{|x_i|})}{2^na^n}$$

The problem is that $\hat a$ in at the edge of probable $a$. So it is always smaller than real value, i.e. biased.

## # 3.9

There is an error in exercise. Theer should be $I(\theta>b)$ term in the pdf. Define $M=\max\{x_1,x_2,\ldots,x_n,b\}$

$$p(\theta|D)=\frac{p(\theta,D)}{p(D)}=\frac{(N+K)M^{N+K}}{\theta^{N+K+1}}I(\theta>M)$$

It is a Pareto dist start at $M$ with exponent $N+K$

## # 3.15

$$m=\frac{a}{a+b},\quad v=\frac{ab}{(a+b)^2(a+b+1)}$$

We have $$v=\frac{m(1-m)}{a+b+1}\quad\Rightarrow\quad a+b=\frac{m(1-m)-v}{v}$$

As a result $$a=\left[\frac{m(1-m)-v}{v}\right]m, \quad b=\left[\frac{m(1-m)-v}{v}\right](1-m)$$

## # 3.16

$$\frac{a}{m}=\frac{b}{1-m}=k$$

So the probability inside $[l,u]$ is $$p(k)=\int_l^u \frac{x^{mk-1}(1-x)^{(1-m)k-1}}{B[mk, (1-m)k]}dx=0.95$$

With $l=0.05,u=0.3,m=0.15$ given, we can find root $k=30.0404$, that is $$a=4.50606, \quad b=25.5343$$

## # 3.17

$$p(N_1|N,\theta)= \binom{N}{N_1}\theta^{N_1}(1-\theta)^{N-N_1}$$

So we find

\begin{align} p(N_1|N)&=\binom{N}{N_1}\int \theta^{N_1}(1-\theta)^{N-N_1}d\theta\\ &=\binom{N}{N_1}B(N_1+1, N-N_1+1)\\ &=1/(N+1) \end{align}

## # 3.19

### # a.

We have $P(c|x)=P(x|c)P(c)/P(x)$. As $P(x),P(c)=0.5$ won't affect the ratio, $$K=\log\frac{P(c=1|x)}{P(c=2|x)}\propto \log\frac{P(x|c=1)}{P(x|c=2)}=\mathbf{\phi}^T(\mathbf{\beta}_1-\mathbf{\beta}_2)$$

Define $$K_w=x_w\log\frac{\theta_{1w}}{\theta_{2w}}+(1-x_w)\log\frac{1-\theta_{1w}}{1-\theta_{2w}},$$ then $$K=\sum_w K_w$$

### # b.

For indiscrimitive word, $K_w$ should not be changed significantly by $x_w$, so if $$\delta_w=\log \frac{\theta_{1w}(1-\theta_{2w})}{\theta_{2w}(1-\theta_{1w})}=\beta_{1w}-\beta_{2w}$$ is small enough, we may consider $w$ irrevalent.

### # c.

$$\hat\theta_{cw}=\frac{1+n_c}{2+n_c}$$

So, $\beta_{1w}-\beta_{2w}$ will be heavily affected by $n_2/n_1$. It will not be ignored. The reason is that the word is strongly affected by regularization to the class.

Consider the postorior distribution of $\theta$ $$L(\theta)=\operatorname{Beta}(k,n-k), \qquad k=n_{cw}+1, \quad n=n_w+2$$

$$\bar\theta=k/n,\quad \sigma=\frac{1}{n}\sqrt{\frac{k(n-k)}{(n+1)}}$$

We need $$0<\bar\theta\pm A\sigma< 1$$

$$\Rightarrow \sqrt k>A,\quad \sqrt{n-k}>A$$

i.e. $$n_{cw}>A^2, \quad n_c-n_{cw}>A^2$$

### # d.

Now we begin to consider the variance of $K_w$ because of the keyword $w$: $$\sigma^2_w=\delta_w\theta_w(1-\theta_w)$$ The variance of $K$ is $$\sigma^2=\sum_w \sigma_w^2$$ We can select keywords with largest $\sigma_w^2=\delta_w\theta_w(1-\theta_w)$, and (a.) is a special case of this criteria.

## # 3.20

### # a.

$$p(x|y=c)=p(x_1|\theta_{jc})p(x_2|x_1,\theta_{jc})\cdots p(x_D|x_{D-1},\ldots x_1,\theta_{jc})$$

So the number of parameters we need is $$(2+2^2+\cdots + 2^{D})C=(2^{D+1}-2)C$$

### # b. c.

Naive Bayes will be better for small data set, while Full Bayes is more accurate for large data set. Because there are few parameters, so we tend to get better estimation in naive case. Conversely, the full case need much more data to train the parameters.

### # d.

• $O(ND)$ for naive
• $O(N2^D)$ for full

### # e.

• linear $O(D)$ for naive model
• quadratic $O(D^2)$ for full

## # 3.21

\begin{align} I&=\sum_{x_j}\sum_y p(x_j|y)p(y)\log\frac{p(x_j|y)}{p(x_j)}\\ &=\sum_c\pi_c\sum_{x_j} \theta_{jc}^{x_j}(1-\theta)_{jc}^{1-x_j}\log\frac{\theta_{jc}^{x_j}(1-\theta)_{jc}^{1-x_j}}{\sum_c\pi_c \theta_{jc}^{x_j}(1-\theta)_{jc}^{1-x_j}}\\ &=\sum_c \pi_c\left[\theta_{jc}\log\frac{\theta_{jc}}{\sum_c \pi_c\theta_{jc}}+\sum_c (1-\theta_{jc})\log\frac{1-\theta_{jc}}{\sum_c \pi_c(1-\theta_{jc})}\right]\\ &=\sum_c \pi_c\left[\theta_{jc}\log\frac{\theta_{jc}}{\theta_{j}}+\sum_c (1-\theta_{jc})\log\frac{1-\theta_{jc}}{1-\theta_{j}}\right] \end{align}

## # 3.22

In 3 spam messages, we find:

• secret x 2
• dollar x 1

In 4 non-spam messages, we find:

• secret x 1
• sports x 2

As a result, we can get MLEs:

• $\theta_{\mathrm{spam}}=3/7$
• $\theta_{\mathrm{secret|spam}}=2/3$
• $\theta_{\mathrm{secret|non-spam}}=1/4$
• $\theta_{\mathrm{sports|non-spam}}=2/4=1/2$
• $\theta_{\mathrm{dollar|spam}}=1/3$