Date

# Problem

Suppose there are $n$ numbers, use Quickselect to determine $k$-th smallest number. What is average runtime?

# Recursion

For our convenience, we assume there are no equal numbers. Initially, there are $l=k-1$ numbers smaller than target and $r=n-k$ numbers larger than target, which satisfy $n=l+r+1$. We denote expectation value of run time for this situation by $T[l, r]$. For one step of quickselect, it takes $cn$ time to do partition with a random pivot. Then the problem falls into three cases:

  • If pivot is smaller than target, $l$ is reduced and falls in the range $[0, l-1]$
  • If pivot is greater than target, $r$ is reduced and falls in the range $[0, r-1]$
  • If pivot is target, our algorithm halts, which takes constant time $d$.

With this analysis, we find recursion for $T[l, r]$:

$$T[l, r]= cn+\frac{\sum_{k=0}^{l-1} T[k, r]+\sum_{k=0}^{r-1} T[l, k]+d}{n}$$

# Local Difference

The recursion of $T$ is equivalent to $$nT[l, r]= cn^2+d+\sum_{k=0}^{l-1} T[k, r]+\sum_{k=0}^{r-1} T[l, k]$$

Similarly we have \begin{align} (n-1)T[l, r-1]&= c(n-1)^2+d+\sum_{k=0}^{l-1} T[k, r-1]+\sum_{k=0}^{r-2} T[l, k]\\ (n-1)T[l-1, r]&= c(n-1)^2+d+\sum_{k=0}^{l-2} T[k, r]+\sum_{k=0}^{r-1} T[l-1, k]\\ (n-2)T[l-1, r-1]&= c(n-2)^2+d+\sum_{k=0}^{l-2} T[k, r-1]+\sum_{k=0}^{r-2} T[l-1, k] \end{align}

Evaluate $nT[l, r]-(n-1)T[l, r-1]-(n-1)T[l-1, r]+(n-2)T[l-1, r-1]$, and we find local difference term for $(l, r)$:

$$T[l, r]-T[l, r-1]-T[l-1, r]+T[l-1, r-1]=2c/n$$

# Summation

Sum over all local differences for $(i, j)$ points ($0< i\leq l, 0<j\leq r$), and we have

\begin{align} &\phantom{{}={}}T[l, r]-T[l, 0]-T[0, r]+T[0, 0]\\ &=\sum_{i=1}^l\sum_{j=1}^r\frac{2c}{i+j+1}\\ &< 2c\int_{1/2}^{l+1/2}\int_{1/2}^{r+1/2} \frac{dxdy}{x+y+1}\\ &= 2c[(l+r+2)\ln (l+r+2)-(l+2)\ln (l+2)-(r+2)\ln (r+2)+2\ln 2]\\ &= 2c\Big[N[-p\ln (p+1/N)-q\ln (q+1/N)]-\ln (l+2)(r+2)+2\ln 2\Big]\\ &< 2cSN,\quad S=-p\ln p-q\ln q \end{align}

where $N=l+r+2, p=(l+1)/N, q=(r+1)/N$. $S$ is entropy for $(p, q)$ distribution.

# Complexity

It is obvious that $T[n, 0]=T[0, n]=bn, T[0, 0]=d$, which gives

$$T[l, r]= (b+2cS)N$$

It shows that constant factor of computing median is larger than factor for max/min.

# $T/N$ as a function of percentage $p$

Assume $b=c=1$, then we can plot $T/N$ as function of $p$

In [25]:
p=linspace(1e-6, 1-1e-6, 101)
q = 1-p
b = c = 1
plot(p, b-2*c*(p*log(p)+q*log(q)))
grid();
xlabel('Percentage')
ylabel(r'$T/N$');

Comments

comments powered by Disqus