Week-3

Dense Registration through Optical Flow¶

Used when there's a little difference in the 2 images

Assuming $g(x) = f(x + t)$, where $t$ is small.
$\frac{d(f(x))}{dx} \approx \frac{f(x+t) - f(x)}{t}$ = $\frac{g(x) - f(x)}{t}$

Error:
E(t) = $\sum w(x)(f(x+t) - g(x))^2$
$\approx \sum w(x) (f(x) + t^T\Delta f(x) - g(x))^2$

Minimizing the Error:
$\frac{\partial{E}}{\partial{t}} = \sum_x w(x). 2\Delta f(x)(f(x) + t^T\Delta f(x) - g(x)) = 0$

Least squares solution (ignoring summation and other arguments for simplicity):
$w\Delta{f}(\Delta{f})^T t = w \Delta{f}(g-f)$, Here we are trying to find $t$

This suffers from the aperture problem, because this operation is done in a local neighbourhood.¶

Assumptions:

The pixel intensities of an object do not change between consecutive frames.
Neighbouring pixels have similar motion.

In [46]:

from skimage.color import rgb2gray
from skimage.data import stereo_motorcycle
from skimage.transform import warp
from skimage.registration import optical_flow_tvl1

# --- Load the sequence
image0, image1, disp = stereo_motorcycle()

# --- Display the images 
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(10, 5))

ax0.imshow(image0)
ax0.set_axis_off()

ax1.imshow(image1)
ax1.set_axis_off()

ax2.imshow(disp)
ax2.set_title("Disparities")
ax2.set_axis_off()

fig.tight_layout()

As you can observe, the right image is the right-shifted version of the first-image, and we seek find the registration of both of them.

In [47]:

# --- Convert the images to gray level: color is not supported.
image0 = rgb2gray(image0)
image1 = rgb2gray(image1)

# --- Compute the optical flow
v, u = optical_flow_tvl1(image0, image1)

# --- Use the estimated optical flow for registration

nr, nc = image0.shape

row_coords, col_coords = np.meshgrid(np.arange(nr), np.arange(nc), indexing='ij')

image1_warp = warp(image1, np.array([row_coords + v, col_coords + u]), mode='edge')

# build an RGB image with the unregistered sequence
seq_im = np.zeros((nr, nc, 3))
seq_im[..., 0] = image1
seq_im[..., 1] = image0
seq_im[..., 2] = image0

# build an RGB image with the registered sequence
reg_im = np.zeros((nr, nc, 3))
reg_im[..., 0] = image1_warp
reg_im[..., 1] = image0
reg_im[..., 2] = image0

# build an RGB image with the registered sequence
target_im = np.zeros((nr, nc, 3))
target_im[..., 0] = image0
target_im[..., 1] = image0
target_im[..., 2] = image0

# --- Show the result

fig, (ax0, ax1, ax2) = plt.subplots(3, 1, figsize=(5, 10))

ax0.imshow(seq_im)
ax0.set_title("Unregistered sequence")
ax0.set_axis_off()

ax1.imshow(reg_im)
ax1.set_title("Registered sequence")
ax1.set_axis_off()

ax2.imshow(target_im)
ax2.set_title("Target")
ax2.set_axis_off()

fig.tight_layout()

Refer this for better understanding: https://scikit-image.org/docs/stable/auto_examples/registration/plot_opticalflow.html#sphx-glr-auto-examples-registration-plot-opticalflow-py ¶

Wide Baseline Spatial Matching¶

Assumption: Every part of 1st image may appear anywhere in 2nd image.
Intuition: Do pairwise local-descriptor matching, and then attempt to enforce some geometric consistency according to some rigid motion model: (Zoom, Pan, Translation, Rotation)
Possible Solution:

Do pairwaise feature descriptor matching
Find pairs that are inlier to rigid transformation.

--> Descriptor Extraction

--> Descriptor Matching

* For each descriptor in the 1st image, find its 2 nearest neighbours in the 2nd image.
* If the ratio of distance of 1st to distance of 2nd is small, make a correspondance.
* This yields a list of tentative correspondances.

Why is it difficult?:

Should allow for geometric transformations
Finding the model to correspondances is sensitive to outliers
Finding inliers to transformation requires finding the transformation in the first place.
Correspondances can have gross error. [Bad descriptors for certain feature]
If inliers are less than 50%

Wide Baseline Spatial Matching vs Dense Registration through Optical Flow¶

Wide-Baseline Spatial Matching: Matches distinct keypoints between images taken from significantly different viewpoints. Dense Registration through Optical Flow: Estimates pixel-wise motion to capture detailed movement across consecutive frames.

Geometric Transformations¶

2 images $I_1$ and $I_2$ are equal at points $x_1$ and $x_2$, if $I_1(x_1) = I_2(x_2)$ and $x_2 = T(x_1)$
$T:R^2 \rightarrow R^2$, and $T$ is a bijecton

Transformation Matrices¶

Translation(2DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}1 & 0 & t_x\\0 & 1 & t_y\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$
Rotation(1DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}cos\theta & -sin\theta & 0\\sin\theta & cos\theta & 0\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$

Similarity(4DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}rcos\theta & -rsin\theta & t_x\\rsin\theta & rcos\theta & t_y\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$

Shear(2DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}1 & b_x & 0\\b_y & 1 & 0\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$

Affine Transformation(6DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}a_{11} & a_{12} & a_{13}\\a_{21} & a_{22} & a_{23}\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$

The problem of finding how's the image transformed can be converted to a linear system: $Ax=b$, where $A$ is the transformation matrix, $x$ is the $I_1$ ka points, and $b$ is the $I_2$ ka points.

We need $n = \lceil \frac{d}{2} \rceil$ correspondances, where $d$ is DOF of our model.
Therefore, for translation, we just need 1 correspondance.

RANSAC¶

When there're outliers, the least square fit fails.

Inliers: Data whose distribution can be explained by some model parameters. Outliers: Noise. Those points that don't respect any distribution.

Pick 2 or n points at random.
Draw line through them.
Form a neighbourhood.
Count the inlier points.
Repeat and keep the best hypotheses so far.

Limitations

Inlier Ratio (no.of inliers / no.of points in the data) is unknown.
Too expensive when no. of samples is large (n > 6) and inlier ratio is small (w < 10%). Which means, $10^6$ iterations needed for 1% probability of failure.

$$ (1 - w^n)^k = 0.01 \\ k = ln(0.01)/ln(0.99999) > 10^6 $$

In [96]:

# RANSAC Implementation. Will do it once I figure out matplotlib animations.

Image-Level Descriptions¶

can we discard geometry for now and bring it back later in some other format?

Bag-of-Words¶

Precalculate the visual words over my entire dataset
Extract the keypoints, and thereby visual words from the query image.

BoW for Retrieval

Each image is represented by $Z \in R^k$, where $k$ is the no. of visual words.
Each element $Z_i$ in $Z = w_in_i$, where $w_i$ is a fixed weight per word and $n_i$ is the no. of occurences
$Z \in R^{k x n}$, where $k$ = no. of visual words, and $n$ = no. of images
$s = S_{BOW}(Z, q) = Z^Tq$, then sort 's' in descending order.
This ($Z^Tq$) is similar to euclidean distance: $||Z-q||^2 = 2(1 - Z^Tq)$
High cosine, Low Euclidean
When $k >> p$, Z and q are sparse.
Rather than checking whether a word is contained in an image, check which image contains a given word. In other words, cluster the features of the dataset.

Training a bag of words system goes as follows:

Compute the features for each image of the training set
Cluster those features
Label each cluster with the images that have features in that cluster

At this point the training is done and you can start with the testing as follows:

Compute the features of the test image
For each feature, find the nearest cluster
Add a tick for each training image that belong to this cluster
The image that has the highest number of ticks is the best match and the image with the second highest number of ticks is the second best match and so on

BoW for Classification:
Each image represented by $Z \in R^k$, where $Z_i$ is the no. of occurences of visual word $c_i$ in the image.

Vector of Locally Aggregated Descriptors¶

Same as BoW, but instead yields a vector per visual word instead of scalar frequency.

In [ ]:

## Hierarchical K-Means and BoW

Matching¶

BoW Matching: It matches image-to-image. No partial matching. $M(X_c, Y_c) = \sum_{x \in X_c} \sum_{y \in Y_c} 1$

Nearest Neighbour Matching: Features may get mapped to same instance in the second image. Greedy Matching.

Generalized Descriptor Matching with Kernels:
An given image is described by a set of 'n' descriptors.
$X = \underbrace{x_1, x_2, x_3, ... , x_n}_{\text{'d' dimension}}$
Every descriptor gets mapped to a cluster descriptor (representative descriptor) AKA visual word.
$C = {c_1, c_2, ..., c_k}$ is a codebook consisting of 'k' visual words.

To compare X and Y, define a family of kernels: $K(X, Y) = \underbrace{\gamma(x)\gamma(y)}_{\text{normalization func to not let 1 image overpower other}}\sum_{c \in C}\overbrace{M}^{\text{with-in cell matching function}}\underbrace{(X_c, Y_c)}_{\text{set of desc assigned to same visual word}}$

Hamming Embedding for Matching: $M(X_c, Y_c) = \sum_{x\in X_c}\sum_{y \in Y_c}1[h(b_x, b_y) \leq \tau]$, where $\tau$ is threshold, and $h$ is hamming distance.

VLAD Matching: $V(X_c) = $

Aggregated Selective Match Kernel(ASMK): A combination of 2 borrowed ideas. 1.) Non-linear selective function (Hamming). 2.)Pooling Residuals (VLAD).
$M(X_c, Y_c) = \sigma_{\alpha}(\hat{V}(X_c)^T\hat{V}(Y_c)$, where $\sigma_\alpha(u) = \begin{cases} sign(u)|u|^\alpha if u > \tau \\ 0\end{cases}$

Efficient Match Kernels: Instead of threshold-based matching functions, we can sue continous function $k(x,y)$ and avoid using computationally intensive codeblocks. $K(X, Y)$ from before can be decomposed into $\Phi(x), \Phi(y)$

Feature Matching¶