Big Question
How do we match features in 2 photos once we figure out the descriptors?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import datasets, misc
from PIL import Image
from scipy.ndimage import convolve, correlate, gaussian_filter, filters, laplace
import cv2 as cv
Used when there's a little difference in the 2 images
Assuming $g(x) = f(x + t)$, where $t$ is small.
$\frac{d(f(x))}{dx} \approx \frac{f(x+t) - f(x)}{t}$ = $\frac{g(x) - f(x)}{t}$
Error:
E(t) = $\sum w(x)(f(x+t) - g(x))^2$
$\approx \sum w(x) (f(x) + t^T\Delta f(x) - g(x))^2$
Minimizing the Error:
$\frac{\partial{E}}{\partial{t}} = \sum_x w(x). 2\Delta f(x)(f(x) + t^T\Delta f(x) - g(x)) = 0$
Least squares solution (ignoring summation and other arguments for simplicity):
$w\Delta{f}(\Delta{f})^T t = w \Delta{f}(g-f)$, Here we are trying to find $t$
Assumptions:
The pixel intensities of an object do not change between consecutive frames.
Neighbouring pixels have similar motion.
from skimage.color import rgb2gray
from skimage.data import stereo_motorcycle
from skimage.transform import warp
from skimage.registration import optical_flow_tvl1
# --- Load the sequence
image0, image1, disp = stereo_motorcycle()
# --- Display the images
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(10, 5))
ax0.imshow(image0)
ax0.set_axis_off()
ax1.imshow(image1)
ax1.set_axis_off()
ax2.imshow(disp)
ax2.set_title("Disparities")
ax2.set_axis_off()
fig.tight_layout()
As you can observe, the right image is the right-shifted version of the first-image, and we seek find the registration of both of them.
# --- Convert the images to gray level: color is not supported.
image0 = rgb2gray(image0)
image1 = rgb2gray(image1)
# --- Compute the optical flow
v, u = optical_flow_tvl1(image0, image1)
# --- Use the estimated optical flow for registration
nr, nc = image0.shape
row_coords, col_coords = np.meshgrid(np.arange(nr), np.arange(nc), indexing='ij')
image1_warp = warp(image1, np.array([row_coords + v, col_coords + u]), mode='edge')
# build an RGB image with the unregistered sequence
seq_im = np.zeros((nr, nc, 3))
seq_im[..., 0] = image1
seq_im[..., 1] = image0
seq_im[..., 2] = image0
# build an RGB image with the registered sequence
reg_im = np.zeros((nr, nc, 3))
reg_im[..., 0] = image1_warp
reg_im[..., 1] = image0
reg_im[..., 2] = image0
# build an RGB image with the registered sequence
target_im = np.zeros((nr, nc, 3))
target_im[..., 0] = image0
target_im[..., 1] = image0
target_im[..., 2] = image0
# --- Show the result
fig, (ax0, ax1, ax2) = plt.subplots(3, 1, figsize=(5, 10))
ax0.imshow(seq_im)
ax0.set_title("Unregistered sequence")
ax0.set_axis_off()
ax1.imshow(reg_im)
ax1.set_title("Registered sequence")
ax1.set_axis_off()
ax2.imshow(target_im)
ax2.set_title("Target")
ax2.set_axis_off()
fig.tight_layout()
Assumption: Every part of 1st image may appear anywhere in 2nd image.
Intuition: Do pairwise local-descriptor matching, and then attempt to enforce some geometric consistency according to some rigid motion model: (Zoom, Pan, Translation, Rotation)
Possible Solution:
--> Descriptor Extraction
--> Descriptor Matching
* For each descriptor in the 1st image, find its 2 nearest neighbours in the 2nd image.
* If the ratio of distance of 1st to distance of 2nd is small, make a correspondance.
* This yields a list of tentative correspondances.
Why is it difficult?:
Wide-Baseline Spatial Matching: Matches distinct keypoints between images taken from significantly different viewpoints. Dense Registration through Optical Flow: Estimates pixel-wise motion to capture detailed movement across consecutive frames.
2 images $I_1$ and $I_2$ are equal at points $x_1$ and $x_2$, if $I_1(x_1) = I_2(x_2)$ and $x_2 = T(x_1)$
$T:R^2 \rightarrow R^2$, and $T$ is a bijecton
Translation(2DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}1 & 0 & t_x\\0 & 1 & t_y\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$
Rotation(1DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}cos\theta & -sin\theta & 0\\sin\theta & cos\theta & 0\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$
Similarity(4DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}rcos\theta & -rsin\theta & t_x\\rsin\theta & rcos\theta & t_y\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$
Shear(2DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}1 & b_x & 0\\b_y & 1 & 0\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$
Affine Transformation(6DOF): $\begin{bmatrix}x'\\y'\\1\end{bmatrix} = \begin{bmatrix}a_{11} & a_{12} & a_{13}\\a_{21} & a_{22} & a_{23}\\0 & 0 & 1\end{bmatrix}\begin{bmatrix}x\\y\\1\end{bmatrix}$
The problem of finding how's the image transformed can be converted to a linear system: $Ax=b$, where $A$ is the transformation matrix, $x$ is the $I_1$ ka points, and $b$ is the $I_2$ ka points.
We need $n = \lceil \frac{d}{2} \rceil$ correspondances, where $d$ is DOF of our model.
Therefore, for translation, we just need 1 correspondance.
When there're outliers, the least square fit fails.
Inliers: Data whose distribution can be explained by some model parameters. Outliers: Noise. Those points that don't respect any distribution.
Limitations
# RANSAC Implementation. Will do it once I figure out matplotlib animations.
can we discard geometry for now and bring it back later in some other format?
BoW for Retrieval
Training a bag of words system goes as follows:
Compute the features for each image of the training set
Cluster those features
Label each cluster with the images that have features in that cluster
At this point the training is done and you can start with the testing as follows:
Compute the features of the test image
For each feature, find the nearest cluster
Add a tick for each training image that belong to this cluster
The image that has the highest number of ticks is the best match and the image with the second highest number of ticks is the second best match and so on
BoW for Classification:
Each image represented by $Z \in R^k$, where $Z_i$ is the no. of occurences of visual word $c_i$ in the image.
Same as BoW, but instead yields a vector per visual word instead of scalar frequency.
## Hierarchical K-Means and BoW
BoW Matching: It matches image-to-image. No partial matching. $M(X_c, Y_c) = \sum_{x \in X_c} \sum_{y \in Y_c} 1$
Nearest Neighbour Matching: Features may get mapped to same instance in the second image. Greedy Matching.
Generalized Descriptor Matching with Kernels:
An given image is described by a set of 'n' descriptors.
$X = \underbrace{x_1, x_2, x_3, ... , x_n}_{\text{'d' dimension}}$
Every descriptor gets mapped to a cluster descriptor (representative descriptor) AKA visual word.
$C = {c_1, c_2, ..., c_k}$ is a codebook consisting of 'k' visual words.
To compare X and Y, define a family of kernels: $K(X, Y) = \underbrace{\gamma(x)\gamma(y)}_{\text{normalization func to not let 1 image overpower other}}\sum_{c \in C}\overbrace{M}^{\text{with-in cell matching function}}\underbrace{(X_c, Y_c)}_{\text{set of desc assigned to same visual word}}$
Hamming Embedding for Matching: $M(X_c, Y_c) = \sum_{x\in X_c}\sum_{y \in Y_c}1[h(b_x, b_y) \leq \tau]$, where $\tau$ is threshold, and $h$ is hamming distance.
VLAD Matching: $V(X_c) = $
Aggregated Selective Match Kernel(ASMK): A combination of 2 borrowed ideas. 1.) Non-linear selective function (Hamming). 2.)Pooling Residuals (VLAD).
$M(X_c, Y_c) = \sigma_{\alpha}(\hat{V}(X_c)^T\hat{V}(Y_c)$, where $\sigma_\alpha(u) = \begin{cases} sign(u)|u|^\alpha if u > \tau \\ 0\end{cases}$
Efficient Match Kernels: Instead of threshold-based matching functions, we can sue continous function $k(x,y)$ and avoid using computationally intensive codeblocks. $K(X, Y)$ from before can be decomposed into $\Phi(x), \Phi(y)$