The goal of the Perspective-n-Point (PnP) problem is to estimate the position and orientation of a calibrated camera from known 3D-to-2D point correspondences.
P3P uses the geometric relationships of three points, the 3D-to-2D point correspondences, and the known internal parameters of the camera to get the external parameters of the camera.
Firstly, convert 2D points to corresponding 3D points in camera coordinate system
By these, we can get \(\cos\left< a, b \right>, \cos\left< b, c \right>, \cos\left< a, c \right>\). Then, by the cosine theorem, we can get
\[
\left\{
\begin{aligned}
& OA^2 + OB^2 - 2OA\cdot OB\cdot \cos\left< a, b \right> = AB^2 \\
& OB^2 + OC^2 - 2OB\cdot OC\cdot \cos\left< b, c \right> = BC^2 \\
& OA^2 + OC^2 - 2OA\cdot OC\cdot \cos\left< a, c \right> = AC^2 \\
\end{aligned}
\right.
\]
Devided both sides by \(OC^2\), and let \(\displaystyle x = \frac{OA}{OC}, y = \frac{OB}{OC}, v = \frac{AB^2}{OC^2}, u = \frac{BC^2}{AB^2}, w = \frac{AC^2}{AB^2}\), then the equations become
\[
\left\{
\begin{aligned}
& x^2 + y^2 - 2xy\cos\left< a, b \right> - v = 0 \\
& y^2 + 1 - 2y\cos\left< b, c \right> - uv = 0 \\
& x^2 + 1 - 2y\cos\left< a, c \right> - wv = 0 \\
\end{aligned}
\right.
\]
Rerange the equations, then we can get
\[
\left\{
\begin{aligned}
& (1 - u)y^2 - ux^2 - \cos\left< b, c \right> y + 2uxy\cos\left< a, b \right> + 1 = 0 \\
& (1 - w)x^2 - wy^2 - \cos\left< b, c \right> x + 2wxy\cos\left< a, b \right> + 1 = 0 \\
\end{aligned}
\right.
\]
This binary quadratic equation has 4 possible answers, we use one extra point to determine the most possible one.
Suppose \(P_i\) is the coordinate of a point in the world coordinate system, \(p_i\) is the homogeneous coordinate of corresponding point in camera coordinate system, \(\omega_i\) is the depth of correspoinding point, then we can define the reprojection error
Structure from motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences that may be coupled with local motion signals.
Epipolar geometry uses the 2D point correspondences and the known internal parameters of cameras to get \(R, t\) between two cameras.
Baseline (\(O_lO_r\)): Line connecting the two camera centers.
Epipoles (\(e_l\) and \(e_r\)): Intersections of baseline with image planes.
Epipolar Plane (\(XO_lO_r\)): Plane containing \(X, O\) and \(O'\).
Epipolar Lines (\(l_l\) and \(l_r\)): Intersections of epipolar plane with image planes.
The normal vecter of the epipolar plane is $\bm{n} = \bm{t} \times \bm{x}_l $, so \(\bm{x}_l \cdot (\bm{t} \times \bm{x}_l) = 0\). That is
Given that \(T_X\) is a skew-symmetric matrix ($a_{ij} = -a_{ji} $) and \(R\) is an orthonormal matrix, we can decouple \(T_X\) and \(R\) from \(E\) using singular value decomposition, so our goal becomes getting \(E\).
Suppose \(\mathbf{P}\) is the coordinate of a point, \(\bm{\hat{u}}_l, \bm{\hat{u}}_r\) are projections of the point on the two image planes, then we can define the reprojection error