Correlation Filters with Limited Boundaries
Tracking Code (MATLAB)
In traditional correlation filters, the boundary effect causes learning correlation filters from an unbalanced set of "real-world" and "synthetic" examples. These synthetic examples are created through the application of a circular shift on the real-world examples, and are supposed to be representative of those examples at different translational shifts. We use the term synthetic, as all these shifted examples are plagued by circular boundary effects and are not truly representative of the shifted example (see above Figure(c)). As a result the training set used for learning the template is extremely unbalanced with one real-world example for every D-1 synthetic examples (where D is the dimensionality of the examples). These boundary effects can dramatically affect the resulting performance of the estimated template. In this work we propose to learn correlation filters whose the size is much smaller than the size of training images. We show how this technique can efficiently reduce the portion of training patches affacted by boundary effects ((c) and (d) in the above Figure).
- We propose a new correlation filter objective that can drastically reduce the number of examples in a correlation filter that are affected by boundary effects. We further demonstrate, however, that solving this objective in closed form drastically decreases computational efficiency: (D^3 + ND^2) versus (NDlogD) for the canonical objective where D is the length of the vectorized image and N is the number of examples.
- We demonstrate how this new objective can be efficiently optimized in an iterative manner through an Augmented Lagrangian Method (ALM) so as to take advantage of inherent redundancies in the frequency domain. The efficiency of this new approach is Ø([N + K]T log T) where K is the number of iterations and T is the size of the search window.
- We present impressive results for both object detection and tracking outperforming MOSSE and other leading non-correlation filter methods for object tracking.
(a) Defines the example of fixed spatial support within the image from which the peak correlation output should occur. (b) The desired output response, based on (a), of the correlation filter when applied to the entire image. (c) A subset of patch examples used in a canonical correlation filter where green denotes a non-zero correlation output, and red denotes a zero correlation output in direct accordance with (b). (d) A subset of patch examples used in our proposed correlation filter. Note that our proposed approach uses patches stemming from dierent parts of the image, whereas the canonical correlation filter simply employs circular shifted versions of the same single patch. The central dilemma in this work is how to perform (d) efficiently in the Fourier domain. The two last patches of (d) show that (D-1)/T patches near the image border are affected by circular shift in our method which can be greatly diminished by choosing D << T. Where D and T respectively represent the length of vectorized face patch in (a) the whole image in (a).
Learning Demo (ADMM iterations)
Object tracking demo (red: ground truth)
localization performance: comparing with prior filters
Dataset: The CMU Multi-PIE face database was used for this experiment, containing 900 frontal faces with neutral expression and normal illumination. We randomly selected 400 of these images for training and the reminder for testing.
Image Preprocessing: All images were cropped to have a same size of 128x128 such that the left and right eye are respectively centered at (40,32) and (40,96) coordinates. The cropped images were power normalized to have a zero-mean and unit standard deviation. Then, a 2D cosine window was employed to reduce the frequency eects caused by opposite borders of the images in the Fourier domain.
Filters Training: We trained a 64x64 filter of the right eye using full face images for our method (T = 128x128 and D= 64 x 64), and 64 x 64 cropped patches (centered upon the right eye) for the others. Similar to ASEF and MOSSE, we defined the desired response as a 2D Gaussian function with an spatial variance of s = 2 whose the peak was located upon the center of the right eye.
Image Preprocessing: All images were cropped to have a same size of 128x128 such that the left and right eye are respectively centered at (40,32) and (40,96) coordinates. The cropped images were power normalized to have a zero-mean and unit standard deviation. Then, a 2D cosine window was employed to reduce the frequency eects caused by opposite borders of the images in the Fourier domain.
Filters Training: We trained a 64x64 filter of the right eye using full face images for our method (T = 128x128 and D= 64 x 64), and 64 x 64 cropped patches (centered upon the right eye) for the others. Similar to ASEF and MOSSE, we defined the desired response as a 2D Gaussian function with an spatial variance of s = 2 whose the peak was located upon the center of the right eye.
An example of eye localization is shown for an image with normal lighting. The outputs (bottom) are produced using 64 x 64 correlation flters (top). The green box represents the approximated location of the right eye (output peak). The peak strength measured by PSR shows the sharpness of the output peak.
The Influence of D and T on Detection Accuracy: We examined the influence of T (the size of training images) on the performance of eye localization. For this purpose, we employed cropped patches of the right eye with varying sizes of T = {D; 1.5D; 2D; 2.5D; 3D; 3.5D; 4Du to train filters of size D = 32 x 32. The localization results are illustrated in following figure, showing that the lowest performance obtained when T is equal to D (32 x 32) and the localization rate improved by increasing the size of the training patches with respect to the filter size.
Runtime and Convergence Evaluation: comparing with spatial optimizer
Runtime performance and convergence behavior of our method against another naive iterative method (steepest descent method) [Zeiler et al., 2010]. Our approach enjoys superior performance in terms of: (a) convergence speed to train two filters with different sizes (32 x 32 and 6 x 64) and (b) the number of iterations required to converge.
visual object tracking
The tracking performance is shown as a tuple of {precision within 20 pixels, average position error in pixels}, where our method achieved the best performance over 8 of 10 videos. The best fps was obtained by MOSSE. Our method obtained a real-time tacking speed of 50 fps using four iterations of ADMM. The best result for each video is highlighted in bold.
Tracking results of our method over testing videos with challenging variations of pose, scale, illumination and partial occlusion. The blue (dashed) and red boxes respectively represent the ground truth and the positions predicted by our method. For each frame, we illustrate the target, trained filter and correlation output.