The flashcards below were created by user
Anonymous
on FreezingBlue Flashcards.

Differences between detection, recognition and tracking
 Detection: find the object without any prior prediction, only with the model (global, slow)
 Recognition: match detected objects that correspond to the same entity
 Tracking: update the object state(s) using the previous state and the dynamical model (dynamic, local, fast)

Advantages of tracking over sequential detection?
 Speed
 Data association problem (multiple candidates for one object)
 More robust to noisy measures

Assumptions of tracking?
 Smooth camera movement (gradual changes between frames)
 The motion of the object can be modeled (linear, nonlinear)
 The presence of the object at a certain position can be estimated.

Approaches to do tracking?
 Sequential: online, realtime > cheap, recursive (t1, t, t+1). Cannot revise past errors, no lookahead
 Batch: offline, expensive. Takes all info into account, can correct past errors.
 Paralell trackers: multiple single trackers. Simple to implement and computationally cheap. Interactions difficult to handle.
 Join state tracker: one multiobject tracker. More expensive and complex. Better handling of interactions
 Probabilistic: slower, needs adhoc interpretation (thresholds,etc), multimodal and flexible.
 NonProbabilistic: fast convergence, stuck at local minima, can focus on one object only.

Main aspects of tracking?
 Models: Object model (2d,3d, DoF), Sensors (internal, external camera), Context (background, lighting, env. contraints)
 Vision: features (color, texture, motion, edges, blobs), data association (match hypothesis and detections), data fusion (multip. features, CAD Models, camera angles), likelyhood (how good is the measure, errors)
 Tracking: Pipeline: Img acquisition > prediction > measurement > Model matching > correction

Main Challenges to Tracking?
 Change in appearance (pose), illumination, scale
 Oclussion, clutter

Motivation and main goal for Corner and Edge detection?
 Edges carry lots of semantic information (shape, boundaries, order)
 Idea is to find unusual parts of an image which are also easy to recognize again, i.g. they are robust/invariant to changes in lighting and transformation conditions (good for tracking).

How can an edge be defined?
 Rapid change in pixel intensity, hence it can be represented as a gradient (w.r.t. pixel differences) with a "big" magnitude.
 The gradient magnitude determines the strength of the edge and the orientation of the gradient is orthogonal to the orientation of the edge.
 To get the derivative of an image (i.g. the differences between pixel intensities), the kernel [1,1] can be convolved with the original image.

How to handle noisy images when detecting edges and what's the problem?
 When derivating noisy images, edges get difficult to spot (fig1).
 They should be smooth to filter out high frequencies (the small disturbances). Achieved with Gaussian kernel.
 Now when taking the derivative, nice peaks can be spotted.
 Advantages: d/dx (f * g) = df/dx * g = dg/dx * f which saves an operation. 2D gaussian is separable.

Criteria for a good edge detector:
 Detection: Minimize false positives
 Localization: Get points that are closest to the real edge.
 Response: Find one point in a local neighborhood.

What's the Canny edge detector doing additionaly to improve edge detection results?
 Nonmaximum suppression: for alining gradients, only consieder the ones that are locally greater.
 Edge linking: take the points that are normal to the gradient and mark them as candidates to follow the edge
 Hysteresis thresholding: use two thresholds. A big one to start a new edge and a small one to continue an edge.

How does the basic corner detection work (KanadeTomasi / Harris)?
 Take a window and compute the SSD when moved within a local neighborhood. ( create H > [I_x^2, I_xI_y; I_yI_x, I_y^2] )
 Take a look at the directions along which the error changes the most (the eigenvalues)
 Threshold according to the lowest eigenvalue, lambda_ (point in the image where the lower eig. value is greater than a threshold)
 Choose points where lambda_ is a local maximum.

How does Harris improve over the KanadeTomasi, corner detection algorithm?
 Instead of thresholding the eigenvalues he creates a score function R (= det(H)  k(trace(H))^2 ).
 This gives a high value when both eigenvalues are similary big and low when they are too different or both too low.
 So by thresholding R (and doing nonmax. supression), the same responses can be obtained as with the eigenvalues but without having to compute them directly (costly).

Main characteristics of harris detected features (as the ellipses of the eigenvectors):
 Scale: non invariant (think of a zoomed in corner... is no longer a corner but a smooth curve)
 Rotation: invariant. [ f(x) = f(T(x)) = y ]  Location: covariant [ if f(x)=y THEN f(T(x)) = T(f(x)) = T(y) ]
 Intensity: invariant to linear intensity scale or shift (I>aI & I>I+b

Challenges with blob detection (& solution):
 Achieving scale covariance.
 Solution: find a function that returns a max response at a size that corresponds to the size of the image (small image > small region, bigger image > bigger region). {PLOTS!}

What's the function used for blob detection (and why)?
 LoG
 Taken the idea from edge detectors, edges can be represented as zerocrossing ripples (2nd derivative of a Gaussian)
 A blob can hence be represented as a superposition of two ripples (graphic). So when the ripples of the two edges of a blob superpose, the response is maximum. For that, the right scale of the Laplacian has to be found.
 Since the response of a LoG decays as the sigma increases, a normalization by the scale has to be done. Since it's the second derivative, the normalization is sigma^2 > graphs!
 For a blob of radius r, the optimum scale (or sigma) is at sig* = r/sqrt(2) [1(x^2+y^2)/(2*sigma^) = 0] > graph!

What's the pipene to do the original blob detecion?
 Convolve the image with scalenormalized LoG at several scales
 Find the point between scales where the response was maximum.

A way of efficiently implementing a blob detector?
 LoG can be approximated by a DoG, hence
 take gaussianblurred images at different sigmas and susbstract them.
 Find the max points among the diffed images
 Scales the original image down and repeat the process of blurringdiffingmax_thresholding.

Basic approach to match features (a.k.a. histograms)? Issues and solutions?
 Cross Corelation: CC(h1, h2) = 1/N sum{h1_i * h2_i}
 For affine transformations on the feature, namely a*I+b, CC is not invariant.
 Solution: make the features zeromean ( mu_h_i = 1/N sum(h_i) > Z_i = h_i  mu ) and unit variance.

Describe how to construct SIFT features:
 Take a 16x16 regions, subdivide them in 4x4 areas and create a HOG.
 Sample the HOGs to 8 orientations only and take assign the main orientation to the highest bin (or bins if they are above a threshold)
 Concatenate (128 bins: 16 4x4 areas with 8 orientation bins each)

Main ideas regarding object cateogry classification? (first approach to background substraction)
 Can be partbased or globalbased
 Partbased is more flexible (handle oclussion, transformations), deals with moving objects but is more complex
 Global is simpler but only works for small solutions and does detection using a binary classifier
 Classifier relies on some feature extracted from a window (intensity histogram, HOG) and a ground truth. Learns a model and takes decision based on scoring functions for candidates (queries)

How to compute and optimize gradient histograms?
 Just the plain HOG is very high dimensional (over 4000 dimensions)
 For learning get a GT of cropped images with the objects of interest, encode them into the feature space and learn a binary classifier.
 For the detection, scan queries using a scale space and pass them to the classifier.
 Descriptor usually runs over blocks of 'cells' (e.g. 8x8 pixel windows) creating the histograms of gradients. Overlapping cells of different blocks are normalized and at the end, concatenated.
 Values on a particular cell also affect the values in the neighboring cells (interpolated trilinearity: GRAPH)
 Gradient magnitudes are finally weighted by a Gaussian funcion (values in the center are more important.

How to simplify background substraction (first approach)?
 Assume the camera is static: Assume the background is the image at time t_{1}, diff the current image with the background image and get a foreground mask (whatever the difference leaves above certain threshold)
 Issues with this approach are illumination changes, moving background (trees, parked cars), shaking camera...
 To get rid of noise (single pixels in the foreground mask) apply a median mask (median value of a window (3x3))

How to do more advanced background substraction? (second approach)
 Allow the background to have moving objects > learn the background (then diff current image, threshold and denoise).
 Take N images and compute the average background image. (average value among the captured images)
 Pros: easy to implement, fast, flexible (can relearn)
 Cons: for good background estimation a lot of memory is needed (lots of images). Still depending on the threshold after diffing BG with the current image.

How to do very advanced background substraction? (thrid approach)
 Learn how the background varies at each pixel (the mean and variance, intensitywise).
 Get the current image and diff it with the learned BG (old)
 Use local information about the neighboring points with a gaussian (if neighboring points are too different, most likely the center is different too)
 Threshold as follows: if the current pixel in an image is outside the mean value (+, 2sigma) for the learned background then it's most likely a foreground pixel.
 To do a fancier thresholding, classify the pixel intensities of the background as a mixture of gaussians (handle night and day images for example, when different means and variations occur)

What are the usual motion models to do tracking, given a set of previous positions?
 Depends on the model (how the object is assumed to move)
 0th order: doesn't move (x_t = x_{t1})  1st order: linear movement (x_t = m*x_{t1}+b)
 2nd order: acceleated movement (...)
 Choosing the best one is rather adhoc but chosing a bad model can screw your tracking!

What are some challenges when doing feature tracking?
 Finding good features
 Change of appearance (even oclussion) of points
 Driffting of tracked feats.

How are features tracked nowdays? (General)
 Using LucasKanade Tracker
 Assumptions: Brightness constancy (pts look the same after movement), small motion and spatial coherence (move like neighbors)

How does the LKT defines and solves tracking?
 [1] By the brightness constancy equation we have that I(x,y,t) = I(x+u,y+v,t+1)
 Assuming a linear displacement of pixels, we can say that I(x+u,y+v,t+1) =' I(x,y,t) + I_x*u + I_y*v + I_t
 Hence I(x+u,y+v,t+1)  I(x,y,t) = I_x*u + I_y*v + I_t (and by using [1], all = 0). Written as nabla I*[u,v]^T + I_t = 0
 There's one equation with two variables so the solution would have the barberpole problem. The solution is to take neighboring pixels (a window) and assume they all move the same.
 That's too many. On a 5x5 window, there are already 25*3(RGB) equations. Instead of solving for each, solve the matrix A d = b > [I_x p_0, I_y p_0; ...; I_x p_n, I_y p_n] * [u,v]^T = [I_t p_0;...;I_t p_n] using least squares which imply solving (A^T A) * d = A^T b (way less dimensions)

What are good patches to track with LKT?
 A^T A is well conditioned (i.g. eigenvalue ratio is close to 1) > Same criteria as for Harris corner detection, hence, track corners!
 Flat surfaces or highly texture areas get lost due to noise or jumpy pixels.
 The barbershop pole problem gets solved since the edges are characterized by the eigenvectors (along the edge and perpendicular so we know where are they going)

When does LKT fail? Solutions?
 Assumptions are not met: brightness varies, big motion, local inconsistency (no movement like neighbors)
 For big windows, pixels don't behave consistently. Solution: Reduce the scale, solve for small scale, upscale and take a smaller window.
 Tracking over lots of images: driffting. Solution: check against the first detected feature and correct accordingly.

What's optical flow and what's it good for? How is it done?
 Apparent motion of brightness patterns in an image. (apparent because moving light can create similar motion fields)
 It's a good approximation of the projection of the 3D motion of a scene in an image.
 Good to recover image motion at pixel levels.  Implemented by a LKT at pixel levels. Once the equations are solved, interpolation is done and the process starts again.
 if the image is to big, apply the iterative LKT with the scale pyramid.

