
Tensor Voting:
Review, Applications to Computer Vision and Machine Learning

Gerard Medioni
– U.S.C.
We first briefly review tensor voting, which is an
efficient, noniterative
framework for tackling perceptual organization problems in arbitrary
dimension spaces. It is based on data representation by secondorder
symmetric tensors, which allow a unified representation of inliers of
smooth structures, discontinuities and outliers, and data communication by
tensor voting, during which tokens propagate information in their
neighborhood by casting tensor votes. These votes convey the amount of
support of the voter for a structure (such as a curve or a hypersurface)
that goes through the voter and receiver. No parametric models are assumed
for the underlying structure and the criteria for determining whether a
structure goes through the data are proximity and good continuation. Our
framework has proven to be very robust even under extreme noise corruption,
with a single free parameter, the scale of the voting field.
The second part of the talk focuses on the application of tensor voting to
real computer vision problems. Since many computer vision problems, such as
stereo and motion analysis, can be expressed as the inference of smooth
structures, they can be addressed within a perceptual organization
framework. For instance, potential pixel correspondences generate tokens in
3 and 4D for stereo and motion respectively. In that space, correct
matches should form salient, coherent structures that correspond to the
scene objects, while wrong matches do not align as well as the correct ones
and can be eliminated.
Finally, we show how tensor voting can be applied to problems in higher
dimensions, while keeping the computational complexity at reasonable
levels. Since the tensors can represent all possible structure types, which
range from junctions to hypervolumes, multiple structures of different
dimensionality can be inferred at the same time and interact with each
other. Since all processing is local, computational complexity depends on
the number of neighbors of each input point and remains manageable even for
very large numbers of inputs in high dimensions. Therefore, tensor voting
could be an alternative to methods such as Locally Linear Embedding and
Isomap, which are stateoftheart algorithms in machine learning.

Combining Holistic
and Local Representations using Kernels over Sets

Amnon Shashua and Tamir Hazan – Hebrew
U.

In the area of learning from observations there are two
main paths that are often mutually exclusive: (i) the design of learning
algorithms, and (ii) the data representation scheme. The algorithm designers
take pride in the fact that their algorithm can generalize well given
straightforward data representations, whereas
those who work on data representations demonstrate often remarkable
results with sophisticated data representations using only straightforward
learning algorithms. This dichotomy is probably most emphasized in the area
of computer vision, where image understanding from observations involve data
instances of images or image sequences containing huge amounts of data. Our
work is about bridging the gap between algorithms and representations. The
key is to allow advanced algorithms (which typically require metric structure
on the instance space) to work with advanced data representations (which are
often not easily embedded into a metric space).
I will present a general family of algebraic positive
definite similarity functions over spaces of matrices with varying column
rank. The columns can represent local regions in an image (whereby images
have varying number of local parts), images of an image sequence, motion
trajectories in a multibody motion, and so froth. The family of similarity measures will be
shown to be exhaustive, thus providing a cookbook of sorts covering the
possible "wish lists" from similarity measures over sets of varying
cardinality.

Solving geometric PDEs
on manifolds

Alon Spira and Ron
Kimmel – Technion

In this talk we present numerical schemes for implementing
geometric flows of curves and images on manifolds. We consider a 2D
parameterization plane that is mapped to an Ndimensional space. Our approach
in devising the schemes is to implement them on the uniform Cartesian grid of
the parameterization plane instead of doing so in the Ndimensional space.
This enhances the efficiency and robustness of the resulting numerical
schemes.
The first numerical scheme is an efficient solution to
the eikonal equation on parametric manifolds. The scheme is based on Kimmel
and Sethian's solution for triangulated manifolds, but uses the metric tensor
of the parametric manifold in order to implement the scheme on the
parameterization plane. The scheme is used to devise a short time kernel for
the Beltrami image enhancing flow. The kernel enables an arbitrary time step
for the flow for regular images as well as images painted on manifolds, such
as face images. The numerical scheme is further used for face recognition by
constructing an invariant face signature from distances calculated on the
face manifold.
Another numerical scheme implements curve evolution by
geodesic curvature flow on parametric manifolds. The flow is implemented by
back projecting the curve from the manifold to the parameterization plane,
calculating the flow on the plane by the level sets method and then mapping
it back to the manifold. Combining this flow with geodesic constant flow
enables the implementation of geodesic active contours for images painted on
parametric manifolds.

Multiscale
Segmentation by Combining Motion and Intensity Cues

Meirav Galun, Alexander Apartsin and Ronen Basri 
Weizmann

Motion provides a strong cue for segmentation. In this
talk we present a multiscale method for motion segmentation. Our method
begins with local, ambiguous optical flow measurements. It uses a process of
aggregation to resolve the ambiguities and reach reliable estimates of the
motion. In addition, as the process of aggregation proceeds and larger
aggregates are identified it employs a progressively more complex model to
describe the motion. In particular, we proceed by recovering translational
motion at fine levels, through affine transformation at intermediate levels,
to 3D motion (described by a fundamental matrix) at the coarsest levels.
Finally, the method is integrated with a segmentation method that uses
intensity cues. We further demonstrate the utility of the method on both
random dot and real motion sequences.

The Canonical
Correlations of Color Images

Yacov HelOr  IDC

Over the last decade or so a lot of effort has been
invested in an attempt to study the underlying statistics of natural images.
Most of this effort, however, dealt with grayscale images, and quite a
number of studies attempted to model the *spatial dependencies* existing
between pixel values. Although impressive results have been achieved in a
variety of problems by applying prior models on grayscale images, only a few
studies have dealt with prior models on color images. In the latter there is
a need to characterize spatial as well as *spectral* (color) dependencies.
In this talk I will suggest a new approach that exploits
the spectral dependencies in color images using the Canonical Correlation
Analysis (CCA). I will show how this statistical inference can help
solve Inverse Problems in general and the Demosaicing problem in particular.
It is an interesting fact that the resulting statistical inference that is
derived solely from the statistical properties of natural images, can also be
derived independently from the characteristics of the human visual system.
This suggests that the human visual system has adapted itself to the
statistical properties of natural color images, and that the proposed
approach is based on a reliable statistical model.
This work was conducted at HP labs.

An algorithm based
on Biological Gain control for
High Dynamic Range Compression

Hedva Spitzer
– TelAviv U.

The visual system has the ability to see and get detailed
information from high dynamic range scene. For example, a person can observe
items in a one sight while observing in a dim room and outside through a
window. An algorithm for high dynamic
range compression that can be applied for still and video images is
presented. This algorithm is based on a biological model which is suggested
also for wide dynamic range and lightness constancy. It succeeds in automatically compressing
the dynamic range of images to a 'human vision appearance (as is commonly
required in cameras and displays) while maintaining contrast and even
improving it. The biological basis is retinal mechanisms of adaptation (gain
control): ‘local’, and ‘remote’. These mechanisms enable video image
applications, since they take into account the dynamics of human adaptation
mechanisms. The results indicate that the contribution of adaptation
mechanisms to image appearance is significant, robust, and were proven to fit
next generation High dynamic range cameras (CMOS based).

Dynamosaics:
Dynamic Mosaics with NonChronological Time

Alex RavAcha and Shmuel Peleg – Hebrew U.

With the limited field of view of human vision, our
perception of most scenes is built over time while our eyes are scanning the
scenes. In the case of static scenes this process can be modeled by panoramic
mosaicing: stitching together images into a panoramic view. Can a dynamic
scene, scanned by a video camera, be represented with a dynamic panoramic
video?
When a video camera is scanning a dynamic scene,
different regions are visible at different times. The chronological time when
a region becomes visible in the input video is not part of the scene
dynamics, and may be ignored. Only the ``local time'' during the visibility
period of each region is relevant for the dynamics of the scene, and should
be used for building the dynamic mosaics.
We used the spacetime volume, when 2D image frames are
stacked on the time axis to form a 3D volume, as a basic representation which
enables to create dynamic mosaics.
Various 2D slices of the spacetime volume can manipulate the
chronological time and generate panoramic movies. The chronological time can
even be reversed without affecting the local time. E.g., Given a video camera
scanning water falls from left to right, we can generate a video scanning the
falls from right to left, but in contradiction to reversal of the video
sequence, the water will flow down!

Pixels
Correlated to Sound

Einat Kidron, Yoav Schechner, and Michael Elad – Technion

People and animals fuse auditory and visual information
to obtain robust perception. A particular benefit of such crossmodality
analysis is the ability to localize visual events associated with sound
sources. We are interested in a computervision approach that localizes the
image pixels associated with sound, aided by a single microphone. Past
efforts encountered problems stemming from the huge gap between the
dimensions involved and the available data. This had led to solutions
suffering from low spatiotemporal resolutions. We present a rigorous
analysis of the fundamental problems associated with audiovisual
localization. We then present a stable and robust algorithm which overcomes
past deficiencies. It grasps dynamic events with high spatial resolution, and
derives a unique and stable result. It exploits the fact that such events are
typically spatially sparse. The algorithm is simple and efficient thanks to
its reliance on linear programming. The formulation is convex and free of
userdefined parameters. Its capabilities are demonstrated in experiments,
where the algorithm overcomes substantial visual distractions and audio
noise.

SpaceTime Video
Completion

Yonatan
Wexler, Eli Shechtman and Michal Irani
– Weizmann

We present a method for spacetime completion of large
spacetime holes in video sequences of complex dynamic scenes. The missing
portions are filledin by sampling spatiotemporal patches from the available
parts of the video, while enforcing global spatiotemporal consistency
between all patches in and around the hole. This is obtained by posing the
task of video completion and synthesis as a global optimization problem with
a welldefined objective function.The consistent completion of static scene
parts simultaneously with dynamic behaviors leads to realistic looking video
sequences. Spacetime video completion is useful for a variety of tasks,
including, but not limited to:
(i) Sophisticated video removal (of undesired static or
dynamic objects) by completing the appropriate static or dynamic background
information
(ii) Correction of missing/corrupted video frames in old
movies
(iii) Synthesis of new video frames to add a visual
story, modify it, or generate a new one.


Dynamic
Visual Search Using InnerScene Similarity:
Algorithms and Inherent
Limitations

Tamar Avraham
and Micha Lindenbaum – Technion

A dynamic visual search framework based mainly on
innerscene similarity is proposed. Algorithms as well as measures quantifying
the difficulty of search tasks are suggested.
Given a number of candidates (e.g. subimages), our basic
hypothesis is that more visually similar candidates are more likely to have
the same identity. Both deterministic and stochastic approaches, relying on
this hypothesis, are used to quantify this intuition.
Under the deterministic approach, we suggest a measure
similar to Kolmogorov's $\epsilon$covering that quantifies the difficulty of
a search task and bounds the performance of all search algorithms. We also
suggest a simple algorithm that meets this bound.
Under the stochastic approach, we model the identities of
the candidates as correlated random variables and characterize the task using
its second order statistics. We derive a search procedure based on minimum
MSE linear estimation. Simple
extensions enable the algorithm to use topdown and/or bottomup information,
when available. Both approaches are
evaluated experimentally.

Height from
moving shadows

Yaron Caspi and Mike Werman – Hebrew U.

Plane + Parallax have been touted as an excellent
representation for 3D reconstruction. Several ways to recover 3D parallax
have been proposed in the past, most of them relay on point matches. In this talk we describe how shadows or
light stripes may be used to compute
a plane + parallax representation, where the 3D parallax refers to the
height from the ground plane. The
method is based on analyzing shadows of vertical poles (e.g., a tall
building's contour) that sweep the object twice.
Existing beam scanning approaches (shadow or light
stripes) will be reviewed, and the differences and similarities with the
proposed method will be discussed. We
show that in contrast to existing methods, that recover the distance of a
point from the camera, our approach measures the height from the ground plane
directly. This is particularly useful,
when the camera cannot face the scene orthogonally, and the object is very
far from the camera.
