Israel Computer Vision Day

2004 Israeli Computer Vision Day
Sunday, December 12, 2004

The Efi Arazi School of Computer Science

I.D.C. Herzliya

Sponsored by the Israeli Ministry of Science and Technology

Time	Speaker and Collaborators	Affiliation	Title
09:00-09:30	Gathering
09:30-09:45	Opening – Ministry of Science
09:45-10:30	Guest Lecture: Prof. Gerard Medioni	Chairman - Comp. Science Department - U.S.C.	Tensor Voting: Review, Applications to Computer Vision and Machine Learning
10:30-11:00	Amnon Shashua, Tamir Hazan	HUJI	Combining Holistic and Local Representations using Kernels over Sets
11:00-11:30	Coffee Break	11:30-12:00	Coffee Break
11:30-12:00	Alon Spira, Ron Kimmel	Technion	Solving geometric PDEs on manifolds
12:00-12:30	Meirav Galun, Alexander Apartsin, Ronen Basri	Weizmann	Multiscale Segmentation by Combining Motion and Intensity Cues
12:30-13:00	Yacov Hel-Or	IDC	The Canonical Correlations of Color Images
13:00-14:00	Lunch break
14:00-14:30	Hedva Spitzer	TAU	An algorithm based on Biological Gain control for High Dynamic Range Compression
14:30-15:00	Alex Rav-Acha, Shmuel Peleg	HUJI	Dynamosaics: Dynamic Mosaics with Non-Chronological Time
15:00-15:30	Yoav Schechner, Michael Elad, Einat Kidron	Technion	Pixels correlated to Sound
15:30-16:00	Coffee Break
16:00-16:30	Yonatan Wexler, Eli Shechtman, Michal Irani	Weizmann	Space-Time Video Completion
16:30-17:00	Tamar Avraham, Micha Lindenbaum	Technion	Dynamic Visual Search Using Inner-Scene Similarity: Algorithms and Inherent Limitations
17:00-17:30	Yaron Caspi, Mike Werman	HUJI	Height from moving shadows

General: This is the second Israeli Computer Vision Day. It will be hosted at IDC.

We hope it will be an academically fruitful and pleasant conference.

Location and Directions: The Vision Day will take place at the Interdisciplinary Center (IDC), Herzliya, in the Ivtzer Auditorium. For driving instructions see map.

A convenient option to arrive is by train, see time schedule here. Get off at the Herzliya Station, and order a taxi ride by phone. There are two taxi stations that provide this service: Moniyot Av-Yam (09 9501263 or 09 9563111), and Moniyot Pituach (09 9582288 or 09 9588001). The fair for a taxi ride from the railway station to IDC is around 20.- NIS.

Last year’s Vision Day Web Page can be found here.

Abstracts

Tensor Voting: Review, Applications to Computer Vision and Machine Learning

Gerard Medioni – U.S.C.

We first briefly review tensor voting, which is an efficient, non-iterative
framework for tackling perceptual organization problems in arbitrary
dimension spaces. It is based on data representation by second-order
symmetric tensors, which allow a unified representation of inliers of
smooth structures, discontinuities and outliers, and data communication by
tensor voting, during which tokens propagate information in their
neighborhood by casting tensor votes. These votes convey the amount of
support of the voter for a structure (such as a curve or a hyper-surface)
that goes through the voter and receiver. No parametric models are assumed
for the underlying structure and the criteria for determining whether a
structure goes through the data are proximity and good continuation. Our
framework has proven to be very robust even under extreme noise corruption,
with a single free parameter, the scale of the voting field.

The second part of the talk focuses on the application of tensor voting to
real computer vision problems. Since many computer vision problems, such as
stereo and motion analysis, can be expressed as the inference of smooth
structures, they can be addressed within a perceptual organization
framework. For instance, potential pixel correspondences generate tokens in
3- and 4-D for stereo and motion respectively. In that space, correct
matches should form salient, coherent structures that correspond to the
scene objects, while wrong matches do not align as well as the correct ones
and can be eliminated.

Finally, we show how tensor voting can be applied to problems in higher
dimensions, while keeping the computational complexity at reasonable
levels. Since the tensors can represent all possible structure types, which
range from junctions to hyper-volumes, multiple structures of different
dimensionality can be inferred at the same time and interact with each
other. Since all processing is local, computational complexity depends on
the number of neighbors of each input point and remains manageable even for
very large numbers of inputs in high dimensions. Therefore, tensor voting
could be an alternative to methods such as Locally Linear Embedding and
Isomap, which are state-of-the-art algorithms in machine learning.

Combining Holistic and Local Representations using Kernels over Sets

Amnon Shashua and Tamir Hazan – Hebrew U.

In the area of learning from observations there are two main paths that are often mutually exclusive: (i) the design of learning algorithms, and (ii) the data representation scheme. The algorithm designers take pride in the fact that their algorithm can generalize well given straightforward data representations, whereas those who work on data representations demonstrate often remarkable results with sophisticated data representations using only straightforward learning algorithms. This dichotomy is probably most emphasized in the area of computer vision, where image understanding from observations involve data instances of images or image sequences containing huge amounts of data. Our work is about bridging the gap between algorithms and representations. The key is to allow advanced algorithms (which typically require metric structure on the instance space) to work with advanced data representations (which are often not easily embedded into a metric space).

I will present a general family of algebraic positive definite similarity functions over spaces of matrices with varying column rank. The columns can represent local regions in an image (whereby images have varying number of local parts), images of an image sequence, motion trajectories in a multibody motion, and so froth. The family of similarity measures will be shown to be exhaustive, thus providing a cook-book of sorts covering the possible "wish lists" from similarity measures over sets of varying cardinality.

Solving geometric PDEs on manifolds

Alon Spira and Ron Kimmel – Technion

In this talk we present numerical schemes for implementing geometric flows of curves and images on manifolds. We consider a 2D parameterization plane that is mapped to an N-dimensional space. Our approach in devising the schemes is to implement them on the uniform Cartesian grid of the parameterization plane instead of doing so in the N-dimensional space. This enhances the efficiency and robustness of the resulting numerical schemes.

The first numerical scheme is an efficient solution to the eikonal equation on parametric manifolds. The scheme is based on Kimmel and Sethian's solution for triangulated manifolds, but uses the metric tensor of the parametric manifold in order to implement the scheme on the parameterization plane. The scheme is used to devise a short time kernel for the Beltrami image enhancing flow. The kernel enables an arbitrary time step for the flow for regular images as well as images painted on manifolds, such as face images. The numerical scheme is further used for face recognition by constructing an invariant face signature from distances calculated on the face manifold.

Another numerical scheme implements curve evolution by geodesic curvature flow on parametric manifolds. The flow is implemented by back projecting the curve from the manifold to the parameterization plane, calculating the flow on the plane by the level sets method and then mapping it back to the manifold. Combining this flow with geodesic constant flow enables the implementation of geodesic active contours for images painted on parametric manifolds.

Multiscale Segmentation by Combining Motion and Intensity Cues

Meirav Galun, Alexander Apartsin and Ronen Basri - Weizmann

Motion provides a strong cue for segmentation. In this talk we present a multiscale method for motion segmentation. Our method begins with local, ambiguous optical flow measurements. It uses a process of aggregation to resolve the ambiguities and reach reliable estimates of the motion. In addition, as the process of aggregation proceeds and larger aggregates are identified it employs a progressively more complex model to describe the motion. In particular, we proceed by recovering translational motion at fine levels, through affine transformation at intermediate levels, to 3D motion (described by a fundamental matrix) at the coarsest levels. Finally, the method is integrated with a segmentation method that uses intensity cues. We further demonstrate the utility of the method on both random dot and real motion sequences.

The Canonical Correlations of Color Images

Yacov Hel-Or - IDC

Over the last decade or so a lot of effort has been invested in an attempt to study the underlying statistics of natural images. Most of this effort, however, dealt with gray-scale images, and quite a number of studies attempted to model the *spatial dependencies* existing between pixel values. Although impressive results have been achieved in a variety of problems by applying prior models on gray-scale images, only a few studies have dealt with prior models on color images. In the latter there is a need to characterize spatial as well as *spectral* (color) dependencies.

In this talk I will suggest a new approach that exploits the spectral dependencies in color images using the Canonical Correlation Analysis (CCA). I will show how this statistical inference can help solve Inverse Problems in general and the Demosaicing problem in particular.

It is an interesting fact that the resulting statistical inference that is derived solely from the statistical properties of natural images, can also be derived independently from the characteristics of the human visual system. This suggests that the human visual system has adapted itself to the statistical properties of natural color images, and that the proposed approach is based on a reliable statistical model.

This work was conducted at HP labs.

An algorithm based on Biological Gain control for

High Dynamic Range Compression

Hedva Spitzer – Tel-Aviv U.

The visual system has the ability to see and get detailed information from high dynamic range scene. For example, a person can observe items in a one sight while observing in a dim room and outside through a window. An algorithm for high dynamic range compression that can be applied for still and video images is presented. This algorithm is based on a biological model which is suggested also for wide dynamic range and lightness constancy. It succeeds in automatically compressing the dynamic range of images to a 'human vision appearance (as is commonly required in cameras and displays) while maintaining contrast and even improving it. The biological basis is retinal mechanisms of adaptation (gain control): ‘local’, and ‘remote’. These mechanisms enable video image applications, since they take into account the dynamics of human adaptation mechanisms. The results indicate that the contribution of adaptation mechanisms to image appearance is significant, robust, and were proven to fit next generation High dynamic range cameras (CMOS based).

Dynamosaics: Dynamic Mosaics with Non-Chronological Time

Alex Rav-Acha and Shmuel Peleg – Hebrew U.

With the limited field of view of human vision, our perception of most scenes is built over time while our eyes are scanning the scenes. In the case of static scenes this process can be modeled by panoramic mosaicing: stitching together images into a panoramic view. Can a dynamic scene, scanned by a video camera, be represented with a dynamic panoramic video?

When a video camera is scanning a dynamic scene, different regions are visible at different times. The chronological time when a region becomes visible in the input video is not part of the scene dynamics, and may be ignored. Only the ``local time'' during the visibility period of each region is relevant for the dynamics of the scene, and should be used for building the dynamic mosaics.

We used the space-time volume, when 2D image frames are stacked on the time axis to form a 3D volume, as a basic representation which enables to create dynamic mosaics. Various 2D slices of the space-time volume can manipulate the chronological time and generate panoramic movies. The chronological time can even be reversed without affecting the local time. E.g., Given a video camera scanning water falls from left to right, we can generate a video scanning the falls from right to left, but in contradiction to reversal of the video sequence, the water will flow down!

Pixels Correlated to Sound

Einat Kidron, Yoav Schechner, and Michael Elad – Technion

We present a computer-vision approach for localizing image pixels that are associated with sound. This task is prompted by evidence that fusion of auditory and visual information is exploited by people and animals for enhancing perception. We present a rigorous analysis of the fundamental problems associated with this task. Ignoring those problems leads to solutions that suffer from low spatio-temporal resolution. We thus present a stable and robust algorithm which overcomes these problems. It detects audio-visual dynamic events with high spatial resolution. Moreover, it is simple and efficient, relying on linear programming, and it does not require tweaking of user-defined parameters. We demonstrate the capabilities of our algorithm in experiments, where it overcomes significant visual and auditory distractions.

Space-Time Video Completion

Yonatan Wexler, Eli Shechtman and Michal Irani – Weizmann

We present a method for space-time completion of large space-time holes in video sequences of complex dynamic scenes. The missing portions are filled-in by sampling spatio-temporal patches from the available parts of the video, while enforcing global spatio-temporal consistency between all patches in and around the hole. This is obtained by posing the task of video completion and synthesis as a global optimization problem with a well-defined objective function.The consistent completion of static scene parts simultaneously with dynamic behaviors leads to realistic looking video sequences. Space-time video completion is useful for a variety of tasks, including, but not limited to:

(i) Sophisticated video removal (of undesired static or dynamic objects) by completing the appropriate static or dynamic background information

(ii) Correction of missing/corrupted video frames in old movies

(iii) Synthesis of new video frames to add a visual story, modify it, or generate a new one.

Dynamic Visual Search Using Inner-Scene Similarity:

Algorithms and Inherent Limitations

Tamar Avraham and Micha Lindenbaum – Technion

A dynamic visual search framework based mainly on inner-scene similarity is proposed. Algorithms as well as measures quantifying the difficulty of search tasks are suggested.

Given a number of candidates (e.g. sub-images), our basic hypothesis is that more visually similar candidates are more likely to have the same identity. Both deterministic and stochastic approaches, relying on this hypothesis, are used to quantify this intuition.

Under the deterministic approach, we suggest a measure similar to Kolmogorov's $\epsilon$-covering that quantifies the difficulty of a search task and bounds the performance of all search algorithms. We also suggest a simple algorithm that meets this bound.

Under the stochastic approach, we model the identities of the candidates as correlated random variables and characterize the task using its second order statistics. We derive a search procedure based on minimum MSE linear estimation. Simple extensions enable the algorithm to use top-down and/or bottom-up information, when available. Both approaches are evaluated experimentally.

Height from moving shadows

Yaron Caspi and Mike Werman – Hebrew U.

Plane + Parallax have been touted as an excellent representation for 3D reconstruction. Several ways to recover 3D parallax have been proposed in the past, most of them relay on point matches. In this talk we describe how shadows or light stripes may be used to compute a plane + parallax representation, where the 3D parallax refers to the height from the ground plane. The method is based on analyzing shadows of vertical poles (e.g., a tall building's contour) that sweep the object twice.

Existing beam scanning approaches (shadow or light stripes) will be reviewed, and the differences and similarities with the proposed method will be discussed. We show that in contrast to existing methods, that recover the distance of a point from the camera, our approach measures the height from the ground plane directly. This is particularly useful, when the camera cannot face the scene orthogonally, and the object is very far from the camera.