Recognition and Classification in Images and Video *

203.4780

Course overview Useful links Syllabus Detailed schedule

Meeting Times: Monday 9-12, Room 462

Instruction Hour: Wednesday 11:00-12:00, Room 410 (Jacobs)

Instructor: Dr. Rita Osadchy

e-mail: rita [at]cs [dot]haifa.ac.il
Office: Jacobs 410

Announcements:

§ No class 5.5;

§ The reviews for the topic: Patch-based Representations are due to 5.5;

§ The reviews for the topic: Detection as a binary decision are due to 12.5;

§ Both topics will be presented on 12.5. ( If we don’t have enough time for the second topic, we will continue it on 19.5);

§ All announcements and guidelines will be distributed by email.

§ Those who do not send their contact address on time will not be added to the contact list!!!

§ You must send me an email to (rita[at]cs[dot]haifa.ac.il) by March 1 from your active address with the subject "course 4780"

Course overview:

General: This is a graduate course in computer vision. We will survey and discuss vision papers relating to object and activity recognition and scene understanding. The goal of the course is to understand classical and modern approaches to some important problems, analyzing their strengths and weaknesses, and identifying interesting open questions.

Requirements: Students will be responsible for writing a paper review each week, participating in discussions, completing a programming project, and presenting one topic in a class.

Note that presentations are due one week before the slot your presentation is scheduled. This means you will need to read the papers, create slides, etc. one week before the date you are signed up for, to leave time for improvement. Note, that you should get my approval for your presentation.

More details on the requirements and grading breakdown are here.

Syllabus:

A. Recognizing specific objects

Global features:

1. Linear Subspaces

2. Detection as a binary decision

Local features:

3. Local features, matching for object instances

4. Visual Vocabularies and Bag of Words

Region-based methods:

5. Mid-Level Representations

B. Beyond Single objects (using additional information)

1. Saliency

2. Attributes

3. Context

C. Scalability problems

1. Scaling with the large number of categories

2. Large-scale search

D. Action recognition in video and images

Schedule and papers:

Note: * = required reading.
Additional papers are provided for reference, and as a starting point for background reading for projects.
Paper presentations: Cover the starred papers.

Date

Topics

Papers and links

Presenters

3.3

Course intro

[slides]

Instructor

10.3

Introduction to Object and Event Recognition

[slides]

Instructor

17.3

Introduction to Object and Event Recognition

Instructor

24.3

No class

31.3

Linear Subspaces

Global appearance models for object recognition, dimensionality reduction.

o *Eigenfaces for Recognition, Turk and Pentland, 1991. [pdf]

o *P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition using Class Specific Linear Projection, 1996 [pdf]

o Face Database [here]

Additional Material

Shimon Ullman and Ronen Basri, Recognition by Linear Combinations of Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991 [pdf]

T.F. Cootes and C.J. Taylor, "Statistical models of appearance for medical image analysis and computer vision", Proc. SPIE Medical Imaging 2001. [pdf]

Mor [pdf]

7.4

Cyber Day

28.4

Local features and matching for object instances:

Invariant local features, instance recognition

SiftModelsFound

o *Object Recognition from Local Scale-Invariant Features, Lowe, ICCV 1999. [pdf] [code] [other implementations of SIFT] [IJCV]

o *Selected pages from: Local Invariant Feature Detectors: A Survey, Tuytelaars and Mikolajczyk. Foundations and Trends in Computer Graphics and Vision, 2008. [pdf] [Oxford code] [Read pp. 178-188, 216-220, 254-255]

o Oxford group interest point software

o Andrea Vedaldi's VLFeat code, including SIFT, MSER, hierarchical k-means.

o INRIA LEAR team's software, including interest points, shape features

o FLANN - Fast Library for Approximate Nearest Neighbors. Marius Muja et al.

o Google Goggles

o Kooaba

Additional Material

For more background on feature extraction: Szeliski book: Sec 3.2 Linear filtering, 4.1 Points and patches, 4.2 Edges

Scalable Recognition with a Vocabulary Tree, D. Nister and H. Stewenius, CVPR 2006. [pdf]

SURF: Speeded Up Robust Features, Bay, Ess, Tuytelaars, and Van Gool, CVIU 2008. [pdf] [code]

Robust Wide Baseline Stereo from Maximally Stable Extremal Regions, J. Matas, O. Chum, U. Martin, and T. Pajdla, BMVC 2002. [pdf]

A Performance Evaluation of Local Descriptors. K. Mikolajczyk and C. Schmid. CVPR 2003 [pdf]

Guy[pdf]

12.5

Patch-based Representations

visual vocabularies, bag-of-words and SPK for scene classification

o *Visual Categorization with Bags of Keypoints, C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka, ECCV International Workshop on Statistical Learning in Computer Vision, 2004. [pdf]

o *Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, Lazebnik, Schmid, and Ponce, CVPR [pdf], [code],[data].

LIBPMK feature extraction code, includes dense sampling
LIBSVM library for support vector machines
PASCAL VOC Visual Object Classes Challenge

Additional Material

Video Google: A Text Retrieval Approach to Object Matching in Videos, Sivic and Zisserman, ICCV 2003. [pdf] [demo]

Pedestrian Detection in Crowded Scenes, Leibe, Seemann, and Schiele, CVPR 2005. [pdf]

Sampling Strategies for Bag-of-Features Image Classification. E. Nowak, F. Jurie, and B. Triggs. ECCV 2006. [pdf]

Scalable Recognition with a Vocabulary Tree, D. Nister and H. Stewenius, CVPR 2006. [pdf]

Assaf [pdf]

12.5

Detection as a binary decision

Sliding window detection, detection as a binary decision problem.

o *Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005. [pdf] [code] [PASCAL datasets]

o *Rapid Object Detection Using a Boosted Cascade of Simple Features, Viola and Jones, CVPR 2001. [pdf] [code]

o LIBSVM library for support vector machines

o PASCAL VOC Visual Object Classes Challenge

o Face data

Additional Material

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search.

A Discriminatively Trained, Multiscale, Deformable Part Model, by P. Felzenszwalb, D. McAllester and D. Ramanan. CVPR 2008. [pdf] [code]

A Trainable System for Object Detection, C. Papageorgiou and T. Poggio, IJCV 2000. [pdf]

Class-specific Hough Forests for Object Detection. J. Gall and V. Lempitsky. CVPR 2009. [pdf] [slides] [code]

Majd[pdf]

19.5

Context and scenes

Multi-object scenes, inter-object relationships, understanding scenes' spatial layout, 3d context

*Using the Forest to See the Trees: Exploiting Context for Visual Object Detection and Localization. Torralba, Murphy, and Freeman. CACM 2009. [pdf] [related code]

o *Object Bank: A High-Level Image Representation for Scene Classiﬁcation & Semantic Feature Sparsiﬁcation. L-J. Li, H. Su, E. Xing, L. Fei-Fei. NIPS 2010. [pdf] [code]

Labelme Database
Scene Understanding Symposium

Multi-Class Segmentation with Relative Location Prior. S. Gould, J. Rodgers, D. Cohen, G. Elidan and D. Koller. IJCV 2008. [pdf] [code]

Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces. D. Lee, A. Gupta, M. Hebert, and T. Kanade. NIPS 2010. [pdf] [code]

Contextual Priming for Object Detection, A. Torralba. IJCV 2003. [pdf] [web] [code]

Recognition Using Visual Phrases. M. Sadeghi and A. Farhadi. CVPR 2011. [pdf]

Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry. V. Hedau, D. Hoiem, and D. Forsyth. ECCV 2010 [pdf] [code and data]

Putting Objects in Perspective, by D. Hoiem, A. Efros, and M. Hebert, CVPR 2006. [pdf] [web]

Learning Spatial Context: Using Stuff to Find Things, by G. Heitz and D. Koller, ECCV 2008. [pdf] [code]

Context Based Object Categorization: A Critical Survey. C. Galleguillos and S. Belongie. [pdf]

[pdf]

26.5

Describing objects with attributes

Visual properties, learning from natural language descriptions, intermediate representations