Object Perception

Learning Outcomes

1. Describe these theories of object perception:

• template approach

• prototype approach

• Pandemonium

• Marr’s approach

• Recognition By Components

2. What are the pros and cons of each theory?

3. What is the basis of the structural description approach?

4. How does Stanley’s vision work? What is the current state of robotic vision in autonomous vehicles?

Object Perception

• ___________: perceiving something as previously experienced

• ______________: naming or classifying an object

How is this done?

• ______-__ (data-driven) processing: low-level stimulus information combined into larger wholes to create representation

• ___-____ (concept-driven) processing: higher-level cognitive processes (e.g., memories, beliefs, expectations) affect interpretations of the stimulus input

Template Matching

- compare input to a model or ________ stored in memory

- stimulus categorized by exact match

Pros & Cons:

successfully used by ________

e.g., reading MICR numbers at the bottom of a cheque

cannot handle novel stimuli

cannot handle __________ within a stimulus:

letter A

too many templates required

cannot handle _______:

13 14 15 or BED

Prototype Approach

- individual instances not stored

- represented as prototype: abstraction of _______ or best example of an object

- categorization based on “distance” between perceived item and prototype

Pros & Cons:

more ________ than templates

cannot handle _______

Feature Analysis

e.g., Pandemonium (Selfridge, 1959):

Stage 1: “Image Demon” gets sensory input

e.g., R

Stage 2: “Feature Demon” analyze input in terms of ________; each activated by its specific feature

e.g.,

Stage 3: “Cognitive Demon” determine which patterns of features are present, corresponding to known ________

e.g, P R T more than A or X

Stage 4: “Decision Demon” identifies the pattern by listening for the Cognitive Demon shouting the loudest

e.g., “R”

Pros & Cons:

can identify a wide range of stimuli--just specify component features

feature-detectors physiologically relate to _______ in visual system

doesn’t define “features”

(two lines forming an angle? single line segment?)

cannot handle ______________ principles (Gestalt laws)

(e.g., when is a row of dots a line?)

cannot handle _______ effects (no “Context Demon”)

cannot be applied to 3-D objects

Structural Description Approach

_____-based models:

- these traditional models of visual perception focus on analyzing aspects of the (2-D) retinal image (e.g., junctions, features, etc.)

- they rely on a viewpoint-dependent frame of reference

- as a result, it is difficult to represent a fully 3-D world

__________-description models:

- structural description: a set of symbolic propositions about a particular configuration

structural description

- these are different in the picture domain (2-D) but are the same in the object domain (3-D)

- _____________ among components are important

e.g., brick joined at midpoint to another brick

Marr & Nishihara (1978):

- problem: defining an object with an ________ frame of reference

- solution: define object’s characteristics with respect to object itself (object-centred)

- determine object’s primary axis using generalized _____

e.g., pyramids, spheres, cylinders, oblongs, as well as “arms” and “legs”

• have an axis of orientation

• a certain location or centre of mass

• overall size

- create shape descriptions of the object at different levels of detail

e.g., human body, limbs, fingers

- each level of hierarchy contains information about:

• axes of cones

• arrangement of axes of component cones (how cones connect)

• internal reference to 3-D description of component models (i.e., “name”)

- this comprises the ___ _____ description

- 3-D model description is object-centred, and thus invariant over changes in position of the viewer

(viewpoint invariance: ability to identify an object from different points of view)

viewpoint invariance

- object identification: finds match between 3-D model description and a stored catalog of known objects

• ___________ index (“level of detail”):

▸ searches through hierarchy of stored information until the information in the 3-D model and in the catalog have the same level of specificity

▸ starts with overall shape information and then goes to more and more specific detail

e.g., object → biped → human → male or female

specificity

(more detail needed to differentiate David vs. Venus than human vs. tree)

▸ is a bottom-up process because it is based on incoming sensory information, processing generalized cones in terms of how much detail they provide

• _______ (or subcomponent) index (“whole-to-parts reference”):

▸ relates information about components (locations, orientation, relative sizes) to help determine object

e.g., human → arm → forearm → hand → David

adjunct index

▸ is top-down because of the nature of the information it is using: knowledge of human beings who have arms, that have forearms, that have hands, that belong to certain people

• ______ (or supercomponent) index (“parts-to-whole reference”):

▸ as each component is identified, it provides information on what the whole object is likely to be

e.g., hand → forearm → arm → human → David

parent index

▸ is also top-down because it is also using learned knowledge about objects

Pros & Cons:

doesn’t rely on a list of ________

is economical

handles variation & novel stimuli

allows for top-down processing

accounts for ______________ principles

physiological evidence?

identifies objects by gross features, not details

Recognition By Components

(Biederman, 1987):

- assumption: visual scene can be decomposed into constant, basic elements

- components called _____ (geometric icons): 36 basic volumetric shapes that can be modified (length, width, etc.), and yet remain identifiable (cylinder, brick, cone):

geons

- different geons have different ___-__________ properties: not an artefact of viewing position, but rather reflect a property of the world

non-accidental properties

• curvature: curves in the image imply curved edges in the object

e.g., image of sphere is round, because a sphere has a rounded contour

• ____________: straight lines in the image imply straight lines in the object

• symmetry: symmetry in the image implies symmetry in the object

• parallel: parallel lines/curves in the image implies the same in the object

• ______________: lines in the image ending at a common point implies edges of object end at common point

- _________ is important:

concavity

(a) complete stimuli

(b) stimuli preserving concavity information

- Principle of ____________ ________: if an object’s geons can be determined, then the object can be recognized or identified--even if the object is partially obscured

- overview of model:

	edge extraction

detection of non-accidental properties		parsing of regions of concavity

	determination of components

	matching of components to ______ representations

- evidence: _______ studies (Biederman, 1995)

1) prime: present object

e.g., teacup teacup

2a) viewpoint-invariant contour change: present object from same category made of different geons

e.g., cylindrical mug mug

2b) metric change: present object from same category made of same geons, but stretched

e.g., latte bowl

- responses were ______ for 2b than 2a

metric changes

Pros & Cons:

has well-defined __________

can handle variation & novel stimuli

is __________

geons not always reliably determined (e.g., a puddle)

may be too broad--objects also differ in their details

is viewpoint-invariant; however, objects are most easily identified from a _________ (or typical) viewpoint

canonical vs. non-canonical viewpoints

(the same is true of Marr & Nishihara's model)

Computer Vision & Stanley

(Sebastian Thrun et al., 2006)

DARPA Grand Challenge:

- $1M prize for __________ (self-driving) vehicle completing course up to 270 km long through Mojave desert in no more than 10 hours

- goal: military applications?

- in 2004: none of 15 teams finished; maximum completed was 5%

- in 2005: 23 teams competed over a 212 km course for $2M prize

Stanford University’s Stanley

- based on a turbocharged 4WD 2004 Volkswagen Touareg R5 TDI

- finished _____ in 6 hours, 53 minutes, 58 seconds

Hardware:

- environment sensors:

• 5 _____ (laser imaging, detection, and ranging) range finders (25 m range)

• colour video camera

• 2 radar sensors (200 m range) for detecting large obstacles (not used in race due to technical problems and a lack of large obstacles on the course)

- positioning sensors:

• GPS positioning system

• GPS compass

• inertial measurement unit

- 6 Pentium M computers running Linux

• 1 dedicated to _____ processing

• 2 for all other software

Software:

- environment state consists of multiple ____ (laser, vision, radar)

- these construct 2-D environment map

Vision system:

- lasers can detect obstacles up to 22 m--insufficient range for travel at 55 km/h

- video camera’s effective range: 70 m--but classifying terrain into drivable and non-drivable is __________

- solution: drivable area determined by laser analysis projected into visual image

- _____________ made to similar visual areas out of laser range

e.g., vision initially classifies grass as nondrivable (green area) until lasers scan it (blue trapezoid); lasers conclude grass is drivable; then all grass areas in visual range reclassified as drivable (red area)

terrain classification

- data continually evaluated by a learning algorithm, which can adapt to new terrain

- vision not used to steering control, but for ________ control

Pros and Cons:

showed the power of AI that uses machine learning

fastest autonomous vehicle, by 11 minutes (average 30 km/h)

limited obstacle processing (can’t differentiate tall grass vs. _____)

unable to navigate in _______

little generalizability to _____ vision (weak equivalence)

Sequel: DARPA Urban Challenge, 2007

- autonomous vehicles required to obey all traffic regulations, negotiate with other traffic and obstacles, and merge into traffic in a mock urban environment

- 6 (out of 35) teams completed the course

- Stanford Racing Team: “Junior,” a 2006 VW Passat wagon

• 4× Stanley’s computer power

• 360° LIDAR

• 6 cameras for omnidirectional video

• finished in ______ place

- winner: Tartan Racing (CMU/GM): “Boss”

Robotic Vision Today

• Waymo autonomous vehicles/robotaxis

- founded by Sebastian Thrun as the Google self-driving car

- autonomously driven over 200 million km

- sensing technology includes LIDAR, radar, and cameras

• Tesla Enhanced Autopilot

- has 8 cameras (up to 250 m range), 12 ultrasonic sensors, and forward-facing radar processed by Tesla Vision neural network

- implicated in over 50 serious __________

• consumer technologies

- autonomous ______ control: adjusts vehicle speed/braking based on data from laser/radar sensors that detect distance to vehicle ahead

e.g., Ford’s Adaptive Cruise Control

- automatic _______: sensors used to aid parallel parking

e.g., Lexus’s Advanced Parking Guidance System

- lane departure warning: signal driver when vehicle moves out of its lane; some systems use computer-processed images from video cameras

e.g., Mercedes-Benz’s Active Lane Keeping Assist

- pre-collision braking and throttle management: slows or stops vehicle if it detects a potentially hazardous object in the way

e.g., Subaru’s EyeSight driver assist technology uses dual cameras

• current technology (Schoettle, 2017):

- no single sensor currently equals human visual perception

- some sensors have capabilities that human drivers do not (e.g., sensing through fog with radar)

- equaling or exceeding human sensing capabilities requires a variety of sensors (e.g., radar, LIDAR, cameras), whose data must be __________ to form a unified representation of the roadway and environment