Object Perception

 

Learning Outcomes

1. Describe these theories of object perception:

• template approach

• prototype approach

• Pandemonium

• Marr’s approach

• Recognition By Components

2. What are the pros and cons of each theory?

3. What is the basis of the structural description approach?

4. How does Stanley’s vision work? What is the current state of robotic vision in autonomous vehicles?

 


 

Object Perception

 

___________: perceiving something as previously experienced

______________: naming or classifying an object

 

How is this done?

______-__ (data-driven) processing: low-level stimulus information combined into larger wholes to create representation

___-____ (concept-driven) processing: higher-level cognitive processes (e.g., memories, beliefs, expectations) affect interpretations of the stimulus input

 

Template Matching

- compare input to a model or ________ stored in memory

- stimulus categorized by exact match

 

Pros & Cons:

☑ successfully used by ________

e.g., reading MICR numbers at the bottom of a cheque

☒ cannot handle novel stimuli

☒ cannot handle __________ within a stimulus:

letter A

☒ too many templates required

☒ cannot handle _______:

13 14 15 or BED

Prototype Approach

- individual instances not stored

- represented as prototype: abstraction of _______ or best example of an object

- categorization based on “distance” between perceived item and prototype

 

Pros & Cons:

☑ more ________ than templates

☒ cannot handle _______

 

Feature Analysis

e.g., Pandemonium (Selfridge, 1959):

Stage 1: “Image Demon” gets sensory input

e.g., R

 

Stage 2: “Feature Demon” analyze input in terms of ________; each activated by its specific feature

e.g., features

 

Stage 3: “Cognitive Demon” determine which patterns of features are present, corresponding to known ________

e.g, P R T more than A or X

 

Stage 4: “Decision Demon” identifies the pattern by listening for the Cognitive Demon shouting the loudest

e.g., “R

 

Pros & Cons:

☑ can identify a wide range of stimuli--just specify component features

☑ feature-detectors physiologically relate to _______ in visual system

☒ doesn’t define “features”

(two lines forming an angle? single line segment?)

☒ cannot handle ______________ principles (Gestalt laws)

(e.g., when is a row of dots a line?)

☒ cannot handle _______ effects (no “Context Demon”)

☒ cannot be applied to 3-D objects

 


 

Structural Description Approach

 

_____-based models:

- these traditional models of visual perception focus on analyzing aspects of the (2-D) retinal image (e.g., junctions, features, etc.)

- they rely on a viewpoint-dependent frame of reference

- as a result, it is difficult to represent a fully 3-D world

 

__________-description models:

- structural description: a set of symbolic propositions about a particular configuration

structural description

- these are different in the picture domain (2-D) but are the same in the object domain (3-D)

- _____________ among components are important

e.g., brick joined at midpoint to another brick

 

Marr & Nishihara (1978):

- problem: defining an object with an ________ frame of reference

- solution: define object’s characteristics with respect to object itself (object-centred)

- determine object’s primary axis using generalized _____

e.g., pyramids, spheres, cylinders, oblongs, as well as “arms” and “legs”

• have an axis of orientation

• a certain location or centre of mass

• overall size

- create shape descriptions of the object at different levels of detail

e.g., human body, limbs, fingers

- each level of hierarchy contains information about:

• axes of cones

• arrangement of axes of component cones (how cones connect)

• internal reference to 3-D description of component models (i.e., “name”)

- this comprises the ___ _____ description

- 3-D model description is object-centred, and thus invariant over changes in position of the viewer

(viewpoint invariance: ability to identify an object from different points of view)

 

viewpoint invariance

 

- object identification: finds match between 3-D model description and a stored catalog of known objects

• ___________ index (“level of detail”):

searches through hierarchy of stored information until the information in the 3-D model and in the catalog have the same level of specificity

starts with overall shape information and then goes to more and more specific detail

e.g., object → biped → human → male or female

specificity

(more detail needed to differentiate David vs. Venus than human vs. tree)

is a bottom-up process because it is based on incoming sensory information, processing generalized cones in terms of how much detail they provide

 

• _______ (or subcomponent) index (“whole-to-parts reference”):

relates information about components (locations, orientation, relative sizes) to help determine object

e.g., human → arm → forearm → hand → David

adjunct index

is top-down because of the nature of the information it is using: knowledge of human beings who have arms, that have forearms, that have hands, that belong to certain people

 

• ______ (or supercomponent) index (“parts-to-whole reference”):

as each component is identified, it provides information on what the whole object is likely to be

e.g., hand → forearm → arm → human → David

parent index

is also top-down because it is also using learned knowledge about objects

 

Pros & Cons:

☑ doesn’t rely on a list of ________

☑ is economical

☑ handles variation & novel stimuli

☑ allows for top-down processing

☑ accounts for ______________ principles

☒ physiological evidence?

☒ identifies objects by gross features, not details

 

Recognition By Components

(Biederman, 1987):

- assumption: visual scene can be decomposed into constant, basic elements

- components called _____ (geometric icons): 36 basic volumetric shapes that can be modified (length, width, etc.), and yet remain identifiable (cylinder, brick, cone):

geons

- different geons have different ___-__________ properties: not an artefact of viewing position, but rather reflect a property of the world

non-accidental properties

• curvature: curves in the image imply curved edges in the object

e.g., image of sphere is round, because a sphere has a rounded contour

• ____________: straight lines in the image imply straight lines in the object

• symmetry: symmetry in the image implies symmetry in the object

• parallel: parallel lines/curves in the image implies the same in the object

• ______________: lines in the image ending at a common point implies edges of object end at common point

 

- _________ is important:

concavity

(a) complete stimuli

(b) stimuli preserving concavity information

(c) stimuli lacking concavity information

 

- Principle of ____________ ________: if an object’s geons can be determined, then the object can be recognized or identified--even if the object is partially obscured

 

- overview of model:

 

edge

extraction

⤢ ↘

detection of

non-accidental

properties

 parsing of 

regions of

concavity

↘ ↙

determination

of components

↓

matching of

components to

______

representations

 

- evidence: _______ studies (Biederman, 1995)

1) prime: present object

 

e.g., teacup teacup

 

2a) viewpoint-invariant contour change: present object from same category made of different geons

 

e.g., cylindrical mug mug

 

2b) metric change: present object from same category made of same geons, but stretched

 

e.g., latte bowl latte bowl

 

- responses were ______ for 2b than 2a

 

metric changes

 

Pros & Cons:

☑ has well-defined __________

☑ can handle variation & novel stimuli

☑ is __________

☒ geons not always reliably determined (e.g., a puddle)

☒ may be too broad--objects also differ in their details

☒ is viewpoint-invariant; however, objects are most easily identified from a _________ (or typical) viewpoint

canonical vs. non-canonical viewpoints

(the same is true of Marr & Nishihara's model)

 


 

Computer Vision & Stanley

(Sebastian Thrun et al., 2006)

 

DARPA Grand Challenge:

- $1M prize for __________ (self-driving) vehicle completing course up to 270 km long through Mojave desert in no more than 10 hours

- goal: military applications?

- in 2004: none of 15 teams finished; maximum completed was 5%

- in 2005: 23 teams competed over a 212 km course for $2M prize

 

Stanford University’s Stanley

- based on a turbocharged 4WD 2004 Volkswagen Touareg R5 TDI

- finished _____ in 6 hours, 53 minutes, 58 seconds

 

Hardware:

- environment sensors:

• 5 _____ (laser imaging, detection, and ranging) range finders (25 m range)

• colour video camera

• 2 radar sensors (200 m range) for detecting large obstacles (not used in race due to technical problems and a lack of large obstacles on the course)

- positioning sensors:

• GPS positioning system

• GPS compass

• inertial measurement unit

- 6 Pentium M computers running Linux

• 1 dedicated to _____ processing

• 2 for all other software

 

Software:

- environment state consists of multiple ____ (laser, vision, radar)

- these construct 2-D environment map

 

Vision system:

- lasers can detect obstacles up to 22 m--insufficient range for travel at 55 km/h

- video camera’s effective range: 70 m--but classifying terrain into drivable and non-drivable is __________

- solution: drivable area determined by laser analysis projected into visual image

- _____________ made to similar visual areas out of laser range

e.g., vision initially classifies grass as nondrivable (green area) until lasers scan it (blue trapezoid); lasers conclude grass is drivable; then all grass areas in visual range reclassified as drivable (red area)

terrain classification

- data continually evaluated by a learning algorithm, which can adapt to new terrain

- vision not used to steering control, but for ________ control

 

Pros and Cons:

☑ showed the power of AI that uses machine learning

☑ fastest autonomous vehicle, by 11 minutes (average 30 km/h)

☒ limited obstacle processing (can’t differentiate tall grass vs. _____)

☒ unable to navigate in _______

☒ little generalizability to _____ vision (weak equivalence)

 

Sequel: DARPA Urban Challenge, 2007

- autonomous vehicles required to obey all traffic regulations, negotiate with other traffic and obstacles, and merge into traffic in a mock urban environment

- 6 (out of 35) teams completed the course

- Stanford Racing Team: “Junior,” a 2006 VW Passat wagon

• 4× Stanley’s computer power

• 360° LIDAR

• 6 cameras for omnidirectional video

• finished in ______ place

- winner: Tartan Racing (CMU/GM): “Boss”

 

Robotic Vision Today

 

• Waymo autonomous vehicles/robotaxis

- founded by Sebastian Thrun as the Google self-driving car

- autonomously driven over 200 million km

- sensing technology includes LIDAR, radar, and cameras

 

• Tesla Enhanced Autopilot

- has 8 cameras (up to 250 m range), 12 ultrasonic sensors, and forward-facing radar processed by Tesla Vision neural network

- implicated in over 50 serious __________

 

• consumer technologies

- autonomous ______ control: adjusts vehicle speed/braking based on data from laser/radar sensors that detect distance to vehicle ahead

e.g., Ford’s Adaptive Cruise Control

- automatic _______: sensors used to aid parallel parking

e.g., Lexus’s Advanced Parking Guidance System

- lane departure warning: signal driver when vehicle moves out of its lane; some systems use computer-processed images from video cameras

e.g., Mercedes-Benz’s Active Lane Keeping Assist

- pre-collision braking and throttle management: slows or stops vehicle if it detects a potentially hazardous object in the way

e.g., Subaru’s EyeSight driver assist technology uses dual cameras

 

• current technology (Schoettle, 2017):

- no single sensor currently equals human visual perception

- some sensors have capabilities that human drivers do not (e.g., sensing through fog with radar)

- equaling or exceeding human sensing capabilities requires a variety of sensors (e.g., radar, LIDAR, cameras), whose data must be __________ to form a unified representation of the roadway and environment