Our senses enable us to take in very diverse information about our surroundings, and they differ from one another in a number of features specific to their individual modes. The eye specializes in the perception of spatial structure, and the ear in the perception of temporal processes. Only in the rarest cases, however, are we confronted with sensory stimuli of only a single modality; we perceive our world through all five senses and hence multimodally. Consequently, our sense organs are not, as is often assumed, isolated from one another, since it is their synergetic interplay that gives human beings their evolutionary advantage. Irrespective of the modality, the most reliable stimulus in a given situation dominates all the others; if one sense provides too little or unclear information, all of the other senses enter as a corrective. The integration of multimodal sense stimuli in meaningful units is called multimodal integration; to a certain extent, it already occurs on the neuronal level and hence unconsciously and passively. Another often-used possibility to link stimuli beyond the boundaries between the senses is known as intermodal analogy. This involves searching consciously and actively for an amodal quality that is present in several sensory regions, such as intensity or brightness, in order to form analogies that transcend the boundaries between the senses. These mechanisms, which are described very briefly here, are, of course, just two elements in the interaction of hearing and seeing. Any consideration of audiovisual perception should at least distinguish between genuine synesthesia and other associative or emotional links between image and sound.
Our faculty of perception is a response to the surplus of information in our environment. According to Konrad Lorenz, every sense organ embodies a theory as to which information about the environment promises an evolutionary advantage.[1] Every sense focuses on a small selection from this nearly infinite variety and thereby opens a gate to another world, so to speak. The perceptual worlds that seeing and hearing open up for us are characterized, to put it simply, by the mode-specific features described here.
By perceiving high-frequency light rays with wavelengths from 380 to 680 nm that move in nearly straight lines, the eye can resolve surface structures with extremely precise detail. This gives us the capacity for extremely precise spatial orientation, as is evident, for example, when we read tiny structures such as writing. The human auditory system, by contrast, is optimized to understand the human voice. The sound waves in this range, with wavelengths from about 20 to 30 cm, bend around smaller objects and hence are comparatively poorly suited for the precise perception of space.
Seeing is a targeted and directed process that takes place actively and consciously. As a rule, sensory stimuli are transferred directly to the cerebrum for rational processing, which makes visual perception ideally suited for dealing with highly differentiated inputs. Several nerve tracts lead directly from the ear to the diencephalon, or interbrain, which is responsible, among other things, for controlling the emotions. That is why acoustic stimuli are able to trigger relatively direct feelings and physical reactions (such as an increased pulse rate). Moreover, the ear is immobile and cannot be closed, which is why, like it or not, we register all acoustical events in our surroundings, even when we are sleeping. Hearing often occurs unconsciously and passively and can be described as totalizing and collectivizing, since acoustic perception will largely coincide for several people in a room.
Our visual perception is optimized for grasping static objects and can, as a rule, follow no more than one movement, such as a passing car. Acoustic phenomena, by contrast, are practically inconceivable without dynamic changes. When we listen, we have no problem distinguishing between several simultaneous movements, such as the noises of two different cars driving away. Moreover, the sense of hearing is fundamentally faster than the sense of vision in taking in and processing sensory stimuli. The ear thus tends to specialize in the perception of temporal processes and the eye in the detailed resolution of static phenomena, which is probably the basis of the common association of images with permanence and sounds with ephemerality.[2]
Although the significance of such mode-specific qualities cannot be emphasized enough, the particular achievement of our faculty of the senses lies in the linkage and convergence of these supposedly separate domains — it is only their synergetic interaction that has given human beings their evolutionary advantage.
Despite the creation of a means of segregating information on a sense-by-sense basis, evolution did not eliminate the ability to benefit from the advantages of pooling information across sensory modalities. Rather, it created an interesting duality: some parts of the brain became specialized for dealing with information within individual senses, and others for pooling information across senses. [3]
Only in the rarest cases are we confronted with sensory stimuli of a single modality, since we perceive our environment with five senses and thus in a multimodal way. In fact, separating our perception into several independent worlds — hearing, seeing, smelling, tasting, feeling — is an enormous task of abstraction that human beings only learn as part of their cultural socialization. In the literature, one sometimes comes across the informative observation that the media and technical apparatuses of the nineteenth century contributed to the singularizing of our senses. For sound, movement, body, and image had an immediate relationship in the history of culture at least until technical possibilities enforced their separation … People stared as if paralyzed at the horns of phonographs. From that time onward, specializing of certain bodily functions outside the body, and hence a specializing of the senses, was necessitated.[4] Whether or not one chooses to accept this specific assessment, it cannot be denied that technical development and human perception are closely intertwined. Ever since developments in twentieth-century media have made synchronized recording and playback of image and sound possible, our perception and the use and the weighting of our senses has transformed again: the separation of hearing and seeing has been undermined by technology. Whereas the primacy of the eye that dominates in Western culture has come to seem increasingly problematic, the interaction of the senses has been recognized as more important and been discussed more and more frequently. This is another reason why it makes sense to speak of audiovisual perception. Several mechanisms involved in the complex interplay of hearing and seeing will be presented below.
The possibility of integrating multimodal sensory stimuli in meaningful units has many advantages for us — for example, improved understanding of language. The enormous selectivity of our sense of hearing makes it possible to follow a voice attentively even in a loud environment. If at the same time one watches the facial and lip movements of the person speaking, understanding improves enormously. Multimodal integration thus means that the perception in the realm of one sense is influenced by perception in another, since the two components are integrated into an interpretation which is as consistent as possible. The linkage of visual and auditory stimuli does not result solely — as was long assumed — from mental construction; it has been shown that various sensory stimuli already converge on the neuronal level in so-called multimodal neurons.[5] Multimodal integration thus occurs on the lowest level of perception, sometimes even before the object is recognized.
In contrast to genuine synesthesia, which is considered absolute,[6] multimodal integration is dependent on the context: if one sense provides too little or unclear information, other senses enter in as a corrective — for example, in dark surroundings, auditory perception becomes more important for orientation in space. Irrespective of modality, in a given situation the most reliable stimulus will dominate all the others. Factors in evaluating reliability naturally include attentiveness, experience, motivation, and previous knowledge, which makes it clear that multimodal integration cannot be reduced to processes taking place in multimodal neurons. In any case, the key to stable perception is the efficient combination and integration of information of different modalities.
Multisensory perception is tied to the fundamental question as to which circumstances lead us to perceive qualitatively distinct sensory stimuli as deriving from a common source or as belonging to different objects. Very simple rules have been identified in answer to this question: spatial proximity (the rule of space) and occurrence at the same time (the rule of time) are crucial to the integration of stimuli of different modalities, and even an approximate fulfillment of these conditions is sufficient.[7] This makes sense, since absolute simultaneity will never be possible, if only because light and sound travel, and are perceived at different speeds. Relative proximity and simultaneity (synchronicity) are thus elementary preconditions for the integration of acoustic and visual information. Thus, it has been demonstrated experimentally that speaking can fall as much as 250 ms behind its visual equivalent — that is, lip movements — before the lack of simultaneity is noticed. But when speaking precedes the lip movements, the different is noticed more quickly. This is analogous to physical reality, since sound always arrives after light. The sometimes surprising effects that result when contradictory sensory stimuli of different modalities are synthesized — for example, the fact that we perceive the voice of a ventriloquist’s dummy as coming from the dummy — are known as crossmodal illusions.
Now the question arises as to which mechanisms of multimodal integration can lead to improved perception.[8] Stein et al. report an experiment in which the sensitivity of individual multimodal neurons was tested.[9] The reaction to the bimodal stimulus (flash of light plus beep) corresponded approximately to the sum (additive) of reactions to unimodal stimuli. By analogy to the rule of space and the rule of time, a close connection between spatial and temporal coincidence was confirmed on a neuronal basis: the reaction to spatially disparate stimuli was lower than the sum of the two separate stimuli (subadditive), which corresponds to a weakening of the sensation. Stein and Meredith were also able to show particularly strong (superadditive) effects with relatively weak but coincidentally bimodal stimuli.[10] Thus, reactions triggered by bimodal stimuli are already distinct on the neuronal level from mode-specific components of reaction, which explains in part the perception-intensifying effect of combinations of images and sounds. The phenomena described are probably the neurological basis for Michel Chion’s programmatic observation: We never see the same thing when we also hear; we don‘t hear the same thing when we see as well.[11]
In addition to modal qualities that occur exclusively for just one sense (for example, pitch for the sense of hearing, color for the sense of sight), there are also amodal (or intersensory)[12] qualities that are perceived by multiple senses. The psychologist Heinz Werner examined such phenomena in detail as early as the 1960s. When we say that a tone is strong or weak, that a pressure is strong or weak, we are no doubt referring in all cases to a property that is the same in all these sensory domains. Recent research has now shown that there are doubtless many more properties than psychology previously assumed that, like intensity, can be called intersensory.[13] Werner listed the following properties with which it is possible to establish analogies across the boundaries between the senses: intensity, brightness, volume, density, and roughness. According to Michel Chion, these and other amodal qualities are in fact at the center of our perception.[14]
Using these dimensions, it is possible to relate sensory impressions of widely different modalities to one another — that is, to create intermodal analogies. The process used, in contrast to multimodal integration, occurs consciously and actively, since one is searching for a criterion of comparison that is usually found in one of the amodal dimensions. For example, the question ‘What sound goes with this color?’ can be decided on the basis of brightness. The formation of such analogies is influenced by the context, since the color or loudness of an object are not absolute values but can only be assessed in relation to their surroundings. Intermodal analogies have features that tend to be regular from person to person (small interpersonal variation), whereas synesthetic correspondences are very different (large interpersonal variation). In general, the dimensions of brightness and intensity seem to be of central importance to the formation of intermodal analogies.[15]
The psychologist Albert Wellek examined similar connections in the 1920s. Experimental testing enabled him to compile a list of six correspondences — so-called primeval synesthesias), which in his view could be found among all peoples at all times and hence were fixed in human perception: thin forms correspond to high pitches, thick forms to low pitches, and so on. The historically demonstrable universality of these simplest sensory parallels go so far that everyone, even today, will consider all six correspondences valid and intelligible, at least in one of the given forms.[16] Our Western notation clearly corresponds to these primeval synesthesias, since, for example, the pitches are depicted, by visual analogy, as high or low, which makes intuitive sense to us.
The principles of multimodal integration and intermodal analogy outlined here are, of course, just two elements in the interaction of hearing and seeing. We should recall that our perception is characterized above all by the complex interaction of all the senses in countless independent processes on different levels and by means of massive parallel processing. Although even today — despite the valuable contributions of neurology, physiology, psychology, sociology, and so on[17] — we understand only a small portion of these connections, only rarely do we attempt to emulate our perception and integrate information from different strands into a harmonious overall picture. On the contrary, the literature often speaks of synesthesia in a relatively undifferentiated way whenever the interplay of the senses is discussed. This essay is intended to encourage people to differentiate at least between genuine synesthesia, multimodal integration, and intermodal analogy, although it was not possible to discuss in greater detail associative, symbolic, and metaphorical ways to link the senses. To that end, the reader is referred to the work of Michael Haverkamp, whose model of linkage levels represents an effort to systematize specific mechanisms of linkage between different sensory modalities and make them useful for synesthetic design in particular.[18].
This model represents a first step toward raising representations of human perception to a more complex level.
[1] See Michael Giesecke, Die Entdeckung der kommunikativen Welt (Frankfurt am Main: Suhrkamp, 2007), 240ff.
[2] So, overall, in a first attempt to an audiovisual message, the eye is more spatially adept, and the ear more temporally adept. Michel Chion, Audio-Vision: Sound on Screen, trans. Claudia Gorbman (New York: Columbia University Press, 1994), 11.
[3] Barry Stein et al., “Crossmodal Spatial Interactions in Subcortical and Cortical Circuits,” in Crossmodal Space and Crossmodal Attention, eds. Jon Driver and Charles Spence (Oxford: Oxford University Press, 2004), 25–50.
[4] Susanne Binas, “Audio-Visionen als notwendige Konsequenz des präsentablen ‘common digit’: Einige Gedanken zwischen musikästhetischer Reflexion und Veranstaltungsalltag,” in Techno-Visionen: Neue Sounds, neue Bildräume, eds. Sandro Droschl, Christian Höller, and Harald Wiltsche (Vienna: Folio, 2005), 112–20. — Trans. S. L.
[5] Barry Stein and Alex Meredith, The Merging of the Senses (Cambridge, MA: MIT Press, 1993).
[6] See Klaus-Ernst Behne, “Wirkungen von Musik,” in Kompendium der Musikpädagogik, eds. S. Helms, R. Schneider, and R. Weber (Kassel: Bosse, 1995), 281–332.
[7] The rule of time is used a great deal in sound design, for example, when images associated with notes that are in fact wrong are synthesized in our perception into harmonious impressions.
[8] [Multisensory stimuli] add depth and complexity to our sensory experiences and, as will be shown below, speed and enhance the accuracy of our judgements of environmental events in a manner that could not have been achieved using only independent channels of sensory information. Stein et al., “Crossmodal Spatial Interactions,” 25.
[9] Stein et al, “Crossmodal Spatial Interactions,” 25–50.
[10] Stein and Meredith, The Merging of the Senses.
[11] Chion, Audio-Vision. Sound on Screen, xxv.
[12] In my view, the terms amodal and intersensorial are used synonymously in the literature.
[13] Heinz Werner, “Intermodale Qualitäten (Synästhesien),” in Handbuch der Psychologie, ed. Wolfgang Metzger, (Göttingen: Hogrefe, 1965), 278–303. — Trans. S. L.
[14] Cf. Michel Chion, Audio-Vision. Sound on Screen, 137.
[15] The association of pitch with color via the dimension of brightness is probably the only meaningful analogy that can be made between these two domains. A familiar procedure used by subjects seeking to correlate different sensory domains via the dimension of intensity is ‘cross-modality matching.’ — Trans. N. W..
[16] Albert Wellek, “Die Farbe-Ton-Forschung und ihr erster Kongreß,” in Zeitschrift für Musikwissenschaft 9 (1927): 576–584. — Trans. S. L.
[17] The enormous importance of sociocultural factors in perceptual processes is pointed out in the following citation from Thomas Kuhn. What scientists perceive and how they perceive is always dependent on the paradigm within which they operate: What is built into the neural process that transforms stimuli to sensations has the following characteristics: it has been transmitted through education; it has, by trial, been found more effective than its historical competitors in a group’s current environment; and, finally, it is subject to change both through further education and through the discovery of misfits with the environment. Thomas Kuhn, The Structure of Scientific Revolutions (Chicago: University of Chicago Press, 1970), 196.
[18] Michael Haverkamp, Synästhetisches Design: Kreative Produktentwicklung für alle Sinne (Munich: Hanser, 2009)