1.3. ProbIem, GoaI & Approach
1.3.1. ProbIem & Research Question
1.3.2. GoaI Definition
1.3.3. SoIution Strategy
1.3.4. Expected Outcome
1.4. Overview of this Paper
2. Human-Computer Interaction
2.1. Computers in everyday Iife
2.2.2. A new medium evoIves
2.3. Affective Computing
2.4. The Essence of Agency
3. Embodied Agents
3.2. MuItimodaI Interfaces
3.2.1. Speech as input
3.2.2. Speech as output
3.2.4. VisuaI Speech
3.2.5. Other senses
3.4. AppIication Areas
3.4.1. ConversationaI Agents
3.4.2. PedagogicaI Agents
4. History of MuItimodaI Systems
4.1. Previous Embodied Agents
4.1.3. Cosmo — The Internet Advisor
4.1.6. Towards SociaI Interaction
4.2. User Testing of Agents
4.2.3. COSMO — The Internet Advisor
5. Augmented ReaIity and Agents
5.1. The Essence of Augmented ReaIity
5.2. Augmented ReaIity as Interface
5.2.1. TransitionaI Interfaces
5.2.2. TangibIe User Interface
5.3. Agents in Augmented ReaIity Interfaces
6. Design Objectives for Embodied Agents
6.2. Human-figure animation
6.2.1. Body animation
6.2.2. FaciaI animation
6.3. SociaI Interface
6.6. AppIication domain
7. The ImpIementation..55 7.1. Basic Considerations
7.1.1. AIternative Approaches
7.1.2. Research situation
7.2.1. PreIiminary considerations
7.2.2. Software Candidates
7.3. DetaiIs of the ImpIementation
8. Directions for Research
From early on mankind was fascinated by the idea of creating a human-like creature from inanimate material. Jewish legend tells us of a Golem made from clay and mystically filled with life (Idel 1990). It was taken up by modern authors in novels like "Frankenstein" and found their way in more recent productions like "Toy Story". All are fuelled by the same thought: to ease human life by natural interaction with a human-like partner. Today's technology does not rely on clay but on computers and electronic devices. With digital tools at hand research groups around the world focus again on an old idea — natural interaction with a human-like partner, this time it is computer generated.
Such interaction might not necessarily be 'better' in terms of efficiency than others but maybe in terms of quality. As Maes (1994) pointed out intelligent interactive agents are usefull to be employed in "fail-soft" systems. Fail-soft systems are such systems that preserve their essential operability even if parts of it fail. Applied to software agents that means that failure of communication with the agent must not mean the breakdown of the whole process. Agents enhance the quality of the application but they are not essential to it — a "nice to have" gimmick.Using advanced computer graphics it is possible to generate almost photo-realistic imagery of three dimensional virtual characters. In films it is now common to see digital characters interacting with real actors. Virtual characters become more and more indistinguishable from their real counterparts. However until recently such characters always inhabited the screen space separated from the real world, and they were not able to be generated in real time.
Augmented Reality (AR) is a research field concerned with overlaying computer graphics on the real world so that virtual imagery is seamlessly blended with the real surroundings. Unlike in film and television, these graphics are rendered in real time. With this technology virtual characters could co-exist in the same space with real humans. Despite the interesting possibilities that it offers, there has been little research work on Augmented Reality Agents. The goal of this paper is to review and summarise research work on embodied digital characters, to identify promising areas of application of augmented reality interfaces for the addition of embodied agents, to present some results of a prototype implementation, and to envisage further research directions.
Let us image we were in a future meeting of the senior managing board of a big car manufacturer. The issue is about a new car model and how its engine design should look like. All participants are immersed in an Augmented Reality Environment (ARE), where the computer generated representation of the car's engine compartment is superimposed over the real world (imagine a virtual three dimensional model that each participant can see in correct perspective from his individual position). The participants can discuss freely with each other and have an almost hands-on impression of the engine's location and constrains it has to match. Each of them can interact with it and everybody else will see the change. As the research & development (R&D) department is thousands of miles away, it has sent a proxy: a virtual agent. It has knowledge of the application domain and problems related to it, is capable of following the discussion, gives sound explanation, and is sensitive to the conversational context. Now the managers will consult the agent for details and background information and a conversation will develop. In the end, the group will have decided for an innovative car design, even if nobody of the R&D experts was physically present, solely the agent and its abilities provided sound support. Artificial Intelligence has strived for such omnipotent agents for many years but a sound result has not emerged yet. We might shift our focus to other areas of application, for example education or entertainment.
Transferring the presentation scenario from the business world to the world of, for example, a museum means to shift from decision support to process support. In this area the skills of the agent are helpfull complementing other means, but not crucial. Nonetheless it is important to the joy and pleasure of all participants what skills the agent has and what actions it is able to perform. Presenting interesting stories in a museum involves a great deal of connecting facts to stories that relate back to the exposed artefacts and the audience. If a museum's guide additionally reacts flexibly on the audience's questions and suggestions, people will most likely regard the tour as successful and interesting. The audience's contentment can be clearly attributed to the convincing presentation of the agent.
Having envisaged such applications, we want to give some insight to the reader concerning the problems, the approaches, the application domains, the key questions to be answered and the possible outcome of this paper in the following chapters.
1.3. Problem, Goal & Approach
1.3.1. Problem & Research Question
Presentation through and collaboration with an agent in AR is a relatively new field of research. First, in recent years research in collaboration focused on Virtual 1?eality (VR) and its improvements. The main question was how representations of other users could be given more lifelikeness in a V1? Environment (VRE). Little was done to improve the quality and thoroughness of interaction. Second, augmented reality is a very young field of research and the chosen domain of presentation has only been explored to a small degree before. Third, computers are used to present complex information to the user. The mismatch between humans ('analogue': senses, thoughts, emotions) and computers (digital: numbers, computation, complexity) suggests the idea to level the both sides by using human-like interfaces on the computer. At the time of conducting this research in the Human Interface Technology Laboratory, nobody had done anything like that before, and there was no expertise in the area of digital characters. The task and the questions behind were:
"What is the relation between embodied agents, augmented reality and collaboration? Review the literature and compile a comprehensive report with appropriate references about your findings! Implement and possibly test a prototype! What are your recommended directions for further research?"
1.3.2. Goal Definition
From the given hypothetical scenarios above, we can learn that agents might be a good idea for interfacing between human and computers. Our goal is to find evidence in literature and research how agents were employed as effective means of communication in conversation centred areas such as presentations. This includes a review of implemented systems and of the test results. Then, we will have a look at combining AR technology with agents, and what special issues might arise. With this background we will develop design objectives and requirements of agents in AR. A prototype implementation of an agent architecture with a digital character as user front-end will be implemented. This will conclude our feasibility study on agents as presenters in augmented reality. From our consideration during the literature review and our practical implementation we want to identify a list of fruitful areas of research and derive a list of challenging new questions.
1.3.3. Solution Strategy
Knowing very little about agents, their incorporation into AREs and how they might affect the interaction between human and computers, we need to ask some basic but essential questions. What constitutes natural communication among humans, how does it relate to human-computer interaction and how may we ease this interaction? Which communication cues are essential, and what means can mediate those? How can human behaviour be explained in conversational setting, and what implications can we draw from theories? We will mainly look into appropriate literature in the field of psychology and sociology to become clear about these questions. The findings are presented in Section 2.
Having set our focus on embodied agents, we will review the literature. What technologies were used to mediate what interaction? We are especially interested in the question what kind of systems had been proven to be effective in the sense of providing natural interaction to the user. The architecture of the beneficial systems will be of interest. A review of papers, articles and reports of other research groups will be the main source of our conclusion. We expect that the vague guess, an AR agent, might be a good idea. Read Section 3 and 4 to learn more about the findings.
Having learned about other approaches, we need to define our own. But what is different in an augmented reality setting, and is there any special condition we must be cautious about when building an agent for an ARE? What features and abilities constitute an agent and what measurement is to be taken to evaluate it? What areas of requirements can be identified when designing an agent? We have reviewed some proposed sets of criteria for agents, extended them and build our own. Read about it in Section 5 and 6.
Concerning the implementation, some questions arise. Should we build a new architecture, or use an existing one? What specific advantages and disadvantages do both approaches have? Outcomes are mostly based on our own practical tests. The experience during this stage of research greatly contributed to identifying further directions of research. You will find the results in Section 7 and 8.
1.3.4. Expected Outcome
We want to understand how human communication works, what means effects conversation in what way and how agents can be designed to use these effects. We expect to build an thorough understanding to what extent an agent can be useful in shared spaces and how it has to be build to engage humans in an effective or pleasant social interaction. Ideally a user study shall be done to confirm our expectation towards the effectiveness of agents.
1.4. Overview of this Paper
In order to develop a compelling virtual character there are many research problems that need to be addressed - language processing and generation, reasoning and machine learning, computer graphics, artificial intelligence, cognitive and social psychology, philosophy and sociology. Although interesting, most of these topics are beyond the scope of this paper. We will introduce and discuss topics from these fields as far as they contribute to our considerations. For further in-depth elaboration see the appropriate literature mentioned in the related section and the subsequent chapters.
In the remainder of this work we will give an introduction to user interfaces and some background on human interaction theories. Then the notion of Embodied Agents (EA) will be introduced and some properties explained. Next, we will summarise research on previous EAs and discuss several user studies conducted with such agents. Defining Augmented Reality and how agents relate to it will precede our work on design requirements for agents in the presentation domain. After that, we will present our prototype implementation of an agent in AR and what consideration let to the design. Finally we will describe our findings about promising directions for future research.
As we could not include all the material we have produced for this work into this paper, there is a complementing website. Look at http://www.cs.uni-magdeburg.de/-cgraf/NZ/HITLab/Report/ to find out more about adjacent topics, detailed background information and pointers to appropriate literature.
2. HUMAN-COMPUTER INTERACTION
2.1. Computers in everyday life
Fifty years ago, when the UNIVAC (see Figure 1) was introduced as first commercially available general purpose computer for civilian-use, the computer was utilised to take the burden of vast computations from the human, but it could only be handled by experts. Nowadays, "more than 72 million employed people in the US age 16 and over - about 54 percent of all workers - used a computer on the job" [Web1] and consumer products are commonly computerised. Computer chips can be found in almost any household today, be it in washing machines, mobile phones, VCRs or microwaves - everybody has to cope with computers and handle them.
illustration not visible in this excerpt
Figure 1: The U.S. Census Bureau was the first client to order a mainframe electronic computer.
© U.S. Census Bureau
Back in the 50s technological constraints (memory, speed, input & output channels) of early computer systems forced a concentration on the functionality. Only few lines of the programming code were concerned with the user interface. The end user had to be an expert to run the system. Lifting the hardware limitations has freed resources for considerable efforts to improve the user interface. "The effect of this rapid increase in the number and availability of computers is that the computer interface, must be made for everybody instead of just the professional of computer hobbyist" (Eberts, 1994).
Our world today shows that computer technology runs every sort of processes and machinery. Trying to accommodate the user performing his task we must concern the users first (so called 'user-centred design'). Machines should work through routines, tedious procedures, error-prone tasks so that humans might concentrate on critical decisions, planning and coping with unexpected situations. Human judgement is necessary for unpredictable events in the world (the open system) in which actions must be taken to preserve safety, avoid expensive failures, increase product quality. To achieve a higher task effectiveness of humans we need to decrease their burden to use technology.
Computers are at the foremost position in infiltrating all levels of our daily life — they work in all kinds of machinery and equipment commonly used. They run office, home and entertainment applications, they work in the industry and commercial systems as well as in exploratory, creative and co-operative systems. Whenever humans get in contact with machines, they have to manage to get along with the rather emotionless world of modern technology. Addressing this area, Human-Computer-Interaction (HCI) is "a discipline concerned with the design, evaluation and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them" [Web2]. As Faulkner (1998) puts it "Human-Computer Interaction is the study of the relationship that exist between human users and the computer systems they use in the performance of their various tasks". HCI is an interdisciplinary field, relating computer science, psychology, cognitive science, human factors engineering (ergonomics), sociology, design, engineering, art, anthropology, physiology, artificial intelligence and others. See Figure 4 for a schematic illustration of the framework. The focus on the human and his needs becomes even more clear when considering the term Human Factors Design.
illustration not visible in this excerpt
Figure 4: Interrelationship among topics in HCI (Courtesy SIGCHI)
Sometimes computers have an important place in life critical systems, e.g. in power plants or surgery support systems. Intuitively it becomes clear that such systems should be easy to use - that is the motivation behind HCI. If not, devastating consequences could result, as the public had to learn in the 1979 Three Mile Island nuclear power plant disaster [Web16][Web19]. The control panel did not give the operators any useful information to remedy to the situation when problems arose. Hundreds of alarms (audio and visual) went on at the same time. The system could not provide useful information about the exact current status of the system, was displaying to many uncoordinated warning without priority indicators and recovering actions to be taken were not obvious. As a result the core of the reactor almost melt down and the engineers could hardly stop it. Not only to avoid such fatal incidents, but to ease humans' need to work with technology in general, the aim of HCI is to develop or improve human-computer interfaces regarding (Shneiderman, 1998)
- safety (reduce errors generated by users, or offer better handling for such errors),
- utility (mapping of interface elements with user's tasks),
- effectiveness (does the system perform the tasks correctly?),
- efficiency (does that improve the user's efficiency/productivity?), and
- usability (can it be used?) of systems in general.
HCI aims on providing the users with interfaces that will make them more efficient in performing a task using the machine's abilities and advantages. The rating of efficiency is done in comparing the performance on the computer interface to the one on an equivalent manual system. This requirement is essential since "all to often computerized applications are produced that do not make the user's task easier and more satisfying, nor do they save time" (Faulkner 1998). He concludes: "The task of HCI is to design for people, for tasks and environments". The result of the design process is the user interface. It has to be task appropriate, efficient and suitable for the user. The UI connects the human to an environment that is filled with technology, mostly computer systems. The objective in designing the UI is to minimise the human's workload to handle the UI. Then he can invest as much as possible into solving the task. Thus the UI has to be the mediator between two different worlds: technology and humans.
During the last twenty years there has been significant effort to mend these two worlds and establish a mediating in-between. Research shows that the percentage of code for the UI and the percentage of money for its development has increased (Smith and Mosier 1984) (MacIntyre, Estep, and Sieburth, 1990; Rosenberg 1989). But merely computerising a formerly manual process will not guarantee an increase in efficiency (Eberts 1994). Findings from Hansen, Doring and Whitlock (1978), Kozar and Dickson (1978), and Gould (1981) show that inappropriate UI design can make tasks more difficult and time-consuming. This is contra-productive to what UI design strives for! Simply writing more code or investing more money seems not to be enough. In fact, concrete "evidence of improved usability is difficult to find", and experience shows that people still have significant problems with computers (Eberts, 1994). There was much hope in the Object-Action Interface (OAI) model to overcome these problems.
In the OAI model objects and actions are mapped from the real-world onto metaphorical objects and actions in the interface. Successful OAI interfaces provide the following features [Web9]:
- high visibility of interface objects;
- high visibility of actions of interest;
- and incremental actions.
The Direct Manipulation (DM) approach is the most prominent representative of OAI. DM is an interface concept that offers, among other things [Web9]:
- Display of current status of tools (overall view of system's status);
- Be as close to reality as possible (e.g. provide an appropriate representation or model of reality);
- Allow maximum control to the user;
- Display the result of an action immediately;
- Provide rapid response and display.
Compared to former UI concepts like command line interfaces, DM is more suitable to the user and his task. The user can find a solution through a succession of various tasks on the interface. He is supported in finding this solution by suitable representations of problems. See it as counting on your fingers: it gives you a physical real representation of numbers. The advantages of an DM object is that it represents the problem in a more intuitive way, that it combines both the data entry and data display in the same physical location, and that it gives immediate feedback.
Direct manipulation object are nice to add to an interface, but they have their limitations too. The following list provides some hints on key problems [Web9].
- Many objects with many possible actions could be present which is potentially confusing.
- Being bound to a finite screen, DM objects may consume valuable space that makes it necessary to hide some of the information off-screen and that requires scrolling and multiple actions. When more detailed information is needed the display is cluttered quickly.
- Menus hide their content at first. Not all options are displayed at the same time and the user has to search and retrace them. This takes time and guided attention - it is tedious.
- The meaningfulness of visual components is essential. Users must learn the semantics of the components (e.g. slider) that may lead to errors. The visual representation (e.g. icon) itself may be misleading.
- The choice of proper objects (including icons, menus, labels, buttons, other interface elements) and proper actions is difficult. The metaphor should be understood instantly without much attention. It is advisable to use simple metaphors and models with an associated minimal set of concepts.
Summarising these limitations, the DM approach seems to have reached its limitations concerning the problem of bridging the conceptual gap when humans communicate with or through computers.
2.2.2. A new medium evolves
There has been the tendency over the last ten years that computers have gained another meaning. The growing demand of information and global connectivity pushed the popularity of world-wide computer networks. The Internet has turned computers into a means of communication. They connect people, transport content and thus influence people. Being tools for humans in former days they have transformed into a medium now. Today's new purpose of computers might be a reason to re-consider the old DM metaphor. Such an interface might have been good for computers as tools to perform certain tasks. But the nature of the computer as medium today is quite different to the nature as instrument of action - think of a radio in contrast to a hammer. Thus the applicability of DM has to be reconsidered. One consequence could be to think about other metaphors or at least alternatives to choose from [Web20].
As Reeves and Nass could demonstrate in several studies (Reeves 1996), human-computer-interaction follows the same principles as human-human interaction. That means, we have an inherent tendency to respond to media and technical systems in ways that are social and common among humans. The question is, how do we allow the user to behave and communicate naturally when interaction with technology? One would expect that it probably lessens the cognitive workload for the user and contribute to the user satisfaction. With a proper design the task performance would increase as well. In the following chapter we conceptually suggest such a more human-like way of computing.
2.3. Affective Computing
The usual way humans interact with each other is face-to-face and by language. In such a communication at least two humans exchange meaning verbally. But words and sentences are not the sole part of the communication. Potentially humans can communicate with all five senses: sight, sound, touch, taste and smell. Using these channels, many other cues, verbal and non-verbal, are incorporated when transferring meaning. The sender employs these cues subconsciously and the receiver understands it in the same fashion - subconsciously. A vital feature is that our everyday interaction with each other and the world around us is a multi-sensory one, each sense providing different information that builds up the whole.
A key issue in interaction is speech (including listening). It is not a one-channel but always a two-channel process. The sender utters some words complemented by gaze, The receiver gives back some kind of acknowledgement to the sender, poses questions or takes turn, i.e. there is feed-back channel (see Figure 6).
illustration not visible in this excerpt
Figure 6: The sender-receiver model (a.k.a. discourse model) of human communication
The fluency, prosody and intonation of the sentences and words tell us important things about the speaker, e.g. about his inner state, about the circumstances he is speaking of or to give or emphasise a certain meaning.
Aside from the language centred cues there are multiple sources of sensory information, the so called non-verbal cues. These cues could support or complement the verbal utterances, e.g. pointing somewhere and saying 'there!'. It includes gestures like hand movements, head and eye movements as well as body movement and body orientation. Non-verbal cues can be expressions on their own, i.e. the body language. Turning your back on someone and crossing the arms in front of your chest will tell anyone that you don't want to interact with this person even if you don't say a word. On the other hand, when someone identifies a known person, the typical reaction is tracking with the eyes and a body orientation towards the person in question. That will be understood as a sign of openness and the disposition to start a conversation. Non-verbal cues play an important role for both, the speaker and the listener. For the speaker they are necessary means to accentuate his story. They support the listener to understand what the speakers wants to tell. Without non-verbal cues the conversation would be 'stiff' and quite unnatural.
An important aspect is the context of the conversation are the emotional states of the participants. They play a key role to truly understand what each one wants to tell. Thus the ability to recognise, interpret and express emotions - commonly referred to as "emotional intelligence" (Goleman, 1995) - plays a key role in human communication. Related is empathy, another typical property of human-human interaction. It means that someone shows understanding of the other's situation and is feeling with the individual, e.g. after a friend has suffered from a great loss one could express his sympathy by simply hugging him. Without showing some kind of empathy the human-human conversation is cold and distant, solely consisting of facts - not any social.
On the other hand, the use of computer technology often has unpleasant side effects, some of which are strong, negative emotional states that arise in humans during interaction with computers. Frustration, confusion, anger, anxiety and similar emotional states can affect not only the interaction itself, but also productivity, learning, social relationships, and overall well-being. Affective Computing takes up the social side of human-computer interaction and tries to actively support human users in their ability to regulate, manage, and recover from their own negative emotional states, particularly frustration. Thriving for this goal affective computing tries to resemble the human-human communication, especially on the emotional side of the interaction.
Emotion affects judgement, preference, and decision-making in a powerful, yet elusive way. Therefore it is an indispensable element in any interaction with technologies. Since emotions are an integral part of communication, computers with affect-recognition and expression skills would allow a more natural and thus improved human-computer interaction. An affect-recognising computer can "learn" during an interaction by associating emotional expressions (like pleasure or displeasure) with its own behaviour, as a kind of reward and punishment. The software or system would automatically adapt to the users needs. An affective computer-system could detect user frustration and show sympathy for the user and offer assistance, encouragement or comfort. Klein (1999) could show that user frustration levels were significantly lower if assistance was provided.
Since human interaction is improved by multi-sensory input, it makes sense to ask whether multi-sensory information would benefit to this endeavour. We can argue that only a multimodal system is a sound foundation for subsequent (re-)actions of an affective UI, because only the sum of multiple cues from different channels indicates what an utterance truly means or how a person feels. If a system knows how its users feel, it can appropriately react to these emotions. It can guide, help, change its appearance or simply be as unobtrusive as possible.
Bringing affective computing into a more human form led to the development of Social Agents, resembling humans i.e. 'anthropomorphic'. They adapt to the user's context and system status, constructing intelligent responses to users and interacting verbally complemented by non-verbal cues. Social Agents are not intended to be a clone of humans, rather, we are applying human intelligence principles relevant to the specific technological and usage situations.
Equipping an interface with such agents will most likely increase the subjective pleasure of the interaction. Additionally it might positively influence the users' task performance. A functioning system will make human-computer interaction more intuitive, i.e. communicating with a computer will not require special technical skills anymore. We simply can use the same communication skills as in everyday life. The cognitive load for using the interface would vanish and we could free more time for concentrating on and solving the task.
Before concerning about how to implement a socially able agent we need to know what a general agents consists of.
2.4. The Essence of Agency
The essential attributes constituting an agent shall be defined here. A note from a popular textbook on artificial intelligence shows that we should not expect a mathematically sound definition: "The notion of an agent is meant to be a tool for analysing systems, not an absolute characterisation that divides the world into agents and non-agents" (Russell and Norvig 1995).
An common requirement for an agent is that it acts, or can act independently. In contrast to real world agent we deal with agents that 'live' inside computers: software agents. Sanchez (1997) identifies a hierarchy of eight types of software agents (see Figure 7). In his taxonomy he stresses the importance of the entity that perceives and benefits from the notion of agent. For example, the software developers use programmer and network agents as abstractions to cope with the complexity of system design and computer network. In the same matter, the end user shall benefit from the user agents. From findings by Friedman (1995) he concludes that end users can better deal with system complexity by viewing programs as animistic entities. The term interface agents or user interface agents has frequently been associated with this view of agency (see Laurel 1990, Kozierok and Maes 1993, and Wooldridge and Jennings 1995). But "interfaces" also exist between software modules and communicating computers (or between any independent systems). For clarity and to prevent ambiguity we will use "synthetic" to distinguish this class of agents.
illustration not visible in this excerpt
Figure 7: A Taxonomy of Agents (Sanchez 1997)
With this diversity given, one may ask what is common for every general agent? Franklin and Graesser (1996) identify the basic properties of a general agent as:
- reactive (sensing and acting), i.e. responds in a timely fashion to changes in the environment;
- autonomous, i.e. exercises control over its own actions;
- goal-oriented (pro-active, purposeful), i.e. does not simply act in response to the
- temporally continuous, i.e. is a continuously running process.
In short, "an autonomous agent is a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future." (Franklin and Graesser 1996)
A further classification might be helpful to relate a wide variety of agents to each other, e.g. according to (a) the tasks they perform, (b) the range and sensitivity of their senses, (c) the range and effectiveness of their actions, or (d) how much internal states they possess. Other researchers suggest different taxonomies, e.g. (Brustoloni 1991). Focusing on synthetic agents, different researches provide criteria that programs must pass to be considered as "true agents" (e.g. Foner 1993).
Our defining criterion will be if the agent is communicative and socially able, i.e. communicates with people. Then we speak of a Social Agents (see last chapter), Communicative Agent (Franklin and Graesser 1996), Synthetic Agent (Sanchez 1997), or Embodied Agent (Cassell et al. 2000). Communication is a multi-party, multi-modal process that includes not only the subjects but their history, affective states and the context of the conversation. For more information on human communication see the accompanying web-design site.
Agents designed to be perceived directly by end users and to perform tasks on their behalf are an innovative class of agents that can facilitate human-computer interaction. The potential of agent-based user interfaces has been discussed extensively (Kay 1984, 1990; Laurel 1990, 1991; Negroponte 1990, 1995). On the other hand, as we have seen from earlier chapters, applications that are primarily output orientated might not gain much from such an agent. Agents simple need to much time to interact with the human. But applications that rely on user satisfaction, that have a positive relationship between the interacting entities as central element or that aim on the user's pleasure can benefit from Social Interaction with agents. We normally find such applications in learning environments, museums and entertainment. This is exactly the area this paper wants to focus on to unwrap the potential of social interaction between humans and computers.
We could see during this section that human's interaction with technology was governed by an efficient but not necessarily enjoyable and self-explaining interface. Affective computing tries to employ concept known from human-human communication to ease the human's effort of interacting with technology. The subclass of affective interfaces this paper wants to focus on is the class of embodied agents. The next section will have a closer look on those.
3. Embodied Agents
Humans naturally (if unconsciously) attempt to use "a certain system of interaction with other intelligences — a system of social interaction" (Doyle 1999) when acting with computers. Humans have both evolved and learned that system and we "implicitly have expectations about the social and emotional behavior of machines, and even treat them in social ways" (Doyle 1999). However these expectations remain largely neglected in commercially available software and hardware computer system. Although users may speak and gesture at their computers, the machines do not gesture back, or engage in human-like communication. This lead to the idea to 'wrap' the agent into a 'hull' of appropriate visualisation and behaviours — the Embodied Agent (EA).
In this section we describe research that attempts to develop embodied characters that can engage more naturally with the user.
When considering human-like virtual characters in two terms that are often used and confused: that is "avatar" and "agent". The first one is a graphical representation of the user as he is immersed in a virtual reality or three dimensional graphics environment. An avatar is not autonomous — it totally relies on the human input for speech, gestures, movements etc. In contrast, an agent is an autonomous being, producing appropriate (re)action and utterances itself — the user is not obliged to steer it. Even if avatars and agents resemble each other in the outer appearance, in the animations they exhibit and the actions they perform, this key feature distinguishes them. Consequently avatars and agents are typically used for different types of applications. Avatars are often employed in chat systems or online communities to represent a human being. Agents can serve a number of roles, such as pedagogical agents (Johnson 1995; Rickel and Johnson 1999; Rickel and Johnson 2000) that are utilised in educational environments to train users on specific tasks, mostly manual work; or conversational agents that are designed to and capable of engaging the user in a conversation to accomplish or facilitate the desired task. They are employed mostly in information delivery. Agents and avatars could be implemented in a wide variety of different forms, but as we have already seen from the findings of Reeves and Nass (1996) human-like representations would possible be better for social interaction. Therefore we want to focus on EAs in this paper, especially on the conversational ones, the Embodied Conversational Agents (ECA).
An embodied conversational agent may be defined as a virtual graphical character that understand natural human communication cues and responds with it's own speech and gesture output. Developing such an agent is an extremely complex task that encompasses research from a number of different areas, including:
- Multimodal interfaces
- Application Area
- Conversational models
We will discuss the first three topics in the following chapters in more detail. The last three topics are interesting in the broader context but the are not essential and thus not elaborated on exhaustingly in this report. Some concepts from those areas might be used in the text later on with a short explanation. If interested the reader could browse the complementing web-site for further detailed information.
3.2. Multimodal Interfaces
Among the strengths of social communication are is the use of multiple modes and multiple information types and it's inherent flexibility. An ECA should mimic human behaviour and responses in a natural way to be believable to its conversation partner. Moreover, its perceptual mechanisms need to support interpretations of real-world events that can result in real-time action of the type that people produce effortlessly when interacting with their environment. These properties constitute a multi-modal system, a system that is able to integrate multiple modalities and thus is a higher-bandwidth communication interface. As interface to computers it means that other channels could be used at the same time providing additional or complementing information.
Considering the five senses, language (as a form of sound) is the ability that distinguishes humans from animals. Recognising speech has the advantage that it happens almost automatically with little attention. While producing speech the human body can move freely possibly engaged in some task, e.g. pointing at something. Think of disabled people who cannot use other means of interaction.
3.2.1. Speech as input
In the English language we can identify 40 phonemes, the atomic elements of speech (Dix et al. 1999). But language is more than simple sounds. Emphasis, stress, pauses and pitch can all be used to alter the meaning and nature of an utterance. The alteration in tone and quality of phonemes is termed prosody, it conveys a great deal of the actual meaning and emotion within a sentence. Prosodic information gives language its richness and texture, but is very difficult to quantify and thus to resemble it within the computational framework of computers. Another problem is that phonemes sound differently when preceded of followed by different phonemes, a phenomenon termed co-articulation. We need these distinctions for later elaboration, and as we can see using language is nothing but easy.
People who do not regard themselves as computer literate are appealed by the idea to converse naturally with the computer. Indeed, sometimes synthesised speech becomes the primary communication channel, e.g. for disabled people with visual impairment. Today single-user, limited vocabulary systems can work satisfactorily, but there are still problems when employing speech. On the input side, speech can only be applied to very specialised task because recognising complex vocabulary is hard to learn for computers. The speech recognition process poses problems itself: accents, dialects, different intonation of words and 'continuation' noise such as 'Umm' to fill gaps in the usual speech confuse the machines. Interference from background noise hardens the extraction of meaningful sound.
Even if all the technical problems had overcome it still would be a problem for computer to interpret natural language. It is full of arbitrary meaning, sometimes meaning that is contrary to what the speaker wants to express (think of sarcasm, irony etc.). The computer had to have a huge world knowledge to correctly interpret all the human diversity in language. And, there is a fundamental difference between humans and computers: we concentrate on extracting the meaning from the whole sentence we hear rather than decomposing sounds into their constituent parts, analysing the structure (i.e. syntax), assigning meaning (i.e. semantics) and intension (i.e. pragmatics).
3.2.2. Speech as output
Using synthesised speech as output meets significant challenges as much as on the input side. Humans are "highly sensitive to variations and intonation on speech" (Dix et al. 1998). Listening to synthesised speech they are often intolerant to its imperfections. Speech synthesisers hardly produce natural sounding speech — they present to us mostly monotonic, non-prosodic tones. Human find it hard to adjust to that impartial and emotionless presentation. Some of today's synthesisers can deliver a degree of prosody, but "in order to decide what intonation to give to a word the system must have an understanding of the domain" (Dix et al. 1998). If the feedback to the user is only a relatively small set of non-changing messages, a human speaker could be recorded and the messages played back at choice yielding in much more acceptable speech. But the dynamic production of speech still requires huge efforts and is not about to be solved satisfyingly in the near future.
We can conclude that using speech for interfacing between humans and technology is ambivalent but seems to be the best channel we can effectively use. On the one hand we can rely on the highly elaborated functions of the human brain to instantly convey messages. On the other hand, interface designers have to cope with the humans' expectations and implicit reasoning that are not even conscious to them. Non-attentive behaviour displayed through body movement, posture, facial expression, and gestures all contribute to what the persons in a conversation perceive from each other. Information transferred through the unconscious behaviour conveys a great deal of meaning, personal feelings and context.
For social interaction, aside from language, humans use the visual system as the predominant channel for information transfer. Thus the computer interface should actively use sight as well. Observations from the surrounding world may complement the information gathered by sound or it may add entirely new aspects. Looking at a person who shrugs the shoulders saying "Hmmmm" in a monotonous way may tell that this person is probably disappointed or disinterested. The same person would be considered to by satisfied and happy when making the same sound while being observed eating some ice-cream with a smile on the face. Consequently the agent itself should have the ability to make some movements and exhibit certain displays of emotion. This would help others to understand how the character feels and what some words really mean on a certain 'emotional' background.
3.2.4. Visual Speech
Speech as a multimodal phenomenon is supported by experiments indicating that our perception and understanding are influenced by a speaker's face and accompanying gestures, as well as the actual sound of the speech. Many communication environments involve a noisy auditory channel, which degrades speech perception and recognition. Visible speech from the talker's face (or from a reasonably accurate synthetic talking head) improves intelligibility in these situations. Visible speech also is an important communication channel for individuals with hearing loss and others with specific deficits in processing auditory information.
The number of words understood from a degraded auditory message can often be doubled by pairing the message with visible speech from the talker's face. The combination of auditory and visual speech has been called super-additive because their combination can lead to accuracy that is much greater than accuracy on either modality alone. Furthermore, the strong influence of visible speech is not limited to situations with degraded auditory input. A perceiver's recognition of an auditory-visual syllable reflects the contribution of both sound and sight. For example, if the ambiguous auditory sentence "My bab pop me poo brive" is paired with the visible sentence "My gag kok me koo grive", the perceiver is likely to hear, "My dad taught me to drive". Two ambiguous sources of information are combined to create a meaningful interpretation, the McGurk-effect (McGurk and MacDonald 1976).
There are several reasons why the use of auditory and visual information together is so successful. These include a) robustness of visual speech, b) complementarity of auditory and visual speech, and c) optimal integration of these two sources of information. Speechreading, or the ability to obtain speech information from the face, is robust in that perceivers are fairly good at speech reading even when they are not looking directly at the talker's lips. Furthermore, accuracy is not dramatically reduced when the facial image is blurred (because of poor vision, for example), when the face is viewed from above, below, or in profile, or when there is a large distance between the talker and the viewer (Massaro 1998).
Complement of auditory and visual information simply means that one of the sources is strong when the other is weak. A distinction between two segments robustly conveyed in one modality is relatively ambiguous in the other modality. For example, the place difference between /ba/ and /da/ is easy to see but relatively difficult to hear. On the other hand, the voicing difference between /ba/ and /pa/ is relatively easy to hear but very difficult to discriminate visually. Two complementary sources of information make their combined use much more informative than would be the case if the two sources were non-complementary, or redundant (McGurk and MacDonald 1976).
The final reason is that perceivers combine or integrate the auditory and visual sources of information in an optimally efficient manner. There are many possible ways to treat two sources of information: use only the most informative source, average the two sources together, or integrate them in such a fashion in which both sources are used but that the least ambiguous source has the most influence.
Perceivers in fact integrate the information available from each modality to perform as efficiently as possible. Many different empirical results have been accurately predicted by a model that describes an optimally efficient process of combination (Massaro 1998).
3.2.5. Other senses
We have seen that sight and sound are the dominant senses that detect and transmit most of the information. Tactile feedback is also important in improving interactivity. Without the hands humans would not have been able to form utilities that were central to his development. Tactile feedback forms an intrinsic part of their operation, and even today in the electronic office we handle many things that require holding, e.g. pens. On the other hand taste and smell are the least used of our senses. Their use is more for receiving information than for communicating it, and they have been difficult to implement into computer systems up to now. The secondary nature of those senses "tends to suggest that their incorporation, if it were possible, would lead to only a marginal improvement" (Dix et al. 1998). With this reasoning we do not consider taste and smell in this paper any further.
During the last chapter, we could see that non-verbal behaviour convey a great deal of meaning and context. If we don't use the non-attentive cues we probably loose the opportunity to set up a efficient communication channel between human and computer. Today's DM interfaces can't fulfil our demands concerning a natural interaction. We could learn throughout the last 50 years that technology has given massive computational power to us. With this advantage, we now can development interfaces that were not possible some years ago — human like interfaces. With anthropomorphic interfaces we use the natural ability of the human to communicate and co-operate with his species. Their is no interfacing needed anymore to translate between technology and human user, given that the interface follows the rules and conventions of human-human conversation. The human is freed from learning the interface and can instead concentrate on the task or simply enjoy interaction.
But research on human-like characters is seen critical by some researchers. They question if anthropomorphism is appropriate as user interfaces paradigm and what function that should serve (Maes and Shneiderman 1997; Shneiderman 1998; Shneiderman 2002). Any new technology should be significantly better than the existing solutions. Thus we have to evaluate this new type of human-computer interface.
One argument brought against anthropomorphic interfaces is the confusion they induce in users. With most of them, the appearance promises human-like interaction whereas their actual behaviour today is very much predictable and seems scripted. Additionally they lack the whole palette of non-attentive conversation cues like gaze behaviour, turn-taking and turn-giving, adaptation to different situations and different interaction partners. Another argument is that they cause slower response time of the user, take control away from the user in applications where it is essential (e.g. the Microsoft Helper Agent in former MS Office products) and thus have never been successful in the past.
Cassell argues in favour of giving the interface a human-like appearance, stating that "only conversational embodiment ... will allow us to evaluate the function of embodiment in the interface" (Cassell, Vilhjalmsson et al. 2001). In their opinion well-designed agents have a human-like appearance and thus might address particular needs that are not met in current interfaces. Possible outcomes would include ways to make dialogue systems robust despite imperfect speech recognition, to increase bandwidth at low cost, and to support efficient collaboration between human and machines, and between humans mediated by machines. "This is exactly what bodies bring to conversation" (Cassell, Vilhjalmsson et al. 2001).
Although we will use the term Embodied Conversational Agent that was coined by the MIT Media Lab group around Justine Cassell (Cassell, Bickmore et al. 1999), different researches suggested other terms like Social Agents (Parise, Kiessler et al. 1996) and Interactive Virtual Humans (Gratch et al. 2002).
3.4. Application Areas
A misconception of social interaction applied to technology is that we could improve all interfaces by making them explicitly social. The failure of the Office Assistant in Microsoft's Office package is a supporting example. It was meant to providing task-specific help, status information and suggestions through a small animated character and dialog bubbles in screen space.