1 Introduction ................ 1

Chapter 1 Introduction

Information is being created and becoming available in ever growing quantities as the access possibilities to it proliferate. There is currently a great deal of excitement and confusion about the promise of an Electronic Information Superhighway that would enable anybody to access these diverse and large information sources. Many information providers are developing on-line services to provide users with an interface to this emerging rich universe of knowledge stored in the form of multimedia documents, business and financial data, games and entertainment, shopping and consumer information. However, the realization of the promise to make any information available to users almost instantly, commonly referred to as the information explosion, is already becoming a mixed blessing without better methods to filter, retrieve and manage this potentially unlimited influx of information. Users face an information overload problem and they require tools to explore this vast universe of information in a structured way.

Information visualization techniques can provide better methods for accessing and understanding large information spaces. This thesis develops a novel spatial representation, called the InfoCrystal, that can visualize abstract information spaces, such as document spaces, that do not have explicit spatial properties that simplify the visualization problem. The development of such representations contributes both to the emerging field of information visualization and to the established field of information retrieval. The InfoCrystal embodies new visual representation techniques that can help to solve problems encountered in information retrieval. More generally, the InfoCrystal has broad applications because it offers a "visual machinery" to compare and relate any number of arbitrary data sets.

Highly trained users, who perform complex data explorations, will likely be the first adopters of the tools developed in this thesis. As these tools will become more popular, they may be integrated into an interface with a broad appeal that enables users to "surf" the information explosion and "cruise" on the Information Superhighway.

1.1 Information Visualization

Researchers at Xerox PARC believe that visual interfaces that recode the information in progressively more abstract and simpler representations will play a central role in the effective management of large information spaces [Card et al. 1991]. Recent work in scientific visualization shows how large sets of data can be visualized in such a way that human perception can detect patterns revealing the underlying structure in the data more readily than by a direct analysis of the numbers [Rosenblum 1994]. When applied to retrieving information, information visualization seeks to reveal structural relationships between documents and their context that would be more difficult to detect by individual retrieval requests [Card et al. 1991].

Humans have a highly developed and versatile ability to extract information from visual stimuli. The field of Computational Vision is trying to determine how the human visual system processes information and what constraints it exploits to arrive at a three-dimensional perception given the two-dimensional nature of its input [Marr 1982]. A major constraint, which the human visual system uses, is that the visible physical world consists mostly of smooth surfaces whose visual properties change smoothly across them, except at object boundaries, and that objects change their position in a continuous fashion. Hence, for visualization to succeed, transformations have to be found, whereby the visual activity on the computer screen reflects a virtual reality that shares many of the laws and principles governing the physical world for which our human perceptual system has been "optimized". In particular, a transformation must lead to visual codes whose features vary smoothly across some portion of the image and lead to visual discontinuities that are meaningful with respect to the data. Ideally, the variables used to create visual codes should not lead to spurious and meaningless perceptual boundaries.

Many abstract concepts seem to be mentally represented by structures originally dedicated to the representation of space and the movement of objects within it [Pinker 1990]. It has long been known that an object's spatial location has a different perceptual status than its color, lightness, texture, or shape, and that people extract information more easily from spatial representations. Spatial data provide a structure for storing and retrieving information and facilitate recall. Hence, visualization should exploit spatial properties of data or provide suitable spatial metaphors to be effective.

Most of the visualization problems that are currently being investigated involve continuous, multi-variate fields over space and time [Rosenblum 1994]. Hence, the transformation problem is simplified, because the data has an explicit spatial structure that can be exploited. This thesis, however, addresses the difficult problem of how to visualize information that is abstract and does not have explicit spatial properties that can be exploited. In particular, it addresses how to access large information spaces, where users usually find it hard to visualize how the contents relate to their interests. This thesis deals with the challenging question of how to visually encode an abstract information space so as to exploit the ability of the human visual system to rapidly recognize spatial patterns and to minimize the cognitive load. In particular, it is the goal to create a representation that provides a spatial overview of the data elements and simultaneously provides visual cues about the content of the data elements. These opposing requirements are difficult to satisfy, especially when the content of the data elements needs to be described along many dimensions, as is the case, for example, with documents that are described by multiple keywords or concepts. This thesis attempts to resolve these opposing requirements by exploiting the grouping principles used by the human visual system to make relationships between different, but related data elements visible and immediate. Further, it creates a visual representation that not only has descriptive power, because it enables users to see large amounts of information in a compact way, but that also has expressive power that enables users, for example, to interact with the data to issue commands.

1.2 Information Retrieval

The domain of information retrieval poses three challenges. First, the currently dominant Boolean or Exact Matching approach needs to become more user-friendly. General users find it difficult to use the Boolean operators and apply parentheses to formulate effective Boolean queries [Borgman 1989, Belkin and Croft 1993]. Further, few have mastered how to fully exploit the expressive power of Boolean query language [Marcus 1991]. Second, the Partial Matching approaches, which are initially easier to use, present users with a sequential list of the "best" documents. This can create a "tunnel vision" effect, because the ranked list obscures what the role the query terms played in the ranking of the retrieved documents. Users could use this type of feedback to help them decide how to proceed in their search. Third, recent retrieval experiments have shown that the competing Exact and Partial matching approaches are complementary because the sets of relevant documents retrieved by them do not overlap to a great extent [Belkin et al. 1993]. Hence, there is a growing consensus that a combination of these two approaches is needed to enhance the retrieval effectiveness [Belkin et al. 1993]. However, the complementary Exact and Partial Matching approaches need to be combined in a framework that enables users to make effective use of their respective strengths.

The problems mentioned above and of the lack of visual feedback cause users to feel confused while searching for information, which in turn undermines their confidence and effectiveness. There is a growing awareness that besides the need to develop more versatile retrieval methods, a great deal of leverage can be obtained by developing better visual tools that support users in the search process and that provide them with a more comprehensive overview of an information space [Fox et al. 1993, Kahle et al. 1993].

Metaphorically speaking, it is as if users, using current retrieval methods, have to begin their exploration of a large information space in darkness. On the one hand, they can use a flashlight with a very narrow, but powerful beam of light (i.e., formulating a very specific and complex query: high precision, but low recall) which gives them only a very limited view of the information space. In order to piece together a more comprehensive picture, users need to cast the flashlight in different directions in an orchestrated fashion (i.e., formulating multiple queries guided by a well-developed strategy requiring sufficient expertise). On the other hand, users can use a light source that casts a wide but very dim beam of light (i.e., formulating a simple and broad query: high recall, but low precision) which provides them only with a very murky and undifferentiated view. Instead of being in darkness, they are now surrounded by thick fog, where too much information is presented in a very unstructured way, and it is not clear how the retrieved data really relates to their interests. It is our goal to provide users with a lighting environment that enables them to use multiple light sources at the same time to illuminate the information space, where the emerging structure is clearly perceivable and can be easily interpreted. Further, the proposed tool should allow users to create complex and powerful lighting strategies that reveal areas in the information space that are of great interest to them or provide them with insight into how to proceed in the search process.

1.3 Goal of the Thesis

This thesis demonstrates how information visualization offers ways to accomplish the needed improvements in information retrieval. In particular, this thesis addresses the problem of how to enhance the ability of users to access information by developing better ways for visualizing information and formulating queries graphically. Further, it develops a visual framework that unifies the Exact and the Partial Matching approaches and enables users to take advantage of their respective strengths. As the amount of available information keeps growing at an ever increasing rate, it will become critical to provide users with high-level visual retrieval tools that enable them to explore, manipulate, and relate large information spaces to their interests in an interactive way. We use the term "high-level" because these tools are designed to give users a flexible visual framework for both how to retrieve and how to explore information.

To address the problems outlined above, this thesis develops the InfoCrystal, which is an example of such a high-level retrieval tool and it has the following functionality: 1) Users can explore an information space along several dimensions simultaneously without having to abandon their sense of overview. 2) Users can manipulate the information by creating useful abstractions. 3) Similar to a spreadsheet, users can ask "what-if" questions and observe the effects without having to change the framework of a query.
4) Users receive support in the search process because they receive dynamic visual feedback on how to proceed. 5) Users can formulate queries graphically, and they have flexibility in terms of the particular methods used to retrieve the information.

1.4 Thesis Organization

This thesis is organized as follows: 1) We will consider a concrete retrieval example to set the stage. 2) We will review the major text retrieval paradigms such as the Exact Matching and the Partial Matching approaches. 3) We will introduce the InfoCrystal and proceed to demonstrate how it can be used to visualize and formulate Boolean, weighted and vector space queries. We will also describe a query outlining tool that enables users to create and manage complex queries. 4) We will give a brief overview of the current InfoCrystal software environment. 5) We will report on a set of two evaluation experiments that we conducted to test specific aspects of the InfoCrystal interface by comparing with a standard Boolean interface. In an appendix we will describe in detail the tutorial that introduced the subjects to the InfoCrystal interface. Further, we will present the feedback received from the experimental subjects. 6) We will review and compare relevant previous research with the InfoCrystal. 7) We will describe several brief application scenarios of the InfoCrystal. 8) We will outline the research to be conducted in the future. 9) We will provide a summary of the key accomplishments of this thesis. Finally, we will also reflect on the major challenges and opportunities facing the field of information visualization.

1.5 Concrete Example

It is best to consider a concrete example to describe some the problems a user currently faces when searching for information. For example, if we are interested in documents that talk about "visual query languages for retrieving information and that consider human factors issues" then the first problem we are faced with is the vocabulary problem. Which particular concepts should we use to represent our information need ? The following concepts could capture our interest: (Graphical OR Visual), Information Retrieval, Query language, Human Factors. Most of the existing on-line retrieval systems use Boolean or Exact Matching operators to combine the identified concepts to form a query. Hence, we are faced next with the coordination problem. Which operators should we use and how should we use them to coordinate the concepts ? On the one hand, the most exclusive query would join the concepts by using the AND operator. Such a query, performed on the INSPEC Database for the years 1991-92, retrieved only one document containing all four concepts. On the other hand, the most inclusive query would join the concepts by using the OR operator; it retrieved 19,691 documents. Hence, either too few documents or too many documents are presented. How should we broaden the exclusive query or narrow the inclusive query to retrieve more relevant documents? We will revisit this example after we have introduced the InfoCrystal and we will show how it could help users to modify the query successfully.