The Sleeping Giant

Content-based image retrieval: an essay by Arnold Smeulders, 2000

This text draws from a much bigger text which has appeared as A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain: Content based image retrieval at the end of the early years, IEEE PAMI dec 2000. All references can be found in that paper.

0. The archive will speed us up - will halt us

If you try to imagine what life will be like hundred years from now, it must be frightening. For one, everyone can carry on in its pink-top an archive bigger than the life-long capacity of the senses. So, the relevant portion of the information reaching us can be stored and processed off-line. This must have the effect on the one hand that the processing of information and hence of life will proceed in an even faster pace. And, at the same time, it must have the effect that the development of life will be slowing down as all information is stored in an archive with its tendency to conserve what is there. (In 2002, on the one hand is the effort to write, rehearse and perform a theater play within 24 hours. This is the ultimate expression of the progressive archive-less times we live in. At the same time, the Dutch government requires now that all expenditures are accounted for, leading to extreme archive-conservatism.)

Archives are waiting for us as a huge giant asleep. If we wake the giant, what is it able to tell us? What language will it speak? Will we be able to understand the dreams it has in its mind? And, the giant grows and grows by interconnections. Will it be a force stronger than ourselves?

I cannot answer these questions. What I can do is give an overview of the possibilities and impossibilities of accessing a visual archive and the current bottlenecks.

1. Content based image retrieval - at the end of the early years

There is something about pictures that no words can convey. Consider Munch's The scream, or a performance by a video artist, or even the average Mondrian. It has to be seen. The same holds for of pictures of the Kalahari Desert, a dividing cell or the facial expression of an actor playing King Lear. It is beyond words. Try to imagine an editor taking in pictures without seeing them, a radiologist deciding on an X-ray from just a verbal description. Pictures have to be seen and searched for as pictures, by object, by style or purpose.

Research in content-based image retrieval today is a lively discipline, expanding in breath. As it happens during the maturation process of many a discipline, after early successes in a few applications, research is now concentrating on deeper problems, challenging the hard problems at the cross roads of the discipline from which it was born: computer vision, databases, and information retrieval.

At the current stage of content-based image retrieval research, it is interesting to look back towards the beginning and see which of the original ideas have blossomed, which haven't, and which were made obsolete by the changing landscape of computing. In February 1992, the USA-based National Science Foundation organized a workshop in Redwood, CA, to identify major research areas that should be addressed by researchers for visual information management systems that would be useful in scientific, industrial, medical, environmental, educational, entertainment, and other applications. There are earlier attempts, such as the 1979 conferences on databases and pictorial applications in Florence, but nothing of much interest for today was reported there. In hindsight, the workshop can be marked as the beginning of research in content-based retrieval.

Why did it take so long to get the exploration of visual material started? Before 1995 the machine power, the capacity and reach of the Internet and the availability of digital sensors was underdeveloped to bootstrap any serious use of image exploration. But just after the NSF-workshop, things were to change quickly. The Mosaic Internet-browser was released spawning the web revolution that very quickly changed all cards. In the same era a host of new digital vision sensors became available. The number of images that the average user could reach increased dramatically in just a few years. Instantly, indexing tools of the Web or digital archives became urgent. And, in science, the visual image databases and exploration of the visual content has drawn considerable attention ever since. In order to appreciate the current state of affairs we need to discuss some basic observations which hold true for all images and all observers - man or machine alike: the sensory gap and the semantic gap.

The sensory gap is the gap between the object in the world and the information in a numerical/verbal/categorical description derived from an image recording of that scene. A computer can only process the numerical information derived from an image, so it is important to realize how much information is lost when converting an image into a digital description. For narrow image domains with a limited and predictable variability in its appearance a special digital language might be developed, but still the challenges are considerable. In the narrow domain of frontal views of faces, they are usually recorded against a clear background and illuminated with white frontal light. Where each face is unique and has large variability in the visual details, there are obvious geometrical, physical and color-related constraints governing the domain. Still the same person could render a thousand different recorded faces depending on mood, beard, weather, time of day, make-up, glasses, lighting position, incident shadows, clothes, hair-cut, frame of photography and so on. And this is yet a relatively narrow domain. The domain would be called slightly wider had the faces been photographed from a crowd or from an outdoor scene. In that case also clutter in the scene, occlusion and a non-frontal viewpoint will have a major impact on the digital description. For a broad class of images, such as the images in a photo-stock, the gap between the feature description and the semantic interpretation is generally wider still. The sensory gap makes the description of objects essentially uncertainty in what is known about the object. The sensory gap is particularly poignant when a precise knowledge of the recording conditions is missing. The 2D-records of different 3D-objects can be identical. Without further knowledge, one has to decide that they might represent the same object. Content-based image retrieval systems may provide support in the reduction of the uncertainty through elimination among several potential explanations, much the same as has been done in natural language processing. In short, the sensory gap introduces an uncertainty in any description of an image as a thousand (slightly) different images are mapped on the same description.

I guess most of the current disappointment with standard retrieval systems originates from the semantic gap. The semantic gap is the lack of coincidence between the information extracted from visual data and the interpretation that the visual data have for a user. The semantic gap is best illustrated by pictures each holding a chair. When searching for a chair we may be satisfied with any object under that name. That is we search for man-defined equality. When we search for all one-leg chairs, we add an additional geometrical constraint to the general class. The same holds when searching for a red chair, now adding a color constraint in the search, not a geometrical condition. When we search for a chair perceptually equivalent to a given chair, at least it must be of the same geometrical and color type and we are down to the sensory gap. Finally, when we search for exactly the same image of that chair, literal equality is requested, still ignoring the variations due to noise in the image and we are in the realm of image processing. Where a linguistic description is contextual, an image in an archive is not and may live by itself. So closing the semantic gap will at least include contextual search, a topic barely touched upon in science. Systems like [Chang, Rui] are collecting images from the Internet and inserting them in a predefined taxonomy on the basis of the text surrounding them.

When sorted on their purpose of image search, we discriminate three types of systems:

  1. Searches by association at the start have no specific aim other than find interesting things. It often implies iterative refinement of the search, the similarity or the examples with which the search was started. Systems in this category typically are highly interactive, where the specification may by sketch or by example images. The oldest realistic example of such a system is probably by [Kato]. The result of the search can be manipulated interactively by relevance feedback [Hiroike, Frederix]. To support the quest for relevant results, also other sources than images are employed, [Swain].
  2. Target search may be for a precise copy of the image in mind, as in searching art catalogues [Qbic95]. Target search may also be for another image of the same object. This is target search by example. Target search may also be applied when the user has a specific image in mind and the target is interactively specified as similar to a group of given examples, [Cox]. These systems are suited to search in catalogues.
  3. Category search aims at retrieving an arbitrary image representative of a specific class. It may be the case that the user has an example and the search is for other elements of the same class. Categories may be derived from labels or emerge from the database [Swets]. In category search, the user may have available a group of images and the search is for additional images of the same class [Ciocca]. A typical application of category search is catalogues of varieties, with a domain specific definition of similarity.


The pivotal point in content-based retrieval is that the user seeks semantic similarity, but the database can only provide similarity by data processing. This is what we called the semantic gap. At the same time, the sensory gap between the properties in an image and the properties of the object plays a limiting role in retrieving the content of the image.

We discussed applications of content-based retrieval in three broad types: target search, category search and search by association. Target search is closest to computer vision research. Category search is much more challenging and requires on-line learning or visual data mining. Search by association is hampered most by the semantic gap. As long as the gap is there, use of content-based retrieval for browsing will not be within the grasp of the general public as humans are accustomed to rely on the immediate semantic imprint the moment they see an image. New ways of presenting and learning from archives are necessary here. In general, I would formulate as the challenge for image search engines: to tailor the engine to the narrow domain the user has in mind, via query specification, via learning from past, and via current interaction.

2. The state of the art - a few practical tips and a disappointing conclusion

To enhance the image information, retrieval has set the spotlights on color, as color has a high discriminatory power among objects in a scene, much higher than gray levels. The purpose of most image color processing is to reduce the influence of the accidental conditions of the scene and sensing, that is another definition of the sensory gap as discussed above. Progress has been made in tailored color space representation for well-described classes of variant conditions. Also, the application of geometrical description derived from scale space theory will reveal viewpoint and scene independent salient point sets thus opening the way to similarity of images on a few most informative regions or points.

In the description of the image one usually starts from assuming strong segmentation of the image. The alternative to do no segmentation at all is unattractive as one mixes aspects of all objects in the image into one soup of numerical descriptions. Strong segmentation is the precise outline in the image for each individual object. This is an incredible complex task, human can do in the second they see the picture, even if they have never seen the topic of the picture before. There is no chance a machine can perform this general task in a hundred years, simply because humans use one third of all the brainpower to achieve this. Luckily - and it took years to realize this - for retrieval a total understanding of the image is rarely needed. Of course, the deeper one goes into the semantics of the pictures, the deeper also the understanding of the picture will have to be, but even understanding the semantic meaning of the image does not require strong segmentation. A weaker version of segmentation has been introduced in content-based retrieval. In weak segmentation the result is a homogeneous region by some criterion, but not necessarily covering the complete object silhouette. It results in a blobby description of objects rather than a precise segmentation. Salient features of the weak segments might capture the essential information of the object in a nutshell. The extreme form of the weak segmentation is the selection of the perceptually most salient points as the ultimately efficient data reduction in the representation of an object quite likely drawing human focus-of-attention.

Whenever the image itself permits an obvious interpretation, the ideal content-based system should employ that information. A strong semantic interpretation occurs when a sign can be positively identified in the image. This is rarely the case due to the large variety of signs. As data sets grow big and the processing power matches that growth, the opportunity arises to learn rather than to know the signs. One type of approach is appearance based modeling, learning from examples. That works but only if the recording conditions are highly standardized, like frontal well illuminated faces only. A better approach is one-class classifiers, from examples carefully describing the limits of a class of objects [Tieu, Tax]. An interesting technique to bridge the gap between textual and pictorial descriptions is to exploit information is called latent semantic indexing [Sclaroff]. The search is for hidden correlates of features and captions. In a broad class of images and the enormity of the task to define a reliable detection algorithm for each of them.

Similarity is an interpretation of the image based on the difference between two elements or groups of elements. Any information the user can provide in the search process should be employed to provide the rich context required in establishing the meaning of a picture. The interaction should form an integral component in any image retrieval system, rather than a last resort when the automatic methods fail. Already at the start, interaction can play an important role. Most of current systems perform query space initialization irrespective of whether a target search, a category search, or an associative search is requested. But the fact of the matter is that the set of appropriate features and the similarity function depend on the user goal. Asking the user for the required invariance, yields a solution for a specific form of target search. For category search and associative search the user-driven initialization of query space is still an open issue. We make a pledge for human-based similarity rather than general similarity.

User interaction in image retrieval has, however, some different characteristics from text retrieval. There is no sensory gap and the semantic gap from keywords to full text in text retrieval is of a different nature. No translation is needed from keywords to pictorial elements. We identify six query classes: exact and approximate and ranging from spatial content, image, image groups, and all combinations thereof. When accessing spatial content, some form of (weak) segmentation is required. A balance has to be found between flexibility on the user side and scalability on the system side. Query by image example has been researched most thoroughly, but a single image is only suited when another image of the same object(s) is the aim of the search. In other cases there is simply not sufficient context. Queries based on groups as well as techniques for prior identification of groups in data sets are still promising lines of research. Such group-based approaches have the potential to partially bridge the semantic gap while leaving room for efficient solutions.

A critical point in the advancement of content-based retrieval is the sensory and semantic gaps. The size of the sensory gap between the image data and the computer processed description - for a discussion see above - is enormous but in recent years some of its structure has become better known and even some partial fillings have been achieved. The size of the semantic gap - the distance between the image and the immediate understanding of a user not available to a machine - for a discussion see above - is formidable. The scientific progress is interesting but negligible to the size of the gap. Use of content-based retrieval for semantic browsing in general domains will not be within the grasp of the general public as they are accustomed to rely on the immediate semantic imprint the moment they see an image, and they expect a computer to do the same. Specific situations may work. The aim of content-based retrieval systems must be to provide maximum support in bridging the semantic gap between the simplicity of available visual features and the richness of the user semantics. The best way to resolve the semantic gap comes from sources outside the image by integrating other sources of information about the image in the query. Information about an image can come from a number of different sources: the image content, labels attached to the image, images embedded in a text, and so on. We still have very primitive ways of integrating this information in order to optimize access to images. Among these, the integration of natural language processing and computer vision should come first.

Document Actions
Document Actions
Personal tools
Log in