Overview
The objective of the Diamond project is to enable interactive search of terabyte-scale, non-indexed collections of complex data, such as photo collections, satellite pictures and medical images. Diamond achieves this goal by distributing the search and leveraging active storage technology-storage devices with processing capability embedded or nearby. The goal is to retrieve buried gems within massive collections, without moving a mountain of data. Diamond is a collaborative effort involving Intel Research and Carnegie Mellon University. The Diamond project is led by Rahul Sukthankar of Intel Research Pittsburgh.
The Diamond system provides a common infrastructure and programming interfaces for building search applications in a variety of domains, such as medical imaging and homeland security. This enables application developers to focus on domain-specific aspects of the problem while relying on Diamond to provide an efficient, parallel implementation of the search task.
Diamond is comprised of a systems component, designed to make searches of unstructured data efficient, and application-specific algorithms that identify the data to be retrieved. Tasks that meet two criteria are suitable for Diamond: First, the search must involve objects (photos, for example) that can be processed independently, and in any order. This allows Diamond to search objects distributed across many storage devices in an efficient manner. Secondly, Diamond must be able to decompose the search task into a sequence of filter steps. This enables the execution of simpler search steps on active storage devices and more complex steps on the user's machine. Diamond can dynamically balance computation between the available processors to speed up the search process.
The Challenge of Searching through Rich Media
Diamond addresses the evolution of information stored on the Web and in consumer and business storage devices, from primarily text and numeric data to rich media-unstructured data such as digital still images, video and audio. Current search technologies can easily comb through structured data but are not well suited to searching rich media. When searching through unstructured data, they are typically restricted to combing through the metadata-text labels such as date stamps affixed to photos. Without such labels, they cannot identify, say, "all photos of my family at the beach"-a search request that requires rich semantic knowledge. Unfortunately, current algorithms for processing rich media are unable to understand this semantic information in a fully automated manner. Diamond attempts to close this gap by providing users with the ability to interactively search through this data.
In searching through unstructured data, such as photographs, Diamond can look for color, texture and/or shape-three characteristics of image data that can convey semantics. The goal is to narrow the set of potential "matches" that the user must review in order to identify the desired data.
Filtering and "Early Discard"
Suppose a user wants to search for photos of a whale watching trip. Computer vision algorithms are not yet sophisticated enough to recognize a whale's fluke. But by applying machine learning algorithms, Diamond learns the concept of water (in terms of blue-gray color) from a small number of examples, and quickly discards all other images. The objective is to reduce the set of potential matches by filtering out, as early as possible, photos that are very unlikely to contain water - or a whale.
If needed, the remaining set of images is filtered again, using a slightly more sophisticated search for visual texture (also learned using examples). During this second stage, images containing blue-gray patches of color but without a wave-like texture are eliminated, further reducing the set of potential matches. The much smaller set of remaining photos can then be manually reviewed by the user to identify the desired photos.
Note that the user must provide the search heuristics to the system in an iterative process of searching, reviewing results, and refining the search. We can't yet take the user out of the loop, but Diamond can make the user's time as efficient as possible. As a result of filtering and early discard, the user may have to review, say, 50 images rather than 5,000.
Tradeoffs: Precision versus Recall
In creating machine learning algorithms for searching through unstructured data, researchers face a tradeoff: If a system such as Diamond is too aggressive in discarding objects or files, a desired object might be discarded. If the search is not aggressive enough, too few objects may be discarded, leaving the user with many irrelevant matches. Researchers refer to this as the "precision versus recall tradeoff." Precision refers to the percentage of relevant results (files) out of all results retrieved. Recall refers to the percentage of relevant results retrieved out of all relevant results in the database being searched.
Machine learning algorithms should reflect the level of precision or recall appropriate for a given application. For example, a Department of Homeland Security official searching a database of potential terror suspects wants a high probability of not missing a match; recall is important, so fewer files should be discarded. By contrast, an advertising agency searching for photos to illustrate a concept is more interested in precision. It doesn't need to see every potential match, so early discard can be more aggressive, limiting the search to those images of greatest interest or relevance.
Accelerating the Search via Active Storage Technology
Without metadata to organize a rich media collection, the only way a search engine could identify desired data would be to search sequentially through every file. This is a slow process that would make interactive searches of terabyte-size data collections infeasible.
To speed the process, Diamond distributes the search across multiple active storage devices, operating in parallel, each with processing capability embedded or nearby. Files are examined and discarded near their storage location rather than being sent to a central location for processing, significantly accelerating the search. This approach is analogous to efficiently searching for a needle in a haystack. Each storage device may not be able to identify what a needle looks like, but much irrelevant hay can be quickly discarded, making it easier for the human to find the needle.
Training the Machine Learning System
At the start of a search, the user must supply Diamond with samples of the images containing the desired features. Referring to our earlier example, to help Diamond locate whale watching photos, the user could supply some sample images containing ocean water. By highlighting some regions in those images the user can indicate that this is an important feature (in terms of color or texture). Diamond then learns a binary classifier to determine how to categorize future images based on this feature. Diamond can incorporate both interactively-trained classifiers (such as the water classifier described above) and pre-trained classifiers (such as face detectors) into its search process.
Proof-of-Concept Applications
To test the Diamond system, researchers have developed two proof-of-concept applications. One of those applications, SnapFind, allows users to quickly, interactively search large collections of unlabeled photographs. The motivation for choosing this application is that digital cameras allow users to generate thousands of photos, yet few users have the patience to manually index them. SnapFind enables users to create an initial query (e.g., "find all of the images containing some water") and interactively refine the search based on partial search results. In the current implementation, the user can filter images based on color, texture and shape (such as human faces). A prototype version of SnapFind can be downloaded from the Diamond project website.

Diamond SnapFind: A Proof-of-Concept Application
Research Progress
Since the Diamond project was launched in January 2003 researchers have made significant progress, improving the efficiency of distributed storage systems and developing new machine learning algorithms to improve the accuracy of searches. One such algorithm, PCA-SIFT [PDF 645KB], is a local shape descriptor. It takes small patches from an image and translates their appearance into a distinctive set of numbers that can be used to find similar patches in other images. This enables Diamond to recognize the same object across different images. Researchers are also exploring an object-based retrieval system [PDF 374KB] that can learn to recognize objects from a small set of examples.
Diamond has many potential applications in the medical arena. For example, a doctor in a remote region, faced with an unusual x-ray, could use Diamond to search annotated medical databases for semantically-similar images, to help achieve an accurate diagnosis. Today Intel Research Pittsburgh and Carnegie Mellon University are applying the Diamond research to real-world problems in bio-medical imaging. In collaboration with the University of Pittsburgh Medical Center, one of the nation's largest and most advanced integrated health systems, they are developing a system of computer-aided diagnosis of dermato-pathology images.
The potential for Diamond extends beyond these examples to many fields of scientific endeavor, from botany to astronomy. In general, any researcher who wants to test a hypothesis against a large amount of data could potentially benefit from Diamond's search capabilities.
Those capabilities will continue to evolve. Today, human operators must supply heuristics to the Diamond, and the system's semantic understanding is limited to color, texture and shape. In the future, Diamond will become more capable of understanding semantics in rich data, enabling increasingly sophisticated interactive queries of large non-indexed datasets.