HANDBOOK OF MASSIVE DATA SETS 

     
EDITED BY James Abello, Panos M. Pardalos, and Mauricio G. C. Resende
Copyright © 2002 by Kluwer Academic Publishers.
Published by Kluwer Academic Publishers. 











                                  




P.O. Box 17, 3300 AA Dordrecht, The Netherlands.
 

Preface  

The proliferation of massive data sets brings with it a series of special computational challenges. This "data avalanche'' arises in a wide range of scientific and commercial applications. With advances in computer and information technologies, many of these challenges are beginning to be addressed by diverse inter-disciplinary groups, that include computer scientists, mathematicians, statisticians and engineers, working in close cooperation with application domain experts. High profile applications include astrophysics, bio-technology, demographics, finance, geographical information systems, government, medicine, telecommunications, the environment and the internet. John R. Tucker of the Board on Mathematical Sciences has stated:

 "My interest in this problem (Massive Data Sets) is that I see it as the most important cross-cutting problem for the mathematical sciences in practical problem solving for the next decade, because it is so pervasive.'' 

The Handbook of Massive Data Sets is comprised of articles written by experts on selected topics that deal with some major aspect of massive data sets. It contains chapters on information retrieval both in the internet and in the traditional sense, web crawlers, massive graphs, string processing, data compression, clustering methods, wavelets, optimization, external memory algorithms and data structures, the US national cluster project, high performance computing, data warehouses, data cubes, semi-structured data, data squashing, data quality, billing in the large, fraud detection, and data processing in astrophysics, air pollution, biomolecular data, earth observation and the environment. 

We would like to take the opportunity to thank the authors of the chapters, the anonymous referees, AT&T Labs Research, and the Center for Applied Optimization, University of Florida for supporting this effort. The first editor wants to express his appreciation to Dave Belanger for his continued support. Special thanks and appreciation go to Dr. Hong-Xuan Huang for assisting us with LaTeX and other issues in the preparation of the camera-ready copy of this handbook. Finally, we would like to thank the Kluwer Academic Publishers for their assistance. 

James Abello, Panos M. Pardalos, and Mauricio G. C. Resende  

July 2001

 
CONTENTS 
Preface xi
	James Abello, Panos M. Pardalos, and Mauricio G. C. Resende 
PART I: INTERNET AND THE WORLD WIDE WEB
	1. Algorithmic Aspects of Information Retrieval on 
           the Web  3-23 [abstract]
		Andrei Broder and Monika Henzinger
	2. High-Performance Web Crawling  25-45 [abstract]
		Marc Najork and Allan Heydon	
	3. Internet growth: Is there a "Moore's Law" for 
           data traffic?  47-93 [abstract]
		K. G. Coffman and A. M. Odlyzko
PART II: MASSIVE GRAPHS



	
	4. Random Evolution in Massive Graphs  97-122 [abstract]
		William Aiello, Fan Chung, and Linyuan Lu 
	5. Property Testing in Massive Graphs  123-147 [abstract]
		Oded Goldreich
PART III: STRING PROCESSING AND 
          DATA COMPRESSION
	6. String Pattern Matching for a Deluge Survival 
           Kit  151-194 [abstract]
		Alberto Apostolico and Maxime Crochemore
	7. Searching Large Text Collections  195-243 [abstract]
		Ricardo Baeza-Yates, Alistair Moffat, and Gonzalo Navarro
	8. Data Compression  245-309 [abstract]
		David Salomon
PART IV: EXTERNAL MEMORY ALGORITHMS AND 
         DATA STRUCTURES
	9.  External Memory Data Structures  313-357 [abstract]
		Lars Arge	
	10. External Memory Algorithms  359-416 [abstract]
		Jeffrey Scott Vitter
PART V: OPTIMIZATION
	11.  Data Envelopment Analysis (DEA) in Massive 
             Data Sets  419-437 [abstract]
		José H. Dulá and Francisco J. López
	12.  Optimization Methods in Massive Data Sets  439-471 [abstract]
		 P. S. Bradley, O. L. Mangasarian, and D. R. Musicant
	13.  Wavelets and Multiscale Transforms in Astronomical 
            Image Processing  473-500 [abstract]
		Fionn Murtagh and Jean-Luc Starck 
	14.  Clustering in Massive Data Sets  501-543 [abstract]
		Fionn Murtagh
PART VI: DATA MANAGEMENT
	15.  Managing and Analyzing Massive Data Sets with 
             Data Cubes  547-578 [abstract]
		Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi
	16.  Data Squashing: Constructing Summary Data 
             Sets  579-591 [abstract]
		William DuMouchel
	17.  Mining and Monitoring Evolving Data  593-642 [abstract]
		Venkatesh Ganti and Raghu Ramakrishnan 
	18.  Data Quality in Massive Data Sets  643-659 [abstract]
		 Michael F. Goodchild and Keith C. Clarke
	19.  Data Warehousing   661-710 [abstract]
		Theodore Johnson 
	20.  Aggregate View Management in Data Warehouses   711-741 [abstract]
		Yannis Kotidis 
	21.  Semistructured Data and XML  743-788 [abstract]
		Dan Suciu  
 PART VII: ARCHITECTURE ISSUES
	22.  Overview of High Performance Computers  791-852 [abstract]
		Aad J. van der Steen and Jack Dongarra
	23.  The National Scalable Cluster Project: Three Lessons 
             about High Performance Data Mining and Data Intensive 
             Computing  853-874 [abstract]
		 Robert Grossman and Robert Hollebeek
	24.  Sorting and Selection on Parallel Disk Models  875-892 [abstract]
		Sanguthevar Rajasekaran
PART VIII: APPLICATIONS
	25.  Billing in the Large  895-909 [abstract]
		Andrew Hume
	26.  Detecting Fraud in the Real World  911-929 [abstract]
		Michael H. Cahill, Diane Lambert, José C. Pinheiro, and 
		Don X. Sun
	27.  Massive Datasets in Astronomy  931-979 [abstract]
		Robert J. Brunner, S. George Djorgovski, Thomas A. Prince, and 
  		Alex S. Szalay
	28.  Data Management in Environmental Information 
             Systems  981-1091 [abstract]
		Oliver Günther
	29.  Massive Data Sets Issues in Earth Observing  1093-1140 [abstract]
		Ruixin Yang and Menas Kafatos
	30.  Mining Biomolecular Data using Background Knowledge and 
            Artificial Neural Networks  1141-1168 [abstract]
		 Qicheng Ma, Jason T. L. Wang, and James R. Gattiker
	31.  Massive Data Set Issues in Air Pollution 
             Modelling  1169-1220 [abstract]
		 Zahari Zlatev
Index 1221-1223
 
 
 
CONTRIBUTORS
James Abello
AT&T Labs Research, Florham Park, NJ 07932 USA
Divyakant Agrawal



Computer Science Department, University of California,



Santa Barbara, CA 93106 USA
William Aiello



AT&T Labs Research, Florham Park, NJ 07932, USA
Alberto Apostolico



Department of Computer Sciences, Purdue University,



West Lafayette, IN 47907, USA and



Dipartimento di Elettronica e Informatica,



Universitŕ di Padova,



Padova, Italy
Lars Arge



Department of Computer Science, Duke University,



Durham, NC 27708-0129 USA
Ricardo Baeza-Yates



Department of Computer Science, Universidad de Chile,



Santiago, Chile
P. S. Bradley



Microsoft Research, Redmond, WA 98052, USA
Andrei Broder



AltaVista Company, San Mateo, California, USA
Robert J. Brunner



California Institute of Technology, Pasadena, CA 91125, USA
Michael H. Cahill



Lucent Technologies, New Providence, NJ 07974, USA
Fan Chung



University of California, San Diego, La Jolla, CA 92093, USA
Keith C. Clarke



National Center for Geographic Information and Analysis,



and Department of Geography, University of California, Santa



Barbara, CA 93106-4060, USA
K. G. Coffman



AT&T Labs Research, Florham Park, NJ 07932, USA
Maxime Crochemore



Institut Gaspard Monge,



Université de Marne-la-Vallée,



F-93160 Noisy-le-Grand, France
S. George Djorgovski



California Institute of Technology, Pasadena, CA 91125, USA
Jack Dongarra



Computer Science Department, University of Tennessee, and



Oak Ridge National Laboratory,



Oak Ridge, Tennessee, USA
José H. Dulá



School of Business, University of Mississippi, University, MS 38677, USA
William DuMouchel



AT&T Labs Research, Florham Park, NJ 07932, USA
Amr El Abbadi



Computer Science Department, University of California,



Santa Barbara CA 93106 USA
Venkatesh Ganti



Department of Computer Sciences, University of Wisconsin,



Madison, WI 53706 USA
James R. Gattiker



Los Alamos National Laboratory, Los Alamos, NM 87544, USA
Oded Goldreich



Department of Computer Science and Applied Mathematics,



Weizmann Institute of Science, Rehovot, Israel
Michael F. Goodchild



National Center for Geographic Information and Analysis,



and Department of Geography, University of California, Santa



Barbara, CA 93106-4060, USA
Robert Grossman



University of Illinois at Chicago and Magnify Inc.,



Chicago, IL 60607, USA
Oliver Günther



Institute of Information Systems,



School of Economics and Business Administration,



Humboldt-Universität zu Berlin, Germany
Monika Henzinger



Google Incorporated, Mountain View, California, USA
Allan Heydon



Model N, Inc., South San Francisco, CA 94080, USA
Robert Hollebeek



University of Pennsylvania and Hubs Inc., Philadelphia, PA



19104, USA
Andrew Hume



AT&T Labs Research, Florham Park, NJ 07932, USA
Theodore Johnson



AT&T Labs Research, Florham Park, NJ 07932, USA
Menas Kafatos



Center for Earth Observing & Space Research (CEOSR),



School of  Computational Sciences (SCS), George Mason University,



Fairfax, VA 22030, USA
Yannis Kotidis



AT&T Labs Research, Florham Park, NJ 07932, USA
Diane Lambert



Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA
Francisco J. López



School of Business, University of Mississippi, University,



MS 38677, USA
Linyuan Lu



University of California, San Diego, La Jolla, CA 92093, USA
Qicheng Ma



Department of Computer and Information Science, New Jersey



Institute of Technology, Newark, NJ 07102, USA
O. L. Mangasarian



Computer Sciences Dept., University of Wisconsin, Madison,



WI 53706, USA
Alistair Moffat



Dept. Computer Science and Software Engineering,



The University of Melbourne,



Victoria 3010, Australia
Fionn Murtagh



School of Computer Science, The Queen's University of



Belfast, Belfast BT7 1NN, Northern Ireland
D. R. Musicant



Computer Sciences Dept., University of Wisconsin, Madison, WI 53706, USA
Marc Najork



Compaq Computer Corporation Systems Research Center, Palo



Alto, CA 94301, USA
Gonzalo Navarro



Dept. of Computer Science, Universidad de Chile,



Santiago, Chile
A. M. Odlyzko



AT&T Labs Research, Florham Park, NJ 07932, USA
Panos M. Pardalos
University of Florida, Gainesville, FL 32611 USA
José C. Pinheiro



Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA
Thomas A. Prince



California Institute of Technology, Pasadena, CA 91125, USA
Sanguthevar Rajasekaran



Department of CISE, University of Florida, Gainesville, FL



32611, USA
Raghu Ramakrishnan



Department of Computer Sciences, University of Wisconsin,



Madison, WI 53706 USA
Mauricio G. C. Resende
AT&T Labs Research, Florham Park, NJ 07932 USA
Mirek Riedewald



Computer Science Department, University of California,



Santa Barbara CA 93106 USA
David Salomon



Computer Science Department, California State University,



Northridge, CA 91330, USA
Jean-Luc Starck



DAPNIA/SEI-SAP, CEA/Saclay, 91191 Gif sur Yvette, France
Aad J. van der Steen



Dept. of Computational Physics, Utrecht University,



3508 TD Utrecht, The Netherlands
Dan Suciu



Computer Science Department, University of Washington,



Seattle, WA 98195, USA
Don X. Sun



Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA
Alex S. Szalay



Johns Hopkins University, Baltimore, MD 21218, USA
Jeffrey Scott Vitter



Department of Computer Science, Duke University,



Durham, NC 27708-0129 USA
Jason T. L. Wang



Department of Computer and Information Science, New Jersey



Institute of Technology, Newark, NJ 07102, USA
Ruixin Yang



Center for Earth Observing & Space Research (CEOSR),



School of  Computational Sciences (SCS), George Mason University,



Fairfax, VA 22030, USA
Zahari Zlatev



National Environmental Research Institute,



DK-4000 Roskilde, Denmark