HANDBOOK OF MASSIVE DATA SETS
|
EDITED BY James Abello, Panos M. Pardalos, and Mauricio G. C. Resende Copyright © 2002 by Kluwer Academic Publishers. Published by Kluwer Academic Publishers.
|
P.O. Box 17, 3300 AA Dordrecht, The Netherlands.
The proliferation of massive data sets brings with it a series of special computational challenges. This "data avalanche'' arises in a wide range of scientific and commercial applications. With advances in computer and information technologies, many of these challenges are beginning to be addressed by diverse inter-disciplinary groups, that include computer scientists, mathematicians, statisticians and engineers, working in close cooperation with application domain experts. High profile applications include astrophysics, bio-technology, demographics, finance, geographical information systems, government, medicine, telecommunications, the environment and the internet. John R. Tucker of the Board on Mathematical Sciences has stated:
"My interest in this problem (Massive Data Sets) is that I see it as the most important cross-cutting problem for the mathematical sciences in practical problem solving for the next decade, because it is so pervasive.''
The Handbook of Massive Data Sets is comprised of articles written by experts on selected topics that deal with some major aspect of massive data sets. It contains chapters on information retrieval both in the internet and in the traditional sense, web crawlers, massive graphs, string processing, data compression, clustering methods, wavelets, optimization, external memory algorithms and data structures, the US national cluster project, high performance computing, data warehouses, data cubes, semi-structured data, data squashing, data quality, billing in the large, fraud detection, and data processing in astrophysics, air pollution, biomolecular data, earth observation and the environment.
We would like to take the opportunity to thank the authors of the chapters, the anonymous referees, AT&T Labs Research, and the Center for Applied Optimization, University of Florida for supporting this effort. The first editor wants to express his appreciation to Dave Belanger for his continued support. Special thanks and appreciation go to Dr. Hong-Xuan Huang for assisting us with LaTeX and other issues in the preparation of the camera-ready copy of this handbook. Finally, we would like to thank the Kluwer Academic Publishers for their assistance.
James Abello, Panos M. Pardalos, and Mauricio
G. C. Resende
July 2001
CONTENTS
Preface xi
James Abello, Panos M. Pardalos, and Mauricio G. C. Resende
PART I: INTERNET AND THE WORLD WIDE WEB
1. Algorithmic Aspects of Information Retrieval on
the Web 3-23 [abstract]
Andrei Broder and Monika Henzinger
2. High-Performance Web Crawling 25-45 [abstract]
Marc Najork and Allan Heydon
3. Internet growth: Is there a "Moore's Law" for
data traffic? 47-93 [abstract]
K. G. Coffman and A. M. Odlyzko
PART II: MASSIVE GRAPHS
4. Random Evolution in Massive Graphs 97-122 [abstract]
William Aiello, Fan Chung, and Linyuan Lu
5. Property Testing in Massive Graphs 123-147 [abstract]
Oded Goldreich
PART III: STRING PROCESSING AND
DATA COMPRESSION
6. String Pattern Matching for a Deluge Survival
Kit 151-194 [abstract]
Alberto Apostolico and Maxime Crochemore
7. Searching Large Text Collections 195-243 [abstract]
Ricardo Baeza-Yates, Alistair Moffat, and Gonzalo Navarro
8. Data Compression 245-309 [abstract]
David Salomon
PART IV: EXTERNAL MEMORY ALGORITHMS AND
DATA STRUCTURES
9. External Memory Data Structures 313-357 [abstract]
Lars Arge
10. External Memory Algorithms 359-416 [abstract]
Jeffrey Scott Vitter
PART V: OPTIMIZATION
11. Data Envelopment Analysis (DEA) in Massive
Data Sets 419-437 [abstract]
José H. Dulá and Francisco J. López
12. Optimization Methods in Massive Data Sets 439-471 [abstract]
P. S. Bradley, O. L. Mangasarian, and D. R. Musicant
13. Wavelets and Multiscale Transforms in Astronomical
Image Processing 473-500 [abstract]
Fionn Murtagh and Jean-Luc Starck
14. Clustering in Massive Data Sets 501-543 [abstract]
Fionn Murtagh
PART VI: DATA MANAGEMENT
15. Managing and Analyzing Massive Data Sets with
Data Cubes 547-578 [abstract]
Mirek Riedewald, Divyakant Agrawal, and Amr El Abbadi
16. Data Squashing: Constructing Summary Data
Sets 579-591 [abstract]
William DuMouchel
17. Mining and Monitoring Evolving Data 593-642 [abstract]
Venkatesh Ganti and Raghu Ramakrishnan
18. Data Quality in Massive Data Sets 643-659 [abstract]
Michael F. Goodchild and Keith C. Clarke
19. Data Warehousing 661-710 [abstract]
Theodore Johnson
20. Aggregate View Management in Data Warehouses 711-741 [abstract]
Yannis Kotidis
21. Semistructured Data and XML 743-788 [abstract]
Dan Suciu
PART VII: ARCHITECTURE ISSUES
22. Overview of High Performance Computers 791-852 [abstract]
Aad J. van der Steen and Jack Dongarra
23. The National Scalable Cluster Project: Three Lessons
about High Performance Data Mining and Data Intensive
Computing 853-874 [abstract]
Robert Grossman and Robert Hollebeek
24. Sorting and Selection on Parallel Disk Models 875-892 [abstract]
Sanguthevar Rajasekaran
PART VIII: APPLICATIONS
25. Billing in the Large 895-909 [abstract]
Andrew Hume
26. Detecting Fraud in the Real World 911-929 [abstract]
Michael H. Cahill, Diane Lambert, José C. Pinheiro, and
Don X. Sun
27. Massive Datasets in Astronomy 931-979 [abstract]
Robert J. Brunner, S. George Djorgovski, Thomas A. Prince, and
Alex S. Szalay
28. Data Management in Environmental Information
Systems 981-1091 [abstract]
Oliver Günther
29. Massive Data Sets Issues in Earth Observing 1093-1140 [abstract]
Ruixin Yang and Menas Kafatos
30. Mining Biomolecular Data using Background Knowledge and
Artificial Neural Networks 1141-1168 [abstract]
Qicheng Ma, Jason T. L. Wang, and James R. Gattiker
31. Massive Data Set Issues in Air Pollution
Modelling 1169-1220 [abstract]
Zahari Zlatev
Index 1221-1223
CONTRIBUTORS
James Abello
AT&T Labs Research, Florham Park, NJ 07932 USA
Divyakant Agrawal Computer Science Department, University of California, Santa Barbara, CA 93106 USA
William Aiello AT&T Labs Research, Florham Park, NJ 07932, USA
Alberto Apostolico Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA and Dipartimento di Elettronica e Informatica, Universitŕ di Padova, Padova, Italy
Lars Arge Department of Computer Science, Duke University, Durham, NC 27708-0129 USA
Ricardo Baeza-Yates Department of Computer Science, Universidad de Chile, Santiago, Chile
P. S. Bradley Microsoft Research, Redmond, WA 98052, USA
Andrei Broder AltaVista Company, San Mateo, California, USA
Robert J. Brunner California Institute of Technology, Pasadena, CA 91125, USA
Michael H. Cahill Lucent Technologies, New Providence, NJ 07974, USA
Fan Chung University of California, San Diego, La Jolla, CA 92093, USA
Keith C. Clarke National Center for Geographic Information and Analysis, and Department of Geography, University of California, Santa Barbara, CA 93106-4060, USA
K. G. Coffman AT&T Labs Research, Florham Park, NJ 07932, USA
Maxime Crochemore Institut Gaspard Monge, Université de Marne-la-Vallée, F-93160 Noisy-le-Grand, France
S. George Djorgovski California Institute of Technology, Pasadena, CA 91125, USA
Jack Dongarra Computer Science Department, University of Tennessee, and Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
José H. Dulá School of Business, University of Mississippi, University, MS 38677, USA
William DuMouchel AT&T Labs Research, Florham Park, NJ 07932, USA
Amr El Abbadi Computer Science Department, University of California, Santa Barbara CA 93106 USA
Venkatesh Ganti Department of Computer Sciences, University of Wisconsin, Madison, WI 53706 USA
James R. Gattiker Los Alamos National Laboratory, Los Alamos, NM 87544, USA
Oded Goldreich Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
Michael F. Goodchild National Center for Geographic Information and Analysis, and Department of Geography, University of California, Santa Barbara, CA 93106-4060, USA
Robert Grossman University of Illinois at Chicago and Magnify Inc., Chicago, IL 60607, USA
Oliver Günther Institute of Information Systems, School of Economics and Business Administration, Humboldt-Universität zu Berlin, Germany
Monika Henzinger Google Incorporated, Mountain View, California, USA
Allan Heydon Model N, Inc., South San Francisco, CA 94080, USA
Robert Hollebeek University of Pennsylvania and Hubs Inc., Philadelphia, PA 19104, USA
Andrew Hume AT&T Labs Research, Florham Park, NJ 07932, USA
Theodore Johnson AT&T Labs Research, Florham Park, NJ 07932, USA
Menas Kafatos Center for Earth Observing & Space Research (CEOSR), School of Computational Sciences (SCS), George Mason University, Fairfax, VA 22030, USA
Yannis Kotidis AT&T Labs Research, Florham Park, NJ 07932, USA
Diane Lambert Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA
Francisco J. López School of Business, University of Mississippi, University, MS 38677, USA
Linyuan Lu University of California, San Diego, La Jolla, CA 92093, USA
Qicheng Ma Department of Computer and Information Science, New Jersey Institute of Technology, Newark, NJ 07102, USA
O. L. Mangasarian Computer Sciences Dept., University of Wisconsin, Madison, WI 53706, USA
Alistair Moffat Dept. Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia
Fionn Murtagh School of Computer Science, The Queen's University of Belfast, Belfast BT7 1NN, Northern Ireland
D. R. Musicant Computer Sciences Dept., University of Wisconsin, Madison, WI 53706, USA
Marc Najork Compaq Computer Corporation Systems Research Center, Palo Alto, CA 94301, USA
Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Santiago, Chile
A. M. Odlyzko AT&T Labs Research, Florham Park, NJ 07932, USA
Panos M. Pardalos
University of Florida, Gainesville, FL 32611 USA
José C. Pinheiro Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA
Thomas A. Prince California Institute of Technology, Pasadena, CA 91125, USA
Sanguthevar Rajasekaran Department of CISE, University of Florida, Gainesville, FL 32611, USA
Raghu Ramakrishnan Department of Computer Sciences, University of Wisconsin, Madison, WI 53706 USA
Mauricio G. C. Resende
AT&T Labs Research, Florham Park, NJ 07932 USA
Mirek Riedewald Computer Science Department, University of California, Santa Barbara CA 93106 USA
David Salomon Computer Science Department, California State University, Northridge, CA 91330, USA
Jean-Luc Starck DAPNIA/SEI-SAP, CEA/Saclay, 91191 Gif sur Yvette, France
Aad J. van der Steen Dept. of Computational Physics, Utrecht University, 3508 TD Utrecht, The Netherlands
Dan Suciu Computer Science Department, University of Washington, Seattle, WA 98195, USA
Don X. Sun Bell Labs, Lucent Technologies, Murray Hill, NJ 07974, USA
Alex S. Szalay Johns Hopkins University, Baltimore, MD 21218, USA
Jeffrey Scott Vitter Department of Computer Science, Duke University, Durham, NC 27708-0129 USA
Jason T. L. Wang Department of Computer and Information Science, New Jersey Institute of Technology, Newark, NJ 07102, USA
Ruixin Yang Center for Earth Observing & Space Research (CEOSR), School of Computational Sciences (SCS), George Mason University, Fairfax, VA 22030, USA
Zahari Zlatev National Environmental Research Institute, DK-4000 Roskilde, Denmark