Searching Large Text Collections

Ricardo Baeza-Yates, Alistair Moffat, and Gonzalo Navarro

In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also cover parallel and distributed implementations of these two structures. As an example, we show how mechanisms based upon inverted files can be used to index and search the Web.

Keywords: Inverted files, Index, Search, Web crawlers, Suffix arrays, Supraindex.