(Disclosure: O’Reilly Media provided me with a free ebook copy of this book for the purposes of this review. I have done my best not to let that influence my opinions here.)
When a book bills itself as “The Definitive Guide,” well, that’s a tall order to fill. But, except for updates as new releases of HBase roll out, I can’t imagine another book surpassing this one by Lars George.
Lars George has been working with HBase since 2007 and is a full committer to the project as of 2009. He now works for Cloudera (a company providing a commercial flavor of Hadoop, as well as Hadoop support). After reading this book, there’s no question in my mind that George has deep understanding, not only of HBase as a data solution, but of the internal workings of HBase.
George gives the background and history of HBase in the larger context of relational databases and NoSQL, which I found to be very helpful. The many diagrams throughout the book are extremely useful in explaining concepts, especially for those of us coming from a relational database background.
George has an excellent and clear writing style. Take, for example, the section where he discusses The Problem with Relational Database Systems, giving a quick rundown of the typical steps for getting an RDBMS to scale up. The flow of his summary reads like the increasing levels of panic that many of us have gone through when dealing with a database-backed application that will not scale.
As an example of how thorough and comprehensive the book is, look at chapter 2, where there is an extensive discussion of the type and class (not desktop PCs!) of machines suitable for running HBase. George gives a truly helpful set of configuration practices, even down to a recommendation for having redundant power supply units.
Another example of his thoroughness comes where George discusses delete methods (Chapter 3). He shows how you can use custom versioning, while admitting that the example is somewhat contrived. Indeed, right after elaborating the example, there is a distinct “Warning” box that admits that custom versioning is not actually recommended. So, even though you may not implement custom versioning, you do understand it as a feature that HBase provides.
Many of the programming examples come with excellent remarks or discussions of the tradeoffs implicit in the techniques, including performance and scaling concerns. Java developers will be most comfortable with the majority of examples, but they can be followed by anyone with some object-oriented programming experience.
I really appreciated the thorough discussion in chapter 8 (“Architecture”) of subjects like B+ trees vs. Log-Structured Merge Trees (LSMs), the Write-Ahead Log, and seeks vs. transfers, topics which are relevant not only to HBase but to many database systems of varying architectures.
The level of thoroughness is also the book’s only weakness. I’m not sure who the target audience for this book is, because it serves both developers and system or database administrators. While nearly every imaginable HBase topic is touched upon, some would have been better off merely listed, with appropriate references given to sources of more information (for example, all those hardware recommendations). The print edition of the book is 552 pages.
Still, a complaint that a book is too detailed shouldn’t be interpreted as much of a complaint. Anyone with an interest in NoSQL databases in general, and HBase in particular should read and study this book. It’s not likely to be superseded in the future.
The catalog page for “HBase: The Definitive Guide”.