Brief Impressions of Mongo Boston 2011

October 6, 2011 Leave a comment

Monday I attended Mongo Boston 2011 at the Microsoft NERD Center in Cambridge.

The opening keynote by 10gen’s CTO and co-founder Eliot Horowitz struck a couple of very interesting notes.

  • 10gen wants MongoDB to be a general-purpose database.
  • One of their key principles in building MongoDB is to reduce the number of “knobs” an administrator needs to turn.

Overall I would say the conference was valuable, but could really do with a second day.  For one thing, none of the presentations were more than forty-five minutes long.  While that length does allow for decent overviews, it’s impossible to get into any real depth with such a limited time.

A second day could also reduce some of the “drinking from a fire hose” effect.  I attended eight different presentations, which contained a lot of concepts to absorb.

I wouldn’t recommend these conferences for those who have no experience at all using MongoDB.  I’ve worked with it for a little over a year now, so the material was at a good level for my current understanding.

The price was right at $20 or $30 depending on whether you met the early bird deadline or not.  In my mind, this pricing is a shrewd strategy by 10gen, as it enables interested students to attend.  Building interest and enthusiasm among the up-and-coming developers of tomorrow is a great way to build a community.  However, it was gratifying to see that the attendees represented a wide range of ages.

If you get an opportunity to attend one of the upcoming conferences, I think you’ll find the day worth your time.

Advertisements
Categories: MongoDB, NoSQL Tags: ,

Book Review: “HBase: The Definitive Guide” by Lars George (O’Reilly Media)

October 5, 2011 Leave a comment

Summary

(Disclosure: O’Reilly Media provided me with a free ebook copy of this book for the purposes of this review. I have done my best not to let that influence my opinions here.)

When a book bills itself as “The Definitive Guide,” well, that’s a tall order to fill. But, except for updates as new releases of HBase roll out, I can’t imagine another book surpassing this one by Lars George.

Lars George has been working with HBase since 2007 and is a full committer to the project as of 2009.  He now works for Cloudera (a company providing a commercial flavor of Hadoop, as well as Hadoop support).  After reading this book, there’s no question in my mind that George has deep understanding, not only of HBase as a data solution, but of the internal workings of HBase.

My Reactions

George gives the background and history of HBase in the larger context of relational databases and NoSQL, which I found to be very helpful. The many diagrams throughout the book are extremely useful in explaining concepts, especially for those of us coming from a relational database background.

George has an excellent and clear writing style. Take, for example, the section where he discusses The Problem with Relational Database Systems, giving a quick rundown of the typical steps for getting an RDBMS to scale up.  The flow of his summary reads like the increasing levels of panic that many of us have gone through when dealing with a database-backed application that will not scale.

As an example of how thorough and comprehensive the book is, look at chapter 2, where there is an extensive discussion of the type and class (not desktop PCs!) of machines suitable for running HBase. George gives a truly helpful set of configuration practices, even down to a recommendation for having redundant power supply units.

Another example of his thoroughness comes where George discusses delete methods (Chapter 3). He shows how you can use custom versioning, while admitting that the example is somewhat contrived. Indeed, right after elaborating the example, there is a distinct “Warning” box that admits that custom versioning is not actually recommended.  So, even though you may not implement custom versioning, you do understand it as a feature that HBase provides.

Many of the programming examples come with excellent remarks or discussions of the tradeoffs implicit in the techniques, including performance and scaling concerns.  Java developers will be most comfortable with the majority of examples, but they can be followed by anyone with some object-oriented programming experience.

I really appreciated the thorough discussion in chapter 8 (“Architecture”) of subjects like B+ trees vs. Log-Structured Merge Trees (LSMs), the Write-Ahead Log, and seeks vs. transfers, topics which are relevant not only to HBase but to many database systems of varying architectures.

The level of thoroughness is also the book’s only weakness.  I’m not sure who the target audience for this book is, because it serves both developers and system or database administrators.  While nearly every imaginable HBase topic is touched upon, some would have been better off merely listed, with appropriate references given to sources of more information (for example, all those hardware recommendations). The print edition of the book is 552 pages.

Still, a complaint that a book is too detailed shouldn’t be interpreted as much of a complaint.  Anyone with an interest in NoSQL databases in general, and HBase in particular should read and study this book.  It’s not likely to be superseded in the future.

The catalog page for “HBase: The Definitive Guide”.

Categories: Book Reviews, Books, NoSQL Tags: ,

Book Review: “Big Data Glossary” by Pete Warden (O’Reilly Media)

October 2, 2011 Leave a comment

Big Data Glossary” could probably have been titled  something like “Big Data Cheat Sheets” because it’s both more and less than a glossary.  Instead the book is an excellent summary of tools in the “big data” space, rather than a list of terms with definitions.

Warden tackles eleven topics:

  1. Some background on fundamental techniques (e.g., key-value stores)
  2. NoSQL databases
  3. MapReduce
  4. Storage techniques
  5. “Cloud” servers
  6. Data processing technologies (e.g., R and Lucene)
  7. Natural Language Processing
  8. Machine Learning
  9. Visualization
  10. Acquisition
  11. Serialization

He covers none of these topics in great detail, which will no doubt cause carping among some folks.  However, I really like his approach of sketching broad themes, identifying key projects (or products) in each space, and pointing the reader to further research.  Because the field of “big data” is so large, this short book (it’s only 50 pages) serves the extremely useful purpose of tying together the field by providing an overview.

Highly recommended for folks looking to get their feet wet in the great lake of big data.

Compiling and Installing PostgreSQL 9.1 from Source on Fedora 15 (64-bit)

September 23, 2011 Leave a comment

Prerequisites

Last week I discussed various reasons why you would want to compile your open-source database from the project’s source code.  I wanted to give an example here of the process for compiling a particular database, PostgreSQL.  Version 9.1 of PostgreSQL was released on 2011-09-11, so the example below works with that code base.

I’m assuming in the following instructions that you’re starting with a sparkling, new Fedora installation.  In other words, you haven’t installed any of the dependencies (e.g., gcc, make, bison) that you’ll need for building the PostgreSQL source code.  I used Fedora 15, 64-bit.

To grab the most up-to-date Postgres code, you’ll want to check it out from the git repository.  If git is not installed on your system, install it.
$ sudo yum install git
There are a number of libraries and software programs that you’ll need if you’re going to build PostgresSQL.  The easiest way to get all of them is to use the yum-builddep command.  (This will also install flex, bison, and Perl, if they are not already present on your system).
$ sudo yum-builddep postgresql
Among other software and libraries, yum should offer to install bison, flex, and tcl.  However, yum-builddep for postgresql does not install the gcc compiler, which you will also need.  To get gcc installed, run the command:
$ sudo yum install gcc

Check out the Source Code using Git

Grabbing the code via git is a very simple one-line command.

$ git clone git://git.postgresql.org/git/postgresql.git
This checks the source code out into a sub-directory of your current location, called postgresql.

Configure the Build

A full list of the configuration options available at the command line is described in the “PostgreSQL 9.1.0 Documentation“.  For this example, I ran a fairly vanilla configuration, but added features for SQL/XML in the database.
$ cd postgresql
$ ./configure --with-libxml --with-libxslt

Make and Install

Now that you’ve fine-tuned exactly how you want Postgres built, you’ll run the fairly standard make and install steps.
$ gmake
Note that you must run the install step as a privileged user, since access to the /usr/local directory (it’s there, at /usr/local/pgsql/bin, that the binary files will live) is restricted by default.
$ sudo gmake install

Post-Installation and Database Configuration

You’ll want to run PostgreSQL as a specific, non-root user, postgres.  So, create this user.

$ sudo adduser postgres
$ sudo passwd postgres

Now, change user to the newborn postgres user to run the database initialization.

$ su - postgres

First, set the environment variables for locating needed libraries and executables.

$ LD_LIBRARY_PATH=/usr/local/pgsql/lib; export LD_LIBRARY_PATH
$ PATH=$PATH:/usr/local/pgsql/bin; export PATH

Now, create a directory to contain the server’s data files.

$ mkdir <directory_for_database_data_files>

Run the initdb command to initialize the PostgreSQL server.

$ initdb -D <directory_for_database_data_files>

I recommend creating a specific directory for the server log files.

$ mkdir <directory_for_server_log_files>

Starting the Server

Start the server as the postgres user, indicating the location of data files and the location of the server log files.

$ postgres -D <directory_for_data files> ><directory_for_server_log_files>/server.log 2>&1 &

There shouldn’t be any errors that prevent the server from starting, but if you inspect the log file you should see the following messages.

LOG: database system is ready to accept connections
LOG: autovacuum launcher started

Your server should be up and running, ready to accept your client connections!

Why Compile MySQL?

September 14, 2011 Leave a comment

I’ve got an upcoming post that discusses the steps for compiling MySQL from source.  Before I get to that topic, though, I thought a preliminary post on the reasons for doing so would be helpful.

Almost every open-source database project provides an installer for various platforms.  Why then would you bother to compile MySQL (or any other database)?  There are number of good reasons.

  • You have an opportunity to read the code

If you have the time and inclination, grabbing and compiling the source code is the perfect excuse to spend some time reading the code, particularly the parts of the code that interest you.  How exactly does PostgreSQL implement k-nearest-neighbor indexing (new in version 9.1)?

Sometimes the source code is installed along with binary versions when you use an installer, but this is often not the case.  If you download the source to compile it, you know you’ll have it available to read at your leisure.

  • Some options are only available by compiling the source

You might want to try out an atypical storage engine, but you can only do that by mixing in support for that engine when you compile the code itself.  For example, you want to use the FEDERATED engine in MySQL, but that doesn’t come as part of the installer-produced installation.  You really have no choice but to compile MySQL and include support for the FEDERATED engine as part of the configure-and-compile process.

  • You can more easily set some defaults

You might not care for the default database character set.   Sometimes DBAs or developers forget to set a character set when creating a table.  For those times, you may want to enforce a standard that all database tables use the UTF-8 character set (unless specifically overridden).  When configuring MySQL for compilation use the -DDEFAULT_CHARSET=utf8 command line option.

  • You want to get an up-to-date version, including the latest patches

Want the latest and greatest version of the server, including the most recently committed changes for bugs and security holes?  No problem!  Just download and compile the most current source, and you will automatically get the latest bug fixes and security patches.

In future posts, I’ll discuss the steps for compiling MySQL and PostgreSQL from source code.

Fundamental Constructs

September 13, 2011 Leave a comment

There are a number of very fundamental concepts that crop up again and again in computer science.  This is particularly true when it comes to different database products.  Among these constructs are:

  • B-trees (and variations)
  • Caches
  • Locking and scheduling
  • Hashing
  • Memory management

All “big-time” relational databases, and many non-relational ones, deal with these issues.  Vendors or projects may implement locking in slightly different ways, but you can be pretty sure that a relational database deals with locking.  Having a strong understanding, for example, of locking techniques and how they interact will help you go a long way toward working on almost any relational database.

I’ll be adding posts on these topics in the future, but in the meantime, if you were to do nothing more than search Wikipedia for these topics, and read the articles and references contained in them, you would learn a lot about the internals of databases.

Developers vs. DBAs

August 17, 2011 Leave a comment

You’ve probably been involved in the following kind of situation.  I know I have.

It’s late in the day and two people are arguing in the CIO’s office.

Developer: “I’m just trying to do my job!  I need access to the production database so I can troubleshoot this bug.”

DBA: “You shouldn’t need access to production databases to do your job.  You can troubleshoot this just as well in our non-prod database.”

Developer: “No, the data condition that is causing the problem isn’t available in any database outside of production.”

There are two viewpoints at play here: application building versus data guardianship.

To a good DBA, data itself is a “feature”, which needs integrity, consistency, and cleanliness.  DBAs know that lots of groups within an organization may want to access the data that has been written, or to extract reports from it.  They also know that a given application may fade away into the sunset, but that the data itself will live forever in some form, probably to be used by the application that replaces the Visual Basic dinosaur.  The DBA in our scenario is most likely worried that the developer will somehow corrupt what would otherwise be sound data.

Most DBAs want to be proactive in protecting data, but they are often put in the position of being reactive.

To a good application developer, data is what the application works with to provide functionality.  It’s more of a by-product than an end in itself.  Capturing, presenting, manipulating data are all done by applications that are the soul of what a good developer creates.

I wonder how many DBAs or data administrators take the time to explain the “stewardship” mentality they have around databases.

One of the biggest challenges to communication between developers and DBAs is the “object-relational impedence mismatch”.  The overwhelmingly most popular method for software development is object-orientation.  Relational databases are much more friendly to the procedural world.

How can we resolve this clash?  Each side needs to understand where the other is coming from.  For developers, this means understanding that DBAs view data as an asset to be protected.  For DBAs, this means understanding that developers view data as a resource for their applications.

Since I’m coming at this from a database-centric view, I have three specific suggestions for DBAs and one for developers.

First, developers should learn more about relational theory.  It’s not enough to understand how various database engines implement the theory, or to know how to write SQL.  Developers should take some time to understand that relational databases are built on a mathematical foundation dealing with issues of data consistency and completeness.

Second, DBAs should learn an object-oriented language.  I’m not suggesting that the typical DBA should become an application developer, but that they can learn the object-oriented mindset by writing programs in an OO language.  Pick one, any one, Pythyon, Java, Ruby, it doesn’t matter for this purpose.

Third, DBAs should understand object-relational mapping and its tools.  A DBA should at least understand the concepts behind a framework like Hibernate.  This not only furthers the goal of getting the OO mindset, but helps DBAs understand the kind of code that these frameworks generate.  A DBA can now become a collaborator with developers, helping to tweak the generated SQL so that it performs.

Fourth, and most important, DBAs should educate, as well as administer.  Get out of the “ivory tower” of the computer room mindset.  Teach developers some basic relational theory.  Help them understand why you view the relationships between data elements as so important.

Categories: Work Tags: