I spoke about the experience at my workplace while moving from a 2+ TB relational database to a MongoDB cluster (twelve shards). My hope was to convey some of the challenges we encountered and the lessons we learned while working on this project.
Have you ever had to work with a legacy relational database and found that there was little, if any documentation about the existing tables and columns? In my career, this has been a common problem for me, especially when coming up-to-speed on an existing application’s database. It’s bad enough when there is no ER diagram to give you an overview of the database, but many database developers do everything in their power, it seems, to make the database itself opaque.
What are some best practices for making your database easier to understand?
Make Your Table and Column Names Long Enough to be Descriptive
Instead of something like CSTACCT, use a name like CUSTOMER_ACCOUNT. From a self-documenting perspective, there is very little point to using short, cryptic names, when you may have 30 characters (Oracle) or as many as 63 characters (PostgreSQL) to work with.
Establish Abbreviation Standards
When you must abbreviate names, use consistent abbreviations, and maintain a list of approved abbreviations that all database designers and developers may use. Enforce the standard abbreviations through code reviews.
Be Consistent with Table and Column Naming
Another practice that makes for difficult-to-understand databases is the use of inconsistent naming conventions. Will your “flag” columns be named “completed_yn”, “completed_flag”, or “completed_ind”? It probably doesn’t matter which convention you adopt, as long as you use it consistently throughout your database.
Use Table and Column Comments
Most RDBMs today allow you to store comments right in the database itself. This makes it easy to generate a data dictionary. Adopt a coding standard for your data definition SQL that requires table and column comments (sometimes called “remarks”).
Create a Data Dictionary
Whenever you’re responsible for database design or maintenance, publish a data dictionary. There are expensive tools like ErWin that make this easy, but I like to use the free and open-source SchemaSpy. It’s critical that you keep the data dictionary up-to-date, so be sure to publish a revised copy whenever there are revisions to your database schema.
What Do You Gain?
By adopting the practices I’ve described, you’ll gain the powerful ability to help others understand your databases and the data itself. In summary:
- New developers can more easily understand where data is stored in your databases.
- Other teams, outside the software engineering department, can find the data they’re interested in, whether for ad hoc queries or for regularly-scheduled reports.
- The ETL logic for migrating data, when moving to a new system, is much easier to develop, if all existing data is clearly described.
- Along the same lines of data integration, it will be straightforward for ETL developers to identify source and target database elements.
Last week I discussed various reasons why you would want to compile your open-source database from the project’s source code. I wanted to give an example here of the process for compiling a particular database, PostgreSQL. Version 9.1 of PostgreSQL was released on 2011-09-11, so the example below works with that code base.
I’m assuming in the following instructions that you’re starting with a sparkling, new Fedora installation. In other words, you haven’t installed any of the dependencies (e.g., gcc, make, bison) that you’ll need for building the PostgreSQL source code. I used Fedora 15, 64-bit.
$ sudo yum install git
$ sudo yum-builddep postgresql
$ sudo yum install gcc
Check out the Source Code using Git
Grabbing the code via git is a very simple one-line command.
$ git clone git://git.postgresql.org/git/postgresql.git
Configure the Build
$ cd postgresql $ ./configure --with-libxml --with-libxslt
Make and Install
$ sudo gmake install
Post-Installation and Database Configuration
You’ll want to run PostgreSQL as a specific, non-root user, postgres. So, create this user.
$ sudo adduser postgres $ sudo passwd postgres
Now, change user to the newborn postgres user to run the database initialization.
$ su - postgres
First, set the environment variables for locating needed libraries and executables.
$ LD_LIBRARY_PATH=/usr/local/pgsql/lib; export LD_LIBRARY_PATH $ PATH=$PATH:/usr/local/pgsql/bin; export PATH
Now, create a directory to contain the server’s data files.
$ mkdir <directory_for_database_data_files>
Run the initdb command to initialize the PostgreSQL server.
$ initdb -D <directory_for_database_data_files>
I recommend creating a specific directory for the server log files.
$ mkdir <directory_for_server_log_files>
Starting the Server
Start the server as the postgres user, indicating the location of data files and the location of the server log files.
$ postgres -D <directory_for_data files> ><directory_for_server_log_files>/server.log 2>&1 &
There shouldn’t be any errors that prevent the server from starting, but if you inspect the log file you should see the following messages.
LOG: database system is ready to accept connections LOG: autovacuum launcher started
Your server should be up and running, ready to accept your client connections!
There are a number of very fundamental concepts that crop up again and again in computer science. This is particularly true when it comes to different database products. Among these constructs are:
- B-trees (and variations)
- Locking and scheduling
- Memory management
All “big-time” relational databases, and many non-relational ones, deal with these issues. Vendors or projects may implement locking in slightly different ways, but you can be pretty sure that a relational database deals with locking. Having a strong understanding, for example, of locking techniques and how they interact will help you go a long way toward working on almost any relational database.
I’ll be adding posts on these topics in the future, but in the meantime, if you were to do nothing more than search Wikipedia for these topics, and read the articles and references contained in them, you would learn a lot about the internals of databases.