Interesting Data-Related Blogs and Articles – Week of July 28, 2019

Special this week: Humble Bundle Data Analysis & Machine Learning ebooks from O’Reilly Media.

Donate at least $1 US and you’ll be able to download 5 ebooks. If you donate at least $15 US, you will get 15 ebooks! A sample of titles:

  • Graphing Data with R
  • Learning Apache Drill
  • Architecting Modern Data Platforms

AWS

Best practices for Amazon RDS PostgreSQL replication

Contains recommendations for monitoring, configuration parameters, and locking behavior on both the source and the replica instances.

Cloud Vendor Deep-Dive: PostgreSQL on AWS Aurora

An interesting neutral party overview of PostgreSQL on the Aurora implementation. Covers best practices, compatibility with “vanilla” PostgreSQL, monitoring and a host of other topics.

Orchestrate an ETL process using AWS Step Functions for Amazon Redshift

Describes the architecture and overview of a solution that relies on a combination of Step Functions, Lambda, and Batch for building an ETL workflow. There’s a link in the article to a CloudFormation template that will launch the needed infrastructure.


PostgreSQL

Combined Indexes vs. Separate Indexes In PostgreSQL

Excellent discussion on when to create composite indexes as opposed to indexes on single columns.

TRUNCATE vs DELETE: Efficiently Clearing Data from a Postgres Table

You might be surprised which method is faster (at least in this set of tests).

Parallelism in PostgreSQL

PostgreSQL has supported various parallel strategies since version 9.6 and an even broader set of parallel operations came with version 10. This article provides and overview of types of queries that can execute in-parallel.


R

Ten more random useful things in R you may not know about

It’s truly a random list, but a number of these look interesting and helpful.


Software Updates

Anaconda 2019.07

Midyear release of the popular Python environment.

Jupyter, PyCharm and Pizza

Last week I linked to the release notes for the latest version of PyCharm (2019.2), the popular Python IDE. This article shows how to use the Jupyter notebook integration in that release.


Practices and Architecture

Overview of Consistency Levels in Database Systems

“Isolation levels” is probably a more well-known concept, but as this article explains there are consistency levels (the C in ACID) for database systems as well. I recommend reading anything by Daniel Abadi.

Presto at Pinterest

Presto is a query engine for executing SQL on different data sources. The article above discusses challenges that Pinterest has encountered in using Presto for various uses and how it overcame those.


General Data-Related

Harvard Data Science Review

A new open-access journal sponsored by the Harvard Data Science Initiative.

Salesforce Completes Acquisition of Tableau

That didn’t take long. The acquisition was originally announced on June 10.

This Week in Neo4j – Exploration from Bloom Canvas, Building your first Graph App, Parallel k-Hop counts, Scala Cypher DSL

Technically, last week in Neo4j. A helpful summary of noteworthy posts from the Neo4j blog.


Upcoming Conferences of Interest

ApacheCon 2019

There’s one in Las Vegas (North America) in September and one in Berlin (Europe) in October.

DCMI 2019 Conference

The Dublin Core Metadata Initiative 2019 Annual Conference. Held this year in Seoul, South Korea.


Classic Paper or Reference of the Week

A Simple Guide to Five Normal Forms in Relational Database Theory

Everyone who designs relational databases ought to be familiar with the concepts in this article. There are higher normal forms (Boyce-Codd normal form anyone?) that have been proposed since Kent’s paper, but these five are the foundation.


Data Technology of the Week

Apache NiFi

“Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” In other words, one can view NiFi as being an ETL tool, but it offers features like data provenance and back pressure that are not common in ETL tools and these, among functionalities, make NiFi more of a general-purpose data flow processor. NiFi was originally developed by the US National Security Agency (NSA). You can learn more about flow-based programming in this Wikipedia article. Pronounce the name “nye fye”.


Metadata Standard of the Week

Categories for the Description of Works of Art

Created and maintained by the J. Paul Getty Trust, the CDWA is a set of guidelines for describing art, architecture, and other works of culture. One implementation of the CDWA is the Cultural Objects Name Authority.

A New Year and New Technology

Happy New Year!

With a new year here, I’ve been thinking about ways to expand my skill set and the technologies to learn more about in 2012.  As I’m still a data guy at heart, it should be no surprise that the technologies that interest me are related to data and databases.

MongoDB

I hoping to do a lot of work with MongoDB this year.  In the past, I’ve played around with it some, but it looks as if there will be at least one project at work that will let me get some deep experience using MongoDB.  I’m currently reading “MongoDB: The Definitive Guide” by Kristina Chodorow and Michael Dirolf and am really enjoying it.  Based on what I’ve read so far, I would say this is one of those classic O’Reilly books: specialized, but so well-written that you almost forget that you’re reading something highly technical.

Hadoop

I attended a number of talks at QCon SF back in November and one topic that recurred frequently was how companies are using Hadoop.  It seemed as if every presenter described how their company is finding a way to use at least part of the Hadoop ecosystem.  And Hadoop is truly that: an entire ecosystem, encompassing not only the core project, but also Pig, ZooKeeper, Mahout, HBase, and still others.  You can find more information at the Apache Hadoop project page.  I’m hoping to get a proof-of-concept cluster up and running by the middle of the year.

Lucene/Solr

Full-text search technology falls into an area that, like MongoDB, I have had some exposure to in the past, but would like very much to learn more about.  I’m planning to spend some time this year on Lucene first, and then move onto the search server, Solr.  I’m fortunate to have some colleagues with a lot of experience in this area, and I intend to mine their knowledge whenever possible.

XML

I believe that those of us steeped in the relational database world can take a great step toward a more data architect mindset by having a deep understanding of XML and its related technologies.  My plan this year to is to get a firmer understanding of XML, XPath, XSLT, XML Schema, etc.  I’ve done quite a bit of work with XML in the past, but this has typically been in relation to Oracle’s handling of XML documents within the database.  I want to gain a broader understanding.

Of course JSON is coming on strong in supplanting XML in some of its previous strongholds.  For example, MongoDB’s data model relies on JSON-formatted documents.

This seems like enough to keep me busy outside of my day job for one year!

Book Review: “HBase: The Definitive Guide” by Lars George (O’Reilly Media)

Summary

(Disclosure: O’Reilly Media provided me with a free ebook copy of this book for the purposes of this review. I have done my best not to let that influence my opinions here.)

When a book bills itself as “The Definitive Guide,” well, that’s a tall order to fill. But, except for updates as new releases of HBase roll out, I can’t imagine another book surpassing this one by Lars George.

Lars George has been working with HBase since 2007 and is a full committer to the project as of 2009.  He now works for Cloudera (a company providing a commercial flavor of Hadoop, as well as Hadoop support).  After reading this book, there’s no question in my mind that George has deep understanding, not only of HBase as a data solution, but of the internal workings of HBase.

My Reactions

George gives the background and history of HBase in the larger context of relational databases and NoSQL, which I found to be very helpful. The many diagrams throughout the book are extremely useful in explaining concepts, especially for those of us coming from a relational database background.

George has an excellent and clear writing style. Take, for example, the section where he discusses The Problem with Relational Database Systems, giving a quick rundown of the typical steps for getting an RDBMS to scale up.  The flow of his summary reads like the increasing levels of panic that many of us have gone through when dealing with a database-backed application that will not scale.

As an example of how thorough and comprehensive the book is, look at chapter 2, where there is an extensive discussion of the type and class (not desktop PCs!) of machines suitable for running HBase. George gives a truly helpful set of configuration practices, even down to a recommendation for having redundant power supply units.

Another example of his thoroughness comes where George discusses delete methods (Chapter 3). He shows how you can use custom versioning, while admitting that the example is somewhat contrived. Indeed, right after elaborating the example, there is a distinct “Warning” box that admits that custom versioning is not actually recommended.  So, even though you may not implement custom versioning, you do understand it as a feature that HBase provides.

Many of the programming examples come with excellent remarks or discussions of the tradeoffs implicit in the techniques, including performance and scaling concerns.  Java developers will be most comfortable with the majority of examples, but they can be followed by anyone with some object-oriented programming experience.

I really appreciated the thorough discussion in chapter 8 (“Architecture”) of subjects like B+ trees vs. Log-Structured Merge Trees (LSMs), the Write-Ahead Log, and seeks vs. transfers, topics which are relevant not only to HBase but to many database systems of varying architectures.

The level of thoroughness is also the book’s only weakness.  I’m not sure who the target audience for this book is, because it serves both developers and system or database administrators.  While nearly every imaginable HBase topic is touched upon, some would have been better off merely listed, with appropriate references given to sources of more information (for example, all those hardware recommendations). The print edition of the book is 552 pages.

Still, a complaint that a book is too detailed shouldn’t be interpreted as much of a complaint.  Anyone with an interest in NoSQL databases in general, and HBase in particular should read and study this book.  It’s not likely to be superseded in the future.

The catalog page for “HBase: The Definitive Guide”.