Interesting Data-Related Blogs and Articles – Week of August 4, 2019

AWS

Announcing PartiQL: One query language for all your data

AWS has open-sourced a new query language, based on SQL (“SQL-compatible”, as the blog post puts it). Essentially, it is a super-set of SQL that requires the JRE. This project bears watching, but will only be successful outside AWS if data store query engines add support for PartiQL. To further that aim, AWS is releasing the specification and reference implementation(written in Kotlin). The AWS services that support PartiQL thus far are:

  • Amazon S3 Select
  • Amazon Glacier Select
  • Amazon Redshift Spectrum
  • Amazon Quantum Ledger Database

Build highly available MySQL applications using Amazon Aurora Multi-Master

Not yet available for the PostgreSQL-compatible version of Amazon Aurora. Let’s hope it will be.

EBS default volume type updated to GP2

Per the announcement: “GP2 volumes offer lower latency and higher throughput than Standard volumes.”

Techtalk: Best Practices for Running Spark Applications Using Spot Instances on EMR

August 28. This is a 300-level session.

PostgreSQL

PostgreSQL: Regular expressions and pattern matching

The author intends this to be the first in a series of posts on using regular expressions (RE) with PostgreSQL. This post overviews four RE operators.


Python

Python in Visual Studio Code – August 2019 Release

Visual Studio Code has been coming on strong as a popular cross-platform and cross-language IDE. This release furthers the support for Python and Jupyter notebooks.


R

Transitioning into the tidyverse (part 2)

A pair of posts presenting the ecosystem of packages for R (my cat Cookie’s personal favorite being purrr) that promotes a methodology for data analysis.


Software Updates

DBeaver 6.1.4 (2019-08-04)

Among the fixes are these:

  • PostgreSQL:
    • Array data type handler was fixed
    • Indexes metadata reading was fixed
    • Execution plan visualization was fixed for CTE nodes
  • SQL Server:
    • Support of identity columns creation was added
    • Session manager was fixed
  • Oracle:
    • Scheduled jobs metadata reading was fixed
    • Session management was fixed for RAC mode

General Data-Related

DFLib – a lightweight, pure Java implementation of DataFrame

Dataframes (think database tables, spreadsheets) are foundational data structures in Python, R, and Spark. This library provides similar functionality when writing Java.

Liquibase Improving Community Support

In a sign of the continued strength of Liquibase as a community project, this post announces the hiring of an open source community manager.

Traversing the Land of Graph Computing and Databases

Based on the author’s talk at Pycon X, this is a high-level overview of the resurgence of interest in graph-based technology and graph databases.


Upcoming Conferences of Interest

Strata Data Conference – New York, September 23-26

This is one of a series of O’Reilly-sponsored conferences on big data and data science. There are two coming up next year: Strata Data Conference – San Jose (March 15-18, 2020) and Strata Data Conference – London (April 20-23, 2020).


Classic Paper or Reference of the Week

Since I linked to a couple of blog posts above on the R “tidyverse”, I thought some folks might be interested in the paper by Hadley Wickham that started it all: Tidy Data, as published in the Journal of Statistical Software.


Data Technology of the Week

AgensGraph

An Apache 2.0 licensed-project that supports a property graph data model on top of PostgreSQL (version 10.3). AgensGraph supports both ANSI-SQL and openCypher for querying. There’s an enterprise version from Bitnine.


Metadata Standard of the Week

Friend of a Friend or “FOAF” is an ontology, using RDF and OWL, to describe persons, activities, and relations to other people and objects. An example use of FOAF is to describe a social network.

Interesting Data-Related Blogs and Articles – Week of July 28, 2019

Special this week: Humble Bundle Data Analysis & Machine Learning ebooks from O’Reilly Media.

Donate at least $1 US and you’ll be able to download 5 ebooks. If you donate at least $15 US, you will get 15 ebooks! A sample of titles:

  • Graphing Data with R
  • Learning Apache Drill
  • Architecting Modern Data Platforms

AWS

Best practices for Amazon RDS PostgreSQL replication

Contains recommendations for monitoring, configuration parameters, and locking behavior on both the source and the replica instances.

Cloud Vendor Deep-Dive: PostgreSQL on AWS Aurora

An interesting neutral party overview of PostgreSQL on the Aurora implementation. Covers best practices, compatibility with “vanilla” PostgreSQL, monitoring and a host of other topics.

Orchestrate an ETL process using AWS Step Functions for Amazon Redshift

Describes the architecture and overview of a solution that relies on a combination of Step Functions, Lambda, and Batch for building an ETL workflow. There’s a link in the article to a CloudFormation template that will launch the needed infrastructure.


PostgreSQL

Combined Indexes vs. Separate Indexes In PostgreSQL

Excellent discussion on when to create composite indexes as opposed to indexes on single columns.

TRUNCATE vs DELETE: Efficiently Clearing Data from a Postgres Table

You might be surprised which method is faster (at least in this set of tests).

Parallelism in PostgreSQL

PostgreSQL has supported various parallel strategies since version 9.6 and an even broader set of parallel operations came with version 10. This article provides and overview of types of queries that can execute in-parallel.


R

Ten more random useful things in R you may not know about

It’s truly a random list, but a number of these look interesting and helpful.


Software Updates

Anaconda 2019.07

Midyear release of the popular Python environment.

Jupyter, PyCharm and Pizza

Last week I linked to the release notes for the latest version of PyCharm (2019.2), the popular Python IDE. This article shows how to use the Jupyter notebook integration in that release.


Practices and Architecture

Overview of Consistency Levels in Database Systems

“Isolation levels” is probably a more well-known concept, but as this article explains there are consistency levels (the C in ACID) for database systems as well. I recommend reading anything by Daniel Abadi.

Presto at Pinterest

Presto is a query engine for executing SQL on different data sources. The article above discusses challenges that Pinterest has encountered in using Presto for various uses and how it overcame those.


General Data-Related

Harvard Data Science Review

A new open-access journal sponsored by the Harvard Data Science Initiative.

Salesforce Completes Acquisition of Tableau

That didn’t take long. The acquisition was originally announced on June 10.

This Week in Neo4j – Exploration from Bloom Canvas, Building your first Graph App, Parallel k-Hop counts, Scala Cypher DSL

Technically, last week in Neo4j. A helpful summary of noteworthy posts from the Neo4j blog.


Upcoming Conferences of Interest

ApacheCon 2019

There’s one in Las Vegas (North America) in September and one in Berlin (Europe) in October.

DCMI 2019 Conference

The Dublin Core Metadata Initiative 2019 Annual Conference. Held this year in Seoul, South Korea.


Classic Paper or Reference of the Week

A Simple Guide to Five Normal Forms in Relational Database Theory

Everyone who designs relational databases ought to be familiar with the concepts in this article. There are higher normal forms (Boyce-Codd normal form anyone?) that have been proposed since Kent’s paper, but these five are the foundation.


Data Technology of the Week

Apache NiFi

“Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” In other words, one can view NiFi as being an ETL tool, but it offers features like data provenance and back pressure that are not common in ETL tools and these, among functionalities, make NiFi more of a general-purpose data flow processor. NiFi was originally developed by the US National Security Agency (NSA). You can learn more about flow-based programming in this Wikipedia article. Pronounce the name “nye fye”.


Metadata Standard of the Week

Categories for the Description of Works of Art

Created and maintained by the J. Paul Getty Trust, the CDWA is a set of guidelines for describing art, architecture, and other works of culture. One implementation of the CDWA is the Cultural Objects Name Authority.

Interesting Data-Related Blogs and Articles – Week of July 21, 2019

AWS

AWS Tech Talk (July 31): How to Build Serverless Data Lake Analytics with Amazon Athena

Will discuss using AWS Athena for querying data in S3.

Migrate and deploy your Apache Hive metastore on Amazon EMR

Includes a discussion of using the Glue Catalog. I’ve experimented with Glue. The Glue Catalog might be the most useful part of Glue (in the current state of Glue development).


PostgreSQL

PostgreSQL wins 2019 O’Reilly Open Source Award for Lifetime Achievement

Last year’s winner was Linux, so PostgreSQL is in excellent company. This is only the second year that the award has been presented.


Python

pandas-profiling

This is a library for profiling a data set. I have been playing around with it and so far really like the functionality and simplicity of using pandas-profiling via, for example, a Jupyter Notebook.

Researchers love PyTorch and TensorFlow

From the O’Reilly AI channel. The finding mentioned in the headline comes from an analysis of papers posted on arXiv.org.

StanfordNLP 0.2.0 – Python NLP Library for Many Human Languages

“StanfordNLP is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism. In addition, it is able to call the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.”


R

June 2019 “Top 40” R Packages

These cover Computational Methods, Data, Finance, Genomics, Machine Learning, Science and Medicine, Statistics, Time Series, Utilities, and Visualization.


Software Updates

DBeaver 6.1.3

Excerpts from the release notes (the link shows all changes):

“New project configuration format was implemented.

Major features:

  • Data viewer: “References” panel was added (browse values by foreign and reference keys)

Other:

  • Connection page was redesigned
  • PostgreSQL: struct/array data types support was fixed
  • MySQL: privileges viewer was fixed (global privileges grant/revoke)”

PyCharm 2019.2

With improved integration with Jupyter Notebook, among other improvements.


Practices and Architecture

Five principles that will keep your data warehouse organized

Some are obvious, some less so:

  • Use schemas to logically group together objects
  • Use consistent and meaningful names for objects in a warehouse
  • Use a separate user for each human being and application connecting to your data warehouse
  • Grant privileges systematically
  • Limit access to superuser privilegs

Graph Databases Go Mainstream

Given that this article is published in Forbes, it’s hard to argue with the headline. An interesting overview.

The Little Book of Python Anti-Patterns

“Welcome, fellow Pythoneer! This is a small book of Python anti-patterns and worst practices.”

General Data-Related

What We Learned From The 2018 Liquibase Community Survey

The creator of Liquibase shares information about who uses Liquibase.

You’re very easy to track down, even when your data has been anonymized

Scary stuff: turns out anonymizing data doesn’t protect you from being identified after all.


Podcasts

Acquiring and sharing high-quality data

July 18th episode of the O’Reilly Data Show Podcast.


Upcoming Conferences of Interest

NODES 2019

“Neo4j Online Developer Expo & Summit” Apparently, this is the first-ever such conference for the Neo4j community.


Classic Paper or Reference of the Week

The Design of Postgres

Written by Michael Stonebraker and Lawrence A. Rowe, describes the architecture of Postgres as a successor to INGRES. Of course, this is the jumping-off point for the PostgreSQL of today.


Data Technologies of the Week

I couldn’t pick just one.

Apache Iceberg

“Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table.” Still incubating, but sounds very cool.

Apache Avro

Avro is a data serialization format. “Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.”

Dask

“Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love.”


Metadata Standard of the Week

BISAC Subject Headings (2018 Edition)

“The BISAC Subject Headings List, also known as the BISAC Subject Codes List, is a standard used by many companies throughout the supply chain to categorize books based on topical content. The Subject Heading applied to a book can determine where the work is shelved in a brick and mortar store or the genre(s) under which it can be searched for in an internal database.” The Book Industry Study Group (BISG) provides a helpful FAQ for deciding what BISAC to use for a book.

Interesting Data-Related Blogs and Articles – Week of July 14, 201

I’ve added some new sections this week, though I still intend to focus on data and data-related items.

AWS

Announcing the support of Parquet data format in AWS DMS 3.1.3

Apparently the AWS “Database Migration Service” can be used for migrating files, not just databases. The service now supports migrating to S3 in Apache Parquet format. This could be useful if you want to use Amazon Athena or Redshift Spectrum to query the data.

Orchestrating an ETL process using AWS Step Functions for Amazon Redshift

“Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.”

New AWS Public Datasets Available from Facebook, Yale, Allen Institute for Brain Science, NOAA, and others

AWS hosts a large number (114 so far) open data sets. The registry provides search functionality to help you find what you may be looking for. More information is at the Open Data on AWS page.

Separating queries and managing costs using Amazon Athena workgroups

This post, from the AWS Big Data blog, describes an important way to isolate workloads (for example, ad-hoc vs. reporting) and attribute costs appropriately (by using tags) when querying data via AWS Athena. It’s a helpful companion piece to the item above on Parquet and DMS.


PostgreSQL

BRIN Index for PostgreSQL: Don’t Forget the Benefits

The benefits include smaller sizes than B-Tree indexes, fast scanning of extremely large tables, and more efficient vacuuming. The original proposal, linked in the article above, is here. It provides more rationale for what the proposer, Alvaro Herrera, called “minmax indexes”.


Software Updates

Oracle released its July Critical Patch Update (CPU) (2019-07-16).


Practices and Architecture

A Data Cleaner’s Cookbook

OK, pretty old-school, but pretty cool ways to clean data from the command line. The author has an accompanying blog, called “BASHing data“.

Graph Query Language GQL

This is a proposed ISO standard for querying graph databases. There’s even a GQL Manifesto.

The Rise Of Natural Language Interfaces To Databases

This development seems to be driven by the needs of querying RDF-triple stores, but applies to all models of databases.


Upcoming Conferences of Interest

Classic Paper or Reference of the Week

Data Cleaning: Problems and Current Approaches

The classification of data quality problems is as helpful today as it was back in 2000, when this paper was first published.


Cool Research Paper of the Week

Towards Multiverse Databases

You can think of a multiverse database as one that extends the concept of a distributed database with individual views of that data for each user. Multiverse databases contain a centralized privacy policy that needs only be implemented once.


Data Technology of the Week

Apache Superset

Aims to provide “…a modern, enteprise-ready business intelligence web application”. Still incubating, but already has an impressive list of companies using it. Check out the Visualizations Gallery.


Metadata Standard of the Week

MARC is actually a set of formats that was originally created in the 1960s and 1970s. MARC includes formats for bibliographic metadata, authority records (e.g., names, subjects), holdings, classifications, communities, and translations.

Book Review: “Big Data Glossary” by Pete Warden (O’Reilly Media)

Big Data Glossary” could probably have been titled  something like “Big Data Cheat Sheets” because it’s both more and less than a glossary.  Instead the book is an excellent summary of tools in the “big data” space, rather than a list of terms with definitions.

Warden tackles eleven topics:

  1. Some background on fundamental techniques (e.g., key-value stores)
  2. NoSQL databases
  3. MapReduce
  4. Storage techniques
  5. “Cloud” servers
  6. Data processing technologies (e.g., R and Lucene)
  7. Natural Language Processing
  8. Machine Learning
  9. Visualization
  10. Acquisition
  11. Serialization

He covers none of these topics in great detail, which will no doubt cause carping among some folks.  However, I really like his approach of sketching broad themes, identifying key projects (or products) in each space, and pointing the reader to further research.  Because the field of “big data” is so large, this short book (it’s only 50 pages) serves the extremely useful purpose of tying together the field by providing an overview.

Highly recommended for folks looking to get their feet wet in the great lake of big data.

Compiling and Installing PostgreSQL 9.1 from Source on Fedora 15 (64-bit)

Prerequisites

Last week I discussed various reasons why you would want to compile your open-source database from the project’s source code.  I wanted to give an example here of the process for compiling a particular database, PostgreSQL.  Version 9.1 of PostgreSQL was released on 2011-09-11, so the example below works with that code base.

I’m assuming in the following instructions that you’re starting with a sparkling, new Fedora installation.  In other words, you haven’t installed any of the dependencies (e.g., gcc, make, bison) that you’ll need for building the PostgreSQL source code.  I used Fedora 15, 64-bit.

To grab the most up-to-date Postgres code, you’ll want to check it out from the git repository.  If git is not installed on your system, install it.
$ sudo yum install git
There are a number of libraries and software programs that you’ll need if you’re going to build PostgresSQL.  The easiest way to get all of them is to use the yum-builddep command.  (This will also install flex, bison, and Perl, if they are not already present on your system).
$ sudo yum-builddep postgresql
Among other software and libraries, yum should offer to install bison, flex, and tcl.  However, yum-builddep for postgresql does not install the gcc compiler, which you will also need.  To get gcc installed, run the command:
$ sudo yum install gcc

Check out the Source Code using Git

Grabbing the code via git is a very simple one-line command.

$ git clone git://git.postgresql.org/git/postgresql.git
This checks the source code out into a sub-directory of your current location, called postgresql.

Configure the Build

A full list of the configuration options available at the command line is described in the “PostgreSQL 9.1.0 Documentation“.  For this example, I ran a fairly vanilla configuration, but added features for SQL/XML in the database.
$ cd postgresql
$ ./configure --with-libxml --with-libxslt

Make and Install

Now that you’ve fine-tuned exactly how you want Postgres built, you’ll run the fairly standard make and install steps.
$ gmake
Note that you must run the install step as a privileged user, since access to the /usr/local directory (it’s there, at /usr/local/pgsql/bin, that the binary files will live) is restricted by default.
$ sudo gmake install

Post-Installation and Database Configuration

You’ll want to run PostgreSQL as a specific, non-root user, postgres.  So, create this user.

$ sudo adduser postgres
$ sudo passwd postgres

Now, change user to the newborn postgres user to run the database initialization.

$ su - postgres

First, set the environment variables for locating needed libraries and executables.

$ LD_LIBRARY_PATH=/usr/local/pgsql/lib; export LD_LIBRARY_PATH
$ PATH=$PATH:/usr/local/pgsql/bin; export PATH

Now, create a directory to contain the server’s data files.

$ mkdir 

Run the initdb command to initialize the PostgreSQL server.

$ initdb -D 

I recommend creating a specific directory for the server log files.

$ mkdir 

Starting the Server

Start the server as the postgres user, indicating the location of data files and the location of the server log files.

$ postgres -D  >/server.log 2>&1 &

There shouldn’t be any errors that prevent the server from starting, but if you inspect the log file you should see the following messages.

LOG: database system is ready to accept connections
LOG: autovacuum launcher started

Your server should be up and running, ready to accept your client connections!

Why Compile MySQL?

I’ve got an upcoming post that discusses the steps for compiling MySQL from source.  Before I get to that topic, though, I thought a preliminary post on the reasons for doing so would be helpful.

Almost every open-source database project provides an installer for various platforms.  Why then would you bother to compile MySQL (or any other database)?  There are number of good reasons.

  • You have an opportunity to read the code

If you have the time and inclination, grabbing and compiling the source code is the perfect excuse to spend some time reading the code, particularly the parts of the code that interest you.  How exactly does PostgreSQL implement k-nearest-neighbor indexing (new in version 9.1)?

Sometimes the source code is installed along with binary versions when you use an installer, but this is often not the case.  If you download the source to compile it, you know you’ll have it available to read at your leisure.

  • Some options are only available by compiling the source

You might want to try out an atypical storage engine, but you can only do that by mixing in support for that engine when you compile the code itself.  For example, you want to use the FEDERATED engine in MySQL, but that doesn’t come as part of the installer-produced installation.  You really have no choice but to compile MySQL and include support for the FEDERATED engine as part of the configure-and-compile process.

  • You can more easily set some defaults

You might not care for the default database character set.   Sometimes DBAs or developers forget to set a character set when creating a table.  For those times, you may want to enforce a standard that all database tables use the UTF-8 character set (unless specifically overridden).  When configuring MySQL for compilation use the -DDEFAULT_CHARSET=utf8 command line option.

  • You want to get an up-to-date version, including the latest patches

Want the latest and greatest version of the server, including the most recently committed changes for bugs and security holes?  No problem!  Just download and compile the most current source, and you will automatically get the latest bug fixes and security patches.

In future posts, I’ll discuss the steps for compiling MySQL and PostgreSQL from source code.