Interesting Data-Related Blogs and Articles – Week of July 14, 201

I’ve added some new sections this week, though I still intend to focus on data and data-related items.


Announcing the support of Parquet data format in AWS DMS 3.1.3

Apparently the AWS “Database Migration Service” can be used for migrating files, not just databases. The service now supports migrating to S3 in Apache Parquet format. This could be useful if you want to use Amazon Athena or Redshift Spectrum to query the data.

Orchestrating an ETL process using AWS Step Functions for Amazon Redshift

“Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.”

New AWS Public Datasets Available from Facebook, Yale, Allen Institute for Brain Science, NOAA, and others

AWS hosts a large number (114 so far) open data sets. The registry provides search functionality to help you find what you may be looking for. More information is at the Open Data on AWS page.

Separating queries and managing costs using Amazon Athena workgroups

This post, from the AWS Big Data blog, describes an important way to isolate workloads (for example, ad-hoc vs. reporting) and attribute costs appropriately (by using tags) when querying data via AWS Athena. It’s a helpful companion piece to the item above on Parquet and DMS.


BRIN Index for PostgreSQL: Don’t Forget the Benefits

The benefits include smaller sizes than B-Tree indexes, fast scanning of extremely large tables, and more efficient vacuuming. The original proposal, linked in the article above, is here. It provides more rationale for what the proposer, Alvaro Herrera, called “minmax indexes”.

Software Updates

Oracle released its July Critical Patch Update (CPU) (2019-07-16).

Practices and Architecture

A Data Cleaner’s Cookbook

OK, pretty old-school, but pretty cool ways to clean data from the command line. The author has an accompanying blog, called “BASHing data“.

Graph Query Language GQL

This is a proposed ISO standard for querying graph databases. There’s even a GQL Manifesto.

The Rise Of Natural Language Interfaces To Databases

This development seems to be driven by the needs of querying RDF-triple stores, but applies to all models of databases.

Upcoming Conferences of Interest

Classic Paper or Reference of the Week

Data Cleaning: Problems and Current Approaches

The classification of data quality problems is as helpful today as it was back in 2000, when this paper was first published.

Cool Research Paper of the Week

Towards Multiverse Databases

You can think of a multiverse database as one that extends the concept of a distributed database with individual views of that data for each user. Multiverse databases contain a centralized privacy policy that needs only be implemented once.

Data Technology of the Week

Apache Superset

Aims to provide “…a modern, enteprise-ready business intelligence web application”. Still incubating, but already has an impressive list of companies using it. Check out the Visualizations Gallery.

Metadata Standard of the Week

MARC is actually a set of formats that was originally created in the 1960s and 1970s. MARC includes formats for bibliographic metadata, authority records (e.g., names, subjects), holdings, classifications, communities, and translations.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s