I’ve added some new sections this week, though I still intend to focus on data and data-related items.
AWS
Announcing the support of Parquet data format in AWS DMS 3.1.3
Apparently the AWS “Database Migration Service” can be used for migrating files, not just databases. The service now supports migrating to S3 in Apache Parquet format. This could be useful if you want to use Amazon Athena or Redshift Spectrum to query the data.
Orchestrating an ETL process using AWS Step Functions for Amazon Redshift
“Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.”
New AWS Public Datasets Available from Facebook, Yale, Allen Institute for Brain Science, NOAA, and others
AWS hosts a large number (114 so far) open data sets. The registry provides search functionality to help you find what you may be looking for. More information is at the Open Data on AWS page.
Separating queries and managing costs using Amazon Athena workgroups
This post, from the AWS Big Data blog, describes an important way to isolate workloads (for example, ad-hoc vs. reporting) and attribute costs appropriately (by using tags) when querying data via AWS Athena. It’s a helpful companion piece to the item above on Parquet and DMS.
PostgreSQL
BRIN Index for PostgreSQL: Don’t Forget the Benefits
The benefits include smaller sizes than B-Tree indexes, fast scanning of extremely large tables, and more efficient vacuuming. The original proposal, linked in the article above, is here. It provides more rationale for what the proposer, Alvaro Herrera, called “minmax indexes”.
Software Updates
Oracle released its July Critical Patch Update (CPU) (2019-07-16).
Practices and Architecture
A Data Cleaner’s Cookbook
OK, pretty old-school, but pretty cool ways to clean data from the command line. The author has an accompanying blog, called “BASHing data“.
Graph Query Language GQL
This is a proposed ISO standard for querying graph databases. There’s even a GQL Manifesto.
The Rise Of Natural Language Interfaces To Databases
This development seems to be driven by the needs of querying RDF-triple stores, but applies to all models of databases.
Upcoming Conferences of Interest
Classic Paper or Reference of the Week
Data Cleaning: Problems and Current Approaches
The classification of data quality problems is as helpful today as it was back in 2000, when this paper was first published.
Cool Research Paper of the Week
Towards Multiverse Databases
You can think of a multiverse database as one that extends the concept of a distributed database with individual views of that data for each user. Multiverse databases contain a centralized privacy policy that needs only be implemented once.
Data Technology of the Week
Apache Superset
Aims to provide “…a modern, enteprise-ready business intelligence web application”. Still incubating, but already has an impressive list of companies using it. Check out the Visualizations Gallery.
Metadata Standard of the Week
MARC is actually a set of formats that was originally created in the 1960s and 1970s. MARC includes formats for bibliographic metadata, authority records (e.g., names, subjects), holdings, classifications, communities, and translations.