I’ve added some new sections this week, though I still intend to focus on data and data-related items.
“Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.”
This post, from the AWS Big Data blog, describes an important way to isolate workloads (for example, ad-hoc vs. reporting) and attribute costs appropriately (by using tags) when querying data via AWS Athena. It’s a helpful companion piece to the item above on Parquet and DMS.
The benefits include smaller sizes than B-Tree indexes, fast scanning of extremely large tables, and more efficient vacuuming. The original proposal, linked in the article above, is here. It provides more rationale for what the proposer, Alvaro Herrera, called “minmax indexes”.
Oracle released its July Critical Patch Update (CPU) (2019-07-16).
Practices and Architecture
OK, pretty old-school, but pretty cool ways to clean data from the command line. The author has an accompanying blog, called “BASHing data“.
This is a proposed ISO standard for querying graph databases. There’s even a GQL Manifesto.
This development seems to be driven by the needs of querying RDF-triple stores, but applies to all models of databases.
Upcoming Conferences of Interest
Classic Paper or Reference of the Week
The classification of data quality problems is as helpful today as it was back in 2000, when this paper was first published.
Cool Research Paper of the Week
Data Technology of the Week
Metadata Standard of the Week
MARC is actually a set of formats that was originally created in the 1960s and 1970s. MARC includes formats for bibliographic metadata, authority records (e.g., names, subjects), holdings, classifications, communities, and translations.