Interesting Data-Related Blogs and Articles – Week of August 4, 2019

AWS

Announcing PartiQL: One query language for all your data

AWS has open-sourced a new query language, based on SQL (“SQL-compatible”, as the blog post puts it). Essentially, it is a super-set of SQL that requires the JRE. This project bears watching, but will only be successful outside AWS if data store query engines add support for PartiQL. To further that aim, AWS is releasing the specification and reference implementation(written in Kotlin). The AWS services that support PartiQL thus far are:

  • Amazon S3 Select
  • Amazon Glacier Select
  • Amazon Redshift Spectrum
  • Amazon Quantum Ledger Database

Build highly available MySQL applications using Amazon Aurora Multi-Master

Not yet available for the PostgreSQL-compatible version of Amazon Aurora. Let’s hope it will be.

EBS default volume type updated to GP2

Per the announcement: “GP2 volumes offer lower latency and higher throughput than Standard volumes.”

Techtalk: Best Practices for Running Spark Applications Using Spot Instances on EMR

August 28. This is a 300-level session.

PostgreSQL

PostgreSQL: Regular expressions and pattern matching

The author intends this to be the first in a series of posts on using regular expressions (RE) with PostgreSQL. This post overviews four RE operators.


Python

Python in Visual Studio Code – August 2019 Release

Visual Studio Code has been coming on strong as a popular cross-platform and cross-language IDE. This release furthers the support for Python and Jupyter notebooks.


R

Transitioning into the tidyverse (part 2)

A pair of posts presenting the ecosystem of packages for R (my cat Cookie’s personal favorite being purrr) that promotes a methodology for data analysis.


Software Updates

DBeaver 6.1.4 (2019-08-04)

Among the fixes are these:

  • PostgreSQL:
    • Array data type handler was fixed
    • Indexes metadata reading was fixed
    • Execution plan visualization was fixed for CTE nodes
  • SQL Server:
    • Support of identity columns creation was added
    • Session manager was fixed
  • Oracle:
    • Scheduled jobs metadata reading was fixed
    • Session management was fixed for RAC mode

General Data-Related

DFLib – a lightweight, pure Java implementation of DataFrame

Dataframes (think database tables, spreadsheets) are foundational data structures in Python, R, and Spark. This library provides similar functionality when writing Java.

Liquibase Improving Community Support

In a sign of the continued strength of Liquibase as a community project, this post announces the hiring of an open source community manager.

Traversing the Land of Graph Computing and Databases

Based on the author’s talk at Pycon X, this is a high-level overview of the resurgence of interest in graph-based technology and graph databases.


Upcoming Conferences of Interest

Strata Data Conference – New York, September 23-26

This is one of a series of O’Reilly-sponsored conferences on big data and data science. There are two coming up next year: Strata Data Conference – San Jose (March 15-18, 2020) and Strata Data Conference – London (April 20-23, 2020).


Classic Paper or Reference of the Week

Since I linked to a couple of blog posts above on the R “tidyverse”, I thought some folks might be interested in the paper by Hadley Wickham that started it all: Tidy Data, as published in the Journal of Statistical Software.


Data Technology of the Week

AgensGraph

An Apache 2.0 licensed-project that supports a property graph data model on top of PostgreSQL (version 10.3). AgensGraph supports both ANSI-SQL and openCypher for querying. There’s an enterprise version from Bitnine.


Metadata Standard of the Week

Friend of a Friend or “FOAF” is an ontology, using RDF and OWL, to describe persons, activities, and relations to other people and objects. An example use of FOAF is to describe a social network.

Interesting Data-Related Blogs and Articles – Week of July 28, 2019

Special this week: Humble Bundle Data Analysis & Machine Learning ebooks from O’Reilly Media.

Donate at least $1 US and you’ll be able to download 5 ebooks. If you donate at least $15 US, you will get 15 ebooks! A sample of titles:

  • Graphing Data with R
  • Learning Apache Drill
  • Architecting Modern Data Platforms

AWS

Best practices for Amazon RDS PostgreSQL replication

Contains recommendations for monitoring, configuration parameters, and locking behavior on both the source and the replica instances.

Cloud Vendor Deep-Dive: PostgreSQL on AWS Aurora

An interesting neutral party overview of PostgreSQL on the Aurora implementation. Covers best practices, compatibility with “vanilla” PostgreSQL, monitoring and a host of other topics.

Orchestrate an ETL process using AWS Step Functions for Amazon Redshift

Describes the architecture and overview of a solution that relies on a combination of Step Functions, Lambda, and Batch for building an ETL workflow. There’s a link in the article to a CloudFormation template that will launch the needed infrastructure.


PostgreSQL

Combined Indexes vs. Separate Indexes In PostgreSQL

Excellent discussion on when to create composite indexes as opposed to indexes on single columns.

TRUNCATE vs DELETE: Efficiently Clearing Data from a Postgres Table

You might be surprised which method is faster (at least in this set of tests).

Parallelism in PostgreSQL

PostgreSQL has supported various parallel strategies since version 9.6 and an even broader set of parallel operations came with version 10. This article provides and overview of types of queries that can execute in-parallel.


R

Ten more random useful things in R you may not know about

It’s truly a random list, but a number of these look interesting and helpful.


Software Updates

Anaconda 2019.07

Midyear release of the popular Python environment.

Jupyter, PyCharm and Pizza

Last week I linked to the release notes for the latest version of PyCharm (2019.2), the popular Python IDE. This article shows how to use the Jupyter notebook integration in that release.


Practices and Architecture

Overview of Consistency Levels in Database Systems

“Isolation levels” is probably a more well-known concept, but as this article explains there are consistency levels (the C in ACID) for database systems as well. I recommend reading anything by Daniel Abadi.

Presto at Pinterest

Presto is a query engine for executing SQL on different data sources. The article above discusses challenges that Pinterest has encountered in using Presto for various uses and how it overcame those.


General Data-Related

Harvard Data Science Review

A new open-access journal sponsored by the Harvard Data Science Initiative.

Salesforce Completes Acquisition of Tableau

That didn’t take long. The acquisition was originally announced on June 10.

This Week in Neo4j – Exploration from Bloom Canvas, Building your first Graph App, Parallel k-Hop counts, Scala Cypher DSL

Technically, last week in Neo4j. A helpful summary of noteworthy posts from the Neo4j blog.


Upcoming Conferences of Interest

ApacheCon 2019

There’s one in Las Vegas (North America) in September and one in Berlin (Europe) in October.

DCMI 2019 Conference

The Dublin Core Metadata Initiative 2019 Annual Conference. Held this year in Seoul, South Korea.


Classic Paper or Reference of the Week

A Simple Guide to Five Normal Forms in Relational Database Theory

Everyone who designs relational databases ought to be familiar with the concepts in this article. There are higher normal forms (Boyce-Codd normal form anyone?) that have been proposed since Kent’s paper, but these five are the foundation.


Data Technology of the Week

Apache NiFi

“Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” In other words, one can view NiFi as being an ETL tool, but it offers features like data provenance and back pressure that are not common in ETL tools and these, among functionalities, make NiFi more of a general-purpose data flow processor. NiFi was originally developed by the US National Security Agency (NSA). You can learn more about flow-based programming in this Wikipedia article. Pronounce the name “nye fye”.


Metadata Standard of the Week

Categories for the Description of Works of Art

Created and maintained by the J. Paul Getty Trust, the CDWA is a set of guidelines for describing art, architecture, and other works of culture. One implementation of the CDWA is the Cultural Objects Name Authority.