Special this week: Humble Bundle Data Analysis & Machine Learning ebooks from O’Reilly Media.
Donate at least $1 US and you’ll be able to download 5 ebooks. If you donate at least $15 US, you will get 15 ebooks! A sample of titles:
- Graphing Data with R
- Learning Apache Drill
- Architecting Modern Data Platforms
AWS
Best practices for Amazon RDS PostgreSQL replication
Contains recommendations for monitoring, configuration parameters, and locking behavior on both the source and the replica instances.
Cloud Vendor Deep-Dive: PostgreSQL on AWS Aurora
An interesting neutral party overview of PostgreSQL on the Aurora implementation. Covers best practices, compatibility with “vanilla” PostgreSQL, monitoring and a host of other topics.
Orchestrate an ETL process using AWS Step Functions for Amazon Redshift
Describes the architecture and overview of a solution that relies on a combination of Step Functions, Lambda, and Batch for building an ETL workflow. There’s a link in the article to a CloudFormation template that will launch the needed infrastructure.
PostgreSQL
Combined Indexes vs. Separate Indexes In PostgreSQL
Excellent discussion on when to create composite indexes as opposed to indexes on single columns.
TRUNCATE vs DELETE: Efficiently Clearing Data from a Postgres Table
You might be surprised which method is faster (at least in this set of tests).
Parallelism in PostgreSQL
PostgreSQL has supported various parallel strategies since version 9.6 and an even broader set of parallel operations came with version 10. This article provides and overview of types of queries that can execute in-parallel.
R
Ten more random useful things in R you may not know about
It’s truly a random list, but a number of these look interesting and helpful.
Software Updates
Anaconda 2019.07
Midyear release of the popular Python environment.
Jupyter, PyCharm and Pizza
Last week I linked to the release notes for the latest version of PyCharm (2019.2), the popular Python IDE. This article shows how to use the Jupyter notebook integration in that release.
Practices and Architecture
Overview of Consistency Levels in Database Systems
“Isolation levels” is probably a more well-known concept, but as this article explains there are consistency levels (the C in ACID) for database systems as well. I recommend reading anything by Daniel Abadi.
Presto at Pinterest
Presto is a query engine for executing SQL on different data sources. The article above discusses challenges that Pinterest has encountered in using Presto for various uses and how it overcame those.
General Data-Related
The First AI to Beat Pros in 6-Player Poker, Developed by Facebook and Carnegie Mellon
Is no game safe from the AI juggernaut?
Harvard Data Science Review
A new open-access journal sponsored by the Harvard Data Science Initiative.
Salesforce Completes Acquisition of Tableau
That didn’t take long. The acquisition was originally announced on June 10.
This Week in Neo4j – Exploration from Bloom Canvas, Building your first Graph App, Parallel k-Hop counts, Scala Cypher DSL
Technically, last week in Neo4j. A helpful summary of noteworthy posts from the Neo4j blog.
Podcasts
Putting the “science” in data science: the scientific method, the null hypothesis, and p-hacking
This week’s episode from the podcast Linear Digressions.
Simplifying Data Integration Through Eventual Connectivity
This week’s episode from the Data Engineering Podcast.
Upcoming Conferences of Interest
ApacheCon 2019
There’s one in Las Vegas (North America) in September and one in Berlin (Europe) in October.
DCMI 2019 Conference
The Dublin Core Metadata Initiative 2019 Annual Conference. Held this year in Seoul, South Korea.
Classic Paper or Reference of the Week
A Simple Guide to Five Normal Forms in Relational Database Theory
Everyone who designs relational databases ought to be familiar with the concepts in this article. There are higher normal forms (Boyce-Codd normal form anyone?) that have been proposed since Kent’s paper, but these five are the foundation.
Data Technology of the Week
“Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” In other words, one can view NiFi as being an ETL tool, but it offers features like data provenance and back pressure that are not common in ETL tools and these, among functionalities, make NiFi more of a general-purpose data flow processor. NiFi was originally developed by the US National Security Agency (NSA). You can learn more about flow-based programming in this Wikipedia article. Pronounce the name “nye fye”.
Metadata Standard of the Week
Categories for the Description of Works of Art
Created and maintained by the J. Paul Getty Trust, the CDWA is a set of guidelines for describing art, architecture, and other works of culture. One implementation of the CDWA is the Cultural Objects Name Authority.