Interesting Data-Related Blogs and Articles – Week of July 21, 2019

AWS

AWS Tech Talk (July 31): How to Build Serverless Data Lake Analytics with Amazon Athena

Will discuss using AWS Athena for querying data in S3.

Migrate and deploy your Apache Hive metastore on Amazon EMR

Includes a discussion of using the Glue Catalog. I’ve experimented with Glue. The Glue Catalog might be the most useful part of Glue (in the current state of Glue development).


PostgreSQL

PostgreSQL wins 2019 O’Reilly Open Source Award for Lifetime Achievement

Last year’s winner was Linux, so PostgreSQL is in excellent company. This is only the second year that the award has been presented.


Python

pandas-profiling

This is a library for profiling a data set. I have been playing around with it and so far really like the functionality and simplicity of using pandas-profiling via, for example, a Jupyter Notebook.

Researchers love PyTorch and TensorFlow

From the O’Reilly AI channel. The finding mentioned in the headline comes from an analysis of papers posted on arXiv.org.

StanfordNLP 0.2.0 – Python NLP Library for Many Human Languages

“StanfordNLP is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism. In addition, it is able to call the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.”


R

June 2019 “Top 40” R Packages

These cover Computational Methods, Data, Finance, Genomics, Machine Learning, Science and Medicine, Statistics, Time Series, Utilities, and Visualization.


Software Updates

DBeaver 6.1.3

Excerpts from the release notes (the link shows all changes):

“New project configuration format was implemented.

Major features:

  • Data viewer: “References” panel was added (browse values by foreign and reference keys)

Other:

  • Connection page was redesigned
  • PostgreSQL: struct/array data types support was fixed
  • MySQL: privileges viewer was fixed (global privileges grant/revoke)”

PyCharm 2019.2

With improved integration with Jupyter Notebook, among other improvements.


Practices and Architecture

Five principles that will keep your data warehouse organized

Some are obvious, some less so:

  • Use schemas to logically group together objects
  • Use consistent and meaningful names for objects in a warehouse
  • Use a separate user for each human being and application connecting to your data warehouse
  • Grant privileges systematically
  • Limit access to superuser privilegs

Graph Databases Go Mainstream

Given that this article is published in Forbes, it’s hard to argue with the headline. An interesting overview.

The Little Book of Python Anti-Patterns

“Welcome, fellow Pythoneer! This is a small book of Python anti-patterns and worst practices.”

General Data-Related

What We Learned From The 2018 Liquibase Community Survey

The creator of Liquibase shares information about who uses Liquibase.

You’re very easy to track down, even when your data has been anonymized

Scary stuff: turns out anonymizing data doesn’t protect you from being identified after all.


Podcasts

Acquiring and sharing high-quality data

July 18th episode of the O’Reilly Data Show Podcast.


Upcoming Conferences of Interest

NODES 2019

“Neo4j Online Developer Expo & Summit” Apparently, this is the first-ever such conference for the Neo4j community.


Classic Paper or Reference of the Week

The Design of Postgres

Written by Michael Stonebraker and Lawrence A. Rowe, describes the architecture of Postgres as a successor to INGRES. Of course, this is the jumping-off point for the PostgreSQL of today.


Data Technologies of the Week

I couldn’t pick just one.

Apache Iceberg

“Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table.” Still incubating, but sounds very cool.

Apache Avro

Avro is a data serialization format. “Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.”

Dask

“Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love.”


Metadata Standard of the Week

BISAC Subject Headings (2018 Edition)

“The BISAC Subject Headings List, also known as the BISAC Subject Codes List, is a standard used by many companies throughout the supply chain to categorize books based on topical content. The Subject Heading applied to a book can determine where the work is shelved in a brick and mortar store or the genre(s) under which it can be searched for in an internal database.” The Book Industry Study Group (BISG) provides a helpful FAQ for deciding what BISAC to use for a book.

Interesting Data-Related Blogs and Articles – Week of July 14, 201

I’ve added some new sections this week, though I still intend to focus on data and data-related items.

AWS

Announcing the support of Parquet data format in AWS DMS 3.1.3

Apparently the AWS “Database Migration Service” can be used for migrating files, not just databases. The service now supports migrating to S3 in Apache Parquet format. This could be useful if you want to use Amazon Athena or Redshift Spectrum to query the data.

Orchestrating an ETL process using AWS Step Functions for Amazon Redshift

“Modern data lakes depend on extract, transform, and load (ETL) operations to convert bulk information into usable data. This post walks through implementing an ETL orchestration process that is loosely coupled using AWS Step Functions, AWS Lambda, and AWS Batch to target an Amazon Redshift cluster.”

New AWS Public Datasets Available from Facebook, Yale, Allen Institute for Brain Science, NOAA, and others

AWS hosts a large number (114 so far) open data sets. The registry provides search functionality to help you find what you may be looking for. More information is at the Open Data on AWS page.

Separating queries and managing costs using Amazon Athena workgroups

This post, from the AWS Big Data blog, describes an important way to isolate workloads (for example, ad-hoc vs. reporting) and attribute costs appropriately (by using tags) when querying data via AWS Athena. It’s a helpful companion piece to the item above on Parquet and DMS.


PostgreSQL

BRIN Index for PostgreSQL: Don’t Forget the Benefits

The benefits include smaller sizes than B-Tree indexes, fast scanning of extremely large tables, and more efficient vacuuming. The original proposal, linked in the article above, is here. It provides more rationale for what the proposer, Alvaro Herrera, called “minmax indexes”.


Software Updates

Oracle released its July Critical Patch Update (CPU) (2019-07-16).


Practices and Architecture

A Data Cleaner’s Cookbook

OK, pretty old-school, but pretty cool ways to clean data from the command line. The author has an accompanying blog, called “BASHing data“.

Graph Query Language GQL

This is a proposed ISO standard for querying graph databases. There’s even a GQL Manifesto.

The Rise Of Natural Language Interfaces To Databases

This development seems to be driven by the needs of querying RDF-triple stores, but applies to all models of databases.


Upcoming Conferences of Interest

Classic Paper or Reference of the Week

Data Cleaning: Problems and Current Approaches

The classification of data quality problems is as helpful today as it was back in 2000, when this paper was first published.


Cool Research Paper of the Week

Towards Multiverse Databases

You can think of a multiverse database as one that extends the concept of a distributed database with individual views of that data for each user. Multiverse databases contain a centralized privacy policy that needs only be implemented once.


Data Technology of the Week

Apache Superset

Aims to provide “…a modern, enteprise-ready business intelligence web application”. Still incubating, but already has an impressive list of companies using it. Check out the Visualizations Gallery.


Metadata Standard of the Week

MARC is actually a set of formats that was originally created in the 1960s and 1970s. MARC includes formats for bibliographic metadata, authority records (e.g., names, subjects), holdings, classifications, communities, and translations.

Interesting Data-Related Blogs and Articles – Week of July 7, 2019

 

AWS

Amazon Aurora PostgreSQL Serverless – Now Generally Available

Serverless Aurora MySQL has been around for awhile, but this is the first release of serverless Aurora for PostgreSQL.
In related news, the Aurora development team just won the 2019 ACM SIGMOD Systems Award.

How 3M Health Information Systems built a healthcare data reporting tool with Amazon Redshift

A case study of modernizing a legacy data warehouse on AWS, using Redshift, including lessons learned.

Improving Amazon Redshift Performance: Our Data Warehouse Story

From Udemy Engineering, a brief overview of how column stores like Redshift differ from traditional relational databases. The author discusses how to design a database to take advantage of Redhsift’s fundamental architecture.

Optimizing Amazon DynamoDB scan latency through schema design.

An overview of improving table scans by paying attention to your attributes.


PostgreSQL

EnterpriseDB Acquired by Great Hill Partners

EnterpriseDB staff make major contributions to the PostgreSQL code base.
In a related development, Michael Stonebraker, the original architect of what is now PostgreSQL, will serve as a technical adviser to the company.

Generated columns in PostgreSQL 12

A cool new feature in the next release of PostgreSQL.
“This feature is known in various other DBMS as ‘calculated columns’, ‘virtual columns’, or ‘generated columns’.”

How We Solved a Storage Problem in PostgreSQL Without Adding a Single Byte of Storage

Pretty clever idea: reduce the size of the key used in sorting by hashing it. Probably not specific to PostgreSQL.

Postgresql Interval, Date, Timestamp and Time Data Types

“Does anyone really know what time it is?”
A primer on all the various ways of representing time in PostgreSQL.


Software Updates

AWS RDS for PostgreSQL Supports New Minor Versions (2017-07-03)

PostgreSQL versions 11.4, 10.9, 9.6.14, 9.5.18, and 9.4.23 are now available for RDS.

DBeaver 6.1.2 (Released 2019-07-07)

pgAdmin 4.10 (Released 2019-07-04)


Practices and Architecture

Figuring out the future of distributed data systems

Summary of an interview with Martin Kleppmann, author of Designing Data-Intensive Applications, which is becoming an influential book in the field.

Spark core concepts explained

A brief primer with helpful graphics.


Classic Paper or Reference of the Week

The classic “Red Book” Readings in Database Systems is now in a fifth edition and exclusively on the Web. Peter Bailis, an up-and-coming light in the database community joins Joe Hellerstein and Michael Stonebraker as editors for this edition.

 

Interesting Data Links, Week Ending 2015-06-27

The most interesting data and database-related articles to come my way this week.

Oracle’s biggest database foe: Could it be Postgres? via Postgres Weekly.

Oracle has such a huge head start, I doubt that PostgreSQL is really a threat. Still, I’m glad to see PostgreSQL so popular among start-ups.

Don’t Let Your Data Out of the Database

Another excellent post by Pat Shaughnessy.

Introducing HypoPG, hypothetical indexes for PostgreSQL

Being able to test the usefulness of an index without having to create it on disk is a fantastic tool for developers. Over-indexing can be as bad as not indexing at all.