AWS Tech Talk (July 31): How to Build Serverless Data Lake Analytics with Amazon Athena
Will discuss using AWS Athena for querying data in S3.
Includes a discussion of using the Glue Catalog. I’ve experimented with Glue. The Glue Catalog might be the most useful part of Glue (in the current state of Glue development).
Last year’s winner was Linux, so PostgreSQL is in excellent company. This is only the second year that the award has been presented.
This is a library for profiling a data set. I have been playing around with it and so far really like the functionality and simplicity of using pandas-profiling via, for example, a Jupyter Notebook.
From the O’Reilly AI channel. The finding mentioned in the headline comes from an analysis of papers posted on arXiv.org.
“StanfordNLP is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism. In addition, it is able to call the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.”
These cover Computational Methods, Data, Finance, Genomics, Machine Learning, Science and Medicine, Statistics, Time Series, Utilities, and Visualization.
Excerpts from the release notes (the link shows all changes):
“New project configuration format was implemented.
- Data viewer: “References” panel was added (browse values by foreign and reference keys)
- Connection page was redesigned
- PostgreSQL: struct/array data types support was fixed
- MySQL: privileges viewer was fixed (global privileges grant/revoke)”
With improved integration with Jupyter Notebook, among other improvements.
Practices and Architecture
Some are obvious, some less so:
- Use schemas to logically group together objects
- Use consistent and meaningful names for objects in a warehouse
- Use a separate user for each human being and application connecting to your data warehouse
- Grant privileges systematically
- Limit access to superuser privilegs
Given that this article is published in Forbes, it’s hard to argue with the headline. An interesting overview.
“Welcome, fellow Pythoneer! This is a small book of Python anti-patterns and worst practices.”
The creator of Liquibase shares information about who uses Liquibase.
Scary stuff: turns out anonymizing data doesn’t protect you from being identified after all.
Upcoming Conferences of Interest
“Neo4j Online Developer Expo & Summit” Apparently, this is the first-ever such conference for the Neo4j community.
Classic Paper or Reference of the Week
Written by Michael Stonebraker and Lawrence A. Rowe, describes the architecture of Postgres as a successor to INGRES. Of course, this is the jumping-off point for the PostgreSQL of today.
Data Technologies of the Week
“Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table.” Still incubating, but sounds very cool.
Avro is a data serialization format. “Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.
- Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
- Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
- No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.”
“Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love.”
Metadata Standard of the Week
BISAC Subject Headings (2018 Edition)
“The BISAC Subject Headings List, also known as the BISAC Subject Codes List, is a standard used by many companies throughout the supply chain to categorize books based on topical content. The Subject Heading applied to a book can determine where the work is shelved in a brick and mortar store or the genre(s) under which it can be searched for in an internal database.” The Book Industry Study Group (BISG) provides a helpful FAQ for deciding what BISAC to use for a book.