Interesting Data-Related Blogs and Articles – Week of July 21, 2019

AWS

AWS Tech Talk (July 31): How to Build Serverless Data Lake Analytics with Amazon Athena

Will discuss using AWS Athena for querying data in S3.

Migrate and deploy your Apache Hive metastore on Amazon EMR

Includes a discussion of using the Glue Catalog. I’ve experimented with Glue. The Glue Catalog might be the most useful part of Glue (in the current state of Glue development).


PostgreSQL

PostgreSQL wins 2019 O’Reilly Open Source Award for Lifetime Achievement

Last year’s winner was Linux, so PostgreSQL is in excellent company. This is only the second year that the award has been presented.


Python

pandas-profiling

This is a library for profiling a data set. I have been playing around with it and so far really like the functionality and simplicity of using pandas-profiling via, for example, a Jupyter Notebook.

Researchers love PyTorch and TensorFlow

From the O’Reilly AI channel. The finding mentioned in the headline comes from an analysis of papers posted on arXiv.org.

StanfordNLP 0.2.0 – Python NLP Library for Many Human Languages

“StanfordNLP is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism. In addition, it is able to call the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.”


R

June 2019 “Top 40” R Packages

These cover Computational Methods, Data, Finance, Genomics, Machine Learning, Science and Medicine, Statistics, Time Series, Utilities, and Visualization.


Software Updates

DBeaver 6.1.3

Excerpts from the release notes (the link shows all changes):

“New project configuration format was implemented.

Major features:

  • Data viewer: “References” panel was added (browse values by foreign and reference keys)

Other:

  • Connection page was redesigned
  • PostgreSQL: struct/array data types support was fixed
  • MySQL: privileges viewer was fixed (global privileges grant/revoke)”

PyCharm 2019.2

With improved integration with Jupyter Notebook, among other improvements.


Practices and Architecture

Five principles that will keep your data warehouse organized

Some are obvious, some less so:

  • Use schemas to logically group together objects
  • Use consistent and meaningful names for objects in a warehouse
  • Use a separate user for each human being and application connecting to your data warehouse
  • Grant privileges systematically
  • Limit access to superuser privilegs

Graph Databases Go Mainstream

Given that this article is published in Forbes, it’s hard to argue with the headline. An interesting overview.

The Little Book of Python Anti-Patterns

“Welcome, fellow Pythoneer! This is a small book of Python anti-patterns and worst practices.”

General Data-Related

What We Learned From The 2018 Liquibase Community Survey

The creator of Liquibase shares information about who uses Liquibase.

You’re very easy to track down, even when your data has been anonymized

Scary stuff: turns out anonymizing data doesn’t protect you from being identified after all.


Podcasts

Acquiring and sharing high-quality data

July 18th episode of the O’Reilly Data Show Podcast.


Upcoming Conferences of Interest

NODES 2019

“Neo4j Online Developer Expo & Summit” Apparently, this is the first-ever such conference for the Neo4j community.


Classic Paper or Reference of the Week

The Design of Postgres

Written by Michael Stonebraker and Lawrence A. Rowe, describes the architecture of Postgres as a successor to INGRES. Of course, this is the jumping-off point for the PostgreSQL of today.


Data Technologies of the Week

I couldn’t pick just one.

Apache Iceberg

“Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table.” Still incubating, but sounds very cool.

Apache Avro

Avro is a data serialization format. “Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.”

Dask

“Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love.”


Metadata Standard of the Week

BISAC Subject Headings (2018 Edition)

“The BISAC Subject Headings List, also known as the BISAC Subject Codes List, is a standard used by many companies throughout the supply chain to categorize books based on topical content. The Subject Heading applied to a book can determine where the work is shelved in a brick and mortar store or the genre(s) under which it can be searched for in an internal database.” The Book Industry Study Group (BISG) provides a helpful FAQ for deciding what BISAC to use for a book.

Slides from my MongoDB Boston 2013 Talk “No More SQL”

MongoDB was kind enough to ask me to present at MongoDB Boston 2013 last Friday (2013-10-25).  Below are the slides from my talk, entitled “No More SQL”.

I spoke about the experience at my workplace while moving from a 2+ TB relational database to a MongoDB cluster (twelve shards).  My hope was to convey some of the challenges we encountered and the lessons we learned while working on this project.

Book Review: MongoDB Applied Design Patterns

MongoDB Applied Design Patterns is a book that I will read again.  I generally don’t say that about technical books, but the strengths of this work are such that many parts merit a second reading.

This book is for folks with some experience using MongoDB.  If you’ve never worked with MongoDB before, you should start with another book.  Python developers, in particular, will benefit from studying this book, as most of the code examples are in that language.  As long as you have some object-oriented programming experience and have worked with the MongoDB shell, though, you’ll have little difficulty following the code examples.

Another group of people who will strongly benefit from this book are those with only relational database experience.  The author does a thorough job, particularly in the early sections of the book, of comparing MongoDB with traditional relational database management systems.

I particularly liked the author’s discussion of transactions, in chapter 3.  The example is complex, and not a simple debit-credit discussion.  You understand through this example that you must write your own transaction management when you give up using a relational database system.  To me, this is an important point, and I’m glad that the author spends so much time on this example.

Some of the use cases presented are similar to those in the MongoDB manual, in particular chapters four, five, and six.  The remaining use cases go beyond what is described in that manual. All of the discussion in these use cases is thorough.  There is typically an explanation of the data model (schema design) and then of standard CRUD operations.  The author also goes into not-so-typical operations, like aggregation.  I was particularly pleased that each use case includes sharding concerns.

In summary, I highly recommend this book.  It’s great to see MongoDB being adopted for so many different uses.

MongoDB for DBAs Course Impressions

I recently received my final score from 10gen Education’s MongoDB for DBA course.  I’m pleased to report that I scored in the top tier.  10gen even said that I am “awesome”.

I wanted to give my impressions of the course and to encourage as many folks who are interested to take the free training that 10gen is providing.   The next DBA course begins on April 29, and the MongoDB for Developers course (for Java developers) starts on May 13.

Course Format

There are six weeks of lectures, and a seventh week for a final exam, which is really a hands-on project.

Each week is divided up into six to twelve video “lectures”, of varying length.  Some of the lectures are only three or four minutes long, though some are as many as fifteen to twenty minutes.  Most of the video lectures are followed by one or two quizzes. These quizzes are almost all multiple-choice questions.  You’re given up to three chances to get a quiz answer correct, and can even peek at the answer before submitting your solution.  While this may seem like a way to cheat, you’re really cheating yourself if you make no attempt to answer the quiz questions honestly. If you don’t understand the material in the lectures, you will not be able to complete the homework. Think of the quizzes as a way of checking your understanding of the video lectures, and re-watch any lectures where you found the quiz difficult.

There are typically four or five homework problems per week.  These are primarily not multiple-choice, but worked problems that require you to actually perform various operations with MongoDB.  If you haven’t mastered the material in the lectures, you will not complete the homework successfully, as the problems are not trivial.

The Less Good

Like most MOOCs, this is no substitute for an instructor-led class.  You can’t ask questions of the instructor in real-time, but only through the course message board.

The quizzes are somewhat simplistic.  This may be somewhat attributable to the test engine that 10gen used (edX).  It would be helpful to students if the questions weren’t multiple choice, and required a bit more understanding.

I found myself referring to the excellent on-line documentation for MongoDB to fill in gaps that were left by the lectures themselves.  Many weeks I found myself wanting more information than the lectures provided.

The Good

It’s free!  10gen is very wise to offer this training free to the community.  The more folks who know how to use MongoDB, the better it is for MongoDB and 10gen.

The DBA course is taught by one of the founders of 10gen, Dwight Merriman.  To have someone at that level spending precious time on instruction tells me that 10gen clearly values building up its user base and community.

The homework assignments really test your understanding of the material.  10gen was very ingenuous in making it difficult to cheat on the homework questions.  I think the homework is the best part of the course, actually. I’ve referred back to homework questions several times to help me solve a problem at work. The course would be even stronger if similar effort had been put into the quizzes.

I’m really grateful to 10gen for making this training freely available.  I was so impressed by my experience with MongoDB for DBAs that I’ve registered for the MongoDB for Java Developers course (M101J) , which begins on May 13.  Can’t wait!

A New Year and New Technology

Happy New Year!

With a new year here, I’ve been thinking about ways to expand my skill set and the technologies to learn more about in 2012.  As I’m still a data guy at heart, it should be no surprise that the technologies that interest me are related to data and databases.

MongoDB

I hoping to do a lot of work with MongoDB this year.  In the past, I’ve played around with it some, but it looks as if there will be at least one project at work that will let me get some deep experience using MongoDB.  I’m currently reading “MongoDB: The Definitive Guide” by Kristina Chodorow and Michael Dirolf and am really enjoying it.  Based on what I’ve read so far, I would say this is one of those classic O’Reilly books: specialized, but so well-written that you almost forget that you’re reading something highly technical.

Hadoop

I attended a number of talks at QCon SF back in November and one topic that recurred frequently was how companies are using Hadoop.  It seemed as if every presenter described how their company is finding a way to use at least part of the Hadoop ecosystem.  And Hadoop is truly that: an entire ecosystem, encompassing not only the core project, but also Pig, ZooKeeper, Mahout, HBase, and still others.  You can find more information at the Apache Hadoop project page.  I’m hoping to get a proof-of-concept cluster up and running by the middle of the year.

Lucene/Solr

Full-text search technology falls into an area that, like MongoDB, I have had some exposure to in the past, but would like very much to learn more about.  I’m planning to spend some time this year on Lucene first, and then move onto the search server, Solr.  I’m fortunate to have some colleagues with a lot of experience in this area, and I intend to mine their knowledge whenever possible.

XML

I believe that those of us steeped in the relational database world can take a great step toward a more data architect mindset by having a deep understanding of XML and its related technologies.  My plan this year to is to get a firmer understanding of XML, XPath, XSLT, XML Schema, etc.  I’ve done quite a bit of work with XML in the past, but this has typically been in relation to Oracle’s handling of XML documents within the database.  I want to gain a broader understanding.

Of course JSON is coming on strong in supplanting XML in some of its previous strongholds.  For example, MongoDB’s data model relies on JSON-formatted documents.

This seems like enough to keep me busy outside of my day job for one year!

Brief Impressions of Mongo Boston 2011

Monday I attended Mongo Boston 2011 at the Microsoft NERD Center in Cambridge.

The opening keynote by 10gen’s CTO and co-founder Eliot Horowitz struck a couple of very interesting notes.

  • 10gen wants MongoDB to be a general-purpose database.
  • One of their key principles in building MongoDB is to reduce the number of “knobs” an administrator needs to turn.

Overall I would say the conference was valuable, but could really do with a second day.  For one thing, none of the presentations were more than forty-five minutes long.  While that length does allow for decent overviews, it’s impossible to get into any real depth with such a limited time.

A second day could also reduce some of the “drinking from a fire hose” effect.  I attended eight different presentations, which contained a lot of concepts to absorb.

I wouldn’t recommend these conferences for those who have no experience at all using MongoDB.  I’ve worked with it for a little over a year now, so the material was at a good level for my current understanding.

The price was right at $20 or $30 depending on whether you met the early bird deadline or not.  In my mind, this pricing is a shrewd strategy by 10gen, as it enables interested students to attend.  Building interest and enthusiasm among the up-and-coming developers of tomorrow is a great way to build a community.  However, it was gratifying to see that the attendees represented a wide range of ages.

If you get an opportunity to attend one of the upcoming conferences, I think you’ll find the day worth your time.