Archive

Posts Tagged ‘Lucene’

A New Year and New Technology

January 11, 2012 Leave a comment

Happy New Year!

With a new year here, I’ve been thinking about ways to expand my skill set and the technologies to learn more about in 2012.  As I’m still a data guy at heart, it should be no surprise that the technologies that interest me are related to data and databases.

MongoDB

I hoping to do a lot of work with MongoDB this year.  In the past, I’ve played around with it some, but it looks as if there will be at least one project at work that will let me get some deep experience using MongoDB.  I’m currently reading “MongoDB: The Definitive Guide” by Kristina Chodorow and Michael Dirolf and am really enjoying it.  Based on what I’ve read so far, I would say this is one of those classic O’Reilly books: specialized, but so well-written that you almost forget that you’re reading something highly technical.

Hadoop

I attended a number of talks at QCon SF back in November and one topic that recurred frequently was how companies are using Hadoop.  It seemed as if every presenter described how their company is finding a way to use at least part of the Hadoop ecosystem.  And Hadoop is truly that: an entire ecosystem, encompassing not only the core project, but also Pig, ZooKeeper, Mahout, HBase, and still others.  You can find more information at the Apache Hadoop project page.  I’m hoping to get a proof-of-concept cluster up and running by the middle of the year.

Lucene/Solr

Full-text search technology falls into an area that, like MongoDB, I have had some exposure to in the past, but would like very much to learn more about.  I’m planning to spend some time this year on Lucene first, and then move onto the search server, Solr.  I’m fortunate to have some colleagues with a lot of experience in this area, and I intend to mine their knowledge whenever possible.

XML

I believe that those of us steeped in the relational database world can take a great step toward a more data architect mindset by having a deep understanding of XML and its related technologies.  My plan this year to is to get a firmer understanding of XML, XPath, XSLT, XML Schema, etc.  I’ve done quite a bit of work with XML in the past, but this has typically been in relation to Oracle’s handling of XML documents within the database.  I want to gain a broader understanding.

Of course JSON is coming on strong in supplanting XML in some of its previous strongholds.  For example, MongoDB’s data model relies on JSON-formatted documents.

This seems like enough to keep me busy outside of my day job for one year!

Advertisements

Book Review: “Big Data Glossary” by Pete Warden (O’Reilly Media)

October 2, 2011 Leave a comment

Big Data Glossary” could probably have been titled  something like “Big Data Cheat Sheets” because it’s both more and less than a glossary.  Instead the book is an excellent summary of tools in the “big data” space, rather than a list of terms with definitions.

Warden tackles eleven topics:

  1. Some background on fundamental techniques (e.g., key-value stores)
  2. NoSQL databases
  3. MapReduce
  4. Storage techniques
  5. “Cloud” servers
  6. Data processing technologies (e.g., R and Lucene)
  7. Natural Language Processing
  8. Machine Learning
  9. Visualization
  10. Acquisition
  11. Serialization

He covers none of these topics in great detail, which will no doubt cause carping among some folks.  However, I really like his approach of sketching broad themes, identifying key projects (or products) in each space, and pointing the reader to further research.  Because the field of “big data” is so large, this short book (it’s only 50 pages) serves the extremely useful purpose of tying together the field by providing an overview.

Highly recommended for folks looking to get their feet wet in the great lake of big data.