Comparision of NoSQL databases mongoDB and apache hadoop

NoSQL

Not only SQL (NoSQL) databases using different kind of technics for storing and retrieving data compared to classical relational databases like DB2 or Oracle. They emphasis more on scaling and storing different kind of data then on ACID or CAP support.

Type of NoSQL databases

NoSQL databases can be divided into five different    types:

  • Column
    • Tuple of three values: Unique name, value and timestamp
  • Document
    • semi structured data in documents
  • Key Value
    • dictionary of key value pairs
  • Graph
    • graph oriented system of nodes with properties and edges
  • Multi Model
    • mixture of data models with one backend

MongoDB belongs to the document based NoSQL databases.

HBase as part of apache Hadoop belongs to the column based NoSQL databases.

MongoDB

MongoDB is a open source document database mainly written in C++ as  native application.  The MongoDB Inc. is the company behind that product and offers commercial versions, suppport, etc.

License is A GPL 3.0

Features

  • Document oriented Database
  • aggregation
  • Sharding
  • Replication
  • Indexing
  • automatic fail-over

Apache Hadoop

The Apache Hadoop is a umbrella for different software projects to store and process distributed large data sets. It’s open source and maintained by the apache community.

License is Apache 2.0 license

HDFS

The Hadoop distributed file system (HDFS) is a distributed storage system written in java suitable for storing large files in  a scalable and fault tolerance cluster.

HBase

HBase is a distributed, non relational database modeled after Google BigTable implementation.

Facebook used it for their messages to store over 100 PB on data.

Features

  • Linear and modular scalability.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers.
  • Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
  • Easy to use Java API for client access.
  • Block cache and Bloom Filters for real-time queries.
  • Query predicate push down via server side Filters
  • Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
  • Extensible jruby-based (JIRB) shell
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JM

Hive

data warehouse extensions for HBase to use HiveQL as like like query language

Ambari

Web based tool to monitor Hadoop installations

Pig

Ability to create MapReduce programs with the PigLatin language to analyze large datasets

Chukwa

Monitoring solution for distributed systems

Zoopkeper

Distributed configuration system

API

For MongoDB exists

For Hadoop exists

  • HBase
    • CLI
    • Java
    • REST
  •  Hive
    • CLI
    • REST

Integrate Hadoop into mongodb

There exists an adapter to use hadoops map reduce functions for aggregations on data. Hadoop jobs extract the data from mongodb, aggregate them and write back to mongodb.

decision criteria to choose the right tool

  • MongoDB
    • best suited as operation database
    • not for data analysis or data processing
  •  Hadoop
    • has many data analysis functionalities
    • use for large amount of read only data

References