MongoDB is the most popular NoSQL database management system (DBMS) (according to db-engines). It is document-oriented and coded in C++. In terms of relations, it has a very flexible approach in how it stores data and doesn’t have any schema.
MongoDB has been devised to support high volumes of data. It implements a logic of horizontal scalability* with sharding*. To go further, MongoDB also enables to implement a MapReduce* system.
In its 3rd version, MongoDB enables to do advanced research like geospatial, faceted search, do research on some text and define the language, ignore “stop words” (“and”, “or”, “the”…in English for example) or do stemming (retrieve a document with the word “vaccination” when you’re looking for “vaccine”).
Documents are stored in BSON (Binary + JSON) on the computer, which enables to get some disk space and a better performance.
*Horizontal scalability: Unlike the vertical scalability (wich can be expensive), this solution offers to split the load among several servers.
*Sharding: To split among several servers, pieces of index to divide the load of the disk.
This also allows splitting the load of a service because there is at least one service per server so they are less in demand as they run in parallel.*MapReduce: MapReduce is an algorithm allowing to process a large amount of data
shared on several servers by aggregating data.
It doesn’t have any API REST interface. You can only have access to it through the protocol. However, some external projects come to fill the gap and offer a solution acting as the interface between an API REST and the protocol on the other side.
Full-text search is possible but not in depth as Mongo doesn’t let us act on indexing parameters. You can only specify the language. Functionalities like “ More like this”, which would allow you to find related documents, can’t be done with MongoDB.
MongoDB can’t stand ACID* transactions and you can’t relate documents in a native way. The developer should undertake two queries to find a relation between two documents (or use something like parent/child or nested documents).
*ACID: Atomicity, Consistency, Isolation, Durability
USEFUL LINKS
Elasticsearch is a document-oriented search engine programmed in Java using Lucene. Created in 2012, it is becoming more and more popular with a growing community. It is free, open source. It has plugins and tools that you have to pay for.
Elasticsearch is very efficient to do complex search on high volumes of data thanks to Lucene. Horizontal scalability* is very easy because you only need to create a new service so that it comes directly to connect itself to the cluster with an automatic choice from the master and an intelligent management of shards*. Elasticsearch has been developed to make a ‘no SPOF’ (no Single Point Of Failure) engine -i.e. in a cluster of several Elasticsearch even if a node would turn off (for example a server crash) the data would always be available and the service would go on working.
Elasticsearch has got its own DSL based on a JSON format enabling to make some queries through REST API more easily than with a SQL query.
You can store flat documents, which are stored like JSON objects and do not necessarily match a schema. However, at any time through the REST API, you can edit a schema to add indexing rules on some fields for specific performances.
*Horizontal scalability: Unlike the vertical scalability (wich can be expensive), this solution offers to split the load among several servers.
*Shards: To split among several servers, pieces of index to divide the load of the disk. This also allows splitting the load of a service
because there is at least one service per server so they are less in demand as they run in parallel.
The software editor explains it: you shouldn’t use Elasticsearch as a main database system. It’s a search engine and not a database.
Reindexing takes some time. You must wait for the 1s ‘refresh’ before the data would be available (or doing it manually).
As other document-oriented DBMS, Elasticsearch will do two queries to deal with several documents or use the trick parent/children (just like MongoDB).
USEFUL LINKS
OrientDB is a NoSQL solution released in 2010 and a 2.0 version in 2015. It focuses on graph-document, enabling to take advantage of graphs (faster link, possibility to cross relations…) and to get the flexibility of a document without any set schema. It implements a query language close to SQL. OrientDB is open source and free for any use (contrary to Neo4j).
A database graph-oriented, implements some ‘vertex’, which are nodes containing information and ‘edge’, which are links between vertex (they may also contain some information). A vertex linked with another can be found more easily than with a table relationship because the edge is a pointer towards another node.
With OrientDB the cluster is master-master, i.e. it doesn’t have any leader nor any election between nodes from the cluster. The data is replicated and sharded* between the various nodes to be more tolerant towards node failure without interrupting the service or data loss. On the same topic, to be more efficient by being scalable, OrientDB has set up some clusters at the class level: you can specify a User class choosing the cluster User_fr in which you store French users and a cluster User_us in which you store American users. This enables you to search in the User class to find back all the users or to search in one of the clusters to limit the number of results.
The fact that OrientDB is a graph database enables you to find quickly relationships and relationship stages with a native function. This is very useful for example when using a social network to find and suggest to users the friends of friends at different stages. This is a great difference with other databases seen in this article that do not natively manage relationships.
OrientDB uses Lucene, enabling it to do advanced ‘full-text’ search as well as geospatial search.
In spite of promises OrientDB made, we do only find few user feedback from a production with a large amount of data. The community isn’t quite big around this tool, which can be quite frightening if a problem might occur (even if the community is increasing).
The HTTP API uses SQL query language by putting directly the command in the URL. The use of a JSON would be more adapted.
USEFUL LINKS
Hadoop is a Java framework enabling several tools from the same ecosystem to connect onto it. Hadoop enables you to work with a high horizontal scalability* to deal with very large volumes of data thanks to HDFS (Hadoop Distributed File System). Hadoop runs MapReduce* jobs retrieving in parallel on different servers data to aggregate them, so as to abstract the fact that the load is handed out and run as if the data was stored on one disk.
Hive is a Java software that will connect itself onto Hadoop and will enable to run queries close to SQL syntax. When you run a query via Hive, it will compile the request in MapReduce* job and run it on the server cluster.
If you use Hadoop it’s mainly to analyse a very big volume of data shared out among several servers. Hadoop is very well fitted for ‘big data’. One of its use for example can be to retrieve all the tweets with a specific hashtag to analyse the level of satisfaction towards a brand.
The data should be prepared to be quickly retrieved. SQL queries are compiled in MapReduce* job for a small-sized data or with not so much server this tool might not be the best fitted.
Hive is not a search engine so undertaking a ‘full-text’ search or faceted search is not possible.
USEFUL LINKS
Cassandra is a NoSQL system, created by Facebook and released in 2008. It is column-oriented and open source. It is widely used by big companies: CERN, eBay (with 250TB of data), Netflix (with 420TB of data and a trillion of queries a day), Github, Spotify, Instagram… The project is still on and companies like Facebook take part in the product advancement. A commercial version also exists edited by the company Datastax.
Cassandra has been created to enable a strong scalability and to guarantee a high availability. When you add a Cassandra node within a cluster, the power is improved in proportion to nodes added. In other words, performance remains the same so there is no trouble adding a node as it can be the case with other DBMS.
As the system is column-oriented, the schema should be specified in advance (even if you can do without any schema by putting several data in the same column Cassandra wasn’t created for this purpose). It is strict during retrieval. You need to create an index on the columns where you are going to query labelling ‘where’.
Retrieval is not exhaustive, no like, no ‘full-text’ or faceted search.
USEFUL LINKS
There are many NoSQL DBMS. Each tries to answer an issue. What we do notice as obvious is that a system can’t answer all the issues, which might arise on a project. The best solution would be to combine NoSQL DBMS and/or also add a SQL solution like MySQL or PostgreSQL.
You can benefit from immediate data saving on MongoDB or Cassandra with advanced full-text search on Elasticsearch. You can also add Hadoop to this pattern to analyse ‘big data’ on high volumes.
Many combinations exist with gateways to index from one engine to the other, for example ‘message broker’ like RabbitMQ.
Vertical scalability: To improve the server thanks to disk space, RAM or the CPU power.
Horizontal scalability: Unlike the vertical scalability (which can be expensive), this solution offers to split the load among several servers.
Sharding: To split among several servers, pieces of index to divide the load of the disk. This also allows splitting the load of a service because there is at least one service per server so they are less in demand as they run in parallel.
MapReduce: MapReduce is an algorithm allowing to process a large amount of data shared on several servers by aggregating data. More information here
ACID: Atomicity, Consistency, Isolation, Durability