How to setup a Sharded Cluster in MongoDB using an Ubuntu Server 18.04

Introduction

MongoDB is a No-SQL, document based database system that scales well horizontally and implements data storage through a key-value system. A popular choice for web applications and websites, MongoDB is easy to implement and access pro-grammatically.

MongoDB achieves scaling through a technique known as "sharding". Sharding is the process of writing data across different servers to distribute the read and write load and data storage requirements.

MongoDB Sharding Topology

Sharding is implemented through three separate components. Each part performs a specific function:

Config Server: Each production sharding implementation must contain at least three configuration servers. This is to ensure redundancy and high availability.

Config servers are used to store the metadata that links requested data with the shard that contains it. It organizes the data so that information can be retrieved reliably and consistently.

Config servers are the brains of your cluster: they hold all of the metadata about which servers hold what data. Thus, they must be set up first, and the data they hold is extremely important: make sure that they are running with journaling enabled and that their data is stored on non-ephemeral drives. In production deployments, your config server replica set should consist of at least three members. Each config server should be on a separate physical machine, preferable geographically distributed.

The config servers must be started before any of the mongos processes, as mongos pulls its configuration from them.

Query Routers: The query routers are the machines that your application actually connects to. These machines are responsible for communicating to the config servers to figure out where the requested data is stored. It then accesses and returns the data from the appropriate shard(s).

Each query router runs the "mongos" process.

Shard Servers: Shards are responsible for the actual data storage operations. In production environments, a single shard is usually composed of a replica set instead of a single machine. This is to ensure that data will still be accessible in the event that a primary shard server goes offline.

MongoDB organizes information into databases. Inside each database, data is further compartmentalized through "collections". A collection is akin to a table in traditional relational database models. Inside collection the information is stored as "Document". A document is akin to a row in traditional relational database models. Inside each document the data is stored as a set of key:value pairs separated by comma (,). It is like column/field and its value as comparison to relational database models.

In this tutorial, we will configure an example sharding cluster that contains:

1 Config Server Replica Set (Required in production environments)
3 Query Routers (Minimum of 1 necessary)
3 Shard Server Replica Set (Minimum of 2 necessary)

Please note that I am going to setup a sample cluster for demo purpose only to make you off the ground. In production system there are other factors that you need to care about like security etc. Please read the mongodb documentation for how to deploy in a production environment.

Each Replica Set will contain 3 Nodes a.k.a. machines. Hence we will need fifteen machines in our production setup. But for the demonstration purpose and to follow along, I will start all the components on a single node/machine with different port number. The requirement in a real setup is to use different machines and put the required value for bind_ip and port number for the process you are starting on that node.

Although I am using the command line parameters but suggest to use the configuration file as it is the good way to manage server startup options. I will show an example of that at the end of the blog along with what will be different in real production setup than mine. Till then read on 😊 or skip to the end to know that. What should be done in a production setup

In reality, some of these functions can overlap (for instance, you can run a query router on the same machine you use as a config server or application server) and you only need one query router and a minimum of 2 shard servers.

We will go above this minimum in order to demonstrate adding multiple components of each type as well as that it leads more reliable/high available and scalable system.

The cluster setup will proceed as follows: (Assumption: MongoDB binaries are installed on the target nodes.) I will run the services in background with no hangup (nohup). I am running all the processes under my own user in the test machine (a virtual Ubuntu 18.04 machine), hence the directory paths will be relative.

Note: Replace the appropriate values according to your setup. Like IP Address, port number, file directory path, file names etc. in connection string, directory setup etc.

If things do not work for you, please check to make sure that you are providing the right values for entities involved, for example for IP address, ports etc. The user running the services have right access privileges to the directories in use and do any other check you deem fit according to your setup.

Diagram Representation

Create Config Server Replica Set

The first components that must be set up and running are the configuration servers. These must be online and operational before the query routers or shards can be configured.

The first thing we need to do is create a data directory (will store the metadata that associates location and content of the sharded collection on the different shards) and log directory (store config server log output).

mkdir -p ~/data/cfg{1,2,3}

mkdir -p ~/logs/cfg{1,2,3}

Since we are running 3 instances of config server to set up our config server Replica set, hence created three separate directories for it. Now it is the time to start the config server service. Binary required is mongod.

nohup mongod --configsvr --replSet configRS --dbpath ~/data/cfg1/ --logpath ~/logs/cfg1/confsvr.log --bind_ip 127.0.0.1 --port 20000 --oplogSize 200 &

nohup mongod --configsvr --replSet configRS --dbpath ~/data/cfg2/ --logpath ~/logs/cfg2/confsvr.log --bind_ip 127.0.0.1 --port 20001 --oplogSize 200 &

nohup mongod --configsvr --replSet configRS --dbpath ~/data/cfg3/ --logpath ~/logs/cfg3/confsvr.log --bind_ip 127.0.0.1 --port 20002 --oplogSize 200 &

The server will start outputting information into the log file and will begin listening for connections from other components.

You can check if all the three configuration server are up and running by executing below command:

ps -eaf | grep -i configsvr | grep -v grep

Now is the time to stitch these three separate config servers into a Replica set. Connect to any one of the above config server and initialize the Replica set.

mongo mongodb://127.0.0.1:20000

rs.initiate(

{

_id: "configRS",

configsvr: true,

members: [

{ _id : 0, host : "127.0.0.1:20000"},

{ _id : 1, host : "127.0.0.1:20001"},

{ _id : 2, host : "127.0.0.1:20002"}

]

}

)

You can check the status by running rs.status() command.

Create the Shard Replica Sets

As already mentioned that we are creating 3 shard server with Replica set of 3 Node each. Hence we are going to follow below steps:

Create data and log directories to store information.
Run the shard server services.
Stitch all this together to create a setup of 3 SHARD server where each SHARD will be a Replica set of 3 Node. This will be done by connecting to one member of each shard Replica set and initialize the same.

Binary required is mongod.

Create data and log directories to store information.

mkdir -p ~/data/rs0{1,2,3}

mkdir -p ~/logs/rs0{1,2,3}

mkdir -p ~/data/rs1{1,2,3}

mkdir -p ~/logs/rs1{1,2,3}

mkdir -p ~/data/rs2{1,2,3}

mkdir -p ~/logs/rs2{1,2,3}

Run the shard server services.

nohup mongod --shardsvr --replSet shardRS0 --dbpath ~/data/rs01/ --logpath ~/logs/rs01/shardsvr.log --bind_ip 127.0.0.1 --port 20005 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS0 --dbpath ~/data/rs02/ --logpath ~/logs/rs02/shardsvr.log --bind_ip 127.0.0.1 --port 20006 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS0 --dbpath ~/data/rs03/ --logpath ~/logs/rs03/shardsvr.log --bind_ip 127.0.0.1 --port 20007 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS1 --dbpath ~/data/rs11/ --logpath ~/logs/rs11/shardsvr.log --bind_ip 127.0.0.1 --port 20008 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS1 --dbpath ~/data/rs12/ --logpath ~/logs/rs12/shardsvr.log --bind_ip 127.0.0.1 --port 20009 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS1 --dbpath ~/data/rs13/ --logpath ~/logs/rs13/shardsvr.log --bind_ip 127.0.0.1 --port 20010 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS2 --dbpath ~/data/rs21/ --logpath ~/logs/rs21/shardsvr.log --bind_ip 127.0.0.1 --port 20011 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS2 --dbpath ~/data/rs22/ --logpath ~/logs/rs22/shardsvr.log --bind_ip 127.0.0.1 --port 20012 --oplogSize 200 &

nohup mongod --shardsvr --replSet shardRS2 --dbpath ~/data/rs23/ --logpath ~/logs/rs23/shardsvr.log --bind_ip 127.0.0.1 --port 20013 --oplogSize 200 &

Initialize 3 SHARDs, each Replica set of 3 Node.

mongo mongodb://127.0.0.1:20005

rs.initiate(

{

_id: "shardRS0",

members: [

{ _id : 0, host : "127.0.0.1:20005"},

{ _id : 1, host : "127.0.0.1:20006"},

{ _id : 2, host : "127.0.0.1:20007"}

]

}

)

mongo mongodb://127.0.0.1:20008

rs.initiate(

{

_id: "shardRS1",

members: [

{ _id : 0, host : "127.0.0.1:20008"},

{ _id : 1, host : "127.0.0.1:20009"},

{ _id : 2, host : "127.0.0.1:20010"}

]

}

)

mongo mongodb://127.0.0.1:20011

rs.initiate(

{

_id: "shardRS2",

members: [

{ _id : 0, host : "127.0.0.1:20011"},

{ _id : 1, host : "127.0.0.1:20012"},

{ _id : 2, host : "127.0.0.1:20013"}

]

}

)

Configure mongos/Query Router

Once you have the config servers and data shards running, start one or more mongos process for your application to connect to. mongos process need to know where the config servers are, so you must always start mongos with the --configdb option:

The query router service is called mongos.

The configuration string must be exactly the same for every query router you configure (including the order of arguments). It is composed of the address of each configuration server and the port number it is operating on, separated by a comma.

Since the mongos pulls all the required information from config server, hence need no data directory for its own. We are going to run 3 Query router and will create log directory for each.

mkdir -p ~/logs/mongosqr{1,2,3}

nohup mongos --configdb configRS/127.0.0.1:20000,127.0.0.1:20001,127.0.0.1:20002 --bind_ip 127.0.0.1 --port 20003 --logpath ~/logs/mongosqr1/mongosqr.log &

nohup mongos --configdb configRS/127.0.0.1:20000,127.0.0.1:20001,127.0.0.1:20002 --bind_ip 127.0.0.1 --port 20004 --logpath ~/logs/mongosqr2/mongosqr.log &

nohup mongos --configdb configRS/127.0.0.1:20000,127.0.0.1:20001,127.0.0.1:20002 --bind_ip 127.0.0.1 --port 20014 --logpath ~/logs/mongosqr3/mongosqr.log &

Add shards to mongos/Query Router

Now you’re ready to add your replica set as a shard using the sh.addShard() method. Connect to a mongos service using mongo shell interface and to the admin database:

You can specify all the members of the Replica set, but you do not have to. mongos will automatically detect any members that were not included in the seed list. If you run sh.status(), you’ll see that.

mongo 127.0.0.1:20014/admin

sh.addShard("shardRS0/127.0.0.1:20005,127.0.0.1:20006,127.0.0.1:20007")

sh.addShard("shardRS1/127.0.0.1:20008")

sh.addShard("shardRS2/127.0.0.1:20011,127.0.0.1:20012,127.0.0.1:20013")

sh.status()

Test your Setup

MongoDB won't distribute your data automatically until you tell it how to do so. You must explicitly tell both the database and the collection that you want them to be distributed. For example, suppose you wanted to shard the artists collection in the music database on the "name" shard key. First, you’d enable sharding for the database:

db.enableSharding("music")

Sharding a database is always a prerequisite to sharding one of its collections.

Once you’ve enabled sharding on the database level, you can shard a collection by running sh.shardCollection():

sh.shardCollection("music.artists", {"name" : 1})

sh.shardCollection(<ns>,<shard key>)

Now the artists collection will be sharded by the "name" key. If you are sharding an existing collection there must be an index on the "name" field; otherwise, the shardCollection call will return an error. If you get an error, create the index (mongos will return the index it suggests as part of the error message) and retry the shardCollection command.

If the collection you are sharding does not yet exist, mongos will automatically create the shard key index for you.

For Example

mongo 127.0.0.1:20003

use books

for(i=0;i<10;i++){ db.mybooks.insert( {_id:i,name:"book_name_"+i,author:"Abhishek G"} ) }

sh.enableSharding("books")

sh.shardCollection("books.mybooks",{_id:1},true)

for(i=11;i<50000;i++){ db.mybooks.insert( {_id:i,name:"book_name_"+i,author:"Abhishek G"} ) }

db.mybooks.stats()

sh.status()

What should be done in a production setup

Each server connected to another server.
MongoDB software/binaries installed on each machine.
The required user (to run the service) and data, log etc. directories are created for the required service/process to store information. (data + metadata). Preferably mongodb user is created and required privileges are given to this user for directories and files to run the particular service.
Configuration file filled with required key:value startup options.
Root privileges or sudo power to setup and run the cluster.
Security and other considerations to run a real world sharded cluster.

Config Server configuration file

# mongod.conf

# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/

# Where and how to store data.
storage:
  dbPath: /var/lib/mongodb
  engine: wiredTiger
  directoryPerDB: true
  journal:
    enabled: true

# where to write logging data.
systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log

# network interfaces
net:
  port: 27017
  bindIp: 192.168.11.129
#  ssl:
#    mode: requireSSL
#    PEMKeyFile: /home/userA/mongodb.pem
#    CAFile: /home/userA/rootCA.pem


# how the process runs
processManagement:
  timeZoneInfo: /usr/share/zoneinfo

# security:
#   authorization: "enabled"

replication:
  replSetName: "configRS"

sharding:
  clusterRole: configsvr

Shard Server configuration file

 
# mongod.conf

# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/

# Where and how to store data.
storage:
  dbPath: /var/lib/mongodb
  engine: wiredTiger
  directoryPerDB: true
  journal:
    enabled: true

# where to write logging data.
systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log

# network interfaces
net:
  port: 27017
  bindIp: 192.168.11.129
#  ssl:
#    mode: requireSSL
#    PEMKeyFile: /home/userA/mongodb.pem
#    CAFile: /home/userA/rootCA.pem


# how the process runs
processManagement:
  timeZoneInfo: /usr/share/zoneinfo

# security:
#   authorization: "enabled"

replication:
  replSetName: "shardRS0"

sharding:
  clusterRole: shardsvr

Search This Blog

MARKS TUTORIAL