Abstraction and Motivation
The only single point of failure (SPOF) in HDFS is on the most important node — NameNode. If it fails, the ongoing operations will fail and user data may be lost.
In 2008, we (team from China Mobile Research Institute) have implemented an initial version of Name-Node Cluster (NNC) on hadoop 0.17. NNC introduced a Synchronization Agent, which synchronizes the updates of FSNamesystem to slave NNs. Then we ported it to hadoop 0.19 and 0.20, and employed Zookeeper as registry of NNs. Related Information was mentioned in our presention in Hadoop Summit’09.
Here I will post some detail on the design and practice.
- Name-Node Cluster (NNC), a cluster of NNs, the master NN synchronizes metadata to slave NNs and a slave NN will take over the master’s work if it fails.
- Master NN and Slave NN: in a NN cluster, Master NN synchronizes information to slaves.
- Synchronization Agent (SyncAgent): a daemon in NN, which manages the name system information synchronization between master and slave NNs.
- SyncMaster: management slaves and transfer updates to slave nodes.
- SyncSlave: receive updates from master and update the namesystem of masters.
- SyncUnit or update, a primary metadata operation, which is transfer unit in synchronization procedure. It encapsulates the operation, such as mkdirs or block report, and datas.
Synchronization Agent (SyncAgent)
Each NameNode in NNC has a SyncAgent, either SyncMaster or SyncSlave. SyncAgent maintains the syncronization related status and deal with the synchronization data transfer between NameNodes. NNC work like a journalized filesystem, every change made in Namesystem of master NameNode will be logged and transfer to slaves as a SyncUnit, and it will replay at slave, which keep the slaves’ Namesystem identical with the master.
A slave in NNC may transit among four status:
- Empty, a new joined NameNode is in Empty status and will be injected by master.
- Unsynced, a NameNode with Namesystem data, but has not synchronized with the master.
- Synced, a NameNode has been synchronized with master, and need a most recent update to keep up with master.
- Up-to-date: a NameNode with the latest update
The following figure shows the transition of the slave status.
Initial Injection of slave
When a new NameNode join NNC, it enters a bootup procedure. After having handshaked with master, the master will inject the whole Namespace to slave node.
- The slave registers to master, with a socket for receiving updates.
- The master locks the FSamespace, blocks all metadata writing operation.
- The master dumps the FSDirectory and DataNodeMap and encapsulates them.
- The master unlocks the FSNamespace.
- The master establishes the connection to slave and sent all the data to slave.
- The slave initializes the FSNamespace.
Then the slave will be marked as “Unsynced”, and will receives the updates that happens since the above step 4. After having synchronized with the master, it will transit to “Synced” or “Up-to-date” status.
If an meta data changing operation happens, SyncMaster will transfer it to SyncSlave with the eastablished synchronization connection. If there are many slaves, each one will use a separated thread.
- Every meta changing operation is hooked by SyncMaster, and it encapsulates the op and data, then starts the update routine.
- Before an update is transfered, SyncMaster ensures that no older updates are transfering, i.e. all updates will be transfered in secquence. (This is a trade off between performance and constency, if we adopt a larger window, the performance may increase.)
- A SyncMornitor is started, which manages all the threads transfering SyncUnits to slaves.
- If a thread failed or timeout, the slave will be transited to “Unsynced” status, and it is requred to register and sync again.
- After all the threads are finished, SyncMonitor will return and SyncMaster will release the lock that may block the next update transfering.
The following is the update events that will be hooked and synchronized:
- NNU_NOP // nothing to do
- NNU_BLK // add or remove a block
- NNU_INODE // add or remove or modify an inode (add or remove file; new block allocation)
- NNU_NEWFILE // start new file
- NNU_CLSFILE // close new file
- NNU_MVRM // move or remove file
- NNU_MKDIR // mkdir
- NNU_LEASE // add/update or release a lease
- NNU_LEASE_BATCH //update batch of leases
- NNU_DNODEHB_BATCH //batch of datanode heartbeat
- NNU_DNODEREG // dnode register
- NNU_DNODEBLK // block report
- NNU_DNODERM // remove dnode
- NNU_BLKRECV // block received message from datanode
- NNU_REPLICAMON //replication monitor work
- NNU_WORLD //bootstrap a slave node
- NNU_MASSIVE //bootstrap a slave node
Zookeeper in NNC
In the 0.20 version of NNC, Zookeeper was introduce as a registry of NameNodes. From zookeeper, a slave can find the master’s location, and DFSClients and DataNodes can also use it to find available NameNode. We did plan to adopt zookeeper in the failover procedure, but did not implement it yet.
The failover of NNC is managed by Linux-HA (heartbeat) 2.99. Whe master is down, linux-ha promote a slave to master. It update it status and update Zookeeper. Then DFSClients and DataNodes will connect to it.
A new method setAsMaster() is added to ClientProtocol, and administrator or scripts can use dfsadmin to call this method. It will promote the specified NameNode from slave to master.
When a slave is promoted to master, it constructs a new SyncMaster with the existed SyncSlave contents. The procedure does not affect the data of FSNamespace.
Failover with linuxHA
Another new method status() is also added to ClientProtocol, and Linux-HA OCF script keeps track the status of NameNodes with this method. If it cannot get the status of the master NameNode, it will promote a slave NameNode as new Msater.
During the failover procedure, DFSClients and DataNodes cannot connect to master NameNode, and they will keep retry for a while. Once the new master returns to work, all the operation will be done.
Practice and Issues
We deployed some experimental systems of NNC, the synchronization works well in lab and will be adopt in some expermental systems. And there are some issues yet:
- The overhead of NameNode synchronization
- For typical file IO and MapReduce (sort, wordcount)
- NNC system reaches 95% performance of hadoop without NNC
- For meta data write only operation (parallel touchz or mkdir)
- NNC system reaches 15% performance of hadoop without NNC
- Performance gaining of Multiple NameNode in read-only operation
- Cannot observed till now, unfortunately
- Other design issue
- Why from master to slaves directly without an additional delivery node?
- That may introduce another SPOF, and make the problem more complex.
- Why don’t use Zookeeper for failover?
- Linux-HA works well, and we are also evaluate whether change to ZK, any suggestions?
Backup Node (BN) in trunk brings “warm” standby (HADOOP-4539), it synchronizes namespace information from Main NameNode, and leaves DataNode information, Block Operation (such as replication), and lease information unsynchronized. It could be a base of our furture work.
As NN in current trunk can log all the namespace changes as journals to BN, the rest information needed to be synchronized includes:
- Changes in DataNode Map: information collected from DataNode heartbeats and reports. Together with namespace info, the blocks map can also be calculated.
- Lease information: leases are updated by clients periodically or when it lock/unlock file for write.
- Block operation’s status, such as pending replication block list: these information are depends on random number, thus it may lead to inconsistency if do not synchronize them.