DataNode Volume Refreshment in HDFS-1362

HDFS-1362 introduces a volume refreshment utility for DataNode, thus, you can modify your dfs.datanode.data.dir configuration after insert new disks, and refresh it without reboot the machine or restart the service.

(Here is the Chinese Version)

Motivation

Disk failure happens frequently, and we had to replace it even other disks still in service. Currently, DataNode, actually FSDataset, could remove an error volume, but cannot active newer ones. Most modern servers and SATA disks support hotplug, why don’t we enable it for DataNode?

Firstly we added a simple user interface including add, remove, list of volumes. After discussion with Tom White and Todd Lipcon, we decide to change to a simpler interface—refresh based on the reconfiguration framework introduced in HADOOP-7001 as they suggested. It will benefit the consistency of the running service and the on disk configuration file.

Volume Management of DataNode

DataNode has 2 steps on volume management. In start time, DataStorage member loads the configured dirs, check the existence, availability, version, dir structure, etc. And do upgrade, roll back, and format if needed. Then, it adds usable dirs into itself, and provides an iterator for StorageDirectory access.

Having loaded the directories, DataNode constructs FSDataset member to manage volumes and blocks stored in them. FSDataset manages an array of volumes in FSVolumeSet and a map of blocks. In loading procedure, every blocks is found and recorded into the map with the block ID as key. The may is not indexed by volume, thus you must iterate all the map if you want to remove a volume to find all the blocks stored in it.

As described above, the block loading time depends on how many blocks in a volume, and it will consume much time if a volume is full of blocks.

How to Refresh

In HDFS-1362, we have 4 steps to refresh the volumes:

• Firstly DataNode reloads the configuration, and creates the a list of dirs.

• Iterate the StorageDirectory in DataStorage,

• If sd is not in FSDataset, which means it had been removed before because of failure, remove it from DataStorage; then

• If sd is contained in the new dirs list, remove it from the list as it has already in service;

• On the contrary, if sd is not in the new dir list, it means a in-service dir has been removed from configuration. Currently, it will lead to a warning in log file only.

• Load remain true new dirs to DataStorage.

• Add them to FSDataset.

Load Volumes

Load volumes operation based on the DataStorage.revoverTransitionRead, we copy and modified it as revoverTransitionAdditionalRead, the different is:

• Do not initialize the StorageID and StorageDirectory list;

• Do not doTransition against all SDs, but only the inserted ones;

• Do not writeAll, but only the inserted ones;

• Return the new added SDs Collections, which is used for the blocks loading.

We did not unify the revoverTransitionAdditionalRead and revoverTransitionRead because we considered it may related with writeAll and transition operation.

Load Blocks on Disk

Load blocks means we add volumes to FSDataset, it’s no more than add volume to array in FSVolumeSet and load blocks info into the map. But we think we can optimized on the loading to shorten the loading time.

Asynchronous Services

FSDataset uses FSDatasetAsyncDiskService for some asynchronous execution on volumes, it is based on a HashMap, thus we only need to move the put operation from constrictor to a separate method and give its access permission to FSDataset.

Next step

Next step might be support the decommission operation which is not implemented now and leave a warning only.

NameNode Cluster Code @ GitHub

I have committed the initial code on GitHub

http://github.com/gnawux/hadoop-cmri

And this post will give some explanation on the code.

Disclaimer

Though we will keep improve our distribution, the NNC is not mature enough for product, and it is released as a reference for you.

The architecture of the code

The NNC code is based on Apache’s Hadoop branch-0.20 repository. I had claimed our dist is based on ydist, but we met an unresolved issue about MapReduce on it days ago. Hence I had to port it from Ydist to Apache dist, which delayed my release work for days.

The code is constructed as follows:

  • Synchronization Manager: the NN* in org.apache.hadoop.hdfs.server.namenode . They manage the synchronization between NameNodes
  • Hack on FSNamesystem: FSNameSystem.java, FSImage.java, FSDirectory.java, etc. is modified for these reasons:
    • Hook the modification on FSNameSystem
    • Give Sync Manager privilages to access some internal elements.
  • Zookeeper Access util: org.apache.hadoop.hdfs.zookeeper . They help Namenode, Datanode and DFSClient to access the Zookeeper registry.
  • Datanode and DFSClient side: access namenode via zookeeper registry.
  • Management Util: bin/nnc and DFSAdmin.java for monitor, managing and failover from other system.

What’s the Unsolved Issues

There still some found issues here:

  • No unit test available: we will add some tests in the future.
  • SetTimes and SetQuota is not synchronized, and it will be fixed in days.

The Next Step of NNC

Works on current branch

We will keep fix found issues on current branch, but will not contribute new HA related feature on it. And we will pay more attention on trunk,

Works on trunk (0.22)

We are planning on a BackupNode based design, which is closer to mainline code. And we are also interesting in participating the job on HA in JIRA.

Stuff beyond NNC

We have also add some management feature to DataNode — volume add and decommission online, which help us replace failed hard disk without shutdown a node. This feature will be committed to the github repositry this week.

Have Any Question?

If you have any question on our hadoop job, please contact with me. and for Chinese Reader, you can visit our official website:

[ Simplified Chinese ] http://labs.chinamobile.com/cloud/

A Simple HDFS Performance Test Tool

We had written a HDFS performance test tool, which start a group of parallel read/write thread to press on HDFS.

The following figure illustrates the working of it.

htest

 

The synchronizer is a server written in python, it accepts the request of test program running in test nodes. Having received requests from all nodes, it admits them start pressure simultaneously.

The test program is written in Java, and it starts several threads to write or read with DFSClient. All the pressure thread record the data it has written in a variable and the main thread of the test program collect them periodically, then written into a XML file.

Analyze the xml output file, we can tell the performance of reading and writing.

In our test program, it supports read only, write only and read-write. And it can be set as read files writen by itself or random files.

Pratice of NameNode Cluster for HDFS HA

Abstraction and Motivation

The only single point of failure (SPOF) in HDFS is on the most important node — NameNode. If it fails, the ongoing operations will fail and user data may be lost.

In 2008, we (team from China Mobile Research Institute) have implemented an initial version of Name-Node Cluster (NNC) on hadoop 0.17. NNC introduced a Synchronization Agent, which synchronizes the updates of FSNamesystem to slave NNs. Then we ported it to hadoop 0.19 and 0.20, and employed Zookeeper as registry of NNs. Related Information was mentioned in our presention in Hadoop Summit’09.

Here I will post some detail on the design and practice.

Terms

  • Name-Node Cluster (NNC), a cluster of NNs, the master NN synchronizes metadata to slave NNs and a slave NN will take over the master’s work if it fails.
  • Master NN and Slave NN: in a NN cluster, Master NN synchronizes information to slaves.
  • Synchronization Agent (SyncAgent): a daemon in NN, which manages the name system information synchronization between master and slave NNs.
    • SyncMaster: management slaves and transfer updates to slave nodes.
    • SyncSlave: receive updates from master and update the namesystem of masters.
  • SyncUnit or update, a primary metadata operation, which is transfer unit in synchronization procedure. It encapsulates the operation, such as mkdirs or block report, and datas.

Synchronization Agent (SyncAgent)

Each NameNode in NNC has a SyncAgent, either SyncMaster or SyncSlave. SyncAgent maintains the syncronization related status and deal with the synchronization data transfer between NameNodes. NNC work like a journalized filesystem, every change made in Namesystem of master NameNode will be logged and transfer to slaves as a SyncUnit, and it will replay at slave, which keep the slaves’ Namesystem identical with the master.

A slave in NNC may transit among four status:

  • Empty, a new joined NameNode is in Empty status and will be injected by master.
  • Unsynced, a NameNode with Namesystem data, but has not synchronized with the master.
  • Synced, a NameNode has been synchronized with master, and need a most recent update to keep up with master.
  • Up-to-date: a NameNode with the latest update

The following figure shows the transition of the slave status.

nn-status-en

Initial Injection of slave

When a new NameNode join NNC, it enters a bootup procedure. After having handshaked with master, the master will inject the whole Namespace to slave node.

  1. The slave registers to master, with a socket for receiving updates.
  2. The master locks the FSamespace, blocks all metadata writing operation.
  3. The master dumps the FSDirectory and DataNodeMap and encapsulates them.
  4. The master unlocks the FSNamespace.
  5. The master establishes the connection to slave and sent all the data to slave.
  6. The slave initializes the FSNamespace.

Then the slave will be marked as “Unsynced”, and will receives the updates that happens since the above step 4. After having synchronized with the master, it will transit to “Synced” or “Up-to-date” status.

Update synchronization

If an meta data changing operation happens, SyncMaster will transfer it to SyncSlave with the eastablished synchronization connection. If there are many slaves, each one will use a separated thread.

  1. Every meta changing operation is hooked by SyncMaster, and it encapsulates the op and data, then starts the update routine.
  2. Before an update is transfered, SyncMaster ensures that no older updates are transfering, i.e. all updates will be transfered in secquence. (This is a trade off between performance and constency, if we adopt a larger window, the performance may increase.)
  3. A SyncMornitor is started, which manages all the threads transfering SyncUnits to slaves.
    1. If a thread failed or timeout, the slave will be transited to “Unsynced” status, and it is requred to register and sync again.
  4. After all the threads are finished, SyncMonitor will return and SyncMaster will release the lock that may block the next update transfering.

The following is the update events that will be hooked and synchronized:

  • NNU_NOP      // nothing to do
  • NNU_BLK    // add or remove a block
  • NNU_INODE       // add or remove or modify an inode (add or remove file; new block allocation)
  • NNU_NEWFILE    // start new file
  • NNU_CLSFILE    // close new file
  • NNU_MVRM    // move or remove file
  • NNU_MKDIR    // mkdir
  • NNU_LEASE    // add/update or release a lease
  • NNU_LEASE_BATCH    //update batch of leases
  • NNU_DNODEHB_BATCH      //batch of datanode heartbeat
  • NNU_DNODEREG    // dnode register
  • NNU_DNODEBLK        // block report
  • NNU_DNODERM       // remove dnode
  • NNU_BLKRECV        // block received message from datanode
  • NNU_REPLICAMON     //replication monitor work
  • NNU_WORLD          //bootstrap a slave node
  • NNU_MASSIVE        //bootstrap a slave node

Zookeeper in NNC

In the 0.20 version of NNC, Zookeeper was introduce as a registry of NameNodes. From zookeeper, a slave can find the master’s location, and DFSClients and DataNodes can also use it to find available NameNode. We did plan to adopt zookeeper in the failover procedure, but did not implement it yet.

Failover

The failover of NNC is managed by Linux-HA (heartbeat) 2.99. Whe master is down, linux-ha promote a slave to master. It update it status and update Zookeeper. Then DFSClients and DataNodes will connect to it.

Promote

A new method setAsMaster() is added to ClientProtocol, and administrator or scripts can use dfsadmin to call this method. It will promote the specified NameNode from slave to master.

When a slave is promoted to master, it constructs a new SyncMaster with the existed SyncSlave contents. The procedure does not affect the data of FSNamespace.

Failover with linuxHA

Another new method status() is also added to ClientProtocol, and Linux-HA OCF script keeps track the status of NameNodes with this method. If it cannot get the status of the master NameNode, it will promote a slave NameNode as new Msater.

During the failover procedure, DFSClients and DataNodes cannot connect to master NameNode, and they will keep retry for a while. Once the new master returns to work, all the operation will be done.

Practice and Issues

We deployed some experimental systems of NNC, the synchronization works well in lab and will be adopt in some expermental systems. And there are some issues yet:

  • The overhead of NameNode synchronization
    • For typical file IO and MapReduce (sort, wordcount)
      • NNC system reaches 95% performance of hadoop without NNC
    • For meta data write only operation (parallel touchz or mkdir)
      • NNC system reaches 15% performance of hadoop without NNC
  • Performance gaining of Multiple NameNode in read-only operation
    • Cannot observed till now, unfortunately
  • Other design issue
    • Why from master to slaves directly without an additional delivery node?
      • That may introduce another SPOF, and make the problem more complex.
    • Why don’t use Zookeeper for failover?
      • Linux-HA works well, and we are also evaluate whether change  to ZK, any suggestions?

Future Work

Backup Node (BN) in trunk brings “warm” standby (HADOOP-4539), it synchronizes namespace information from Main NameNode, and leaves DataNode information, Block Operation (such as replication), and lease information unsynchronized. It could be a base of our furture work.

As NN in current trunk can log all the namespace changes as journals to BN, the rest information needed to be synchronized includes:

- Changes in DataNode Map: information collected from DataNode heartbeats and reports. Together with namespace info, the blocks map can also be calculated.

- Lease information: leases are updated by clients periodically or when it lock/unlock file for write.

- Block operation’s status, such as pending replication block list: these information are depends on random number, thus it may lead to inconsistency if do not synchronize them.