Currently, FSDataset is the second long file in HDFS trunk source, and here is an illustration of it. Which is plot with XMind.

Currently, FSDataset is the second long file in HDFS trunk source, and here is an illustration of it. Which is plot with XMind.

HDFS-1362 introduces a volume refreshment utility for DataNode, thus, you can modify your dfs.datanode.data.dir configuration after insert new disks, and refresh it without reboot the machine or restart the service.
Disk failure happens frequently, and we had to replace it even other disks still in service. Currently, DataNode, actually FSDataset, could remove an error volume, but cannot active newer ones. Most modern servers and SATA disks support hotplug, why don’t we enable it for DataNode?
Firstly we added a simple user interface including add, remove, list of volumes. After discussion with Tom White and Todd Lipcon, we decide to change to a simpler interface—refresh based on the reconfiguration framework introduced in HADOOP-7001 as they suggested. It will benefit the consistency of the running service and the on disk configuration file.
DataNode has 2 steps on volume management. In start time, DataStorage member loads the configured dirs, check the existence, availability, version, dir structure, etc. And do upgrade, roll back, and format if needed. Then, it adds usable dirs into itself, and provides an iterator for StorageDirectory access.
Having loaded the directories, DataNode constructs FSDataset member to manage volumes and blocks stored in them. FSDataset manages an array of volumes in FSVolumeSet and a map of blocks. In loading procedure, every blocks is found and recorded into the map with the block ID as key. The may is not indexed by volume, thus you must iterate all the map if you want to remove a volume to find all the blocks stored in it.
As described above, the block loading time depends on how many blocks in a volume, and it will consume much time if a volume is full of blocks.
In HDFS-1362, we have 4 steps to refresh the volumes:
• Firstly DataNode reloads the configuration, and creates the a list of dirs.
• Iterate the StorageDirectory in DataStorage,
• If sd is not in FSDataset, which means it had been removed before because of failure, remove it from DataStorage; then
• If sd is contained in the new dirs list, remove it from the list as it has already in service;
• On the contrary, if sd is not in the new dir list, it means a in-service dir has been removed from configuration. Currently, it will lead to a warning in log file only.
• Load remain true new dirs to DataStorage.
• Add them to FSDataset.
Load volumes operation based on the DataStorage.revoverTransitionRead, we copy and modified it as revoverTransitionAdditionalRead, the different is:
• Do not initialize the StorageID and StorageDirectory list;
• Do not doTransition against all SDs, but only the inserted ones;
• Do not writeAll, but only the inserted ones;
• Return the new added SDs Collections, which is used for the blocks loading.
We did not unify the revoverTransitionAdditionalRead and revoverTransitionRead because we considered it may related with writeAll and transition operation.
Load blocks means we add volumes to FSDataset, it’s no more than add volume to array in FSVolumeSet and load blocks info into the map. But we think we can optimized on the loading to shorten the loading time.
FSDataset uses FSDatasetAsyncDiskService for some asynchronous execution on volumes, it is based on a HashMap, thus we only need to move the put operation from constrictor to a separate method and give its access permission to FSDataset.
Next step might be support the decommission operation which is not implemented now and leave a warning only.
I have committed the initial code on GitHub
And this post will give some explanation on the code.
Though we will keep improve our distribution, the NNC is not mature enough for product, and it is released as a reference for you.
The NNC code is based on Apache’s Hadoop branch-0.20 repository. I had claimed our dist is based on ydist, but we met an unresolved issue about MapReduce on it days ago. Hence I had to port it from Ydist to Apache dist, which delayed my release work for days.
The code is constructed as follows:
There still some found issues here:
We will keep fix found issues on current branch, but will not contribute new HA related feature on it. And we will pay more attention on trunk,
We are planning on a BackupNode based design, which is closer to mainline code. And we are also interesting in participating the job on HA in JIRA.
We have also add some management feature to DataNode — volume add and decommission online, which help us replace failed hard disk without shutdown a node. This feature will be committed to the github repositry this week.
If you have any question on our hadoop job, please contact with me. and for Chinese Reader, you can visit our official website:
[ Simplified Chinese ] http://labs.chinamobile.com/cloud/
We had written a HDFS performance test tool, which start a group of parallel read/write thread to press on HDFS.
The following figure illustrates the working of it.
The synchronizer is a server written in python, it accepts the request of test program running in test nodes. Having received requests from all nodes, it admits them start pressure simultaneously.
The test program is written in Java, and it starts several threads to write or read with DFSClient. All the pressure thread record the data it has written in a variable and the main thread of the test program collect them periodically, then written into a XML file.
Analyze the xml output file, we can tell the performance of reading and writing.
In our test program, it supports read only, write only and read-write. And it can be set as read files writen by itself or random files.
The only single point of failure (SPOF) in HDFS is on the most important node — NameNode. If it fails, the ongoing operations will fail and user data may be lost.
In 2008, we (team from China Mobile Research Institute) have implemented an initial version of Name-Node Cluster (NNC) on hadoop 0.17. NNC introduced a Synchronization Agent, which synchronizes the updates of FSNamesystem to slave NNs. Then we ported it to hadoop 0.19 and 0.20, and employed Zookeeper as registry of NNs. Related Information was mentioned in our presention in Hadoop Summit’09.
Here I will post some detail on the design and practice.
Each NameNode in NNC has a SyncAgent, either SyncMaster or SyncSlave. SyncAgent maintains the syncronization related status and deal with the synchronization data transfer between NameNodes. NNC work like a journalized filesystem, every change made in Namesystem of master NameNode will be logged and transfer to slaves as a SyncUnit, and it will replay at slave, which keep the slaves’ Namesystem identical with the master.
A slave in NNC may transit among four status:
The following figure shows the transition of the slave status.
When a new NameNode join NNC, it enters a bootup procedure. After having handshaked with master, the master will inject the whole Namespace to slave node.
Then the slave will be marked as “Unsynced”, and will receives the updates that happens since the above step 4. After having synchronized with the master, it will transit to “Synced” or “Up-to-date” status.
If an meta data changing operation happens, SyncMaster will transfer it to SyncSlave with the eastablished synchronization connection. If there are many slaves, each one will use a separated thread.
The following is the update events that will be hooked and synchronized:
In the 0.20 version of NNC, Zookeeper was introduce as a registry of NameNodes. From zookeeper, a slave can find the master’s location, and DFSClients and DataNodes can also use it to find available NameNode. We did plan to adopt zookeeper in the failover procedure, but did not implement it yet.
The failover of NNC is managed by Linux-HA (heartbeat) 2.99. Whe master is down, linux-ha promote a slave to master. It update it status and update Zookeeper. Then DFSClients and DataNodes will connect to it.
A new method setAsMaster() is added to ClientProtocol, and administrator or scripts can use dfsadmin to call this method. It will promote the specified NameNode from slave to master.
When a slave is promoted to master, it constructs a new SyncMaster with the existed SyncSlave contents. The procedure does not affect the data of FSNamespace.
Another new method status() is also added to ClientProtocol, and Linux-HA OCF script keeps track the status of NameNodes with this method. If it cannot get the status of the master NameNode, it will promote a slave NameNode as new Msater.
During the failover procedure, DFSClients and DataNodes cannot connect to master NameNode, and they will keep retry for a while. Once the new master returns to work, all the operation will be done.
We deployed some experimental systems of NNC, the synchronization works well in lab and will be adopt in some expermental systems. And there are some issues yet:
Backup Node (BN) in trunk brings “warm” standby (HADOOP-4539), it synchronizes namespace information from Main NameNode, and leaves DataNode information, Block Operation (such as replication), and lease information unsynchronized. It could be a base of our furture work.
As NN in current trunk can log all the namespace changes as journals to BN, the rest information needed to be synchronized includes:
- Changes in DataNode Map: information collected from DataNode heartbeats and reports. Together with namespace info, the blocks map can also be calculated.
- Lease information: leases are updated by clients periodically or when it lock/unlock file for write.
- Block operation’s status, such as pending replication block list: these information are depends on random number, thus it may lead to inconsistency if do not synchronize them.
Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!