UberFS Overview and Performance
From CITA Computing
| Table of contents |
Overview
UberFS is an open source cluster filesystem that has been developed at CITA for use on McKenzie. It is targeted at environments that have significant amounts of distributed diskspace, most commonly compute nodes in commodity clusters. While these disks can be used as temporary local storage and cross-mounted by NFS, there is no mechanism to prevent data loss in the event of failure, and file location management presents unnecessary work for the end user. UberFS addresses these issues at the filesystem level, incurring minimal overhead to overall performance.
Uses
We use UberFS to turn 250+ disks (one disk each for most of our 264 compute nodes) into a single large 9.1TB partition that is visible from all compute nodes. UberFS has built-in RAID-1 like redundancy so actual physical diskspace used is twice this number.
UberFS could also be used to aggregate NAS storage units into a single larger (and more redundant) volume. One use of this might be to create scalable storage that could be attached to a compute cluster. 1000's of compute nodes running UberFS daemons could interact directly with 100's of NAS units. Adding more storage is as simple as adding more NAS units.
Assuming that exporting an UberFS volume over NFS works (as yet untested) it should also be possible to add an additional layer to such a configuration so that compute nodes know nothing about UberFS and only see a single large NFS volume. This layer would consist of a number of nodes running UberFS which (as usual) talk NFS to the NAS, but then re-export the entirity of UberFS over NFS to the compute cluster.
Features
UberFS offers the following features:
Flat namespace
- makes many separate disks appear as one large filesystem
- each node can see the entire filesystem
- removes the need for tedious user/application level book-keeping
Built-in fault tolerance
- data is protected in the event of a single drive or node failure
- ability to re-duplicate lost copies
Supports file and filesystem sizes up to the linux 64bit limit
- files >> 4 GB in size, total filesystem size >> TB
- files can be larger than a single physical disk
- storage partitions can use any filesystem that NFS can export, eg. ext3, XFS
Expandability
- filesystem size changes dynamically as more NFS shares are added/removed
Optimized performance
- theoretically limited by NFS
- stat operations are reduced to database calls, minimizing NFS traffic
Smart caching and data distribution
- all reads and writes are buffered internally for performance
- data is distributed across the cluster based on the underlying network topology
Consistancy
- integrity of the parallel filesystem is maintained by using the database as a central reference point amongst multiple reading/writing processes
- UberFS has as much file locking as is provided by the underlying NFS
Significant testing in a scientific computing environment with an aim for POSIX compliance
Underlying Software Requirements
- Python (http://python.org)
- FUSE (http://fuse.sourceforge.net)
- Linux kernel module and library API for userspace filesystems
- NFS (http://nfs.sourceforge.net)
- MySQL (http://www.mysql.com)
- Ganglia (http://ganglia.sourceforge.net)
How it Works
The basic layout of UberFS is illustrated in the image on the right. The storage capacity of UberFS is provided by the distributed diskspace. Each UberFS node is configured to mount a local disk or partition at /srv/uber-store, which is exported via NFS and automounted on other cluster members as necessary. Each node also broadcasts information about it's local disk usage by means of the Ganglia gmond daemon.The UberFS daemon and FUSE kernel module are loaded on boot. The FUSE module redirects all operations on it's special file descriptor(s), /dev/fuse, to the UberFS daemon. Typically /dev/fuse is mounted at /uber-scratch and this is the only point of access from userspace programs to the filesystem.
Metadata about each filesystem inode (eg. ownership, permissions, modes, inode type, modification time, ... as well as physical location on Uber nodes) is stored in a central MySQL database that is served from a head node of the cluster. The UberFS database consists of two tables, one stores inode information (POSIX information) and the other stores data locations. The database uses InnoDB tables to prevent access collisions, which are superior to the default MySQL tables as they allow row locking rather than full table locking. File and path name lookups in the database use 64 bit integer keys generated via md5 to minimize search times. Another beneficial aspect of the database is that stat information about a file can be provided with a database lookup rather than automounting the target filesystem over NFS.
When a file is created, two copies are generated by UberFS, each of which will be written to separate /srv/uber-store backend disks to provide redundancy. Optimally this results in one copy on the local scratch disk, with a back-up copy sent over NFS to a node on the same switch. Files greater than 1GB in size are broken into 1GB chunks, allowing files to be larger than any one physical disk. The actual locations where the chunks will be written are selected using heuristics that take into account the underlying network topology (MeshLib) and a current measure of which disks have available space (Ganglia). A cached list of nodes that lie on the same switch and have diskspace available is used to both minimize data scatter and to avoid unnecessary Ganglia lookups. If the space on a disk is accidentally exceeded UberFS will gracefully transfer the chunk in progress to a different node and continue the write operation. Data is also cached during the reads and writes to optimize performance.
When a file is accessed for reading the copy that will achieve the best i/o performance is selected. If a local copy exists it is used, followed by a copy on the same switch(s) as the reader. For McKenzie the most likely case is that neither copy of the data resides on a close switch, and so data is retrieved over the fat tree i/o network. If a node is down it is bypassed in the selection process.
Performance
The upper image on the right compares UberFS Bonnie++ performance to that of NFS on the McKenzie cluster without additional load. In this ideal case UberFS approaches NFS with minimal performance degradation.To better represent actual UberFS performance the second image on the right illustrates Bonnie++ results that were generated using a partially full UberFS filesystem while the McKenzie cluster was fully loaded. In this comparison, the 75 Nodes dataset was produced by averaging the results of 75 Bonnie++ processes running simultaneously on 75 different nodes, all of which are accessing UberFS. The 40 Nodes dataset was produced in the same fashion by running 80 Bonnie++ processes on 40 different nodes.
The 75 Nodes Uber Sim results were generated by running Bonnie++ using local scratch disks mounted over NFS in an attempt to model optimal UberFS performance. In this simulation 2 independent Bonnie++ processes are launched per node, with one writing to the local scratch disk and the other writing to an automounted scratch disk from a cluster member on the same switch. The simulation supports the results found in the ideal case scenario.
During this test the remaining compute nodes were busy running typical jobs, and 36% of the UberFS back-end disks were full. The Bonnie++ tests used 2G and 4G file sizes. While this indicates that UberFS performance degrades in a heavily loaded environment, it also shows that UberFS performance scales well when there are many processes actively using the filesystem. The main factor in the performance decrease can be attributed to bonnie++ processes that were required to use 2 non-local UberFS backend storage disks.
Limitations
MySQL Server
- central database could be a performance bottleneck for metadata operations (rapid small file transactions)
- db scalability is a problem in many fields and presumably can be solved by a loadbalancing MySQL cluster or some other technique
NFS
- UberFS can do no better file locking than NFS allows
- Limited to NFS speeds
Future Work
- MPI/IO support
- Support for multiple and varying sized disks per UberFS member
