Linux has long been fertile ground for the creation of various sorts of file systems. The reasons for this have been manifold:
People have moved to Linux from many places, and have wanted to make use of "legacy" file systems from wherever it was that they came, including such systems as Minux, MS-DOS, OS/2, Apple Macintosh, Atari ST, Amiga, and various "official UNIXes;"
Since the kernel is "hackable," some people just like to play with file systems to tune them for better performance (e.g.- reiserfs). A notable application area where improved performance is particularly valuable is that of news processing. News involves "lots and lotsa tiny files," which tends to be challenging to handle.
People want to build "more secure" networked filesystems, typically by creating encrypted variants of NFS;
People want to build "highly reliable" file systems, generally using journalling;
People want to build file systems that provide special sorts of functionality such as maintaining versioning or of creating documents "on the fly." Virtual File System hooks make it possible to build "file systems" that run programs to handle data requests...
Regrettably, there is some conflict in this. There is an enormous desire for "more functionality", and that prevents kernel folk from stopping to comprehensively fix problems. There are, somewhere between fibrechannel device drivers, the mapping of that to SCSI access, then to the VFS layer, which then connects to various filesystems, some evident holes.
My employer sponsored some reliability testing in the interests of trying to see if Linux on Opterons, connecting to FibreChannel disk arrays would make a viable platform for large, highly-available PostgreSQL databases. All the filesystems corrupted painfully easily even though the hardware ought to support better.
It's not going to be easy to resolve this; supporting HA hardware requires thoroughly verifying all the details, supporting SCSI and FibreChannel protocols fully, and that would require a deceleration of Linux kernel development efforts, which is inconsistent with the way that Git is allowing larger and larger sets of contributors to cascade flurries of patches, like snow storms, on the Linux development team.
Analysis of the Ext2fs structure
One of Linux's "claims to fame," at least compared to certain horribly-unstable operating systems, is that it has a pretty decent base file system known as ext2fs. There are rumblings back and forth that ext2fs is better than Berkeley FFS and vice-versa; what is certainly the case is that both provide decent performance for most purposes, and both are quite robust. I've got a section on the issue of defragmentation of ext2, as this is a topic that people ask about frequently. The brief synopsis is that a defragmentation utility is available, but it's not too likely that people actually need to use it.
Proposed upcoming extensions to ext2 include handling of:
Very large file systems (and large file sizes)
Better support for automatic file compression
Hashed directory lookups
Backups of inode table directories
Theodore Ts'o is a member of the "Linux Kernel Core Group" that has often been responsible for ext2 "stuff."
A TOPS-10-like approach to ACLs?
Tim Smith, tzs@halcyon.com, 1999/01/22
The problem with ACLs is that actually managing them is a complex task. Use ACLs implies Add more staff to do "security management."
Few organizations have seen much point in bothering with this. If there are such fine-grained security requirements that you really need this stuff, you probably need to go B1, in which neither Linux nor NT are realistically the answer to the question anyways.
Actually, managing ACLs in only complex because most implementations take the approach of making an ACL some sort of extended attribute of the file or directory it applies to. This leads to complexity because (1) every utility that can copy of move files or directories has to be modified to know about ACLs or the system has to have complex kludges to guess what should be done (they have to be complex because they have to be right, because botching the ACLs can lead to a security breach), (2) if you've got a thousand files somewhere, you've got a thousand ACLs to worry about.
There is another approach. This was used on TOPS-10 . I don't know if it was original, or if TOPS-10 borrowed it from somewhere else. In this approach, the ACLs are not associated with particular files. Rather, there is a separate file that contains the access control information for an entire group of files. Here's a description of how it might work on Unix, based on the TOPS-10 version, with appropriate changes for a Unix world.
On an attempt to access a file, the normal permissions would be checked. If they do not forbid it, the access is allowed.
If the permissions prohibit the access, the file.access_list in the home directory of the owner of the fileis checked.
The file .access_list contains a series of entries, one per line, of the form
name |
Some form of wildcarding would be allowed (this is handy because many sites have patterns in the way they assign usernames).
E.g., rx for read and execute.
The way .access_list is used is that it is scanned looking for an entry whose "name" field matches the file name, whose "user" field matches the user trying to access the file, whose "group" field matches the group of the user, and whose "program" field matches the executable that is trying to access the file. When such a line is found, the "perms" field tells what type of access is to be granted.
Here's an example:
# anything in a directory named "private" is off limits */private/*:*:*:*: # people in group "foo" get full (create, delete, read, write, # execute) access to everything in the foo project directory ~/projects/foo/*:*:foo:*:cdrwx # people playing mygame can update the high score file ~/mygame/score.dat:*:*: ~/mygame/bin/mygame:rw # some friends have access to the RCS files for mygame ~/mygame/src/RCS/*:dennis,kevin,josh:*: /usr/bin/ci:rw ~/mygame/src/RCS/*:dennis,kevin,josh:*: /usr/bin/co:rw # I'll put stuff I want everyone to read in my ~/public directory # I'll make the public directory 744, so no one will actually have # to check .access_list, but I'll still put in this entry for completeness ~/public/*:*:*:*:r# anything left over gets no access*:*:*:*:
I realize that there are problems with hard links with this scheme.
Note that this scheme is not nearly as inefficient as it looks, because most accesses would be to things where the normal permissions would be set to allow access, so .access_list would not be checked. You would only need .access_list for those special cases that don't fit in well with the user/group/other +users_in_multiple_groups model. Furthermore, one could greatly increase the efficiency by going to a binary format that can be compiled from .access_list.
I think this kind of scheme actually fits in better than NT style access lists with the way people really think about security. No one wants to think about security on a thousand files on a file by file basis. You want to group them, and think about security by groups. e.g., "things that are part of the frobozz project need to be kept away from the marketing people", "only people at the VP level should see accounting files", etc.
Note that this scheme does not require any changes to existing utilities. Note also that no changes to existing filesystems are required.
I don't recall if TOPS-10 had one access file per user, or one per directory, or both.
PS: in case anyone is curious, the main problem with this scheme on TOPS-10 was that it was added as a kludge. There was a program, FILDAE (File Daemon--TOPS10 used 6.3 names) that handles the access lists. FILDAE was outside the kernel. When an access failed, the kernel sent a message to FILDAE asking if the access list allowed it. FILDAE made its decision and reported back to the kernel. People found attacks based on getting the kernel and FILDAE to become confused over which response was to which request.
If you have Partition Magic 3.0, ext2 partitions can be resized via Theodore Ts'o's resizefs. For the time being only people who own Partition Magic3.0 can get resizefs. Theodore has indicated that eventually resizefs will come out in GPL form.
e2fsprogs.sourceforge.net - ext2 resizer finally released
(hosted at SourceForge )
FSDEXT2: Second extended file system (ext2fs) for Windows 95
Reiserfs - File system based on balanced trees
This has been a most interesting project; the author has been benchmarking a file system that stores both inodes and data blocks as a balanced tree. It provides increased performance over ext2fs for small files and for certain types of directory accesses. I think he does a very good presentation of the benchmarks that compare performance. Note that as of version 2.4.2, reiserfs became part of the "official" Linux kernel, and tools such as RIP are available to help manage installations using it.
The work is also very useful as an example of benchmarking. In order to have measurably different results between ext2fs and reiserfs, it proved necessary to construct somewhat artificial benchmarks. (Which tells us that ext2fs performance isn't too bad at all, although we already knew that...) It is necessary to interpret the results carefully. See also Trees Are Fast - Hans Reiser on reiserfs.
The overall approach is quite similar to the log-structured filesystems; this one has the advantage that it actually exists now and is not merely in planning stages.
See also Filesystem Benchmarking using PostMark
PostMark is a filesystem benchmarking tool produced by Network Appliance. While it is possible that it may favor their products, there may still be useful insights available from comparing the performance on this benchmark of other systems.
Regrettably, Reiserfs has had something of a "soap opera" attached to it; Hans Reiser evidently murdered his ex-wife, a sad and sordid tale where nobody turns out to really have been "in the right." This appears to have turned the filesystem, now oddly-orphaned, into a curiosity.
btrfs - copy-on-write Filesystem for Linux
This is a cluster-oriented filesystem created by Oracle .
Tru64 AdvFS for Linux Compatibility
AdvFS is a filesystem developed originally at Digital (now part of HP) which had, 10 years back, many of the sorts of features that Sun is hawking with their "ZFS". Tru64 is getting to be something of a curiosity, rather than an interesting product, and HP has made a "code drop" with a view to possibly making it usable on Linux.
ZFS was developed by Sun Microsystems for use in Solaris, but released the code as open source, so ports have been done to Linux, FreeBSD, MacOS. It offers many features for flexible management of filesystems including snapshotting, on the fly data compression, built-in awareness of RAID, volume management, transactional semantics, checksums enabling self-healing of some filesystem failures.
Ext3fs is intended as a successor to Ext2fs, adding in journalling capabilities to allow faster recovery after unexpected reboots.
The Tux2 Failsafe Filesystem for Linux
Rather than using journalling to maintain consistency, the Tux2 filesystem uses a "phase tree" scheme where a tree-structure filesystem is updated in carefully delineated phases. The phase tree approach allows failsafe operation to be achieved with only a slight performance penalty.
See also the Tux2 development at SourceForge.
BULMA: Journal File Systems in Linux
This is a review comparing most of the journalling filesystems available on Linux, with some performance statistics. Since many such filesystems are undergoing active development efforts, statistics that are accurate today may not be accurate six months from now, so your milage certainly may vary.
This appears to be the first of the "logical volume" projects for Linux to result in actual results. This system seems to be modelled after the logical volume system IBM provides with AIX .
Log-Structured File System Project
This is a discontinued project that planned to providing Linux with an "ultra-high-performance" file system that would simultaneously provide "ultra-high-reliability." The general approach is to use the lessons learned in writing robust, fast database management systems.
Grossly oversimplifying, robustness is provided by logging all updates before updating the database ("file system") proper, and speed is provided by having the database be a "view" that references the update logs. A separate process runs when the system is not very busy to "vacuum" out areas of the disk that have become fragmented due to files having been created and deleted. This approach takes after the way modern relational databases are implemented.
The research material may still be useful; other projects continue.
Successor project to lfs.
LFS -Large File Summit - Greater than 2GB file sizes on 32 bit systems
Provides several things:
Allow support for files which size exceeds maximum value of long datatype in 32-bit systems (2G-1).
Page cache supporting at least up to 1TB file sizes.
A few behaviour fixes into POSIX compliance of some file related functions.
Note that this will not be compatible with NFS (pre-NFS3/RFC 1094), which defines only a 32 bit file access path.
LFS has been widely implemented on all the Unix flavours still in widespread use. We are getting, now, to the point that 64 bit systems are getting common enough that LFS is starting to become irrelevant.
The point of this filesystem is that it:
Supports DVDs, and
Allows more efficient "CD-burning" schemes to be used.
News File System
Some combination of the above schemes could be combined to achieve a file system optimized for handling NNTP/INN news spools.
News is quite special in that it results in:
Relatively small files, as most posts do not exceed 4K in size
This encourages small cluster sizes so that space is not wasted
Spool directories containing thousands of files
With ext2, where directory entries are kept in a "list-like" data structure, accesses to files by name becomes increasingly inefficient as directory size grows
Various expiry information that is probably more important than creation/modification dates.
If expiry information were stored in the date "fields," both addition and deletion of news could take place faster.
An interesting test of efficacity of new filesystems is to try to use them for a news spool.
Large File Support
A current weakness with Linux is in support for very large files. The commonly-used ext2 file system supports up to 4TB filesystems, which indeed does qualify as "very large." Files are nonetheless restricted to 2GB, which, for some applications, is not very large.
SAS held the SAS Large Files Summit for UNIX, where suggested APIs and approaches were presented as part of an XOpen "summit" to allow UNIX systems to portably support very large files. Linux's approach should follow this...
"I need to have files bigger than 2GB. What's the big problem?"
There are several issues.
| ||
--Stefan Monnier |
REAL reason for 2GB file size limit
NIT: The real limitation for POSIX file sizes on 32-bit architectures isn't directly from POSIX, but rather, indirectly from ISO/ANSI C (which is assumed by POSIX).
In POSIX, the lseek()
file offset is defined as
off_t. There's nothing to prevent a 32-bit POSIX-compliant
implementation from typedef'ing off_t as a 64-bit signed "long long".
Similarly, ISO/ANSI C's
fsetpos()
/fgetpos()
can be fixed by typedef'ing
fpos_t to be a 64-bit long long.
The problem lies with ISO/ANSI C's
fseek()
/ftell()
, which use a "long" for the
offset. Why fpos_t wasn't used (consistently!) is beyond me, but
hindsight is 20/20. AFAIK, these are the only two functions that
must break if greater-than-2GB files are permitted with
32-bit longs. Other functions can break (like the present
case), but aren't required to break by the POSIX+C
standards.
Thus, it is ISO/ANSI C compatibility that is necessarily broken for file systems that support bigger-than-2GB files with 32-bit longs -- something that applies even to non-POSIX file systems that claim compatibility with ISO/ANSI C. I'm sure you all can think of at least two, but that's a discussion for the advocacy lists.
-- Chuck Phillips <cdp@peakpeak.com>
Or, alternatively, Alexander Viro has an entertaining answer for you...
A: because VM in Linux 2.2 and earlier can't cope with files larger than 2.2 on 32-bit architectures. Regardless of filesystem.
A: use 2.4 or 2.2 with LFS patches or FreeBSD. All of them will handle more than 2Gb on ext2.
A: because if libc thinks that offsets are 32 bit it's not going to pass anything larger to the kernel
A: get sufficiently recent libc. And learn to use search engines, already - all that stuff has been beaten to death many times.
SGI has made their XFS filesystem available on Linux. Some notable properties of XFS include:
Supports Very Large Files
Supports journalled metadata
Uses B-Trees to represent directories, so that directory accesses take O(log n) time.
Clockwise real-time filesystem for Linux
Allows control over the scheduling of disk access requests,providing both "best effort" servicing as well as "real time," with a specified Quality of Service "contract."
StegFS - A Steganographic File System for Linux
StegFS is a Steganographic File System for Linux. Not only does it encrypt data, it also hides it such that it cannot be proved to be there.
Tailmerging is a technique that I first heard about from its use in ReiserFS . Tailmerging makes use of that wasted space in the last block of each file by sharing each tail block between a few files. Each file knows where to look in the tail block to find its own tail.
Cloudburst: A Compressing, Log-Structured Virtual Disk for Flash Memory
BSD Soft Updates
This is an alternative approach to journalling, which tracks and enforces metadata updates to ensure that disk filesystems remain consistent. It imposes a partial ordering on buffer cache operations, which allows requirements for synchronized directory updates to be eliminated. Directory updates can see large performance increases. It also may allow deferring fsck runs, which may be run in the background while the system is "live". Performance is similar to what is provided by journalling, somewhat better in some cases, a little worse in others.
A read/write NTFS-compatible filesystem for Linux. It uses the Microsoft Windows ntfs.sys driver, running atop an emulation layer that emulates needful portions of the Windows NT kernel.
LFS: A Log Structured File System for Linux that Supports Snapshots
The Perl Filesystem (Kernel module to let you hook Perl code in to make up FSes)
Perl vs. traditional Filesystems
People have been known to react to the Perl Filesystem with "Why?", so I thought I'd compare the job of writing filesystems in Perl and C and let you draw your own conclusions.
Filesystem will work the same on any supported system, any supported kernel version. If somebody gives you a pre-built module, you won't even need the kernel sources.
Most bugs will cause error messages and meaningful syslog entries.
Some filesystems might be slower (but our example "Net" filesystem spends all the time waiting for servers at the other end, so it'd be just as slow in any other language).
Traditional
You need to recompile your filesystem for every combination of operating system/version where you want to use it. In most cases, this requires extensive rewriting (just look at the loadable kernel module which supports PerlFS - it tries to work on two kernel versions of the same operating system, and it contains more conditional compilation than is good for sanity)
Most bugs will result in a kernel panic or at best some obscure syslog entry.
Some filesystems might be faster.
"Why can't you use userfs?" - I wish I could find a recent version.
Another question I get is "Why not write a Perl NFS server instead?" - Because the NFS protocol is not flexible enough for some of the things I plan to do.
docfs Unified Documentation Storage and Retrieval for Linux Systems.
And now, for something completely different...
This project proposes creating special file systems that dynamically format documentation into the requested format. For instance, the "original source" would be in /usr/doc/sgml in SGML form. When a request is made for the manual page in /usr/man , this file system would dynamically run the SGML-to-GROFF translator, producing the manual page "on the fly." Similarly, accessing /usr/info/something would result in the SGML source being turned into TeXInfo form.
Usenetfs: A Stackable File System for Large Article Directories
File System development is very difficult and time consuming. Even small changes to existing file systems require deep understanding of kernel internals, making the barrier to entry for new developers high. Moreover, porting file system code from one operating system to another is almost as difficult as the first port. Past proposals to provide extensible (stackable) file system interfaces would have simplified the development of new file systems. These proposals, however, advocated massive changes to existing operating system interfaces and existing file systems; operating system vendors and maintainers resist making any large changes to their kernels because of stability and performance concerns. As a result, file system development is still a difficult, long, and non-portable process.
The FiST (File System Translator) system combines two methods to solve the above problems in a novel way: a set of stackable file system templates for each operating system, and a high-level language that can describe stackable file systems in a cross-platform portable fashion. Using FiST, stackable file systems need only be described once. FiST's code generation tool, fistgen, compiles a single file system description into loadable kernel modules for several operating systems (currently Solaris, Linux, and FreeBSD).
PyVen - for implementing Userspace Filesystems in Python, atop Coda
A file system "server" that stores files in a PostgreSQL database, accesses being handled using NFS clients.
The point of the exercise is to provide automatic versioning, so that one can compare current file "sets" to those that existed at a previous point in time, rolling forward and back as necessary.
This provides a "pervasive" equivalent to CVS.
Code hasn't been sighted in several years.
The Design and Implementation of the Inversion File System
A filesystem implemented atop Postgres . It was slower than NFS, when each update is treated as atomic under "standard" Unix/ NFS semantics. When they were able to run file operations within the DBMS, it was rather a lot faster...
Alex Viro's Per-Process Namespaces for Linux 2.4.2
This is based on the Plan 9 notion of namespaces.
In effect, a namespace associates a set of mounts of filesystems with a process, rather than the traditional Unix approach of associating them with a central table for the system as a whole.
This leads to the notion of mounting "private" filesystems that are visible only to a particular process (and perhaps its children). One thing that this would be useful for is in enhancing system security.
For instance, if I'm using CFS to secure a directory, with the traditional Unix approach, I might use the command cattach /home/cbbrowne/secret_stuff/ secretstuff to mount the data in /home/cbbrowne/secret_stuff/ on /crypt/secretstuff . Unfortunately, anyone on the system with suitable permissions can look in /crypt/secretstuff and see the readable version of the data. That's not terribly secret; I have to be quite careful to keep my data secret!
With a per-process namespace, the mount might be associated with a specific process, and its children. It would be invisible to other processes belonging to other users, and (for better or worse) is even invisible to processes that are not children of that environment. That's rather more secure.
![]() | Mind you, that does not forcibly help in this particular situation since CFS behaves as a pretty much public NFS server for the host; the "mount" is for /crypt as a whole, not for each individual encrypted directory... |
The other really cool thing that starts to become more practical is the notion of mapping data structures onto virtual filesystems. For instance, you might create a "driver" that provides a mapping DBM files to make one look like a directory with a whole bunch of files.
I might thus do mount -t dbm /home/cbbrowne/data/file.dbm /home/cbbrowne/mounts/file and be given the ability to do the following sorts of things
List the keys via ls /home/cbbrowne/mounts/file
Achieving:
key1 key2 key3 key4 |
Show the value for a key via cat /home/cbbrowne/mounts/file/key4
value4 |
More interestingly, we might create entries via echo "value 5" > /home/cbbrowne/mounts/file/key5
None of this would be conceptually impossible with a public namespace; the merit of the namespaces remaining private is that these sorts of isomorphisms are not be blathered around publicly.
There are a number of cryptographic filesystems wherein a virtual filesystem is somehow authenticated at mount time and made accessible to the user.
AtFS - Attribute Filesystems - provides uniform access to immutable revisions of files
Allowing use of AES encryption for filesystems...
LUFS (Linux Userland FileSystem) is a hybrid userspace filesystem framework supporting an indefinite number of filesystems (localfs, sshfs, ftpfs, cardfs and cefs implemented so far) transparently for any application
For instance, consider ftpfs, FTP File System, which is a Linux kernel module, enhancing the VFS with FTP volume mounting capabilities. That is, you can "mount" FTP shared directories in your very personal file system and take advantage of local files ops.
LOCASEFS
LoCaseFS provides a lowercase mapping of the local file system. It comes in handy when importing win32 source trees on *nix systems.
SshFS is probably the most advanced LUFS file system because of its security, usefulness and completeness. It is based on the SFTP protocol and requires openssh. You can mount remote file systems accessible through sftp (scp utility).
GNUTELLAFS
You mount a gnetfs in ~/gnet. You wait a couple of minutes so it can establish its peer connections. You start a search by creating a subdirectory of SEARCH: mkdir "~/gnet/SEARCH/metallica mp3". You wait a few seconds for the results to accumulate. The you chdir to "SEARCH/metallica mp3" and try a ls; surprise - the files are there!
You shoot up mpg123 and enjoy... You are happy.
A project to replace the traditional filesystem with a new document store.
The idea is to store data as BLOBs in a relational database , notably PostgreSQL , along with document attributes. Users would then look for documents based on the attributes, as opposed to designing (usually badly) a hierarchy.
redisfs - Replication-Friendly Redis-based filesystem
This implements a filesystem which stores data atop the Section 5 database.
Allows mounting Google Drive as a Linux filesystem
Tagsistant is a tool to organize files in a semantic way, which means using tags.
NFS is the "traditional" networked filesystem used on Linux and Unix.
The goal of the Global File System research project is to develop a serverless file system that exploits new interfaces like Fibre Channel that allow network attached storage. (Buzzword: SAN = Storage Area Network.)
The critical notion is that the system isserverless. With a traditional networked storage system like NFS, one host "owns" the filesystem and then provides access as a server so that other hosts access the data through that server.
GFS eschews having "a server;" shared-SCSI version exploits SCSI command extensions that provide a locking scheme such that multiple hosts may simultaneously access and update the filesystem directly across the SCSI bus. None of the hosts "own" the filesystem.
Coda is a distributed filesystem with its origin in AFS2. It has many features that are very desirable for network filesystems. Currently, Coda has several features not found elsewhere.
Disconnected operation for mobile computing
Is freely available under a liberal license
High performance through client side persistent caching
Server replication
Security model for authentication, encryption and access control
Continued operation during partial network failures in server network
Network bandwith adaptation
Good scalability
Well defined semantics of sharing, even in the presence of network failures
Oversimplifying somewhat, clients use a cache to store changes that are made to files. They then push updates back to the server, which then distributes changes to other clients.
By having a sufficiently large cache, it can operate even when systems are disconnected, deferring "pushing updates back to the server" until the server is again available.
It implemented the cache using RVM (Recoverable Virtual Memory).
InterMezzo is a new distributed file system with a focus on high availability. InterMezzo is an Open Source project, currently on Linux (2.2 and 2.3). A primary target of our development is to provide support for flexible replication of directories, with disconnected operation and a persistent cache. It was "deeply inspired" by Coda, and was originally started as part of that project.
Unison is a file-synchronization tool for Unix and Windows. It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other.
The Inferno operating system provides a distributed file access protocol called Styx which would be interesting to use on other OSes, perhaps even Linux...
A new distributed file-sharing system featuring fast, exhaustive searches and modest network bandwidth requirements. Written in Java 1.1 (with Swing GUI) for platform independence.
A file sharing system organized around users, allowing users to expose files to those users they wish to provide them to.
Lustre is a storage and file system architecture and implementation designed for use with very large clusters.
A partition editor for creation, deletion, resizing, moving, and copying of disk partitions.
linux.oreillynet.com: Proper Filesystem Layout [Oct. 11, 2001]
fstransform may be used to do in-place transformations of filesystems between several interesting Linux choices, including xfs, jfs, reiserfs, ext2, ext3, ext4, without a need to do additional backups.
Amazon has created a storage service, S3, which offers a web service-based API, quite widely used for data access for file storage, packet-based backups, and which is extensively used by Amazon for hosting data for its EC2 virtualization service.
I haven't yet had call to use it directly, though I use it via some proxies (e.g. - DropBox). I would be particularly interested in seeing alternative implementations of the server side emerge. There's code out there, though not particularly easy to deploy nor totally interoperable, at this stage.