
I’m starting a series of posts about ZFS. I will try to provide as much as possible architectural diagrams and code examples. The series is for people wanting to learn more about general filesystem architecture, especially about ZFS (OpenZFS).
I will restrict myself for now to the implementation and architecture on Illumos, the open source fork of OpenSolaris.
Let’s start with the higher level abstraction, the Virtual File System. Like the most of the Unix systems Illumos uses a framework called Virtual File System (VFS) which represent an abstraction layer under which multiple concrete filesystems can be implemented. One of the first versions of VFS was introduced in SunOS 2.0 with the introduction of NFS (Network file system). It’s good to know that the implementation on Illumos was actually the first one ever made regarding virtual file systems. The SunOS implementation was the one used in the first commercial version of a Unix operating system Unix System V Release 4.
VFS allows almost any objects to be abstracted as files and filesystems. I will mention a few of the file system categories that are in use today:
- Storage based filesystems like ZFS, UFS, ext4
- Network based filesystems like NFS (Network file system), CIFS (Common Internet File System)
- Pseudo filesystems like procfs (/proc) which gives the ability to map the address space of a process as a series of files or devfs for managing devices
Within VFS there are 2 key concepts. The first one is the virtual file which is abstracted as a vnode object and the second one is the virtual file system objects or vfs objects. A vnode provides file related functions and vfs provides file system related functions. In Illumos the vnode operations are defined in usr/src/uts/common/sys/vnode.h and contained within the structure vnodeops_t.
/*
* Operations on vnodes. Note: File systems must never operate directly
* on a 'vnodeops' structure -- it WILL change in future releases! They
* must use vn_make_ops() to create the structure.
*/
typedef struct vnodeops {
const char *vnop_name;
VNODE_OPS; /* Signatures of all vnode operations (vops) */
} vnodeops_t;
To mention some of the vnode operations available: open, close, read, write, seek, sync. All the function signatures are in the VNODE_OPS.
The vnode and vfs functions delegate to appropriate concrete filesystem implementations. All file related functions reach the vfs/vnode layer through a system call and from there they are directed to the appropriate filesystem implementation. The diagram below show the higher level architecture of the VFS.
Files are referenced in process space using file descriptors. File descriptors are integer numbers, specifically a C type int. For example, the 3 standard POSIX file descriptors corresponding to the 3 streams: STDIN – 0, STDOUT – 1 and STDERR – 2. The file descriptors are assigned when the file is opened and then freed when the file is closed.
Each process has it’s own file list. Per process file information in Illumos is kept in a structure called uf_info_t. This can be found in usr/src/uts/common/sys/user.h.
/*
* Per-process file information.
*/
typedef struct uf_info {
kmutex_t fi_lock; /* see below */
int fi_badfd; /* bad file descriptor # */
int fi_action; /* action to take on bad fd use */
int fi_nfiles; /* number of entries in fi_list[] */
uf_entry_t *volatile fi_list; /* current file list */
uf_rlist_t *fi_rlist; /* retired file lists */
} uf_info_t;
The file lists are indexed by the integer file descriptor. Now let’s see how we can normally get to this information, the current file list. In usr/src/uts/common/sys/user.h we have the user_t structure defined which contains all the per process data related to a user. Through user_t, by accessing it’s field u_finfo (this is of type uf_info_t) we get to the per process file information and to the current file list. A good example is in the Dtrace code (dtrace.c): uf_info_t *finfo = &curthread->t_procp->p_user.u_finfo. There are tools that will allow you to see the fi_list based on the process ID, like pfiles for example.
The file list contains elements of type uf_entry_t. The definition for uf_entry_t is again in the same file usr/src/uts/common/sys/user.h.
/*
* Entry in the per-process list of open files.
* Note: only certain fields are copied in flist_grow() and flist_fork().
* This is indicated in brackets in the structure member comments.
*/
typedef struct uf_entry {
kmutex_t uf_lock; /* per-fd lock [never copied] */
struct file *uf_file; /* file pointer [grow, fork] */
struct fpollinfo *uf_fpollinfo; /* poll state [grow] */
int uf_refcnt; /* LWPs accessing this file [grow] */
int uf_alloc; /* right subtree allocs [grow, fork] */
short uf_flag; /* fcntl F_GETFD flags [grow, fork] */
short uf_busy; /* file is allocated [grow, fork] */
kcondvar_t uf_wanted_cv; /* waiting for setf() [never copied] */
kcondvar_t uf_closing_cv; /* waiting for close() [never copied] */
struct portfd *uf_portfd; /* associated with port [grow] */
/* Avoid false sharing - pad to coherency granularity (64 bytes) */
char uf_pad[64 - sizeof (kmutex_t) - 2 * sizeof (void*) -
2 * sizeof (int) - 2 * sizeof (short) -
2 * sizeof (kcondvar_t) - sizeof (struct portfd *)];
} uf_entry_t;
As we can see the uf_entry_t contains the pointer to the file in it’s field: uf_file. Let’s take a look further on down the path, to see reference between the file pointer and it’s attached vnode. In Illumos the definition of the struct file resides in usr/src/uts/common/sys/file.h.
/*
* One file structure is allocated for each open/creat/pipe call.
* Main use is to hold the read/write pointer associated with
* each open file.
*/
typedef struct file {
kmutex_t f_tlock; /* short term lock */
ushort_t f_flag;
ushort_t f_flag2; /* extra flags (FSEARCH, FEXEC) */
struct vnode *f_vnode; /* pointer to vnode structure */
offset_t f_offset; /* read/write character pointer */
struct cred *f_cred; /* credentials of user who opened it */
struct f_audit_data *f_audit_data; /* file audit data */
int f_count; /* reference count */
} file_t;
From the file pointer we can now reach the system wide attached vnode through the f_vnode field. Multiple processes can hold references to the same vnode. We are almost at the end of the virtual filesystem level here, as we move towards the concrete filesystem implementation. The last barrier is basically the vnode. The vnode definition resides in usr/src/uts/common/sys/vnode.h.
typedef struct vnode {
kmutex_t v_lock; /* protects vnode fields */
uint_t v_flag; /* vnode flags (see below) */
uint_t v_count; /* reference count */
void *v_data; /* private data for fs */
struct vfs *v_vfsp; /* ptr to containing VFS */
struct stdata *v_stream; /* associated stream */
enum vtype v_type; /* vnode type */
dev_t v_rdev; /* device (VCHR, VBLK) */
...
} vnode_t;
The entry point into filesystem specific implementation is the v_data field within the vnode_t structure. Having ZFS underneath, v_data will point to a znode. In ZFS the znode is the equivalent of the the UFS inode. The v_data is casted to a znode_t structure through the macro: #define VTOZ(VP) ((znode_t *)(VP)->v_data). More about the znode and the ZFS posix operations implementation in the following posts.
Let’s make a short recap through the code path of arriving from process level user data to the ZFS Posix Layer implementation.
proc_t process;
user_t p_user = process->p_user; // the user structure
uf_info_t file_info = p_user->u_finfo; // per process file information
uf_entry_t file_entry = file_info->fi_list[0]; // This is in case we have fd 0
file_t p_file = file_entry->uf_file; // pointer to file struct
vnode_t file_vnode = p_file->f_vnode; // the vnode
(znode_t *)file_vnode->v_data // reached the ZFS znode
The next post will dive into the ZFS posix layer and into the code path of certain sys calls.
Happy coding.