Introduction to Linux File Systems
This is mostly described in Stevens, Chapter 4. Read it. Now.
Layout of the file system:
· Each physical drive can be divided into several partitions
· Each partition can contain one file system
· Each file system contains:
1. boot block(s);
2. superblock;
3. inode list;
4. data blocks.
· A boot block may contain the bootstrap code that is read into the machine upon booting.
· A superblock describes the state of the file system:
· The inode list is an array of "information nodes" analogous to the FAT (File Allocation Table) system in MS-DOS.
· data blocks start at the end of the inode list and contain file data and directory blocks.
The term file system can mean a single disk, or it can mean the entire collection of devices on a system. It's held together in this second sense by the directory structure.
The directory "tree" usually spans many disks and/or partitions by means of mount points. For example, in Red Hat Linux, there are pre-defined mount points for floppy disks and CD-ROMs at floppy and cdrom in /mnt. See also fstab and mtab in /etc.
Some Linux-supported File Systems
minix is the filesystem used in the Minix operating system, the first to run under Linux. It has a number of shortcomings, mostly in being small.
ext is an elaborate extension of the minix filesystem. It has been completely superseded by ext2.
ext2 is the disk filesystem used by Linux for both hard drives and floppies. ext2, designed as an extension to ext, has in its turn generated a successor, ext3.
ext3 offers the best performance (in terms of speed and CPU usage) combined with data security of the file systems supported under Linux due to its journaling feature.
xiafs was designed as a stable, safe file system by extending minix. It's no longer actively supported and is rarely used.
msdos is the filesystem used by MS-DOS and Windows. msdos filenames are limited to the 8 + 3 form. It's especially good for floppies that you move back and forth.
umsdos extends msdos by adding long filenames, ownership, permissions, and special files while remaining compatible with MS-DOS and Windows.
vfat extends msdos to be compatible with Microsoft Windows' support for long filenames (a good choice for dual-boot).
proc is a pseudo-filesystem which is used as an interface to kernel data. Its files do not use disk space. See proc(5).
iso9660 is a CD-ROM filesystem type conforming to the ISO 9660 standard, including both High Sierra and Rock Ridge.
nfs is a network filesystem used to access remote disks.
smb is a network filesystem used by Windows.
ncpfs is a network filesystem that supports the NCP protocol, used by Novell NetWare.
Partition Structure
Boot Block(s)
Blocks on a Linux (and often a Unix) filesystem are 1024 bytes in length, but may be longer or shorter. The blocks are normally a power of 2 in size (1024 is 2 to the 10th power). Some systems use 512 bytes (2 to the 9th) but 2048 and 4096 are also seen.
The first few blocks on any partition may hold a boot program, a short program for loading the kernel of the operating system and launching it. Often, on a Linux system, it will be controlled by LILO or Grub, allowing booting of multiple operating systems. It's quite simple (and common) to have a multiple boot environment for both Linux and a flavour of Windows.
Superblock
The boot blocks are followed by the superblock, which contains information about the geometry of the physical disk, the layout of the partition, number of inodes and data blocks, and much more.
Inode Blocks
Disk space allocation is managed by the inodes (information node), which are created by the mkfs(1) (make filesystem) command. Inodes cannot be manipulated directly, but are changed by many commands, such as touch(1) and cp(1) and by system calls, like open(2) and unlink(2), and can be read by stat(2). Both chmod(1) and chmod(2) change access permissions.
Data Blocks
This is where the file data itself is stored. Since a directory is simply a specially formatted file, directories are also contained in the data blocks. An allocated data block can belong to one and only one file in the system. If a data block is not allocated to a file, it is free and available for the system to allocate when needed.
Structure of the super block
struct ext2_super_block {
__u32 s_inodes_count; /* Inodes count */
__u32 s_blocks_count; /* Blocks count */
__u32 s_r_blocks_count; /* Reserved blocks count */
__u32 s_free_blocks_count; /* Free blocks count */
__u32 s_free_inodes_count; /* Free inodes count */
__u32 s_first_data_block; /* First Data Block */
__u32 s_log_block_size; /* Block size */
__s32 s_log_frag_size; /* Fragment size */
__u32 s_blocks_per_group; /* # Blocks per group */
__u32 s_frags_per_group; /* # Fragments per group */
__u32 s_inodes_per_group; /* # Inodes per group */
__u32 s_mtime; /* Mount time */
__u32 s_wtime; /* Write time */
__u16 s_mnt_count; /* Mount count */
__s16 s_max_mnt_count; /* Maximal mount count */
__u16 s_magic; /* Magic signature */
__u16 s_state; /* File system state */
__u16 s_errors; /* Behaviour if error detected */
__u16 s_minor_rev_level; /* minor revision level */
__u32 s_lastcheck; /* time of last check */
__u32 s_checkinterval; /* max. time between checks */
__u32 s_creator_os; /* OS */
__u32 s_rev_level; /* Revision level */
__u16 s_def_resuid; /* Def uid for reserved blocks */
__u16 s_def_resgid; /* Def gid for reserved blocks */
__u32 s_first_ino; /* First non-reserved inode */
__u16 s_inode_size; /* size of inode structure */
__u16 s_block_group_nr; /* block grp # of this s'block */
__u32 s_feature_compat; /* compatible feature set */
__u32 s_feature_incompat; /* incompatible feature set */
__u32 s_feature_ro_compat; /* readonly-compat feature set */
__u8 s_uuid[16]; /* 128-bit uuid for volume */
char s_volume_name[16]; /* volume name */
char s_last_mounted[64]; /* dir where last mounted */
[some compression stuff]
__u16 s_padding1; /* Padding for alignment */
[some journaling stuff]
__u32 s_reserved[197]; /* Padding to end of the block */
};
Most of this structure, which is defined fully in /usr/include/ext2fs/ext2_fs.h, is only used by the operating system, and is neither readable nor modifiable by a user or a program.
Structure of an inode on the disk
Each file (a unique collection of data blocks) has only 1 inode, which completely defines the file except for its name(s). The filenames are actually links in the directory structure to the inode for the file.
The representation of the inode to use is the stat(2) structure, which can be seen in /usr/include/bits/stat.h or on page 73 in Stevens.
struct ext2_inode {
__u16 i_mode; /* File mode */
__u16 i_uid; /* Low bits of Uid */
__u32 i_size; /* Size in bytes */
__u32 i_atime; /* Access time */
__u32 i_ctime; /* Creation time */
__u32 i_mtime; /* Modification time */
__u32 i_dtime; /* Deletion Time */
__u16 i_gid; /* Low bits of Gid */
__u16 i_links_count; /* Links count */
__u32 i_blocks; /* Blocks count */
__u32 i_flags; /* File flags */
union {
. . .
} osd1; /* OS dependent 1 */
__u32 i_block[EXT2_N_BLOCKS];/* Data */
__u32 i_generation;/* Version (NFS) */
[ACLs and stuff]
union {
. . .
} osd2; /* OS dependent 2 */
};
Directory entry
Only ever use the d_name field in the directory entry, for use if you are searching through a directory for a filename. Use the filename in a call to stat(2) or lstat(2) to get the struct stat information.
struct dirent {
long d_ino; /* don't use */
__kernel_off_t d_off; /* don't use */
unsigned short d_reclen; /* don't */
char d_name[256];/* OK */
};
Special inode numbers
There are several pre-defined inodes with special purposes. Because of the way the superblock and inodes are defined, you do not need to know these unless you are creating or modifying filesystem software.
EXT2_BAD_INO 1 Bad blocks inode
EXT2_ROOT_INO 2 Root inode
EXT2_ACL_IDX_INO 3 ACL index inode
EXT2_ACL_DATA_INO 4 ACL data inode
EXT2_BOOT_LOADER_INO 5 Boot loader inode
EXT2_UNDEL_DIR_INO 6 Undelete dir inode
EXT2_RESIZE_INO 7 Res grp desc inode
EXT2_JOURNAL_INO 8 Journal inode
First non-reserved inode for ext2 filesystems
EXT2_GOOD_OLD_FIRST_INO 11
Inode Contents
Disk inodes contain the following information:
· file creation time
· last file access time
· last inode modification time
Access Permissions
Directories
Two categories of Users
Linking files
In Linux and Unix, a data file is a bunch of data blocks on a disk, managed by an inode. Its name is stored only in the directory. Or in many directories. This is the concept of linking, as discussed in Stevens sections 4.14 through 4.17. Both "soft" (symbolic) links and "hard" links can be made using the ln(1) command or the link(2) and symlink(2) system calls.
Original file
System Prompt: ls -l
-rw-r--r-- 1 allisor staff 0 D/T abc
Create hard link
System Prompt: ln abc h-abc
System Prompt: ls -l
-rw-r--r-- 2 allisor staff 0 D/T abc
-rw-r--r-- 2 allisor staff 0 D/T h-abc
Create symbolic (soft) link
System Prompt: ln -s abc s-abc
System Prompt: ls -l
-rw-r--r-- 2 allisor staff 0 D/T abc
lrwxr-xr-x 1 allisor staff 3 D/T s-abc -> abc
-rw-r--r-- 2 allisor staff 0 D/T h-abc
Examine inodes
System Prompt: ls -i1
25263 abc
25263 h-abc
25265 s-abc
Remove original file
System Prompt: rm abc
System Prompt: ls -l
-rw-r--r-- 1 allisor staff 0 D/T h-abc
lrwxr-xr-x 1 allisor staff 3 D/T s-abc -> abc
Directory names
Directory names from the current directory
/*
Modified from the Irix (SGI) man pages for the directory entry functions List the entry names from the current directory
*/
#include <stdio.h>
#include <dirent.h>
int main( void )
{
DIR *dp;
struct dirent *dirp;
if ( ( dp = opendir( "." ) ) == NULL )
{
printf( "Can't open current dir\n");
exit( 0 );
}
while ( ( dirp = readdir( dp ) ) != NULL )
{
printf( "%s\n", dirp->d_name );
}
closedir( dp );
return 1;
}
Directory names from argv
/*
Modified from Program 1.1 on page 4 in Stevens. List the entry names from the directory name given as a command-line argument
*/
#include <stdio.h>
#include <dirent.h>
int main( int argc, char *argv[] )
{
DIR *dp;
struct dirent *dirp;
if (argc != 2)
{
printf("usage: %s dirname\n", argv[0]);
exit( 0 );
}
if ( ( dp = opendir( argv[1] ) ) == NULL )
{
printf( "Can't open %s\n", argv[1] );
exit( 0 );
}
while ( (dirp = readdir( dp ) ) != NULL )
{
printf( "%s\n", dirp->d_name );
}
closedir( dp );
return 1;
}
Some useful system calls
stat(2) Stevens, section 4.2, page 73
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
int stat( const char *path, struct stat *buf);
int fstat( int fdes, struct stat *buf );
int lstat(const char *path, struct stat *buf);
Purpose: To retrieve the file statistics (inode info) for the file at path, the open(2) file fdes, or the actual information for the symbolic link (via lstat(2)) for the file.
Returns: On success, buf is filled with the inode information. On failure, returns -1 and sets errno for the failure.
struct stat [some padding removed]
{
__dev_t st_dev; /* Device. */
__ino_t st_ino; /* File serial no. */
__mode_t st_mode; /* File mode */
__nlink_t st_nlink;/* Link count */
__uid_t st_uid; /* file owner uid */
__gid_t st_gid; /* file owner gid */
__dev_t st_rdev; /* Dev no, if device */
__off_t st_size; /* filesize in bytes */
__blksize_t st_blksize;/* Best blksize */
__blkcnt_t st_blocks; /* Blocks alloc */
__time_t st_atime; /* last access time */
__time_t st_mtime; /* last mod time */
__time_t st_ctime; /* last change time */
};
access(2) Stevens, section 4.7, page 82
#include <unistd.h>
int access(const char *pathname, int mode);
Purpose: To see if the process would be allowed to read, write, or test for existence of the file named in pathname. If it's a symbolic link, permissions of the linked file are tested. mode is a mask consisting of one or more of R_OK, W_OK, X_OK, and F_OK.
Returns: If all requested permissions are granted 0 is returned. On error (at least one permission is denied, or some other error occurred), -1 is returned, and errno is set appropriately.
chmod(2) Stevens, section 4.9, page 85
#include <sys/types.h>
#include <sys/stat.h>
int chmod(const char *pathname, mode_t mode );
int fchmod( int fdes, mode_t mode );
Purpose: Sets file permissions of the file named via pathname, or the open(2) file identified by fdes. If mode is octal, it must begin with a zero character. There are pre-defined flags available.
Returns: On success, 0. On failure, -1 and sets errno to indicate the failure.
unlink(2) Stevens, section 4.15, page 95
#include <unistd.h>
int unlink( const char *pathname );
Purpose: Unlinks (that is, deletes a link to remove a file name's connection to an inode) the file named by pathname.
Returns: On success, 0. On failure, -1 and sets errno to indicate the failure.
perror(3) Stevens, section 1.7, page 14
#include <stdio.h>
void perror(const char *s);
Purpose: perror()translates error codes into human-readable form, writing to stderr to describe the latest error from a system function. The argument string is printed, then colon-blank, then the system error text and a newline. You should include diagnostic information such as the name of the failing function and its location.
The error number is taken from extern errno, set when errors occur but not cleared. Save errno if you plan to use it and the failing call is not immediately followed by a call to perror(3).
Some errno values (see /usr/include/asm/errno.h and other variations of errno.h in /usr/include for more):
ENOENT No such file or directory
ENOEXEC Exec format error
ECHILD No child processes
ENOMEM Out of memory
EACCES Permission denied
EEXIST File exists
ENOTDIR Not a directory
EISDIR Is a directory
EINVAL Invalid argument
EPIPE Broken pipe
ENOTEMPTY Directory not empty
ENOSPC No space left on device
ESPIPE Illegal seek
EROFS Read-only file system
EMLINK Too many links
EPIPE Broken pipe
EDOM Math argument out of domain
ERANGE Math result not representable
Some Time and Date Routines
All require #include <time.h>, and are described in Stevens section 6.9 on page 155. They work with the various time fields, including those in the inodes, and all are based on time_t and the count of seconds from the Epoch: 00:00:00 UTC 1 January 1970.
You can convert a time in time_t to a broken-down time in the tm structure with gmtime(3) or localtime(3), and convert a struct tm time to time_t with mktime(3). Either form can be converted to a standard 26-byte string (plus end-of-string) by asctime(3) or ctime(3), or you can put a time into a string using your own format with strftime(3). You can get the current time with time(2).
char *asctime(const struct tm *timeptr);
char *ctime(const time_t *timep);
struct tm *gmtime(const time_t *timep);
struct tm *localtime(const time_t *timep);
time_t mktime(struct tm *timeptr);
struct tm
{
int tm_sec; /* seconds */
int tm_min; /* minutes */
int tm_hour; /* hours */
int tm_mday; /* day of the month */
int tm_mon; /* month */
int tm_year; /* year */
int tm_wday; /* day of the week */
int tm_yday; /* day in the year */
int tm_isdst; /* daylight saving */
};
If time_t is a 32-bit signed
int, when
does the Epoch end?
If it's an unsigned integer? What if it has a 64-bit value?