Skip to content

03 — Character and Block Devices

Technical Overview

Linux divides the device interface into three fundamental types: character devices, block devices, and network devices. This classification is not merely organizational — it determines the data model, the kernel subsystems involved, the buffering strategy, and the interface exposed to userspace. Understanding which category a device belongs to explains how it appears in /dev, how applications read and write to it, and which kernel APIs the driver must implement.

Character devices expose a byte stream with no inherent block structure. They are accessed sequentially or with arbitrary positioning; reads and writes happen one byte at a time if the driver allows it. Block devices expose fixed-size blocks (typically 512 bytes or 4096 bytes) and support random access. Network devices are special: they are packet-based and do not appear in /dev at all; they are accessed through the socket API.


Prerequisites

  • Linux driver model (01-driver-model.md)
  • Kernel module basics (02-kernel-modules.md)
  • Understanding of file descriptor semantics
  • VFS (Virtual Filesystem Switch) basics

Device Type Comparison

  ┌─────────────────┬──────────────────┬──────────────────┬──────────────────┐
  │ Property        │ Character Device │ Block Device     │ Network Device   │
  ├─────────────────┼──────────────────┼──────────────────┼──────────────────┤
  │ Data unit       │ Byte stream      │ Fixed-size block │ Packet           │
  │ Seekable        │ Maybe (tty: no)  │ Yes              │ No               │
  │ Buffered        │ No (or ring buf) │ Yes (page cache) │ sk_buff queue    │
  │ Appears in /dev │ Yes              │ Yes              │ No               │
  │ Access via      │ open/read/write  │ open/read/write  │ socket syscalls  │
  │ Examples        │ /dev/tty, /dev/  │ /dev/sda,        │ eth0, lo,        │
  │                 │  random, /dev/   │  /dev/nvme0n1,   │  wlan0           │
  │                 │  video0          │  /dev/md0        │                  │
  └─────────────────┴──────────────────┴──────────────────┴──────────────────┘

Major and Minor Numbers

Every device in /dev is identified by a major number (identifies the driver) and a minor number (identifies the specific device instance managed by that driver).

ls -la /dev/sda* /dev/tty* /dev/null /dev/zero /dev/urandom
# crw-rw-rw- 1 root root  1,  3 May 16 08:00 /dev/null
# crw-rw-rw- 1 root root  1,  5 May 16 08:00 /dev/zero
# crw-rw-rw- 1 root root  1,  9 May 16 08:00 /dev/urandom
# brw-rw---- 1 root disk  8,  0 May 16 08:00 /dev/sda
# brw-rw---- 1 root disk  8,  1 May 16 08:00 /dev/sda1
# crw-rw-rw- 1 root tty   5,  0 May 16 08:00 /dev/tty
# crw--w---- 1 root tty   4,  1 May 16 08:00 /dev/tty1

The first character is c (character) or b (block). The numbers separated by a comma in the size column are major, minor. /dev/sda (major 8, minor 0) and /dev/sda1 (major 8, minor 1) are both handled by the sd driver (major 8); minor 0 means the whole disk, minor 1 means partition 1.

Major numbers are registered in the kernel (Documentation/admin-guide/devices.txt). Drivers can request a specific major number or ask for dynamic allocation:

/* Static allocation (specific major) */
int ret = register_chrdev_region(MKDEV(major, 0), num_minors, "my_device");

/* Dynamic allocation (kernel assigns major) */
dev_t dev;
int ret = alloc_chrdev_region(&dev, 0, num_minors, "my_device");
int major = MAJOR(dev);

Character Device Implementation

The full lifecycle for a character device driver:

#include <linux/fs.h>
#include <linux/cdev.h>
#include <linux/uaccess.h>

static dev_t dev_num;
static struct cdev my_cdev;
static struct class *my_class;

/* 1. Implement file_operations */
static int my_open(struct inode *inode, struct file *filp)
{
    /* filp->private_data can store per-fd state */
    nonseekable_open(inode, filp);
    return 0;
}

static int my_release(struct inode *inode, struct file *filp)
{
    return 0;
}

static ssize_t my_read(struct file *filp, char __user *buf,
                       size_t count, loff_t *f_pos)
{
    char kbuf[] = "hello\n";
    size_t to_copy = min(count, sizeof(kbuf));

    /* copy_to_user: safely copy kernel data to user buffer */
    if (copy_to_user(buf, kbuf, to_copy))
        return -EFAULT;

    return to_copy;
}

static ssize_t my_write(struct file *filp, const char __user *buf,
                        size_t count, loff_t *f_pos)
{
    char kbuf[256];
    size_t to_copy = min(count, sizeof(kbuf) - 1);

    if (copy_from_user(kbuf, buf, to_copy))
        return -EFAULT;

    kbuf[to_copy] = '\0';
    pr_info("my_device: received: %s\n", kbuf);
    return to_copy;
}

static const struct file_operations my_fops = {
    .owner   = THIS_MODULE,
    .open    = my_open,
    .release = my_release,
    .read    = my_read,
    .write   = my_write,
    .llseek  = no_llseek,
};

/* 2. Register device number, cdev, and sysfs class */
static int __init my_init(void)
{
    int ret;

    /* Allocate major:minor range */
    ret = alloc_chrdev_region(&dev_num, 0, 1, "my_device");
    if (ret < 0) return ret;

    /* Initialize and add cdev to kernel */
    cdev_init(&my_cdev, &my_fops);
    my_cdev.owner = THIS_MODULE;
    ret = cdev_add(&my_cdev, dev_num, 1);
    if (ret < 0) goto err_chrdev;

    /* Create sysfs class → udev creates /dev/my_device */
    my_class = class_create(THIS_MODULE, "my_device");
    device_create(my_class, NULL, dev_num, NULL, "my_device");

    return 0;

err_chrdev:
    unregister_chrdev_region(dev_num, 1);
    return ret;
}

static void __exit my_exit(void)
{
    device_destroy(my_class, dev_num);
    class_destroy(my_class);
    cdev_del(&my_cdev);
    unregister_chrdev_region(dev_num, 1);
}

struct file_operations: The Driver's Interface Contract

struct file_operations is the vtable that connects VFS operations to driver code. Key callbacks:

Callback Triggered by Notes
open open() syscall Initialize per-fd state
release Last close() on file descriptor Free per-fd state
read read() / readv() / pread() Must use copy_to_user
write write() / writev() / pwrite() Must use copy_from_user
unlocked_ioctl ioctl() syscall Device-specific commands
mmap mmap() syscall Map device memory into user address space
poll select() / poll() / epoll() Report readability/writability
llseek lseek() syscall Reposition file offset
fsync fsync() syscall Flush device buffers

The ioctl Interface

ioctl (input/output control) is the escape hatch for device-specific commands that don't fit the read/write model. Setting a serial port's baud rate, querying a disk's geometry, or sending a custom command to a hardware encoder all go through ioctl.

ioctl command numbers are constructed from a 32-bit encoding that embeds direction, size, and type:

/* Command number construction macros */
_IO(type, nr)              /* no argument */
_IOR(type, nr, data_type)  /* read from driver */
_IOW(type, nr, data_type)  /* write to driver */
_IOWR(type, nr, data_type) /* read + write */

/* Example: define commands for a temperature sensor driver */
#define TEMP_IOC_MAGIC    't'
#define TEMP_GET_CELSIUS  _IOR(TEMP_IOC_MAGIC, 0, int)
#define TEMP_SET_ALERT    _IOW(TEMP_IOC_MAGIC, 1, int)
#define TEMP_GET_ALERT    _IOWR(TEMP_IOC_MAGIC, 2, struct temp_alert)

/* In driver's unlocked_ioctl: */
static long temp_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
    int temp, alert_threshold;

    switch (cmd) {
    case TEMP_GET_CELSIUS:
        temp = read_hardware_temperature();
        return copy_to_user((int __user *)arg, &temp, sizeof(int)) ? -EFAULT : 0;

    case TEMP_SET_ALERT:
        if (copy_from_user(&alert_threshold, (int __user *)arg, sizeof(int)))
            return -EFAULT;
        set_hardware_alert(alert_threshold);
        return 0;

    default:
        return -ENOTTY; /* not a tty ioctl — standard "unknown ioctl" errno */
    }
}

The type field (a letter or number) is a namespace to avoid collisions between different drivers' command numbers. Linux headers use a registry at Documentation/userspace-api/ioctl/ioctl-number.rst.


Classic Character Device Examples: /dev/null, /dev/zero, /dev/urandom

These are implemented in drivers/char/mem.c and are canonical examples of minimal character drivers:

/dev/null (major 1, minor 3): reads return 0 bytes (EOF immediately). Writes consume all bytes silently and return the count. Used to discard output: command > /dev/null.

/dev/zero (major 1, minor 5): reads return an infinite stream of zero bytes. Writes are discarded. Used to zero-fill files: dd if=/dev/zero of=file bs=1M count=100, or to create zero-initialized anonymous mappings.

/dev/urandom (major 1, minor 9): reads return cryptographically secure random bytes from the kernel's CSPRNG (ChaCha20-based since Linux 4.8, seeded from hardware entropy sources: CPU RDSEED, device interrupts, disk timing). Never blocks. /dev/random (minor 8) used to block until sufficient entropy was available; since Linux 5.6 it behaves like /dev/urandom after initial boot seeding.


Block Device Implementation

Block devices are more complex because the kernel interposes a page cache between the device and user reads/writes. When a user reads from /dev/sda, the data may come from the page cache rather than triggering physical I/O. Writes go to the page cache and are written back to the device asynchronously by the kernel's writeback mechanism.

The block I/O path:

Application: read(fd, buf, 4096)
         │
         ▼
   VFS: generic_file_read_iter()
         │
         ▼
   Page cache: is page present?
   ├── YES: copy to user buffer, return
   └── NO: submit_bio() to block layer
              │
              ▼
         I/O scheduler (mq-deadline, kyber, none)
              │
              ▼
         blk-mq: multiqueue block layer
              │
              ▼
         Driver: .queue_rq callback
              │
              ▼
         Hardware: DMA to/from device
              │
              ▼
         Interrupt: bio completion callback
              │
              ▼
         Page cache populated, copy to user

The core structure for block I/O is struct bio:

struct bio {
    struct block_device *bi_bdev;    /* target device */
    unsigned int         bi_opf;     /* REQ_OP_READ, REQ_OP_WRITE, etc. */
    sector_t             bi_iter.bi_sector;  /* start sector */
    struct bio_vec       *bi_io_vec; /* array of (page, offset, len) tuples */
    unsigned short       bi_vcnt;    /* number of bio_vecs */
    bio_end_io_t        *bi_end_io;  /* completion callback */
    void                *bi_private; /* driver-private data */
    // ...
};

A bio can span multiple non-contiguous physical pages (scatter-gather), which is essential for avoiding the overhead of physically contiguous allocation for large I/Os.

The modern block driver interface uses blk-mq (multi-queue block layer, added in Linux 3.13) which maps to the multiple hardware queues that NVMe SSDs expose. A block driver registers a blk_mq_ops structure:

static const struct blk_mq_ops my_blk_mq_ops = {
    .queue_rq = my_queue_rq,      /* submit request to hardware */
    .complete = my_complete_rq,   /* called from interrupt context */
    .init_hctx = my_init_hctx,   /* initialize hardware context */
};

udev Rules for Device Naming

udev rules create stable names and permissions for devices. Without custom rules, names are assigned by kernel enumeration order (unstable: which of two NICs is eth0 depends on probe order).

# /etc/udev/rules.d/99-my-devices.rules

# Char device: set permissions for /dev/my_device
SUBSYSTEM=="my_device", KERNEL=="my_device*", MODE="0666"

# Block device: create symlink for a specific disk by serial number
ACTION=="add", SUBSYSTEM=="block", \
    ATTRS{serial}=="WD-WCAWZ0123456", \
    SYMLINK+="disk/backup-drive"

# Network: rename NIC with specific MAC to eth-wan
ACTION=="add", SUBSYSTEM=="net", \
    ATTRS{address}=="00:1a:2b:3c:4d:5e", \
    NAME="eth-wan"

Production Examples

NVIDIA GPU: appears as a character device at /dev/nvidia0 (major 195, minor 0). The NVIDIA driver exposes an ioctl-heavy interface — essentially all GPU operations go through ioctl commands. libcuda.so links against the userspace driver which calls ioctl(fd, NVIDIA_IOCTL_COMMAND, &params).

NVMe SSD: registers as a block device /dev/nvme0n1 and a character device /dev/nvme0 (for admin commands like nvme smart-log). The block device is used for data; the character device is used for NVMe-specific management commands (via ioctl with struct nvme_passthru_cmd).

Video4Linux (V4L2): webcams and video capture cards appear as /dev/videoX. The V4L2 API is entirely ioctl-based (VIDIOC_REQBUFS, VIDIOC_QUERYBUF, VIDIOC_QBUF, VIDIOC_DQBUF) with an mmap interface for zero-copy frame buffer access.


Debugging Notes

# List character and block devices with major/minor
ls -la /dev/ | grep -E '^[cb]'

# Show registered character device ranges
cat /proc/devices

# Monitor block I/O with blktrace
blktrace -d /dev/sda -o - | blkparse -i -

# Show block device queue parameters
cat /sys/block/sda/queue/scheduler
cat /sys/block/nvme0n1/queue/nr_requests

# Strace ioctl calls from a program
strace -e ioctl -p $(pgrep nvidia-smi)

# Test character device read
dd if=/dev/urandom of=/dev/null bs=1M count=100 status=progress

Debugging read returning wrong data: Insert WARN_ON(copy_to_user(...)) and check /proc/kmsg. If copy_to_user returns non-zero, the user buffer pointer is invalid — check the user-space code.


Security Implications

Device file permissions: /dev/sda is typically owned root:disk with permissions 0660. Any user in the disk group can read raw disk data, bypassing filesystem permissions. On multi-user systems, ensuring disk group membership is restricted is critical.

ioctl validation: Every argument to ioctl arrives as an unsigned long and the driver must validate it before use. Failing to validate arg as a valid user-space pointer before dereferencing it as (struct foo __user *)arg leads to kernel exploits. The __user annotation (used by the sparse static analyzer) marks pointers to user memory that must go through copy_from_user/copy_to_user.

TOCTOU on ioctl: If an ioctl copies a struct from user space and then re-reads fields from user space later, a concurrent mprotect + write can change the value between reads. Always copy the entire struct to kernel space at entry and use only the kernel copy.

Arbitrary mmap via driver: A driver that implements mmap and maps physical memory incorrectly can expose kernel memory to user space. /dev/mem (full physical memory access) is restricted since Linux 4.6 with CONFIG_STRICT_DEVMEM to only allow mapping PCI I/O regions.


Performance Implications

Character device read/write overhead: Each read/write syscall to a character device invokes copy_to/from_user, which is optimized with SIMD on modern kernels but still has page fault risk (if user buffer is not faulted in). For high-throughput data acquisition, mmap the device memory directly to avoid copies.

Block device queue depth: NVMe SSDs support thousands of concurrent I/O operations (queue depth). The kernel's blk-mq allows matching this with per-CPU hardware queues. Setting /sys/block/nvme0n1/queue/nr_requests to 1023 (from the default 256) can improve throughput for workloads with high parallelism.

Page cache amplification: Writing 1 byte to a file on a block device causes the kernel to read the entire 4096-byte page into cache, modify it, and write it back — a read-modify-write cycle that amplifies small writes. For databases, O_DIRECT bypasses the page cache, letting the database manage its own buffer pool.


Failure Modes

  • EFAULT on copy_to_user: User passed an invalid pointer (null, unmapped, or read-only mapping). Driver should return -EFAULT and not crash.
  • ENXIO on open: Device not present (e.g., /dev/sda opened after disk removal). cdev_del was called; the open fails gracefully.
  • I/O error on bio completion: Block device signals EIO. Propagates through page cache to the application as a read/write error. RAID drivers (md, dm) retry on alternative paths.
  • Queue stall: Block device's request queue is full (hardware queue depth exhausted). The block layer applies back-pressure: processes calling write() block until queue space frees. With async I/O (io_uring), submissions are queued in user space and submitted as hardware drains.

Modern Usage

The transition from BIO-based drivers to blk-mq is essentially complete for NVMe and SATA. Legacy single-queue request_fn API was removed in Linux 5.9.

io_uring (Linux 5.1) provides a lock-free ring buffer interface for submitting block I/O operations without per-syscall overhead. For high-IOPS NVMe workloads (1M+ IOPS), io_uring eliminates the syscall cost, which was the dominant overhead.

Character devices increasingly expose mmap-based zero-copy interfaces for high-bandwidth use cases (DPDK, SPDK, GPU memory, frame buffers) to avoid the cost of copying data between kernel and user space.


Future Directions

io_uring and passthrough: Linux 5.19 added io_uring NVMe passthrough, allowing user-space NVMe drivers (SPDK) to submit native NVMe commands through io_uring without a full DPDK setup.

Rust character device drivers: The Rust binding layer for kernel modules includes safe wrappers for file_operations callbacks, making it impossible to accidentally dereference a user-space pointer without going through the correct checked APIs.


Exercises

  1. Write a character device driver that implements a simple in-kernel FIFO: writes add data to the FIFO, reads consume it. Block on empty reads using wait_queue_head_t.
  2. Implement poll in your FIFO driver so that select()/poll() from user space correctly reports readability.
  3. Add an ioctl command FIFO_CLEAR that empties the FIFO atomically using a spinlock.
  4. Write a user-space program that mmap's /dev/zero and measures the time to fault in 1GB of pages vs. malloc+memset.
  5. Use blktrace to capture a complete I/O trace during a fio benchmark on an NVMe device. Identify the distribution of I/O sizes and latencies.

References

  • include/linux/fs.hstruct file_operations, struct inode, struct file
  • include/linux/cdev.h — character device registration
  • include/linux/blk-mq.h — multiqueue block layer
  • drivers/char/mem.c/dev/null, /dev/zero, /dev/urandom implementation
  • Linux Device Drivers, 3rd Edition, Chapters 3 and 16
  • Jonathan Corbet, "Rethinking the block I/O layer" — LWN.net, 2013 (blk-mq introduction)
  • Documentation/block/ — block layer documentation
  • Documentation/userspace-api/ioctl/ — ioctl number registry and guidelines