03 — Character and Block Devices
Technical Overview
Linux divides the device interface into three fundamental types: character devices, block devices, and network devices. This classification is not merely organizational — it determines the data model, the kernel subsystems involved, the buffering strategy, and the interface exposed to userspace. Understanding which category a device belongs to explains how it appears in /dev, how applications read and write to it, and which kernel APIs the driver must implement.
Character devices expose a byte stream with no inherent block structure. They are accessed sequentially or with arbitrary positioning; reads and writes happen one byte at a time if the driver allows it. Block devices expose fixed-size blocks (typically 512 bytes or 4096 bytes) and support random access. Network devices are special: they are packet-based and do not appear in /dev at all; they are accessed through the socket API.
Prerequisites
- Linux driver model (01-driver-model.md)
- Kernel module basics (02-kernel-modules.md)
- Understanding of file descriptor semantics
- VFS (Virtual Filesystem Switch) basics
Device Type Comparison
┌─────────────────┬──────────────────┬──────────────────┬──────────────────┐
│ Property │ Character Device │ Block Device │ Network Device │
├─────────────────┼──────────────────┼──────────────────┼──────────────────┤
│ Data unit │ Byte stream │ Fixed-size block │ Packet │
│ Seekable │ Maybe (tty: no) │ Yes │ No │
│ Buffered │ No (or ring buf) │ Yes (page cache) │ sk_buff queue │
│ Appears in /dev │ Yes │ Yes │ No │
│ Access via │ open/read/write │ open/read/write │ socket syscalls │
│ Examples │ /dev/tty, /dev/ │ /dev/sda, │ eth0, lo, │
│ │ random, /dev/ │ /dev/nvme0n1, │ wlan0 │
│ │ video0 │ /dev/md0 │ │
└─────────────────┴──────────────────┴──────────────────┴──────────────────┘
Major and Minor Numbers
Every device in /dev is identified by a major number (identifies the driver) and a minor number (identifies the specific device instance managed by that driver).
ls -la /dev/sda* /dev/tty* /dev/null /dev/zero /dev/urandom
# crw-rw-rw- 1 root root 1, 3 May 16 08:00 /dev/null
# crw-rw-rw- 1 root root 1, 5 May 16 08:00 /dev/zero
# crw-rw-rw- 1 root root 1, 9 May 16 08:00 /dev/urandom
# brw-rw---- 1 root disk 8, 0 May 16 08:00 /dev/sda
# brw-rw---- 1 root disk 8, 1 May 16 08:00 /dev/sda1
# crw-rw-rw- 1 root tty 5, 0 May 16 08:00 /dev/tty
# crw--w---- 1 root tty 4, 1 May 16 08:00 /dev/tty1
The first character is c (character) or b (block). The numbers separated by a comma in the size column are major, minor. /dev/sda (major 8, minor 0) and /dev/sda1 (major 8, minor 1) are both handled by the sd driver (major 8); minor 0 means the whole disk, minor 1 means partition 1.
Major numbers are registered in the kernel (Documentation/admin-guide/devices.txt). Drivers can request a specific major number or ask for dynamic allocation:
/* Static allocation (specific major) */
int ret = register_chrdev_region(MKDEV(major, 0), num_minors, "my_device");
/* Dynamic allocation (kernel assigns major) */
dev_t dev;
int ret = alloc_chrdev_region(&dev, 0, num_minors, "my_device");
int major = MAJOR(dev);
Character Device Implementation
The full lifecycle for a character device driver:
#include <linux/fs.h>
#include <linux/cdev.h>
#include <linux/uaccess.h>
static dev_t dev_num;
static struct cdev my_cdev;
static struct class *my_class;
/* 1. Implement file_operations */
static int my_open(struct inode *inode, struct file *filp)
{
/* filp->private_data can store per-fd state */
nonseekable_open(inode, filp);
return 0;
}
static int my_release(struct inode *inode, struct file *filp)
{
return 0;
}
static ssize_t my_read(struct file *filp, char __user *buf,
size_t count, loff_t *f_pos)
{
char kbuf[] = "hello\n";
size_t to_copy = min(count, sizeof(kbuf));
/* copy_to_user: safely copy kernel data to user buffer */
if (copy_to_user(buf, kbuf, to_copy))
return -EFAULT;
return to_copy;
}
static ssize_t my_write(struct file *filp, const char __user *buf,
size_t count, loff_t *f_pos)
{
char kbuf[256];
size_t to_copy = min(count, sizeof(kbuf) - 1);
if (copy_from_user(kbuf, buf, to_copy))
return -EFAULT;
kbuf[to_copy] = '\0';
pr_info("my_device: received: %s\n", kbuf);
return to_copy;
}
static const struct file_operations my_fops = {
.owner = THIS_MODULE,
.open = my_open,
.release = my_release,
.read = my_read,
.write = my_write,
.llseek = no_llseek,
};
/* 2. Register device number, cdev, and sysfs class */
static int __init my_init(void)
{
int ret;
/* Allocate major:minor range */
ret = alloc_chrdev_region(&dev_num, 0, 1, "my_device");
if (ret < 0) return ret;
/* Initialize and add cdev to kernel */
cdev_init(&my_cdev, &my_fops);
my_cdev.owner = THIS_MODULE;
ret = cdev_add(&my_cdev, dev_num, 1);
if (ret < 0) goto err_chrdev;
/* Create sysfs class → udev creates /dev/my_device */
my_class = class_create(THIS_MODULE, "my_device");
device_create(my_class, NULL, dev_num, NULL, "my_device");
return 0;
err_chrdev:
unregister_chrdev_region(dev_num, 1);
return ret;
}
static void __exit my_exit(void)
{
device_destroy(my_class, dev_num);
class_destroy(my_class);
cdev_del(&my_cdev);
unregister_chrdev_region(dev_num, 1);
}
struct file_operations: The Driver's Interface Contract
struct file_operations is the vtable that connects VFS operations to driver code. Key callbacks:
| Callback | Triggered by | Notes |
|---|---|---|
open |
open() syscall |
Initialize per-fd state |
release |
Last close() on file descriptor |
Free per-fd state |
read |
read() / readv() / pread() |
Must use copy_to_user |
write |
write() / writev() / pwrite() |
Must use copy_from_user |
unlocked_ioctl |
ioctl() syscall |
Device-specific commands |
mmap |
mmap() syscall |
Map device memory into user address space |
poll |
select() / poll() / epoll() |
Report readability/writability |
llseek |
lseek() syscall |
Reposition file offset |
fsync |
fsync() syscall |
Flush device buffers |
The ioctl Interface
ioctl (input/output control) is the escape hatch for device-specific commands that don't fit the read/write model. Setting a serial port's baud rate, querying a disk's geometry, or sending a custom command to a hardware encoder all go through ioctl.
ioctl command numbers are constructed from a 32-bit encoding that embeds direction, size, and type:
/* Command number construction macros */
_IO(type, nr) /* no argument */
_IOR(type, nr, data_type) /* read from driver */
_IOW(type, nr, data_type) /* write to driver */
_IOWR(type, nr, data_type) /* read + write */
/* Example: define commands for a temperature sensor driver */
#define TEMP_IOC_MAGIC 't'
#define TEMP_GET_CELSIUS _IOR(TEMP_IOC_MAGIC, 0, int)
#define TEMP_SET_ALERT _IOW(TEMP_IOC_MAGIC, 1, int)
#define TEMP_GET_ALERT _IOWR(TEMP_IOC_MAGIC, 2, struct temp_alert)
/* In driver's unlocked_ioctl: */
static long temp_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
int temp, alert_threshold;
switch (cmd) {
case TEMP_GET_CELSIUS:
temp = read_hardware_temperature();
return copy_to_user((int __user *)arg, &temp, sizeof(int)) ? -EFAULT : 0;
case TEMP_SET_ALERT:
if (copy_from_user(&alert_threshold, (int __user *)arg, sizeof(int)))
return -EFAULT;
set_hardware_alert(alert_threshold);
return 0;
default:
return -ENOTTY; /* not a tty ioctl — standard "unknown ioctl" errno */
}
}
The type field (a letter or number) is a namespace to avoid collisions between different drivers' command numbers. Linux headers use a registry at Documentation/userspace-api/ioctl/ioctl-number.rst.
Classic Character Device Examples: /dev/null, /dev/zero, /dev/urandom
These are implemented in drivers/char/mem.c and are canonical examples of minimal character drivers:
/dev/null (major 1, minor 3): reads return 0 bytes (EOF immediately). Writes consume all bytes silently and return the count. Used to discard output: command > /dev/null.
/dev/zero (major 1, minor 5): reads return an infinite stream of zero bytes. Writes are discarded. Used to zero-fill files: dd if=/dev/zero of=file bs=1M count=100, or to create zero-initialized anonymous mappings.
/dev/urandom (major 1, minor 9): reads return cryptographically secure random bytes from the kernel's CSPRNG (ChaCha20-based since Linux 4.8, seeded from hardware entropy sources: CPU RDSEED, device interrupts, disk timing). Never blocks. /dev/random (minor 8) used to block until sufficient entropy was available; since Linux 5.6 it behaves like /dev/urandom after initial boot seeding.
Block Device Implementation
Block devices are more complex because the kernel interposes a page cache between the device and user reads/writes. When a user reads from /dev/sda, the data may come from the page cache rather than triggering physical I/O. Writes go to the page cache and are written back to the device asynchronously by the kernel's writeback mechanism.
The block I/O path:
Application: read(fd, buf, 4096)
│
▼
VFS: generic_file_read_iter()
│
▼
Page cache: is page present?
├── YES: copy to user buffer, return
└── NO: submit_bio() to block layer
│
▼
I/O scheduler (mq-deadline, kyber, none)
│
▼
blk-mq: multiqueue block layer
│
▼
Driver: .queue_rq callback
│
▼
Hardware: DMA to/from device
│
▼
Interrupt: bio completion callback
│
▼
Page cache populated, copy to user
The core structure for block I/O is struct bio:
struct bio {
struct block_device *bi_bdev; /* target device */
unsigned int bi_opf; /* REQ_OP_READ, REQ_OP_WRITE, etc. */
sector_t bi_iter.bi_sector; /* start sector */
struct bio_vec *bi_io_vec; /* array of (page, offset, len) tuples */
unsigned short bi_vcnt; /* number of bio_vecs */
bio_end_io_t *bi_end_io; /* completion callback */
void *bi_private; /* driver-private data */
// ...
};
A bio can span multiple non-contiguous physical pages (scatter-gather), which is essential for avoiding the overhead of physically contiguous allocation for large I/Os.
The modern block driver interface uses blk-mq (multi-queue block layer, added in Linux 3.13) which maps to the multiple hardware queues that NVMe SSDs expose. A block driver registers a blk_mq_ops structure:
static const struct blk_mq_ops my_blk_mq_ops = {
.queue_rq = my_queue_rq, /* submit request to hardware */
.complete = my_complete_rq, /* called from interrupt context */
.init_hctx = my_init_hctx, /* initialize hardware context */
};
udev Rules for Device Naming
udev rules create stable names and permissions for devices. Without custom rules, names are assigned by kernel enumeration order (unstable: which of two NICs is eth0 depends on probe order).
# /etc/udev/rules.d/99-my-devices.rules
# Char device: set permissions for /dev/my_device
SUBSYSTEM=="my_device", KERNEL=="my_device*", MODE="0666"
# Block device: create symlink for a specific disk by serial number
ACTION=="add", SUBSYSTEM=="block", \
ATTRS{serial}=="WD-WCAWZ0123456", \
SYMLINK+="disk/backup-drive"
# Network: rename NIC with specific MAC to eth-wan
ACTION=="add", SUBSYSTEM=="net", \
ATTRS{address}=="00:1a:2b:3c:4d:5e", \
NAME="eth-wan"
Production Examples
NVIDIA GPU: appears as a character device at /dev/nvidia0 (major 195, minor 0). The NVIDIA driver exposes an ioctl-heavy interface — essentially all GPU operations go through ioctl commands. libcuda.so links against the userspace driver which calls ioctl(fd, NVIDIA_IOCTL_COMMAND, ¶ms).
NVMe SSD: registers as a block device /dev/nvme0n1 and a character device /dev/nvme0 (for admin commands like nvme smart-log). The block device is used for data; the character device is used for NVMe-specific management commands (via ioctl with struct nvme_passthru_cmd).
Video4Linux (V4L2): webcams and video capture cards appear as /dev/videoX. The V4L2 API is entirely ioctl-based (VIDIOC_REQBUFS, VIDIOC_QUERYBUF, VIDIOC_QBUF, VIDIOC_DQBUF) with an mmap interface for zero-copy frame buffer access.
Debugging Notes
# List character and block devices with major/minor
ls -la /dev/ | grep -E '^[cb]'
# Show registered character device ranges
cat /proc/devices
# Monitor block I/O with blktrace
blktrace -d /dev/sda -o - | blkparse -i -
# Show block device queue parameters
cat /sys/block/sda/queue/scheduler
cat /sys/block/nvme0n1/queue/nr_requests
# Strace ioctl calls from a program
strace -e ioctl -p $(pgrep nvidia-smi)
# Test character device read
dd if=/dev/urandom of=/dev/null bs=1M count=100 status=progress
Debugging read returning wrong data: Insert WARN_ON(copy_to_user(...)) and check /proc/kmsg. If copy_to_user returns non-zero, the user buffer pointer is invalid — check the user-space code.
Security Implications
Device file permissions: /dev/sda is typically owned root:disk with permissions 0660. Any user in the disk group can read raw disk data, bypassing filesystem permissions. On multi-user systems, ensuring disk group membership is restricted is critical.
ioctl validation: Every argument to ioctl arrives as an unsigned long and the driver must validate it before use. Failing to validate arg as a valid user-space pointer before dereferencing it as (struct foo __user *)arg leads to kernel exploits. The __user annotation (used by the sparse static analyzer) marks pointers to user memory that must go through copy_from_user/copy_to_user.
TOCTOU on ioctl: If an ioctl copies a struct from user space and then re-reads fields from user space later, a concurrent mprotect + write can change the value between reads. Always copy the entire struct to kernel space at entry and use only the kernel copy.
Arbitrary mmap via driver: A driver that implements mmap and maps physical memory incorrectly can expose kernel memory to user space. /dev/mem (full physical memory access) is restricted since Linux 4.6 with CONFIG_STRICT_DEVMEM to only allow mapping PCI I/O regions.
Performance Implications
Character device read/write overhead: Each read/write syscall to a character device invokes copy_to/from_user, which is optimized with SIMD on modern kernels but still has page fault risk (if user buffer is not faulted in). For high-throughput data acquisition, mmap the device memory directly to avoid copies.
Block device queue depth: NVMe SSDs support thousands of concurrent I/O operations (queue depth). The kernel's blk-mq allows matching this with per-CPU hardware queues. Setting /sys/block/nvme0n1/queue/nr_requests to 1023 (from the default 256) can improve throughput for workloads with high parallelism.
Page cache amplification: Writing 1 byte to a file on a block device causes the kernel to read the entire 4096-byte page into cache, modify it, and write it back — a read-modify-write cycle that amplifies small writes. For databases, O_DIRECT bypasses the page cache, letting the database manage its own buffer pool.
Failure Modes
- EFAULT on copy_to_user: User passed an invalid pointer (null, unmapped, or read-only mapping). Driver should return -EFAULT and not crash.
- ENXIO on open: Device not present (e.g.,
/dev/sdaopened after disk removal). cdev_del was called; the open fails gracefully. - I/O error on bio completion: Block device signals EIO. Propagates through page cache to the application as a read/write error. RAID drivers (md, dm) retry on alternative paths.
- Queue stall: Block device's request queue is full (hardware queue depth exhausted). The block layer applies back-pressure: processes calling write() block until queue space frees. With async I/O (io_uring), submissions are queued in user space and submitted as hardware drains.
Modern Usage
The transition from BIO-based drivers to blk-mq is essentially complete for NVMe and SATA. Legacy single-queue request_fn API was removed in Linux 5.9.
io_uring (Linux 5.1) provides a lock-free ring buffer interface for submitting block I/O operations without per-syscall overhead. For high-IOPS NVMe workloads (1M+ IOPS), io_uring eliminates the syscall cost, which was the dominant overhead.
Character devices increasingly expose mmap-based zero-copy interfaces for high-bandwidth use cases (DPDK, SPDK, GPU memory, frame buffers) to avoid the cost of copying data between kernel and user space.
Future Directions
io_uring and passthrough: Linux 5.19 added io_uring NVMe passthrough, allowing user-space NVMe drivers (SPDK) to submit native NVMe commands through io_uring without a full DPDK setup.
Rust character device drivers: The Rust binding layer for kernel modules includes safe wrappers for file_operations callbacks, making it impossible to accidentally dereference a user-space pointer without going through the correct checked APIs.
Exercises
- Write a character device driver that implements a simple in-kernel FIFO: writes add data to the FIFO, reads consume it. Block on empty reads using
wait_queue_head_t. - Implement
pollin your FIFO driver so thatselect()/poll()from user space correctly reports readability. - Add an ioctl command
FIFO_CLEARthat empties the FIFO atomically using a spinlock. - Write a user-space program that mmap's
/dev/zeroand measures the time to fault in 1GB of pages vs.malloc+memset. - Use
blktraceto capture a complete I/O trace during afiobenchmark on an NVMe device. Identify the distribution of I/O sizes and latencies.
References
include/linux/fs.h—struct file_operations,struct inode,struct fileinclude/linux/cdev.h— character device registrationinclude/linux/blk-mq.h— multiqueue block layerdrivers/char/mem.c—/dev/null,/dev/zero,/dev/urandomimplementation- Linux Device Drivers, 3rd Edition, Chapters 3 and 16
- Jonathan Corbet, "Rethinking the block I/O layer" — LWN.net, 2013 (blk-mq introduction)
Documentation/block/— block layer documentationDocumentation/userspace-api/ioctl/— ioctl number registry and guidelines