Embedded Systems Fundamentals
Overview
An embedded system is a computing system with a dedicated function within a larger mechanical or electrical system. Unlike general-purpose computers, embedded systems are designed to perform specific tasks — often in real time — under strict constraints of memory, power, cost, and physical size. The software is tightly coupled to the hardware it runs on, frequently down to individual peripheral registers and interrupt line assignments.
The defining characteristic is that the software exists to serve the device's function. A washing machine controller does not run arbitrary programs; its firmware is the washing machine. This tight coupling between software and hardware intent drives every design decision in the embedded domain.
Prerequisites
- Basic understanding of computer architecture (CPU, memory, I/O)
- Familiarity with C programming (pointers, memory layout, bitwise operations)
- Basic digital electronics concepts (GPIO, voltage levels, bus protocols)
- Understanding of interrupts and basic OS scheduling concepts
Embedded System Architecture
+-------------------------------------------------------------+
| EMBEDDED SYSTEM |
| |
| +------------------+ +---------------------------+ |
| | Microcontroller | | External World | |
| | | | | |
| | +------------+ | | Sensors Actuators | |
| | | CPU | |<--->| (Temp) (Motors) | |
| | | (Cortex-M4)| | | (Accel) (Displays) | |
| | +------------+ | | | |
| | +-----++------+ | +---------------------------+ |
| | | RAM || Flash| | |
| | | 256KB||512KB || |
| | +-----++------+ | |
| | +------------+ | |
| | | Peripherals| | GPIO / SPI / I2C / UART / CAN |
| | | ADC/DAC | | Timers / PWM / DMA / RTC |
| | | Timers | | |
| | +------------+ | |
| +------------------+ |
| |
| Power Supply: Battery / Regulated 3.3V / 1.8V |
+-------------------------------------------------------------+
Historical Context
The concept of embedded computing crystallized in the late 1960s with the Apollo Guidance Computer — arguably the first embedded system of consequence. Designed by MIT and built by Raytheon, it used core rope ROM and 4KB of RAM to guide humans to the moon. Intel's 4004 (1971) and the 8048 microcontroller family (1976) began the era of single-chip embedded computing.
By the 1980s, dedicated microcontrollers — chips integrating CPU, RAM, ROM, and peripherals — were displacing discrete logic in appliances and industrial equipment. The ARM architecture emerged from Acorn Computers in 1985, eventually dominating embedded through the Cortex-M profile introduced in 2004. Today, over 40 billion ARM Cortex-M cores have shipped, embedded in everything from toothbrushes to spacecraft.
Resource Constraints
Memory
Typical MCU memory budgets are a fraction of what desktop or server programmers deal with:
| Class | RAM | Flash | Example Chip |
|---|---|---|---|
| Ultra-low-end | 128 bytes | 1KB | PIC10F |
| Low-end MCU | 2KB | 32KB | ATmega328P (Arduino Uno) |
| Mid-range MCU | 64KB | 512KB | STM32F4 |
| High-end MCU | 1MB | 2MB | STM32H7 |
| Application processor | 512MB+ | External eMMC | i.MX8 |
Stack and heap are carved from the same RAM. Stack overflows silently corrupt data because there is no MMU to catch them. Dynamic memory allocation (malloc) is avoided in many embedded systems because fragmentation on constrained heaps causes unpredictable failures in long-running systems. Static allocation is preferred.
Power
Power consumption determines battery life or passive cooling requirements. Modern MCUs offer multiple power states:
- Run mode: Full CPU at maximum clock — typically 10-100mW
- Idle/WFI: CPU halted, peripherals running — 1-10mW
- Sleep: Most peripherals off, RTC and SRAM retained — 100µW-1mW
- Deep sleep/Stop: Only RTC and wake pin active — 1-100µW
- Hibernate/Standby: Near-off, wake from reset — <1µW
The ARM WFI (Wait For Interrupt) instruction halts the CPU until the next interrupt fires, allowing ultra-low-power idle loops. A temperature sensor node on a coin cell battery that samples once per minute and transmits via BLE can achieve years of battery life through aggressive use of deep sleep.
Real-Time Requirements
Many embedded systems must respond to events within bounded time. A motor controller must update PWM duty cycle within microseconds of a sensor reading. An ABS brake system must modulate wheel cylinders within milliseconds of slip detection. Missing timing deadlines can mean mechanical failure, safety incidents, or degraded product quality.
Cost
High-volume consumer electronics amortize engineering costs over millions of units. A $0.10 difference in MCU cost becomes millions of dollars at scale. This drives the use of the smallest, cheapest MCU that meets requirements — which forces careful resource budgeting.
Microcontroller vs. Microprocessor
The most important hardware decision in embedded design:
Microcontroller (MCU)
A microcontroller integrates all major subsystems on one die:
+--------------------------------------------+
| MCU (e.g., STM32F4) |
| +--------+ +---------+ +-----------+ |
| | CPU | | RAM | | Flash | |
| |Cortex-M4| | 192KB | | 1MB | |
| +--------+ +---------+ +-----------+ |
| +-----------+ +--------+ +---------+ |
| | Peripherals| | ADC | | Timers | |
| | GPIO/SPI | | 12-bit | | TIM1-14 | |
| | UART/I2C | | 16ch | | PWM/CAP | |
| | CAN/USB | +--------+ +---------+ |
| +-----------+ +--------+ +---------+ |
| | DMA | | DAC | | RTC | |
| | 16 streams| | 2 ch | | | |
| +-----------+ +--------+ +---------+ |
+--------------------------------------------+
The ARM Cortex-M series is the dominant MCU CPU:
- Cortex-M0/M0+: Ultra-low-cost, low power, no FPU, limited debug
- Cortex-M3: Full Thumb-2 ISA, hardware divide, NVIC
- Cortex-M4: DSP instructions, optional single-precision FPU, common in motor control and audio
- Cortex-M7: Dual-issue, double-precision FPU, cache, used in high-performance MCUs (STM32H7, iMXRT)
- Cortex-M33: Armv8-M with TrustZone security, modern replacement for M4
Microprocessor (MPU)
An application processor provides only the CPU core; all memory and peripherals are external:
+--------------------------------------+
| Application Processor (i.MX8M) |
| Cortex-A53 (4 cores) + Cortex-M4 |
+--------------------------------------+
| | |
+--------+ +--------+ +--------+
| LPDDR4 | | eMMC | | PCIe |
| 2GB | | 8GB | | USB3 |
+--------+ +--------+ +--------+
MPUs run Linux or Android. They require complex PCB design (DDR routing, PMIC), boot sequences, and full OS configuration. Boot times are seconds, not milliseconds. Use when you need a networking stack, file system, complex UI, or connectivity (WiFi, cellular, BT) with high data throughput.
Bare-Metal Programming
Bare-metal means no operating system — code runs directly on hardware. The program structure is typically:
+------------------+
| Startup Code | (written in assembly or generated by toolchain)
| - Initialize SP |
| - Copy .data | ROM -> RAM
| - Zero .bss |
| - Call main() |
+------------------+
|
v
+------------------+
| main() |
| - Init clocks |
| - Init GPIO |
| - Init UART |
| - Enable IRQs |
| - Superloop: |
| while(1) { |
| poll flags |
| process |
| } |
+------------------+
|
+-----+-----+
| |
+------+ +--------+
| ISRs | | Timers |
|UART_IRQ| |SysTick |
|GPIO_IRQ| |TIM_IRQ |
+------+ +--------+
Memory Sections
Flash (ROM): RAM:
+--------------------+ +--------------------+
| .text (code) | load -> | .data (initialized)|
| .rodata (const) | | .bss (zeroed) |
+--------------------+ | Heap (malloc) |
| .data init values | | ... |
+--------------------+ | Stack (grows down) |
+--------------------+
The startup code (often startup_stm32xxx.s or generated by STM32CubeIDE) must:
1. Set the stack pointer to the top of RAM
2. Copy .data initialization values from flash to RAM
3. Zero the .bss section
4. Call any C++ constructors (__libc_init_array)
5. Call main()
A common bug: forgetting to initialize peripherals before enabling their interrupts, causing spurious interrupts before hardware is ready.
Interrupt Driven Design
The real workhorse of embedded systems. The NVIC (Nested Vectored Interrupt Controller) on Cortex-M handles up to 240 external interrupts with hardware priority levels. ISRs must be short — update a flag or write to a queue, return. Long ISRs block other interrupts and create timing problems.
volatile uint8_t uart_byte_received = 0;
volatile uint8_t uart_data;
void USART1_IRQHandler(void) {
if (USART1->SR & USART_SR_RXNE) {
uart_data = USART1->DR; // Read clears flag
uart_byte_received = 1; // Signal main loop
}
}
int main(void) {
// ... init ...
while (1) {
if (uart_byte_received) {
uart_byte_received = 0;
process_byte(uart_data);
}
}
}
RTOS vs. Bare-Metal
| Aspect | Bare-Metal | RTOS |
|---|---|---|
| Footprint | Minimal (0 overhead) | 5-30KB flash, 1-10KB RAM |
| Determinism | Highly deterministic (known call paths) | Depends on scheduler |
| Complexity | Superloop can become spaghetti | Tasks decompose concerns |
| Multitasking | Manual state machines | Preemptive tasks |
| IPC | Volatile flags, ring buffers | Queues, semaphores, events |
| Debugging | Simpler stack traces | Task-aware debugging needed |
| Boot time | Instant | Small overhead for scheduler init |
For a simple sensor node doing one thing, bare-metal is appropriate and optimal. For a device managing multiple concurrent protocols (BLE + sensor reading + display update + OTA update), an RTOS like FreeRTOS provides essential structure. The break-even point is roughly 3-4 concurrent activities with timing dependencies between them.
Embedded Linux
When a project requires:
- TCP/IP networking stack
- Filesystem (logging, OTA, databases)
- POSIX APIs for portable application code
- Complex display rendering (OpenGL ES, Qt)
- USB host (connecting USB peripherals)
- Existing Linux drivers for complex hardware
...embedded Linux is the right choice, accepting the tradeoffs: minimum 32MB RAM (typically 128MB+), second-scale boot times, higher power, and more complex toolchain.
Products using embedded Linux:
- Raspberry Pi: Education and prototyping, i.MX8 / BCM2711
- Home routers: OpenWRT on MIPS/ARM — Qualcomm IPQ series
- Industrial HMIs: Qt on i.MX6 or i.MX8
- Cameras: Ambarella SoC running Linux for H.264 encode
- Smart TVs: Android or Tizen on ARM Cortex-A
- Automotive telematic units: GENIVI / AGL Linux stacks
Production Examples
Arduino Uno (ATmega328P): 8-bit AVR, 2KB RAM, 32KB flash, 16MHz. Real-time GPIO at 16MHz clock cycles. Used in hobbyist projects but the architecture is identical to production sensor nodes.
ESP32 (Espressif): Dual-core Xtensa LX6, 520KB RAM, 4MB flash, WiFi + BT integrated. FreeRTOS pre-installed. Used in millions of IoT products — smart plugs, garage door openers, wearables.
STM32F4 (ST Microelectronics): Cortex-M4 @ 168MHz, 192KB RAM, 1MB flash. Used in drones (Pixhawk flight controllers run STM32F4/F7), motor controllers, industrial sensors, medical devices.
i.MX8M (NXP): Quad-core Cortex-A53 + Cortex-M4. Running embedded Linux + FreeRTOS in heterogeneous multiprocessing. Used in smart speakers (Amazon Echo), automotive systems, industrial gateways.
Mars Perseverance Rover (JPL/NASA): Runs VxWorks on a RAD750 radiation-hardened PowerPC at 200MHz, 256MB RAM, 2GB flash. Extreme reliability requirements — no remote debugging possible. Software updates take 11 light-minutes each way.
Debugging Notes
- JTAG/SWD debugging: Use hardware debugger (ST-LINK, J-LINK, CMSIS-DAP). GDB with OpenOCD or vendor IDE. Hardware breakpoints (typically 4-8) are precious — software breakpoints modify flash, unavailable in read-only regions.
- Stack overflow: With no MMU, stack overflow silently corrupts neighboring variables or the heap. Signs: random corruptions, hard faults. Fix: enable RTOS stack checking, or place a canary value at stack bottom and verify in SysTick.
- Volatile keyword: Variables modified in ISRs must be
volatile. Compiler optimization will cache non-volatile variables in registers, missing updates from interrupt context. - Unaligned access: Cortex-M0/M0+ does not support unaligned memory access — will hard fault. Cast pointers carefully when deserializing network packets or protocol buffers.
- Clock initialization: Many bugs stem from incorrect PLL configuration leaving clocks at reset defaults (typically 8MHz HSI). Verify via USART baud rate accuracy.
Security Implications
- Readback protection (RDP): STM32 and most MCUs support locking flash to prevent JTAG readout. Products must enable RDP before shipping or firmware can be extracted and reverse-engineered.
- Secure boot: Bootloader verifies firmware signature before executing. Essential for preventing malicious firmware updates. ARM TrustZone (Cortex-M33+) enables hardware-enforced security boundaries.
- Buffer overflows: No ASLR, no stack canaries by default. Embedded C code with fixed-size buffers and external input (UART, network) is a common attack surface. Input validation is critical.
- JTAG exposure: Production hardware must disable JTAG/SWD or require authentication. Many IoT devices were compromised by physical JTAG access to exposed test pads.
- Side-channel attacks: Cryptographic implementations must use constant-time algorithms and hardware acceleration (AES-NI equivalent on Cortex-M4 hardware accelerators) to resist power analysis attacks.
Performance Implications
- DMA: Direct Memory Access moves data without CPU involvement. UART/SPI/I2C DMA transfers free the CPU for other work. Critical for high-throughput peripherals.
- Cache (Cortex-M7): Instruction and data caches provide dramatic speedups but require cache coherency management with DMA. DMA bypasses cache — must invalidate/clean cache after DMA operations.
- FPU: Floating-point without FPU is 10-100x slower. Software float (
-msoft-float) costs cycles. Enable hardware FPU on Cortex-M4/M7 (-mfpu=fpv4-sp-d16 -mfloat-abi=hard). - Flash wait states: At high frequencies, CPU is faster than flash — wait states stall the CPU. Instruction cache (ICache) on STM32F4+ mitigates this significantly.
- Clock frequency vs. power: Dynamic voltage/frequency scaling is available on some MCUs. Running at 80MHz uses roughly 4x the power of 20MHz for the same clock cycles of work — but also finishes 4x faster, allowing earlier entry to sleep.
Failure Modes
- Watchdog timer not fed: WDT reset is the intended behavior — but if the system hangs in a bad state and the WDT period is too long, response is delayed.
- Interrupt storm: A peripheral generating interrupts faster than they can be serviced. CPU spends 100% in ISR, main loop never runs. Fix: mask interrupt at source, drain in bulk.
- Stack overflow corruption: Silently overwrites adjacent memory — manifests as corrupted globals, random crashes, or unexpected behavior much later in execution.
- Race conditions without atomics: Read-modify-write on multi-byte variables accessed from both ISR and main loop without disabling interrupts or using atomic operations. On 32-bit MCUs, 64-bit operations are non-atomic.
- Brown-out without brown-out detection (BOD): Voltage sag during heavy load (e.g., radio TX) causes CPU to execute invalid instructions. Enable BOD to hold MCU in reset below threshold voltage.
Modern Usage
The embedded ecosystem has evolved significantly:
- Rust on embedded:
embedded-haltrait ecosystem enables type-safe peripheral access with zero-overhead abstractions. Cortex-M support is mature via thecortex-mandcortex-m-rtcrates. - MicroPython/CircuitPython: Python on MCUs (ESP32, RP2040) for rapid prototyping. 3-5x larger footprint, slower execution, but accessible.
- Zephyr RTOS: Linux Foundation project with a device tree model similar to Linux, strong IoT security focus, growing adoption in consumer and industrial IoT.
- Matter protocol: New smart home standard (formerly Project CHIP) running on ESP32 and similar devices — pushing more complex networking stacks to MCU-class hardware.
- Edge AI: TensorFlow Lite for Microcontrollers and ARM's Ethos-U NPU series bringing neural network inference to Cortex-M class devices with sub-milliwatt power budgets.
Future Directions
- RISC-V embedded: Open ISA alternative to ARM gaining traction (ESP32-C3 uses RISC-V, SiFive Freedom E series). Eliminates licensing costs; ecosystem is maturing rapidly.
- Chiplets and heterogeneous integration: MCUs bundling AI accelerators, RF cores, and sensor processing into multi-die packages.
- Security as baseline: PSA Certified (Platform Security Architecture) from ARM establishing baseline security requirements — secure boot, device attestation, firmware update — becoming table stakes for IoT certification.
- LLVM/Clang for embedded: GCC has dominated embedded C, but LLVM/Clang is gaining ground with better sanitizers, LTO, and language support (especially for Rust).
Exercises
- Write a bare-metal blinky for STM32F4 without any HAL — configure RCC, GPIOD, and toggle an LED using only direct register access. Calculate the exact blink frequency from the clock configuration.
- Implement a software UART receiver using GPIO interrupt on rising/falling edges. Sample at 1.5x the bit period after start bit detection to read 8 data bits.
- Profile the current draw of an STM32 MCU in run mode, WFI sleep, and stop mode. Verify against datasheet values with a current probe or shunt resistor + oscilloscope.
- Create a ring buffer (circular buffer) for UART receive that is safe to write from an ISR and read from the main loop. What is the invariant for safety, and when do you need to disable interrupts?
- Implement a watchdog timer that gets kicked by the main loop only when all subsystems have set their "alive" flags in the last N milliseconds. What failure modes does this catch vs. a simple WDT kick in the superloop?
References
- Joseph Yiu, The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors (3rd ed., Elsevier, 2013)
- ARM Architecture Reference Manual, ARMv7-M (ARM DDI 0403)
- ST Microelectronics RM0090: STM32F4 Reference Manual
- Elecia White, Making Embedded Systems (O'Reilly, 2011)
- Jack Ganssle, The Art of Designing Embedded Systems (2nd ed., Newnes, 2008)
- Phillip Koopman, Better Embedded System Software (Drumnadrochit Education, 2010)
- FreeRTOS documentation: https://www.freertos.org/Documentation/RTOS_book.html
- ESP-IDF Programming Guide (Espressif Systems)
- Zephyr Project documentation: https://docs.zephyrproject.org