Skip to content

08 — MLOps and ML Infrastructure

Overview

MLOps (Machine Learning Operations) is the discipline of applying software engineering rigor to the lifecycle of machine learning systems: from data collection and feature engineering through model training, evaluation, deployment, and production monitoring. The term emerged around 2017 as organizations discovered that the hardest part of ML is not the model itself — it is the surrounding infrastructure needed to keep that model correct, current, and efficient in production. A typical ML system has less than 5% of its code in model logic; the remaining 95% is data pipelines, serving infrastructure, monitoring, retraining orchestration, and tooling. This document maps the full architecture of a production ML platform, from data versioning to LLM serving.


Prerequisites

  • Understanding of distributed training (see 32-ai-infrastructure/02-distributed-training.md)
  • Familiarity with Kubernetes and containers (see 20-containers/, 22-kubernetes-internals/)
  • Basic understanding of model parallelism (see 32-ai-infrastructure/04-model-parallelism-deep-dive.md)
  • Familiarity with storage systems (see 32-ai-infrastructure/05-ai-storage-systems.md)

Historical Context

The "Hidden Technical Debt" Paper (2015)

Google researchers Sculley et al. published "Hidden Technical Debt in Machine Learning Systems" at NeurIPS 2015, articulating the infrastructure burden ML systems impose. They identified patterns specific to ML that accumulate technical debt at rates exceeding traditional software:

  • Entanglement: Changing any input feature changes the behavior of all other features (CACE: Change Anything, Change Everything). No ML component is truly modular.
  • Undeclared consumers: ML models silently consume features from upstream pipelines. When those pipelines change, the model behavior shifts without error.
  • Feedback loops: Model predictions influence future training data (recommendation systems shape what users click; click data trains next model).
  • Data dependency debt: Code dependencies are visible (imports); data dependencies are invisible until production breaks.

This paper catalyzed the MLOps discipline. The industry response was to build infrastructure that makes these dependencies explicit, versioned, and monitored.


ML Platform Architecture

Full ML Platform Architecture:

  Raw Data Sources
  (databases, logs, streams, external APIs)
       |
       v
  +------------------+
  | Data Pipeline     |  (Spark, dbt, Flink)
  | Data Lake         |  (S3/GCS + Delta Lake/Iceberg)
  | Data Versioning   |  (DVC, LakeFS)
  +------------------+
       |
       v
  +------------------+
  | Feature Store     |  (Feast, Tecton, Hopsworks)
  | Feature Serving   |  (online: Redis; offline: Parquet)
  +------------------+
       |
       v
  +------------------+
  | Training Pipeline |  (Kubeflow, SageMaker, Vertex AI)
  | Experiment Track  |  (MLflow, W&B, Neptune)
  | Hyperparameter    |  (Ray Tune, Optuna, Ax)
  +------------------+
       |
       v
  +------------------+
  | Model Registry    |  (MLflow Registry, SageMaker Registry)
  | Model Validation  |  (offline eval, fairness checks, shadow deployment)
  +------------------+
       |
       v
  +------------------+
  | Model Serving     |  (Triton, TorchServe, vLLM, BentoML)
  | A/B Testing       |  (feature flags, canary traffic)
  | Batch Inference   |  (Ray Data, Spark MLlib)
  +------------------+
       |
       v
  +------------------+
  | Monitoring        |  (Evidently, Fiddler, Arize)
  | Alerting          |  (data drift, concept drift, latency)
  | Retraining Trigger|  (schedule, drift-based, accuracy-based)
  +------------------+

Data Versioning

DVC (Data Version Control)

DVC extends Git for large files. Model training data (often hundreds of GB to TB) cannot be stored in Git. DVC stores data in a remote storage backend (S3, GCS, Azure Blob) and tracks metadata files (.dvc files) in Git. A .dvc file is a tiny JSON pointer containing the MD5 hash and storage path of the actual data.

DVC workflow:

  dvc add data/train.csv
  → creates data/train.csv.dvc (tracked in Git)
  → uploads data/train.csv to remote storage (S3)

  git commit -m "add training data v1"
  git tag dataset-v1

  [Reproduce training on any machine:]
  git checkout dataset-v1
  dvc pull          → downloads data/train.csv from S3
  dvc repro         → re-runs pipeline, verifies checksums

DVC pipelines (dvc.yaml) define dependencies between stages (data processing → feature engineering → training → evaluation). dvc repro re-runs only stages whose inputs changed — a Makefile-like caching system for ML.

Delta Lake / Apache Iceberg

For large-scale tabular datasets, Delta Lake (Databricks, 2019) and Apache Iceberg (Netflix, 2020) provide ACID transactions, schema evolution, and time-travel queries on top of Parquet files in object storage. They are critical for:

  • Reproducibility: Query training data as it existed at a specific timestamp (VERSION AS OF / AS OF TIMESTAMP).
  • Concurrent writes: Multiple Spark jobs can write to the same table safely.
  • Schema evolution: Adding columns to a feature table without breaking existing readers.
Iceberg table structure:
  s3://data-lake/features/clicks/
    metadata/
      v1.metadata.json  (table schema, partition spec, snapshot list)
      v2.metadata.json
      snap-001.avro     (snapshot manifest: lists data files)
    data/
      part-00001.parquet
      part-00002.parquet
      ...

  Time travel: SELECT * FROM clicks AS OF TIMESTAMP '2024-01-15 00:00:00'
  → reads v1.metadata.json, resolves to the correct data files

Feature Store

A feature store is a centralized system for computing, storing, and serving ML features. It solves a key infrastructure problem: features computed for training offline must be identical to features computed for inference online — but the systems are completely different (batch Spark vs. real-time serving).

Feature Store Architecture:

  Feature Definition (Python):
    driver_trips_last_7d = Feature(
        name="driver_trips_last_7d",
        source=trips_table,
        transform=lambda df: df.groupby("driver_id")
                               .agg(count=("trip_id", "count"))
                               .filter(col("date") >= now() - days(7))
    )

  Offline Store (training):
    Parquet files in S3, computed by Spark batch jobs
    → Point-in-time correct joins (retrieve features as of each training example's timestamp)
    → Prevents training-serving skew

  Online Store (inference):
    Redis or DynamoDB: pre-materialized feature values for real-time lookup
    Latency: <5ms feature retrieval

  Feature Pipeline (materialization):
    Batch job runs hourly: recomputes features, writes to online + offline stores
    Streaming pipeline (Flink/Spark Streaming): updates online store in real time

Training-serving skew is one of the most common causes of ML production degradation. A model trained with features computed in batch Spark (using LAST_7_DAYS from a full table scan) but served with features computed differently (using a real-time Redis counter with a different window definition) will silently produce worse predictions. A feature store with a single feature definition used for both is the solution.

Popular feature stores: Feast (open source, Feast community), Tecton (commercial, Feast co-founders), Hopsworks (open source + enterprise), Vertex AI Feature Store (GCP managed), SageMaker Feature Store (AWS managed).


Experiment Tracking

Experiment tracking systems record the inputs, configuration, and outputs of every training run, enabling reproducibility and systematic hyperparameter search.

MLflow

MLflow (Databricks, 2018) is the most widely deployed open-source experiment tracker. It provides:

  • Tracking API: Log parameters, metrics, and artifacts from training code
  • Experiment UI: Compare runs, plot metric curves, visualize artifacts
  • Model Registry: Store versioned model artifacts with lifecycle management
  • Projects: Reproducible packaging of ML code
import mlflow

mlflow.set_experiment("bert-finetuning")

with mlflow.start_run():
    mlflow.log_param("learning_rate", 2e-5)
    mlflow.log_param("batch_size", 32)
    mlflow.log_param("epochs", 3)

    # ... training loop ...
    for epoch in range(3):
        train_loss = train_epoch(model, train_loader)
        val_f1 = evaluate(model, val_loader)
        mlflow.log_metric("train_loss", train_loss, step=epoch)
        mlflow.log_metric("val_f1", val_f1, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    # Artifacts saved to MLflow artifact store (S3, GCS, Azure)

Weights & Biases (W&B)

W&B is the preferred commercial platform for large teams. It adds real-time collaboration (multiple users viewing the same experiment), sweep (hyperparameter search) orchestration, model artifact versioning with lineage tracking, and reports for communicating results.

W&B sweep example:
  sweep_config = {
    "method": "bayes",  # Bayesian optimization
    "metric": {"name": "val_f1", "goal": "maximize"},
    "parameters": {
      "learning_rate": {"min": 1e-6, "max": 1e-3, "distribution": "log_uniform"},
      "batch_size": {"values": [16, 32, 64]},
      "dropout": {"min": 0.1, "max": 0.5}
    }
  }
  sweep_id = wandb.sweep(sweep_config, project="bert-finetuning")
  wandb.agent(sweep_id, function=train, count=50)
  # Runs 50 trials, using Bayesian optimization to select promising configs

Distributed Training Orchestration

Kubeflow

Kubeflow (Google, 2018) is a Kubernetes-native ML platform. Its core components:

  • Kubeflow Pipelines: DAG-based workflow orchestration (steps: data prep → train → evaluate → deploy), with a visual UI
  • Training Operators: Kubernetes operators for distributed training jobs: TFJob, PyTorchJob, MPIJob, PaddleJob
  • KServe (formerly KFServing): Model serving framework with canary/blue-green deployment
# Kubeflow PyTorchJob: 1 master + 3 workers
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: bert-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0-cuda11.7
            resources:
              limits:
                nvidia.com/gpu: 8
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0-cuda11.7
            resources:
              limits:
                nvidia.com/gpu: 8

AWS SageMaker

SageMaker provides managed training infrastructure with automatic cluster provisioning, distributed training support (SageMaker Distributed, which wraps NCCL/EFA automatically), built-in hyperparameter tuning (Bayesian optimization), and tight integration with S3 for data/model storage.

Vertex AI

Google's Vertex AI provides custom training containers on TPU/GPU clusters, AutoML, and Vertex Pipelines (Kubeflow Pipelines backed by Google infrastructure). Integration with BigQuery ML enables SQL-defined model training.


Model Registry

A model registry is a versioned artifact store for trained models, with metadata, stage management, and approval workflows.

Model Registry Lifecycle:

  Training job completes
       |
       v
  Register model: mlflow.register_model("runs:/abc123/model", "bert-classifier")
  → Model version 1, stage: None
       |
       v
  Automated evaluation (offline test set, fairness metrics, latency benchmark)
       |
       v
  Promote: client.transition_model_version_stage("bert-classifier", 1, "Staging")
  → Shadow deployment: model receives traffic but results not served to users
       |
       v
  A/B test: 5% canary traffic routed to staging model
  Monitor metrics for 48 hours
       |
       v
  Promote: client.transition_model_version_stage("bert-classifier", 1, "Production")
  → Full traffic cutover
       |
       v
  Previous production model: stage "Archived"
  Rollback: transition archived version back to Production (< 5 minutes)

Model metadata stored in registry: training dataset version, hyperparameters, evaluation metrics, training code git hash, training hardware, training duration, input/output schema (feature names and types).


Model Serving Infrastructure

Online Inference

NVIDIA Triton Inference Server is the standard for high-performance model serving on GPU. It supports:

  • Multiple model frameworks: TensorRT, ONNX Runtime, PyTorch (TorchScript), TensorFlow, OpenVINO
  • Dynamic batching: Triton automatically batches concurrent requests, amortizing GPU kernel launch overhead
  • Model ensembles: chain preprocessing → model → postprocessing as a pipeline
  • Concurrent model execution: multiple models loaded on one GPU
Triton dynamic batching:

  t=0ms:  Request A arrives
  t=1ms:  Request B arrives
  t=2ms:  Triton batches A+B together → one GPU inference for both
  t=4ms:  Inference completes → A and B both receive responses

  Without batching:
  t=0ms:  Request A → GPU inference (2ms)
  t=2ms:  A returns
  t=2ms:  Request B → GPU inference (2ms)
  t=4ms:  B returns
  GPU utilization: ~40% (idle between requests)

  With dynamic batching:
  GPU utilization: ~80%+ (batched requests)
  Throughput: 2x for the same GPU

BentoML is a Python-native serving framework that packages models + preprocessing + serving logic into a single deployable unit (a "Bento"). It supports adaptive batching, multiple runners per service, and deployment to Kubernetes, AWS Lambda, or EC2.

TorchServe (PyTorch team) provides standardized HTTP/gRPC serving for PyTorch models with support for custom handlers, batch inference, and A/B testing via management API.

Batch Inference

For offline scoring (score all customers overnight, generate recommendations for the next day), batch inference pipelines use:

  • Ray Data: Distributed data processing with GPU inference. Scales from a laptop to 1000 GPUs with the same code.
  • Spark MLlib: SQL-native batch scoring using MLlib's model serving API or pandas UDFs.
  • AWS SageMaker Batch Transform: Managed batch inference job that provisions GPU/CPU instances, runs inference, writes outputs to S3, then terminates the fleet.

LLM Serving

Serving large language models (GPT-4 class, 70B+ parameters) requires specialized infrastructure:

LLM Serving Stack:

  Client requests (text generation)
       |
       v
  vLLM / TensorRT-LLM serving frontend
       |
       v
  Continuous batching (PagedAttention)
    - KV cache managed as virtual memory pages
    - New requests inserted into batch as old ones complete
    - GPU utilization: 60-90% (vs. 10-30% without continuous batching)
       |
       v
  Multi-GPU model parallel inference
    - Tensor parallelism: each GPU holds 1/N of each layer's weights
    - Pipeline parallelism: different GPUs handle different layers
       |
       v
  Response streaming (token-by-token via SSE/WebSocket)

vLLM (UC Berkeley, 2023) introduced PagedAttention — managing the KV cache (key-value cache for attention computation) in virtual memory pages, enabling efficient KV cache sharing across requests and eliminating memory fragmentation. vLLM achieves 24x higher throughput than naive HuggingFace serving.


Model Monitoring

Production models degrade silently. Unlike software bugs (which produce errors), model degradation produces subtly wrong predictions — often undetected for weeks.

Data Drift Detection

Data drift monitoring:

  Training distribution (baseline):
    feature "age":  mean=34.2, std=12.1, histogram=[...]

  Production distribution (week 3):
    feature "age":  mean=36.8, std=14.3, histogram=[...]

  Statistical test: Kolmogorov-Smirnov test
    KS statistic: 0.23  (p-value: 0.001 → significant drift)
    Alert: "age feature distribution has drifted"

  Population Stability Index (PSI):
    PSI < 0.1: no significant change
    0.1 ≤ PSI < 0.25: moderate change (monitor)
    PSI ≥ 0.25: significant change (retrain)

Concept Drift

Concept drift occurs when the relationship between input features and the target changes — even if the input distribution is stable. Example: a fraud detection model trained before a new fraud pattern emerges.

Monitoring concept drift requires access to ground truth labels with latency. For fraud detection, chargebacks arrive days or weeks after the transaction — setting the feedback loop frequency. For content moderation, human reviewers provide labels with a 24-hour lag.

Infrastructure Monitoring

Beyond model quality, serving infrastructure requires standard SRE metrics:

  • Latency: P50/P95/P99 inference latency. P99 > SLA triggers alert.
  • Throughput: Requests per second per GPU. Autoscale when throughput approaches capacity.
  • GPU utilization: nvidia-smi and DCGM (Data Center GPU Manager) export to Prometheus.
  • Queue depth: Backlog of pending inference requests. Sustained queue growth triggers scale-out.
  • Error rate: Failed inference (OOM, timeout, malformed input).

Monitoring platforms for ML: Evidently AI (open source, data drift + model performance reports), Fiddler (commercial, explainability + monitoring), Arize (commercial, LLM + classical model monitoring), Datadog ML Monitoring.


LLM Infrastructure

Training foundation models at scale introduces infrastructure challenges beyond standard deep learning:

LLM Training Infrastructure (GPT-4 class, estimated):

  ~25,000 A100 GPUs across 3,000 nodes
  InfiniBand HDR (200 Gbps) between nodes
  NVLink within each 8-GPU node

  Storage:
    Training data: ~15 TB tokens (~30 TB compressed parquet)
    Checkpoint storage: ~1 TB per checkpoint, every 1000 steps
    Checkpoint I/O: write 1 TB in <5 minutes (to avoid compute stall)
    Checkpoint storage: Lustre / IBM GPFS / Azure NFS (parallel FS)

  Fault tolerance:
    Node failure rate: ~0.5% per day at 3,000 nodes
    → Expected failure every 0.6 days
    Checkpoint frequency: every 30 minutes (max 30 minutes lost work)
    Automatic restart: SLURM / Kubernetes Job restart on node failure

  RLHF pipeline (Reinforcement Learning from Human Feedback):
    1. Supervised fine-tuning on human demonstration data
    2. Reward model training (human preference pairs)
    3. PPO (Proximal Policy Optimization) with reward model signal
    Requires: running both policy model and reward model simultaneously
    Typically 2-4x compute vs. supervised training alone

Evaluation infrastructure for LLMs: automated harnesses running thousands of benchmark prompts (MMLU, HellaSwag, HumanEval) after each training checkpoint. EleutherAI's lm-evaluation-harness is the standard open-source benchmark runner. Anthropic, OpenAI, and Google run proprietary evaluation suites internally.


Debugging Notes

  • Training-serving skew investigation: When model accuracy in production is worse than offline evaluation, suspect feature computation differences. Log feature values at inference time and compare their statistics to training data features. A feature store with consistent computation eliminates most such issues.
  • Experiment reproducibility failure: "Same code, different results" is often caused by: different random seeds (set torch.manual_seed + np.random.seed + random.seed + PYTHONHASHSEED), non-deterministic CUDA ops (torch.use_deterministic_algorithms(True) to find culprits), or different data ordering (fix data loader shuffle seed).
  • OOM in serving: Triton or vLLM OOM errors are often caused by KV cache over-allocation. In vLLM, --gpu-memory-utilization 0.85 limits VRAM usage. Monitor KV cache hit rate — low cache hit rate increases VRAM pressure.
  • Drift alert fatigue: If monitoring produces too many drift alerts, the team learns to ignore them. Tune statistical tests (increase significance threshold, use PSI instead of raw KS test) to alert only on economically significant drift.
  • MLflow artifact storage: Large model artifacts stored in MLflow's local filesystem (default) fill disk quickly. Configure mlflow.set_tracking_uri("s3://mlflow-bucket") for production use. Periodically prune old experiments with mlflow gc.

Security Implications

  • Model poisoning via training data: If training data is sourced from user-provided content or third-party APIs without validation, adversaries can inject poisoned examples that cause the model to behave maliciously in specific circumstances (backdoor attacks). Data lineage tracking in Delta Lake/DVC makes it possible to audit and remove poisoned data retroactively.
  • Model extraction attacks: An API that serves model predictions can be queried systematically to reconstruct the model's decision boundary, stealing IP. Rate limiting, query monitoring, and output perturbation (adding noise to predictions) mitigate this.
  • Feature store access control: Feature stores contain aggregated user behavior data that may be sensitive (location history, purchase patterns). Access should be governed by column-level access controls with audit logging.
  • Prompt injection (LLM serving): For LLM serving pipelines, user-supplied prompts can include adversarial instructions that override system prompt behavior. Input sanitization, prompt shields, and output classifiers are defensive layers.
  • Model artifact tampering: A model artifact served from S3 could be tampered with if bucket permissions are misconfigured. Verify artifact checksums (MLflow records MD5/SHA256 on upload) before serving.

Performance Implications

  • Inference latency breakdown: For a typical classification model, end-to-end inference latency = feature retrieval (5ms) + network (1ms) + preprocessing (1ms) + model inference (10ms) + postprocessing (1ms) = ~18ms. The model itself is often not the bottleneck.
  • Batching vs. latency trade-off: Larger batches improve GPU utilization and throughput but increase latency for individual requests. For real-time serving (P99 < 50ms), batch sizes are typically 8–32. For throughput-optimized offline inference, batch sizes of 256–2048 are common.
  • Model quantization impact: INT8 quantized models (TensorRT calibration, PyTorch torch.quantization) achieve 2–4x speedup and 2–4x memory reduction vs. FP32, with <1% accuracy degradation on most tasks. FP16 (half precision) is the default for most serving scenarios.
  • Cold model loading: Loading a 7B parameter model from disk into GPU memory takes 10–30 seconds. Keep models warm in serving infrastructure — do not load on every request. Use Triton's model repository for persistent in-memory model storage.

Failure Modes

  • Silent model degradation: The most dangerous failure in ML systems — the model produces wrong predictions without raising any error. Only detected through monitoring. Establish SLOs for model quality metrics just as for latency/error rate.
  • Feature pipeline failure → stale features: If the batch feature pipeline fails silently, the feature store serves stale features (hours or days old). Feature freshness monitoring and alerts on pipeline failures are essential.
  • Training job divergence: Loss goes to NaN or explodes. Causes: learning rate too high, gradient explosion (add gradient clipping: torch.nn.utils.clip_grad_norm_(params, max_norm=1.0)), numerical instability in custom ops, or corrupted data batch (single corrupt example with enormous loss).
  • Checkpoint corruption: A training job that writes a checkpoint while under memory pressure may produce a corrupted checkpoint. Always write checkpoints atomically (write to temp file, then rename) and verify checksum on load.
  • Evaluation harness drift: If the evaluation benchmark harness is updated between runs, scores become incomparable. Pin evaluation framework versions alongside model checkpoints in the experiment registry.

Modern Usage (2024–2025)

  • LLMOps: MLOps tooling is adapting to LLMs. Key differences: evaluation is harder (no ground truth label for open-ended generation), prompts are artifacts (tracked in registries), fine-tuning is less common than RAG (Retrieval-Augmented Generation), and monitoring focuses on hallucination rate and semantic similarity rather than classification accuracy.
  • Ray ecosystem consolidance: Ray has become the de-facto Python distributed computing substrate for ML: Ray Train for distributed training, Ray Serve for model serving, Ray Data for preprocessing, Ray Tune for hyperparameter search — all in one framework with shared resource management.
  • Inference optimization focus: As model training costs shift toward large pre-trained models, the engineering effort has shifted to inference optimization: speculative decoding (draft model + verification), KV cache offloading (system RAM as spill for KV cache), quantization-aware serving, and kernel fusion.
  • Platform engineering for ML: Large organizations consolidate ML tooling into an internal ML platform team that maintains standardized training templates, feature stores, and serving infrastructure, reducing the per-team burden of assembling the full MLOps stack.

Future Directions

  • Continuous training: Moving from scheduled retraining to continuous training — models that update incrementally as new data arrives (online learning, streaming fine-tuning) without full retraining cycles.
  • Automated ML pipelines (AutoML 2.0): Not just hyperparameter search, but automated feature engineering, architecture search, and deployment policy optimization — treating the full ML lifecycle as an optimization problem.
  • Foundation model fine-tuning standardization: Parameter-efficient fine-tuning methods (LoRA, QLoRA, prefix tuning) are becoming the standard operational model — small adapter layers trained per task, swapped into a shared base model. Infrastructure for adapter versioning, serving multiplexing, and composition is an active development area.
  • Multi-modal MLOps: Models that consume images, audio, video, and text require feature stores and monitoring systems that handle heterogeneous modalities. Standardized schemas and drift detection for embeddings (nearest-neighbor distribution shift) are emerging patterns.

Exercises

  1. End-to-end pipeline: Build a complete ML pipeline: (a) version a dataset with DVC, (b) train a model with MLflow tracking, (c) register the model in MLflow registry, (d) serve it with BentoML, (e) add a data drift monitor that compares request feature distributions to training distributions.

  2. Feature store implementation: Implement a minimal feature store with Feast: define a feature view for user purchase history, materialize features offline, and set up online serving via Redis. Verify that offline training data and online serving data produce identical feature vectors for the same entity.

  3. Triton dynamic batching: Deploy a PyTorch model in Triton. Configure dynamic batching with max_queue_delay_microseconds=5000. Use the perf_analyzer tool to measure throughput vs. latency under different request rates. Identify the optimal batch size for your throughput target at P99 < 50ms.

  4. Drift detection: Take a production dataset and artificially shift one feature distribution (multiply by 1.2). Run the KS test and PSI calculation. At what magnitude of shift does PSI exceed the 0.25 alert threshold? What does this imply about the sensitivity of drift detection?

  5. LLM serving benchmark: Deploy a 7B parameter LLaMA model with vLLM (4-bit quantized) and with naive HuggingFace generate(). Benchmark both at 10 concurrent requests: measure throughput (tokens/second) and P95 latency. Quantify the improvement from PagedAttention.


References

  • Sculley, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." NeurIPS 2015.
  • Zaharia, M., et al. (2018). "Accelerating the Machine Learning Lifecycle with MLflow." IEEE Data Engineering Bulletin.
  • Kwon, W., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023.
  • Feast documentation. https://docs.feast.dev
  • MLflow documentation. https://mlflow.org/docs/latest/index.html
  • Kubeflow documentation. https://www.kubeflow.org/docs/
  • NVIDIA Triton Inference Server documentation. https://docs.nvidia.com/deeplearning/triton-inference-server/
  • Ray documentation. https://docs.ray.io
  • Shankar, S., et al. (2022). "Operationalizing Machine Learning: An Interview Study." arXiv:2209.09125.
  • Paleyes, A., Urma, R., Lawrence, N. (2022). "Challenges in Deploying Machine Learning: a Survey of Case Studies." ACM Computing Surveys.