Blog — GradientPond

Product Update Dec 12, 2024

Introducing GradientPond v2.4: Distributed Training at Scale

Our biggest release yet brings support for 10,000+ GPU clusters, automatic sharding, and 40% faster checkpoint recovery. Here's what's new and how to upgrade.

Read more →

Engineering Nov 28, 2024

How We Built a Zero-Copy Checkpoint System for Large Models

A deep dive into the engineering behind our new checkpoint system that saves 70B+ parameter models in under 30 seconds using memory-mapped files and async I/O.

Read more →

Best Practices Nov 15, 2024

5 Experiment Tracking Patterns Every ML Team Should Adopt

From naming conventions to metric hierarchies, these patterns will help your team maintain reproducibility and accelerate iteration cycles across projects.

Read more →

Product Update Oct 30, 2024

New: Hyperparameter Optimization with Population-Based Training

PBT is now available in GradientPond. Automatically tune hyperparameters during training with evolutionary strategies that adapt in real-time.

Read more →

Engineering Oct 18, 2024

Scaling Experiment Metadata: From PostgreSQL to a Custom Time-Series Store

When you're ingesting 2 million metric points per second, off-the-shelf databases don't cut it. Here's how we built a custom storage engine for ML telemetry.

Read more →

Best Practices Oct 5, 2024

A Practical Guide to Model Versioning and Registry Workflows

How to structure your model registry for production ML: staging environments, promotion gates, rollback strategies, and automated validation pipelines.

Read more →

Insights & Updates

Introducing GradientPond v2.4: Distributed Training at Scale

How We Built a Zero-Copy Checkpoint System for Large Models

5 Experiment Tracking Patterns Every ML Team Should Adopt

New: Hyperparameter Optimization with Population-Based Training

Scaling Experiment Metadata: From PostgreSQL to a Custom Time-Series Store

A Practical Guide to Model Versioning and Registry Workflows