
As CPUs evolve toward hybrid designs with mixed core types and increasingly complex memory hierarchies, HPC schedulers must also evolve.
This post explains how Gridware Cluster Scheduler 9.1.0 meets that challenge—bringing detailed, topology-aware resource scheduling to modern heterogeneous compute nodes.
Why Topology Awareness Matters
In modern high-performance computing (HPC), CPU cores within a single socket may differ in clock frequency, power characteristics, or cache layout. Meanwhile, memory hierarchies—NUMA nodes, multi-level caches, and chiplets—add new layers of complexity.
To schedule jobs efficiently, a cluster manager must understand and exploit this hardware topology.
Gridware Cluster Scheduler 9.1.0 introduces expanded binding and topology-awareness features to maximize performance and ensure predictable resource placement.
Three Hardware Topologies from NVIDIA, AMD, and Intel
To demonstrate the scheduler’s new capabilities, the following sections show real-world topology examples from NVIDIA, AMD, and Intel hardware.
NVIDIA DGX Spark
The NVIDIA DGX Spark—notable for its presentation by Jensen Huang to Elon Musk at SpaceX—uses a heterogeneous ARM architecture optimized for AI/ML workloads.
The system features 20 ARM cores organized into five performance tiers, each with unique efficiency and frequency characteristics:
CPU-Type #4: efficiency=4, cpuset=0x00080000
FrequencyMaxMHz = 4004
LinuxCapacity = 1024
CPU-Type #3: efficiency=3, cpuset=0x00078000
FrequencyMaxMHz = 3978
LinuxCapacity = 1017
CPU-Type #2: efficiency=2, cpuset=0x000003e0
FrequencyMaxMHz = 3900
LinuxCapacity = 997
CPU-Type #1: efficiency=1, cpuset=0x00007c00
FrequencyMaxMHz = 2860
LinuxCapacity = 731
CPU-Type #0: efficiency=0, cpuset=0x0000001f
FrequencyMaxMHz = 2808
LinuxCapacity = 718
Using Intel’s terminology, this architecture could be viewed as 10 Power cores and 10 Efficiency cores
(10 × ARM Cortex-X925 + 10 × ARM Cortex-A725). Each core has private L1/L2 caches, and groups of 10 share an L3 cache.
> loadcheck -cb | grep Topology
Topology (GCS): NSXEEEEECCCCCXEEEEECCCCC
Gridware Cluster Scheduler uses topology strings to represent such layouts.
Here: N = NUMA node, S = socket, X = L3 cache, E = Efficiency core, and C = Power core.
Intel i9-14900HX
While the Intel i9-14900HX isn’t typical for HPC clusters, it’s an ideal case study for hybrid core architectures.
> loadcheck -cb | grep Topology
Topology (GCS): NSXCTTCTTCTTCTTCTTCTTCTTCTTYEEEEYEEEEYEEEEYEEEE
- Power cores (C): Dual-threaded (
T), each with its own L2 cache. - Efficiency cores (E): Single-threaded, grouped by four per L2 cache (
Y). - NUMA node (N) and socket (S): Encompass both core types and a shared L3 cache (
X).
AMD EPYC Zen5
The AMD EPYC Zen5 series (e.g., AMD-Epyc-Zen5-c4d-highmem-384) represents a chiplet-based homogeneous design.
Each core provides two hardware threads, and the L3 cache structure (X) maps directly to chiplets/dies.
> loadcheck -cb | grep Topology
Topology (GCS): NSXCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
XCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
... (repeated chiplet layout per socket)
NSXCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
XCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
... (repeated chiplet layout per socket)
Each socket (S) corresponds to one NUMA node (N), while every core has a private L2 cache.
Handling Heterogeneous Topologies in Gridware Cluster Scheduler
Efficient scheduling means assigning tasks to the most suitable hardware.
If a parallel job spans both slow and fast cores, the slowest becomes a bottleneck. Similarly, crossing NUMA or cache boundaries increases latency.
Gridware Cluster Scheduler 9.1 introduces fine-grained binding control, allowing binding to:
- Sockets
- Cores
- Threads
- NUMA nodes
- Chiplets/Dies (cache domains)
This ensures optimal locality and predictable performance, even on hybrid or asymmetric systems.
Chiplet/Die Binding Example
qsub -pe mpi 15 -btype host -bamount 2 -bunit X ...
This example requests 15 MPI tasks, all running on a single host. Using -btype host, binding is applied relative to the host topology. With -bamount 2 -bunit X, each job portion binds to two chiplets/dies, ensuring that cache boundaries are respected and minimizing cross-die interference.
💡 In this setup, the job uses 15 out of 16 available cores. The scheduler keeps the remaining core idle to prevent contention.
Summary
With version 9.1.0, Gridware Cluster Scheduler becomes fully topology-aware, bridging the gap between modern heterogeneous hardware and intelligent workload scheduling.
By supporting multiple binding types (core, socket, thread, NUMA, chiplet/die), it ensures efficient resource utilization and predictable performance across diverse compute nodes.
We are currently in the QA phase of this release and welcome user feedback on these new features.
They are already included in our nightly builds for testing, and beta releases will be available soon.
Download Gridware Cluster Scheduler
Stay tuned with HPC-Gridware for updates — we’ll share more insights, examples, and best practices as we approach the official release.
Follow me at X/Twitter or follow us at HPC-Gridware (LinkedIn, X/Twitter) for release announcements, tips, and technical insights.