Binding – bablick.de

Understanding Binding in Gridware Cluster Scheduler

ernst.bablick — Sat, 25 Oct 2025 20:13:16 +0000

Modern compute nodes have grown increasingly complex — featuring heterogeneous cores, multi-level caches, and intricate NUMA topologies. In the previous post, Compute Nodes with Heterogeneous Topology in Gridware Cluster Scheduler, we looked at how these topologies are detected and represented in Gridware Cluster Scheduler.

This post explores binding — how the scheduler decides where exactly a job runs within a node, and how users can control this behavior for optimal performance.

Why Binding Matters

In high-performance computing (HPC), resource binding defines how processes or threads are mapped to specific CPU resources. Effective binding ensures predictable performance by preventing multiple jobs from competing for the same core, cache, or memory subsystem. It also enhances bandwidth utilization for attached devices such as network interfaces, InfiniBand adapters, and GPUs by maintaining locality between compute tasks and their associated hardware resources..

Gridware Cluster Scheduler treats binding as a first-class resource. Unlike traditional schedulers where binding was merely a hint, in Gridware Cluster Scheduler it is a hard requirement — a job will only start once the requested binding can be fulfilled.

From Slots to Binding

If you’ve read the previous post, you already know about the slot concept — where each slot represents a unit of computational capacity on a node. Here’s the full analogy that makes the difference between slots and binding concrete:

Slots = seats on an airplane. A compute node has a fixed number of slots, just as a plane has a fixed number of seats. Each sequential job and each task of a parallel job needs one slot, like each passenger needs one seat.
Binding = weight-balanced placement and freight. Binding determines where the job runs within the node. In the airplane analogy, that’s like assigning specific rows/sections and placing freight in defined compartments to maintain balance. Similarly, binding pins tasks to threads, cores, sockets, dies (L3), or NUMA nodes so they benefit from nearby caches and memory and don’t interfere with other workloads.

In short: slots define how many, binding defines where — the placement that preserves locality and stability.

Binding Units and Amounts

Binding behavior is primarily controlled with the -bunit and -bamount parameters during job submission.

Define the Binding Level with `-bunit`

Unit	Description
T or CT	CPU thread of a power core
ET	CPU thread of an efficiency core
C	Power core (default)
E	Efficiency core
S or CS	All power cores of a socket
ES	All efficiency cores of a socket
X or CX	All power cores sharing the same L3 cache (chiplet/die)
EX	All efficiency cores sharing the same L3 cache
Y or CY	All power cores sharing the same L2 cache
EY	All efficiency cores sharing the same L2 cache
N or CN	All power cores of a NUMA node
EN	All efficiency cores of a NUMA node

Each unit level corresponds to a layer in the hardware hierarchy.

Specify the Number of Units with `-bamount`

The binding amount defines how many binding units should be assigned per slot (or per host).

qsub -pe mpi_8 16 -bunit C -bamount 2 ...

This job requests 16 slots across two hosts. Each slot binds to two power cores, ideal for tasks starting two lightweight threads (or processes).

If the threads are tightly coupled, thread binding can be more suitable:

qsub -pe mpi_8 16 -bunit T -bamount 2 ...

Binding threads instead of cores can enhance total cluster throughput by minimizing stalls from system calls, cache misses, and network delays, and by improving cache locality (especially for producer–consumer pairs). While each job may take up to twice as long to complete, the increased parallelism—running twice as many jobs simultaneously—often results in a net performance improvement of 5–10%.

Chiplet or die binding can be especially beneficial on modern CPUs where groups of cores share a common L3 cache. By aligning tasks to those chiplets, cache locality is preserved and cross-die memory traffic is minimized.

qsub ... -btype host -bunit X -bamount 1

This command binds each job or task to all cores that share the same L3 cache. It ensures that the job exclusively uses that cache domain and its attached resources (e.g. GPU or I/O devices), preventing other jobs from interfering. Also this can result in a performance benefit even if the job might not use all cores of the die.

Binding Types: Slot vs. Host

Binding can be applied per slot (as we saw in previous examples) or per host using the -btype parameter.

Slot-based binding (default):
Each slot gets its own binding. This maximizes flexibility and is ideal for mixed workloads.
Host-based binding:
Binding is applied collectively for all slots on a host, ensuring consistent placement but reducing flexibility.

qsub -pe mpi_8 16 -btype host -bunit X -bamount 1 ...

Here, the job get 16 slots (8 per host) and Gridware Cluster Scheduler binds each group to one die (L3-cache). This host-wide approach minimizes fragmentation and improves cache locality.

Once a job (or advance reservation) is scheduled, its actual binding can be inspected with qstat (or qrstat)

qstat -j    # or  qrstat -ar 
...
binding:               bamount=16,binstance=set,bstrategy=pack,btype=host,bunit=X
exec_binding_list 1:   host1=NSxccccccccXCCCCCCCC,host2=NSxccccccccXCCCCCCCC

The first line shows the binding request; the second lists the binding actually applied per host (lower case letters in the topology string). In this example, all cores below the first L3 cache of the first socket were used.

Binding Filters

Sometimes certain cores or sockets should be left free — for example, one core per host reserved for system tasks. Binding filters, defined with -bfilter, make this possible.

A filter uses a topology string where lowercase letters mark excluded units.

Example:

qsub -bfilter ScCCCScCCC ...

Here, the first core of each socket is masked and will not be used for binding. All other cores remain available.

Administrators can also define global filters by keyword:

qconf -sconf | grep binding_params
binding_params ... filter=first_core

Global and job-specific filters are additive, and both restrictions apply simultaneously.

Packed Binding (and What Comes Next)

The packed binding strategy is the default in Gridware Cluster Scheduler.
It assigns available hardware units sequentially from left to right within a node’s topology string, ensuring that each host is filled efficiently while maintaining cache and NUMA locality.

Packed binding automatically groups tasks on nearby cores and within shared cache domains to reduce latency and memory contention.
If a host does not have enough free units to satisfy a job’s binding request, the scheduler simply skips that host.

What has been written so far is available in Open Cluster Scheduler (OCS) and Gridware Cluster Scheduler (GCS).

However, Gridware Cluster Scheduler introduces extended binding control. Packed binding can be refined through additional options — -bsort, -bstart, and -bstop — which let you influence the order and region of unit selection.

These advanced strategies are available only in Gridware Cluster Scheduler and will be discussed in detail in the next blog post.

🚀 Stay connected!
Follow me on X (Twitter) or join HPC-Gridware on LinkedIn and X (Twitter) for the latest release announcements, expert tips, and in-depth technical insights from our team.

🔧 Try it today: nightly builds featuring the latest OCS and GCS enhancements discussed in this post are now available from HPC-Gridware.

Compute Nodes with Heterogeneous Topology in Gridware Cluster Scheduler

ernst.bablick — Thu, 23 Oct 2025 19:12:25 +0000

As CPUs evolve toward hybrid designs with mixed core types and increasingly complex memory hierarchies, HPC schedulers must also evolve.
This post explains how Gridware Cluster Scheduler 9.1.0 meets that challenge—bringing detailed, topology-aware resource scheduling to modern heterogeneous compute nodes.

Why Topology Awareness Matters

In modern high-performance computing (HPC), CPU cores within a single socket may differ in clock frequency, power characteristics, or cache layout. Meanwhile, memory hierarchies—NUMA nodes, multi-level caches, and chiplets—add new layers of complexity.

To schedule jobs efficiently, a cluster manager must understand and exploit this hardware topology.
Gridware Cluster Scheduler 9.1.0 introduces expanded binding and topology-awareness features to maximize performance and ensure predictable resource placement.

Three Hardware Topologies from NVIDIA, AMD, and Intel

To demonstrate the scheduler’s new capabilities, the following sections show real-world topology examples from NVIDIA, AMD, and Intel hardware.

NVIDIA DGX Spark

The NVIDIA DGX Spark—notable for its presentation by Jensen Huang to Elon Musk at SpaceX—uses a heterogeneous ARM architecture optimized for AI/ML workloads.
The system features 20 ARM cores organized into five performance tiers, each with unique efficiency and frequency characteristics:

CPU-Type #4: efficiency=4, cpuset=0x00080000
  FrequencyMaxMHz = 4004
  LinuxCapacity   = 1024
CPU-Type #3: efficiency=3, cpuset=0x00078000
  FrequencyMaxMHz = 3978
  LinuxCapacity   = 1017
CPU-Type #2: efficiency=2, cpuset=0x000003e0
  FrequencyMaxMHz = 3900
  LinuxCapacity   = 997
CPU-Type #1: efficiency=1, cpuset=0x00007c00
  FrequencyMaxMHz = 2860
  LinuxCapacity   = 731
CPU-Type #0: efficiency=0, cpuset=0x0000001f
  FrequencyMaxMHz = 2808
  LinuxCapacity   = 718

Using Intel’s terminology, this architecture could be viewed as 10 Power cores and 10 Efficiency cores
(10 × ARM Cortex-X925 + 10 × ARM Cortex-A725). Each core has private L1/L2 caches, and groups of 10 share an L3 cache.

> loadcheck -cb | grep Topology
Topology (GCS): NSXEEEEECCCCCXEEEEECCCCC

Gridware Cluster Scheduler uses topology strings to represent such layouts.
Here: N = NUMA node, S = socket, X = L3 cache, E = Efficiency core, and C = Power core.

Intel i9-14900HX

While the Intel i9-14900HX isn’t typical for HPC clusters, it’s an ideal case study for hybrid core architectures.

> loadcheck -cb | grep Topology
Topology (GCS): NSXCTTCTTCTTCTTCTTCTTCTTCTTYEEEEYEEEEYEEEEYEEEE

Power cores (C): Dual-threaded (T), each with its own L2 cache.
Efficiency cores (E): Single-threaded, grouped by four per L2 cache (Y).
NUMA node (N) and socket (S): Encompass both core types and a shared L3 cache (X).

AMD EPYC Zen5

The AMD EPYC Zen5 series (e.g., AMD-Epyc-Zen5-c4d-highmem-384) represents a chiplet-based homogeneous design.
Each core provides two hardware threads, and the L3 cache structure (X) maps directly to chiplets/dies.

> loadcheck -cb | grep Topology
Topology (GCS): NSXCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
                  XCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
                  ... (repeated chiplet layout per socket)
                NSXCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
                  XCTTCTTCTTCTTCTTCTTCTTCTT XCTTCTTCTTCTTCTTCTTCTTCTT
                  ... (repeated chiplet layout per socket)

Each socket (S) corresponds to one NUMA node (N), while every core has a private L2 cache.

Handling Heterogeneous Topologies in Gridware Cluster Scheduler

Efficient scheduling means assigning tasks to the most suitable hardware.
If a parallel job spans both slow and fast cores, the slowest becomes a bottleneck. Similarly, crossing NUMA or cache boundaries increases latency.

Gridware Cluster Scheduler 9.1 introduces fine-grained binding control, allowing binding to:

Sockets
Cores
Threads
NUMA nodes
Chiplets/Dies (cache domains)

This ensures optimal locality and predictable performance, even on hybrid or asymmetric systems.

Chiplet/Die Binding Example

qsub -pe mpi 15 -btype host -bamount 2 -bunit X ...

This example requests 15 MPI tasks, all running on a single host. Using -btype host, binding is applied relative to the host topology. With -bamount 2 -bunit X, each job portion binds to two chiplets/dies, ensuring that cache boundaries are respected and minimizing cross-die interference.

💡 In this setup, the job uses 15 out of 16 available cores. The scheduler keeps the remaining core idle to prevent contention.

Summary

With version 9.1.0, Gridware Cluster Scheduler becomes fully topology-aware, bridging the gap between modern heterogeneous hardware and intelligent workload scheduling.
By supporting multiple binding types (core, socket, thread, NUMA, chiplet/die), it ensures efficient resource utilization and predictable performance across diverse compute nodes.

We are currently in the QA phase of this release and welcome user feedback on these new features.
They are already included in our nightly builds for testing, and beta releases will be available soon.
Download Gridware Cluster Scheduler

Stay tuned with HPC-Gridware for updates — we’ll share more insights, examples, and best practices as we approach the official release.

Follow me at X/Twitter or follow us at HPC-Gridware (LinkedIn, X/Twitter) for release announcements, tips, and technical insights.