Stateless Permutation: Shuffle Numbers Without Collisions

The Problem: The Uniqueness Constraint

Shuffling is trivial with a lookup table.
But on a GPU, things get messy if you want to do it statelessly (no shared memory, no global counter, just every thread calculating its own new position in isolation).

You can’t just throw a rand() function, which gives you collisions (two inputs mapping to the same output).
What we need is Bijective (a strict one-to-one mapping: every unique input i hits exactly one unique output j. No gaps, no overlaps).

In short, the goal: map [0, N-1] to a random order that executes at full GPU speed.

In Houdini, I hit this wall while using SegmentByConnectivity. I had a bunch of ordered IDs and needed to randomly mask exactly N pieces.
— Not like 15% via idtomask, which is possibility-based and might return 14 or 16 out of 100 but not garanteeing exactly 15.

My plan was to shuffle the IDs and grab the first N.
Doing this in Copernicus while keeping data on the GPU is harder than it looks.

The Solutions

Linear Congruential Generator (LCG)

The most basic math for this is the LCG formula (Wikipedia: Linear Congruential Generator):
f(x) = (x * A + B) mod N

Technically, if A and N are coprime (they share no common factors other than 1), it produces a full-cycle shuffle, or say it’s stateless. And fast.
To make it work, A needs to be a massive prime number larger than N.

opencl OpenCL

unsigned long A = 1442695040888963407UL; // Large Prime 
int id_out = (id * A + seed) % count;

However, the LCG approach has a major flaw known in math as Marsaglia’s Theorem: While a simple LCG is technically bijective, the math is purely linear, so the “random” results aren’t truly random—they fall into specific planes, or patterns.

In a 2d grid, this manifested as certain patterns across the image. When masking, the selection didn’t look organic:

cop shuffle 1 — LCG patterns: Not exactly 'random'.

Cycle-Walking Permutation

Then I explored the Cycle-Walking Permutation. The core logic relies on Correlated Multi-Jittered Sampling by Andrew Kensler.

This technique is a standard in the graphics industry, used by Pixar’s RenderMan and NVIDIA’s Fermat to generate high-quality random samples without the collision headache:

opencl OpenCL

/// A pseudorandom permutation function (see "Correlated Multi-Jittered Sampling", by Andrew Kensler)
///
CUGAR_FORCEINLINE CUGAR_HOST_DEVICE
uint32 permute(uint32 i, uint32 l, uint32 p)
{
	uint32 w = l - 1;
	w |= w >> 1;
	w |= w >> 2;
	w |= w >> 4;
	w |= w >> 8;
	w |= w >> 16;
	do {
		i ^= p;					i *= 0xe170893d;
		i ^= p			>> 16;
		i ^= (i & w)	>> 4;
		i ^= p			>> 8;	i *= 0x0929eb3f;
		i ^= p			>> 23;
		i ^= (i & w)	>> 1;	i *= 1 | p >> 27;
								i *= 0x6935fa69;
		i ^= (i & w)	>> 11;	i *= 0x74dcb303;
		i ^= (i & w)	>> 2;	i *= 0x9e501cc3;
		i ^= (i & w)	>> 2;	i *= 0xc860a3df;
		i &= w;
		i ^= i			>> 5;
	} while (i >= l);

	return (i + p) % l;
}

i: Input number to be permuted.
L: Range size over which the permutation is performed.
p: Random seed.

How it works:

1. Finding the Power-of-Two Mask

opencl OpenCL

uint32 w = l - 1;
w |= w >> 1;
w |= w >> 2;
w |= w >> 4;
w |= w >> 8;
w |= w >> 16;

It looks weird, but it’s simply a fast GPU way (bit operations) to generate a bitmask w that represents the smallest power-of-two value minus one (POT - 1) that is greater than or equal to L. E.g., for L = 10, the target POT is $2^4=16$ , and w would be $2^4-1=15$ .
This defines the “workspace” of our rapid shuffle that is 0-w. (Yes, it can be shuffled out of L, we’ll handle this in the last section.)

Let’s see an example:

Tracing an Input (L=10):

$L = 10\ (Binary: 1010)$

We start with L-1 to catch cases where L is already a POT:

( A sub example here: if w=L=8 (1000), already a POT, the bit-filling would turn it into 15 (1111). But if we start with L-1=7 (0111), the bit-filling keeps it at 7(0111), which is the correct upper bound we need. )

w = L - 1 $\rightarrow w = 9\ (Binary: 1001)$

Then we fills the entire number with 1 from the highest bit down :

w |= w >> 1 $\rightarrow w = 1001 | 0100 = 1101$

w |= w >> 2 $\rightarrow w = 1101 | 0011 = 1111\ (Decimal: 15)$

In this case above, w is eventually $2^4 - 1 = 15$ . This 0-15 range is our target workspace for mapping.

2. The Scramble

opencl OpenCL

i ^= p;					i *= 0xe170893d;
i ^= p			>> 16;
i ^= (i & w)	>> 4;
i ^= p			>> 8;	i *= 0x0929eb3f;
i ^= p			>> 23;
i ^= (i & w)	>> 1;	i *= 1 | p >> 27;
					i *= 0x6935fa69;
i ^= (i & w)	>> 11;	i *= 0x74dcb303;
i ^= (i & w)	>> 2;	i *= 0x9e501cc3;
i ^= (i & w)	>> 2;	i *= 0xc860a3df;
i &= w;
i ^= i			>> 5;

The goal of this block is to kick the ID into a unique new slot within our 0-w range. No collisions.

Tracing (i=3, L=10, p=42):

$i = 3\ (0011);\ p = 42\ (101010);\ w = 15\ (1111)$

We XOR the seed into the ID. XOR is basically addition without the carry—it shifts the ID but keeps it unique:

i ^= p $\rightarrow i = 0011$ ^ $101010 = 101001\ (Decimal: 41)$

Then we multiply by a giant prime. This is Bijective (the mapping stays 1:1; no two inputs will ever land on the same product). It shatters the linear pattern:

i *= 0xe170893d $\rightarrow i = an\ unique\ massive\ 32bit\ integer\ A$

Then we shift the p by 16, and XOR into i, to mix the high-order bits of Seed into the low-order bits of ID. Therefore we use different parts of p (like >> 16, and later >> 8 and >> 23) to fully influence i.

( These parts are designed to work for large numbers. For our small p= 42 here, they actually do nothing )

i ^= p >> 16 $\rightarrow i = A$ ^ $(101010 >> 16) = A$ ^ $0 = A$

Then we fold the bits. High-order bits change wildly during multiplication, while low-order bits are lazier. So we XOR the high bits, sharing the chaos, back into the low bits.
For our small w = 15 here, it still does nothing :

i ^= (i & w) >> 4 $\rightarrow i = A$ ^ $((A \& 1111) >> 4) = A$ ^ $((last\ 4\ bits\ of\ A) >> 4) = A$ ^ $0 = A$

Then we repeat the sequence of shifts, XORs, and multiplications above to i to fully mix the bits.

( For our small example, not all the operations have an effect. )

3. Cycle-Walking

opencl OpenCL

do {
	// Scrambles
	// ...
} while (i >= l);

Here we handle the case where the result of the scramble lands outside our range 0-L, since our workspace was 0-w (where w is the next POT minus one).

If the result is >= L (e.g., it lands on 12 $\in$ [0, w=15] but we only have L = 10 slots), the Cycle-Walking loop triggers, sending it through the mixer again until it lands in a valid seat.
( Since w < 2L, it’ll land in a valid seat in less than two tries on average. )

cop shuffle 2 — The result: Clean, fast, stateless randomness.