← Back to AILP Home

SIMD Math

Nitpick's SIMD math support (v0.55.6) extends the scalar math compiler builtins to simd<flt64, N> types, enabling vectorized transcendental computation using lane-wise LLVM expansion.


Overview

SIMD math operations apply a scalar math function to each lane of a SIMD vector independently. For functions with native LLVM vector intrinsics (sqrt, abs), the compiler emits a single vector instruction. For transcendentals (sin, cos, exp, log), the compiler performs lane-wise expansion — unrolling the N lanes into N scalar calls and repacking.


Supported Functions

All scalar math builtins that accept flt64 also accept simd<flt64, N>:

Function LLVM Strategy Notes
sqrt(v) llvm.sqrt.v*f64 — single vector instruction Fastest
abs(v) llvm.fabs.v*f64 — single vector instruction Fastest
sin(v) Lane-wise expansion N scalar calls
cos(v) Lane-wise expansion N scalar calls
tan(v) Lane-wise expansion N scalar calls
exp(v) Lane-wise expansion N scalar calls
exp2(v) Lane-wise expansion N scalar calls
log(v) Lane-wise expansion N scalar calls
log2(v) Lane-wise expansion N scalar calls
log10(v) Lane-wise expansion N scalar calls
pow(v, e) Lane-wise expansion Both v and e may be SIMD
floor(v) Lane-wise expansion
ceil(v) Lane-wise expansion
round(v) Lane-wise expansion
trunc(v) Lane-wise expansion

Basic Usage

// 4-lane SIMD
simd<flt64, 4>:angles = { 0.0, PI()/6.0, PI()/4.0, PI()/3.0 };
simd<flt64, 4>:sins   = sin(angles);   // { 0, 0.5, 0.707, 0.866 }
simd<flt64, 4>:coss   = cos(angles);   // { 1, 0.866, 0.707, 0.5 }

// 8-lane SIMD
simd<flt64, 8>:vals = { 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0 };
simd<flt64, 8>:logs = log2(vals);      // { 0, 1, 2, 3, 4, 5, 6, 7 }

Pythagorean Identity — Verified Lane-Wise

The identity sin(x)² + cos(x)² == 1 holds lane-by-lane within floating-point precision:

simd<flt64, 4>:x = { 0.1, 0.5, 1.2, 3.0 };
simd<flt64, 4>:s = sin(x);
simd<flt64, 4>:c = cos(x);
simd<flt64, 4>:pyth = s * s + c * c;
// pyth ≈ { 1.0, 1.0, 1.0, 1.0 } within 1e-15

Element Access

SIMD results can be extracted by index:

simd<flt64, 4>:r = sqrt({ 1.0, 4.0, 9.0, 16.0 });
flt64:lane0 = r[0];  // 1.0
flt64:lane1 = r[1];  // 2.0
flt64:lane2 = r[2];  // 3.0
flt64:lane3 = r[3];  // 4.0

Arithmetic on SIMD Results

SIMD math results compose with standard SIMD arithmetic:

simd<flt64, 4>:a = { 1.0, 2.0, 3.0, 4.0 };
simd<flt64, 4>:b = { 5.0, 6.0, 7.0, 8.0 };

// Combine trig and arithmetic
simd<flt64, 4>:dot_approx = sin(a) * cos(b) + cos(a) * sin(b);
// dot_approx[i] == sin(a[i] + b[i])  (angle sum identity)

NaN and Infinity Handling

SIMD math functions follow IEEE 754 for each lane independently:

simd<flt64, 4>:v = { -1.0, 0.0, 1.0, 4.0 };
simd<flt64, 4>:r = sqrt(v);
// r[0] == NaN   (sqrt of negative)
// r[1] == 0.0
// r[2] == 1.0
// r[3] == 2.0

NaN in one lane does not affect other lanes.


Performance Notes

Vector-Native Functions (sqrt, abs)

For sqrt and abs, the compiler emits a single AVX2/AVX-512/NEON vector instruction (e.g., vfsqrt). These are as fast as a scalar call — effectively free when the data is already in SIMD registers.

Lane-Wise Expansion (sin, cos, exp, log, ...)

For transcendentals, the compiler unrolls to N scalar libm calls. This is semantically correct but does not benefit from SIMD-width acceleration unless the target supports SVML (Intel Short Vector Math Library) or similar.

Optimization hint: On targets with SVML available, LLVM's auto-vectorizer can sometimes replace the unrolled calls with a vector libm call. Enable with -O2 or -O3.

Preferred SIMD Width

Target Recommended Width Reason
x86-64 + AVX2 simd<flt64, 4> 256-bit registers
x86-64 + AVX-512 simd<flt64, 8> 512-bit registers
ARM64 + NEON simd<flt64, 2> 128-bit registers

SIMD vs. Scalar Math Summary

Scenario Recommendation
Single value computation Scalar builtins
Batched trig (sin/cos of N angles) simd<flt64, 4> or simd<flt64, 8>
Batched sqrt / abs SIMD — direct vector instruction
Loop over large arrays SIMD for throughput, scalar for correctness verification

Related