Monday, April 23, 2018

LZHAM decompressor vectorization

The is just a sketch of an idea. It might not pan out, there are too many details to know until I try it. LZHAM is an old codec now, but maybe I could stretch it a bit further before moving on to a new design.

It should be possible to decode multiple segments of each LZHAM block simultaneously with vector instructions. Porting the decoder to C, then vectorizing the decoder loop with SPMD (using iscp) shouldn't be that hard. The compressed stream will need some hints so it knows where each lane needs to start decoding from.

So for 8 parallel decodes, at the beginning of the block you memcpy() the arithmetic and Huffman statistics into 8 copies, run the SPMD decoder on say 1-2K bytes per lane, then sync the statistics after 8 1-2K blocks are processed. Then repeat the process.

Lane 0 will be able to process match copies from any previously decompressed bytes, but lane 1 can't access lane 0's decoded bytes, and lane 2 can't access lane 1/0's, etc. That'll definitely lower ratio. However, it should be possible for the compressor to simulate lane 0's decompressor and predict which bytes are available in lane 1 at each point in the compressed data stream.

The various if() statements in the decoder's inner loop won't be particularly coherent, which would impact lane efficiency. But even a 2-3x improvement in decoding speed would be pretty nice.

DirectXTex BC7 3 subset wierdness

I brought Microsoft's DirectXTex project (latest code) into my test project, to see how it fairs vs. ispc_texcomp and my encoder. Unfortunately, it appears broken. Across 31 test images (kodim and others):

DirectXTex BC7:

No flags (0): 9972.6 secs 44.41 dB

BC_FLAGS_USE_3SUBSETS: 13534.6 secs 44.25 dB

This is wrong. Quality should go up with 3 subset modes enabled, not down. I'm tempted to go figure out what's wrong in there myself, but it's a lot of code.

By comparison, my ispc encoder gets 477.1 secs 46.72 dB (using high quality settings). ispc_texcomp is in the same ballpark. With 3 subset modes disabled, I get 45.96 dB (as expected - the 3 subset modes are useful!).

I verified that the flag is doing something. Here's the BC7 mode histogram for kodim01 with the flags parameter to DirectX::D3DXEncodeBC7() set to 0:

0 17968 0 1752 0 0 4856 0

With 3 subsets enabled:

1435 16647 26 1675 0 0 4793 0

The source looks nice and readable, and as a library it was dead-simple to get it building and linked against my stuff. But it doesn't appear to be a production-ready encoder, it's more like a sample.

I'm calling it from multiple threads using OpenMP (it's too slow to benchmark otherwise). It makes my 20 core Xeon workstation crawl for a while, it's that slow.

Also, there is no need to disable 3 subset modes in a properly written encoder. Some timings with my encoder: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled). This was on a 20 core Xeon workstation/40 threads/31 images (kodim+others). 

The extra cost of a 3 subset mode isn't a big deal (3 endpoint optimizations) once you've estimated which partition(s) to examine more deeply. Partition estimation is fast and simple with SIMD, and a nice property of 3 subset modes is that the # of pixels fed to the endpoint optimizer per subset is rather low (enabling 3-subset specific optimizations). If your pbit handling is correct these modes are quite valuable.


Sunday, April 22, 2018

Proper pbit computation in the BC7 texture format

The BC7 GPU texture format supports the clever concept of endpoint "pbits", where the LSB's of RGB(A) endpoints are forced to the same value so only 1 bit (vs. 3 or 4) needs to be coded. BC7's use of pbits saves precious bits which can be used for other things which decrease error more. Some modes support a unique pbit per endpoint, and some only have 1 pbit for each endpoint pair.

I'm covering this because the majority of available BC7 encoders mess this important detail up. (I now kinda doubt any available BC7 encoder handles this correctly in all modes.) The overall difference across a 31 texture corpus (including the kodim images) is .26 dB RGB PSNR, which is quite a lot considering the CPU/GPU cost of doing this correctly vs. incorrectly is the same. (The improvement is even greater if you try all pbit combinations with proper rounding: .4 dB.)

ispc_texcomp handles this correctly for sure in most if not all modes, while the DirectXTex CPU, Volition GPU, and NVidia Texture Tool encoders don't as far as I can tell (they use pbit voting without proper rounding - the worst). The difference to doing this correctly in some modes is pretty significant - by at least ~.6 dB in mode 0!

Not doing this properly means your encoder will run slower because it will have to work harder (scanning more of the search space) to keep PSNR up vs. the competition. The amount of compute involved in lifting a BC7 encoder "up" by .26 dB across an entire test corpus is actually quite large, because there's a very steep quality vs. compute "wall" in BC7.

Here are some of the ways p-bits can be computed. The RGB PSNR's were for a single encoded image (kodim18), purposely limited to mode 0 (with 4 bit components+unique per-endpoint pbits) to emphasize the importance of doing this correctly:
  • 40.217 dB: pbit search+compensated rounding: Compute properly rounded component endpoints compensating for the chosen pbit, try all pbit combinations. This is an encoder option in my new BC7 encoder. Encoding error evaluation cost: 2X or 4X (expensive!)
  • 39.663 dB: Round to middle of component bin+pbit search: Compute rounded endpoints (with a scale factor not including the pbit), then evaluate the full error of all 2 or 4 pbit possibilities. This is what I initially started doing, because it's trivial to code. In mode 0, you scale by 2^4, round, then iterate through all the pbits and test the error of each combination. Error evaluation cost: 2X or 4X
  • 39.431 dB: Compensated rounding (ispc_texcomp method): Proper quantization interval rounding, factoring in the shift introduced when the pbit is 1 vs. 0. The key idea: If an endpoint scaled and rounded to full-precision (with a scale factoring in the pbit) has an LSB which differs from the pbit actually encoded, you must properly round the output component value to compensate for the different LSB or you will introduce more error than necessary. So basically, if the LSB you want is different from what's encoded, you need to correctly round the encoded component index to compensate for this difference. You must also factor in the way the hardware scales up the encoded component+pbit to 8-bits. Error evaluation cost: 1X
  • 39.151 dB: Voting/parity (DirectXTex/Volition): Count how many endpoint components in the scaled colors (with a scale factor including the pbit) sharing each pbit have set LSB's. If half or more do then set the encoded pbit to 1, otherwise 0. pbit voting doesn't round the encoded components so it introduces a lot of error. Error evaluation cost: 1X
  • 38.712 dB: Always 0 or 0,1
  • 37.878: Always 0 or 0,0
I tested a few different ways of breaking ties when computing pbits by voting and just reported the best one. At least on this image biasing the high endpoint towards 1 helps a little:

Shared  Unique
> >    39.053
> >=   39.151
>= >   38.996
>= >=  38.432
>=  > >=   39.151

This stuff is surprisingly tricky to get right, so here's a mode 0 example to illustrate what's going on. Each component can be coded to 16 possible values with one pbit selecting between two different ramps. So factoring in the pbit we have 32 possible representable 8-bit values. Here are the resulting 8-bit values (scaled using the shift+or method BC7 uses - not pure arithmetic scaling by 255/31 which is slightly different):

pbit 0:
0
16
33
49
66
82
99
115
132
148
165
181
198
214
231
247

pbit 1:
8
24
41
57
74
90
107
123
140
156
173
189
206
222
239
255

Let's say the encoder wants to encode an endpoint value of 9/255 (using 8-bit precision) in mode 0 (4-bit components+pbit). The pbit voting encoders will compute a quantized/scaled component value of 1/31 (from a range of [0,31] - not [0,15] because we're factoring in the pbit). The LSB is 1 and the encoded component index (the top 4 MSB's) is 0, and if more than half of the other component LSB's are also 1 we're ok. In the good case we're coding a value of 8/255, which is closer to 9/255 than 24/255.

If instead a pbit of 0 is chosen, we're now encoding a value of 0/255 (because the encoded component index of 0 wasn't compensated), when we should have chosen the closer value of 16/255 (i.e. a component index of 1). Choosing the wrong LSB and not compensating the index has resulted in increased error.

There's an additional bit of complexity to all this: The hardware scales the mode 0 index+pbit up to 8-bits by shifting the index+pbit left by 3 bits for mode 0, then it takes the upper 3 MSB's of this and logically or's them into the lower 3 bits to fill in. This isn't quite the same as scaling by 255/31. So proper pbit determination code needs to factor this in, too. Here are the ramps computed using arithmetic scaling+rounding (notice they slightly differ from the previous ramps computed using shifting+or'ing):

0
16
33
49
66
82
99
115
132
148
165
181
197
214
230
247

8
25
41
58
74
90
107
123
140
156
173
189
206
222
239
255

I worked out the formulas involved on a piece of paper: 

How to compute [0,1] values from mode 0 bins+pbits (using arithmetic scaling, NOT hardware scaling):
pbit 0: value=bin*2/31
pbit 1: value=(bin*2+1)/31

How to compute mode 0 bins from [0,1] values with proper compensation/rounding (rearranging the equations+rounding) for each pbit index:
pbit 0: bin=floor(value*31/2+.5)
pbit 1: bin=floor((value*31-1)/2+.5)

Here's the clever code in ispc_texcomp that handles this correctly for modes with unique p-bits (modes 0,3,6,7). I bolded the bin calculations, which are slightly optimized forms of the previous set of equations. 

I believe there's actually a bug in here for mode 7 - I don't see it scaling the component values up to 8-bit bytes for this mode. It has a special case in there to handle mode 0, and modes 3/6 don't need scaling because they have 7-bit components, but mode 7 has 5-bit components. I didn't check the rest of the code to see if it actually handles mode 7 elsewhere, but it's possible ispc_texcomp's handling of mode 7 is actually broken due to this bug. Mode 7 isn't valuable when encoding opaque textures, but is pretty valuable for alpha textures because it's the only alpha mode that supports partitions.

///////////////////////////
// endpoint quantization

inline int unpack_to_byte(int v, uniform const int bits)
{
    assert(bits >= 4);
    int vv = v << (8 - bits);
    return vv + shift_right(vv, bits);
}

void ep_quant0367(int qep[], float ep[], uniform int mode, uniform int channels)
{
    uniform int bits = 7;
    if (mode == 0) bits = 4;
    if (mode == 7) bits = 5;

    uniform int levels = 1 << bits;
    uniform int levels2 = levels * 2 - 1;

    for (uniform int i = 0; i < 2; i++)
    {
        int qep_b[8];

        for (uniform int b = 0; b < 2; b++)
            for (uniform int p = 0; p < 4; p++)
            {
                int v = (int)((ep[i * 4 + p] / 255f*levels2 - b) / 2 + 0.5) * 2 + b;
                qep_b[b * 4 + p] = clamp(v, b, levels2 - 1 + b);
            }

        float ep_b[8];
        for (uniform int j = 0; j < 8; j++)
            ep_b[j] = qep_b[j];

        if (mode == 0)
            for (uniform int j = 0; j < 8; j++)
                ep_b[j] = unpack_to_byte(qep_b[j], 5);

        float err0 = 0f;
        float err1 = 0f;
        for (uniform int p = 0; p < channels; p++)
        {
            err0 += sq(ep[i * 4 + p] - ep_b[0 + p]);
            err1 += sq(ep[i * 4 + p] - ep_b[4 + p]);
        }

        for (uniform int p = 0; p < 4; p++)
            qep[i * 4 + p] = (err0 < err1) ? qep_b[0 + p] : qep_b[4 + p];
    }
}

Saturday, April 21, 2018

RDO BC7 encoder planning

I'm building my first RDO BC7/BC6H encoders for our product (Basis), and I'm trying to decide which BC7 modes to implement first. I've decided on modes 1+6 for opaque textures. For alpha, I've boiled down the options to modes 1+5+6, 1+4+6, or maybe 1+4+6+7.

For alpha I know I'll need either 4 and/or 5 because they are the only modes with uncorrelated alpha support. Mode 1 isn't optional because it supports 2 subsets, greatly reducing ugly block artifacts. The 3 subset modes aren't used much in practice and aren't necessary. Mode 6 is optional but boosts quality on simple blocks due to its awesome 7777.1 bit endpoint precision and beefy 4-bit selectors. The bare minimum modes for a decent opaque/alpha encoder seems to be 1+4 or 1+5, but adding 6 and 7 will improve quality especially with alpha textures.

Others have suggested that I do a RDO BC7 encoder using only a single subset mode (6 seems perfect), but the result would have BC1-style block artifacts. I've already built a 2 subset RDO encoder for ETC1 in Basis and it isn't that much harder to support 2 subsets vs. 1.

RDO GPU texture encoders try to optimize the encoded output data so that, after the output is LZ compressed (and the output pretty much always is!), you get as much quality per compressed bit as you can. The output bit patterns matter a lot. If you can reuse a bit pattern that's occurred in a nearby block, without impacting quality too much, an LZ codec can exploit this to issue a match vs. literals (saving bits and improving quality per compressed output bit). There's more to it than that, because you also want the ability to control the output quality vs. compressed size tradeoff in an artist-friendly way.

My RDO BC1 encoder in Basis uses Lagrangian optimization, and this is simplified because in BC1 the bit patterns nicely line up to byte boundaries which modern LZ codecs like. The endpoints and selectors aren't munged together, and a large chunk of the output entropy is in the selector data.

Anyhow, here's some of the data I used to make these decisions about BC7. I have a lot more data than this which I may post later.

image corpus was kodim+7 others
For comparison ispc_basic (using 7 modes) gets 46.54 dB
Encoder was set to extremely high quality (better than ispc_slow)

Single mode:

mode enctime   dB               
1    83        45.8 
3    91.2      43.4 
6    12.1      42.9 
2    12.9      42.7 
0    19.6      42.34
4    38.56     41.48
5    14.4      38.45 

Key mode combinations:
                 
1+6  86.37     46.27
1+5  11.161    45.79
1+6  21.9      45.7 
1+4  137.6     45.38

This data definitely isn't ideal, because I used my ispc encoder which was heavily tuned for performance. So modes 1 and 3 tried all 64 partitions in this test, while modes 0 and 2 only tried the best predicted partition. But it matches up with earlier data I generated from a more brute force encoder.

Note that the mode 4 and 5 encoders used all rotation/index selector options, which skews the output a bit. I doubt I'll be supporting these mode 4/5 options. Two 1+6 entries are for the encoder set to very high vs. much lower quality levels.

Some earlier encoding error data for just kodim18, using an earlier C++ version of my encoder:

Best of any mode error: 12988.968550

Best of modes 0 and 1: 13186.750320
Best of modes 1 and 6: 13440.390471
Best of modes 1 and 2: 13521.855790
Best of modes 1 and 4: 13565.064541
Best of modes 1 and 3: 13589.143902
Best of modes 1 and 5: 13592.709517
Best of modes 1 and 7: 13604.182592
Best of modes 1 and 1: 13605.033774
Best of modes 0 and 6: 14099.723969
Best of modes 0 and 3: 15046.616630
Best of modes 2 and 6: 15383.155463
Best of modes 0 and 4: 15563.136445
Best of modes 0 and 5: 15665.245609
Best of modes 0 and 2: 15892.424359
Best of modes 3 and 6: 15977.876955

Best of modes 0 and 1 and 6: 13037.149688
Best of modes 0 and 1 and 4: 13160.175075
Best of modes 0 and 1 and 2: 13168.570462
Best of modes 0 and 1 and 3: 13171.807469
Best of modes 0 and 1 and 5: 13176.588633
Best of modes 0 and 1 and 7: 13186.504616
Best of modes 0 and 0 and 1: 13186.750320
Best of modes 0 and 1 and 1: 13186.750320
Best of modes 1 and 2 and 6: 13360.493404
Best of modes 1 and 4 and 6: 13400.774007
Best of modes 1 and 5 and 6: 13429.100640
Best of modes 1 and 3 and 6: 13433.822985
Best of modes 1 and 6 and 7: 13439.954762
Best of modes 1 and 1 and 6: 13440.390471
Best of modes 1 and 6 and 6: 13440.390471
Best of modes 1 and 2 and 4: 13489.904966
Best of modes 1 and 2 and 3: 13508.004738
Best of modes 1 and 2 and 5: 13511.406144
Best of modes 1 and 2 and 2: 13521.855790
Best of modes 1 and 1 and 2: 13521.855790

The 1+6 combination is pretty powerful. Only a quarter dB separates it from ispc_basic on this corpus. Also, mode 1's bit packing is nicely aligned to bytes (more or less) - perfect for RDO encoding!

byte bits
0    2 6  mode+partition
1    6 2  endpoints
2    4 4  endpoints
3    2 6  endpoints
4    6 2  endpoints
5    4 4  endpoints
6    2 6  endpoints
7    6 2  endpoints
8    4 4  endpoints
9    2 6  endpoints
10   2 6  pbits+selectors
11   8    selectors
12   8    selectors
13   8    selectors 
14   8    selectors
15   8    selectors

Note that most of mode 1's output is endpoint data, not selector data, so partition/pbit biasing, endpoint quantization and bit pattern optimization will be critical.

BC7 showdown: ispc_texcomp vs. my ispc encoder

This benchmark compares the Fast ISPC Texture Compressor's BC7 encoder vs. my new ispc vectorized encoder. I've just barely begun to profile and optimize it, but it's already looking really strong. To create this encoder, I studied all other available BC7 encoders and leveraged all the things I learned while creating crunch's BC1 high-quality encoder and Basic's new ETC1 and universal format encoders. This is a non-RDO encoding test, i.e. what matters here is how much quality each encoder can achieve per unit time.

It was conducted on a 20 core Xeon workstation across 31 test images (first 24 are the kodim images). Both use AVX instructions. The quality metric is RGB average PSNR, perceptual mode disabled. The PSNR's below are averages across the entire set. The timings are the total amount of CPU time used only calling the encoder functions (across all threads). OpenMP was used for threading, and each encoder was called with 64 blocks per function call.

I'm currently focusing on ispc_texcomp's basic and slow profiles:

ispc_texcomp:
basic profile: 100.5 secs, 46.54 dB, .4631 dB/sec
slow profile: 355.29 secs, 46.77 dB, .1316 dB/sec

My encoder:
uber 0: 56.7 secs, 46.49 dB, .8199 dB/sec
uber 1: 86.4 secs, 46.72 dB, .5407 dB/sec
uber 2: 129.1 secs, 46.79 dB, .3624 dB/sec
uber 2 (2 refinement passes, 16 max 1,3 partitions): 161.9 secs, 46.84 dB, .2893 dB/sec
uber 3 (2 refinement passes, 32 max 1,3 partitions, pbit search): 215.2 secs, 46.91 dB, .2180 dB/sec
uber 4 (2 refinement passes, 64 max 1,3 partitions, pbit search): : 292.5 secs, 46.96 dB, .1605 dB/sec

The dB/sec. values are a simple measure of encoder efficiency. ispc_texcomp's slow profile at .1315 dB/sec. is working very hard for very little quality per unit time compared to its basic profile. The efficiency of both encoders decreases as the quality is improved, but ispc_texcomp falls off very rapidly above basic and mine falls off later. I believe a whole texture encoder like etc2comp's can more efficiently get through the quality barrier here.

What this boils down to: If you use ispc_texcomp, definitely avoid the slow profile (the tiny gain in quality isn't worth it). And it's definitely possible to compete against ispc_texcomp using plain RGB metrics.

Friday, April 20, 2018

ispc_texcomp BC7 issues

Been studying ispc_texcomp today to better understand why it's so slow compared to my encoder (currently by a factor of 2x at max quality). We do many of the same things, so why is it slower? Overall, there are many clever/smart things in there (overall it's surprisingly well done!), but it's being held back by weak vectorization, some missing optimizations, and it only supports linear RGB metrics.

Here's what I've found so far:

- The inner loops are bogged down with gathers and scatters. Definitely not good. There's even a set of helper functions at the top with a comment of "(perf warning expected)". (Umm - the compiler perf warnings are there for a reason!) For an example, check out block_quant().

The innermost loops should not have gathers, period.

- The partition estimation code is using full PCA. I've found this to be unnecessary in my testing - just using Waveren's bounding box approximation works well and is trivially vectorizable. After all, we're not trying to compute the actual output, just a reasonable approximation.

So ispc_texcomp goes into overkill mode and computes PCA while estimating the partition. At least the way it computes each subset's PCA is smart: it first computes the overall block's statistics/covariance, then it subtracts out the statistics of each partition's active (masked) pixels to compute each subset's individual covar.

Also, it's only computing what looks like an upper bound on the error from the block statistics, not an approximation of the actual error. The approximation of the actual error (factoring in quantization to 4/8/16 selectors) is extremely fast to compute with SIMD code, so it's not clear to me what's better yet.

Overall, the author seems to be favoring cleverness vs. exploiting the properties of fast but simple SIMD code.

- It uses squish-style iterative refinement after choosing the partition: Basically, it computes the PCA, comes up with some initial selectors, uses least squares to optimize the endpoints, then it computes new selectors and tries all over again a few times. In my tests, the PSNR gain from this method is too low (fraction of a dB) to justify the repeated LS computation and selector selection costs. Just do it once and then optionally try other things. It's more effective to just vary the selectors in simple ways (simplified cluster fit) in each trial.

- There's no support for perceptual colorspace metrics in there. This indirectly impacts performance (against other codecs that do support perceptual metrics) because it's stuck competing against RGB PSNR, and getting RGB PSNR up in BC7 is VERY computational intensive. You basically hit a steep quality wall, then it takes massively more compute to get it up above that wall even by a fraction of a dB.

If it supported perceptual metrics (where error in R,G,B is allowed to become a little unbalanced by approx. .25 - 1.5 dB, favoring G and R) it wouldn't have to try as hard because it would gain ~1.5 dB or more instantly before hitting the wall.

- First, the good news: The selector quantizer (see block_quant()) is using a clever algorithm: It dots the desired color by the subset's axis, converts that to a scaled int by rounding, clamps that to between [1,num_selectors-1], then it computes full squared euclidean error between the desired color and the subset's interpolated colors (s-1) and s. It only has to compute the full distance to 2 colors vs. all of them, which is cool.

I've compared this method vs. full distance to all colors and the results are super close (~1/1000th of a dB) on many images (but not all - I've seen .1 dB RGB PSNR loss on some images).

Now the bad news: The implementation is just bad. First, it recomputes the subset axis for every pixel (even though there are only 2 or 3 of them in BC7). And it uses multiple gather's to fetch the endpoints! This is done for all 16 pixels in the block - ouch! There's also a per-pixel divide in there.

Also, with good SIMD computing full distance to all subset colors isn't that expensive, at least for 4 and maybe 8 color blocks. I've implemented optimized forms of full search vs. ispc_texcomp's method. At least with AVX, all the fetches into the weighted_colors[] array (one for each lane) just slow the method down. Brute force leads to simpler code once vectorized and seems to slightly win out overall for 4 and 8 color blocks. With 16 color blocks the smarter method wins.

- After iterative refinement it doesn't have any more ways of improving quality. Trying to vary the selectors in keys ways (say by incrementing the lowest values and decrementing the highest values - to exploit extrapolation) and then LS optimizing the results helps a lot (.3-.5 dB) and is very fast if you SIMD optimize the trial solution evaluator function, yet it doesn't do that.

- Its mode 0 encoder suffers from a lot of quantization error - which is indicative of some weaknesses in its endpoint selection:

ispc_texcomp mode 0 only:

My encoder mode 0 only (no dithering - just stronger endpoint selection):


- ispc_texcomp is weak with grayscale images, by around .6-1.2 dB in my testing. Granted, once you're over ~60dB it doesn't matter much.

The "slow" profile is solidly in the quality "wall" region I described earlier. The basic and faster profiles are in much healthier regions.

A few Intel SPMD Compiler (ispc) C porting tips

I took notes as I was porting my new BC7 encoder from C to ispc. First, be sure to read and re-read the user guideperformance guide, and FAQ. This compiler tech kicks ass and I hope Intel keeps throwing resources at it. My initial port of 3k lines of C and initial stabs at vectorization was only ~2x faster, but after tuning the inner loops perf. shot up to over 5x vs. regular C code (on AVX). All without having to use a single ugly intrinsic instruction.

I'm new to ispc so hopefully I haven't made any mistakes below, but here's what I learned during the process:

If you're not starting from scratch, port your code to plain C with minimal dependencies and test that first. Make some simple helper functions like clamp(), min(), etc. that look like the ispc standard lib's, so when you do get to ispc you can easily switch them over to use the stdlib's (which is very important for performance).

Then port your C code to ispc, but put "uniform" on ALL variables and pointers. Test that and make sure it still works. In my experience so far you should have minimal problems at this stage assuming you put uniforms everywhere. Now you can dive into vectorizing the thing. I would first figure out how things should be laid out in memory and go from there. You may be able to just vectorize the hotspots, or you may have to vectorize the entire thing (like I did which was hours of messing around with uniform/varying keywords).

While developing and recompiling over and over again, change your --target option to only target one specific instruction set temporarily: --target=avx (or SSE2, etc.). There's little use targeting a bunch of different instruction sets (SSE, SSE2, AVX, AVX2, etc.) while developing new code, at this point all you care about is getting it working correctly.

The mental model is like shaders but for the CPU. Conceptually, the entire program gang executes each instruction, but the results can be masked off on a per-lane basis. If you are comfortable with shaders you will get this model immediately. Just beware there's a lot of variability in the CPU cost of operations, and optimal code sequences can be dramatically faster than slower ones. Study the generated assembly of your hotspots in the debugger and experiment. CPU SIMD instruction sets seem more brittle than ones for GPU's (why?).

A single pointer deref can hide a super expensive gather or scatter. Don't ignore the compiler warnings. These warnings are critical and can help you understand what the compiler is actually doing with your code. Examine every gather and scatter and understand why the compiler is doing them. If these operations are in your hotspots/inner loops then rethink how your data is stored in memory. (I can't emphasize this enough - scatters and gathers kill perf. unless you are lucky enough to have a Xeon Phi.)

varying and uniform take on godlike properties in ispc. You must master them. A "varying struct" means the struct is internally expanded to contain X values for each member (one each for the size of the gang). sizeof(uniform struct) != sizeof(varying struct). While porting I had to check, recheck, and check again all uniform and varying keywords everywhere in my code.

You need to master pointers with ispc, which are definitely tricky at first. The pointee is uniform by default, but the pointer itself is varying by default which isn't always what you want. "varying struct *uniform ptr" is a uniform pointer to a varying struct (read it right to left). In most cases, I wanted varying struct's and uniform pointers to them.

Find all memset/memmove/memcpy's and examine them extremely closely. In many cases, they won't work as expected after vectorization. Check all sizeof()'s too. The compiler won't always give you warnings when you do something obviously dumb. In most cases I just replaced them with hand-rolled loops to copy/initialize the values, because once you switch to varying types all bets are off if a memset() will do the expected thing.

Sometimes, code sequences in vectorized code just don't work right. I had some code that inserted an element into a sorted list, that wouldn't work right until I rearranged it. Maybe it was something silly I did, but it pays to litter your code with assert()'s until you get things working.

assert()'s aren't automatically disabled in release, you must use "--opt=disable-assertions" to turn them off. assert()'s in vectorized code can be quite slow. The compiler should probably warn you about assert()'s when optimizations are enabled.

print("%", var); is how you print things (not "%u" or "%f" etc.). Double parentheses around the value means the lane was masked out. If using Visual Studio I wouldn't fully trust the locals window when debugging - use print().

Once you start vectorizing, either the compiler is going to crash, or the compiler is going to generate function prologs that immediately crash. Both events are unfortunately going to happen until it's more mature. For the func. prolog crashes, in most if not all cases this was due to a mismatch between the varying/uniform attributes of the passed in pointers to functions that didn't cause compiler errors or warnings. Check and double check your varying and uniform attributes on your pointers. Fix your function parameters until the crash goes away. These were quite painful early on. To help track them down, #if 0 out large sections of code until it works, then slowly bring code in until it fails.

The latest version of ispc (1.9.2) supports limited debugging with Visual Studio. Examining struct's with bool's doesn't seem to work, the locals window is very iffy but more or less works. Single stepping works. Profiling works but seems a little iffy.

If you start to really fight the compiler on a store somewhere, you've probably got something wrong with your varying/uniform keywords. Rethink your data and how your code manipulates it.

If you're just starting a port and are new to ispc, and you wind up with a "varying varying" pointer then it's ok to be paranoid. It's probably not really what you want.

Experienced obvious codegen issues with uniform shifts and logical or's of uint16 values. Once I casted them to uint32's the problems went away. Be wary of integer shifts, which I had issues with in a few spots.

Some very general/hand-wavy recommendations with vectorized code: Prefer SP math over DP. Prefer FP math over integer math. Prefer 32-bit integer math over 64-bit. Prefer signed vs. unsigned integers. Prefer FP math vs. looking stuff up from tables if using the tables requires gathering. Avoid uint64's. Prefer 32-bit int math intermediates vs. 8-bit. Prefer simpler algorithms that load from constant array entries in a table (so all lanes lookup at the same location in the table), vs. more complex algorithms that require table lookups with unique per-lane indices.

Study stdlib.ispc. Prefer stdlib's clamp() vs. custom functions, and prefer stdlib vs. your own stuff for min, max, etc. The compiler apparently will not divine that what you are doing is just a clamp, you should use the stdlib functions to get good SIMD code.

Use uniform's as much as you possibly can. Prefer to make loop iterators uniform by default. Make loop iterators uniform by default when you start iterating at 0, even if the high loop limit is varying.

Use cif() etc. on conditionals which will strongly be taken or not taken by the entire gang. Compilation can get noticeably slower as you switch to cif().

A few min's or max's and some boolean/bit twiddling ops can be much faster than the equivalent multiple if() statements. Study the SSE2 etc. instruction sets because there are some powerful things in there. Prefer building code out of helpers like select() from the stdlib for performance.

Things that usually make perfect sense in CPU code, like early outs, may actually just hurt you with SIMD code. If your early out checks have to check all lanes, and it's an uncommon early out, consider just removing or rethinking them.