Great, I'll let you work on it a bit :) Without the normal remapping 4b=>fp16 is pretty trivial. Just mask and shift the bits to the fp16 denorm position and apply scale/bias (works for sym and asym scemes). 1.5 instructions per element.
and.b32 a0, b4x8, 0xf000f000;
and.b32 a1, b4x8, 0x0f000f00;
and.b32 a3, b4x8, 0x000f000f;
and.b32 a3, fp4x8, 0x000f000f;
shr.b32 a0, a0, 6;
shr.b32 a1, a1, 2;
shl.b32 a2, a2, 2;
shl.b32 a3, a3, 6;
fma.rn.f16x2 a0, a0, scale, bias;
fma.rn.f16x2 a1, a1, scale, bias;
fma.rn.f16x2 a2, a2, scale, bias;
fma.rn.f16x2 a3, a3, scale, bias;