I think that many crunchers would want support for more advanced CPUs. Since you already require SSE2 according to your system requirements page, many people would want support for more advanced CPU instruction sets to speed up the applications even more.
- 64-bit versions of applications: This can help in three areas. The first and more important area is that x86 in 32 bit modes are register starved, and AMD fixed this issue when designing AMD64. Its 32-bit mode was designed when memory ran at the same speed as the processor, so memory operations were cheap back then. They are quite expensive today because DRAM and most caches are slower than the CPU core. Therefore, one study on some Pentium Pro processors cited in one of my old college textbooks found that they spent over half of their time waiting for the memory subsystem when executing code. Having the additional registers added by the AMD64 architecture allows the core to stay busy doing more real work and spend less time waiting for the memory system, and can sometimes keep 64-bit capable NetBurst CPUs from entering the pathologically energy-wasting replay mode by keeping more data in the registers rather than only in the memory system where a failure to keep data in the level 1 cache will guarantee entry into replay mode. The second area is that programs can directly use more than 4 gibibytes of DRAM. The third area is that 64-bit integers are supported, which is probably worthless for this application.
- SSE3: This adds some flexibility to the 128-bit wide vector unit that might help maintain a higher consistent operation rate in some situations depending on the code, and therefore might or might not be helpful depending on your code.
- AVX: This doubles the width of the floating point vector unit to 256 bits as compared to the 128-bit SSE/SSE2/SSE3 instruction sets. The integer vector unit is not affected by this instruction set.
- FMA4: This instruction multiplies two numbers and keeps all of the bits of the product without rounding, adds a third number to the product, and then finally rounds the result, with the whole thing done in one cycle as an atomic operation. This instruction therefore doubles the peak floating point operations per second figure if an FMA operation is counted as two floating point operations. This instruction is required for AMD's Bulldozer processors to perform decently in floating point, because otherwise their floating point units are pathologically slow.
- FMA3: This instruction does the same thing as FMA4, but requires that one of the source variables is overwritten with the result. AMD's Piledriver processors and above support this instruction as well as FMA4, and require either of them to be used to perform floating point at an acceptable speed because all of the processors of the Bulldozer family have floating point units that are otherwise garbage. Intel's Haswell processors and above support this instruction. Poor coordination between AMD and Intel and the discovery by Intel that FMA4 would require an extensive rework of its vector unit generated this confusion between which FMA instruction should be supported.
- AVX2: This does two major things: it doubles the width of the integer vector unit to 256 bits as compared to the 128-bit SSE/SSE2/SSE3 instruction sets, and includes the FMA3 instruction. It is found in Intel's Haswell processors and above, and in AMD's Excavator and above.
EDIT: Explain that Piledriver and above members of the Bulldozer family require either FMA4 or FMA3.