SSE中的xmm系列寄存器都是128位的,暂时学习了movss,movps等指令,现在不知道怎样并行计算,例如我现在有四乘数a1,a2,a3,a4和四个被乘数b1,b2,b3,b4,我想把a1,a2,a3,a4分别放到xmm0的[0-31],[34-63],[64-95],[96-128]位上,b1,b2,b3,b4也是同样的放到xmm1的位置上,然后执行并行计算,现在如何把a1,a2,a3,a4放到xmm寄存器的相应的位置上,movss+位移肯定不是最好的,我就是想知道有没有专门的指令做类似的操作,shufps也不知道是不是 没看明白他的说明
------解决方案--------------------------------------------------------
SSE2 指令movaps 和movups可以做到这一点,前者要求内存地址16字节对齐,后者不需要。
这两个指令能把内存中连续的4个单精度浮点数载入128bit MMX寄存器。
关于SSE指令最好的参考手册是Intel自己的文档,请看《Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M》
我手边的文档是2010,10月的版本,文件名253666.pdf 大小2988KB
这两个指令的描述是:
MOVAPS—Move Aligned Packed Single-Precision Floating-Point Values
MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values
指令MOVUPS 在这片文档中的描述。
MOVUPS—Move Unaligned Packed Single-Precision Floating-Point Values
MOVUPS xmm1, xmm2/m128
Move packed singleprecision floating-point values from xmm2/m128 to xmm1.
xmm2/m128, xmm1
Move packed singleprecision floating-point values from xmm1 to xmm2/m128.
Description
Moves a double quadword containing four packed single-precision floating-point
values from the source operand (second operand) to the destination operand (first
operand). This instruction can be used to load an XMM register from a 128-bit
memory location, store the contents of an XMM register into a 128-bit memory location,
or move data between two XMM registers. When the source or destination
operand is a memory operand, the operand may be unaligned on a 16-byte boundary
without causing a general-protection exception (#GP) to be generated.
To move packed single-precision floating-point values to and from memory locations
that are known to be aligned on 16-byte boundaries, use the MOVAPS instruction.
While executing in 16-bit addressing mode, a linear address for a 128-bit data access
that overlaps the end of a 16-bit segment is not allowed and is defined as reserved
behavior. A specific processor implementation may or may not generate a generalprotection
exception (#GP) in this situation, and the address that spans the end of
the segment may or may not wrap around to the beginning of the segment.
In 64-bit mode, use of the REX.R prefix permits this instruction to access additional
registers (XMM8-XMM15).
最后,更正一下你的一个错误,[34-63] 应该是[32-63].