1,16 → 1,16 |
|
Üßßß |
ÜÜÛÜÜ ÜÜÜÜ ÜÜÜÜÜ ÜÜÜ ÜÜ |
Û Û Û Û Û Û |
Û ÜßßßßÛ ßßßßÜ Û Û Û |
Û ßÜÜÜÜÛÜ ÜÜÜÜÜß Û Û Û |
,''' |
,,;,, ,,,, ,,,,, ,,, ,, |
; ; ; ; ; ; |
; ,''''; '''', ; ; ; |
; ',,,,;, ,,,,,' ; ; ; |
|
flat assembler 1.66 |
flat assembler 1.70 |
Programmer's Manual |
|
|
Table of contents |
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ |
----------------- |
|
Chapter 1 Introduction |
|
50,6 → 50,11 |
2.1.17 SSE3 instructions |
2.1.18 AMD 3DNow! instructions |
2.1.19 The x86-64 long mode instructions |
2.1.20 SSE4 instructions |
2.1.21 AVX instructions |
2.1.22 AVX2 instructions |
2.1.23 Auxiliary sets of computational instructions |
2.1.24 Other extensions of instruction set |
|
2.2 Control directives |
2.2.1 Numerical constants |
75,8 → 80,9 |
2.4.4 Executable and Linkable Format |
|
|
|
Chapter 1 Introduction |
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ |
----------------------- |
|
This chapter contains all the most important information you need to begin |
using the flat assembler. If you are experienced assembly language programmer, |
139,7 → 145,7 |
destination file. |
The following is an example of the compilation summary: |
|
flat assembler version 1.66 |
flat assembler version 1.70 (16384 kilobytes memory) |
38 passes, 5.3 seconds, 77824 bytes. |
|
In case of error during the compilation process, the program will display an |
146,7 → 152,7 |
error message. For example, when compiler can't find the input file, it will |
display the following message: |
|
flat assembler version 1.66 |
flat assembler version 1.70 (16384 kilobytes memory) |
error: source file not found. |
|
If the error is connected with a specific part of source code, the source line |
153,7 → 159,7 |
that caused the error will be also displayed. Also placement of this line in |
the source is given to help you finding this error, for example: |
|
flat assembler version 1.66 |
flat assembler version 1.70 (16384 kilobytes memory) |
example.asm [3]: |
mob ax,1 |
error: illegal instruction. |
163,7 → 169,7 |
contains a macroinstruction, also the line in macroinstruction definition |
that generated the erroneous instruction is displayed: |
|
flat assembler version 1.66 |
flat assembler version 1.70 (16384 kilobytes memory) |
example.asm [6]: |
stoschar 7 |
example.asm [3] stoschar [1]: |
212,8 → 218,8 |
Any of the "+-*/=<>()[]{}:,|&~#`" is the symbol character. The sequence of |
other characters, separated from other items with either blank spaces or |
symbol characters, is a symbol. If the first character of symbol is either a |
single or double quote, it integrates the any sequence of characters following |
it, even the special ones, into a quoted string, which should end with the same |
single or double quote, it integrates any sequence of characters following it, |
even the special ones, into a quoted string, which should end with the same |
character, with which it began (the single or double quote) - however if there |
are two such characters in a row (without any other character between them), |
they are integrated into quoted string as just one of them and the quoted |
237,40 → 243,45 |
brackets or after the "ptr" operator). |
|
Table 1.1 Size operators |
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÄ¿ |
³ Operator ³ Bits ³ Bytes ³ |
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍ͵ |
³ byte ³ 8 ³ 1 ³ |
³ word ³ 16 ³ 2 ³ |
³ dword ³ 32 ³ 4 ³ |
³ fword ³ 48 ³ 6 ³ |
³ pword ³ 48 ³ 6 ³ |
³ qword ³ 64 ³ 8 ³ |
³ tbyte ³ 80 ³ 10 ³ |
³ tword ³ 80 ³ 10 ³ |
³ dqword ³ 128 ³ 16 ³ |
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÙ |
/-------------------------\ |
| Operator | Bits | Bytes | |
|==========|======|=======| |
| byte | 8 | 1 | |
| word | 16 | 2 | |
| dword | 32 | 4 | |
| fword | 48 | 6 | |
| pword | 48 | 6 | |
| qword | 64 | 8 | |
| tbyte | 80 | 10 | |
| tword | 80 | 10 | |
| dqword | 128 | 16 | |
| xword | 128 | 16 | |
| qqword | 256 | 32 | |
| yword | 256 | 32 | |
\-------------------------/ |
|
Table 1.2 Registers |
ÚÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ |
³ Type ³ Bits ³ ³ |
ÆÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵ |
³ ³ 8 ³ al cl dl bl ah ch dh bh ³ |
³ General ³ 16 ³ ax cx dx bx sp bp si di ³ |
³ ³ 32 ³ eax ecx edx ebx esp ebp esi edi ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ Segment ³ 16 ³ es cs ss ds fs gs ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ Control ³ 32 ³ cr0 cr2 cr3 cr4 ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ Debug ³ 32 ³ dr0 dr1 dr2 dr3 dr6 dr7 ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ FPU ³ 80 ³ st0 st1 st2 st3 st4 st5 st6 st7 ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ MMX ³ 64 ³ mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7 ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ SSE ³ 128 ³ xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 ³ |
ÀÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ |
/-----------------------------------------------------------------\ |
| Type | Bits | | |
|=========|======|================================================| |
| | 8 | al cl dl bl ah ch dh bh | |
| General | 16 | ax cx dx bx sp bp si di | |
| | 32 | eax ecx edx ebx esp ebp esi edi | |
|---------|------|------------------------------------------------| |
| Segment | 16 | es cs ss ds fs gs | |
|---------|------|------------------------------------------------| |
| Control | 32 | cr0 cr2 cr3 cr4 | |
|---------|------|------------------------------------------------| |
| Debug | 32 | dr0 dr1 dr2 dr3 dr6 dr7 | |
|---------|------|------------------------------------------------| |
| FPU | 80 | st0 st1 st2 st3 st4 st5 st6 st7 | |
|---------|------|------------------------------------------------| |
| MMX | 64 | mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7 | |
|---------|------|------------------------------------------------| |
| SSE | 128 | xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 | |
|---------|------|------------------------------------------------| |
| AVX | 256 | ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 | |
\-----------------------------------------------------------------/ |
|
|
1.2.2 Data definitions |
316,25 → 327,25 |
considered unknown. |
|
Table 1.3 Data directives |
ÚÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄ¿ |
³ Size ³ Define ³ Reserve ³ |
³ (bytes) ³ data ³ data ³ |
ÆÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍ͵ |
³ 1 ³ db ³ rb ³ |
³ ³ file ³ ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´ |
³ 2 ³ dw ³ rw ³ |
³ ³ du ³ ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´ |
³ 4 ³ dd ³ rd ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´ |
³ 6 ³ dp ³ rp ³ |
³ ³ df ³ rf ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´ |
³ 8 ³ dq ³ rq ³ |
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´ |
³ 10 ³ dt ³ rt ³ |
ÀÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÙ |
/----------------------------\ |
| Size | Define | Reserve | |
| (bytes) | data | data | |
|=========|========|=========| |
| 1 | db | rb | |
| | file | | |
|---------|--------|---------| |
| 2 | dw | rw | |
| | du | | |
|---------|--------|---------| |
| 4 | dd | rd | |
|---------|--------|---------| |
| 6 | dp | rp | |
| | df | rf | |
|---------|--------|---------| |
| 8 | dq | rq | |
|---------|--------|---------| |
| 10 | dt | rt | |
\----------------------------/ |
|
|
1.2.3 Constants and labels |
399,14 → 410,24 |
In the above examples all the numerical expressions were the simple numbers, |
constants or labels. But they can be more complex, by using the arithmetical |
or logical operators for calculations at compile time. All these operators |
with their priority values are listed in table 1.4. |
The operations with higher priority value will be calculated first, you can |
of course change this behavior by putting some parts of expression into |
parenthesis. The "+", "-", "*" and "/" are standard arithmetical operations, |
"mod" calculates the remainder from division. The "and", "or", "xor", "shl", |
"shr" and "not" perform the same logical operations as assembly instructions |
of those names. The "rva" performs the conversion of an address into the |
relocatable offset and is specific to some of the output formats (see 2.4). |
with their priority values are listed in table 1.4. The operations with higher |
priority value will be calculated first, you can of course change this |
behavior by putting some parts of expression into parenthesis. The "+", "-", |
"*" and "/" are standard arithmetical operations, "mod" calculates the |
remainder from division. The "and", "or", "xor", "shl", "shr" and "not" |
perform the same logical operations as assembly instructions of those names. |
The "rva" and "plt" are special unary operators that perform conversions |
between different kinds of addresses, they can be used only with few of the |
output formats and their meaning may vary (see 2.4). |
The arithmetical and logical calculations are usually processed as if they |
operated on infinite precision 2-adic numbers, and assembler signalizes an |
overflow error if because of its limitations it is not table to perform the |
required calculation, or if the result is too large number to fit in either |
signed or unsigned range for the destination unit size. However "not", "xor" |
and "shr" operators are exceptions from this rule - if the value specified |
by numerical expression has to fit in a unit of specified size, and the |
arguments for operation fit into that size, the operation will be performed |
with precision limited to that size. |
The numbers in the expression are by default treated as a decimal, binary |
numbers should have the "b" letter attached at the end, octal number should |
end with "o" letter, hexadecimal numbers should begin with "0x" characters |
431,23 → 452,23 |
while simple "1" defines an integer value. |
|
Table 1.4 Arithmetical and logical operators by priority |
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ |
³ Priority ³ Operators ³ |
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍ͵ |
³ 0 ³ + - ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ 1 ³ * / ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ 2 ³ mod ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ 3 ³ and or xor ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ 4 ³ shl shr ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ 5 ³ not ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ 6 ³ rva ³ |
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ |
/-------------------------\ |
| Priority | Operators | |
|==========|==============| |
| 0 | + - | |
|----------|--------------| |
| 1 | * / | |
|----------|--------------| |
| 2 | mod | |
|----------|--------------| |
| 3 | and or xor | |
|----------|--------------| |
| 4 | shl shr | |
|----------|--------------| |
| 5 | not | |
|----------|--------------| |
| 6 | rva plt | |
\-------------------------/ |
|
|
1.2.5 Jumps and calls |
459,7 → 480,7 |
in 32-bit mode, it will become the near jump. To force this instruction to be |
treated differently, use the "jmp near dword [0]" or "jmp far dword [0]" form. |
When operand of near jump is the immediate value, assembler will generate |
the shortest variant of this jump instruction if possible (but won't create |
the shortest variant of this jump instruction if possible (but will not create |
32-bit instruction in 16-bit mode nor 16-bit instruction in 32-bit mode, |
unless there is a size operator stating it). By specifying the jump type |
you can force it to always generate long variant (for example "jmp near 0") |
492,7 → 513,7 |
|
|
Chapter 2 Instruction set |
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ |
-------------------------- |
|
This chapter provides the detailed information about the instructions and |
directives supported by flat assembler. Directives for defining labels were |
767,12 → 788,12 |
|
2.1.5 Logical instructions |
|
"not" inverts the bits in the specified operand to form a one's |
complement of the operand. It has no effect on the flags. Rules for the |
operand are the same as for the "inc" instruction. |
"and", "or" and "xor" instructions perform the standard |
logical operations. They update the SF, ZF and PF flags. Rules for the |
operands are the same as for the "add" instruction. |
"not" inverts the bits in the specified operand to form a one's complement |
of the operand. It has no effect on the flags. Rules for the operand are the |
same as for the "inc" instruction. |
"and", "or" and "xor" instructions perform the standard logical operations. |
They update the SF, ZF and PF flags. Rules for the operands are the same as |
for the "add" instruction. |
"bt", "bts", "btr" and "btc" instructions operate on a single bit which can |
be in memory or in a general register. The location of the bit is specified |
as an offset from the low order end of the operand. The value of the offset |
918,55 → 939,55 |
target address. |
|
Table 2.1 Conditions |
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ |
³ Mnemonic ³ Condition tested ³ Description ³ |
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵ |
³ o ³ OF = 1 ³ overflow ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ no ³ OF = 0 ³ not overflow ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ c ³ ³ carry ³ |
³ b ³ CF = 1 ³ below ³ |
³ nae ³ ³ not above nor equal ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ nc ³ ³ not carry ³ |
³ ae ³ CF = 0 ³ above or equal ³ |
³ nb ³ ³ not below ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ e ³ ZF = 1 ³ equal ³ |
³ z ³ ³ zero ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ ne ³ ZF = 0 ³ not equal ³ |
³ nz ³ ³ not zero ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ be ³ CF or ZF = 1 ³ below or equal ³ |
³ na ³ ³ not above ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ a ³ CF or ZF = 0 ³ above ³ |
³ nbe ³ ³ not below nor equal ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ s ³ SF = 1 ³ sign ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ ns ³ SF = 0 ³ not sign ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ p ³ PF = 1 ³ parity ³ |
³ pe ³ ³ parity even ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ np ³ PF = 0 ³ not parity ³ |
³ po ³ ³ parity odd ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ l ³ SF xor OF = 1 ³ less ³ |
³ nge ³ ³ not greater nor equal ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ ge ³ SF xor OF = 0 ³ greater or equal ³ |
³ nl ³ ³ not less ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ le ³ (SF xor OF) or ZF = 1 ³ less or equal ³ |
³ ng ³ ³ not greater ³ |
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ |
³ g ³ (SF xor OF) or ZF = 0 ³ greater ³ |
³ nle ³ ³ not less nor equal ³ |
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ |
/-----------------------------------------------------------\ |
| Mnemonic | Condition tested | Description | |
|==========|=======================|========================| |
| o | OF = 1 | overflow | |
|----------|-----------------------|------------------------| |
| no | OF = 0 | not overflow | |
|----------|-----------------------|------------------------| |
| c | | carry | |
| b | CF = 1 | below | |
| nae | | not above nor equal | |
|----------|-----------------------|------------------------| |
| nc | | not carry | |
| ae | CF = 0 | above or equal | |
| nb | | not below | |
|----------|-----------------------|------------------------| |
| e | ZF = 1 | equal | |
| z | | zero | |
|----------|-----------------------|------------------------| |
| ne | ZF = 0 | not equal | |
| nz | | not zero | |
|----------|-----------------------|------------------------| |
| be | CF or ZF = 1 | below or equal | |
| na | | not above | |
|----------|-----------------------|------------------------| |
| a | CF or ZF = 0 | above | |
| nbe | | not below nor equal | |
|----------|-----------------------|------------------------| |
| s | SF = 1 | sign | |
|----------|-----------------------|------------------------| |
| ns | SF = 0 | not sign | |
|----------|-----------------------|------------------------| |
| p | PF = 1 | parity | |
| pe | | parity even | |
|----------|-----------------------|------------------------| |
| np | PF = 0 | not parity | |
| po | | parity odd | |
|----------|-----------------------|------------------------| |
| l | SF xor OF = 1 | less | |
| nge | | not greater nor equal | |
|----------|-----------------------|------------------------| |
| ge | SF xor OF = 0 | greater or equal | |
| nl | | not less | |
|----------|-----------------------|------------------------| |
| le | (SF xor OF) or ZF = 1 | less or equal | |
| ng | | not greater | |
|----------|-----------------------|------------------------| |
| g | (SF xor OF) or ZF = 0 | greater | |
| nle | | not less nor equal | |
\-----------------------------------------------------------/ |
|
The "loop" instructions are conditional jumps that use a value placed in |
CX (or ECX) to specify the number of repetitions of a software loop. All |
1158,7 → 1179,7 |
|
"salc" instruction sets the all bits of AL register when the carry flag is |
set and zeroes the AL register otherwise. This instruction has no arguments. |
The instructions obtained by attaching the condition mnemonic to the "cmov" |
The instructions obtained by attaching the condition mnemonic to "cmov" |
mnemonic transfer the word or double word from the general register or memory |
to the general register only when the condition is true. The destination |
operand should be general register, the source operand can be general register |
1365,7 → 1386,7 |
commonly used contants onto the FPU register stack. The loaded constants are |
+1.0, +0.0, lb 10, lb e, pi, lg 2 and ln 2 respectively. These instructions |
have no operands. |
"fild" convert the singed integer source operand into double extended |
"fild" converts the signed integer source operand into double extended |
precision floating-point format and pushes the result onto the FPU register |
stack. The source operand can be a 16-bit, 32-bit or 64-bit memory location. |
|
1493,18 → 1514,18 |
fcmovb st0,st2 ; transfer st2 to st0 if below |
|
Table 2.2 FPU conditions |
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ |
³ Mnemonic ³ Condition tested ³ Description ³ |
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵ |
³ b ³ CF = 1 ³ below ³ |
³ e ³ ZF = 1 ³ equal ³ |
³ be ³ CF or ZF = 1 ³ below or equal ³ |
³ u ³ PF = 1 ³ unordered ³ |
³ nb ³ CF = 0 ³ not below ³ |
³ ne ³ ZF = 0 ³ not equal ³ |
³ nbe ³ CF and ZF = 0 ³ not below nor equal ³ |
³ nu ³ PF = 0 ³ not unordered ³ |
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ |
/------------------------------------------------------\ |
| Mnemonic | Condition tested | Description | |
|==========|==================|========================| |
| b | CF = 1 | below | |
| e | ZF = 1 | equal | |
| be | CF or ZF = 1 | below or equal | |
| u | PF = 1 | unordered | |
| nb | CF = 0 | not below | |
| ne | ZF = 0 | not equal | |
| nbe | CF and ZF = 0 | not below nor equal | |
| nu | PF = 0 | not unordered | |
\------------------------------------------------------/ |
|
"ftst" compares the value in ST0 with 0.0 and sets the flags in the FPU |
status word according to the results. "fxam" examines the contents of the ST0 |
1528,7 → 1549,12 |
destination in memory and reinitializes the FPU. "fsave" check for pending |
unmasked FPU exceptions before proceeding, "fnsave" does not. "frstor" |
loads the FPU state from the specified memory location. All these instructions |
need an operand being a memory location. |
need an operand being a memory location. For each of these instruction |
exist two additional mnemonics that allow to precisely select the type of the |
operation. The "fstenvw", "fnstenvw", "fldenvw", "fsavew", "fnsavew" and |
"frstorw" mnemonics force the instruction to perform operation as in the 16-bit |
mode, while "fstenvd", "fnstenvd", "fldenvd", "fsaved", "fnsaved" and "frstord" |
force the operation as in 32-bit mode. |
"finit" and "fninit" set the FPU operating environment into its default |
state. "finit" checks for pending unmasked FPU exception before proceeding, |
"fninit" does not. "fclex" and "fnclex" clear the FPU exception flags in the |
1573,17 → 1599,17 |
"psubsb" and "psubsw" perform the addition or substraction of packed bytes |
or packed words with the signed saturation. "paddusb", "paddusw", "psubusb", |
"psubusw" are analoguous, but with unsigned saturation. "pmulhw" and "pmullw" |
performs a signed multiply of the packed words and store the high or low words |
of the results in the destination operand. "pmaddwd" performs a multiply of |
the packed words and adds the four intermediate double word products in pairs |
to produce result as a packed double words. "pand", "por" and "pxor" perform |
the logical operations on the quad words, "pandn" peforms also a logical |
negation of the destination operand before performing the "and" operation. |
"pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed bytes, |
packed words or packed double words. If a pair of data elements is equal, the |
corresponding data element in the destination operand is filled with bits of |
value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd" perform |
the similar operation, but they check whether the data elements in the |
performs a signed multiplication of the packed words and store the high or low |
words of the results in the destination operand. "pmaddwd" performs a multiply |
of the packed words and adds the four intermediate double word products in |
pairs to produce result as a packed double words. "pand", "por" and "pxor" |
perform the logical operations on the quad words, "pandn" peforms also a |
logical negation of the destination operand before performing the "and" |
operation. "pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed |
bytes, packed words or packed double words. If a pair of data elements is |
equal, the corresponding data element in the destination operand is filled with |
bits of value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd" |
perform the similar operation, but they check whether the data elements in the |
destination operand are greater than the correspoding data elements in the |
source operand. "packsswb" converts packed signed words into packed signed |
bytes, "packssdw" converts packed signed double words into packed signed |
1699,18 → 1725,18 |
cmpltss xmm0,[ebx] ; compare single precision values |
|
Table 2.3 SSE conditions |
ÚÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ |
³ Code ³ Mnemonic ³ Description ³ |
ÆÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵ |
³ 0 ³ eq ³ equal ³ |
³ 1 ³ lt ³ less than ³ |
³ 2 ³ le ³ less than or equal ³ |
³ 3 ³ unord ³ unordered ³ |
³ 4 ³ neq ³ not equal ³ |
³ 5 ³ nlt ³ not less than ³ |
³ 6 ³ nle ³ not less than nor equal ³ |
³ 7 ³ ord ³ ordered ³ |
ÀÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ |
/-------------------------------------------\ |
| Code | Mnemonic | Description | |
|======|==========|=========================| |
| 0 | eq | equal | |
| 1 | lt | less than | |
| 2 | le | less than or equal | |
| 3 | unord | unordered | |
| 4 | neq | not equal | |
| 5 | nlt | not less than | |
| 6 | nle | not less than nor equal | |
| 7 | ord | ordered | |
\-------------------------------------------/ |
|
"comiss" and "ucomiss" compare the single precision values and set the ZF, |
PF and CF flags to show the result. The destination operand must be a SSE |
1771,8 → 1797,8 |
|
"pextrw" copies the word in the source operand specified by the third |
operand to the destination operand. The source operand must be a MMX register, |
the destination operand must be a 32-bit general register (but only the low |
word of it is affected), the third operand must an 8-bit immediate value. |
the destination operand must be a 32-bit general register (the high word of |
the destination is cleared), the third operand must an 8-bit immediate value. |
|
pextrw eax,mm0,1 ; extract word into eax |
|
1788,12 → 1814,12 |
return the maximum values of packed unsigned bytes, "pminub" returns the |
minimum values of packed unsigned bytes, "pmaxsw" returns the maximum values |
of packed signed words, "pminsw" returns the minimum values of packed signed |
words. "pmulhuw" performs a unsigned multiply of the packed words and stores |
the high words of the results in the destination operand. "psadbw" computes |
the absolute differences of packed unsigned bytes, sums the differences, and |
stores the sum in the low word of destination operand. All these instructions |
follow the same rules for operands as the general MMX operations described in |
previous section. |
words. "pmulhuw" performs a unsigned multiplication of the packed words and |
stores the high words of the results in the destination operand. "psadbw" |
computes the absolute differences of packed unsigned bytes, sums the |
differences, and stores the sum in the low word of destination operand. All |
these instructions follow the same rules for operands as the general MMX |
operations described in previous section. |
"pmovmskb" creates a mask made of the most significant bit of each byte in |
the source operand and stores the result in the low byte of destination |
operand. The source operand must be a MMX register, the destination operand |
1922,10 → 1948,11 |
point values to packed two double word integers, storing the result in the low |
quad word of the destination operand. "cvtdq2ps" converts packed four |
double word integers to packed single precision floating point values. |
"cvtdq2pd" converts packed two double word integers from the low quad word |
of the source operand to packed double precision floating point values. |
For all these instruction destination operand must be a SSE register, the |
source operand can be a 128-bit memory location or SSE register. |
"cvtdq2pd" converts packed two double word integers from the source operand to |
packed double precision floating point values, the source can be a 64-bit |
memory location or SSE register, destination has to be SSE register. |
"movdqa" and "movdqu" transfer a double quad word operand containing packed |
integers from source operand to destination operand. At least one of the |
operands have to be a SSE register, the second one can be also a SSE register |
1943,7 → 1970,7 |
mnemonics starting with "p") are extended to operate on 128-bit packed |
integers located in SSE registers. Additional syntax for these instructions |
needs an SSE register where MMX register was needed, and the 128-bit memory |
location or SSE register where 64-bit memory location of MMX register were |
location or SSE register where 64-bit memory location or MMX register were |
needed. The exception is "pshufw" instruction, which doesn't allow extended |
syntax, but has two new variants: "pshufhw" and "pshuflw", which allow only |
the extended syntax, and perform the same operation as "pshufw" on the high |
1955,12 → 1982,12 |
pextrw eax,xmm0,7 ; extract highest word into eax |
|
"paddq" performs the addition of packed quad words, "psubq" performs the |
substraction of packed quad words, "pmuludq" performs an unsigned multiply |
of low double words from each corresponding quad words and returns the results |
in packed quad words. These instructions follow the same rules for operands as |
the general MMX operations described in 2.1.14. |
substraction of packed quad words, "pmuludq" performs an unsigned |
multiplication of low double words from each corresponding quad words and |
returns the results in packed quad words. These instructions follow the same |
rules for operands as the general MMX operations described in 2.1.14. |
"pslldq" and "psrldq" perform logical shift left or right of the double |
quad word in the destination operand by the amount of bits specified in the |
quad word in the destination operand by the amount of bytes specified in the |
source operand. The destination operand should be a SSE register, source |
operand should be an 8-bit immediate value. |
"punpckhqdq" interleaves the high quad word of the source operand and the |
2007,10 → 2034,10 |
"movddup" loads the 64-bit source value and duplicates it into high and low |
quad word of the destination operand. The destination operand should be SSE |
register, the source operand can be SSE register or 64-bit memory location. |
"lddqu" is functionally equivalent to "movdqu" instruction with memory as |
source operand, but it may improve performance when the source operand crosses |
a cacheline boundary. The destination operand has to be SSE register, the |
source operand must be 128-bit memory location. |
"lddqu" is functionally equivalent to "movdqu" with memory as source |
operand, but it may improve performance when the source operand crosses a |
cacheline boundary. The destination operand has to be SSE register, the source |
operand must be 128-bit memory location. |
"addsubps" performs single precision addition of second and fourth pairs and |
single precision substracion of the first and third pairs of floating point |
values in the operands. "addsubpd" performs double precision addition of the |
2030,6 → 2057,44 |
waits for a write-back store to the address range set up by the "monitor" |
instruction. It uses two operands with additional parameters, first being the |
EAX and second the ECX register. |
The functionality of SSE3 is further extended by the set of Supplemental |
SSE3 instructions (SSSE3). They generally follow the same rules for operands |
as all the MMX operations extended by SSE. |
"phaddw" and "phaddd" perform the horizontal additional of the pairs of |
adjacent values from both the source and destination operand, and stores the |
sums into the destination (sums from the source operand go into lower part of |
destination register). They operate on 16-bit or 32-bit chunks, respectively. |
"phaddsw" performs the same operation on signed 16-bit packed values, but the |
result of each addition is saturated. "phsubw" and "phsubd" analogously |
perform the horizontal substraction of 16-bit or 32-bit packed value, and |
"phsubsw" performs the horizontal substraction of signed 16-bit packed values |
with saturation. |
"pabsb", "pabsw" and "pabsd" calculate the absolute value of each signed |
packed signed value in source operand and stores them into the destination |
register. They operator on 8-bit, 16-bit and 32-bit elements respectively. |
"pmaddubsw" multiplies signed 8-bit values from the source operand with the |
corresponding unsigned 8-bit values from the destination operand to produce |
intermediate 16-bit values, and every adjacent pair of those intermediate |
values is then added horizontally and those 16-bit sums are stored into the |
destination operand. |
"pmulhrsw" multiplies corresponding 16-bit integers from the source and |
destination operand to produce intermediate 32-bit values, and the 16 bits |
next to the highest bit of each of those values are then rounded and packed |
into the destination operand. |
"pshufb" shuffles the bytes in the destination operand according to the |
mask provided by source operand - each of the bytes in source operand is |
an index of the target position for the corresponding byte in the destination. |
"psignb", "psignw" and "psignd" perform the operation on 8-bit, 16-bit or |
32-bit integers in destination operand, depending on the signs of the values |
in the source. If the value in source is negative, the corresponding value in |
the destination register is negated, if the value in source is positive, no |
operation is performed on the corresponding value is performed, and if the |
value in source is zero, the value in destination is zeroed, too. |
"palignr" appends the source operand to the destination operand to form the |
intermediate value of twice the size, and then extracts into the destination |
register the 64 or 128 bits that are right-aligned to the byte offset |
specified by the third operand, which should be an 8-bit immediate value. This |
is the only SSSE3 instruction that takes three arguments. |
|
|
2.1.18 AMD 3DNow! instructions |
2040,9 → 2105,9 |
These instructions follow the same rules as the general MMX operations, the |
destination operand should be a MMX register, the source operand can be a MMX |
register or 64-bit memory location. "pavgusb" computes the rounded averages |
of packed unsigned bytes. "pmulhrw" performs a signed multiply of the packed |
words, round the high word of each double word results and stores them in the |
destination operand. "pi2fd" converts packed double word integers into |
of packed unsigned bytes. "pmulhrw" performs a signed multiplication of the |
packed words, round the high word of each double word results and stores them |
in the destination operand. "pi2fd" converts packed double word integers into |
packed floating point values. "pf2id" converts packed floating point values |
into packed double word integers using truncation. "pi2fw" converts packed |
word integers into packed floating point values, only low words of each |
2106,28 → 2171,28 |
instruction with any of the new registers. |
|
Table 2.4 New registers in long mode |
ÚÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄ¿ |
³ Type ³ General ³ SSE ³ |
ÃÄÄÄÄÄÄÅÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÅÄÄÄÄÄÄÄ´ |
³ Bits ³ 8 ³ 16 ³ 32 ³ 64 ³ 128 ³ |
ÆÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍ͵ |
³ ³ ³ ³ ³ rax ³ ³ |
³ ³ ³ ³ ³ rcx ³ ³ |
³ ³ ³ ³ ³ rdx ³ ³ |
³ ³ ³ ³ ³ rbx ³ ³ |
³ ³ spl ³ ³ ³ rsp ³ ³ |
³ ³ bpl ³ ³ ³ rbp ³ ³ |
³ ³ sil ³ ³ ³ rsi ³ ³ |
³ ³ dil ³ ³ ³ rdi ³ ³ |
³ ³ r8b ³ r8w ³ r8d ³ r8 ³ xmm8 ³ |
³ ³ r9b ³ r9w ³ r9d ³ r9 ³ xmm9 ³ |
³ ³ r10b ³ r10w ³ r10d ³ r10 ³ xmm10 ³ |
³ ³ r11b ³ r11w ³ r11d ³ r11 ³ xmm11 ³ |
³ ³ r12b ³ r12w ³ r12d ³ r12 ³ xmm12 ³ |
³ ³ r13b ³ r13w ³ r13d ³ r13 ³ xmm13 ³ |
³ ³ r14b ³ r14w ³ r14d ³ r14 ³ xmm14 ³ |
³ ³ r15b ³ r15w ³ r15d ³ r15 ³ xmm15 ³ |
ÀÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÙ |
/--------------------------------------------------\ |
| Type | General | SSE | AVX | |
|------|---------------------------|-------|-------| |
| Bits | 8 | 16 | 32 | 64 | 128 | 256 | |
|======|======|======|======|======|=======|=======| |
| | | | | rax | | | |
| | | | | rcx | | | |
| | | | | rdx | | | |
| | | | | rbx | | | |
| | spl | | | rsp | | | |
| | bpl | | | rbp | | | |
| | sil | | | rsi | | | |
| | dil | | | rdi | | | |
| | r8b | r8w | r8d | r8 | xmm8 | ymm8 | |
| | r9b | r9w | r9d | r9 | xmm9 | ymm9 | |
| | r10b | r10w | r10d | r10 | xmm10 | ymm10 | |
| | r11b | r11w | r11d | r11 | xmm11 | ymm11 | |
| | r12b | r12w | r12d | r12 | xmm12 | ymm12 | |
| | r13b | r13w | r13d | r13 | xmm13 | ymm13 | |
| | r14b | r14w | r14d | r14 | xmm14 | ymm14 | |
| | r15b | r15w | r15d | r15 | xmm15 | ymm15 | |
\--------------------------------------------------/ |
|
In general any instruction from x86 architecture, which allowed 16-bit or |
32-bit operand sizes, in long mode allows also the 64-bit operands. The 64-bit |
2165,30 → 2230,30 |
the upper 32 bits of the 64-bit registers containing them are filled with |
zeros. This is unlike the operations on 16-bit or 8-bit portions of those |
registers, which preserve the upper bits. |
Three new type conversion instructions are available. The "cdqe" sign extends |
the double word in EAX into quad word and stores the result in RAX register. |
"cqo" sign extends the quad word in RAX into double quad word and stores the |
extra bits in the RDX register. These instructions have no operands. "movsxd" |
sign extends the double word source operand, being either the 32-bit register |
or memory, into 64-bit destination operand, which has to be register. |
No analogous instruction is needed for the zero extension, since it is done |
automatically by any operations on 32-bit registers, as noted in previous |
paragraph. And the "movzx" and "movsx" instructions, conforming to the general |
rule, can be used with 64-bit destination operand, allowing extension of byte |
or word values into quad words. |
All the binary arithmetic and logical instruction are promoted to allow |
64-bit operands in long mode. The use of decimal arithmetic instructions in |
long mode is prohibited. |
Three new type conversion instructions are available. The "cdqe" sign |
extends the double word in EAX into quad word and stores the result in RAX |
register. "cqo" sign extends the quad word in RAX into double quad word and |
stores the extra bits in the RDX register. These instructions have no |
operands. "movsxd" sign extends the double word source operand, being either |
the 32-bit register or memory, into 64-bit destination operand, which has to |
be register. No analogous instruction is needed for the zero extension, since |
it is done automatically by any operations on 32-bit registers, as noted in |
previous paragraph. And the "movzx" and "movsx" instructions, conforming to |
the general rule, can be used with 64-bit destination operand, allowing |
extension of byte or word values into quad words. |
All the binary arithmetic and logical instruction have been promoted to |
allow 64-bit operands in long mode. The use of decimal arithmetic instructions |
in long mode is prohibited. |
The stack operations, like "push" and "pop" in long mode default to 64-bit |
operands and it's not possible to use 32-bit operands with them. The "pusha" |
and "popa" are disallowed in long mode. |
The indirect near jumps and calls in long mode default to 64-bit operands and |
it's not possible to use the 32-bit operands with them. On the other hand, the |
indirect far jumps and calls allow any operands that were allowed by the x86 |
architecture and also 80-bit memory operand is allowed (though only EM64T seems |
to implement such variant), with the first eight bytes defining the offset and |
two last bytes specifying the selector. The direct far jumps and calls are not |
allowed in long mode. |
The indirect near jumps and calls in long mode default to 64-bit operands |
and it's not possible to use the 32-bit operands with them. On the other hand, |
the indirect far jumps and calls allow any operands that were allowed by the |
x86 architecture and also 80-bit memory operand is allowed (though only EM64T |
seems to implement such variant), with the first eight bytes defining the |
offset and two last bytes specifying the selector. The direct far jumps and |
calls are not allowed in long mode. |
The I/O instructions, "in", "out", "ins" and "outs" are the exceptional |
instructions that are not extended to accept quad word operands in long mode. |
But all other string operations are, and there are new short forms "movsq", |
2203,13 → 2268,990 |
The "cmpxchg16b" is the 64-bit equivalent of "cmpxchg8b" instruction, it uses |
the double quad word memory operand and 64-bit registers to perform the |
analoguous operation. |
The "fxsave64" and "fxrstor64" are new variants of "fxsave" and "fxrstor" |
instructions, available only in long mode, which use a different format of |
storage area in order to store some pointers in full 64-bit size. |
"swapgs" is the new instruction, which swaps the contents of GS register and |
the KernelGSbase model-specific register (MSR address 0C0000102h). |
"syscall" and "sysret" is the pair of new instructions that provide the |
functionality similar to "sysenter" and "sysexit" in long mode, where the |
latter pair is disallowed. |
latter pair is disallowed. The "sysexitq" and "sysretq" mnemonics provide the |
64-bit versions of "sysexit" and "sysret" instructions. |
The "rdmsrq" and "wrmsrq" mnemonics are the 64-bit variants of the "rdmsr" |
and "wrmsr" instructions. |
|
|
2.1.20 SSE4 instructions |
|
There are actually three different sets of instructions under the name SSE4. |
Intel designed two of them, SSE4.1 and SSE4.2, with latter extending the |
former into the full Intel's SSE4 set. On the other hand, the implementation |
by AMD includes only a few instructions from this set, but also contains |
some additional instructions, that are called the SSE4a set. |
The SSE4.1 instructions mostly follow the same rules for operands, as |
the basic SSE operations, so they require destination operand to be SSE |
register and source operand to be 128-bit memory location or SSE register, |
and some operations require a third operand, the 8-bit immediate value. |
"pmulld" performs a signed multiplication of the packed double words and |
stores the low double words of the results in the destination operand. |
"pmuldq" performs a two signed multiplications of the corresponding double |
words in the lower quad words of operands, and stores the results as |
packed quad words into the destination register. "pminsb" and "pmaxsb" |
return the minimum or maximum values of packed signed bytes, "pminuw" and |
"pmaxuw" return the minimum and maximum values of packed unsigned words, |
"pminud", "pmaxud", "pminsd" and "pmaxsd" return minimum or maximum values |
of packed unsigned or signed words. These instruction complement the |
instructions computing packed minimum or maximum introduced by SSE. |
"ptest" sets the ZF flag to one when the result of bitwise AND of the |
both operands is zero, and zeroes the ZF otherwise. It also sets CF flag |
to one, when the result of bitwise AND of the destination operand with |
the bitwise NOT of the source operand is zero, and zeroes the CF otherwise. |
"pcmpeqq" compares packed quad words for equality, and fills the |
corresponding elements of destination operand with either ones or zeros, |
depending on the result of comparison. |
"packusdw" converts packed signed double words from both the source and |
destination operand into the unsigned words using saturation, and stores |
the eight resulting word values into the destination register. |
"phminposuw" finds the minimum unsigned word value in source operand and |
places it into the lowest word of destination operand, setting the remaining |
upper bits of destination to zero. |
"roundps", "roundss", "roundpd" and "roundsd" perform the rounding of packed |
or individual floating point value of single or double precision, using the |
rounding mode specified by the third operand. |
|
roundsd xmm0,xmm1,0011b ; round toward zero |
|
"dpps" calculates dot product of packed single precision floating point |
values, that is it multiplies the corresponding pairs of values from source and |
destination operand and then sums the products up. The high four bits of the |
8-bit immediate third operand control which products are calculated and taken |
to the sum, and the low four bits control, into which elements of destination |
the resulting dot product is copied (the other elements are filled with zero). |
"dppd" calculates dot product of packed double precision floating point values. |
The bits 4 and 5 of third operand control, which products are calculated and |
added, and bits 0 and 1 of this value control, which elements in destination |
register should get filled with the result. "mpsadbw" calculates multiple sums |
of absolute differences of unsigned bytes. The third operand controls, with |
value in bits 0-1, which of the four-byte blocks in source operand is taken to |
calculate the absolute differencies, and with value in bit 2, at which of the |
two first four-byte block in destination operand start calculating multiple |
sums. The sum is calculated from four absolute differencies between the |
corresponding unsigned bytes in the source and destination block, and each next |
sum is calculated in the same way, but taking the four bytes from destination |
at the position one byte after the position of previous block. The four bytes |
from the source stay the same each time. This way eight sums of absolute |
differencies are calculated and stored as packed word values into the |
destination operand. The instructions described in this paragraph follow the |
same rules for operands, as "roundps" instruction. |
"blendps", "blendvps", "blendpd" and "blendvpd" conditionally copy the |
values from source operand into the destination operand, depending on the bits |
of the mask provided by third operand. If a mask bit is set, the corresponding |
element of source is copied into the same place in destination, otherwise this |
position is destination is left unchanged. The rules for the first two operands |
are the same, as for general SSE instructions. "blendps" and "blendpd" need |
third operand to be 8-bit immediate, and they operate on single or double |
precision values, respectively. "blendvps" and "blendvpd" require third operand |
to be the XMM0 register. |
|
blendvps xmm3,xmm7,xmm0 ; blend according to mask |
|
"pblendw" conditionally copies word elements from the source operand into the |
destination, depending on the bits of mask provided by third operand, which |
needs to be 8-bit immediate value. "pblendvb" conditionally copies byte |
elements from the source operands into destination, depending on mask defined |
by the third operand, which has to be XMM0 register. These instructions follow |
the same rules for operands as "blendps" and "blendvps" instructions, |
respectively. |
"insertps" inserts a single precision floating point value taken from the |
position in source operand specified by bits 6-7 of third operand into location |
in destination register selected by bits 4-5 of third operand. Additionally, |
the low four bits of third operand control, which elements in destination |
register will be set to zero. The first two operands follow the same rules as |
for the general SSE operation, the third operand should be 8-bit immediate. |
"extractps" extracts a single precision floating point value taken from the |
location in source operand specified by low two bits of third operand, and |
stores it into the destination operand. The destination can be a 32-bit memory |
value or general purpose register, the source operand must be SSE register, |
and the third operand should be 8-bit immediate value. |
|
extractps edx,xmm3,3 ; extract the highest value |
|
"pinsrb", "pinsrd" and "pinsrq" copy a byte, double word or quad word from |
the source operand into the location of destination operand determined by the |
third operand. The destination operand has to be SSE register, the source |
operand can be a memory location of appropriate size, or the 32-bit general |
purpose register (but 64-bit general purpose register for "pinsrq", which is |
only available in long mode), and the third operand has to be 8-bit immediate |
value. These instructions complement the "pinsrw" instruction operating on SSE |
register destination, which was introduced by SSE2. |
|
pinsrd xmm4,eax,1 ; insert double word into second position |
|
"pextrb", "pextrw", "pextrd" and "pextrq" copy a byte, word, double word or |
quad word from the location in source operand specified by third operand, into |
the destination. The source operand should be SSE register, the third operand |
should be 8-bit immediate, and the destination operand can be memory location |
of appropriate size, or the 32-bit general purpose register (but 64-bit general |
purpose register for "pextrq", which is only available in long mode). The |
"pextrw" instruction with SSE register as source was already introduced by |
SSE2, but SSE4 extends it to allow memory operand as destination. |
|
pextrw [ebx],xmm3,7 ; extract highest word into memory |
|
"pmovsxbw" and "pmovzxbw" perform sign extension or zero extension of eight |
byte values from the source operand into packed word values in destination |
operand, which has to be SSE register. The source can be 64-bit memory or SSE |
register - when it is register, only its low portion is used. "pmovsxbd" and |
"pmovzxbd" perform sign extension or zero extension of the four byte values |
from the source operand into packed double word values in destination operand, |
the source can be 32-bit memory or SSE register. "pmovsxbq" and "pmovzxbq" |
perform sign extension or zero extension of the two byte values from the |
source operand into packed quad word values in destination operand, the source |
can be 16-bit memory or SSE register. "pmovsxwd" and "pmovzxwd" perform sign |
extension or zero extension of the four word values from the source operand |
into packed double words in destination operand, the source can be 64-bit |
memory or SSE register. "pmovsxwq" and "pmovzxwq" perform sign extension or |
zero extension of the two word values from the source operand into packed quad |
words in destination operand, the source can be 32-bit memory or SSE register. |
"pmovsxdq" and "pmovzxdq" perform sign extension or zero extension of the two |
double word values from the source operand into packed quad words in |
destination operand, the source can be 64-bit memory or SSE register. |
|
pmovzxbq xmm0,word [si] ; zero-extend bytes to quad words |
pmovsxwq xmm0,xmm1 ; sign-extend words to quad words |
|
"movntdqa" loads double quad word from the source operand to the destination |
using a non-temporal hint. The destination operand should be SSE register, |
and the source operand should be 128-bit memory location. |
The SSE4.2, described below, adds not only some new operations on SSE |
registers, but also introduces some completely new instructions operating on |
general purpose registers only. |
"pcmpistri" compares two zero-ended (implicit length) strings provided in |
its source and destination operand and generates an index stored to ECX; |
"pcmpistrm" performs the same comparison and generates a mask stored to XMM0. |
"pcmpestri" compares two strings of explicit lengths, with length provided |
in EAX for the destination operand and in EDX for the source operand, and |
generates an index stored to ECX; "pcmpestrm" performs the same comparision |
and generates a mask stored to XMM0. The source and destination operand follow |
the same rules as for general SSE instructions, the third operand should be |
8-bit immediate value determining the details of performed operation - refer to |
Intel documentation for information on those details. |
"pcmpgtq" compares packed quad words, and fills the corresponding elements of |
destination operand with either ones or zeros, depending on whether the value |
in destination is greater than the one in source, or not. This instruction |
follows the same rules for operands as "pcmpeqq". |
"crc32" accumulates a CRC32 value for the source operand starting with |
initial value provided by destination operand, and stores the result in |
destination. Unless in long mode, the destination operand should be a 32-bit |
general purpose register, and the source operand can be a byte, word, or double |
word register or memory location. In long mode the destination operand can |
also be a 64-bit general purpose register, and the source operand in such case |
can be a byte or quad word register or memory location. |
|
crc32 eax,dl ; accumulate CRC32 on byte value |
crc32 eax,word [ebx] ; accumulate CRC32 on word value |
crc32 rax,qword [rbx] ; accumulate CRC32 on quad word value |
|
"popcnt" calculates the number of bits set in the source operand, which can |
be 16-bit, 32-bit, or 64-bit general purpose register or memory location, |
and stores this count in the destination operand, which has to be register of |
the same size as source operand. The 64-bit variant is available only in long |
mode. |
|
popcnt ecx,eax ; count bits set to 1 |
|
The SSE4a extension, which also includes the "popcnt" instruction introduced |
by SSE4.2, at the same time adds the "lzcnt" instruction, which follows the |
same syntax, and calculates the count of leading zero bits in source operand |
(if the source operand is all zero bits, the total number of bits in source |
operand is stored in destination). |
"extrq" extract the sequence of bits from the low quad word of SSE register |
provided as first operand and stores them at the low end of this register, |
filling the remaining bits in the low quad word with zeros. The position of bit |
string and its length can either be provided with two 8-bit immediate values |
as second and third operand, or by SSE register as second operand (and there |
is no third operand in such case), which should contain position value in bits |
8-13 and length of bit string in bits 0-5. |
|
extrq xmm0,8,7 ; extract 8 bits from position 7 |
extrq xmm0,xmm5 ; extract bits defined by register |
|
"insertq" writes the sequence of bits from the low quad word of the source |
operand into specified position in low quad word of the destination operand, |
leaving the other bits in low quad word of destination intact. The position |
where bits should be written and the length of bit string can either be |
provided with two 8-bit immediate values as third and fourth operand, or by |
the bit fields in source operand (and there are only two operands in such |
case), which should contain position value in bits 72-77 and length of bit |
string in bits 64-69. |
|
insertq xmm1,xmm0,4,2 ; insert 4 bits at position 2 |
insertq xmm1,xmm0 ; insert bits defined by register |
|
"movntss" and "movntsd" store single or double precision floating point |
value from the source SSE register into 32-bit or 64-bit destination memory |
location respectively, using non-temporal hint. |
|
|
2.1.21 AVX instructions |
|
The Advanced Vector Extensions introduce instructions that are new variants |
of SSE instructions, with new scheme of encoding that allows extended syntax |
having a destination operand separate from all the source operands. It also |
introduces 256-bit AVX registers, which extend up the old 128-bit SSE |
registers. Any AVX instruction that puts some result into SSE register, puts |
zero bits into high portion of the AVX register containing it. |
The AVX version of SSE instruction has the mnemonic obtained by prepending |
SSE instruction name with "v". For any SSE arithmetic instruction which had a |
destination operand also being used as one of the source values, the AVX |
variant has a new syntax with three operands - the destination and two sources. |
The destination and first source can be SSE registers, and second source can be |
SSE register or memory. If the operation is performed on single pair of values, |
the remaining bits of first source SSE register are copied into the the |
destination register. |
|
vsubss xmm0,xmm2,xmm3 ; substract two 32-bit floats |
vmulsd xmm0,xmm7,qword [esi] ; multiply two 64-bit floats |
|
In case of packed operations, each instruction can also operate on the 256-bit |
data size when the AVX registers are specified instead of SSE registers, and |
the size of memory operand is also doubled then. |
|
vaddps ymm1,ymm5,yword [esi] ; eight sums of 32-bit float pairs |
|
The instructions that operate on packed integer types (in particular the ones |
that earlier had been promoted from MMX to SSE) also acquired the new syntax |
with three operands, however they are only allowed to operate on 128-bit |
packed types and thus cannot use the whole AVX registers. |
|
vpavgw xmm3,xmm0,xmm2 ; average of 16-bit integers |
vpslld xmm1,xmm0,1 ; shift double words left |
|
If the SSE version of instruction had a syntax with three operands, the third |
one being an immediate value, the AVX version of such instruction takes four |
operands, with immediate remaining the last one. |
|
vshufpd ymm0,ymm1,ymm2,10010011b ; shuffle 64-bit floats |
vpalignr xmm0,xmm4,xmm2,3 ; extract byte aligned value |
|
The promotion to new syntax according to the rules described above has been |
applied to all the instructions from SSE extensions up to SSE4, with the |
exceptions described below. |
"vdppd" instruction has syntax extended to four operans, but it does not |
have a 256-bit version. |
The are a few instructions, namely "vsqrtpd", "vsqrtps", "vrcpps" and |
"vrsqrtps", which can operate on 256-bit data size, but retained the syntax |
with only two operands, because they use data from only one source: |
|
vsqrtpd ymm1,ymm0 ; put square roots into other register |
|
In a similar way "vroundpd" and "vroundps" retained the syntax with three |
operands, the last one being immediate value. |
|
vroundps ymm0,ymm1,0011b ; round toward zero |
|
Also some of the operations on packed integers kept their two-operand or |
three-operand syntax while being promoted to AVX version. In such case these |
instructions follow exactly the same rules for operands as their SSE |
counterparts (since operations on packed integers do not have 256-bit variants |
in AVX extension). These include "vpcmpestri", "vpcmpestrm", "vpcmpistri", |
"vpcmpistrm", "vphminposuw", "vpshufd", "vpshufhw", "vpshuflw". And there are |
more instructions that in AVX versions keep exactly the same syntax for |
operands as the one from SSE, without any additional options: "vcomiss", |
"vcomisd", "vcvtss2si", "vcvtsd2si", "vcvttss2si", "vcvttsd2si", "vextractps", |
"vpextrb", "vpextrw", "vpextrd", "vpextrq", "vmovd", "vmovq", "vmovntdqa", |
"vmaskmovdqu", "vpmovmskb", "vpmovsxbw", "vpmovsxbd", "vpmovsxbq", "vpmovsxwd", |
"vpmovsxwq", "vpmovsxdq", "vpmovzxbw", "vpmovzxbd", "vpmovzxbq", "vpmovzxwd", |
"vpmovzxwq" and "vpmovzxdq". |
The move and conversion instructions have mostly been promoted to allow |
256-bit size operands in addition to the 128-bit variant with syntax identical |
to that from SSE version of the same instruction. Each of the "vcvtdq2ps", |
"vcvtps2dq" and "vcvttps2dq", "vmovaps", "vmovapd", "vmovups", "vmovupd", |
"vmovdqa", "vmovdqu", "vlddqu", "vmovntps", "vmovntpd", "vmovntdq", |
"vmovsldup", "vmovshdup", "vmovmskps" and "vmovmskpd" inherits the 128-bit |
syntax from SSE without any changes, and also allows a new form with 256-bit |
operands in place of 128-bit ones. |
|
vmovups [edi],ymm6 ; store unaligned 256-bit data |
|
"vmovddup" has the identical 128-bit syntax as its SSE version, and it also |
has a 256-bit version, which stores the duplicates of the lowest quad word |
from the source operand in the lower half of destination operand, and in the |
upper half of destination the duplicates of the low quad word from the upper |
half of source. Both source and destination operands need then to be 256-bit |
values. |
"vmovlhps" and "vmovhlps" have only 128-bit versions, and each takes three |
operands, which all must be SSE registers. "vmovlhps" copies two single |
precision values from the low quad word of second source register to the high |
quad word of destination register, and copies the low quad word of first |
source register into the low quad word of destination register. "vmovhlps" |
copies two single precision values from the high quad word of second source |
register to the low quad word of destination register, and copies the high |
quad word of first source register into the high quad word of destination |
register. |
"vmovlps", "vmovhps", "vmovlpd" and "vmovhpd" have only 128-bit versions and |
their syntax varies depending on whether memory operand is a destination or |
source. When memory is destination, the syntax is identical to the one of |
equivalent SSE instruction, and when memory is source, the instruction requires |
three operands, first two being SSE registers and the third one 64-bit memory. |
The value put into destination is then the value copied from first source with |
either low or high quad word replaced with value from second source (the |
memory operand). |
|
vmovhps [esi],xmm7 ; store upper half to memory |
vmovlps xmm0,xmm7,[ebx] ; low from memory, rest from register |
|
"vmovss" and "vmovsd" have syntax identical to their SSE equivalents as long |
as one of the operands is memory, while the versions that operate purely on |
registers require three operands (each being SSE register). The value stored |
in destination is then the value copied from first source with lowest data |
element replaced with the lowest value from second source. |
|
vmovss xmm3,[edi] ; low from memory, rest zeroed |
vmovss xmm0,xmm1,xmm2 ; one value from xmm2, three from xmm1 |
|
"vcvtss2sd", "vcvtsd2ss", "vcvtsi2ss" and "vcvtsi2d" use the three-operand |
syntax, where destination and first source are always SSE registers, and the |
second source follows the same rules and the source in syntax of equivalent |
SSE instruction. The value stored in destination is then the value copied from |
first source with lowest data element replaced with the result of conversion. |
|
vcvtsi2sd xmm4,xmm4,ecx ; 32-bit integer to 64-bit float |
vcvtsi2ss xmm0,xmm0,rax ; 64-bit integer to 32-bit float |
|
"vcvtdq2pd" and "vcvtps2pd" allow the same syntax as their SSE equivalents, |
plus the new variants with AVX register as destination and SSE register or |
128-bit memory as source. Analogously "vcvtpd2dq", "vcvttpd2dq" and |
"vcvtpd2ps", in addition to variant with syntax identical to SSE version, |
allow a variant with SSE register as destination and AVX register or 256-bit |
memory as source. |
"vinsertps", "vpinsrb", "vpinsrw", "vpinsrd", "vpinsrq" and "vpblendw" use |
a syntax with four operands, where destination and first source have to be SSE |
registers, and the third and fourth operand follow the same rules as second |
and third operand in the syntax of equivalent SSE instruction. Value stored in |
destination is the the value copied from first source with some data elements |
replaced with values extracted from the second source, analogously to the |
operation of corresponding SSE instruction. |
|
vpinsrd xmm0,xmm0,eax,3 ; insert double word |
|
"vblendvps", "vblendvpd" and "vpblendvb" use a new syntax with four register |
operands: destination, two sources and a mask, where second source can also be |
a memory operand. "vblendvps" and "vblendvpd" have 256-bit variant, where |
operands are AVX registers or 256-bit memory, as well as 128-bit variant, |
which has operands being SSE registers or 128-bit memory. "vpblendvb" has only |
a 128-bit variant. Value stored in destination is the value copied from the |
first source with some data elements replaced, according to mask, by values |
from the second source. |
|
vblendvps ymm3,ymm1,ymm2,ymm7 ; blend according to mask |
|
"vptest" allows the same syntax as its SSE version and also has a 256-bit |
version, with both operands doubled in size. There are also two new |
instructions, "vtestps" and "vtestpd", which perform analogous tests, but only |
of the sign bits of corresponding single precision or double precision values, |
and set the ZF and CF accordingly. They follow the same syntax rules as |
"vptest". |
|
vptest ymm0,yword [ebx] ; test 256-bit values |
vtestpd xmm0,xmm1 ; test sign bits of 64-bit floats |
|
"vbroadcastss", "vbroadcastsd" and "vbroadcastf128" are new instructions, |
which broadcast the data element defined by source operand into all elements |
of corresponing size in the destination register. "vbroadcastss" needs |
source to be 32-bit memory and destination to be either SSE or AVX register. |
"vbroadcastsd" requires 64-bit memory as source, and AVX register as |
destination. "vbroadcastf128" requires 128-bit memory as source, and AVX |
register as destination. |
|
vbroadcastss ymm0,dword [eax] ; get eight copies of value |
|
"vinsertf128" is the new instruction, which takes four operands. The |
destination and first source have to be AVX registers, second source can be |
SSE register or 128-bit memory location, and fourth operand should be an |
immediate value. It stores in destination the value obtained by taking |
contents of first source and replacing one of its 128-bit units with value of |
the second source. The lowest bit of fourth operand specifies at which |
position that replacement is done (either 0 or 1). |
"vextractf128" is the new instruction with three operands. The destination |
needs to be SSE register or 128-bit memory location, the source must be AVX |
register, and the third operand should be an immediate value. It extracts |
into destination one of the 128-bit units from source. The lowest bit of third |
operand specifies, which unit is extracted. |
"vmaskmovps" and "vmaskmovpd" are the new instructions with three operands |
that selectively store in destination the elements from second source |
depending on the sign bits of corresponding elements from first source. These |
instructions can operate on either 128-bit data (SSE registers) or 256-bit |
data (AVX registers). Either destination or second source has to be a memory |
location of appropriate size, the two other operands should be registers. |
|
vmaskmovps [edi],xmm0,xmm5 ; conditionally store |
vmaskmovpd ymm5,ymm0,[esi] ; conditionally load |
|
"vpermilpd" and "vpermilps" are the new instructions with three operands |
that permute the values from first source according to the control fields from |
second source and put the result into destination operand. It allows to use |
either three SSE registers or three AVX registers as its operands, the second |
source can be a memory of size equal to the registers used. In alternative |
form the second source can be immediate value and then the first source |
can be a memory location of the size equal to destination register. |
"vperm2f128" is the new instruction with four operands, which selects |
128-bit blocks of floating point data from first and second source according |
to the bit fields from fourth operand, and stores them in destination. |
Destination and first source need to be AVX registers, second source can be |
AVX register or 256-bit memory area, and fourth operand should be an immediate |
value. |
|
vperm2f128 ymm0,ymm6,ymm7,12h ; permute 128-bit blocks |
|
"vzeroall" instruction sets all the AVX registers to zero. "vzeroupper" sets |
the upper 128-bit portions of all AVX registers to zero, leaving the SSE |
registers intact. These new instructions take no operands. |
"vldmxcsr" and "vstmxcsr" are the AVX versions of "ldmxcsr" and "stmxcsr" |
instructions. The rules for their operands remain unchanged. |
|
|
2.1.22 AVX2 instructions |
|
The AVX2 extension allows all the AVX instructions operating on packed integers |
to use 256-bit data types, and introduces some new instructions as well. |
The AVX instructions that operate on packed integers and had only a 128-bit |
variants, have been supplemented with 256-bit variants, and thus their syntax |
rules became analogous to AVX instructions operating on packed floating point |
types. |
|
vpsubb ymm0,ymm0,[esi] ; substract 32 packed bytes |
vpavgw ymm3,ymm0,ymm2 ; average of 16-bit integers |
|
However there are some instructions that have not been equipped with the |
256-bit variants. "vpcmpestri", "vpcmpestrm", "vpcmpistri", "vpcmpistrm", |
"vpextrb", "vpextrw", "vpextrd", "vpextrq", "vpinsrb", "vpinsrw", "vpinsrd", |
"vpinsrq" and "vphminposuw" are not affected by AVX2 and allow only the |
128-bit operands. |
The packed shift instructions, which allowed the third operand specifying |
amount to be SSE register or 128-bit memory location, use the same rules |
for the third operand in their 256-bit variant. |
|
vpsllw ymm2,ymm2,xmm4 ; shift words left |
vpsrad ymm0,ymm3,xword [ebx] ; shift double words right |
|
There are also new packed shift instructions with standard three-operand AVX |
syntax, which shift each element from first source by the amount specified in |
corresponding element of second source, and store the results in destination. |
"vpsllvd" shifts 32-bit elements left, "vpsllvq" shifts 64-bit elements left, |
"vpsrlvd" shifts 32-bit elements right logically, "vpsrlvq" shifts 64-bit |
elements right logically and "vpsravd" shifts 32-bit elements right |
arithmetically. |
The sign-extend and zero-extend instructions, which in AVX versions allowed |
source operand to be SSE register or a memory of specific size, in the new |
256-bit variant need memory of that size doubled or SSE register as source and |
AVX register as destination. |
|
vpmovzxbq ymm0,dword [esi] ; bytes to quad words |
|
Also "vmovntdqa" has been upgraded with 256-bit variant, so it allows to |
transfer 256-bit value from memory to AVX register, it needs memory address |
to be aligned to 32 bytes. |
"vpmaskmovd" and "vpmaskmovq" are the new instructions with syntax identical |
to "vmaskmovps" or "vmaskmovpd", and they performs analogous operation on |
packed 32-bit or 64-bit values. |
"vinserti128", "vextracti128", "vbroadcasti128" and "vperm2i128" are the new |
instructions with syntax identical to "vinsertf128", "vextractf128", |
"vbroadcastf128" and "vperm2f128" respectively, and they perform analogous |
operations on 128-bit blocks of integer data. |
"vbroadcastss" and "vbroadcastsd" instructions have been extended to allow |
SSE register as a source operand (which in AVX could only be a memory). |
"vpbroadcastb", "vpbroadcastw", "vpbroadcastd" and "vpbroadcastq" are the |
new instructions which broadcast the byte, word, double word or quad word from |
the source operand into all elements of corresponing size in the destination |
register. The destination operand can be either SSE or AVX register, and the |
source operand can be SSE register or memory of size equal to the size of data |
element. |
|
vpbroadcastb ymm0,byte [ebx] ; get 32 identical bytes |
|
"vpermd" and "vpermps" are new three-operand instructions, which use each |
32-bit element from first source as an index of element in second source which |
is copied into destination at position corresponding to element containing |
index. The destination and first source have to be AVX registers, and the |
second source can be AVX register or 256-bit memory. |
"vpermq" and "vpermpd" are new three-operand instructions, which use 2-bit |
indexes from the immediate value specified as third operand to determine which |
element from source store at given position in destination. The destination |
has to be AVX register, source can be AVX register or 256-bit memory, and the |
third operand must be 8-bit immediate value. |
The family of new instructions performing "gather" operation have special |
syntax, as in their memory operand they use addressing mode that is unique to |
them. The base of address can be a 32-bit or 64-bit general purpose register |
(the latter only in long mode), and the index (possibly multiplied by scale |
value, as in standard addressing) is specified by SSE or AVX register. It is |
possible to use only index without base and any numerical displacement can be |
added to the address. Each of those instructions takes three operands. First |
operand is the destination register, second operand is memory addressed with |
a vector index, and third operand is register containing a mask. The most |
significant bit of each element of mask determines whether a value will be |
loaded from memory into corresponding element in destination. The address of |
each element to load is determined by using the corresponding element from |
index register in memory operand to calculate final address with given base |
and displacement. When the index register contains less elements than the |
destination and mask registers, the higher elements of destination are zeroed. |
After the value is successfuly loaded, the corresponding element in mask |
register is set to zero. The destination, index and mask should all be |
distinct registers, it is not allowed to use the same register in two |
different roles. |
"vgatherdps" loads single precision floating point values addressed by |
32-bit indexes. The destination, index and mask should all be registers of the |
same type, either SSE or AVX. The data addressed by memory operand is 32-bit |
in size. |
|
vgatherdps xmm0,[eax+xmm1],xmm3 ; gather four floats |
vgatherdps ymm0,[ebx+ymm7*4],ymm3 ; gather eight floats |
|
"vgatherqps" loads single precision floating point values addressed by |
64-bit indexes. The destination and mask should always be SSE registers, while |
index register can be either SSE or AVX register. The data addressed by memory |
operand is 32-bit in size. |
|
vgatherqps xmm0,[xmm2],xmm3 ; gather two floats |
vgatherqps xmm0,[ymm2+64],xmm3 ; gather four floats |
|
"vgatherdpd" loads double precision floating point values addressed by |
32-bit indexes. The index register should always be SSE register, the |
destination and mask should be two registers of the same type, either SSE or |
AVX. The data addressed by memory operand is 64-bit in size. |
|
vgatherdpd xmm0,[ebp+xmm1],xmm3 ; gather two doubles |
vgatherdpd ymm0,[xmm3*8],ymm5 ; gather four doubles |
|
"vgatherqpd" loads double precision floating point values addressed by |
64-bit indexes. The destination, index and mask should all be registers of the |
same type, either SSE or AVX. The data addressed by memory operand is 64-bit |
in size. |
"vpgatherdd" and "vpgatherqd" load 32-bit values addressed by either 32-bit |
or 64-bit indexes. They follow the same rules as "vgatherdps" and "vgatherqps" |
respectively. |
"vpgatherdq" and "vpgatherqq" load 64-bit values addressed by either 32-bit |
or 64-bit indexes. They follow the same rules as "vgatherdpd" and "vgatherqpd" |
respectively. |
|
|
2.1.23 Auxiliary sets of computational instructions |
|
There is a number of additional instruction set extensions related to |
AVX. They introduce new vector instructions (and sometimes also their SSE |
equivalents that use classic instruction encoding), and even some new |
instructions operating on general registers that use the AVX-like encoding |
allowing the extended syntax with separate destination and source operands. |
The CPU support for each of these instruction sets needs to be determined |
separately. |
The AES extension provides a specialized set of instructions for the |
purpose of cryptographic computations defined by Advanced Encryption Standard. |
Each of these instructions has two versions: the AVX one and the one with |
SSE-like syntax that uses classic encoding. Refer to the Intel manuals for the |
details of operation of these instructions. |
"aesenc" and "aesenclast" perform a single round of AES encryption on data |
from first source with a round key from second source, and store result in |
destination. The destination and first source are SSE registers, and the |
second source can be SSE register or 128-bit memory. The AVX versions of these |
instructions, "vaesenc" and "vaesenclast", use the syntax with three operands, |
while the SSE-like version has only two operands, with first operand being |
both the destination and first source. |
"aesdec" and "aesdeclast" perform a single round of AES decryption on data |
from first source with a round key from second source. The syntax rules for |
them and their AVX versions are the same as for "aesenc". |
"aesimc" performs the InvMixColumns transformation of source operand and |
store the result in destination. Both "aesimc" and "vaesimc" use only two |
operands, destination being SSE register, and source being SSE register or |
128-bit memory location. |
"aeskeygenassist" is a helper instruction for generating the round key. |
It needs three operands: destination being SSE register, source being SSE |
register or 128-bit memory, and third operand being 8-bit immediate value. |
The AVX version of this instruction uses the same syntax. |
The CLMUL extension introduces just one instruction, "pclmulqdq", and its |
AVX version as well. This instruction performs a carryless multiplication of |
two 64-bit values selected from first and second source according to the bit |
fields in immediate value. The destination and first source are SSE registers, |
second source is SSE register or 128-bit memory, and immediate value is |
provided as last operand. "vpclmulqdq" takes four operands, while "pclmulqdq" |
takes only three operands, with the first one serving both the role of |
destination and first source. |
The FMA (Fused Multiply-Add) extension introduces additional AVX |
instructions which perform multiplication and summation as single operation. |
Each one takes three operands, first one serving both the role of destination |
and first source, and the following ones being the second and third source. |
The mnemonic of FMA instruction is obtained by appending to "vf" prefix: first |
either "m" or "nm" to select whether result of multiplication should be taken |
as-is or negated, then either "add" or "sub" to select whether third value |
will be added to the product or substracted from the product, then either |
"132", "213" or "231" to select which source operands are multiplied and which |
one is added or substracted, and finally the type of data on which the |
instruction operates, either "ps", "pd", "ss" or "sd". As it was with SSE |
instructions promoted to AVX, instructions operating on packed floating point |
values allow 128-bit or 256-bit syntax, in former all the operands are SSE |
registers, but the third one can also be a 128-bit memory, in latter the |
operands are AVX registers and the third one can also be a 256-bit memory. |
Instructions that compute just one floating point result need operands to be |
SSE registers, and the third operand can also be a memory, either 32-bit for |
single precision or 64-bit for double precision. |
|
vfmsub231ps ymm1,ymm2,ymm3 ; multiply and substract |
vfnmadd132sd xmm0,xmm5,[ebx] ; multiply, negate and add |
|
In addition to the instructions created by the rule described above, there are |
families of instructions with mnemonics starting with either "vfmaddsub" or |
"vfmsubadd", followed by either "132", "213" or "231" and then either "ps" or |
"pd" (the operation must always be on packed values in this case). They add |
to the result of multiplication or substract from it depending on the position |
of value in packed data - instructions from the "vfmaddsub" group add when the |
position is odd and substract when the position is even, instructions from the |
"vfmsubadd" group add when the position is even and subtstract when the |
position is odd. The rules for operands are the same as for other FMA |
instructions. |
The FMA4 instructions are similar to FMA, but use syntax with four operands |
and thus allow destination to be different than all the sources. Their |
mnemonics are identical to FMA instructions with the "132", "213" or "231" cut |
out, as having separate destination operand makes such selection of operands |
superfluous. The multiplication is always performed on values from the first |
and second source, and then the value from third source is added or |
substracted. Either second or third source can be a memory operand, and the |
rules for the sizes of operands are the same as for FMA instructions. |
|
vfmaddpd ymm0,ymm1,[esi],ymm2 ; multiply and add |
vfmsubss xmm0,xmm1,xmm2,[ebx] ; multiply and substract |
|
The F16C extension consists of two instructions, "vcvtps2ph" and |
"vcvtph2ps", which convert floating point values between single precision and |
half precision (the 16-bit floating point format). "vcvtps2ph" takes three |
operands: destination, source, and rounding controls. The third operand is |
always an immediate, the source is either SSE or AVX register containing |
single precision values, and the destination is SSE register or memory, the |
size of memory is 64 bits when the source is SSE register and 128 bits when |
the source is AVX register. "vcvtph2ps" takes two operands, the destination |
that can be SSE or AVX register, and the source that is SSE register or memory |
with size of the half of destination operand's size. |
The AMD XOP extension introduces a number of new vector instructions with |
encoding and syntax analogous to AVX instructions. "vfrczps", "vfrczss", |
"vfrczpd" and "vfrczsd" extract fractional portions of single or double |
precision values, they all take two operands. The packed operations allow |
either SSE or AVX register as destination, for the other two it has to be SSE |
register. Source can be register of the same type as destination, or memory |
of appropriate size (256-bit for destination being AVX register, 128-bit for |
packed operation with destination being SSE register, 64-bit for operation |
on a solitary double precision value and 32-bit for operation on a solitary |
single precision value). |
|
vfrczps ymm0,[esi] ; load fractional parts |
|
"vpcmov" copies bits from either first or second source into destination |
depending on the values of corresponding bits in the fourth operand (the |
selector). If the bit in selector is set, the corresponding bit from first |
source is copied into the same position in destination, otherwise the bit from |
second source is copied. Either second source or selector can be memory |
location, 128-bit or 256-bit depending on whether SSE registers or AVX |
registers are specified as the other operands. |
|
vpcmov xmm0,xmm1,xmm2,[ebx] ; selector in memory |
vpcmov ymm0,ymm5,[esi],ymm2 ; source in memory |
|
The family of packed comparison instructions take four operands, the |
destination and first source being SSE register, second source being SSE |
register or 128-bit memory and the fourth operand being immediate value |
defining the type of comparison. The mnemonic or instruction is created |
by appending to "vpcom" prefix either "b" or "ub" to compare signed or |
unsigned bytes, "w" or "uw" to compare signed or unsigned words, "d" or "ud" |
to compare signed or unsigned double words, "q" or "uq" to compare signed or |
unsigned quad words. The respective values from the first and second source |
are compared and the corresponding data element in destination is set to |
either all ones or all zeros depending on the result of comparison. The fourth |
operand has to specify one of the eight comparison types (table 2.5). All |
these instruction have also variants with only three operands and the type |
of comparison encoded within the instruction name by inserting the comparison |
mnemonic after "vpcom". |
|
vpcomb xmm0,xmm1,xmm2,4 ; test for equal bytes |
vpcomgew xmm0,xmm1,[ebx] ; compare signed words |
|
Table 2.5 XOP comparisons |
/-------------------------------------------\ |
| Code | Mnemonic | Description | |
|======|==========|=========================| |
| 0 | lt | less than | |
| 1 | le | less than or equal | |
| 2 | gt | greater than | |
| 3 | ge | greater than or equal | |
| 4 | eq | equal | |
| 5 | neq | not equal | |
| 6 | false | false | |
| 7 | true | true | |
\-------------------------------------------/ |
|
"vpermil2ps" and "vpermil2pd" set the elements in destination register to |
zero or to a value selected from first or second source depending on the |
corresponding bit fields from the fourth operand (the selector) and the |
immediate value provided in fifth operand. Refer to the AMD manuals for the |
detailed explanation of the operation performed by these instructions. Each |
of the first four operands can be a register, and either second source or |
selector can be memory location, 128-bit or 256-bit depending on whether SSE |
registers or AVX registers are used for the other operands. |
|
vpermil2ps ymm0,ymm3,ymm7,ymm2,0 ; permute from two sources |
|
"vphaddbw" adds pairs of adjacent signed bytes to form 16-bit values and |
stores them at the same positions in destination. "vphaddubw" does the same |
but treats the bytes as unsigned. "vphaddbd" and "vphaddubd" sum all bytes |
(either signed or unsigned) in each four-byte block to 32-bit results, |
"vphaddbq" and "vphaddubq" sum all bytes in each eight-byte block to |
64-bit results, "vphaddwd" and "vphadduwd" add pairs of words to 32-bit |
results, "vphaddwq" and "vphadduwq" sum all words in each four-word block to |
64-bit results, "vphadddq" and "vphaddudq" add pairs of double words to 64-bit |
results. "vphsubbw" substracts in each two-byte block the byte at higher |
position from the one at lower position, and stores the result as a signed |
16-bit value at the corresponding position in destination, "vphsubwd" |
substracts in each two-word block the word at higher position from the one at |
lower position and makes signed 32-bit results, "vphsubdq" substract in each |
block of two double word the one at higher position from the one at lower |
position and makes signed 64-bit results. Each of these instructions takes |
two operands, the destination being SSE register, and the source being SSE |
register or 128-bit memory. |
|
vphadduwq xmm0,xmm1 ; sum quadruplets of words |
|
"vpmacsww" and "vpmacssww" multiply the corresponding signed 16-bit values |
from the first and second source and then add the products to the parallel |
values from the third source, then "vpmacsww" takes the lowest 16 bits of the |
result and "vpmacssww" saturates the result down to 16-bit value, and they |
store the final 16-bit results in the destination. "vpmacsdd" and "vpmacssdd" |
perform the analogous operation on 32-bit values. "vpmacswd" and "vpmacswd" do |
the same calculation only on the low 16-bit values from each 32-bit block and |
form the 32-bit results. "vpmacsdql" and "vpmacssdql" perform such operation |
on the low 32-bit values from each 64-bit block and form the 64-bit results, |
while "vpmacsdqh" and "vpmacssdqh" do the same on the high 32-bit values from |
each 64-bit block, also forming the 64-bit results. "vpmadcswd" and |
"vpmadcsswd" multiply the corresponding signed 16-bit value from the first |
and second source, then sum all the four products and add this sum to each |
16-bit element from third source, storing the truncated or saturated result |
in destination. All these instructions take four operands, the second source |
can be 128-bit memory or SSE register, all the other operands have to be |
SSE registers. |
|
vpmacsdd xmm6,xmm1,[ebx],xmm6 ; accumulate product |
|
"vpperm" selects bytes from first and second source, optionally applies a |
separate transformation to each of them, and stores them in the destination. |
The bit fields in fourth operand (the selector) specify for each position in |
destination what byte from which source is taken and what operation is applied |
to it before it is stored there. Refer to the AMD manuals for the detailed |
information about these bit fields. This instruction takes four operands, |
either second source or selector can be a 128-bit memory (or they can be SSE |
registers both), all the other operands have to be SSE registers. |
"vpshlb", "vpshlw", "vpshld" and "vpshlq" shift logically bytes, words, double |
words or quad words respectively. The amount of bits to shift by is specified |
for each element separately by the signed byte placed at the corresponding |
position in the third operand. The source containing elements to shift is |
provided as second operand. Either second or third operand can be 128-bit |
memory (or they can be SSE registers both) and the other operands have to be |
SSE registers. |
|
vpshld xmm3,xmm1,[ebx] ; shift bytes from xmm1 |
|
"vpshab", "vpshaw", "vpshad" and "vpshaq" arithmetically shift bytes, words, |
double words or quad words. These instructions follow the same rules as the |
logical shifts described above. "vprotb", "vprotw", "vprotd" and "vprotq" |
rotate bytes, word, double words or quad words. They follow the same rules as |
shifts, but additionally allow third operand to be immediate value, in which |
case the same amount of rotation is specified for all the elements in source. |
|
vprotb xmm0,[esi],3 ; rotate bytes to the left |
|
The MOVBE extension introduces just one new instruction, "movbe", which |
swaps bytes in value from source before storing it in destination, so can |
be used to load and store big endian values. It takes two operands, either |
the destination or source should be a 16-bit, 32-bit or 64-bit memory (the |
last one being only allowed in long mode), and the other operand should be |
a general register of the same size. |
The BMI extension, consisting of two subsets - BMI1 and BMI2, introduces |
new instructions operating on general registers, which use the same encoding |
as AVX instructions and so allow the extended syntax. All these instructions |
use 32-bit operands, and in long mode they also allow the forms with 64-bit |
operands. |
"andn" calculates the bitwise AND of second source with the inverted bits |
of first source and stores the result in destination. The destination and |
the first source have to be general registers, the second source can be |
general register or memory. |
|
andn edx,eax,[ebx] ; bit-multiply inverted eax with memory |
|
"bextr" extracts from the first source the sequence of bits using an index |
and length specified by bit fields in the second source operand and stores |
it into destination. The lowest 8 bits of second source specify the position |
of bit sequence to extract and the next 8 bits of second source specify the |
length of sequence. The first source can be a general register or memory, |
the other two operands have to be general registers. |
|
bextr eax,[esi],ecx ; extract bit field from memory |
|
"blsi" extracts the lowest set bit from the source, setting all the other |
bits in destination to zero. The destination must be a general register, |
the source can be general register or memory. |
|
blsi rax,r11 ; isolate the lowest set bit |
|
"blsmsk" sets all the bits in the destination up to the lowest set bit in |
the source, including this bit. "blsr" copies all the bits from the source to |
destination except for the lowest set bit, which is replaced by zero. These |
instructions follow the same rules for operands as "blsi". |
"tzcnt" counts the number of trailing zero bits, that is the zero bits up to |
the lowest set bit of source value. This instruction is analogous to "lzcnt" |
and follows the same rules for operands, so it also has a 16-bit version, |
unlike the other BMI instructions. |
"bzhi" is BMI2 instruction, which copies the bits from first source to |
destination, zeroing all the bits up from the position specified by second |
source. It follows the same rules for operands as "bextr". |
"pext" uses a mask in second source operand to select bits from first |
operands and puts the selected bits as a continuous sequence into destination. |
"pdep" performs the reverse operation - it takes sequence of bits from the |
first source and puts them consecutively at the positions where the bits in |
second source are set, setting all the other bits in destination to zero. |
These BMI2 instructions follow the same rules for operands as "andn". |
"mulx" is a BMI2 instruction which performs an unsigned multiplication of |
value from EDX or RDX register (depending on the size of specified operands) |
by the value from third operand, and stores the low half of result in the |
second operand, and the high half of result in the first operand, and it does |
it without affecting the flags. The third operand can be general register or |
memory, and both the destination operands have to be general registers. |
|
mulx edx,eax,ecx ; multiply edx by ecx into edx:eax |
|
"shlx", "shrx" and "sarx" are BMI2 instructions, which perform logical or |
arithmetical shifts of value from first source by the amount specified by |
second source, and store the result in destination without affecting the |
flags. The have the same rules for operands as "bzhi" instruction. |
"rorx" is a BMI2 instruction which rotates right the value from source |
operand by the constant amount specified in third operand and stores the |
result in destination without affecting the flags. The destination operand |
has to be general register, the source operand can be general register or |
memory, and the third operand has to be an immediate value. |
|
rorx eax,edx,7 ; rotate without affecting flags |
|
The TBM is an extension designed by AMD to supplement the BMI set. The |
"bextr" instruction is extended with a new form, in which second source is |
a 32-bit immediate value. "blsic" is a new instruction which performs the |
same operation as "blsi", but with the bits of result reversed. It uses the |
same rules for operands as "blsi". "blsfill" is a new instruction, which takes |
the value from source, sets all the bits below the lowest set bit and store |
the result in destination, it also uses the same rules for operands as "blsi". |
"blci", "blcic", "blcs", "blcmsk" and "blcfill" are instructions analogous |
to "blsi", "blsic", "blsr", "blsmsk" and "blsfill" respectively, but they |
perform the bit-inverted versions of the same operations. They follow the |
same rules for operands as the instructions they reflect. |
"tzmsk" finds the lowest set bit in value from source operand, sets all bits |
below it to 1 and all the rest of bits to zero, then writes the result to |
destination. "t1mskc" finds the least significant zero bit in the value from |
source operand, sets the bits below it to zero and all the other bits to 1, |
and writes the result to destination. These instructions have the same rules |
for operands as "blsi". |
|
|
2.1.24 Other extensions of instruction set |
|
There is a number of additional instruction set extensions recognized by flat |
assembler, and the general syntax of the instructions introduced by those |
extensions is provided here. For a detailed information on the operations |
performed by them, check out the manuals from Intel (for the VMX, SMX, XSAVE, |
RDRAND, FSGSBASE, INVPCID, HLE and RTM extensions) or AMD (for the SVM |
extension). |
The Virtual-Machine Extensions (VMX) provide a set of instructions for the |
management of virtual machines. The "vmxon" instruction, which enters the VMX |
operation, requires a single 64-bit memory operand, which should be a physical |
address of memory region, which the logical processor may use to support VMX |
operation. The "vmxoff" instruction, which leaves the VMX operation, has no |
operands. The "vmlaunch" and "vmresume", which launch or resume the virtual |
machines, and "vmcall", which allows guest software to call the VM monitor, |
use no operands either. |
The "vmptrld" loads the physical address of current Virtual Machine Control |
Structure (VMCS) from its memory operand, "vmptrst" stores the pointer to |
current VMCS into address specified by its memory operand, and "vmclear" sets |
the launch state of the VMCS referenced by its memory operand to clear. These |
three instruction all require single 64-bit memory operand. |
The "vmread" reads from VCMS a field specified by the source operand and |
stores it into the destination operand. The source operand should be a |
general purpose register, and the destination operand can be a register of |
memory. The "vmwrite" writes into a VMCS field specified by the destination |
operand the value provided by source operand. The source operand can be a |
general purpose register or memory, and the destination operand must be a |
register. The size of operands for those instructions should be 64-bit when |
in long mode, and 32-bit otherwise. |
The "invept" and "invvpid" invalidate the translation lookaside buffers |
(TLBs) and paging-structure caches, either derived from extended page tables |
(EPT), or based on the virtual processor identifier (VPID). These instructions |
require two operands, the first one being the general purpose register |
specifying the type of invalidation, and the second one being a 128-bit |
memory operand providing the invalidation descriptor. The first operand |
should be a 64-bit register when in long mode, and 32-bit register otherwise. |
The Safer Mode Extensions (SMX) provide the functionalities available |
throught the "getsec" instruction. This instruction takes no operands, and |
the function that is executed is determined by the contents of EAX register |
upon executing this instruction. |
The Secure Virtual Machine (SVM) is a variant of virtual machine extension |
used by AMD. The "skinit" instruction securely reinitializes the processor |
allowing the startup of trusted software, such as the virtual machine monitor |
(VMM). This instruction takes a single operand, which must be EAX, and |
provides a physical address of the secure loader block (SLB). |
The "vmrun" instruction is used to start a guest virtual machine, |
its only operand should be an accumulator register (AX, EAX or RAX, the |
last one available only in long mode) providing the physical address of the |
virtual machine control block (VMCB). The "vmsave" stores a subset of |
processor state into VMCB specified by its operand, and "vmload" loads the |
same subset of processor state from a specified VMCB. The same operand rules |
as for the "vmrun" apply to those two instructions. |
"vmmcall" allows the guest software to call the VMM. This instruction takes |
no operands. |
"stgi" set the global interrupt flag to 1, and "clgi" zeroes it. These |
instructions take no operands. |
"invlpga" invalidates the TLB mapping for a virtual page specified by the |
first operand (which has to be accumulator register) and address space |
identifier specified by the second operand (which must be ECX register). |
The XSAVE set of instructions allows to save and restore processor state |
components. "xsave" and "xsaveopt" store the components of processor state |
defined by bit mask in EDX and EAX registers into area defined by memory |
operand. "xrstor" restores from the area specified by memory operand the |
components of processor state defined by mask in EDX and EAX. The "xsave64", |
"xsaveopt64" and "xrstor64" are 64-bit versions of these instructions, allowed |
only in long mode. |
"xgetbv" read the contents of 64-bit XCR (extended control register) |
specified in ECX register into EDX and EAX registers. "xsetbv" writes the |
contents of EDX and EAX into the 64-bit XCR specified by ECX register. These |
instructions have no operands. |
The RDRAND extension introduces one new instruction, "rdrand", which loads |
the hardware-generated random value into general register. It takes one |
operand, which can be 16-bit, 32-bit or 64-bit register (with the last one |
being allowed only in long mode). |
The FSGSBASE extension adds long mode instructions that allow to read and |
write the segment base registers for FS and GS segments. "rdfsbase" and |
"rdgsbase" read the corresponding segment base registers into operand, while |
"wrfsbase" and "wrgsbase" write the value of operand into those register. |
All these instructions take one operand, which can be 32-bit or 64-bit general |
register. |
The INVPCID extension adds "invpcid" instruction, which invalidates mapping |
in the TLBs and paging caches based on the invalidation type specified in |
first operand and PCID invalidate descriptor specified in second operand. |
The first operands should be 32-bit general register when not in long mode, |
or 64-bit general register when in long mode. The second operand should be |
128-bit memory location. |
The HLE and RTM extensions provide set of instructions for the transactional |
management. The "xacquire" and "xrelease" are new prefixes that can be used |
with some of the instructions to start or end lock elision on the memory |
address specified by prefixed instruction. The "xbegin" instruction starts |
the transactional execution, its operand is the address a fallback routine |
that gets executes in case of transaction abort, specified like the operand |
for near jump instruction. "xend" marks the end of transcational execution |
region, it takes no operands. "xabort" forces the transaction abort, it takes |
an 8-bit immediate value as its only operand, this value is passed in the |
highest bits of EAX to the fallback routine. "xtest" checks whether there is |
transactional execution in progress, this instruction takes no operands. |
|
|
2.2 Control directives |
|
This section describes the directives that control the assembly process, they |
2271,7 → 3313,7 |
|
2.2.2 Conditional assembly |
|
"if" directive causes come block of instructions to be assembled only under |
"if" directive causes some block of instructions to be assembled only under |
certain condition. It should be followed by logical expression specifying the |
condition, instructions in next lines will be assembled only when this |
condition is met, otherwise they will be skipped. The optional "else if" |
2299,6 → 3341,11 |
followed by any expression, usually just by a single symbol name; it checks |
whether the given expression contains only symbols that are defined in the |
source and accessible from the current position. |
With "relativeto" operator it is possible to check whether values of two |
expressions differ only by constant amount. The valid syntax is a numerical |
expression followed by "relativeto" and then another expression (possibly |
register-based). Labels that have no simple numerical value can be tested |
this way to determine what kind of operations may be possible with them. |
The following simple example uses the "count" constant that should be |
defined somewhere in source: |
|
2329,7 → 3376,7 |
of instructions get assembled, otherwise the last block of instructions, which |
follows the line containing only "else", is assembled. |
There are also operators that allow comparison of values being any chains of |
symbols. The "eq" compares two such values whether they are exactly the same. |
symbols. The "eq" compares whether two such values are exactly the same. |
The "in" operator checks whether given value is a member of the list of values |
following this operator, the list should be enclosed between "<" and ">" |
characters, its members should be separated with commas. The symbols are |
2431,7 → 3478,7 |
255). The loaded data cannot exceed current offset. |
The "store" directive can modify the already generated code by replacing |
some of the previously generated data with the value defined by given |
numerical expression, which follow. The expression can be preceded by the |
numerical expression, which follows. The expression can be preceded by the |
optional size operator to specify how large value the expression defines, and |
therefore how much bytes will be stored, if there is no size operator, the |
size of one byte is assumed. Then the "at" operator and the numerical |
2453,7 → 3500,7 |
end repeat |
|
and each byte of code will be xored with the value defined by "c" constant. |
"virtual" defines virtual data at specified address. This data won't be |
"virtual" defines virtual data at specified address. This data will not be |
included in the output file, but labels defined there can be used in other |
parts of source. This directive can be followed by "at" operator and the |
numerical expression specifying the address for virtual data, otherwise is |
2480,7 → 3527,7 |
end virtual |
|
With such definition instruction "mov ax,[LDT_limit]" will be assembled |
to "mov ax,[bx]". |
to the same instruction as "mov ax,[bx]". |
Declaring defined data values or instructions inside the virtual block would |
also be useful, because the "load" directive can be used to load the values |
from the virtually generated code into a constants. This directive should be |
2547,12 → 3594,16 |
end repeat |
display 13,10 |
|
This block of directives calculates the four hexadecimal digits of 16-bit value |
and converts them into characters for displaying. Note that this won't work if |
the adresses in current addressing space are relocatable (as it might happen |
with PE or object output formats), since only absolute values can be used this |
way. The absolute value may be obtained by calculating the relative address, |
like "$-$$", or "rva $" in case of PE format. |
This block of directives calculates the four hexadecimal digits of 16-bit |
value and converts them into characters for displaying. Note that this will |
not work if the adresses in current addressing space are relocatable (as it |
might happen with PE or object output formats), since only absolute values can |
be used this way. The absolute value may be obtained by calculating the |
relative address, like "$-$$", or "rva $" in case of PE format. |
The "err" directive immediately terminates the assembly process when it is |
encountered by assembler. |
The "assert" directive tests whether the logical expression that follows it |
is true, and if not, it signalizes the error. |
|
|
2.2.6 Multiple passes |
2654,6 → 3705,15 |
The "used" operator may be expected to behave in a similar manner in |
analogous cases, however any other kinds of predictions my not be so simple and |
you should never rely on them this way. |
The "err" directive, usually used to stop the assembly when some condition is |
met, stops the assembly immediately, regardless of whether the current pass |
is final or intermediate. So even when the condition that caused this directive |
to be interpreted is mispredicted and temporary, and would eventually disappear |
in the later passes, the assembly is stopped anyway. |
The "assert" directive signalizes the error only if its expression is false |
after all the symbols have been resolved. You can use "assert 0" in place of |
"err" when you do not want to have assembly stopped during the intermediate |
passes. |
|
|
2.3 Preprocessor directives |
2676,11 → 3736,14 |
number of included files as long as they fit in memory. |
The quoted path can contain environment variables enclosed within "%" |
characters, they will be replaced with their values inside the path, both the |
"\" and "/" characters are allowed as a path separators. If no absolute path |
is given, the file is first searched for in the directory containing file |
which included it and when it's not found there, in the directory containing |
the main source file (the one specified in command line). These rules concern |
also paths given with the "file" directive. |
"\" and "/" characters are allowed as a path separators. The file is first |
searched for in the directory containing file which included it and when it is |
not found there, the search is continued in the directories specified in the |
environment variable called INCLUDE (the multiple paths separated with |
semicolons can be defined there, they will be searched in the same order as |
specified). If file was not found in any of these places, preprocessor looks |
for it in the directory containing the main source file (the one specified in |
command line). These rules concern also paths given with the "file" directive. |
|
|
2.3.2 Symbolic constants |
2713,7 → 3776,7 |
"d" constant back the value "edx", the second one will restore it to value |
"dword", and one more will revert "d" to original meaning as if no such |
constant was defined. If there was no constant defined of given name, |
"restore" won't cause an error, it will be just ignored. |
"restore" will not cause an error, it will be just ignored. |
Symbolic constant can be used to adjust the syntax of assembler to personal |
preferences. For example the following set of definitions provides the handy |
shortcuts for all the size operators: |
2726,6 → 3789,7 |
q equ qword |
t equ tword |
x equ dqword |
y equ qqword |
|
Because symbolic constant may also have an empty value, it can be used to |
allow the syntax with "offset" word before any address value: |
2841,10 → 3905,13 |
definition, so "mov es,ds,dx" will be assembled as "push ds", "pop es" and |
"mov ds,dx". |
By placing the "*" after the name of argument you can mark the argument as |
required - preprocessor won't allow it to have an empty value. For example the |
above macroinstruction could be declared as "macro mov op1*,op2*,op3" to make |
sure that first two arguments will always have to be given some non empty |
required - preprocessor will not allow it to have an empty value. For example |
the above macroinstruction could be declared as "macro mov op1*,op2*,op3" to |
make sure that first two arguments will always have to be given some non empty |
values. |
Alternatively, you can provide the default value for argument, by placing |
the "=" followed by value after the name of argument. Then if the argument |
has an empty value provided, the default value will be used instead. |
When it's needed to provide macroinstruction with argument that contains |
some commas, such argument should be enclosed between "<" and ">" characters. |
If it contains more than one "<" character, the same number of ">" should be |
2852,8 → 3919,8 |
"purge" directive allows removing the last definition of specified |
macroinstruction. It should be followed by one or more names of |
macroinstructions, separated with commas. If such macroinstruction has not |
been defined, you won't get any error. For example after having the syntax of |
"mov" extended with the macroinstructions defined above, you can disable |
been defined, you will not get any error. For example after having the syntax |
of "mov" extended with the macroinstructions defined above, you can disable |
syntax with three operands back by using "purge mov" directive. Next |
"purge mov" will disable also syntax for two operands being segment registers, |
and all the next such directives will do nothing. |
2903,7 → 3970,7 |
} |
|
Each time this macroinstruction is used, "move" will become other unique name |
in its instructions, so you won't get an error you normally get when some |
in its instructions, so you will not get an error you normally get when some |
label is defined more than once. |
"forward", "reverse" and "common" directives divide macroinstruction into |
blocks, each one processed after the processing of previous is finished. They |
2948,8 → 4015,8 |
} |
|
This macroinstruction can be used for calling the procedures using STDCALL |
convention, arguments are pushed on stack in the reverse order. For example |
"stdcall foo,1,2,3" will be assembled as: |
convention, which has all the arguments pushed on stack in the reverse order. |
For example "stdcall foo,1,2,3" will be assembled as: |
|
push 3 |
push 2 |
2985,7 → 4052,7 |
"jae exit" instructions. |
The "#" operator can be also used to concatenate two quoted strings into one. |
Also conversion of name into a quoted string is possible, with the "`" operator, |
which likewise can be used inside the macroinstruction. It convert the name |
which likewise can be used inside the macroinstruction. It converts the name |
that follows it into a quoted string - but note, that when it is followed by |
a macro argument which is being replaced with value containing more than one |
symbol, only the first of them will be converted, as the "`" operator converts |
3104,9 → 4171,10 |
with dot in the contents of macroinstruction. The macroinstruction defined |
using the "struc" directive can have the same name as some other |
macroinstruction defined using the "macro" directive, structure |
macroinstruction won't prevent the standard macroinstruction being processed |
when there is no label before it and vice versa. All the rules and features |
concerning standard macroinstructions apply to structure macroinstructions. |
macroinstruction will not prevent the standard macroinstruction from being |
processed when there is no label before it and vice versa. All the rules and |
features concerning standard macroinstructions apply to structure |
macroinstructions. |
Here is the sample of structure macroinstruction: |
|
struc point x,y |
3146,10 → 4214,7 |
|
The "rept" directive is a special kind of macroinstruction, which makes given |
amount of duplicates of the block enclosed with braces. The basic syntax is |
"rept" directive followed by number (it cannot be an expression, since |
preprocessor doesn't do calculations, if you need repetitions based on values |
calculated by assembler, use one of the code repeating directives that are |
processed by assembler, see 2.2.3), and then block of source enclosed between |
"rept" directive followed by number and then block of source enclosed between |
the "{" and "}" characters. The simplest example: |
|
rept 5 { in al,dx } |
3200,6 → 4265,16 |
will generate code which will clear the contents of eight SSE registers. |
You can define multiple counters separated with commas, and each one can have |
different base. |
The number of repetitions and the base values for counters can be specified |
using the numerical expressions with operator rules identical as in the case |
of assembler. However each value used in such expression must either be a |
directly specified number, or a symbolic constant with value also being an |
expression that can be calculated by preprocessor (in such case the value |
of expression associated with symbolic constant is calculated first, and then |
substituted into the outer expression in place of that constant). If you need |
repetitions based on values that can only be calculated at assembly time, use |
one of the code repeating directives that are processed by assembler, see |
section 2.2.3. |
The "irp" directive iterates the single argument through the given list of |
parameters. The syntax is "irp" followed by the argument name, then the comma |
and then the list of parameters. The parameters are specified in the same |
3253,7 → 4328,7 |
match +,- { include 'second.inc' } |
|
the first file will get included, since "+" after comma matches the "+" in |
pattern, and the second file won't be included, since there is no match. |
pattern, and the second file will not be included, since there is no match. |
To match any other symbol literally, it has to be preceded by "=" character |
in the pattern. Also to match the "=" character itself, or the comma, the |
"==" and "=," constructions have to be used. For example the "=a==" pattern |
3277,8 → 4352,8 |
|
match a b, 1 { db a } |
|
there will be nothing left for "b" to match, so the block won't get processed |
at all. |
there will be nothing left for "b" to match, so the block will not get |
processed at all. |
The block of source defined by match is processed in the same way as any |
macroinstruction, so any operators specific to macroinstructions can be used |
also in this case. |
3314,12 → 4389,12 |
a separate stage, and all other preprocessing is done after on the resulting |
source. |
The standard preprocessing that comes after, on each line begins with |
recognition of the first symbol. It begins with checking for the preprocessor |
recognition of the first symbol. It starts with checking for the preprocessor |
directives, and when none of them is detected, preprocessor checks whether the |
first symbol is macroinstruction. If no macroinstruction is found, it moves |
to the second symbol of line, and again begins with checking for directives, |
which in this case is only the "equ" directive, as this is the only one that |
occurs as the second symbol in line. If there's no directive, the second |
occurs as the second symbol in line. If there is no directive, the second |
symbol is checked for the case of structure macroinstruction and when none |
of those checks gives the positive result, the symbolic constants are replaced |
with their values and such line is passed to the assembler. |
3331,11 → 4406,15 |
|
would be then both interpreted as invocations of macroinstruction "foo", since |
the meaning of the first symbol overrides the meaning of second one. |
The macroinstructions generate the new lines from their definition blocks, |
replacing the parameters with their values and then processing the "#" and "`" |
operators. The conversion operator has the higher priority than concatenation. |
After this is completed, the newly generated line goes through the standard |
preprocessing, as described above. |
When the macroinstruction generates the new lines from its definition block, |
in every line it first scans for macroinstruction directives, and interpretes |
them accordingly. All the other content in the definition block is used to |
brew the new lines, replacing the macroinstruction parameters with their values |
and then processing the symbol escaping and "#" and "`" operators. The |
conversion operator has the higher priority than concatenation and if any of |
them operates on the escaped symbol, the escaping is cancelled before finishing |
the operation. After this is completed, the newly generated line goes through |
the standard preprocessing, as described above. |
Though the symbolic constants are usually only replaced in the lines, where |
no preprocessor directives nor macroinstructions has been found, there are some |
special cases where those replacements are performed in the parts of lines |
3375,6 → 4454,33 |
block enclosed with braces. So if the "list" had value "1,2", the above line |
would generate the line containing "foo 1,2", which would then go through the |
standard preprocessing. |
The other special case is in the parameters of "rept" directive. The amount |
of repetitions and the base value for counter can be specified using |
numerical expressions, and if there is a symbolic constant with non-numerical |
name used in such an expression, preprocessor tries to evaluate its value as |
a numerical expression and if succeeds, it replaces the symbolic constant with |
the result of that calculation and continues to evaluate the primary |
expression. If the expression inside that symbolic constants also contains |
some symbolic constants, preprocessor will try to calculate all the needed |
values recursively. |
This allows to perform some calculations at the time of preprocessing, as |
long as all the values used are the numbers known at the preprocessing stage. |
A single repetition with "rept" can be used for the sole purpose of |
calculating some value, like in this example: |
|
define a b+4 |
define b 3 |
rept 1 result:a*b+2 { define c result } |
|
To compute the base value for "result" counter, preprocessor replaces the "b" |
with its value and recursively calculates the value of "a", obtaining 7 as |
the result, then it calculates the main expression with the result being 23. |
The "c" then gets defined with the first value of counter (because the block |
is processed just one time), which is the result of the computation, so the |
value of "c" is simple "23" symbol. Note that if "b" is later redefined with |
some other numerical value, the next time and expression containing "a" is |
calculated, the value of "a" will reflect the new value of "b", because the |
symbolic constant contains just the text of the expression. |
There is one more special case - when preprocessor goes to checking the |
second symbol in the line and it happens to be the colon character (what is |
then interpreted by assembler as definition of a label), it stops in this |
3421,9 → 4527,10 |
the "a" constant doesn't get defined. However symbolic constant "b" was |
processed normally, even though its definition was put just next to the one |
of "a". So because of the possible confusion you should be very careful |
every time when mixing the features of preprocessor and assembler - always |
try to imagine what your source will become after the preprocessing, and |
thus what the assembler will see and do its multiple passes on. |
every time when mixing the features of preprocessor and assembler - in such |
cases it is important to realize what the source will become after the |
preprocessing, and thus what the assembler will see and do its multiple passes |
on. |
|
|
2.4 Formatter directives |
3433,7 → 4540,10 |
"format" directive followed by the format identifier allows to select the |
output format. This directive should be put at the beginning of the source. |
Default output format is a flat binary file, it can also be selected by using |
"format binary" directive. |
"format binary" directive. This directive can be followed by the "as" keyword |
and the quoted string specifying the default file extension for the output |
file. Unless the output file name was specified from the command line, |
assembler will use this extension when generating the output file. |
"use16" and "use32" directives force the assembler to generate 16-bit or |
32-bit code, omitting the default setting for selected output format. "use64" |
enables generating the code for the long mode of x86-64 processors. |
3468,16 → 4578,20 |
2.4.2 Portable Executable |
|
To select the Portable Executable output format, use "format PE" directive, it |
can be followed by additional format settings: use "console", "GUI" or |
"native" operator selects the target subsystem (floating point value |
specifying subsystem version can follow), "DLL" marks the output file as a |
dynamic link library. Then can follow the "at" operator and the numerical |
expression specifying the base of PE image and then optionally "on" operator |
followed by the quoted string containing file name selects custom MZ stub for |
PE program (when specified file is not a MZ executable, it is treated as a |
flat binary executable file and converted into MZ format). The default code |
setting for this format is 32-bit. The example of fully featured PE format |
declaration: |
can be followed by additional format settings: first the target subsystem |
setting, which can be "console" or "GUI" for Windows applications, "native" |
for Windows drivers, "EFI", "EFIboot" or "EFIruntime" for the UEFI, it may be |
followed by the minimum version of system that the executable is targeted to |
(specified in form of floating-point value). Optional "DLL" and "WDM" keywords |
mark the output file as a dynamic link library and WDM driver respectively, |
and the "large" keyword marks the executable as able to handle addresses |
larger than 2 GB. |
After those settings can follow the "at" operator and a numerical expression |
specifying the base of PE image and then optionally "on" operator followed by |
the quoted string containing file name selects custom MZ stub for PE program |
(when specified file is not a MZ executable, it is treated as a flat binary |
executable file and converted into MZ format). The default code setting for |
this format is 32-bit. The example of fully featured PE format declaration: |
|
format PE GUI 4.0 DLL at 7000000h on 'stub.exe' |
|
3524,22 → 4638,24 |
identifier is followed by "from" operator and quoted file name - in such case |
data is taken from the given resource file. |
The "rva" operator can be used inside the numerical expressions to obtain |
the RVA of the item addressed by the value it is applied to. |
the RVA of the item addressed by the value it is applied to, that is the |
offset relative to the base of PE image. |
|
|
2.4.3 Common Object File Format |
|
To select Common Object File Format, use "format COFF" or "format MS COFF" |
directive whether you want to create classic or Microsoft's COFF file. The |
default code setting for this format is 32-bit. To create the file in |
Microsoft's COFF format for the x86-64 architecture, use "format MS64 COFF" |
setting, in such case long mode code is generated by default. |
directive, depending whether you want to create classic (DJGPP) or Microsoft's |
variant of COFF file. The default code setting for this format is 32-bit. To |
create the file in Microsoft's COFF format for the x86-64 architecture, use |
"format MS64 COFF" setting, in such case long mode code is generated by |
default. |
"section" directive defines a new section, it should be followed by quoted |
string defining the name of section, then one or more section flags can |
follow. Section flags available for both COFF variants are "code" and "data", |
while "readable", "writeable", "executable", "shareable", "discardable", |
"notpageable", "linkremove" and "linkinfo" are flags available only with |
Microsoft COFF variant. |
while flags "readable", "writeable", "executable", "shareable", "discardable", |
"notpageable", "linkremove" and "linkinfo" are available only with Microsoft's |
COFF variant. |
By default section is aligned to double word (four bytes), in case of |
Microsoft COFF variant other alignment can be specified by providing the |
"align" operator followed by alignment value (any power of two up to 8192) |
3561,6 → 4677,12 |
public main |
public start as '_start' |
|
Additionally, with COFF format it's possible to specify exported symbol as |
static, it's done by preceding the name of symbol with the "static" keyword. |
When using the Microsoft's COFF format, the "rva" operator can be used |
inside the numerical expressions to obtain the RVA of the item addressed by the |
value it is applied to. |
|
2.4.4 Executable and Linkable Format |
|
To select ELF output format, use "format ELF" directive. The default code |
3578,14 → 4700,24 |
The "rva" operator can be used also in the case of this format (however not |
when target architecture is x86-64), it converts the address into the offset |
relative to the GOT table, so it may be useful to create position-independent |
code. |
code. There's also a special "plt" operator, which allows to call the external |
functions through the Procedure Linkage Table. You can even create an alias |
for external function that will make it always be called through PLT, with |
the code like: |
|
extrn 'printf' as _printf |
printf = PLT _printf |
|
To create executable file, follow the format choice directive with the |
"executable" keyword. It allows to use "entry" directive followed by the value |
to set as entry point of program. On the other hand it makes "extrn" and |
"public" directives unavailable, and instead of "section" there should be the |
"segment" directive used, followed only by one or more segment permission |
flags. The origin of segment is aligned to page (4096 bytes), and available |
flags for are: "readable", "writeable" and "executable". |
"executable" keyword and optionally the number specifying the brand of the |
target operating system (for example value 3 would mark the executable |
for Linux system). With this format selected it is allowed to use "entry" |
directive followed by the value to set as entry point of program. On the other |
hand it makes "extrn" and "public" directives unavailable, and instead of |
"section" there should be the "segment" directive used, followed by one or |
more segment permission flags and optionally a marker of special ELF |
executable segment, which can be "interpreter", "dynamic" or "note". The |
origin of segment is aligned to page (4096 bytes), and available permission |
flags are: "readable", "writeable" and "executable". |
|
|
EOF |
EOF |