Subversion Repositories Kolibri OS

Compare Revisions

Regard whitespace Rev 2665 → Rev 2666

/data/Vortex86MX-eng/docs/FASM.TXT
1,16 → 1,16
 
Üßßß
ÜÜÛÜÜ ÜÜÜÜ ÜÜÜÜÜ ÜÜÜ ÜÜ
Û Û Û Û Û Û
Û ÜßßßßÛ ßßßßÜ Û Û Û
Û ßÜÜÜÜÛÜ ÜÜÜÜÜß Û Û Û
,'''
,,;,, ,,,, ,,,,, ,,, ,,
; ; ; ; ; ;
; ,''''; '''', ; ; ;
; ',,,,;, ,,,,,' ; ; ;
 
flat assembler 1.66
flat assembler 1.70
Programmer's Manual
 
 
Table of contents
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
-----------------
 
Chapter 1 Introduction
 
50,6 → 50,11
2.1.17 SSE3 instructions
2.1.18 AMD 3DNow! instructions
2.1.19 The x86-64 long mode instructions
2.1.20 SSE4 instructions
2.1.21 AVX instructions
2.1.22 AVX2 instructions
2.1.23 Auxiliary sets of computational instructions
2.1.24 Other extensions of instruction set
 
2.2 Control directives
2.2.1 Numerical constants
75,8 → 80,9
2.4.4 Executable and Linkable Format
 
 
 
Chapter 1 Introduction
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
-----------------------
 
This chapter contains all the most important information you need to begin
using the flat assembler. If you are experienced assembly language programmer,
139,7 → 145,7
destination file.
The following is an example of the compilation summary:
 
flat assembler version 1.66
flat assembler version 1.70 (16384 kilobytes memory)
38 passes, 5.3 seconds, 77824 bytes.
 
In case of error during the compilation process, the program will display an
146,7 → 152,7
error message. For example, when compiler can't find the input file, it will
display the following message:
 
flat assembler version 1.66
flat assembler version 1.70 (16384 kilobytes memory)
error: source file not found.
 
If the error is connected with a specific part of source code, the source line
153,7 → 159,7
that caused the error will be also displayed. Also placement of this line in
the source is given to help you finding this error, for example:
 
flat assembler version 1.66
flat assembler version 1.70 (16384 kilobytes memory)
example.asm [3]:
mob ax,1
error: illegal instruction.
163,7 → 169,7
contains a macroinstruction, also the line in macroinstruction definition
that generated the erroneous instruction is displayed:
 
flat assembler version 1.66
flat assembler version 1.70 (16384 kilobytes memory)
example.asm [6]:
stoschar 7
example.asm [3] stoschar [1]:
212,8 → 218,8
Any of the "+-*/=<>()[]{}:,|&~#`" is the symbol character. The sequence of
other characters, separated from other items with either blank spaces or
symbol characters, is a symbol. If the first character of symbol is either a
single or double quote, it integrates the any sequence of characters following
it, even the special ones, into a quoted string, which should end with the same
single or double quote, it integrates any sequence of characters following it,
even the special ones, into a quoted string, which should end with the same
character, with which it began (the single or double quote) - however if there
are two such characters in a row (without any other character between them),
they are integrated into quoted string as just one of them and the quoted
237,40 → 243,45
brackets or after the "ptr" operator).
 
Table 1.1 Size operators
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÄ¿
³ Operator ³ Bits ³ Bytes ³
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍ͵
³ byte ³ 8 ³ 1 ³
³ word ³ 16 ³ 2 ³
³ dword ³ 32 ³ 4 ³
³ fword ³ 48 ³ 6 ³
³ pword ³ 48 ³ 6 ³
³ qword ³ 64 ³ 8 ³
³ tbyte ³ 80 ³ 10 ³
³ tword ³ 80 ³ 10 ³
³ dqword ³ 128 ³ 16 ³
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÙ
/-------------------------\
| Operator | Bits | Bytes |
|==========|======|=======|
| byte | 8 | 1 |
| word | 16 | 2 |
| dword | 32 | 4 |
| fword | 48 | 6 |
| pword | 48 | 6 |
| qword | 64 | 8 |
| tbyte | 80 | 10 |
| tword | 80 | 10 |
| dqword | 128 | 16 |
| xword | 128 | 16 |
| qqword | 256 | 32 |
| yword | 256 | 32 |
\-------------------------/
 
Table 1.2 Registers
ÚÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ Type ³ Bits ³ ³
ÆÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵
³ ³ 8 ³ al cl dl bl ah ch dh bh ³
³ General ³ 16 ³ ax cx dx bx sp bp si di ³
³ ³ 32 ³ eax ecx edx ebx esp ebp esi edi ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ Segment ³ 16 ³ es cs ss ds fs gs ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ Control ³ 32 ³ cr0 cr2 cr3 cr4 ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ Debug ³ 32 ³ dr0 dr1 dr2 dr3 dr6 dr7 ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ FPU ³ 80 ³ st0 st1 st2 st3 st4 st5 st6 st7 ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ MMX ³ 64 ³ mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7 ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ SSE ³ 128 ³ xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 ³
ÀÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
/-----------------------------------------------------------------\
| Type | Bits | |
|=========|======|================================================|
| | 8 | al cl dl bl ah ch dh bh |
| General | 16 | ax cx dx bx sp bp si di |
| | 32 | eax ecx edx ebx esp ebp esi edi |
|---------|------|------------------------------------------------|
| Segment | 16 | es cs ss ds fs gs |
|---------|------|------------------------------------------------|
| Control | 32 | cr0 cr2 cr3 cr4 |
|---------|------|------------------------------------------------|
| Debug | 32 | dr0 dr1 dr2 dr3 dr6 dr7 |
|---------|------|------------------------------------------------|
| FPU | 80 | st0 st1 st2 st3 st4 st5 st6 st7 |
|---------|------|------------------------------------------------|
| MMX | 64 | mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7 |
|---------|------|------------------------------------------------|
| SSE | 128 | xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 |
|---------|------|------------------------------------------------|
| AVX | 256 | ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 |
\-----------------------------------------------------------------/
 
 
1.2.2 Data definitions
316,25 → 327,25
considered unknown.
 
Table 1.3 Data directives
ÚÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄ¿
³ Size ³ Define ³ Reserve ³
³ (bytes) ³ data ³ data ³
ÆÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍ͵
³ 1 ³ db ³ rb ³
³ ³ file ³ ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
³ 2 ³ dw ³ rw ³
³ ³ du ³ ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
³ 4 ³ dd ³ rd ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
³ 6 ³ dp ³ rp ³
³ ³ df ³ rf ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
³ 8 ³ dq ³ rq ³
ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
³ 10 ³ dt ³ rt ³
ÀÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÙ
/----------------------------\
| Size | Define | Reserve |
| (bytes) | data | data |
|=========|========|=========|
| 1 | db | rb |
| | file | |
|---------|--------|---------|
| 2 | dw | rw |
| | du | |
|---------|--------|---------|
| 4 | dd | rd |
|---------|--------|---------|
| 6 | dp | rp |
| | df | rf |
|---------|--------|---------|
| 8 | dq | rq |
|---------|--------|---------|
| 10 | dt | rt |
\----------------------------/
 
 
1.2.3 Constants and labels
399,14 → 410,24
In the above examples all the numerical expressions were the simple numbers,
constants or labels. But they can be more complex, by using the arithmetical
or logical operators for calculations at compile time. All these operators
with their priority values are listed in table 1.4.
The operations with higher priority value will be calculated first, you can
of course change this behavior by putting some parts of expression into
parenthesis. The "+", "-", "*" and "/" are standard arithmetical operations,
"mod" calculates the remainder from division. The "and", "or", "xor", "shl",
"shr" and "not" perform the same logical operations as assembly instructions
of those names. The "rva" performs the conversion of an address into the
relocatable offset and is specific to some of the output formats (see 2.4).
with their priority values are listed in table 1.4. The operations with higher
priority value will be calculated first, you can of course change this
behavior by putting some parts of expression into parenthesis. The "+", "-",
"*" and "/" are standard arithmetical operations, "mod" calculates the
remainder from division. The "and", "or", "xor", "shl", "shr" and "not"
perform the same logical operations as assembly instructions of those names.
The "rva" and "plt" are special unary operators that perform conversions
between different kinds of addresses, they can be used only with few of the
output formats and their meaning may vary (see 2.4).
The arithmetical and logical calculations are usually processed as if they
operated on infinite precision 2-adic numbers, and assembler signalizes an
overflow error if because of its limitations it is not table to perform the
required calculation, or if the result is too large number to fit in either
signed or unsigned range for the destination unit size. However "not", "xor"
and "shr" operators are exceptions from this rule - if the value specified
by numerical expression has to fit in a unit of specified size, and the
arguments for operation fit into that size, the operation will be performed
with precision limited to that size.
The numbers in the expression are by default treated as a decimal, binary
numbers should have the "b" letter attached at the end, octal number should
end with "o" letter, hexadecimal numbers should begin with "0x" characters
431,23 → 452,23
while simple "1" defines an integer value.
 
Table 1.4 Arithmetical and logical operators by priority
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ Priority ³ Operators ³
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍ͵
³ 0 ³ + - ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ 1 ³ * / ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ 2 ³ mod ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ 3 ³ and or xor ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ 4 ³ shl shr ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ 5 ³ not ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ 6 ³ rva ³
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
/-------------------------\
| Priority | Operators |
|==========|==============|
| 0 | + - |
|----------|--------------|
| 1 | * / |
|----------|--------------|
| 2 | mod |
|----------|--------------|
| 3 | and or xor |
|----------|--------------|
| 4 | shl shr |
|----------|--------------|
| 5 | not |
|----------|--------------|
| 6 | rva plt |
\-------------------------/
 
 
1.2.5 Jumps and calls
459,7 → 480,7
in 32-bit mode, it will become the near jump. To force this instruction to be
treated differently, use the "jmp near dword [0]" or "jmp far dword [0]" form.
When operand of near jump is the immediate value, assembler will generate
the shortest variant of this jump instruction if possible (but won't create
the shortest variant of this jump instruction if possible (but will not create
32-bit instruction in 16-bit mode nor 16-bit instruction in 32-bit mode,
unless there is a size operator stating it). By specifying the jump type
you can force it to always generate long variant (for example "jmp near 0")
492,7 → 513,7
 
 
Chapter 2 Instruction set
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
--------------------------
 
This chapter provides the detailed information about the instructions and
directives supported by flat assembler. Directives for defining labels were
767,12 → 788,12
 
2.1.5 Logical instructions
 
"not" inverts the bits in the specified operand to form a one's
complement of the operand. It has no effect on the flags. Rules for the
operand are the same as for the "inc" instruction.
"and", "or" and "xor" instructions perform the standard
logical operations. They update the SF, ZF and PF flags. Rules for the
operands are the same as for the "add" instruction.
"not" inverts the bits in the specified operand to form a one's complement
of the operand. It has no effect on the flags. Rules for the operand are the
same as for the "inc" instruction.
"and", "or" and "xor" instructions perform the standard logical operations.
They update the SF, ZF and PF flags. Rules for the operands are the same as
for the "add" instruction.
"bt", "bts", "btr" and "btc" instructions operate on a single bit which can
be in memory or in a general register. The location of the bit is specified
as an offset from the low order end of the operand. The value of the offset
918,55 → 939,55
target address.
 
Table 2.1 Conditions
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ Mnemonic ³ Condition tested ³ Description ³
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵
³ o ³ OF = 1 ³ overflow ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ no ³ OF = 0 ³ not overflow ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ c ³ ³ carry ³
³ b ³ CF = 1 ³ below ³
³ nae ³ ³ not above nor equal ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ nc ³ ³ not carry ³
³ ae ³ CF = 0 ³ above or equal ³
³ nb ³ ³ not below ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ e ³ ZF = 1 ³ equal ³
³ z ³ ³ zero ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ ne ³ ZF = 0 ³ not equal ³
³ nz ³ ³ not zero ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ be ³ CF or ZF = 1 ³ below or equal ³
³ na ³ ³ not above ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ a ³ CF or ZF = 0 ³ above ³
³ nbe ³ ³ not below nor equal ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ s ³ SF = 1 ³ sign ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ ns ³ SF = 0 ³ not sign ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ p ³ PF = 1 ³ parity ³
³ pe ³ ³ parity even ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ np ³ PF = 0 ³ not parity ³
³ po ³ ³ parity odd ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ l ³ SF xor OF = 1 ³ less ³
³ nge ³ ³ not greater nor equal ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ ge ³ SF xor OF = 0 ³ greater or equal ³
³ nl ³ ³ not less ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ le ³ (SF xor OF) or ZF = 1 ³ less or equal ³
³ ng ³ ³ not greater ³
ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
³ g ³ (SF xor OF) or ZF = 0 ³ greater ³
³ nle ³ ³ not less nor equal ³
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
/-----------------------------------------------------------\
| Mnemonic | Condition tested | Description |
|==========|=======================|========================|
| o | OF = 1 | overflow |
|----------|-----------------------|------------------------|
| no | OF = 0 | not overflow |
|----------|-----------------------|------------------------|
| c | | carry |
| b | CF = 1 | below |
| nae | | not above nor equal |
|----------|-----------------------|------------------------|
| nc | | not carry |
| ae | CF = 0 | above or equal |
| nb | | not below |
|----------|-----------------------|------------------------|
| e | ZF = 1 | equal |
| z | | zero |
|----------|-----------------------|------------------------|
| ne | ZF = 0 | not equal |
| nz | | not zero |
|----------|-----------------------|------------------------|
| be | CF or ZF = 1 | below or equal |
| na | | not above |
|----------|-----------------------|------------------------|
| a | CF or ZF = 0 | above |
| nbe | | not below nor equal |
|----------|-----------------------|------------------------|
| s | SF = 1 | sign |
|----------|-----------------------|------------------------|
| ns | SF = 0 | not sign |
|----------|-----------------------|------------------------|
| p | PF = 1 | parity |
| pe | | parity even |
|----------|-----------------------|------------------------|
| np | PF = 0 | not parity |
| po | | parity odd |
|----------|-----------------------|------------------------|
| l | SF xor OF = 1 | less |
| nge | | not greater nor equal |
|----------|-----------------------|------------------------|
| ge | SF xor OF = 0 | greater or equal |
| nl | | not less |
|----------|-----------------------|------------------------|
| le | (SF xor OF) or ZF = 1 | less or equal |
| ng | | not greater |
|----------|-----------------------|------------------------|
| g | (SF xor OF) or ZF = 0 | greater |
| nle | | not less nor equal |
\-----------------------------------------------------------/
 
The "loop" instructions are conditional jumps that use a value placed in
CX (or ECX) to specify the number of repetitions of a software loop. All
1158,7 → 1179,7
 
"salc" instruction sets the all bits of AL register when the carry flag is
set and zeroes the AL register otherwise. This instruction has no arguments.
The instructions obtained by attaching the condition mnemonic to the "cmov"
The instructions obtained by attaching the condition mnemonic to "cmov"
mnemonic transfer the word or double word from the general register or memory
to the general register only when the condition is true. The destination
operand should be general register, the source operand can be general register
1365,7 → 1386,7
commonly used contants onto the FPU register stack. The loaded constants are
+1.0, +0.0, lb 10, lb e, pi, lg 2 and ln 2 respectively. These instructions
have no operands.
"fild" convert the singed integer source operand into double extended
"fild" converts the signed integer source operand into double extended
precision floating-point format and pushes the result onto the FPU register
stack. The source operand can be a 16-bit, 32-bit or 64-bit memory location.
 
1493,18 → 1514,18
fcmovb st0,st2 ; transfer st2 to st0 if below
 
Table 2.2 FPU conditions
ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ Mnemonic ³ Condition tested ³ Description ³
ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵
³ b ³ CF = 1 ³ below ³
³ e ³ ZF = 1 ³ equal ³
³ be ³ CF or ZF = 1 ³ below or equal ³
³ u ³ PF = 1 ³ unordered ³
³ nb ³ CF = 0 ³ not below ³
³ ne ³ ZF = 0 ³ not equal ³
³ nbe ³ CF and ZF = 0 ³ not below nor equal ³
³ nu ³ PF = 0 ³ not unordered ³
ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
/------------------------------------------------------\
| Mnemonic | Condition tested | Description |
|==========|==================|========================|
| b | CF = 1 | below |
| e | ZF = 1 | equal |
| be | CF or ZF = 1 | below or equal |
| u | PF = 1 | unordered |
| nb | CF = 0 | not below |
| ne | ZF = 0 | not equal |
| nbe | CF and ZF = 0 | not below nor equal |
| nu | PF = 0 | not unordered |
\------------------------------------------------------/
 
"ftst" compares the value in ST0 with 0.0 and sets the flags in the FPU
status word according to the results. "fxam" examines the contents of the ST0
1528,7 → 1549,12
destination in memory and reinitializes the FPU. "fsave" check for pending
unmasked FPU exceptions before proceeding, "fnsave" does not. "frstor"
loads the FPU state from the specified memory location. All these instructions
need an operand being a memory location.
need an operand being a memory location. For each of these instruction
exist two additional mnemonics that allow to precisely select the type of the
operation. The "fstenvw", "fnstenvw", "fldenvw", "fsavew", "fnsavew" and
"frstorw" mnemonics force the instruction to perform operation as in the 16-bit
mode, while "fstenvd", "fnstenvd", "fldenvd", "fsaved", "fnsaved" and "frstord"
force the operation as in 32-bit mode.
"finit" and "fninit" set the FPU operating environment into its default
state. "finit" checks for pending unmasked FPU exception before proceeding,
"fninit" does not. "fclex" and "fnclex" clear the FPU exception flags in the
1573,17 → 1599,17
"psubsb" and "psubsw" perform the addition or substraction of packed bytes
or packed words with the signed saturation. "paddusb", "paddusw", "psubusb",
"psubusw" are analoguous, but with unsigned saturation. "pmulhw" and "pmullw"
performs a signed multiply of the packed words and store the high or low words
of the results in the destination operand. "pmaddwd" performs a multiply of
the packed words and adds the four intermediate double word products in pairs
to produce result as a packed double words. "pand", "por" and "pxor" perform
the logical operations on the quad words, "pandn" peforms also a logical
negation of the destination operand before performing the "and" operation.
"pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed bytes,
packed words or packed double words. If a pair of data elements is equal, the
corresponding data element in the destination operand is filled with bits of
value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd" perform
the similar operation, but they check whether the data elements in the
performs a signed multiplication of the packed words and store the high or low
words of the results in the destination operand. "pmaddwd" performs a multiply
of the packed words and adds the four intermediate double word products in
pairs to produce result as a packed double words. "pand", "por" and "pxor"
perform the logical operations on the quad words, "pandn" peforms also a
logical negation of the destination operand before performing the "and"
operation. "pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed
bytes, packed words or packed double words. If a pair of data elements is
equal, the corresponding data element in the destination operand is filled with
bits of value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd"
perform the similar operation, but they check whether the data elements in the
destination operand are greater than the correspoding data elements in the
source operand. "packsswb" converts packed signed words into packed signed
bytes, "packssdw" converts packed signed double words into packed signed
1699,18 → 1725,18
cmpltss xmm0,[ebx] ; compare single precision values
 
Table 2.3 SSE conditions
ÚÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ Code ³ Mnemonic ³ Description ³
ÆÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍ͵
³ 0 ³ eq ³ equal ³
³ 1 ³ lt ³ less than ³
³ 2 ³ le ³ less than or equal ³
³ 3 ³ unord ³ unordered ³
³ 4 ³ neq ³ not equal ³
³ 5 ³ nlt ³ not less than ³
³ 6 ³ nle ³ not less than nor equal ³
³ 7 ³ ord ³ ordered ³
ÀÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
/-------------------------------------------\
| Code | Mnemonic | Description |
|======|==========|=========================|
| 0 | eq | equal |
| 1 | lt | less than |
| 2 | le | less than or equal |
| 3 | unord | unordered |
| 4 | neq | not equal |
| 5 | nlt | not less than |
| 6 | nle | not less than nor equal |
| 7 | ord | ordered |
\-------------------------------------------/
 
"comiss" and "ucomiss" compare the single precision values and set the ZF,
PF and CF flags to show the result. The destination operand must be a SSE
1771,8 → 1797,8
 
"pextrw" copies the word in the source operand specified by the third
operand to the destination operand. The source operand must be a MMX register,
the destination operand must be a 32-bit general register (but only the low
word of it is affected), the third operand must an 8-bit immediate value.
the destination operand must be a 32-bit general register (the high word of
the destination is cleared), the third operand must an 8-bit immediate value.
 
pextrw eax,mm0,1 ; extract word into eax
 
1788,12 → 1814,12
return the maximum values of packed unsigned bytes, "pminub" returns the
minimum values of packed unsigned bytes, "pmaxsw" returns the maximum values
of packed signed words, "pminsw" returns the minimum values of packed signed
words. "pmulhuw" performs a unsigned multiply of the packed words and stores
the high words of the results in the destination operand. "psadbw" computes
the absolute differences of packed unsigned bytes, sums the differences, and
stores the sum in the low word of destination operand. All these instructions
follow the same rules for operands as the general MMX operations described in
previous section.
words. "pmulhuw" performs a unsigned multiplication of the packed words and
stores the high words of the results in the destination operand. "psadbw"
computes the absolute differences of packed unsigned bytes, sums the
differences, and stores the sum in the low word of destination operand. All
these instructions follow the same rules for operands as the general MMX
operations described in previous section.
"pmovmskb" creates a mask made of the most significant bit of each byte in
the source operand and stores the result in the low byte of destination
operand. The source operand must be a MMX register, the destination operand
1922,10 → 1948,11
point values to packed two double word integers, storing the result in the low
quad word of the destination operand. "cvtdq2ps" converts packed four
double word integers to packed single precision floating point values.
"cvtdq2pd" converts packed two double word integers from the low quad word
of the source operand to packed double precision floating point values.
For all these instruction destination operand must be a SSE register, the
source operand can be a 128-bit memory location or SSE register.
"cvtdq2pd" converts packed two double word integers from the source operand to
packed double precision floating point values, the source can be a 64-bit
memory location or SSE register, destination has to be SSE register.
"movdqa" and "movdqu" transfer a double quad word operand containing packed
integers from source operand to destination operand. At least one of the
operands have to be a SSE register, the second one can be also a SSE register
1943,7 → 1970,7
mnemonics starting with "p") are extended to operate on 128-bit packed
integers located in SSE registers. Additional syntax for these instructions
needs an SSE register where MMX register was needed, and the 128-bit memory
location or SSE register where 64-bit memory location of MMX register were
location or SSE register where 64-bit memory location or MMX register were
needed. The exception is "pshufw" instruction, which doesn't allow extended
syntax, but has two new variants: "pshufhw" and "pshuflw", which allow only
the extended syntax, and perform the same operation as "pshufw" on the high
1955,12 → 1982,12
pextrw eax,xmm0,7 ; extract highest word into eax
 
"paddq" performs the addition of packed quad words, "psubq" performs the
substraction of packed quad words, "pmuludq" performs an unsigned multiply
of low double words from each corresponding quad words and returns the results
in packed quad words. These instructions follow the same rules for operands as
the general MMX operations described in 2.1.14.
substraction of packed quad words, "pmuludq" performs an unsigned
multiplication of low double words from each corresponding quad words and
returns the results in packed quad words. These instructions follow the same
rules for operands as the general MMX operations described in 2.1.14.
"pslldq" and "psrldq" perform logical shift left or right of the double
quad word in the destination operand by the amount of bits specified in the
quad word in the destination operand by the amount of bytes specified in the
source operand. The destination operand should be a SSE register, source
operand should be an 8-bit immediate value.
"punpckhqdq" interleaves the high quad word of the source operand and the
2007,10 → 2034,10
"movddup" loads the 64-bit source value and duplicates it into high and low
quad word of the destination operand. The destination operand should be SSE
register, the source operand can be SSE register or 64-bit memory location.
"lddqu" is functionally equivalent to "movdqu" instruction with memory as
source operand, but it may improve performance when the source operand crosses
a cacheline boundary. The destination operand has to be SSE register, the
source operand must be 128-bit memory location.
"lddqu" is functionally equivalent to "movdqu" with memory as source
operand, but it may improve performance when the source operand crosses a
cacheline boundary. The destination operand has to be SSE register, the source
operand must be 128-bit memory location.
"addsubps" performs single precision addition of second and fourth pairs and
single precision substracion of the first and third pairs of floating point
values in the operands. "addsubpd" performs double precision addition of the
2030,6 → 2057,44
waits for a write-back store to the address range set up by the "monitor"
instruction. It uses two operands with additional parameters, first being the
EAX and second the ECX register.
The functionality of SSE3 is further extended by the set of Supplemental
SSE3 instructions (SSSE3). They generally follow the same rules for operands
as all the MMX operations extended by SSE.
"phaddw" and "phaddd" perform the horizontal additional of the pairs of
adjacent values from both the source and destination operand, and stores the
sums into the destination (sums from the source operand go into lower part of
destination register). They operate on 16-bit or 32-bit chunks, respectively.
"phaddsw" performs the same operation on signed 16-bit packed values, but the
result of each addition is saturated. "phsubw" and "phsubd" analogously
perform the horizontal substraction of 16-bit or 32-bit packed value, and
"phsubsw" performs the horizontal substraction of signed 16-bit packed values
with saturation.
"pabsb", "pabsw" and "pabsd" calculate the absolute value of each signed
packed signed value in source operand and stores them into the destination
register. They operator on 8-bit, 16-bit and 32-bit elements respectively.
"pmaddubsw" multiplies signed 8-bit values from the source operand with the
corresponding unsigned 8-bit values from the destination operand to produce
intermediate 16-bit values, and every adjacent pair of those intermediate
values is then added horizontally and those 16-bit sums are stored into the
destination operand.
"pmulhrsw" multiplies corresponding 16-bit integers from the source and
destination operand to produce intermediate 32-bit values, and the 16 bits
next to the highest bit of each of those values are then rounded and packed
into the destination operand.
"pshufb" shuffles the bytes in the destination operand according to the
mask provided by source operand - each of the bytes in source operand is
an index of the target position for the corresponding byte in the destination.
"psignb", "psignw" and "psignd" perform the operation on 8-bit, 16-bit or
32-bit integers in destination operand, depending on the signs of the values
in the source. If the value in source is negative, the corresponding value in
the destination register is negated, if the value in source is positive, no
operation is performed on the corresponding value is performed, and if the
value in source is zero, the value in destination is zeroed, too.
"palignr" appends the source operand to the destination operand to form the
intermediate value of twice the size, and then extracts into the destination
register the 64 or 128 bits that are right-aligned to the byte offset
specified by the third operand, which should be an 8-bit immediate value. This
is the only SSSE3 instruction that takes three arguments.
 
 
2.1.18 AMD 3DNow! instructions
2040,9 → 2105,9
These instructions follow the same rules as the general MMX operations, the
destination operand should be a MMX register, the source operand can be a MMX
register or 64-bit memory location. "pavgusb" computes the rounded averages
of packed unsigned bytes. "pmulhrw" performs a signed multiply of the packed
words, round the high word of each double word results and stores them in the
destination operand. "pi2fd" converts packed double word integers into
of packed unsigned bytes. "pmulhrw" performs a signed multiplication of the
packed words, round the high word of each double word results and stores them
in the destination operand. "pi2fd" converts packed double word integers into
packed floating point values. "pf2id" converts packed floating point values
into packed double word integers using truncation. "pi2fw" converts packed
word integers into packed floating point values, only low words of each
2106,28 → 2171,28
instruction with any of the new registers.
 
Table 2.4 New registers in long mode
ÚÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄ¿
³ Type ³ General ³ SSE ³
ÃÄÄÄÄÄÄÅÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÅÄÄÄÄÄÄÄ´
³ Bits ³ 8 ³ 16 ³ 32 ³ 64 ³ 128 ³
ÆÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍ͵
³ ³ ³ ³ ³ rax ³ ³
³ ³ ³ ³ ³ rcx ³ ³
³ ³ ³ ³ ³ rdx ³ ³
³ ³ ³ ³ ³ rbx ³ ³
³ ³ spl ³ ³ ³ rsp ³ ³
³ ³ bpl ³ ³ ³ rbp ³ ³
³ ³ sil ³ ³ ³ rsi ³ ³
³ ³ dil ³ ³ ³ rdi ³ ³
³ ³ r8b ³ r8w ³ r8d ³ r8 ³ xmm8 ³
³ ³ r9b ³ r9w ³ r9d ³ r9 ³ xmm9 ³
³ ³ r10b ³ r10w ³ r10d ³ r10 ³ xmm10 ³
³ ³ r11b ³ r11w ³ r11d ³ r11 ³ xmm11 ³
³ ³ r12b ³ r12w ³ r12d ³ r12 ³ xmm12 ³
³ ³ r13b ³ r13w ³ r13d ³ r13 ³ xmm13 ³
³ ³ r14b ³ r14w ³ r14d ³ r14 ³ xmm14 ³
³ ³ r15b ³ r15w ³ r15d ³ r15 ³ xmm15 ³
ÀÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÙ
/--------------------------------------------------\
| Type | General | SSE | AVX |
|------|---------------------------|-------|-------|
| Bits | 8 | 16 | 32 | 64 | 128 | 256 |
|======|======|======|======|======|=======|=======|
| | | | | rax | | |
| | | | | rcx | | |
| | | | | rdx | | |
| | | | | rbx | | |
| | spl | | | rsp | | |
| | bpl | | | rbp | | |
| | sil | | | rsi | | |
| | dil | | | rdi | | |
| | r8b | r8w | r8d | r8 | xmm8 | ymm8 |
| | r9b | r9w | r9d | r9 | xmm9 | ymm9 |
| | r10b | r10w | r10d | r10 | xmm10 | ymm10 |
| | r11b | r11w | r11d | r11 | xmm11 | ymm11 |
| | r12b | r12w | r12d | r12 | xmm12 | ymm12 |
| | r13b | r13w | r13d | r13 | xmm13 | ymm13 |
| | r14b | r14w | r14d | r14 | xmm14 | ymm14 |
| | r15b | r15w | r15d | r15 | xmm15 | ymm15 |
\--------------------------------------------------/
 
In general any instruction from x86 architecture, which allowed 16-bit or
32-bit operand sizes, in long mode allows also the 64-bit operands. The 64-bit
2165,30 → 2230,30
the upper 32 bits of the 64-bit registers containing them are filled with
zeros. This is unlike the operations on 16-bit or 8-bit portions of those
registers, which preserve the upper bits.
Three new type conversion instructions are available. The "cdqe" sign extends
the double word in EAX into quad word and stores the result in RAX register.
"cqo" sign extends the quad word in RAX into double quad word and stores the
extra bits in the RDX register. These instructions have no operands. "movsxd"
sign extends the double word source operand, being either the 32-bit register
or memory, into 64-bit destination operand, which has to be register.
No analogous instruction is needed for the zero extension, since it is done
automatically by any operations on 32-bit registers, as noted in previous
paragraph. And the "movzx" and "movsx" instructions, conforming to the general
rule, can be used with 64-bit destination operand, allowing extension of byte
or word values into quad words.
All the binary arithmetic and logical instruction are promoted to allow
64-bit operands in long mode. The use of decimal arithmetic instructions in
long mode is prohibited.
Three new type conversion instructions are available. The "cdqe" sign
extends the double word in EAX into quad word and stores the result in RAX
register. "cqo" sign extends the quad word in RAX into double quad word and
stores the extra bits in the RDX register. These instructions have no
operands. "movsxd" sign extends the double word source operand, being either
the 32-bit register or memory, into 64-bit destination operand, which has to
be register. No analogous instruction is needed for the zero extension, since
it is done automatically by any operations on 32-bit registers, as noted in
previous paragraph. And the "movzx" and "movsx" instructions, conforming to
the general rule, can be used with 64-bit destination operand, allowing
extension of byte or word values into quad words.
All the binary arithmetic and logical instruction have been promoted to
allow 64-bit operands in long mode. The use of decimal arithmetic instructions
in long mode is prohibited.
The stack operations, like "push" and "pop" in long mode default to 64-bit
operands and it's not possible to use 32-bit operands with them. The "pusha"
and "popa" are disallowed in long mode.
The indirect near jumps and calls in long mode default to 64-bit operands and
it's not possible to use the 32-bit operands with them. On the other hand, the
indirect far jumps and calls allow any operands that were allowed by the x86
architecture and also 80-bit memory operand is allowed (though only EM64T seems
to implement such variant), with the first eight bytes defining the offset and
two last bytes specifying the selector. The direct far jumps and calls are not
allowed in long mode.
The indirect near jumps and calls in long mode default to 64-bit operands
and it's not possible to use the 32-bit operands with them. On the other hand,
the indirect far jumps and calls allow any operands that were allowed by the
x86 architecture and also 80-bit memory operand is allowed (though only EM64T
seems to implement such variant), with the first eight bytes defining the
offset and two last bytes specifying the selector. The direct far jumps and
calls are not allowed in long mode.
The I/O instructions, "in", "out", "ins" and "outs" are the exceptional
instructions that are not extended to accept quad word operands in long mode.
But all other string operations are, and there are new short forms "movsq",
2203,13 → 2268,990
The "cmpxchg16b" is the 64-bit equivalent of "cmpxchg8b" instruction, it uses
the double quad word memory operand and 64-bit registers to perform the
analoguous operation.
The "fxsave64" and "fxrstor64" are new variants of "fxsave" and "fxrstor"
instructions, available only in long mode, which use a different format of
storage area in order to store some pointers in full 64-bit size.
"swapgs" is the new instruction, which swaps the contents of GS register and
the KernelGSbase model-specific register (MSR address 0C0000102h).
"syscall" and "sysret" is the pair of new instructions that provide the
functionality similar to "sysenter" and "sysexit" in long mode, where the
latter pair is disallowed.
latter pair is disallowed. The "sysexitq" and "sysretq" mnemonics provide the
64-bit versions of "sysexit" and "sysret" instructions.
The "rdmsrq" and "wrmsrq" mnemonics are the 64-bit variants of the "rdmsr"
and "wrmsr" instructions.
 
 
2.1.20 SSE4 instructions
 
There are actually three different sets of instructions under the name SSE4.
Intel designed two of them, SSE4.1 and SSE4.2, with latter extending the
former into the full Intel's SSE4 set. On the other hand, the implementation
by AMD includes only a few instructions from this set, but also contains
some additional instructions, that are called the SSE4a set.
The SSE4.1 instructions mostly follow the same rules for operands, as
the basic SSE operations, so they require destination operand to be SSE
register and source operand to be 128-bit memory location or SSE register,
and some operations require a third operand, the 8-bit immediate value.
"pmulld" performs a signed multiplication of the packed double words and
stores the low double words of the results in the destination operand.
"pmuldq" performs a two signed multiplications of the corresponding double
words in the lower quad words of operands, and stores the results as
packed quad words into the destination register. "pminsb" and "pmaxsb"
return the minimum or maximum values of packed signed bytes, "pminuw" and
"pmaxuw" return the minimum and maximum values of packed unsigned words,
"pminud", "pmaxud", "pminsd" and "pmaxsd" return minimum or maximum values
of packed unsigned or signed words. These instruction complement the
instructions computing packed minimum or maximum introduced by SSE.
"ptest" sets the ZF flag to one when the result of bitwise AND of the
both operands is zero, and zeroes the ZF otherwise. It also sets CF flag
to one, when the result of bitwise AND of the destination operand with
the bitwise NOT of the source operand is zero, and zeroes the CF otherwise.
"pcmpeqq" compares packed quad words for equality, and fills the
corresponding elements of destination operand with either ones or zeros,
depending on the result of comparison.
"packusdw" converts packed signed double words from both the source and
destination operand into the unsigned words using saturation, and stores
the eight resulting word values into the destination register.
"phminposuw" finds the minimum unsigned word value in source operand and
places it into the lowest word of destination operand, setting the remaining
upper bits of destination to zero.
"roundps", "roundss", "roundpd" and "roundsd" perform the rounding of packed
or individual floating point value of single or double precision, using the
rounding mode specified by the third operand.
 
roundsd xmm0,xmm1,0011b ; round toward zero
 
"dpps" calculates dot product of packed single precision floating point
values, that is it multiplies the corresponding pairs of values from source and
destination operand and then sums the products up. The high four bits of the
8-bit immediate third operand control which products are calculated and taken
to the sum, and the low four bits control, into which elements of destination
the resulting dot product is copied (the other elements are filled with zero).
"dppd" calculates dot product of packed double precision floating point values.
The bits 4 and 5 of third operand control, which products are calculated and
added, and bits 0 and 1 of this value control, which elements in destination
register should get filled with the result. "mpsadbw" calculates multiple sums
of absolute differences of unsigned bytes. The third operand controls, with
value in bits 0-1, which of the four-byte blocks in source operand is taken to
calculate the absolute differencies, and with value in bit 2, at which of the
two first four-byte block in destination operand start calculating multiple
sums. The sum is calculated from four absolute differencies between the
corresponding unsigned bytes in the source and destination block, and each next
sum is calculated in the same way, but taking the four bytes from destination
at the position one byte after the position of previous block. The four bytes
from the source stay the same each time. This way eight sums of absolute
differencies are calculated and stored as packed word values into the
destination operand. The instructions described in this paragraph follow the
same rules for operands, as "roundps" instruction.
"blendps", "blendvps", "blendpd" and "blendvpd" conditionally copy the
values from source operand into the destination operand, depending on the bits
of the mask provided by third operand. If a mask bit is set, the corresponding
element of source is copied into the same place in destination, otherwise this
position is destination is left unchanged. The rules for the first two operands
are the same, as for general SSE instructions. "blendps" and "blendpd" need
third operand to be 8-bit immediate, and they operate on single or double
precision values, respectively. "blendvps" and "blendvpd" require third operand
to be the XMM0 register.
 
blendvps xmm3,xmm7,xmm0 ; blend according to mask
 
"pblendw" conditionally copies word elements from the source operand into the
destination, depending on the bits of mask provided by third operand, which
needs to be 8-bit immediate value. "pblendvb" conditionally copies byte
elements from the source operands into destination, depending on mask defined
by the third operand, which has to be XMM0 register. These instructions follow
the same rules for operands as "blendps" and "blendvps" instructions,
respectively.
"insertps" inserts a single precision floating point value taken from the
position in source operand specified by bits 6-7 of third operand into location
in destination register selected by bits 4-5 of third operand. Additionally,
the low four bits of third operand control, which elements in destination
register will be set to zero. The first two operands follow the same rules as
for the general SSE operation, the third operand should be 8-bit immediate.
"extractps" extracts a single precision floating point value taken from the
location in source operand specified by low two bits of third operand, and
stores it into the destination operand. The destination can be a 32-bit memory
value or general purpose register, the source operand must be SSE register,
and the third operand should be 8-bit immediate value.
 
extractps edx,xmm3,3 ; extract the highest value
 
"pinsrb", "pinsrd" and "pinsrq" copy a byte, double word or quad word from
the source operand into the location of destination operand determined by the
third operand. The destination operand has to be SSE register, the source
operand can be a memory location of appropriate size, or the 32-bit general
purpose register (but 64-bit general purpose register for "pinsrq", which is
only available in long mode), and the third operand has to be 8-bit immediate
value. These instructions complement the "pinsrw" instruction operating on SSE
register destination, which was introduced by SSE2.
 
pinsrd xmm4,eax,1 ; insert double word into second position
 
"pextrb", "pextrw", "pextrd" and "pextrq" copy a byte, word, double word or
quad word from the location in source operand specified by third operand, into
the destination. The source operand should be SSE register, the third operand
should be 8-bit immediate, and the destination operand can be memory location
of appropriate size, or the 32-bit general purpose register (but 64-bit general
purpose register for "pextrq", which is only available in long mode). The
"pextrw" instruction with SSE register as source was already introduced by
SSE2, but SSE4 extends it to allow memory operand as destination.
 
pextrw [ebx],xmm3,7 ; extract highest word into memory
 
"pmovsxbw" and "pmovzxbw" perform sign extension or zero extension of eight
byte values from the source operand into packed word values in destination
operand, which has to be SSE register. The source can be 64-bit memory or SSE
register - when it is register, only its low portion is used. "pmovsxbd" and
"pmovzxbd" perform sign extension or zero extension of the four byte values
from the source operand into packed double word values in destination operand,
the source can be 32-bit memory or SSE register. "pmovsxbq" and "pmovzxbq"
perform sign extension or zero extension of the two byte values from the
source operand into packed quad word values in destination operand, the source
can be 16-bit memory or SSE register. "pmovsxwd" and "pmovzxwd" perform sign
extension or zero extension of the four word values from the source operand
into packed double words in destination operand, the source can be 64-bit
memory or SSE register. "pmovsxwq" and "pmovzxwq" perform sign extension or
zero extension of the two word values from the source operand into packed quad
words in destination operand, the source can be 32-bit memory or SSE register.
"pmovsxdq" and "pmovzxdq" perform sign extension or zero extension of the two
double word values from the source operand into packed quad words in
destination operand, the source can be 64-bit memory or SSE register.
 
pmovzxbq xmm0,word [si] ; zero-extend bytes to quad words
pmovsxwq xmm0,xmm1 ; sign-extend words to quad words
 
"movntdqa" loads double quad word from the source operand to the destination
using a non-temporal hint. The destination operand should be SSE register,
and the source operand should be 128-bit memory location.
The SSE4.2, described below, adds not only some new operations on SSE
registers, but also introduces some completely new instructions operating on
general purpose registers only.
"pcmpistri" compares two zero-ended (implicit length) strings provided in
its source and destination operand and generates an index stored to ECX;
"pcmpistrm" performs the same comparison and generates a mask stored to XMM0.
"pcmpestri" compares two strings of explicit lengths, with length provided
in EAX for the destination operand and in EDX for the source operand, and
generates an index stored to ECX; "pcmpestrm" performs the same comparision
and generates a mask stored to XMM0. The source and destination operand follow
the same rules as for general SSE instructions, the third operand should be
8-bit immediate value determining the details of performed operation - refer to
Intel documentation for information on those details.
"pcmpgtq" compares packed quad words, and fills the corresponding elements of
destination operand with either ones or zeros, depending on whether the value
in destination is greater than the one in source, or not. This instruction
follows the same rules for operands as "pcmpeqq".
"crc32" accumulates a CRC32 value for the source operand starting with
initial value provided by destination operand, and stores the result in
destination. Unless in long mode, the destination operand should be a 32-bit
general purpose register, and the source operand can be a byte, word, or double
word register or memory location. In long mode the destination operand can
also be a 64-bit general purpose register, and the source operand in such case
can be a byte or quad word register or memory location.
 
crc32 eax,dl ; accumulate CRC32 on byte value
crc32 eax,word [ebx] ; accumulate CRC32 on word value
crc32 rax,qword [rbx] ; accumulate CRC32 on quad word value
 
"popcnt" calculates the number of bits set in the source operand, which can
be 16-bit, 32-bit, or 64-bit general purpose register or memory location,
and stores this count in the destination operand, which has to be register of
the same size as source operand. The 64-bit variant is available only in long
mode.
 
popcnt ecx,eax ; count bits set to 1
 
The SSE4a extension, which also includes the "popcnt" instruction introduced
by SSE4.2, at the same time adds the "lzcnt" instruction, which follows the
same syntax, and calculates the count of leading zero bits in source operand
(if the source operand is all zero bits, the total number of bits in source
operand is stored in destination).
"extrq" extract the sequence of bits from the low quad word of SSE register
provided as first operand and stores them at the low end of this register,
filling the remaining bits in the low quad word with zeros. The position of bit
string and its length can either be provided with two 8-bit immediate values
as second and third operand, or by SSE register as second operand (and there
is no third operand in such case), which should contain position value in bits
8-13 and length of bit string in bits 0-5.
 
extrq xmm0,8,7 ; extract 8 bits from position 7
extrq xmm0,xmm5 ; extract bits defined by register
 
"insertq" writes the sequence of bits from the low quad word of the source
operand into specified position in low quad word of the destination operand,
leaving the other bits in low quad word of destination intact. The position
where bits should be written and the length of bit string can either be
provided with two 8-bit immediate values as third and fourth operand, or by
the bit fields in source operand (and there are only two operands in such
case), which should contain position value in bits 72-77 and length of bit
string in bits 64-69.
 
insertq xmm1,xmm0,4,2 ; insert 4 bits at position 2
insertq xmm1,xmm0 ; insert bits defined by register
 
"movntss" and "movntsd" store single or double precision floating point
value from the source SSE register into 32-bit or 64-bit destination memory
location respectively, using non-temporal hint.
 
 
2.1.21 AVX instructions
 
The Advanced Vector Extensions introduce instructions that are new variants
of SSE instructions, with new scheme of encoding that allows extended syntax
having a destination operand separate from all the source operands. It also
introduces 256-bit AVX registers, which extend up the old 128-bit SSE
registers. Any AVX instruction that puts some result into SSE register, puts
zero bits into high portion of the AVX register containing it.
The AVX version of SSE instruction has the mnemonic obtained by prepending
SSE instruction name with "v". For any SSE arithmetic instruction which had a
destination operand also being used as one of the source values, the AVX
variant has a new syntax with three operands - the destination and two sources.
The destination and first source can be SSE registers, and second source can be
SSE register or memory. If the operation is performed on single pair of values,
the remaining bits of first source SSE register are copied into the the
destination register.
vsubss xmm0,xmm2,xmm3 ; substract two 32-bit floats
vmulsd xmm0,xmm7,qword [esi] ; multiply two 64-bit floats
 
In case of packed operations, each instruction can also operate on the 256-bit
data size when the AVX registers are specified instead of SSE registers, and
the size of memory operand is also doubled then.
 
vaddps ymm1,ymm5,yword [esi] ; eight sums of 32-bit float pairs
 
The instructions that operate on packed integer types (in particular the ones
that earlier had been promoted from MMX to SSE) also acquired the new syntax
with three operands, however they are only allowed to operate on 128-bit
packed types and thus cannot use the whole AVX registers.
 
vpavgw xmm3,xmm0,xmm2 ; average of 16-bit integers
vpslld xmm1,xmm0,1 ; shift double words left
If the SSE version of instruction had a syntax with three operands, the third
one being an immediate value, the AVX version of such instruction takes four
operands, with immediate remaining the last one.
 
vshufpd ymm0,ymm1,ymm2,10010011b ; shuffle 64-bit floats
vpalignr xmm0,xmm4,xmm2,3 ; extract byte aligned value
The promotion to new syntax according to the rules described above has been
applied to all the instructions from SSE extensions up to SSE4, with the
exceptions described below.
"vdppd" instruction has syntax extended to four operans, but it does not
have a 256-bit version.
The are a few instructions, namely "vsqrtpd", "vsqrtps", "vrcpps" and
"vrsqrtps", which can operate on 256-bit data size, but retained the syntax
with only two operands, because they use data from only one source:
vsqrtpd ymm1,ymm0 ; put square roots into other register
 
In a similar way "vroundpd" and "vroundps" retained the syntax with three
operands, the last one being immediate value.
 
vroundps ymm0,ymm1,0011b ; round toward zero
Also some of the operations on packed integers kept their two-operand or
three-operand syntax while being promoted to AVX version. In such case these
instructions follow exactly the same rules for operands as their SSE
counterparts (since operations on packed integers do not have 256-bit variants
in AVX extension). These include "vpcmpestri", "vpcmpestrm", "vpcmpistri",
"vpcmpistrm", "vphminposuw", "vpshufd", "vpshufhw", "vpshuflw". And there are
more instructions that in AVX versions keep exactly the same syntax for
operands as the one from SSE, without any additional options: "vcomiss",
"vcomisd", "vcvtss2si", "vcvtsd2si", "vcvttss2si", "vcvttsd2si", "vextractps",
"vpextrb", "vpextrw", "vpextrd", "vpextrq", "vmovd", "vmovq", "vmovntdqa",
"vmaskmovdqu", "vpmovmskb", "vpmovsxbw", "vpmovsxbd", "vpmovsxbq", "vpmovsxwd",
"vpmovsxwq", "vpmovsxdq", "vpmovzxbw", "vpmovzxbd", "vpmovzxbq", "vpmovzxwd",
"vpmovzxwq" and "vpmovzxdq".
The move and conversion instructions have mostly been promoted to allow
256-bit size operands in addition to the 128-bit variant with syntax identical
to that from SSE version of the same instruction. Each of the "vcvtdq2ps",
"vcvtps2dq" and "vcvttps2dq", "vmovaps", "vmovapd", "vmovups", "vmovupd",
"vmovdqa", "vmovdqu", "vlddqu", "vmovntps", "vmovntpd", "vmovntdq",
"vmovsldup", "vmovshdup", "vmovmskps" and "vmovmskpd" inherits the 128-bit
syntax from SSE without any changes, and also allows a new form with 256-bit
operands in place of 128-bit ones.
 
vmovups [edi],ymm6 ; store unaligned 256-bit data
"vmovddup" has the identical 128-bit syntax as its SSE version, and it also
has a 256-bit version, which stores the duplicates of the lowest quad word
from the source operand in the lower half of destination operand, and in the
upper half of destination the duplicates of the low quad word from the upper
half of source. Both source and destination operands need then to be 256-bit
values.
"vmovlhps" and "vmovhlps" have only 128-bit versions, and each takes three
operands, which all must be SSE registers. "vmovlhps" copies two single
precision values from the low quad word of second source register to the high
quad word of destination register, and copies the low quad word of first
source register into the low quad word of destination register. "vmovhlps"
copies two single precision values from the high quad word of second source
register to the low quad word of destination register, and copies the high
quad word of first source register into the high quad word of destination
register.
"vmovlps", "vmovhps", "vmovlpd" and "vmovhpd" have only 128-bit versions and
their syntax varies depending on whether memory operand is a destination or
source. When memory is destination, the syntax is identical to the one of
equivalent SSE instruction, and when memory is source, the instruction requires
three operands, first two being SSE registers and the third one 64-bit memory.
The value put into destination is then the value copied from first source with
either low or high quad word replaced with value from second source (the
memory operand).
 
vmovhps [esi],xmm7 ; store upper half to memory
vmovlps xmm0,xmm7,[ebx] ; low from memory, rest from register
"vmovss" and "vmovsd" have syntax identical to their SSE equivalents as long
as one of the operands is memory, while the versions that operate purely on
registers require three operands (each being SSE register). The value stored
in destination is then the value copied from first source with lowest data
element replaced with the lowest value from second source.
 
vmovss xmm3,[edi] ; low from memory, rest zeroed
vmovss xmm0,xmm1,xmm2 ; one value from xmm2, three from xmm1
"vcvtss2sd", "vcvtsd2ss", "vcvtsi2ss" and "vcvtsi2d" use the three-operand
syntax, where destination and first source are always SSE registers, and the
second source follows the same rules and the source in syntax of equivalent
SSE instruction. The value stored in destination is then the value copied from
first source with lowest data element replaced with the result of conversion.
 
vcvtsi2sd xmm4,xmm4,ecx ; 32-bit integer to 64-bit float
vcvtsi2ss xmm0,xmm0,rax ; 64-bit integer to 32-bit float
 
"vcvtdq2pd" and "vcvtps2pd" allow the same syntax as their SSE equivalents,
plus the new variants with AVX register as destination and SSE register or
128-bit memory as source. Analogously "vcvtpd2dq", "vcvttpd2dq" and
"vcvtpd2ps", in addition to variant with syntax identical to SSE version,
allow a variant with SSE register as destination and AVX register or 256-bit
memory as source.
"vinsertps", "vpinsrb", "vpinsrw", "vpinsrd", "vpinsrq" and "vpblendw" use
a syntax with four operands, where destination and first source have to be SSE
registers, and the third and fourth operand follow the same rules as second
and third operand in the syntax of equivalent SSE instruction. Value stored in
destination is the the value copied from first source with some data elements
replaced with values extracted from the second source, analogously to the
operation of corresponding SSE instruction.
vpinsrd xmm0,xmm0,eax,3 ; insert double word
 
"vblendvps", "vblendvpd" and "vpblendvb" use a new syntax with four register
operands: destination, two sources and a mask, where second source can also be
a memory operand. "vblendvps" and "vblendvpd" have 256-bit variant, where
operands are AVX registers or 256-bit memory, as well as 128-bit variant,
which has operands being SSE registers or 128-bit memory. "vpblendvb" has only
a 128-bit variant. Value stored in destination is the value copied from the
first source with some data elements replaced, according to mask, by values
from the second source.
 
vblendvps ymm3,ymm1,ymm2,ymm7 ; blend according to mask
"vptest" allows the same syntax as its SSE version and also has a 256-bit
version, with both operands doubled in size. There are also two new
instructions, "vtestps" and "vtestpd", which perform analogous tests, but only
of the sign bits of corresponding single precision or double precision values,
and set the ZF and CF accordingly. They follow the same syntax rules as
"vptest".
 
vptest ymm0,yword [ebx] ; test 256-bit values
vtestpd xmm0,xmm1 ; test sign bits of 64-bit floats
 
"vbroadcastss", "vbroadcastsd" and "vbroadcastf128" are new instructions,
which broadcast the data element defined by source operand into all elements
of corresponing size in the destination register. "vbroadcastss" needs
source to be 32-bit memory and destination to be either SSE or AVX register.
"vbroadcastsd" requires 64-bit memory as source, and AVX register as
destination. "vbroadcastf128" requires 128-bit memory as source, and AVX
register as destination.
 
vbroadcastss ymm0,dword [eax] ; get eight copies of value
 
"vinsertf128" is the new instruction, which takes four operands. The
destination and first source have to be AVX registers, second source can be
SSE register or 128-bit memory location, and fourth operand should be an
immediate value. It stores in destination the value obtained by taking
contents of first source and replacing one of its 128-bit units with value of
the second source. The lowest bit of fourth operand specifies at which
position that replacement is done (either 0 or 1).
"vextractf128" is the new instruction with three operands. The destination
needs to be SSE register or 128-bit memory location, the source must be AVX
register, and the third operand should be an immediate value. It extracts
into destination one of the 128-bit units from source. The lowest bit of third
operand specifies, which unit is extracted.
"vmaskmovps" and "vmaskmovpd" are the new instructions with three operands
that selectively store in destination the elements from second source
depending on the sign bits of corresponding elements from first source. These
instructions can operate on either 128-bit data (SSE registers) or 256-bit
data (AVX registers). Either destination or second source has to be a memory
location of appropriate size, the two other operands should be registers.
vmaskmovps [edi],xmm0,xmm5 ; conditionally store
vmaskmovpd ymm5,ymm0,[esi] ; conditionally load
 
"vpermilpd" and "vpermilps" are the new instructions with three operands
that permute the values from first source according to the control fields from
second source and put the result into destination operand. It allows to use
either three SSE registers or three AVX registers as its operands, the second
source can be a memory of size equal to the registers used. In alternative
form the second source can be immediate value and then the first source
can be a memory location of the size equal to destination register.
"vperm2f128" is the new instruction with four operands, which selects
128-bit blocks of floating point data from first and second source according
to the bit fields from fourth operand, and stores them in destination.
Destination and first source need to be AVX registers, second source can be
AVX register or 256-bit memory area, and fourth operand should be an immediate
value.
 
vperm2f128 ymm0,ymm6,ymm7,12h ; permute 128-bit blocks
 
"vzeroall" instruction sets all the AVX registers to zero. "vzeroupper" sets
the upper 128-bit portions of all AVX registers to zero, leaving the SSE
registers intact. These new instructions take no operands.
"vldmxcsr" and "vstmxcsr" are the AVX versions of "ldmxcsr" and "stmxcsr"
instructions. The rules for their operands remain unchanged.
 
2.1.22 AVX2 instructions
 
The AVX2 extension allows all the AVX instructions operating on packed integers
to use 256-bit data types, and introduces some new instructions as well.
The AVX instructions that operate on packed integers and had only a 128-bit
variants, have been supplemented with 256-bit variants, and thus their syntax
rules became analogous to AVX instructions operating on packed floating point
types.
 
vpsubb ymm0,ymm0,[esi] ; substract 32 packed bytes
vpavgw ymm3,ymm0,ymm2 ; average of 16-bit integers
 
However there are some instructions that have not been equipped with the
256-bit variants. "vpcmpestri", "vpcmpestrm", "vpcmpistri", "vpcmpistrm",
"vpextrb", "vpextrw", "vpextrd", "vpextrq", "vpinsrb", "vpinsrw", "vpinsrd",
"vpinsrq" and "vphminposuw" are not affected by AVX2 and allow only the
128-bit operands.
The packed shift instructions, which allowed the third operand specifying
amount to be SSE register or 128-bit memory location, use the same rules
for the third operand in their 256-bit variant.
 
vpsllw ymm2,ymm2,xmm4 ; shift words left
vpsrad ymm0,ymm3,xword [ebx] ; shift double words right
 
There are also new packed shift instructions with standard three-operand AVX
syntax, which shift each element from first source by the amount specified in
corresponding element of second source, and store the results in destination.
"vpsllvd" shifts 32-bit elements left, "vpsllvq" shifts 64-bit elements left,
"vpsrlvd" shifts 32-bit elements right logically, "vpsrlvq" shifts 64-bit
elements right logically and "vpsravd" shifts 32-bit elements right
arithmetically.
The sign-extend and zero-extend instructions, which in AVX versions allowed
source operand to be SSE register or a memory of specific size, in the new
256-bit variant need memory of that size doubled or SSE register as source and
AVX register as destination.
 
vpmovzxbq ymm0,dword [esi] ; bytes to quad words
Also "vmovntdqa" has been upgraded with 256-bit variant, so it allows to
transfer 256-bit value from memory to AVX register, it needs memory address
to be aligned to 32 bytes.
"vpmaskmovd" and "vpmaskmovq" are the new instructions with syntax identical
to "vmaskmovps" or "vmaskmovpd", and they performs analogous operation on
packed 32-bit or 64-bit values.
"vinserti128", "vextracti128", "vbroadcasti128" and "vperm2i128" are the new
instructions with syntax identical to "vinsertf128", "vextractf128",
"vbroadcastf128" and "vperm2f128" respectively, and they perform analogous
operations on 128-bit blocks of integer data.
"vbroadcastss" and "vbroadcastsd" instructions have been extended to allow
SSE register as a source operand (which in AVX could only be a memory).
"vpbroadcastb", "vpbroadcastw", "vpbroadcastd" and "vpbroadcastq" are the
new instructions which broadcast the byte, word, double word or quad word from
the source operand into all elements of corresponing size in the destination
register. The destination operand can be either SSE or AVX register, and the
source operand can be SSE register or memory of size equal to the size of data
element.
 
vpbroadcastb ymm0,byte [ebx] ; get 32 identical bytes
"vpermd" and "vpermps" are new three-operand instructions, which use each
32-bit element from first source as an index of element in second source which
is copied into destination at position corresponding to element containing
index. The destination and first source have to be AVX registers, and the
second source can be AVX register or 256-bit memory.
"vpermq" and "vpermpd" are new three-operand instructions, which use 2-bit
indexes from the immediate value specified as third operand to determine which
element from source store at given position in destination. The destination
has to be AVX register, source can be AVX register or 256-bit memory, and the
third operand must be 8-bit immediate value.
The family of new instructions performing "gather" operation have special
syntax, as in their memory operand they use addressing mode that is unique to
them. The base of address can be a 32-bit or 64-bit general purpose register
(the latter only in long mode), and the index (possibly multiplied by scale
value, as in standard addressing) is specified by SSE or AVX register. It is
possible to use only index without base and any numerical displacement can be
added to the address. Each of those instructions takes three operands. First
operand is the destination register, second operand is memory addressed with
a vector index, and third operand is register containing a mask. The most
significant bit of each element of mask determines whether a value will be
loaded from memory into corresponding element in destination. The address of
each element to load is determined by using the corresponding element from
index register in memory operand to calculate final address with given base
and displacement. When the index register contains less elements than the
destination and mask registers, the higher elements of destination are zeroed.
After the value is successfuly loaded, the corresponding element in mask
register is set to zero. The destination, index and mask should all be
distinct registers, it is not allowed to use the same register in two
different roles.
"vgatherdps" loads single precision floating point values addressed by
32-bit indexes. The destination, index and mask should all be registers of the
same type, either SSE or AVX. The data addressed by memory operand is 32-bit
in size.
 
vgatherdps xmm0,[eax+xmm1],xmm3 ; gather four floats
vgatherdps ymm0,[ebx+ymm7*4],ymm3 ; gather eight floats
 
"vgatherqps" loads single precision floating point values addressed by
64-bit indexes. The destination and mask should always be SSE registers, while
index register can be either SSE or AVX register. The data addressed by memory
operand is 32-bit in size.
vgatherqps xmm0,[xmm2],xmm3 ; gather two floats
vgatherqps xmm0,[ymm2+64],xmm3 ; gather four floats
"vgatherdpd" loads double precision floating point values addressed by
32-bit indexes. The index register should always be SSE register, the
destination and mask should be two registers of the same type, either SSE or
AVX. The data addressed by memory operand is 64-bit in size.
vgatherdpd xmm0,[ebp+xmm1],xmm3 ; gather two doubles
vgatherdpd ymm0,[xmm3*8],ymm5 ; gather four doubles
 
"vgatherqpd" loads double precision floating point values addressed by
64-bit indexes. The destination, index and mask should all be registers of the
same type, either SSE or AVX. The data addressed by memory operand is 64-bit
in size.
"vpgatherdd" and "vpgatherqd" load 32-bit values addressed by either 32-bit
or 64-bit indexes. They follow the same rules as "vgatherdps" and "vgatherqps"
respectively.
"vpgatherdq" and "vpgatherqq" load 64-bit values addressed by either 32-bit
or 64-bit indexes. They follow the same rules as "vgatherdpd" and "vgatherqpd"
respectively.
 
2.1.23 Auxiliary sets of computational instructions
 
There is a number of additional instruction set extensions related to
AVX. They introduce new vector instructions (and sometimes also their SSE
equivalents that use classic instruction encoding), and even some new
instructions operating on general registers that use the AVX-like encoding
allowing the extended syntax with separate destination and source operands.
The CPU support for each of these instruction sets needs to be determined
separately.
The AES extension provides a specialized set of instructions for the
purpose of cryptographic computations defined by Advanced Encryption Standard.
Each of these instructions has two versions: the AVX one and the one with
SSE-like syntax that uses classic encoding. Refer to the Intel manuals for the
details of operation of these instructions.
"aesenc" and "aesenclast" perform a single round of AES encryption on data
from first source with a round key from second source, and store result in
destination. The destination and first source are SSE registers, and the
second source can be SSE register or 128-bit memory. The AVX versions of these
instructions, "vaesenc" and "vaesenclast", use the syntax with three operands,
while the SSE-like version has only two operands, with first operand being
both the destination and first source.
"aesdec" and "aesdeclast" perform a single round of AES decryption on data
from first source with a round key from second source. The syntax rules for
them and their AVX versions are the same as for "aesenc".
"aesimc" performs the InvMixColumns transformation of source operand and
store the result in destination. Both "aesimc" and "vaesimc" use only two
operands, destination being SSE register, and source being SSE register or
128-bit memory location.
"aeskeygenassist" is a helper instruction for generating the round key.
It needs three operands: destination being SSE register, source being SSE
register or 128-bit memory, and third operand being 8-bit immediate value.
The AVX version of this instruction uses the same syntax.
The CLMUL extension introduces just one instruction, "pclmulqdq", and its
AVX version as well. This instruction performs a carryless multiplication of
two 64-bit values selected from first and second source according to the bit
fields in immediate value. The destination and first source are SSE registers,
second source is SSE register or 128-bit memory, and immediate value is
provided as last operand. "vpclmulqdq" takes four operands, while "pclmulqdq"
takes only three operands, with the first one serving both the role of
destination and first source.
The FMA (Fused Multiply-Add) extension introduces additional AVX
instructions which perform multiplication and summation as single operation.
Each one takes three operands, first one serving both the role of destination
and first source, and the following ones being the second and third source.
The mnemonic of FMA instruction is obtained by appending to "vf" prefix: first
either "m" or "nm" to select whether result of multiplication should be taken
as-is or negated, then either "add" or "sub" to select whether third value
will be added to the product or substracted from the product, then either
"132", "213" or "231" to select which source operands are multiplied and which
one is added or substracted, and finally the type of data on which the
instruction operates, either "ps", "pd", "ss" or "sd". As it was with SSE
instructions promoted to AVX, instructions operating on packed floating point
values allow 128-bit or 256-bit syntax, in former all the operands are SSE
registers, but the third one can also be a 128-bit memory, in latter the
operands are AVX registers and the third one can also be a 256-bit memory.
Instructions that compute just one floating point result need operands to be
SSE registers, and the third operand can also be a memory, either 32-bit for
single precision or 64-bit for double precision.
 
vfmsub231ps ymm1,ymm2,ymm3 ; multiply and substract
vfnmadd132sd xmm0,xmm5,[ebx] ; multiply, negate and add
 
In addition to the instructions created by the rule described above, there are
families of instructions with mnemonics starting with either "vfmaddsub" or
"vfmsubadd", followed by either "132", "213" or "231" and then either "ps" or
"pd" (the operation must always be on packed values in this case). They add
to the result of multiplication or substract from it depending on the position
of value in packed data - instructions from the "vfmaddsub" group add when the
position is odd and substract when the position is even, instructions from the
"vfmsubadd" group add when the position is even and subtstract when the
position is odd. The rules for operands are the same as for other FMA
instructions.
The FMA4 instructions are similar to FMA, but use syntax with four operands
and thus allow destination to be different than all the sources. Their
mnemonics are identical to FMA instructions with the "132", "213" or "231" cut
out, as having separate destination operand makes such selection of operands
superfluous. The multiplication is always performed on values from the first
and second source, and then the value from third source is added or
substracted. Either second or third source can be a memory operand, and the
rules for the sizes of operands are the same as for FMA instructions.
 
vfmaddpd ymm0,ymm1,[esi],ymm2 ; multiply and add
vfmsubss xmm0,xmm1,xmm2,[ebx] ; multiply and substract
The F16C extension consists of two instructions, "vcvtps2ph" and
"vcvtph2ps", which convert floating point values between single precision and
half precision (the 16-bit floating point format). "vcvtps2ph" takes three
operands: destination, source, and rounding controls. The third operand is
always an immediate, the source is either SSE or AVX register containing
single precision values, and the destination is SSE register or memory, the
size of memory is 64 bits when the source is SSE register and 128 bits when
the source is AVX register. "vcvtph2ps" takes two operands, the destination
that can be SSE or AVX register, and the source that is SSE register or memory
with size of the half of destination operand's size.
The AMD XOP extension introduces a number of new vector instructions with
encoding and syntax analogous to AVX instructions. "vfrczps", "vfrczss",
"vfrczpd" and "vfrczsd" extract fractional portions of single or double
precision values, they all take two operands. The packed operations allow
either SSE or AVX register as destination, for the other two it has to be SSE
register. Source can be register of the same type as destination, or memory
of appropriate size (256-bit for destination being AVX register, 128-bit for
packed operation with destination being SSE register, 64-bit for operation
on a solitary double precision value and 32-bit for operation on a solitary
single precision value).
 
vfrczps ymm0,[esi] ; load fractional parts
"vpcmov" copies bits from either first or second source into destination
depending on the values of corresponding bits in the fourth operand (the
selector). If the bit in selector is set, the corresponding bit from first
source is copied into the same position in destination, otherwise the bit from
second source is copied. Either second source or selector can be memory
location, 128-bit or 256-bit depending on whether SSE registers or AVX
registers are specified as the other operands.
 
vpcmov xmm0,xmm1,xmm2,[ebx] ; selector in memory
vpcmov ymm0,ymm5,[esi],ymm2 ; source in memory
 
The family of packed comparison instructions take four operands, the
destination and first source being SSE register, second source being SSE
register or 128-bit memory and the fourth operand being immediate value
defining the type of comparison. The mnemonic or instruction is created
by appending to "vpcom" prefix either "b" or "ub" to compare signed or
unsigned bytes, "w" or "uw" to compare signed or unsigned words, "d" or "ud"
to compare signed or unsigned double words, "q" or "uq" to compare signed or
unsigned quad words. The respective values from the first and second source
are compared and the corresponding data element in destination is set to
either all ones or all zeros depending on the result of comparison. The fourth
operand has to specify one of the eight comparison types (table 2.5). All
these instruction have also variants with only three operands and the type
of comparison encoded within the instruction name by inserting the comparison
mnemonic after "vpcom".
 
vpcomb xmm0,xmm1,xmm2,4 ; test for equal bytes
vpcomgew xmm0,xmm1,[ebx] ; compare signed words
 
Table 2.5 XOP comparisons
/-------------------------------------------\
| Code | Mnemonic | Description |
|======|==========|=========================|
| 0 | lt | less than |
| 1 | le | less than or equal |
| 2 | gt | greater than |
| 3 | ge | greater than or equal |
| 4 | eq | equal |
| 5 | neq | not equal |
| 6 | false | false |
| 7 | true | true |
\-------------------------------------------/
 
"vpermil2ps" and "vpermil2pd" set the elements in destination register to
zero or to a value selected from first or second source depending on the
corresponding bit fields from the fourth operand (the selector) and the
immediate value provided in fifth operand. Refer to the AMD manuals for the
detailed explanation of the operation performed by these instructions. Each
of the first four operands can be a register, and either second source or
selector can be memory location, 128-bit or 256-bit depending on whether SSE
registers or AVX registers are used for the other operands.
 
vpermil2ps ymm0,ymm3,ymm7,ymm2,0 ; permute from two sources
"vphaddbw" adds pairs of adjacent signed bytes to form 16-bit values and
stores them at the same positions in destination. "vphaddubw" does the same
but treats the bytes as unsigned. "vphaddbd" and "vphaddubd" sum all bytes
(either signed or unsigned) in each four-byte block to 32-bit results,
"vphaddbq" and "vphaddubq" sum all bytes in each eight-byte block to
64-bit results, "vphaddwd" and "vphadduwd" add pairs of words to 32-bit
results, "vphaddwq" and "vphadduwq" sum all words in each four-word block to
64-bit results, "vphadddq" and "vphaddudq" add pairs of double words to 64-bit
results. "vphsubbw" substracts in each two-byte block the byte at higher
position from the one at lower position, and stores the result as a signed
16-bit value at the corresponding position in destination, "vphsubwd"
substracts in each two-word block the word at higher position from the one at
lower position and makes signed 32-bit results, "vphsubdq" substract in each
block of two double word the one at higher position from the one at lower
position and makes signed 64-bit results. Each of these instructions takes
two operands, the destination being SSE register, and the source being SSE
register or 128-bit memory.
 
vphadduwq xmm0,xmm1 ; sum quadruplets of words
"vpmacsww" and "vpmacssww" multiply the corresponding signed 16-bit values
from the first and second source and then add the products to the parallel
values from the third source, then "vpmacsww" takes the lowest 16 bits of the
result and "vpmacssww" saturates the result down to 16-bit value, and they
store the final 16-bit results in the destination. "vpmacsdd" and "vpmacssdd"
perform the analogous operation on 32-bit values. "vpmacswd" and "vpmacswd" do
the same calculation only on the low 16-bit values from each 32-bit block and
form the 32-bit results. "vpmacsdql" and "vpmacssdql" perform such operation
on the low 32-bit values from each 64-bit block and form the 64-bit results,
while "vpmacsdqh" and "vpmacssdqh" do the same on the high 32-bit values from
each 64-bit block, also forming the 64-bit results. "vpmadcswd" and
"vpmadcsswd" multiply the corresponding signed 16-bit value from the first
and second source, then sum all the four products and add this sum to each
16-bit element from third source, storing the truncated or saturated result
in destination. All these instructions take four operands, the second source
can be 128-bit memory or SSE register, all the other operands have to be
SSE registers.
 
vpmacsdd xmm6,xmm1,[ebx],xmm6 ; accumulate product
 
"vpperm" selects bytes from first and second source, optionally applies a
separate transformation to each of them, and stores them in the destination.
The bit fields in fourth operand (the selector) specify for each position in
destination what byte from which source is taken and what operation is applied
to it before it is stored there. Refer to the AMD manuals for the detailed
information about these bit fields. This instruction takes four operands,
either second source or selector can be a 128-bit memory (or they can be SSE
registers both), all the other operands have to be SSE registers.
"vpshlb", "vpshlw", "vpshld" and "vpshlq" shift logically bytes, words, double
words or quad words respectively. The amount of bits to shift by is specified
for each element separately by the signed byte placed at the corresponding
position in the third operand. The source containing elements to shift is
provided as second operand. Either second or third operand can be 128-bit
memory (or they can be SSE registers both) and the other operands have to be
SSE registers.
 
vpshld xmm3,xmm1,[ebx] ; shift bytes from xmm1
"vpshab", "vpshaw", "vpshad" and "vpshaq" arithmetically shift bytes, words,
double words or quad words. These instructions follow the same rules as the
logical shifts described above. "vprotb", "vprotw", "vprotd" and "vprotq"
rotate bytes, word, double words or quad words. They follow the same rules as
shifts, but additionally allow third operand to be immediate value, in which
case the same amount of rotation is specified for all the elements in source.
 
vprotb xmm0,[esi],3 ; rotate bytes to the left
 
The MOVBE extension introduces just one new instruction, "movbe", which
swaps bytes in value from source before storing it in destination, so can
be used to load and store big endian values. It takes two operands, either
the destination or source should be a 16-bit, 32-bit or 64-bit memory (the
last one being only allowed in long mode), and the other operand should be
a general register of the same size.
The BMI extension, consisting of two subsets - BMI1 and BMI2, introduces
new instructions operating on general registers, which use the same encoding
as AVX instructions and so allow the extended syntax. All these instructions
use 32-bit operands, and in long mode they also allow the forms with 64-bit
operands.
"andn" calculates the bitwise AND of second source with the inverted bits
of first source and stores the result in destination. The destination and
the first source have to be general registers, the second source can be
general register or memory.
 
andn edx,eax,[ebx] ; bit-multiply inverted eax with memory
 
"bextr" extracts from the first source the sequence of bits using an index
and length specified by bit fields in the second source operand and stores
it into destination. The lowest 8 bits of second source specify the position
of bit sequence to extract and the next 8 bits of second source specify the
length of sequence. The first source can be a general register or memory,
the other two operands have to be general registers.
 
bextr eax,[esi],ecx ; extract bit field from memory
"blsi" extracts the lowest set bit from the source, setting all the other
bits in destination to zero. The destination must be a general register,
the source can be general register or memory.
 
blsi rax,r11 ; isolate the lowest set bit
"blsmsk" sets all the bits in the destination up to the lowest set bit in
the source, including this bit. "blsr" copies all the bits from the source to
destination except for the lowest set bit, which is replaced by zero. These
instructions follow the same rules for operands as "blsi".
"tzcnt" counts the number of trailing zero bits, that is the zero bits up to
the lowest set bit of source value. This instruction is analogous to "lzcnt"
and follows the same rules for operands, so it also has a 16-bit version,
unlike the other BMI instructions.
"bzhi" is BMI2 instruction, which copies the bits from first source to
destination, zeroing all the bits up from the position specified by second
source. It follows the same rules for operands as "bextr".
"pext" uses a mask in second source operand to select bits from first
operands and puts the selected bits as a continuous sequence into destination.
"pdep" performs the reverse operation - it takes sequence of bits from the
first source and puts them consecutively at the positions where the bits in
second source are set, setting all the other bits in destination to zero.
These BMI2 instructions follow the same rules for operands as "andn".
"mulx" is a BMI2 instruction which performs an unsigned multiplication of
value from EDX or RDX register (depending on the size of specified operands)
by the value from third operand, and stores the low half of result in the
second operand, and the high half of result in the first operand, and it does
it without affecting the flags. The third operand can be general register or
memory, and both the destination operands have to be general registers.
 
mulx edx,eax,ecx ; multiply edx by ecx into edx:eax
 
"shlx", "shrx" and "sarx" are BMI2 instructions, which perform logical or
arithmetical shifts of value from first source by the amount specified by
second source, and store the result in destination without affecting the
flags. The have the same rules for operands as "bzhi" instruction.
"rorx" is a BMI2 instruction which rotates right the value from source
operand by the constant amount specified in third operand and stores the
result in destination without affecting the flags. The destination operand
has to be general register, the source operand can be general register or
memory, and the third operand has to be an immediate value.
 
rorx eax,edx,7 ; rotate without affecting flags
The TBM is an extension designed by AMD to supplement the BMI set. The
"bextr" instruction is extended with a new form, in which second source is
a 32-bit immediate value. "blsic" is a new instruction which performs the
same operation as "blsi", but with the bits of result reversed. It uses the
same rules for operands as "blsi". "blsfill" is a new instruction, which takes
the value from source, sets all the bits below the lowest set bit and store
the result in destination, it also uses the same rules for operands as "blsi".
"blci", "blcic", "blcs", "blcmsk" and "blcfill" are instructions analogous
to "blsi", "blsic", "blsr", "blsmsk" and "blsfill" respectively, but they
perform the bit-inverted versions of the same operations. They follow the
same rules for operands as the instructions they reflect.
"tzmsk" finds the lowest set bit in value from source operand, sets all bits
below it to 1 and all the rest of bits to zero, then writes the result to
destination. "t1mskc" finds the least significant zero bit in the value from
source operand, sets the bits below it to zero and all the other bits to 1,
and writes the result to destination. These instructions have the same rules
for operands as "blsi".
 
2.1.24 Other extensions of instruction set
 
There is a number of additional instruction set extensions recognized by flat
assembler, and the general syntax of the instructions introduced by those
extensions is provided here. For a detailed information on the operations
performed by them, check out the manuals from Intel (for the VMX, SMX, XSAVE,
RDRAND, FSGSBASE, INVPCID, HLE and RTM extensions) or AMD (for the SVM
extension).
The Virtual-Machine Extensions (VMX) provide a set of instructions for the
management of virtual machines. The "vmxon" instruction, which enters the VMX
operation, requires a single 64-bit memory operand, which should be a physical
address of memory region, which the logical processor may use to support VMX
operation. The "vmxoff" instruction, which leaves the VMX operation, has no
operands. The "vmlaunch" and "vmresume", which launch or resume the virtual
machines, and "vmcall", which allows guest software to call the VM monitor,
use no operands either.
The "vmptrld" loads the physical address of current Virtual Machine Control
Structure (VMCS) from its memory operand, "vmptrst" stores the pointer to
current VMCS into address specified by its memory operand, and "vmclear" sets
the launch state of the VMCS referenced by its memory operand to clear. These
three instruction all require single 64-bit memory operand.
The "vmread" reads from VCMS a field specified by the source operand and
stores it into the destination operand. The source operand should be a
general purpose register, and the destination operand can be a register of
memory. The "vmwrite" writes into a VMCS field specified by the destination
operand the value provided by source operand. The source operand can be a
general purpose register or memory, and the destination operand must be a
register. The size of operands for those instructions should be 64-bit when
in long mode, and 32-bit otherwise.
The "invept" and "invvpid" invalidate the translation lookaside buffers
(TLBs) and paging-structure caches, either derived from extended page tables
(EPT), or based on the virtual processor identifier (VPID). These instructions
require two operands, the first one being the general purpose register
specifying the type of invalidation, and the second one being a 128-bit
memory operand providing the invalidation descriptor. The first operand
should be a 64-bit register when in long mode, and 32-bit register otherwise.
The Safer Mode Extensions (SMX) provide the functionalities available
throught the "getsec" instruction. This instruction takes no operands, and
the function that is executed is determined by the contents of EAX register
upon executing this instruction.
The Secure Virtual Machine (SVM) is a variant of virtual machine extension
used by AMD. The "skinit" instruction securely reinitializes the processor
allowing the startup of trusted software, such as the virtual machine monitor
(VMM). This instruction takes a single operand, which must be EAX, and
provides a physical address of the secure loader block (SLB).
The "vmrun" instruction is used to start a guest virtual machine,
its only operand should be an accumulator register (AX, EAX or RAX, the
last one available only in long mode) providing the physical address of the
virtual machine control block (VMCB). The "vmsave" stores a subset of
processor state into VMCB specified by its operand, and "vmload" loads the
same subset of processor state from a specified VMCB. The same operand rules
as for the "vmrun" apply to those two instructions.
"vmmcall" allows the guest software to call the VMM. This instruction takes
no operands.
"stgi" set the global interrupt flag to 1, and "clgi" zeroes it. These
instructions take no operands.
"invlpga" invalidates the TLB mapping for a virtual page specified by the
first operand (which has to be accumulator register) and address space
identifier specified by the second operand (which must be ECX register).
The XSAVE set of instructions allows to save and restore processor state
components. "xsave" and "xsaveopt" store the components of processor state
defined by bit mask in EDX and EAX registers into area defined by memory
operand. "xrstor" restores from the area specified by memory operand the
components of processor state defined by mask in EDX and EAX. The "xsave64",
"xsaveopt64" and "xrstor64" are 64-bit versions of these instructions, allowed
only in long mode.
"xgetbv" read the contents of 64-bit XCR (extended control register)
specified in ECX register into EDX and EAX registers. "xsetbv" writes the
contents of EDX and EAX into the 64-bit XCR specified by ECX register. These
instructions have no operands.
The RDRAND extension introduces one new instruction, "rdrand", which loads
the hardware-generated random value into general register. It takes one
operand, which can be 16-bit, 32-bit or 64-bit register (with the last one
being allowed only in long mode).
The FSGSBASE extension adds long mode instructions that allow to read and
write the segment base registers for FS and GS segments. "rdfsbase" and
"rdgsbase" read the corresponding segment base registers into operand, while
"wrfsbase" and "wrgsbase" write the value of operand into those register.
All these instructions take one operand, which can be 32-bit or 64-bit general
register.
The INVPCID extension adds "invpcid" instruction, which invalidates mapping
in the TLBs and paging caches based on the invalidation type specified in
first operand and PCID invalidate descriptor specified in second operand.
The first operands should be 32-bit general register when not in long mode,
or 64-bit general register when in long mode. The second operand should be
128-bit memory location.
The HLE and RTM extensions provide set of instructions for the transactional
management. The "xacquire" and "xrelease" are new prefixes that can be used
with some of the instructions to start or end lock elision on the memory
address specified by prefixed instruction. The "xbegin" instruction starts
the transactional execution, its operand is the address a fallback routine
that gets executes in case of transaction abort, specified like the operand
for near jump instruction. "xend" marks the end of transcational execution
region, it takes no operands. "xabort" forces the transaction abort, it takes
an 8-bit immediate value as its only operand, this value is passed in the
highest bits of EAX to the fallback routine. "xtest" checks whether there is
transactional execution in progress, this instruction takes no operands.
 
 
2.2 Control directives
 
This section describes the directives that control the assembly process, they
2271,7 → 3313,7
 
2.2.2 Conditional assembly
 
"if" directive causes come block of instructions to be assembled only under
"if" directive causes some block of instructions to be assembled only under
certain condition. It should be followed by logical expression specifying the
condition, instructions in next lines will be assembled only when this
condition is met, otherwise they will be skipped. The optional "else if"
2299,6 → 3341,11
followed by any expression, usually just by a single symbol name; it checks
whether the given expression contains only symbols that are defined in the
source and accessible from the current position.
With "relativeto" operator it is possible to check whether values of two
expressions differ only by constant amount. The valid syntax is a numerical
expression followed by "relativeto" and then another expression (possibly
register-based). Labels that have no simple numerical value can be tested
this way to determine what kind of operations may be possible with them.
The following simple example uses the "count" constant that should be
defined somewhere in source:
 
2329,7 → 3376,7
of instructions get assembled, otherwise the last block of instructions, which
follows the line containing only "else", is assembled.
There are also operators that allow comparison of values being any chains of
symbols. The "eq" compares two such values whether they are exactly the same.
symbols. The "eq" compares whether two such values are exactly the same.
The "in" operator checks whether given value is a member of the list of values
following this operator, the list should be enclosed between "<" and ">"
characters, its members should be separated with commas. The symbols are
2431,7 → 3478,7
255). The loaded data cannot exceed current offset.
The "store" directive can modify the already generated code by replacing
some of the previously generated data with the value defined by given
numerical expression, which follow. The expression can be preceded by the
numerical expression, which follows. The expression can be preceded by the
optional size operator to specify how large value the expression defines, and
therefore how much bytes will be stored, if there is no size operator, the
size of one byte is assumed. Then the "at" operator and the numerical
2453,7 → 3500,7
end repeat
 
and each byte of code will be xored with the value defined by "c" constant.
"virtual" defines virtual data at specified address. This data won't be
"virtual" defines virtual data at specified address. This data will not be
included in the output file, but labels defined there can be used in other
parts of source. This directive can be followed by "at" operator and the
numerical expression specifying the address for virtual data, otherwise is
2480,7 → 3527,7
end virtual
 
With such definition instruction "mov ax,[LDT_limit]" will be assembled
to "mov ax,[bx]".
to the same instruction as "mov ax,[bx]".
Declaring defined data values or instructions inside the virtual block would
also be useful, because the "load" directive can be used to load the values
from the virtually generated code into a constants. This directive should be
2547,12 → 3594,16
end repeat
display 13,10
 
This block of directives calculates the four hexadecimal digits of 16-bit value
and converts them into characters for displaying. Note that this won't work if
the adresses in current addressing space are relocatable (as it might happen
with PE or object output formats), since only absolute values can be used this
way. The absolute value may be obtained by calculating the relative address,
like "$-$$", or "rva $" in case of PE format.
This block of directives calculates the four hexadecimal digits of 16-bit
value and converts them into characters for displaying. Note that this will
not work if the adresses in current addressing space are relocatable (as it
might happen with PE or object output formats), since only absolute values can
be used this way. The absolute value may be obtained by calculating the
relative address, like "$-$$", or "rva $" in case of PE format.
The "err" directive immediately terminates the assembly process when it is
encountered by assembler.
The "assert" directive tests whether the logical expression that follows it
is true, and if not, it signalizes the error.
 
 
2.2.6 Multiple passes
2654,6 → 3705,15
The "used" operator may be expected to behave in a similar manner in
analogous cases, however any other kinds of predictions my not be so simple and
you should never rely on them this way.
The "err" directive, usually used to stop the assembly when some condition is
met, stops the assembly immediately, regardless of whether the current pass
is final or intermediate. So even when the condition that caused this directive
to be interpreted is mispredicted and temporary, and would eventually disappear
in the later passes, the assembly is stopped anyway.
The "assert" directive signalizes the error only if its expression is false
after all the symbols have been resolved. You can use "assert 0" in place of
"err" when you do not want to have assembly stopped during the intermediate
passes.
 
 
2.3 Preprocessor directives
2676,11 → 3736,14
number of included files as long as they fit in memory.
The quoted path can contain environment variables enclosed within "%"
characters, they will be replaced with their values inside the path, both the
"\" and "/" characters are allowed as a path separators. If no absolute path
is given, the file is first searched for in the directory containing file
which included it and when it's not found there, in the directory containing
the main source file (the one specified in command line). These rules concern
also paths given with the "file" directive.
"\" and "/" characters are allowed as a path separators. The file is first
searched for in the directory containing file which included it and when it is
not found there, the search is continued in the directories specified in the
environment variable called INCLUDE (the multiple paths separated with
semicolons can be defined there, they will be searched in the same order as
specified). If file was not found in any of these places, preprocessor looks
for it in the directory containing the main source file (the one specified in
command line). These rules concern also paths given with the "file" directive.
 
 
2.3.2 Symbolic constants
2713,7 → 3776,7
"d" constant back the value "edx", the second one will restore it to value
"dword", and one more will revert "d" to original meaning as if no such
constant was defined. If there was no constant defined of given name,
"restore" won't cause an error, it will be just ignored.
"restore" will not cause an error, it will be just ignored.
Symbolic constant can be used to adjust the syntax of assembler to personal
preferences. For example the following set of definitions provides the handy
shortcuts for all the size operators:
2726,6 → 3789,7
q equ qword
t equ tword
x equ dqword
y equ qqword
 
Because symbolic constant may also have an empty value, it can be used to
allow the syntax with "offset" word before any address value:
2841,10 → 3905,13
definition, so "mov es,ds,dx" will be assembled as "push ds", "pop es" and
"mov ds,dx".
By placing the "*" after the name of argument you can mark the argument as
required - preprocessor won't allow it to have an empty value. For example the
above macroinstruction could be declared as "macro mov op1*,op2*,op3" to make
sure that first two arguments will always have to be given some non empty
required - preprocessor will not allow it to have an empty value. For example
the above macroinstruction could be declared as "macro mov op1*,op2*,op3" to
make sure that first two arguments will always have to be given some non empty
values.
Alternatively, you can provide the default value for argument, by placing
the "=" followed by value after the name of argument. Then if the argument
has an empty value provided, the default value will be used instead.
When it's needed to provide macroinstruction with argument that contains
some commas, such argument should be enclosed between "<" and ">" characters.
If it contains more than one "<" character, the same number of ">" should be
2852,8 → 3919,8
"purge" directive allows removing the last definition of specified
macroinstruction. It should be followed by one or more names of
macroinstructions, separated with commas. If such macroinstruction has not
been defined, you won't get any error. For example after having the syntax of
"mov" extended with the macroinstructions defined above, you can disable
been defined, you will not get any error. For example after having the syntax
of "mov" extended with the macroinstructions defined above, you can disable
syntax with three operands back by using "purge mov" directive. Next
"purge mov" will disable also syntax for two operands being segment registers,
and all the next such directives will do nothing.
2903,7 → 3970,7
}
 
Each time this macroinstruction is used, "move" will become other unique name
in its instructions, so you won't get an error you normally get when some
in its instructions, so you will not get an error you normally get when some
label is defined more than once.
"forward", "reverse" and "common" directives divide macroinstruction into
blocks, each one processed after the processing of previous is finished. They
2948,8 → 4015,8
}
 
This macroinstruction can be used for calling the procedures using STDCALL
convention, arguments are pushed on stack in the reverse order. For example
"stdcall foo,1,2,3" will be assembled as:
convention, which has all the arguments pushed on stack in the reverse order.
For example "stdcall foo,1,2,3" will be assembled as:
 
push 3
push 2
2985,7 → 4052,7
"jae exit" instructions.
The "#" operator can be also used to concatenate two quoted strings into one.
Also conversion of name into a quoted string is possible, with the "`" operator,
which likewise can be used inside the macroinstruction. It convert the name
which likewise can be used inside the macroinstruction. It converts the name
that follows it into a quoted string - but note, that when it is followed by
a macro argument which is being replaced with value containing more than one
symbol, only the first of them will be converted, as the "`" operator converts
3104,9 → 4171,10
with dot in the contents of macroinstruction. The macroinstruction defined
using the "struc" directive can have the same name as some other
macroinstruction defined using the "macro" directive, structure
macroinstruction won't prevent the standard macroinstruction being processed
when there is no label before it and vice versa. All the rules and features
concerning standard macroinstructions apply to structure macroinstructions.
macroinstruction will not prevent the standard macroinstruction from being
processed when there is no label before it and vice versa. All the rules and
features concerning standard macroinstructions apply to structure
macroinstructions.
Here is the sample of structure macroinstruction:
 
struc point x,y
3146,10 → 4214,7
 
The "rept" directive is a special kind of macroinstruction, which makes given
amount of duplicates of the block enclosed with braces. The basic syntax is
"rept" directive followed by number (it cannot be an expression, since
preprocessor doesn't do calculations, if you need repetitions based on values
calculated by assembler, use one of the code repeating directives that are
processed by assembler, see 2.2.3), and then block of source enclosed between
"rept" directive followed by number and then block of source enclosed between
the "{" and "}" characters. The simplest example:
 
rept 5 { in al,dx }
3200,6 → 4265,16
will generate code which will clear the contents of eight SSE registers.
You can define multiple counters separated with commas, and each one can have
different base.
The number of repetitions and the base values for counters can be specified
using the numerical expressions with operator rules identical as in the case
of assembler. However each value used in such expression must either be a
directly specified number, or a symbolic constant with value also being an
expression that can be calculated by preprocessor (in such case the value
of expression associated with symbolic constant is calculated first, and then
substituted into the outer expression in place of that constant). If you need
repetitions based on values that can only be calculated at assembly time, use
one of the code repeating directives that are processed by assembler, see
section 2.2.3.
The "irp" directive iterates the single argument through the given list of
parameters. The syntax is "irp" followed by the argument name, then the comma
and then the list of parameters. The parameters are specified in the same
3253,7 → 4328,7
match +,- { include 'second.inc' }
 
the first file will get included, since "+" after comma matches the "+" in
pattern, and the second file won't be included, since there is no match.
pattern, and the second file will not be included, since there is no match.
To match any other symbol literally, it has to be preceded by "=" character
in the pattern. Also to match the "=" character itself, or the comma, the
"==" and "=," constructions have to be used. For example the "=a==" pattern
3277,8 → 4352,8
 
match a b, 1 { db a }
 
there will be nothing left for "b" to match, so the block won't get processed
at all.
there will be nothing left for "b" to match, so the block will not get
processed at all.
The block of source defined by match is processed in the same way as any
macroinstruction, so any operators specific to macroinstructions can be used
also in this case.
3314,12 → 4389,12
a separate stage, and all other preprocessing is done after on the resulting
source.
The standard preprocessing that comes after, on each line begins with
recognition of the first symbol. It begins with checking for the preprocessor
recognition of the first symbol. It starts with checking for the preprocessor
directives, and when none of them is detected, preprocessor checks whether the
first symbol is macroinstruction. If no macroinstruction is found, it moves
to the second symbol of line, and again begins with checking for directives,
which in this case is only the "equ" directive, as this is the only one that
occurs as the second symbol in line. If there's no directive, the second
occurs as the second symbol in line. If there is no directive, the second
symbol is checked for the case of structure macroinstruction and when none
of those checks gives the positive result, the symbolic constants are replaced
with their values and such line is passed to the assembler.
3331,11 → 4406,15
 
would be then both interpreted as invocations of macroinstruction "foo", since
the meaning of the first symbol overrides the meaning of second one.
The macroinstructions generate the new lines from their definition blocks,
replacing the parameters with their values and then processing the "#" and "`"
operators. The conversion operator has the higher priority than concatenation.
After this is completed, the newly generated line goes through the standard
preprocessing, as described above.
When the macroinstruction generates the new lines from its definition block,
in every line it first scans for macroinstruction directives, and interpretes
them accordingly. All the other content in the definition block is used to
brew the new lines, replacing the macroinstruction parameters with their values
and then processing the symbol escaping and "#" and "`" operators. The
conversion operator has the higher priority than concatenation and if any of
them operates on the escaped symbol, the escaping is cancelled before finishing
the operation. After this is completed, the newly generated line goes through
the standard preprocessing, as described above.
Though the symbolic constants are usually only replaced in the lines, where
no preprocessor directives nor macroinstructions has been found, there are some
special cases where those replacements are performed in the parts of lines
3375,6 → 4454,33
block enclosed with braces. So if the "list" had value "1,2", the above line
would generate the line containing "foo 1,2", which would then go through the
standard preprocessing.
The other special case is in the parameters of "rept" directive. The amount
of repetitions and the base value for counter can be specified using
numerical expressions, and if there is a symbolic constant with non-numerical
name used in such an expression, preprocessor tries to evaluate its value as
a numerical expression and if succeeds, it replaces the symbolic constant with
the result of that calculation and continues to evaluate the primary
expression. If the expression inside that symbolic constants also contains
some symbolic constants, preprocessor will try to calculate all the needed
values recursively.
This allows to perform some calculations at the time of preprocessing, as
long as all the values used are the numbers known at the preprocessing stage.
A single repetition with "rept" can be used for the sole purpose of
calculating some value, like in this example:
 
define a b+4
define b 3
rept 1 result:a*b+2 { define c result }
To compute the base value for "result" counter, preprocessor replaces the "b"
with its value and recursively calculates the value of "a", obtaining 7 as
the result, then it calculates the main expression with the result being 23.
The "c" then gets defined with the first value of counter (because the block
is processed just one time), which is the result of the computation, so the
value of "c" is simple "23" symbol. Note that if "b" is later redefined with
some other numerical value, the next time and expression containing "a" is
calculated, the value of "a" will reflect the new value of "b", because the
symbolic constant contains just the text of the expression.
There is one more special case - when preprocessor goes to checking the
second symbol in the line and it happens to be the colon character (what is
then interpreted by assembler as definition of a label), it stops in this
3421,9 → 4527,10
the "a" constant doesn't get defined. However symbolic constant "b" was
processed normally, even though its definition was put just next to the one
of "a". So because of the possible confusion you should be very careful
every time when mixing the features of preprocessor and assembler - always
try to imagine what your source will become after the preprocessing, and
thus what the assembler will see and do its multiple passes on.
every time when mixing the features of preprocessor and assembler - in such
cases it is important to realize what the source will become after the
preprocessing, and thus what the assembler will see and do its multiple passes
on.
 
 
2.4 Formatter directives
3433,7 → 4540,10
"format" directive followed by the format identifier allows to select the
output format. This directive should be put at the beginning of the source.
Default output format is a flat binary file, it can also be selected by using
"format binary" directive.
"format binary" directive. This directive can be followed by the "as" keyword
and the quoted string specifying the default file extension for the output
file. Unless the output file name was specified from the command line,
assembler will use this extension when generating the output file.
"use16" and "use32" directives force the assembler to generate 16-bit or
32-bit code, omitting the default setting for selected output format. "use64"
enables generating the code for the long mode of x86-64 processors.
3468,16 → 4578,20
2.4.2 Portable Executable
 
To select the Portable Executable output format, use "format PE" directive, it
can be followed by additional format settings: use "console", "GUI" or
"native" operator selects the target subsystem (floating point value
specifying subsystem version can follow), "DLL" marks the output file as a
dynamic link library. Then can follow the "at" operator and the numerical
expression specifying the base of PE image and then optionally "on" operator
followed by the quoted string containing file name selects custom MZ stub for
PE program (when specified file is not a MZ executable, it is treated as a
flat binary executable file and converted into MZ format). The default code
setting for this format is 32-bit. The example of fully featured PE format
declaration:
can be followed by additional format settings: first the target subsystem
setting, which can be "console" or "GUI" for Windows applications, "native"
for Windows drivers, "EFI", "EFIboot" or "EFIruntime" for the UEFI, it may be
followed by the minimum version of system that the executable is targeted to
(specified in form of floating-point value). Optional "DLL" and "WDM" keywords
mark the output file as a dynamic link library and WDM driver respectively,
and the "large" keyword marks the executable as able to handle addresses
larger than 2 GB.
After those settings can follow the "at" operator and a numerical expression
specifying the base of PE image and then optionally "on" operator followed by
the quoted string containing file name selects custom MZ stub for PE program
(when specified file is not a MZ executable, it is treated as a flat binary
executable file and converted into MZ format). The default code setting for
this format is 32-bit. The example of fully featured PE format declaration:
 
format PE GUI 4.0 DLL at 7000000h on 'stub.exe'
 
3524,22 → 4638,24
identifier is followed by "from" operator and quoted file name - in such case
data is taken from the given resource file.
The "rva" operator can be used inside the numerical expressions to obtain
the RVA of the item addressed by the value it is applied to.
the RVA of the item addressed by the value it is applied to, that is the
offset relative to the base of PE image.
 
 
2.4.3 Common Object File Format
 
To select Common Object File Format, use "format COFF" or "format MS COFF"
directive whether you want to create classic or Microsoft's COFF file. The
default code setting for this format is 32-bit. To create the file in
Microsoft's COFF format for the x86-64 architecture, use "format MS64 COFF"
setting, in such case long mode code is generated by default.
directive, depending whether you want to create classic (DJGPP) or Microsoft's
variant of COFF file. The default code setting for this format is 32-bit. To
create the file in Microsoft's COFF format for the x86-64 architecture, use
"format MS64 COFF" setting, in such case long mode code is generated by
default.
"section" directive defines a new section, it should be followed by quoted
string defining the name of section, then one or more section flags can
follow. Section flags available for both COFF variants are "code" and "data",
while "readable", "writeable", "executable", "shareable", "discardable",
"notpageable", "linkremove" and "linkinfo" are flags available only with
Microsoft COFF variant.
while flags "readable", "writeable", "executable", "shareable", "discardable",
"notpageable", "linkremove" and "linkinfo" are available only with Microsoft's
COFF variant.
By default section is aligned to double word (four bytes), in case of
Microsoft COFF variant other alignment can be specified by providing the
"align" operator followed by alignment value (any power of two up to 8192)
3561,6 → 4677,12
public main
public start as '_start'
 
Additionally, with COFF format it's possible to specify exported symbol as
static, it's done by preceding the name of symbol with the "static" keyword.
When using the Microsoft's COFF format, the "rva" operator can be used
inside the numerical expressions to obtain the RVA of the item addressed by the
value it is applied to.
 
2.4.4 Executable and Linkable Format
 
To select ELF output format, use "format ELF" directive. The default code
3578,14 → 4700,24
The "rva" operator can be used also in the case of this format (however not
when target architecture is x86-64), it converts the address into the offset
relative to the GOT table, so it may be useful to create position-independent
code.
code. There's also a special "plt" operator, which allows to call the external
functions through the Procedure Linkage Table. You can even create an alias
for external function that will make it always be called through PLT, with
the code like:
 
extrn 'printf' as _printf
printf = PLT _printf
 
To create executable file, follow the format choice directive with the
"executable" keyword. It allows to use "entry" directive followed by the value
to set as entry point of program. On the other hand it makes "extrn" and
"public" directives unavailable, and instead of "section" there should be the
"segment" directive used, followed only by one or more segment permission
flags. The origin of segment is aligned to page (4096 bytes), and available
flags for are: "readable", "writeable" and "executable".
"executable" keyword and optionally the number specifying the brand of the
target operating system (for example value 3 would mark the executable
for Linux system). With this format selected it is allowed to use "entry"
directive followed by the value to set as entry point of program. On the other
hand it makes "extrn" and "public" directives unavailable, and instead of
"section" there should be the "segment" directive used, followed by one or
more segment permission flags and optionally a marker of special ELF
executable segment, which can be "interpreter", "dynamic" or "note". The
origin of segment is aligned to page (4096 bytes), and available permission
flags are: "readable", "writeable" and "executable".
 
 
EOF
EOF