WebSVN – Kolibri OS – Diff – /data/eng/docs/FASM.TXT

-Üßßß
+,'''
-                         ÜÜÛÜÜ ÜÜÜÜ    ÜÜÜÜÜ ÜÜÜ ÜÜ
+                         ,,;,, ,,,,    ,,,,, ,,, ,,
-                           Û       Û  Û      Û  Û  Û
+                           ;       ;  ;      ;  ;  ;
-                           Û  ÜßßßßÛ   ßßßßÜ Û  Û  Û
+                           ;  ,'''';   '''', ;  ;  ;
-                           Û  ßÜÜÜÜÛÜ ÜÜÜÜÜß Û  Û  Û
+                           ;  ',,,,;, ,,,,,' ;  ;  ;
-                              flat assembler 1.66
+                              flat assembler 1.70
                               Programmer's Manual
 Table of contents
-ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
+-----------------
 .1.16  SSE2 instructions
 .1.17  SSE3 instructions
 .1.18  AMD 3DNow! instructions
 .1.19  The x86-64 long mode instructions
 .1.20  SSE4 instructions
+.1.21  AVX instructions
+.1.22  AVX2 instructions
+.1.23  Auxiliary sets of computational instructions
+.1.24  Other extensions of instruction set
 .2  Control directives
 .2.1  Numerical constants
 .2.2  Conditional assembly
 .2.3  Repeating blocks of instructions
 .4.3  Common Object File Format
 .4.4  Executable and Linkable Format
 Chapter 1  Introduction
-ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
+-----------------------
 This chapter contains all the most important information you need to begin
 done, how much time it took, and how many bytes were written into the
 destination file.
 The following is an example of the compilation summary:
-flat assembler  version 1.66
+flat assembler  version 1.70 (16384 kilobytes memory)
 passes, 5.3 seconds, 77824 bytes.
 In case of error during the compilation process, the program will display an
 error message. For example, when compiler can't find the input file, it will
 display the following message:
-flat assembler  version 1.66
+flat assembler  version 1.70 (16384 kilobytes memory)
 error: source file not found.
 If the error is connected with a specific part of source code, the source line
 that caused the error will be also displayed. Also placement of this line in
 the source is given to help you finding this error, for example:
-flat assembler  version 1.66
+flat assembler  version 1.70 (16384 kilobytes memory)
 example.asm [3]:
         mob     ax,1
 error: illegal instruction.
 It means that in the third line of the "example.asm" file compiler has
 encountered an unrecognized instruction. When the line that caused error
 contains a macroinstruction, also the line in macroinstruction definition
 that generated the erroneous instruction is displayed:
 that are individual items even when are not spaced from the other ones.
 Any of the "+-*/=<>()[]{}:,|&~#`" is the symbol character. The sequence of
 other characters, separated from other items with either blank spaces or
 symbol characters, is a symbol. If the first character of symbol is either a
-single or double quote, it integrates the any sequence of characters following
+single or double quote, it integrates any sequence of characters following it,
-it, even the special ones, into a quoted string, which should end with the same
+even the special ones, into a quoted string, which should end with the same
 character, with which it began (the single or double quote) - however if there
 are two such characters in a row (without any other character between them),
 they are integrated into quoted string as just one of them and the quoted
 string continues then. The symbols other than symbol characters and quoted
 strings can be used as names, so are also called the name symbols.
   Every instruction consists of the mnemonic and the various number of
 by a colon should be put just before the address value (inside the square
 brackets or after the "ptr" operator).
    Table 1.1  Size operators
-  ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÄ¿
+  /-------------------------\
-  ³ Operator ³ Bits ³ Bytes ³
+  | Operator | Bits | Bytes |
-  ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍÍµ
+  |==========|======|=======|
-  ³ byte     ³ 8    ³ 1     ³
+  | byte     | 8    | 1     |
-  ³ word     ³ 16   ³ 2     ³
+  | word     | 16   | 2     |
-  ³ dword    ³ 32   ³ 4     ³
+  | dword    | 32   | 4     |
-  ³ fword    ³ 48   ³ 6     ³
+  | fword    | 48   | 6     |
-  ³ pword    ³ 48   ³ 6     ³
+  | pword    | 48   | 6     |
-  ³ qword    ³ 64   ³ 8     ³
+  | qword    | 64   | 8     |
-  ³ tbyte    ³ 80   ³ 10    ³
+  | tbyte    | 80   | 10    |
-  ³ tword    ³ 80   ³ 10    ³
+  | tword    | 80   | 10    |
-  ³ dqword   ³ 128  ³ 16    ³
+  | dqword   | 128  | 16    |
+  | xword    | 128  | 16    |
+  | qqword   | 256  | 32    |
+  | yword    | 256  | 32    |
-  ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÙ
+  \-------------------------/
    Table 1.2  Registers
-  ÚÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
+  /-----------------------------------------------------------------\
-  ³ Type    ³ Bits ³                                                ³
+  | Type    | Bits |                                                |
-  ÆÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍµ
+  |=========|======|================================================|
-  ³         ³ 8    ³ al    cl    dl    bl    ah    ch    dh    bh   ³
+  |         | 8    | al    cl    dl    bl    ah    ch    dh    bh   |
-  ³ General ³ 16   ³ ax    cx    dx    bx    sp    bp    si    di   ³
+  | General | 16   | ax    cx    dx    bx    sp    bp    si    di   |
-  ³         ³ 32   ³ eax   ecx   edx   ebx   esp   ebp   esi   edi  ³
+  |         | 32   | eax   ecx   edx   ebx   esp   ebp   esi   edi  |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |---------|------|------------------------------------------------|
-  ³ Segment ³ 16   ³ es    cs    ss    ds    fs    gs               ³
+  | Segment | 16   | es    cs    ss    ds    fs    gs               |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |---------|------|------------------------------------------------|
-  ³ Control ³ 32   ³ cr0         cr2   cr3   cr4                    ³
+  | Control | 32   | cr0         cr2   cr3   cr4                    |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |---------|------|------------------------------------------------|
-  ³ Debug   ³ 32   ³ dr0   dr1   dr2   dr3               dr6   dr7  ³
+  | Debug   | 32   | dr0   dr1   dr2   dr3               dr6   dr7  |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |---------|------|------------------------------------------------|
-  ³ FPU     ³ 80   ³ st0   st1   st2   st3   st4   st5   st6   st7  ³
+  | FPU     | 80   | st0   st1   st2   st3   st4   st5   st6   st7  |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |---------|------|------------------------------------------------|
-  ³ MMX     ³ 64   ³ mm0   mm1   mm2   mm3   mm4   mm5   mm6   mm7  ³
+  | MMX     | 64   | mm0   mm1   mm2   mm3   mm4   mm5   mm6   mm7  |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |---------|------|------------------------------------------------|
+  | SSE     | 128  | xmm0  xmm1  xmm2  xmm3  xmm4  xmm5  xmm6  xmm7 |
+  |---------|------|------------------------------------------------|
-  ³ SSE     ³ 128  ³ xmm0  xmm1  xmm2  xmm3  xmm4  xmm5  xmm6  xmm7 ³
+  | AVX     | 256  | ymm0  ymm1  ymm2  ymm3  ymm4  ymm5  ymm6  ymm7 |
-  ÀÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
+  \-----------------------------------------------------------------/
 may not be included in the output file, so its values should be always
 considered unknown.
    Table 1.3  Data directives
-  ÚÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄ¿
+  /----------------------------\
-  ³ Size    ³ Define ³ Reserve ³
+  | Size    | Define | Reserve |
-  ³ (bytes) ³ data   ³ data    ³
+  | (bytes) | data   | data    |
-  ÆÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍµ
+  |=========|========|=========|
-  ³ 1       ³ db     ³ rb      ³
+  | 1       | db     | rb      |
-  ³         ³ file   ³         ³
+  |         | file   |         |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
+  |---------|--------|---------|
-  ³ 2       ³ dw     ³ rw      ³
+  | 2       | dw     | rw      |
-  ³         ³ du     ³         ³
+  |         | du     |         |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
+  |---------|--------|---------|
-  ³ 4       ³ dd     ³ rd      ³
+  | 4       | dd     | rd      |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
+  |---------|--------|---------|
-  ³ 6       ³ dp     ³ rp      ³
+  | 6       | dp     | rp      |
-  ³         ³ df     ³ rf      ³
+  |         | df     | rf      |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
+  |---------|--------|---------|
-  ³ 8       ³ dq     ³ rq      ³
+  | 8       | dq     | rq      |
-  ÃÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´
+  |---------|--------|---------|
-  ³ 10      ³ dt     ³ rt      ³
+  | 10      | dt     | rt      |
-  ÀÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÙ
+  \----------------------------/
 In the above examples all the numerical expressions were the simple numbers,
 constants or labels. But they can be more complex, by using the arithmetical
 or logical operators for calculations at compile time. All these operators
-with their priority values are listed in table 1.4.
+with their priority values are listed in table 1.4. The operations with higher
-The operations with higher priority value will be calculated first, you can
+priority value will be calculated first, you can of course change this
-of course change this behavior by putting some parts of expression into
+behavior by putting some parts of expression into parenthesis. The "+", "-",
-parenthesis. The "+", "-", "*" and "/" are standard arithmetical operations,
+"*" and "/" are standard arithmetical operations, "mod" calculates the
-"mod" calculates the remainder from division. The "and", "or", "xor", "shl",
+remainder from division. The "and", "or", "xor", "shl", "shr" and "not"
-"shr" and "not" perform the same logical operations as assembly instructions
+perform the same logical operations as assembly instructions of those names.
+The "rva" and "plt" are special unary operators that perform conversions
-of those names. The "rva" performs the conversion of an address into the
+between different kinds of addresses, they can be used only with few of the
-relocatable offset and is specific to some of the output formats (see 2.4).
+output formats and their meaning may vary (see 2.4).
+  The arithmetical and logical calculations are usually processed as if they
+operated on infinite precision 2-adic numbers, and assembler signalizes an
+overflow error if because of its limitations it is not table to perform the
+required calculation, or if the result is too large number to fit in either
+signed or unsigned range for the destination unit size. However "not", "xor"
+and "shr" operators are exceptions from this rule - if the value specified
+by numerical expression has to fit in a unit of specified size, and the
+arguments for operation fit into that size, the operation will be performed
+with precision limited to that size.
   The numbers in the expression are by default treated as a decimal, binary
 numbers should have the "b" letter attached at the end, octal number should
 end with "o" letter, hexadecimal numbers should begin with "0x" characters
 (like in C language) or with the "$" character (like in Pascal language) or
 they should end with "h" letter. Also quoted string, when encountered in
 characters. So "1.0", "1E0" and "1f" define the same floating point value,
 while simple "1" defines an integer value.
    Table 1.4  Arithmetical and logical operators by priority
-  ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
+  /-------------------------\
-  ³ Priority ³ Operators    ³
+  | Priority | Operators    |
-  ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍµ
+  |==========|==============|
-  ³ 0        ³ +  -         ³
+  | 0        | +  -         |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|--------------|
-  ³ 1        ³ *  /         ³
+  | 1        | *  /         |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|--------------|
-  ³ 2        ³ mod          ³
+  | 2        | mod          |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|--------------|
-  ³ 3        ³ and  or  xor ³
+  | 3        | and  or  xor |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|--------------|
-  ³ 4        ³ shl  shr     ³
+  | 4        | shl  shr     |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|--------------|
-  ³ 5        ³ not          ³
+  | 5        | not          |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|--------------|
-  ³ 6        ³ rva          ³
+  | 6        | rva  plt     |
-  ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
+  \-------------------------/
 instruction "jmp dword [0]" will become the far jump and when assembler is
 in 32-bit mode, it will become the near jump. To force this instruction to be
 treated differently, use the "jmp near dword [0]" or "jmp far dword [0]" form.
   When operand of near jump is the immediate value, assembler will generate
-the shortest variant of this jump instruction if possible (but won't create
+the shortest variant of this jump instruction if possible (but will not create
 -bit instruction in 16-bit mode nor 16-bit instruction in 32-bit mode,
 unless there is a size operator stating it). By specifying the jump type
 you can force it to always generate long variant (for example "jmp near 0")
 or to always generate short variant and terminate with an error when it's
 impossible (for example "jmp short 0").
 without forcing it to use the longer form of instruction.
 Chapter 2  Instruction set
-ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
+--------------------------
 This chapter provides the detailed information about the instructions and
 directives supported by flat assembler. Directives for defining labels were
 .1.5  Logical instructions
-"not" inverts the bits in the specified operand to form a one's
+"not" inverts the bits in the specified operand to form a one's complement
-complement of the operand. It has no effect on the flags. Rules for the
+of the operand. It has no effect on the flags. Rules for the operand are the
-operand are the same as for the "inc" instruction.
+same as for the "inc" instruction.
-  "and", "or" and "xor" instructions perform the standard
+  "and", "or" and "xor" instructions perform the standard logical operations.
-logical operations. They update the SF, ZF and PF flags. Rules for the
+They update the SF, ZF and PF flags. Rules for the operands are the same as
-operands are the same as for the "add" instruction.
+for the "add" instruction.
   "bt", "bts", "btr" and "btc" instructions operate on a single bit which can
 be in memory or in a general register. The location of the bit is specified
 as an offset from the low order end of the operand. The value of the offset
 optimized (see 1.2.5), the operand should be an immediate value specifying
 target address.
    Table 2.1  Conditions
-  ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
+  /-----------------------------------------------------------\
-  ³ Mnemonic ³ Condition tested      ³ Description            ³
+  | Mnemonic | Condition tested      | Description            |
-  ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍµ
+  |==========|=======================|========================|
-  ³ o        ³ OF = 1                ³ overflow               ³
+  | o        | OF = 1                | overflow               |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ no       ³ OF = 0                ³ not overflow           ³
+  | no       | OF = 0                | not overflow           |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ c        ³                       ³ carry                  ³
+  | c        |                       | carry                  |
-  ³ b        ³ CF = 1                ³ below                  ³
+  | b        | CF = 1                | below                  |
-  ³ nae      ³                       ³ not above nor equal    ³
+  | nae      |                       | not above nor equal    |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ nc       ³                       ³ not carry              ³
+  | nc       |                       | not carry              |
-  ³ ae       ³ CF = 0                ³ above or equal         ³
+  | ae       | CF = 0                | above or equal         |
-  ³ nb       ³                       ³ not below              ³
+  | nb       |                       | not below              |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ e        ³ ZF = 1                ³ equal                  ³
+  | e        | ZF = 1                | equal                  |
-  ³ z        ³                       ³ zero                   ³
+  | z        |                       | zero                   |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ ne       ³ ZF = 0                ³ not equal              ³
+  | ne       | ZF = 0                | not equal              |
-  ³ nz       ³                       ³ not zero               ³
+  | nz       |                       | not zero               |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ be       ³ CF or ZF = 1          ³ below or equal         ³
+  | be       | CF or ZF = 1          | below or equal         |
-  ³ na       ³                       ³ not above              ³
+  | na       |                       | not above              |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ a        ³ CF or ZF = 0          ³ above                  ³
+  | a        | CF or ZF = 0          | above                  |
-  ³ nbe      ³                       ³ not below nor equal    ³
+  | nbe      |                       | not below nor equal    |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ s        ³ SF = 1                ³ sign                   ³
+  | s        | SF = 1                | sign                   |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ ns       ³ SF = 0                ³ not sign               ³
+  | ns       | SF = 0                | not sign               |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ p        ³ PF = 1                ³ parity                 ³
+  | p        | PF = 1                | parity                 |
-  ³ pe       ³                       ³ parity even            ³
+  | pe       |                       | parity even            |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ np       ³ PF = 0                ³ not parity             ³
+  | np       | PF = 0                | not parity             |
-  ³ po       ³                       ³ parity odd             ³
+  | po       |                       | parity odd             |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ l        ³ SF xor OF = 1         ³ less                   ³
+  | l        | SF xor OF = 1         | less                   |
-  ³ nge      ³                       ³ not greater nor equal  ³
+  | nge      |                       | not greater nor equal  |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ ge       ³ SF xor OF = 0         ³ greater or equal       ³
+  | ge       | SF xor OF = 0         | greater or equal       |
-  ³ nl       ³                       ³ not less               ³
+  | nl       |                       | not less               |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ le       ³ (SF xor OF) or ZF = 1 ³ less or equal          ³
+  | le       | (SF xor OF) or ZF = 1 | less or equal          |
-  ³ ng       ³                       ³ not greater            ³
+  | ng       |                       | not greater            |
-  ÃÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
+  |----------|-----------------------|------------------------|
-  ³ g        ³ (SF xor OF) or ZF = 0 ³ greater                ³
+  | g        | (SF xor OF) or ZF = 0 | greater                |
-  ³ nle      ³                       ³ not less nor equal     ³
+  | nle      |                       | not less nor equal     |
-  ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
+  \-----------------------------------------------------------/
   The "loop" instructions are conditional jumps that use a value placed in
 CX (or ECX) to specify the number of repetitions of a software loop. All
 "loop" instructions automatically decrement CX (or ECX) and terminate the
     seto byte [bx]   ; set byte if overflow
   "salc" instruction sets the all bits of AL register when the carry flag is
 set and zeroes the AL register otherwise. This instruction has no arguments.
-  The instructions obtained by attaching the condition mnemonic to the "cmov"
+  The instructions obtained by attaching the condition mnemonic to "cmov"
 mnemonic transfer the word or double word from the general register or memory
 to the general register only when the condition is true. The destination
 operand should be general register, the source operand can be general register
 or memory.
   "fld1", "fldz", "fldl2t", "fldl2e", "fldpi", "fldlg2" and "fldln2" load the
 commonly used contants onto the FPU register stack. The loaded constants are
 +1.0, +0.0, lb 10, lb e, pi, lg 2 and ln 2 respectively. These instructions
 have no operands.
-  "fild" convert the singed integer source operand into double extended
+  "fild" converts the signed integer source operand into double extended
 precision floating-point format and pushes the result onto the FPU register
 stack. The source operand can be a 16-bit, 32-bit or 64-bit memory location.
     fcomi st2        ; compare st0 with st2 and set flags
     fcmovb st0,st2   ; transfer st2 to st0 if below
    Table 2.2  FPU conditions
-  ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
+  /------------------------------------------------------\
-  ³ Mnemonic ³ Condition tested ³ Description            ³
+  | Mnemonic | Condition tested | Description            |
-  ÆÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍµ
+  |==========|==================|========================|
-  ³ b        ³ CF = 1           ³ below                  ³
+  | b        | CF = 1           | below                  |
-  ³ e        ³ ZF = 1           ³ equal                  ³
+  | e        | ZF = 1           | equal                  |
-  ³ be       ³ CF or ZF = 1     ³ below or equal         ³
+  | be       | CF or ZF = 1     | below or equal         |
-  ³ u        ³ PF = 1           ³ unordered              ³
+  | u        | PF = 1           | unordered              |
-  ³ nb       ³ CF = 0           ³ not below              ³
+  | nb       | CF = 0           | not below              |
-  ³ ne       ³ ZF = 0           ³ not equal              ³
+  | ne       | ZF = 0           | not equal              |
-  ³ nbe      ³ CF and ZF = 0    ³ not below nor equal    ³
+  | nbe      | CF and ZF = 0    | not below nor equal    |
-  ³ nu       ³ PF = 0           ³ not unordered          ³
+  | nu       | PF = 0           | not unordered          |
-  ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
+  \------------------------------------------------------/
   "ftst" compares the value in ST0 with 0.0 and sets the flags in the FPU
 status word according to the results. "fxam" examines the contents of the ST0
 FPU state (operating environment and register stack) at the specified
 destination in memory and reinitializes the FPU. "fsave" check for pending
 unmasked FPU exceptions before proceeding, "fnsave" does not. "frstor"
 loads the FPU state from the specified memory location. All these instructions
-need an operand being a memory location.
+need an operand being a memory location. For each of these instruction
-  "finit" and "fninit" set the FPU operating environment into its default
+exist two additional mnemonics that allow to precisely select the type of the
+operation. The "fstenvw", "fnstenvw", "fldenvw", "fsavew", "fnsavew" and
+"frstorw" mnemonics force the instruction to perform operation as in the 16-bit
+mode, while "fstenvd", "fnstenvd", "fldenvd", "fsaved", "fnsaved" and "frstord"
+force the operation as in 32-bit mode.
+  "finit" and "fninit" set the FPU operating environment into its default
 state. "finit" checks for pending unmasked FPU exception before proceeding,
 "fninit" does not. "fclex" and "fnclex" clear the FPU exception flags in the
 FPU status word. "fclex" checks for pending unmasked FPU exception before
 proceeding, "fnclex" does not. "wait" and "fwait" are synonyms for the same
 instruction, which causes the processor to check for pending unmasked FPU
 "psubd" perform the substraction of appropriate types. "paddsb", "paddsw",
 "psubsb" and "psubsw" perform the addition or substraction of packed bytes
 or packed words with the signed saturation. "paddusb", "paddusw", "psubusb",
 "psubusw" are analoguous, but with unsigned saturation. "pmulhw" and "pmullw"
-performs a signed multiply of the packed words and store the high or low words
+performs a signed multiplication of the packed words and store the high or low
-of the results in the destination operand. "pmaddwd" performs a multiply of
+words of the results in the destination operand. "pmaddwd" performs a multiply
-the packed words and adds the four intermediate double word products in pairs
+of the packed words and adds the four intermediate double word products in
-to produce result as a packed double words. "pand", "por" and "pxor" perform
+pairs to produce result as a packed double words. "pand", "por" and "pxor"
-the logical operations on the quad words, "pandn" peforms also a logical
+perform the logical operations on the quad words, "pandn" peforms also a
-negation of the destination operand before performing the "and" operation.
+logical negation of the destination operand before performing the "and"
-"pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed bytes,
+operation. "pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed
-packed words or packed double words. If a pair of data elements is equal, the
+bytes, packed words or packed double words. If a pair of data elements is
-corresponding data element in the destination operand is filled with bits of
+equal, the corresponding data element in the destination operand is filled with
-value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd" perform
+bits of value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd"
-the similar operation, but they check whether the data elements in the
+perform the similar operation, but they check whether the data elements in the
 destination operand are greater than the correspoding data elements in the
 source operand. "packsswb" converts packed signed words into packed signed
 bytes, "packssdw" converts packed signed double words into packed signed
 words, using saturation to handle overflow conditions. "packuswb" converts
 packed signed words into packed unsigned bytes. Converted data elements from
 the source operand are stored in the low part of the destination operand,
     cmpps xmm2,xmm4,0  ; compare packed single precision values
     cmpltss xmm0,[ebx] ; compare single precision values
    Table 2.3  SSE conditions
-  ÚÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
+  /-------------------------------------------\
-  ³ Code ³ Mnemonic ³ Description             ³
+  | Code | Mnemonic | Description             |
-  ÆÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍØÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍÍµ
+  |======|==========|=========================|
-  ³ 0    ³ eq       ³ equal                   ³
+  | 0    | eq       | equal                   |
-  ³ 1    ³ lt       ³ less than               ³
+  | 1    | lt       | less than               |
-  ³ 2    ³ le       ³ less than or equal      ³
+  | 2    | le       | less than or equal      |
-  ³ 3    ³ unord    ³ unordered               ³
+  | 3    | unord    | unordered               |
-  ³ 4    ³ neq      ³ not equal               ³
+  | 4    | neq      | not equal               |
-  ³ 5    ³ nlt      ³ not less than           ³
+  | 5    | nlt      | not less than           |
-  ³ 6    ³ nle      ³ not less than nor equal ³
+  | 6    | nle      | not less than nor equal |
-  ³ 7    ³ ord      ³ ordered                 ³
+  | 7    | ord      | ordered                 |
-  ÀÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
+  \-------------------------------------------/
   "comiss" and "ucomiss" compare the single precision values and set the ZF,
 PF and CF flags to show the result. The destination operand must be a SSE
     cvtss2si eax,xmm0  ; convert single precision value to integer
   "pextrw" copies the word in the source operand specified by the third
 operand to the destination operand. The source operand must be a MMX register,
-the destination operand must be a 32-bit general register (but only the low
+the destination operand must be a 32-bit general register (the high word of
-word of it is affected), the third operand must an 8-bit immediate value.
+the destination is cleared), the third operand must an 8-bit immediate value.
     pextrw eax,mm0,1   ; extract word into eax
   "pavgb" and "pavgw" compute average of packed bytes or words. "pmaxub"
 return the maximum values of packed unsigned bytes, "pminub" returns the
 minimum values of packed unsigned bytes, "pmaxsw" returns the maximum values
 of packed signed words, "pminsw" returns the minimum values of packed signed
-words. "pmulhuw" performs a unsigned multiply of the packed words and stores
+words. "pmulhuw" performs a unsigned multiplication of the packed words and
-the high words of the results in the destination operand. "psadbw" computes
+stores the high words of the results in the destination operand. "psadbw"
-the absolute differences of packed unsigned bytes, sums the differences, and
+computes the absolute differences of packed unsigned bytes, sums the
-stores the sum in the low word of destination operand. All these instructions
+differences, and stores the sum in the low word of destination operand. All
-follow the same rules for operands as the general MMX operations described in
+these instructions follow the same rules for operands as the general MMX
-previous section.
+operations described in previous section.
   "pmovmskb" creates a mask made of the most significant bit of each byte in
 the source operand and stores the result in the low byte of destination
 operand. The source operand must be a MMX register, the destination operand
 must a 32-bit general register.
   "pshufw" inserts words from the source operand in the destination operand
 operand. "cvtpd2dq" and "cvttpd2dq" convert packed double precision floating
 point values to packed two double word integers, storing the result in the low
 quad word of the destination operand. "cvtdq2ps" converts packed four
 double word integers to packed single precision floating point values.
-"cvtdq2pd" converts packed two double word integers from the low quad word
+For all these instruction destination operand must be a SSE register, the
-of the source operand to packed double precision floating point values.
-For all these instruction destination operand must be a SSE register, the
 source operand can be a 128-bit memory location or SSE register.
-  "movdqa" and "movdqu" transfer a double quad word operand containing packed
+"cvtdq2pd" converts packed two double word integers from the source operand to
+packed double precision floating point values, the source can be a 64-bit
+memory location or SSE register, destination has to be SSE register.
+  "movdqa" and "movdqu" transfer a double quad word operand containing packed
 integers from source operand to destination operand. At least one of the
 operands have to be a SSE register, the second one can be also a SSE register
 or 128-bit memory location. Memory operands for "movdqa" instruction must be
 aligned on boundary of 16 bytes, operands for "movdqu" instruction don't have
 to be aligned.
   All MMX instructions operating on the 64-bit packed integers (those with
 mnemonics starting with "p") are extended to operate on 128-bit packed
 integers located in SSE registers. Additional syntax for these instructions
 needs an SSE register where MMX register was needed, and the 128-bit memory
-location or SSE register where 64-bit memory location of MMX register were
+location or SSE register where 64-bit memory location or MMX register were
 needed. The exception is "pshufw" instruction, which doesn't allow extended
 syntax, but has two new variants: "pshufhw" and "pshuflw", which allow only
 the extended syntax, and perform the same operation as "pshufw" on the high
 or low quad words of operands respectively. Also the new instruction "pshufd"
 is introduced, which performs the same operation as "pshufw", but on the
     psubb xmm0,[esi]   ; substract 16 packed bytes
     pextrw eax,xmm0,7  ; extract highest word into eax
   "paddq" performs the addition of packed quad words, "psubq" performs the
-substraction of packed quad words, "pmuludq" performs an unsigned multiply
+substraction of packed quad words, "pmuludq" performs an unsigned
-of low double words from each corresponding quad words and returns the results
+multiplication of low double words from each corresponding quad words and
-in packed quad words. These instructions follow the same rules for operands as
+returns the results in packed quad words. These instructions follow the same
-the general MMX operations described in 2.1.14.
+rules for operands as the general MMX operations described in 2.1.14.
   "pslldq" and "psrldq" perform logical shift left or right of the double
-quad word in the destination operand by the amount of bits specified in the
+quad word in the destination operand by the amount of bytes specified in the
 source operand. The destination operand should be a SSE register, source
 operand should be an 8-bit immediate value.
   "punpckhqdq" interleaves the high quad word of the source operand and the
 high quad word of the destination operand and writes them to the destination
 -bit memory location.
   "movddup" loads the 64-bit source value and duplicates it into high and low
 quad word of the destination operand. The destination operand should be SSE
 register, the source operand can be SSE register or 64-bit memory location.
-  "lddqu" is functionally equivalent to "movdqu" instruction with memory as
+  "lddqu" is functionally equivalent to "movdqu" with memory as source
-source operand, but it may improve performance when the source operand crosses
+operand, but it may improve performance when the source operand crosses a
-a cacheline boundary. The destination operand has to be SSE register, the
+cacheline boundary. The destination operand has to be SSE register, the source
-source operand must be 128-bit memory location.
+operand must be 128-bit memory location.
   "addsubps" performs single precision addition of second and fourth pairs and
 single precision substracion of the first and third pairs of floating point
 values in the operands. "addsubpd" performs double precision addition of the
 second pair and double precision substraction of the first pair of floating
 point values in the operand. "haddps" performs the addition of two single
 precision values within the each quad word of source and destination operands,
 need its three operands to be EAX, ECX and EDX register in that order. "mwait"
 waits for a write-back store to the address range set up by the "monitor"
 instruction. It uses two operands with additional parameters, first being the
 EAX and second the ECX register.
   The functionality of SSE3 is further extended by the set of Supplemental
+SSE3 instructions (SSSE3). They generally follow the same rules for operands
+as all the MMX operations extended by SSE.
+  "phaddw" and "phaddd" perform the horizontal additional of the pairs of
+adjacent values from both the source and destination operand, and stores the
+sums into the destination (sums from the source operand go into lower part of
+destination register). They operate on 16-bit or 32-bit chunks, respectively.
+"phaddsw" performs the same operation on signed 16-bit packed values, but the
+result of each addition is saturated. "phsubw" and "phsubd" analogously
+perform the horizontal substraction of 16-bit or 32-bit packed value, and
+"phsubsw" performs the horizontal substraction of signed 16-bit packed values
+with saturation.
+  "pabsb", "pabsw" and "pabsd" calculate the absolute value of each signed
+packed signed value in source operand and stores them into the destination
+register. They operator on 8-bit, 16-bit and 32-bit elements respectively.
+  "pmaddubsw" multiplies signed 8-bit values from the source operand with the
+corresponding unsigned 8-bit values from the destination operand to produce
+intermediate 16-bit values, and every adjacent pair of those intermediate
+values is then added horizontally and those 16-bit sums are stored into the
+destination operand.
+  "pmulhrsw" multiplies corresponding 16-bit integers from the source and
+destination operand to produce intermediate 32-bit values, and the 16 bits
+next to the highest bit of each of those values are then rounded and packed
+into the destination operand.
+  "pshufb" shuffles the bytes in the destination operand according to the
+mask provided by source operand - each of the bytes in source operand is
+an index of the target position for the corresponding byte in the destination.
+  "psignb", "psignw" and "psignd" perform the operation on 8-bit, 16-bit or
+-bit integers in destination operand, depending on the signs of the values
+in the source. If the value in source is negative, the corresponding value in
+the destination register is negated, if the value in source is positive, no
+operation is performed on the corresponding value is performed, and if the
+value in source is zero, the value in destination is zeroed, too.
+  "palignr" appends the source operand to the destination operand to form the
+intermediate value of twice the size, and then extracts into the destination
+register the 64 or 128 bits that are right-aligned to the byte offset
+specified by the third operand, which should be an 8-bit immediate value. This
+is the only SSSE3 instruction that takes three arguments.
 .1.18  AMD 3DNow! instructions
 The 3DNow! extension adds a new MMX instructions to those described in 2.1.14,
 and introduces operation on the 64-bit packed floating point values, each
 consisting of two single precision floating point values.
   These instructions follow the same rules as the general MMX operations, the
 destination operand should be a MMX register, the source operand can be a MMX
 register or 64-bit memory location. "pavgusb" computes the rounded averages
-of packed unsigned bytes. "pmulhrw" performs a signed multiply of the packed
+of packed unsigned bytes. "pmulhrw" performs a signed multiplication of the
-words, round the high word of each double word results and stores them in the
+packed words, round the high word of each double word results and stores them
-destination operand. "pi2fd" converts packed double word integers into
+in the destination operand. "pi2fd" converts packed double word integers into
 packed floating point values. "pf2id" converts packed floating point values
 into packed double word integers using truncation. "pi2fw" converts packed
 word integers into packed floating point values, only low words of each
 "ch" and "dh" registers in long mode, but you cannot use them in the same
 instruction with any of the new registers.
    Table 2.4  New registers in long mode
-  ÚÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄ¿
+  /--------------------------------------------------\
-  ³ Type ³          General          ³  SSE  ³
+  | Type |          General          |  SSE  |  AVX  |
-  ÃÄÄÄÄÄÄÅÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÅÄÄÄÄÄÄÄ´
+  |------|---------------------------|-------|-------|
-  ³ Bits ³  8   ³  16  ³  32  ³  64  ³  128  ³
+  | Bits |  8   |  16  |  32  |  64  |  128  |  256  |
-  ÆÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍØÍÍÍÍÍÍÍµ
+  |======|======|======|======|======|=======|=======|
-  ³      ³      ³      ³      ³ rax  ³       ³
+  |      |      |      |      | rax  |       |       |
-  ³      ³      ³      ³      ³ rcx  ³       ³
+  |      |      |      |      | rcx  |       |       |
-  ³      ³      ³      ³      ³ rdx  ³       ³
+  |      |      |      |      | rdx  |       |       |
-  ³      ³      ³      ³      ³ rbx  ³       ³
+  |      |      |      |      | rbx  |       |       |
-  ³      ³ spl  ³      ³      ³ rsp  ³       ³
+  |      | spl  |      |      | rsp  |       |       |
-  ³      ³ bpl  ³      ³      ³ rbp  ³       ³
+  |      | bpl  |      |      | rbp  |       |       |
-  ³      ³ sil  ³      ³      ³ rsi  ³       ³
+  |      | sil  |      |      | rsi  |       |       |
-  ³      ³ dil  ³      ³      ³ rdi  ³       ³
+  |      | dil  |      |      | rdi  |       |       |
-  ³      ³ r8b  ³ r8w  ³ r8d  ³ r8   ³ xmm8  ³
+  |      | r8b  | r8w  | r8d  | r8   | xmm8  | ymm8  |
-  ³      ³ r9b  ³ r9w  ³ r9d  ³ r9   ³ xmm9  ³
+  |      | r9b  | r9w  | r9d  | r9   | xmm9  | ymm9  |
-  ³      ³ r10b ³ r10w ³ r10d ³ r10  ³ xmm10 ³
+  |      | r10b | r10w | r10d | r10  | xmm10 | ymm10 |
-  ³      ³ r11b ³ r11w ³ r11d ³ r11  ³ xmm11 ³
+  |      | r11b | r11w | r11d | r11  | xmm11 | ymm11 |
-  ³      ³ r12b ³ r12w ³ r12d ³ r12  ³ xmm12 ³
+  |      | r12b | r12w | r12d | r12  | xmm12 | ymm12 |
-  ³      ³ r13b ³ r13w ³ r13d ³ r13  ³ xmm13 ³
+  |      | r13b | r13w | r13d | r13  | xmm13 | ymm13 |
-  ³      ³ r14b ³ r14w ³ r14d ³ r14  ³ xmm14 ³
+  |      | r14b | r14w | r14d | r14  | xmm14 | ymm14 |
-  ³      ³ r15b ³ r15w ³ r15d ³ r15  ³ xmm15 ³
+  |      | r15b | r15w | r15d | r15  | xmm15 | ymm15 |
-  ÀÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÙ
+  \--------------------------------------------------/
    In general any instruction from x86 architecture, which allowed 16-bit or
 -bit operand sizes, in long mode allows also the 64-bit operands. The 64-bit
 registers should be used for addressing in long mode, the 32-bit addressing
   If any operation is performed on the 32-bit general registers in long mode,
 the upper 32 bits of the 64-bit registers containing them are filled with
 zeros. This is unlike the operations on 16-bit or 8-bit portions of those
 registers, which preserve the upper bits.
-  Three new type conversion instructions are available. The "cdqe" sign extends
+  Three new type conversion instructions are available. The "cdqe" sign
-the double word in EAX into quad word and stores the result in RAX register.
+extends the double word in EAX into quad word and stores the result in RAX
-"cqo" sign extends the quad word in RAX into double quad word and stores the
+register. "cqo" sign extends the quad word in RAX into double quad word and
-extra bits in the RDX register. These instructions have no operands. "movsxd"
+stores the extra bits in the RDX register. These instructions have no
-sign extends the double word source operand, being either the 32-bit register
+operands. "movsxd" sign extends the double word source operand, being either
-or memory, into 64-bit destination operand, which has to be register.
+the 32-bit register or memory, into 64-bit destination operand, which has to
-No analogous instruction is needed for the zero extension, since it is done
+be register. No analogous instruction is needed for the zero extension, since
-automatically by any operations on 32-bit registers, as noted in previous
+it is done automatically by any operations on 32-bit registers, as noted in
-paragraph. And the "movzx" and "movsx" instructions, conforming to the general
+previous paragraph. And the "movzx" and "movsx" instructions, conforming to
-rule, can be used with 64-bit destination operand, allowing extension of byte
+the general rule, can be used with 64-bit destination operand, allowing
-or word values into quad words.
+extension of byte or word values into quad words.
-  All the binary arithmetic and logical instruction are promoted to allow
+  All the binary arithmetic and logical instruction have been promoted to
--bit operands in long mode. The use of decimal arithmetic instructions in
+allow 64-bit operands in long mode. The use of decimal arithmetic instructions
-long mode is prohibited.
+in long mode is prohibited.
   The stack operations, like "push" and "pop" in long mode default to 64-bit
 operands and it's not possible to use 32-bit operands with them. The "pusha"
 and "popa" are disallowed in long mode.
-  The indirect near jumps and calls in long mode default to 64-bit operands and
+  The indirect near jumps and calls in long mode default to 64-bit operands
-it's not possible to use the 32-bit operands with them. On the other hand, the
+and it's not possible to use the 32-bit operands with them. On the other hand,
-indirect far jumps and calls allow any operands that were allowed by the x86
+the indirect far jumps and calls allow any operands that were allowed by the
-architecture and also 80-bit memory operand is allowed (though only EM64T seems
+x86 architecture and also 80-bit memory operand is allowed (though only EM64T
-to implement such variant), with the first eight bytes defining the offset and
+seems to implement such variant), with the first eight bytes defining the
-two last bytes specifying the selector. The direct far jumps and calls are not
+offset and two last bytes specifying the selector. The direct far jumps and
-allowed in long mode.
+calls are not allowed in long mode.
   The I/O instructions, "in", "out", "ins" and "outs" are the exceptional
 instructions that are not extended to accept quad word operands in long mode.
 But all other string operations are, and there are new short forms "movsq",
 "cmpsq", "scasq", "lodsq" and "stosq" introduced for the variants of string
 operations for 64-bit string elements. The RSI and RDI registers are used by
 default to address the string elements.
 in long mode require the 80-bit memory operand.
   The "cmpxchg16b" is the 64-bit equivalent of "cmpxchg8b" instruction, it uses
 the double quad word memory operand and 64-bit registers to perform the
 analoguous operation.
-  "swapgs" is the new instruction, which swaps the contents of GS register and
+  The "fxsave64" and "fxrstor64" are new variants of "fxsave" and "fxrstor"
+instructions, available only in long mode, which use a different format of
+storage area in order to store some pointers in full 64-bit size.
+  "swapgs" is the new instruction, which swaps the contents of GS register and
 the KernelGSbase model-specific register (MSR address 0C0000102h).
   "syscall" and "sysret" is the pair of new instructions that provide the
 functionality similar to "sysenter" and "sysexit" in long mode, where the
-latter pair is disallowed.
+latter pair is disallowed. The "sysexitq" and "sysretq" mnemonics provide the
+-bit versions of "sysexit" and "sysret" instructions.
+  The "rdmsrq" and "wrmsrq" mnemonics are the 64-bit variants of the "rdmsr"
+and "wrmsr" instructions.
+.1.20  SSE4 instructions
+There are actually three different sets of instructions under the name SSE4.
+Intel designed two of them, SSE4.1 and SSE4.2, with latter extending the
+former into the full Intel's SSE4 set. On the other hand, the implementation
+by AMD includes only a few instructions from this set, but also contains
+some additional instructions, that are called the SSE4a set.
+  The SSE4.1 instructions mostly follow the same rules for operands, as
+the basic SSE operations, so they require destination operand to be SSE
+register and source operand to be 128-bit memory location or SSE register,
+and some operations require a third operand, the 8-bit immediate value.
+  "pmulld" performs a signed multiplication of the packed double words and
+stores the low double words of the results in the destination operand.
+"pmuldq" performs a two signed multiplications of the corresponding double
+words in the lower quad words of operands, and stores the results as
+packed quad words into the destination register. "pminsb" and "pmaxsb"
+return the minimum or maximum values of packed signed bytes, "pminuw" and
+"pmaxuw" return the minimum and maximum values of packed unsigned words,
+"pminud", "pmaxud", "pminsd" and "pmaxsd" return minimum or maximum values
+of packed unsigned or signed words. These instruction complement the
+instructions computing packed minimum or maximum introduced by SSE.
+  "ptest" sets the ZF flag to one when the result of bitwise AND of the
+both operands is zero, and zeroes the ZF otherwise. It also sets CF flag
+to one, when the result of bitwise AND of the destination operand with
+the bitwise NOT of the source operand is zero, and zeroes the CF otherwise.
+"pcmpeqq" compares packed quad words for equality, and fills the
+corresponding elements of destination operand with either ones or zeros,
+depending on the result of comparison.
+  "packusdw" converts packed signed double words from both the source and
+destination operand into the unsigned words using saturation, and stores
+the eight resulting word values into the destination register.
+  "phminposuw" finds the minimum unsigned word value in source operand and
+places it into the lowest word of destination operand, setting the remaining
+upper bits of destination to zero.
+  "roundps", "roundss", "roundpd" and "roundsd" perform the rounding of packed
+or individual floating point value of single or double precision, using the
+rounding mode specified by the third operand.
+    roundsd xmm0,xmm1,0011b ; round toward zero
+  "dpps" calculates dot product of packed single precision floating point
+values, that is it multiplies the corresponding pairs of values from source and
+destination operand and then sums the products up. The high four bits of the
+-bit immediate third operand control which products are calculated and taken
+to the sum, and the low four bits control, into which elements of destination
+the resulting dot product is copied (the other elements are filled with zero).
+"dppd" calculates dot product of packed double precision floating point values.
+The bits 4 and 5 of third operand control, which products are calculated and
+added, and bits 0 and 1 of this value control, which elements in destination
+register should get filled with the result. "mpsadbw" calculates multiple sums
+of absolute differences of unsigned bytes. The third operand controls, with
+value in bits 0-1, which of the four-byte blocks in source operand is taken to
+calculate the absolute differencies, and with value in bit 2, at which of the
+two first four-byte block in destination operand start calculating multiple
+sums. The sum is calculated from four absolute differencies between the
+corresponding unsigned bytes in the source and destination block, and each next
+sum is calculated in the same way, but taking the four bytes from destination
+at the position one byte after the position of previous block. The four bytes
+from the source stay the same each time. This way eight sums of absolute
+differencies are calculated and stored as packed word values into the
+destination operand. The instructions described in this paragraph follow the
+same rules for operands, as "roundps" instruction.
+  "blendps", "blendvps", "blendpd" and "blendvpd" conditionally copy the
+values from source operand into the destination operand, depending on the bits
+of the mask provided by third operand. If a mask bit is set, the corresponding
+element of source is copied into the same place in destination, otherwise this
+position is destination is left unchanged. The rules for the first two operands
+are the same, as for general SSE instructions. "blendps" and "blendpd" need
+third operand to be 8-bit immediate, and they operate on single or double
+precision values, respectively. "blendvps" and "blendvpd" require third operand
+to be the XMM0 register.
+    blendvps xmm3,xmm7,xmm0 ; blend according to mask
+  "pblendw" conditionally copies word elements from the source operand into the
+destination, depending on the bits of mask provided by third operand, which
+needs to be 8-bit immediate value. "pblendvb" conditionally copies byte
+elements from the source operands into destination, depending on mask defined
+by the third operand, which has to be XMM0 register. These instructions follow
+the same rules for operands as "blendps" and "blendvps" instructions,
+respectively.
+  "insertps" inserts a single precision floating point value taken from the
+position in source operand specified by bits 6-7 of third operand into location
+in destination register selected by bits 4-5 of third operand. Additionally,
+the low four bits of third operand control, which elements in destination
+register will be set to zero. The first two operands follow the same rules as
+for the general SSE operation, the third operand should be 8-bit immediate.
+  "extractps" extracts a single precision floating point value taken from the
+location in source operand specified by low two bits of third operand, and
+stores it into the destination operand. The destination can be a 32-bit memory
+value or general purpose register, the source operand must be SSE register,
+and the third operand should be 8-bit immediate value.
+    extractps edx,xmm3,3 ; extract the highest value
+  "pinsrb", "pinsrd" and "pinsrq" copy a byte, double word or quad word from
+the source operand into the location of destination operand determined by the
+third operand. The destination operand has to be SSE register, the source
+operand can be a memory location of appropriate size, or the 32-bit general
+purpose register (but 64-bit general purpose register for "pinsrq", which is
+only available in long mode), and the third operand has to be 8-bit immediate
+value. These instructions complement the "pinsrw" instruction operating on SSE
+register destination, which was introduced by SSE2.
+    pinsrd xmm4,eax,1 ; insert double word into second position
+  "pextrb", "pextrw", "pextrd" and "pextrq" copy a byte, word, double word or
+quad word from the location in source operand specified by third operand, into
+the destination. The source operand should be SSE register, the third operand
+should be 8-bit immediate, and the destination operand can be memory location
+of appropriate size, or the 32-bit general purpose register (but 64-bit general
+purpose register for "pextrq", which is only available in long mode). The
+"pextrw" instruction with SSE register as source was already introduced by
+SSE2, but SSE4 extends it to allow memory operand as destination.
+    pextrw [ebx],xmm3,7 ; extract highest word into memory
+  "pmovsxbw" and "pmovzxbw" perform sign extension or zero extension of eight
+byte values from the source operand into packed word values in destination
+operand, which has to be SSE register. The source can be 64-bit memory or SSE
+register - when it is register, only its low portion is used. "pmovsxbd" and
+"pmovzxbd" perform sign extension or zero extension of the four byte values
+from the source operand into packed double word values in destination operand,
+the source can be 32-bit memory or SSE register. "pmovsxbq" and "pmovzxbq"
+perform sign extension or zero extension of the two byte values from the
+source operand into packed quad word values in destination operand, the source
+can be 16-bit memory or SSE register. "pmovsxwd" and "pmovzxwd" perform sign
+extension or zero extension of the four word values from the source operand
+into packed double words in destination operand, the source can be 64-bit
+memory or SSE register. "pmovsxwq" and "pmovzxwq" perform sign extension or
+zero extension of the two word values from the source operand into packed quad
+words in destination operand, the source can be 32-bit memory or SSE register.
+"pmovsxdq" and "pmovzxdq" perform sign extension or zero extension of the two
+double word values from the source operand into packed quad words in
+destination operand, the source can be 64-bit memory or SSE register.
+    pmovzxbq xmm0,word [si]  ; zero-extend bytes to quad words
+    pmovsxwq xmm0,xmm1       ; sign-extend words to quad words
+  "movntdqa" loads double quad word from the source operand to the destination
+using a non-temporal hint. The destination operand should be SSE register,
+and the source operand should be 128-bit memory location.
+  The SSE4.2, described below, adds not only some new operations on SSE
+registers, but also introduces some completely new instructions operating on
+general purpose registers only.
+  "pcmpistri" compares two zero-ended (implicit length) strings provided in
+its source and destination operand and generates an index stored to ECX;
+"pcmpistrm" performs the same comparison and generates a mask stored to XMM0.
+"pcmpestri" compares two strings of explicit lengths, with length provided
+in EAX for the destination operand and in EDX for the source operand, and
+generates an index stored to ECX; "pcmpestrm" performs the same comparision
+and generates a mask stored to XMM0. The source and destination operand follow
+the same rules as for general SSE instructions, the third operand should be
+-bit immediate value determining the details of performed operation - refer to
+Intel documentation for information on those details.
+  "pcmpgtq" compares packed quad words, and fills the corresponding elements of
+destination operand with either ones or zeros, depending on whether the value
+in destination is greater than the one in source, or not. This instruction
+follows the same rules for operands as "pcmpeqq".
+  "crc32" accumulates a CRC32 value for the source operand starting with
+initial value provided by destination operand, and stores the result in
+destination. Unless in long mode, the destination operand should be a 32-bit
+general purpose register, and the source operand can be a byte, word, or double
+word register or memory location. In long mode the destination operand can
+also be a 64-bit general purpose register, and the source operand in such case
+can be a byte or quad word register or memory location.
+    crc32 eax,dl          ; accumulate CRC32 on byte value
+    crc32 eax,word [ebx]  ; accumulate CRC32 on word value
+    crc32 rax,qword [rbx] ; accumulate CRC32 on quad word value
+  "popcnt" calculates the number of bits set in the source operand, which can
+be 16-bit, 32-bit, or 64-bit general purpose register or memory location,
+and stores this count in the destination operand, which has to be register of
+the same size as source operand. The 64-bit variant is available only in long
+mode.
+    popcnt ecx,eax        ; count bits set to 1
+  The SSE4a extension, which also includes the "popcnt" instruction introduced
+by SSE4.2, at the same time adds the "lzcnt" instruction, which follows the
+same syntax, and calculates the count of leading zero bits in source operand
+(if the source operand is all zero bits, the total number of bits in source
+operand is stored in destination).
+  "extrq" extract the sequence of bits from the low quad word of SSE register
+provided as first operand and stores them at the low end of this register,
+filling the remaining bits in the low quad word with zeros. The position of bit
+string and its length can either be provided with two 8-bit immediate values
+as second and third operand, or by SSE register as second operand (and there
+is no third operand in such case), which should contain position value in bits
+-13 and length of bit string in bits 0-5.
+    extrq xmm0,8,7        ; extract 8 bits from position 7
+    extrq xmm0,xmm5       ; extract bits defined by register
+  "insertq" writes the sequence of bits from the low quad word of the source
+operand into specified position in low quad word of the destination operand,
+leaving the other bits in low quad word of destination intact. The position
+where bits should be written and the length of bit string can either be
+provided with two 8-bit immediate values as third and fourth operand, or by
+the bit fields in source operand (and there are only two operands in such
+case), which should contain position value in bits 72-77 and length of bit
+string in bits 64-69.
+    insertq xmm1,xmm0,4,2 ; insert 4 bits at position 2
+    insertq xmm1,xmm0     ; insert bits defined by register
+  "movntss" and "movntsd" store single or double precision floating point
+value from the source SSE register into 32-bit or 64-bit destination memory
+location respectively, using non-temporal hint.
+.1.21  AVX instructions
+The Advanced Vector Extensions introduce instructions that are new variants
+of SSE instructions, with new scheme of encoding that allows extended syntax
+having a destination operand separate from all the source operands. It also
+introduces 256-bit AVX registers, which extend up the old 128-bit SSE
+registers. Any AVX instruction that puts some result into SSE register, puts
+zero bits into high portion of the AVX register containing it.
+  The AVX version of SSE instruction has the mnemonic obtained by prepending
+SSE instruction name with "v". For any SSE arithmetic instruction which had a
+destination operand also being used as one of the source values, the AVX
+variant has a new syntax with three operands - the destination and two sources.
+The destination and first source can be SSE registers, and second source can be
+SSE register or memory. If the operation is performed on single pair of values,
+the remaining bits of first source SSE register are copied into the the
+destination register.
+    vsubss xmm0,xmm2,xmm3         ; substract two 32-bit floats
+    vmulsd xmm0,xmm7,qword [esi]  ; multiply two 64-bit floats
+In case of packed operations, each instruction can also operate on the 256-bit
+data size when the AVX registers are specified instead of SSE registers, and
+the size of memory operand is also doubled then.
+    vaddps ymm1,ymm5,yword [esi]  ; eight sums of 32-bit float pairs
+The instructions that operate on packed integer types (in particular the ones
+that earlier had been promoted from MMX to SSE) also acquired the new syntax
+with three operands, however they are only allowed to operate on 128-bit
+packed types and thus cannot use the whole AVX registers.
+    vpavgw xmm3,xmm0,xmm2         ; average of 16-bit integers
+    vpslld xmm1,xmm0,1            ; shift double words left
+If the SSE version of instruction had a syntax with three operands, the third
+one being an immediate value, the AVX version of such instruction takes four
+operands, with immediate remaining the last one.
+    vshufpd ymm0,ymm1,ymm2,10010011b ; shuffle 64-bit floats
+    vpalignr xmm0,xmm4,xmm2,3        ; extract byte aligned value
+The promotion to new syntax according to the rules described above has been
+applied to all the instructions from SSE extensions up to SSE4, with the
+exceptions described below.
+  "vdppd" instruction has syntax extended to four operans, but it does not
+have a 256-bit version.
+  The are a few instructions, namely "vsqrtpd", "vsqrtps", "vrcpps" and
+"vrsqrtps", which can operate on 256-bit data size, but retained the syntax
+with only two operands, because they use data from only one source:
+    vsqrtpd ymm1,ymm0         ; put square roots into other register
+In a similar way "vroundpd" and "vroundps" retained the syntax with three
+operands, the last one being immediate value.
+    vroundps ymm0,ymm1,0011b  ; round toward zero
+  Also some of the operations on packed integers kept their two-operand or
+three-operand syntax while being promoted to AVX version. In such case these
+instructions follow exactly the same rules for operands as their SSE
+counterparts (since operations on packed integers do not have 256-bit variants
+in AVX extension). These include "vpcmpestri", "vpcmpestrm", "vpcmpistri",
+"vpcmpistrm", "vphminposuw", "vpshufd", "vpshufhw", "vpshuflw". And there are
+more instructions that in AVX versions keep exactly the same syntax for
+operands as the one from SSE, without any additional options: "vcomiss",
+"vcomisd", "vcvtss2si", "vcvtsd2si", "vcvttss2si", "vcvttsd2si", "vextractps",
+"vpextrb", "vpextrw", "vpextrd", "vpextrq", "vmovd", "vmovq", "vmovntdqa",
+"vmaskmovdqu", "vpmovmskb", "vpmovsxbw", "vpmovsxbd", "vpmovsxbq", "vpmovsxwd",
+"vpmovsxwq", "vpmovsxdq", "vpmovzxbw", "vpmovzxbd", "vpmovzxbq", "vpmovzxwd",
+"vpmovzxwq" and "vpmovzxdq".
+  The move and conversion instructions have mostly been promoted to allow
+-bit size operands in addition to the 128-bit variant with syntax identical
+to that from SSE version of the same instruction. Each of the "vcvtdq2ps",
+"vcvtps2dq" and "vcvttps2dq", "vmovaps", "vmovapd", "vmovups", "vmovupd",
+"vmovdqa", "vmovdqu", "vlddqu", "vmovntps", "vmovntpd", "vmovntdq",
+"vmovsldup", "vmovshdup", "vmovmskps" and "vmovmskpd" inherits the 128-bit
+syntax from SSE without any changes, and also allows a new form with 256-bit
+operands in place of 128-bit ones.
+    vmovups [edi],ymm6        ; store unaligned 256-bit data
+  "vmovddup" has the identical 128-bit syntax as its SSE version, and it also
+has a 256-bit version, which stores the duplicates of the lowest quad word
+from the source operand in the lower half of destination operand, and in the
+upper half of destination the duplicates of the low quad word from the upper
+half of source. Both source and destination operands need then to be 256-bit
+values.
+  "vmovlhps" and "vmovhlps" have only 128-bit versions, and each takes three
+operands, which all must be SSE registers. "vmovlhps" copies two single
+precision values from the low quad word of second source register to the high
+quad word of destination register, and copies the low quad word of first
+source register into the low quad word of destination register. "vmovhlps"
+copies two single  precision values from the high quad word of second source
+register to the low quad word of destination register, and copies the high
+quad word of first source register into the high quad word of destination
+register.
+  "vmovlps", "vmovhps", "vmovlpd" and "vmovhpd" have only 128-bit versions and
+their syntax varies depending on whether memory operand is a destination or
+source. When memory is destination, the syntax is identical to the one of
+equivalent SSE instruction, and when memory is source, the instruction requires
+three operands, first two being SSE registers and the third one 64-bit memory.
+The value put into destination is then the value copied from first source with
+either low or high quad word replaced with value from second source (the
+memory operand).
+    vmovhps [esi],xmm7       ; store upper half to memory
+    vmovlps xmm0,xmm7,[ebx]  ; low from memory, rest from register
+  "vmovss" and "vmovsd" have syntax identical to their SSE equivalents as long
+as one of the operands is memory, while the versions that operate purely on
+registers require three operands (each being SSE register). The value stored
+in destination is then the value copied from first source with lowest data
+element replaced with the lowest value from second source.
+    vmovss xmm3,[edi]        ; low from memory, rest zeroed
+    vmovss xmm0,xmm1,xmm2    ; one value from xmm2, three from xmm1
+  "vcvtss2sd", "vcvtsd2ss", "vcvtsi2ss" and "vcvtsi2d" use the three-operand
+syntax, where destination and first source are always SSE registers, and the
+second source follows the same rules and the source in syntax of equivalent
+SSE instruction. The value stored in destination is then the value copied from
+first source with lowest data element replaced with the result of conversion.
+    vcvtsi2sd xmm4,xmm4,ecx  ; 32-bit integer to 64-bit float
+    vcvtsi2ss xmm0,xmm0,rax  ; 64-bit integer to 32-bit float
+  "vcvtdq2pd" and "vcvtps2pd" allow the same syntax as their SSE equivalents,
+plus the new variants with AVX register as destination and SSE register or
+-bit memory as source. Analogously "vcvtpd2dq", "vcvttpd2dq" and
+"vcvtpd2ps", in addition to variant with syntax identical to SSE version,
+allow a variant with SSE register as destination and AVX register or 256-bit
+memory as source.
+  "vinsertps", "vpinsrb", "vpinsrw", "vpinsrd", "vpinsrq" and "vpblendw" use
+a syntax with four operands, where destination and first source have to be SSE
+registers, and the third and fourth operand follow the same rules as second
+and third operand in the syntax of equivalent SSE instruction. Value stored in
+destination is the the value copied from first source with some data elements
+replaced with values extracted from the second source, analogously to the
+operation of corresponding SSE instruction.
+    vpinsrd xmm0,xmm0,eax,3  ; insert double word
+  "vblendvps", "vblendvpd" and "vpblendvb" use a new syntax with four register
+operands: destination, two sources and a mask, where second source can also be
+a memory operand. "vblendvps" and "vblendvpd" have 256-bit variant, where
+operands are AVX registers or 256-bit memory, as well as 128-bit variant,
+which has operands being SSE registers or 128-bit memory. "vpblendvb" has only
+a 128-bit variant. Value stored in destination is the value copied from the
+first source with some data elements replaced, according to mask, by values
+from the second source.
+    vblendvps ymm3,ymm1,ymm2,ymm7  ; blend according to mask
+  "vptest" allows the same syntax as its SSE version and also has a 256-bit
+version, with both operands doubled in size. There are also two new
+instructions, "vtestps" and "vtestpd", which perform analogous tests, but only
+of the sign bits of corresponding single precision or double precision values,
+and set the ZF and CF accordingly. They follow the same syntax rules as
+"vptest".
+    vptest ymm0,yword [ebx]  ; test 256-bit values
+    vtestpd xmm0,xmm1        ; test sign bits of 64-bit floats
+  "vbroadcastss", "vbroadcastsd" and "vbroadcastf128" are new instructions,
+which broadcast the data element defined by source operand into all elements
+of corresponing size in the destination register. "vbroadcastss" needs
+source to be 32-bit memory and destination to be either SSE or AVX register.
+"vbroadcastsd" requires 64-bit memory as source, and AVX register as
+destination. "vbroadcastf128" requires 128-bit memory as source, and AVX
+register as destination.
+    vbroadcastss ymm0,dword [eax]  ; get eight copies of value
+  "vinsertf128" is the new instruction, which takes four operands. The
+destination and first source have to be AVX registers, second source can be
+SSE register or 128-bit memory location, and fourth operand should be an
+immediate value. It stores in destination the value obtained by taking
+contents of first source and replacing one of its 128-bit units with value of
+the second source. The lowest bit of fourth operand specifies at which
+position that replacement is done (either 0 or 1).
+  "vextractf128" is the new instruction with three operands. The destination
+needs to be SSE register or 128-bit memory location, the source must be AVX
+register, and the third operand should be an immediate value. It extracts
+into destination one of the 128-bit units from source. The lowest bit of third
+operand specifies, which unit is extracted.
+  "vmaskmovps" and "vmaskmovpd" are the new instructions with three operands
+that selectively store in destination the elements from second source
+depending on the sign bits of corresponding elements from first source. These
+instructions can operate on either 128-bit data (SSE registers) or 256-bit
+data (AVX registers). Either destination or second source has to be a memory
+location of appropriate size, the two other operands should be registers.
+    vmaskmovps [edi],xmm0,xmm5  ; conditionally store
+    vmaskmovpd ymm5,ymm0,[esi]  ; conditionally load
+  "vpermilpd" and "vpermilps" are the new instructions with three operands
+that permute the values from first source according to the control fields from
+second source and put the result into destination operand. It allows to use
+either three SSE registers or three AVX registers as its operands, the second
+source can be a memory of size equal to the registers used. In alternative
+form the second source can be immediate value and then the first source
+can be a memory location of the size equal to destination register.
+  "vperm2f128" is the new instruction with four operands, which selects
+-bit blocks of floating point data from first and second source according
+to the bit fields from fourth operand, and stores them in destination.
+Destination and first source need to be AVX registers, second source can be
+AVX register or 256-bit memory area, and fourth operand should be an immediate
+value.
+    vperm2f128 ymm0,ymm6,ymm7,12h  ; permute 128-bit blocks
+  "vzeroall" instruction sets all the AVX registers to zero. "vzeroupper" sets
+the upper 128-bit portions of all AVX registers to zero, leaving the SSE
+registers intact. These new instructions take no operands.
+  "vldmxcsr" and "vstmxcsr" are the AVX versions of "ldmxcsr" and "stmxcsr"
+instructions. The rules for their operands remain unchanged.
+.1.22  AVX2 instructions
+The AVX2 extension allows all the AVX instructions operating on packed integers
+to use 256-bit data types, and introduces some new instructions as well.
+  The AVX instructions that operate on packed integers and had only a 128-bit
+variants, have been supplemented with 256-bit variants, and thus their syntax
+rules became analogous to AVX instructions operating on packed floating point
+types.
+    vpsubb ymm0,ymm0,[esi]   ; substract 32 packed bytes
+    vpavgw ymm3,ymm0,ymm2    ; average of 16-bit integers
+However there are some instructions that have not been equipped with the
+-bit variants. "vpcmpestri", "vpcmpestrm", "vpcmpistri", "vpcmpistrm",
+"vpextrb", "vpextrw", "vpextrd", "vpextrq", "vpinsrb", "vpinsrw", "vpinsrd",
+"vpinsrq" and "vphminposuw" are not affected by AVX2 and allow only the
+-bit operands.
+  The packed shift instructions, which allowed the third operand specifying
+amount to be SSE register or 128-bit memory location, use the same rules
+for the third operand in their 256-bit variant.
+    vpsllw ymm2,ymm2,xmm4        ; shift words left
+    vpsrad ymm0,ymm3,xword [ebx] ; shift double words right
+  There are also new packed shift instructions with standard three-operand AVX
+syntax, which shift each element from first source by the amount specified in
+corresponding element of second source, and store the results in destination.
+"vpsllvd" shifts 32-bit elements left, "vpsllvq" shifts 64-bit elements left,
+"vpsrlvd" shifts 32-bit elements right logically, "vpsrlvq" shifts 64-bit
+elements right logically and "vpsravd" shifts 32-bit elements right
+arithmetically.
+  The sign-extend and zero-extend instructions, which in AVX versions allowed
+source operand to be SSE register or a memory of specific size, in the new
+-bit variant need memory of that size doubled or SSE register as source and
+AVX register as destination.
+    vpmovzxbq ymm0,dword [esi]   ; bytes to quad words
+  Also "vmovntdqa" has been upgraded with 256-bit variant, so it allows to
+transfer 256-bit value from memory to AVX register, it needs memory address
+to be aligned to 32 bytes.
+  "vpmaskmovd" and "vpmaskmovq" are the new instructions with syntax identical
+to "vmaskmovps" or "vmaskmovpd", and they performs analogous operation on
+packed 32-bit or 64-bit values.
+  "vinserti128", "vextracti128", "vbroadcasti128" and "vperm2i128" are the new
+instructions with syntax identical to "vinsertf128", "vextractf128",
+"vbroadcastf128" and "vperm2f128" respectively, and they perform analogous
+operations on 128-bit blocks of integer data.
+  "vbroadcastss" and "vbroadcastsd" instructions have been extended to allow
+SSE register as a source operand (which in AVX could only be a memory).
+  "vpbroadcastb", "vpbroadcastw", "vpbroadcastd" and "vpbroadcastq" are the
+new instructions which broadcast the byte, word, double word or quad word from
+the source operand into all elements of corresponing size in the destination
+register. The destination operand can be either SSE or AVX register, and the
+source operand can be SSE register or memory of size equal to the size of data
+element.
+    vpbroadcastb ymm0,byte [ebx]  ; get 32 identical bytes
+  "vpermd" and "vpermps" are new three-operand instructions, which use each
+-bit element from first source as an index of element in second source which
+is copied into destination at position corresponding to element containing
+index. The destination and first source have to be AVX registers, and the
+second source can be AVX register or 256-bit memory.
+  "vpermq" and "vpermpd" are new three-operand instructions, which use 2-bit
+indexes from the immediate value specified as third operand to determine which
+element from source store at given position in destination. The destination
+has to be AVX register, source can be AVX register or 256-bit memory, and the
+third operand must be 8-bit immediate value.
+  The family of new instructions performing "gather" operation have special
+syntax, as in their memory operand they use addressing mode that is unique to
+them. The base of address can be a 32-bit or 64-bit general purpose register
+(the latter only in long mode), and the index (possibly multiplied by scale
+value, as in standard addressing) is specified by SSE or AVX register. It is
+possible to use only index without base and any numerical displacement can be
+added to the address. Each of those instructions takes three operands. First
+operand is the destination register, second operand is memory addressed with
+a vector index, and third operand is register containing a mask. The most
+significant bit of each element of mask determines whether a value will be
+loaded from memory into corresponding element in destination. The address of
+each element to load is determined by using the corresponding element from
+index register in memory operand to calculate final address with given base
+and displacement. When the index register contains less elements than the
+destination and mask registers, the higher elements of destination are zeroed.
+After the value is successfuly loaded, the corresponding element in mask
+register is set to zero. The destination, index and mask should all be
+distinct registers, it is not allowed to use the same register in two
+different roles.
+  "vgatherdps" loads single precision floating point values addressed by
+-bit indexes. The destination, index and mask should all be registers of the
+same type, either SSE or AVX. The data addressed by memory operand is 32-bit
+in size.
+    vgatherdps xmm0,[eax+xmm1],xmm3    ; gather four floats
+    vgatherdps ymm0,[ebx+ymm7*4],ymm3  ; gather eight floats
+  "vgatherqps" loads single precision floating point values addressed by
+-bit indexes. The destination and mask should always be SSE registers, while
+index register can be either SSE or AVX register. The data addressed by memory
+operand is 32-bit in size.
+    vgatherqps xmm0,[xmm2],xmm3        ; gather two floats
+    vgatherqps xmm0,[ymm2+64],xmm3     ; gather four floats
+  "vgatherdpd" loads double precision floating point values addressed by
+-bit indexes. The index register should always be SSE register, the
+destination and mask should be two registers of the same type, either SSE or
+AVX. The data addressed by memory operand is 64-bit in size.
+    vgatherdpd xmm0,[ebp+xmm1],xmm3    ; gather two doubles
+    vgatherdpd ymm0,[xmm3*8],ymm5      ; gather four doubles
+  "vgatherqpd" loads double precision floating point values addressed by
+-bit indexes. The destination, index and mask should all be registers of the
+same type, either SSE or AVX. The data addressed by memory operand is 64-bit
+in size.
+  "vpgatherdd" and "vpgatherqd" load 32-bit values addressed by either 32-bit
+or 64-bit indexes. They follow the same rules as "vgatherdps" and "vgatherqps"
+respectively.
+  "vpgatherdq" and "vpgatherqq" load 64-bit values addressed by either 32-bit
+or 64-bit indexes. They follow the same rules as "vgatherdpd" and "vgatherqpd"
+respectively.
+.1.23  Auxiliary sets of computational instructions
+  There is a number of additional instruction set extensions related to
+AVX. They introduce new vector instructions (and sometimes also their SSE
+equivalents that use classic instruction encoding), and even some new
+instructions operating on general registers that use the AVX-like encoding
+allowing the extended syntax with separate destination and source operands.
+The CPU support for each of these instruction sets needs to be determined
+separately.
+  The AES extension provides a specialized set of instructions for the
+purpose of cryptographic computations defined by Advanced Encryption Standard.
+Each of these instructions has two versions: the AVX one and the one with
+SSE-like syntax that uses classic encoding. Refer to the Intel manuals for the
+details of operation of these instructions.
+  "aesenc" and "aesenclast" perform a single round of AES encryption on data
+from first source with a round key from second source, and store result in
+destination. The destination and first source are SSE registers, and the
+second source can be SSE register or 128-bit memory. The AVX versions of these
+instructions, "vaesenc" and "vaesenclast", use the syntax with three operands,
+while the SSE-like version has only two operands, with first operand being
+both the destination and first source.
+  "aesdec" and "aesdeclast" perform a single round of AES decryption on data
+from first source with a round key from second source. The syntax rules for
+them and their AVX versions are the same as for "aesenc".
+  "aesimc" performs the InvMixColumns transformation of source operand and
+store the result in destination. Both "aesimc" and "vaesimc" use only two
+operands, destination being SSE register, and source being SSE register or
+-bit memory location.
+  "aeskeygenassist" is a helper instruction for generating the round key.
+It needs three operands: destination being SSE register, source being SSE
+register or 128-bit memory, and third operand being 8-bit immediate value.
+The AVX version of this instruction uses the same syntax.
+  The CLMUL extension introduces just one instruction, "pclmulqdq", and its
+AVX version as well. This instruction performs a carryless multiplication of
+two 64-bit values selected from first and second source according to the bit
+fields in immediate value. The destination and first source are SSE registers,
+second source is SSE register or 128-bit memory, and immediate value is
+provided as last operand. "vpclmulqdq" takes four operands, while "pclmulqdq"
+takes only three operands, with the first one serving both the role of
+destination and first source.
+  The FMA (Fused Multiply-Add) extension introduces additional AVX
+instructions which perform multiplication and summation as single operation.
+Each one takes three operands, first one serving both the role of destination
+and first source, and the following ones being the second and third source.
+The mnemonic of FMA instruction is obtained by appending to "vf" prefix: first
+either "m" or "nm" to select whether result of multiplication should be taken
+as-is or negated, then either "add" or "sub" to select whether third value
+will be added to the product or substracted from the product, then either
+"132", "213" or "231" to select which source operands are multiplied and which
+one is added or substracted, and finally the type of data on which the
+instruction operates, either "ps", "pd", "ss" or "sd". As it was with SSE
+instructions promoted to AVX, instructions operating on packed floating point
+values allow 128-bit or 256-bit syntax, in former all the operands are SSE
+registers, but the third one can also be a 128-bit memory, in latter the
+operands are AVX registers and the third one can also be a 256-bit memory.
+Instructions that compute just one floating point result need operands to be
+SSE registers, and the third operand can also be a memory, either 32-bit for
+single precision or 64-bit for double precision.
+    vfmsub231ps ymm1,ymm2,ymm3     ; multiply and substract
+    vfnmadd132sd xmm0,xmm5,[ebx]   ; multiply, negate and add
+In addition to the instructions created by the rule described above, there are
+families of instructions with mnemonics starting with either "vfmaddsub" or
+"vfmsubadd", followed by either "132", "213" or "231" and then either "ps" or
+"pd" (the operation must always be on packed values in this case). They add
+to the result of multiplication or substract from it depending on the position
+of value in packed data - instructions from the "vfmaddsub" group add when the
+position is odd and substract when the position is even, instructions from the
+"vfmsubadd" group add when the position is even and subtstract when the
+position is odd. The rules for operands are the same as for other FMA
+instructions.
+  The FMA4 instructions are similar to FMA, but use syntax with four operands
+and thus allow destination to be different than all the sources. Their
+mnemonics are identical to FMA instructions with the "132", "213" or "231" cut
+out, as having separate destination operand makes such selection of operands
+superfluous. The multiplication is always performed on values from the first
+and second source, and then the value from third source is added or
+substracted. Either second or third source can be a memory operand, and the
+rules for the sizes of operands are the same as for FMA instructions.
+    vfmaddpd ymm0,ymm1,[esi],ymm2  ; multiply and add
+    vfmsubss xmm0,xmm1,xmm2,[ebx]  ; multiply and substract
+  The F16C extension consists of two instructions, "vcvtps2ph" and
+"vcvtph2ps", which convert floating point values between single precision and
+half precision (the 16-bit floating point format). "vcvtps2ph" takes three
+operands: destination, source, and rounding controls. The third operand is
+always an immediate, the source is either SSE or AVX register containing
+single precision values, and the destination is SSE register or memory, the
+size of memory is 64 bits when the source is SSE register and 128 bits when
+the source is AVX register. "vcvtph2ps" takes two operands, the destination
+that can be SSE or AVX register, and the source that is SSE register or memory
+with size of the half of destination operand's size.
+  The AMD XOP extension introduces a number of new vector instructions with
+encoding and syntax analogous to AVX instructions. "vfrczps", "vfrczss",
+"vfrczpd" and "vfrczsd" extract fractional portions of single or double
+precision values, they all take two operands. The packed operations allow
+either SSE or AVX register as destination, for the other two it has to be SSE
+register. Source can be register of the same type as destination, or memory
+of appropriate size (256-bit for destination being AVX register, 128-bit for
+packed operation with destination being SSE register, 64-bit for operation
+on a solitary double precision value and 32-bit for operation on a solitary
+single precision value).
+    vfrczps ymm0,[esi]           ; load fractional parts
+  "vpcmov" copies bits from either first or second source into destination
+depending on the values of corresponding bits in the fourth operand (the
+selector). If the bit in selector is set, the corresponding bit from first
+source is copied into the same position in destination, otherwise the bit from
+second source is copied. Either second source or selector can be memory
+location, 128-bit or 256-bit depending on whether SSE registers or AVX
+registers are specified as the other operands.
+    vpcmov xmm0,xmm1,xmm2,[ebx]  ; selector in memory
+    vpcmov ymm0,ymm5,[esi],ymm2  ; source in memory
+The family of packed comparison instructions take four operands, the
+destination and first source being SSE register, second source being SSE
+register or 128-bit memory and the fourth operand being immediate value
+defining the type of comparison. The mnemonic or instruction is created
+by appending to "vpcom" prefix either "b" or "ub" to compare signed or
+unsigned bytes, "w" or "uw" to compare signed or unsigned words, "d" or "ud"
+to compare signed or unsigned double words, "q" or "uq" to compare signed or
+unsigned quad words. The respective values from the first and second source
+are compared and the corresponding data element in destination is set to
+either all ones or all zeros depending on the result of comparison. The fourth
+operand has to specify one of the eight comparison types (table 2.5). All
+these instruction have also variants with only three operands and the type
+of comparison encoded within the instruction name by inserting the comparison
+mnemonic after "vpcom".
+    vpcomb   xmm0,xmm1,xmm2,4    ; test for equal bytes
+    vpcomgew xmm0,xmm1,[ebx]     ; compare signed words
+   Table 2.5  XOP comparisons
+  /-------------------------------------------\
+  | Code | Mnemonic | Description             |
+  |======|==========|=========================|
+  | 0    | lt       | less than               |
+  | 1    | le       | less than or equal      |
+  | 2    | gt       | greater than            |
+  | 3    | ge       | greater than or equal   |
+  | 4    | eq       | equal                   |
+  | 5    | neq      | not equal               |
+  | 6    | false    | false                   |
+  | 7    | true     | true                    |
+  \-------------------------------------------/
+  "vpermil2ps" and "vpermil2pd" set the elements in destination register to
+zero or to a value selected from first or second source depending on the
+corresponding bit fields from the fourth operand (the selector) and the
+immediate value provided in fifth operand. Refer to the AMD manuals for the
+detailed explanation of the operation performed by these instructions. Each
+of the first four operands can be a register, and either second source or
+selector can be memory location, 128-bit or 256-bit depending on whether SSE
+registers or AVX registers are used for the other operands.
+    vpermil2ps ymm0,ymm3,ymm7,ymm2,0  ; permute from two sources
+  "vphaddbw" adds pairs of adjacent signed bytes to form 16-bit values and
+stores them at the same positions in destination. "vphaddubw" does the same
+but treats the bytes as unsigned. "vphaddbd" and "vphaddubd" sum all bytes
+(either signed or unsigned) in each four-byte block to 32-bit results,
+"vphaddbq" and "vphaddubq" sum all bytes in each eight-byte block to
+-bit results, "vphaddwd" and "vphadduwd" add pairs of words to 32-bit
+results, "vphaddwq" and "vphadduwq" sum all words in each four-word block to
+-bit results, "vphadddq" and "vphaddudq" add pairs of double words to 64-bit
+results. "vphsubbw" substracts in each two-byte block the byte at higher
+position from the one at lower position, and stores the result as a signed
+-bit value at the corresponding position in destination, "vphsubwd"
+substracts in each two-word block the word at higher position from the one at
+lower position and makes signed 32-bit results, "vphsubdq" substract in each
+block of two double word the one at higher position from the one at lower
+position and makes signed 64-bit results. Each of these instructions takes
+two operands, the destination being SSE register, and the source being SSE
+register or 128-bit memory.
+    vphadduwq xmm0,xmm1          ; sum quadruplets of words
+  "vpmacsww" and "vpmacssww" multiply the corresponding signed 16-bit values
+from the first and second source and then add the products to the parallel
+values from the third source, then "vpmacsww" takes the lowest 16 bits of the
+result and "vpmacssww" saturates the result down to 16-bit value, and they
+store the final 16-bit results in the destination. "vpmacsdd" and "vpmacssdd"
+perform the analogous operation on 32-bit values. "vpmacswd" and "vpmacswd" do
+the same calculation only on the low 16-bit values from each 32-bit block and
+form the 32-bit results. "vpmacsdql" and "vpmacssdql" perform such operation
+on the low 32-bit values from each 64-bit block and form the 64-bit results,
+while "vpmacsdqh" and "vpmacssdqh" do the same on the high 32-bit values from
+each 64-bit block, also forming the 64-bit results. "vpmadcswd" and
+"vpmadcsswd" multiply the corresponding signed 16-bit value from the first
+and second source, then sum all the four products and add this sum to each
+-bit element from third source, storing the truncated or saturated result
+in destination. All these instructions take four operands, the second source
+can be 128-bit memory or SSE register, all the other operands have to be
+SSE registers.
+    vpmacsdd xmm6,xmm1,[ebx],xmm6  ; accumulate product
+  "vpperm" selects bytes from first and second source, optionally applies a
+separate transformation to each of them, and stores them in the destination.
+The bit fields in fourth operand (the selector) specify for each position in
+destination what byte from which source is taken and what operation is applied
+to it before it is stored there. Refer to the AMD manuals for the detailed
+information about these bit fields. This instruction takes four operands,
+either second source or selector can be a 128-bit memory (or they can be SSE
+registers both), all the other operands have to be SSE registers.
+  "vpshlb", "vpshlw", "vpshld" and "vpshlq" shift logically bytes, words, double
+words or quad words respectively. The amount of bits to shift by is specified
+for each element separately by the signed byte placed at the corresponding
+position in the third operand. The source containing elements to shift is
+provided as second operand. Either second or third operand can be 128-bit
+memory (or they can be SSE registers both) and the other operands have to be
+SSE registers.
+    vpshld xmm3,xmm1,[ebx]       ; shift bytes from xmm1
+"vpshab", "vpshaw", "vpshad" and "vpshaq" arithmetically shift bytes, words,
+double words or quad words. These instructions follow the same rules as the
+logical shifts described above. "vprotb", "vprotw", "vprotd" and "vprotq"
+rotate bytes, word, double words or quad words. They follow the same rules as
+shifts, but additionally allow third operand to be immediate value, in which
+case the same amount of rotation is specified for all the elements in source.
+    vprotb xmm0,[esi],3          ; rotate bytes to the left
+  The MOVBE extension introduces just one new instruction, "movbe", which
+swaps bytes in value from source before storing it in destination, so can
+be used to load and store big endian values. It takes two operands, either
+the destination or source should be a 16-bit, 32-bit or 64-bit memory (the
+last one being only allowed in long mode), and the other operand should be
+a general register of the same size.
+  The BMI extension, consisting of two subsets - BMI1 and BMI2, introduces
+new instructions operating on general registers, which use the same encoding
+as AVX instructions and so allow the extended syntax. All these instructions
+use 32-bit operands, and in long mode they also allow the forms with 64-bit
+operands.
+  "andn" calculates the bitwise AND of second source with the inverted bits
+of first source and stores the result in destination. The destination and
+the first source have to be general registers, the second source can be
+general register or memory.
+    andn edx,eax,[ebx]   ; bit-multiply inverted eax with memory
+  "bextr" extracts from the first source the sequence of bits using an index
+and length specified by bit fields in the second source operand and stores
+it into destination. The lowest 8 bits of second source specify the position
+of bit sequence to extract and the next 8 bits of second source specify the
+length of sequence. The first source can be a general register or memory,
+the other two operands have to be general registers.
+    bextr eax,[esi],ecx  ; extract bit field from memory
+  "blsi" extracts the lowest set bit from the source, setting all the other
+bits in destination to zero. The destination must be a general register,
+the source can be general register or memory.
+    blsi rax,r11         ; isolate the lowest set bit
+  "blsmsk" sets all the bits in the destination up to the lowest set bit in
+the source, including this bit. "blsr" copies all the bits from the source to
+destination except for the lowest set bit, which is replaced by zero. These
+instructions follow the same rules for operands as "blsi".
+  "tzcnt" counts the number of trailing zero bits, that is the zero bits up to
+the lowest set bit of source value. This instruction is analogous to "lzcnt"
+and follows the same rules for operands, so it also has a 16-bit version,
+unlike the other BMI instructions.
+  "bzhi" is BMI2 instruction, which copies the bits from first source to
+destination, zeroing all the bits up from the position specified by second
+source. It follows the same rules for operands as "bextr".
+  "pext" uses a mask in second source operand to select bits from first
+operands and puts the selected bits as a continuous sequence into destination.
+"pdep" performs the reverse operation - it takes sequence of bits from the
+first source and puts them consecutively at the positions where the bits in
+second source are set, setting all the other bits in destination to zero.
+These BMI2 instructions follow the same rules for operands as "andn".
+  "mulx" is a BMI2 instruction which performs an unsigned multiplication of
+value from EDX or RDX register (depending on the size of specified operands)
+by the value from third operand, and stores the low half of result in the
+second operand, and the high half of result in the first operand, and it does
+it without affecting the flags. The third operand can be general register or
+memory, and both the destination operands have to be general registers.
+    mulx edx,eax,ecx     ; multiply edx by ecx into edx:eax
+  "shlx", "shrx" and "sarx" are BMI2 instructions, which perform logical or
+arithmetical shifts of value from first source by the amount specified by
+second source, and store the result in destination without affecting the
+flags. The have the same rules for operands as "bzhi" instruction.
+  "rorx" is a BMI2 instruction which rotates right the value from source
+operand by the constant amount specified in third operand and stores the
+result in destination without affecting the flags. The destination operand
+has to be general register, the source operand can be general register or
+memory, and the third operand has to be an immediate value.
+    rorx eax,edx,7       ; rotate without affecting flags
+  The TBM is an extension designed by AMD to supplement the BMI set. The
+"bextr" instruction is extended with a new form, in which second source is
+a 32-bit immediate value. "blsic" is a new instruction which performs the
+same operation as "blsi", but with the bits of result reversed. It uses the
+same rules for operands as "blsi". "blsfill" is a new instruction, which takes
+the value from source, sets all the bits below the lowest set bit and store
+the result in destination, it also uses the same rules for operands as "blsi".
+  "blci", "blcic", "blcs", "blcmsk" and "blcfill" are instructions analogous
+to "blsi", "blsic", "blsr", "blsmsk" and "blsfill" respectively, but they
+perform the bit-inverted versions of the same operations. They follow the
+same rules for operands as the instructions they reflect.
+  "tzmsk" finds the lowest set bit in value from source operand, sets all bits
+below it to 1 and all the rest of bits to zero, then writes the result to
+destination. "t1mskc" finds the least significant zero bit in the value from
+source  operand, sets the bits below it to zero and all the other bits to 1,
+and writes the result to destination. These instructions have the same rules
+for operands as "blsi".
+.1.24  Other extensions of instruction set
+There is a number of additional instruction set extensions recognized by flat
+assembler, and the general syntax of the instructions introduced by those
+extensions is provided here. For a detailed information on the operations
+performed by them, check out the manuals from Intel (for the VMX, SMX, XSAVE,
+RDRAND, FSGSBASE, INVPCID, HLE and RTM extensions) or AMD (for the SVM
+extension).
+  The Virtual-Machine Extensions (VMX) provide a set of instructions for the
+management of virtual machines. The "vmxon" instruction, which enters the VMX
+operation, requires a single 64-bit memory operand, which should be a physical
+address of memory region, which the logical processor may use to support VMX
+operation. The "vmxoff" instruction, which leaves the VMX operation, has no
+operands. The "vmlaunch" and "vmresume", which launch or resume the virtual
+machines, and "vmcall", which allows guest software to call the VM monitor,
+use no operands either.
+  The "vmptrld" loads the physical address of current Virtual Machine Control
+Structure (VMCS) from its memory operand, "vmptrst" stores the pointer to
+current VMCS into address specified by its memory operand, and "vmclear" sets
+the launch state of the VMCS referenced by its memory operand to clear. These
+three instruction all require single 64-bit memory operand.
+  The "vmread" reads from VCMS a field specified by the source operand and
+stores it into the destination operand. The source operand should be a
+general purpose register, and the destination operand can be a register of
+memory. The "vmwrite" writes into a VMCS field specified by the destination
+operand the value provided by source operand. The source operand can be a
+general purpose register or memory, and the destination operand must be a
+register. The size of operands for those instructions should be 64-bit when
+in long mode, and 32-bit otherwise.
+  The "invept" and "invvpid" invalidate the translation lookaside buffers
+(TLBs) and paging-structure caches, either derived from extended page tables
+(EPT), or based on the virtual processor identifier (VPID). These instructions
+require two operands, the first one being the general purpose register
+specifying the type of invalidation, and the second one being a 128-bit
+memory operand providing the invalidation descriptor. The first operand
+should be a 64-bit register when in long mode, and 32-bit register otherwise.
+  The Safer Mode Extensions (SMX) provide the functionalities available
+throught the "getsec" instruction. This instruction takes no operands, and
+the function that is executed is determined by the contents of EAX register
+upon executing this instruction.
+  The Secure Virtual Machine (SVM) is a variant of virtual machine extension
+used by AMD. The "skinit" instruction securely reinitializes the processor
+allowing the startup of trusted software, such as the virtual machine monitor
+(VMM). This instruction takes a single operand, which must be EAX, and
+provides a physical address of the secure loader block (SLB).
+  The "vmrun" instruction is used to start a guest virtual machine,
+its only operand should be an accumulator register (AX, EAX or RAX, the
+last one available only in long mode) providing the physical address of the
+virtual machine control block (VMCB). The "vmsave" stores a subset of
+processor state into VMCB specified by its operand, and "vmload" loads the
+same subset of processor state from a specified VMCB. The same operand rules
+as for the "vmrun" apply to those two instructions.
+  "vmmcall" allows the guest software to call the VMM. This instruction takes
+no operands.
+  "stgi" set the global interrupt flag to 1, and "clgi" zeroes it. These
+instructions take no operands.
+  "invlpga" invalidates the TLB mapping for a virtual page specified by the
+first operand (which has to be accumulator register) and address space
+identifier specified by the second operand (which must be ECX register).
+  The XSAVE set of instructions allows to save and restore processor state
+components. "xsave" and "xsaveopt" store the components of processor state
+defined by bit mask in EDX and EAX registers into area defined by memory
+operand. "xrstor" restores from the area specified by memory operand the
+components of processor state defined by mask in EDX and EAX. The "xsave64",
+"xsaveopt64" and "xrstor64" are 64-bit versions of these instructions, allowed
+only in long mode.
+  "xgetbv" read the contents of 64-bit XCR (extended control register)
+specified in ECX register into EDX and EAX registers. "xsetbv" writes the
+contents of EDX and EAX into the 64-bit XCR specified by ECX register. These
+instructions have no operands.
+  The RDRAND extension introduces one new instruction, "rdrand", which loads
+the hardware-generated random value into general register. It takes one
+operand, which can be 16-bit, 32-bit or 64-bit register (with the last one
+being allowed only in long mode).
+  The FSGSBASE extension adds long mode instructions that allow to read and
+write the segment base registers for FS and GS segments. "rdfsbase" and
+"rdgsbase" read the corresponding segment base registers into operand, while
+"wrfsbase" and "wrgsbase" write the value of operand into those register.
+All these instructions take one operand, which can be 32-bit or 64-bit general
+register.
+  The INVPCID extension adds "invpcid" instruction, which invalidates mapping
+in the TLBs and paging caches based on the invalidation type specified in
+first operand and PCID invalidate descriptor specified in second operand.
+The first operands should be 32-bit general register when not in long mode,
+or 64-bit general register when in long mode. The second operand should be
+-bit memory location.
+  The HLE and RTM extensions provide set of instructions for the transactional
+management. The "xacquire" and "xrelease" are new prefixes that can be used
+with some of the instructions to start or end lock elision on the memory
+address specified by prefixed instruction. The "xbegin" instruction starts
+the transactional execution, its operand is the address a fallback routine
+that gets executes in case of transaction abort, specified like the operand
+for near jump instruction. "xend" marks the end of transcational execution
+region, it takes no operands. "xabort" forces the transaction abort, it takes
+an 8-bit immediate value as its only operand, this value is passed in the
+highest bits of EAX to the fallback routine. "xtest" checks whether there is
+transactional execution in progress, this instruction takes no operands.
 .2  Control directives
 .2.2  Conditional assembly
-"if" directive causes come block of instructions to be assembled only under
+"if" directive causes some block of instructions to be assembled only under
 certain condition. It should be followed by logical expression specifying the
 condition, instructions in next lines will be assembled only when this
 condition is met, otherwise they will be skipped. The optional "else if"
 even if symbol is used only after this check). The "defined" operator can be
 followed by any expression, usually just by a single symbol name; it checks
 whether the given expression contains only symbols that are defined in the
 source and accessible from the current position.
-  The following simple example uses the "count" constant that should be
+  With "relativeto" operator it is possible to check whether values of two
+expressions differ only by constant amount. The valid syntax is a numerical
+expression followed by "relativeto" and then another expression (possibly
+register-based). Labels that have no simple numerical value can be tested
+this way to determine what kind of operations may be possible with them.
+  The following simple example uses the "count" constant that should be
 defined somewhere in source:
     if count>0
         mov cx,count
 which follows the "else if", is evaluated and if it's true, the second block
 of instructions get assembled, otherwise the last block of instructions, which
 follows the line containing only "else", is assembled.
   There are also operators that allow comparison of values being any chains of
-symbols. The "eq" compares two such values whether they are exactly the same.
+symbols. The "eq" compares whether two such values are exactly the same.
 The "in" operator checks whether given value is a member of the list of values
 following this operator, the list should be enclosed between "<" and ">"
 characters, its members should be separated with commas. The symbols are
 considered the same when they have the same meaning for the assembler - for
 example "pword" and "fword" for assembler are the same and thus are not
 distinguished by the above operators. In the same way "16 eq 10h" is the true
 operator is specified, one byte is loaded (thus value is in range from 0 to
 ). The loaded data cannot exceed current offset.
   The "store" directive can modify the already generated code by replacing
 some of the previously generated data with the value defined by given
-numerical expression, which follow. The expression can be preceded by the
+numerical expression, which follows. The expression can be preceded by the
 optional size operator to specify how large value the expression defines, and
 therefore how much bytes will be stored, if there is no size operator, the
 size of one byte is assumed. Then the "at" operator and the numerical
 expression defining the valid address in current addressing code space, at
 which the given value have to be stored should follow. This is a directive for
 advanced appliances and should be used carefully.
         store byte a xor c at $$+%-1
     end repeat
 and each byte of code will be xored with the value defined by "c" constant.
-  "virtual" defines virtual data at specified address. This data won't be
+  "virtual" defines virtual data at specified address. This data will not be
 included in the output file, but labels defined there can be used in other
 parts of source. This directive can be followed by "at" operator and the
 numerical expression specifying the address for virtual data, otherwise is
 uses current address, the same as "virtual at $". Instructions defining data
 are expected in next lines, ended with "end virtual" directive. The block of
         LDT_address dd ?
     end virtual
 With such definition instruction "mov ax,[LDT_limit]" will be assembled
-to "mov ax,[bx]".
+to the same instruction as "mov ax,[bx]".
   Declaring defined data values or instructions inside the virtual block would
 also be useful, because the "load" directive can be used to load the values
 from the virtually generated code into a constants. This directive should be
 used after the code it loads but before the virtual block ends, because it can
 only load the values from the same addressing space. For example:
         display d
     end repeat
     display 13,10
-This block of directives calculates the four hexadecimal digits of 16-bit value
+This block of directives calculates the four hexadecimal digits of 16-bit
-and converts them into characters for displaying. Note that this won't work if
+value and converts them into characters for displaying. Note that this will
-the adresses in current addressing space are relocatable (as it might happen
+not work if the adresses in current addressing space are relocatable (as it
-with PE or object output formats), since only absolute values can be used this
+might happen with PE or object output formats), since only absolute values can
-way. The absolute value may be obtained by calculating the relative address,
+be used this way. The absolute value may be obtained by calculating the
-like "$-$$", or "rva $" in case of PE format.
+relative address, like "$-$$", or "rva $" in case of PE format.
+  The "err" directive immediately terminates the assembly process when it is
+encountered by assembler.
+  The "assert" directive tests whether the logical expression that follows it
+is true, and if not, it signalizes the error.
 also defined somewhere later.
   The "used" operator may be expected to behave in a similar manner in
 analogous cases, however any other kinds of predictions my not be so simple and
 you should never rely on them this way.
   The "err" directive, usually used to stop the assembly when some condition is
+met, stops the assembly immediately, regardless of whether the current pass
+is final or intermediate. So even when the condition that caused this directive
+to be interpreted is mispredicted and temporary, and would eventually disappear
+in the later passes, the assembly is stopped anyway.
+  The "assert" directive signalizes the error only if its expression is false
+after all the symbols have been resolved. You can use "assert 0" in place of
+"err" when you do not want to have assembly stopped during the intermediate
+passes.
 .3  Preprocessor directives
 to the line containing the "include" directive. There are no limits to the
 number of included files as long as they fit in memory.
   The quoted path can contain environment variables enclosed within "%"
 characters, they will be replaced with their values inside the path, both the
-"\" and "/" characters are allowed as a path separators. If no absolute path
+"\" and "/" characters are allowed as a path separators. The file is first
-is given, the file is first searched for in the directory containing file
+searched for in the directory containing file which included it and when it is
+not found there, the search is continued in the directories specified in the
-which included it and when it's not found there, in the directory containing
+environment variable called INCLUDE (the multiple paths separated with
+semicolons can be defined there, they will be searched in the same order as
+specified). If file was not found in any of these places, preprocessor looks
-the main source file (the one specified in command line). These rules concern
+for it in the directory containing the main source file (the one specified in
-also paths given with the "file" directive.
+command line). These rules concern also paths given with the "file" directive.
 .3.2  Symbolic constants
 separated with commas. So "restore d" after the above definitions will give
 "d" constant back the value "edx", the second one will restore it to value
 "dword", and one more will revert "d" to original meaning as if no such
 constant was defined. If there was no constant defined of given name,
-"restore" won't cause an error, it will be just ignored.
+"restore" will not cause an error, it will be just ignored.
   Symbolic constant can be used to adjust the syntax of assembler to personal
 preferences. For example the following set of definitions provides the handy
 shortcuts for all the size operators:
     b equ byte
     f equ fword
     q equ qword
     t equ tword
     x equ dqword
     y equ qqword
   Because symbolic constant may also have an empty value, it can be used to
 allow the syntax with "offset" word before any address value:
 given, this macroinstruction will become two macroinstructions of the previous
 definition, so "mov es,ds,dx" will be assembled as "push ds", "pop es" and
 "mov ds,dx".
   By placing the "*" after the name of argument you can mark the argument as
-required - preprocessor won't allow it to have an empty value. For example the
+required - preprocessor will not allow it to have an empty value. For example
-above macroinstruction could be declared as "macro mov op1*,op2*,op3" to make
+the above macroinstruction could be declared as "macro mov op1*,op2*,op3" to
-sure that first two arguments will always have to be given some non empty
+make sure that first two arguments will always have to be given some non empty
 values.
-  When it's needed to provide macroinstruction with argument that contains
+  Alternatively, you can provide the default value for argument, by placing
+the "=" followed by value after the name of argument. Then if the argument
+has an empty value provided, the default value will be used instead.
+  When it's needed to provide macroinstruction with argument that contains
 some commas, such argument should be enclosed between "<" and ">" characters.
 If it contains more than one "<" character, the same number of ">" should be
 used to tell that the value of argument ends.
   "purge" directive allows removing the last definition of specified
 macroinstruction. It should be followed by one or more names of
 macroinstructions, separated with commas. If such macroinstruction has not
-been defined, you won't get any error. For example after having the syntax of
+been defined, you will not get any error. For example after having the syntax
-"mov" extended with the macroinstructions defined above, you can disable
+of "mov" extended with the macroinstructions defined above, you can disable
 syntax with three operands back by using "purge mov" directive. Next
 "purge mov" will disable also syntax for two operands being segment registers,
 and all the next such directives will do nothing.
   If after the "macro" directive you enclose some group of arguments' names in
 square brackets, it will allow giving more values for this group of arguments
 when using that macroinstruction. Any more argument given after the last
         jnz move
+     }
 Each time this macroinstruction is used, "move" will become other unique name
-in its instructions, so you won't get an error you normally get when some
+in its instructions, so you will not get an error you normally get when some
 label is defined more than once.
   "forward", "reverse" and "common" directives divide macroinstruction into
 blocks, each one processed after the processing of previous is finished. They
 differ in behavior only if macroinstruction allows multiple groups of
 arguments. Block of instructions that follows "forward" directive is processed
       common call proc
+     }
 This macroinstruction can be used for calling the procedures using STDCALL
-convention, arguments are pushed on stack in the reverse order. For example
+convention, which has all the arguments pushed on stack in the reverse order.
-"stdcall foo,1,2,3" will be assembled as:
+For example "stdcall foo,1,2,3" will be assembled as:
     push 3
     push 2
     push 1
 For example "jif ax,ae,10h,exit" will be assembled as "cmp ax,10h" and
 "jae exit" instructions.
   The "#" operator can be also used to concatenate two quoted strings into one.
 Also conversion of name into a quoted string is possible, with the "`" operator,
-which likewise can be used inside the macroinstruction. It convert the name
+which likewise can be used inside the macroinstruction. It converts the name
 that follows it into a quoted string - but note, that when it is followed by
 a macro argument which is being replaced with value containing more than one
 symbol, only the first of them will be converted, as the "`" operator converts
 only one symbol that immediately follows it. Here's an example of utilizing
 those two features:
 used. This label will be also attached at the beginning of every name starting
 with dot in the contents of macroinstruction. The macroinstruction defined
 using the "struc" directive can have the same name as some other
 macroinstruction defined using the "macro" directive, structure
-macroinstruction won't prevent the standard macroinstruction being processed
+macroinstruction will not prevent the standard macroinstruction from being
-when there is no label before it and vice versa. All the rules and features
+processed when there is no label before it and vice versa. All the rules and
-concerning standard macroinstructions apply to structure macroinstructions.
+features concerning standard macroinstructions apply to structure
-  Here is the sample of structure macroinstruction:
+macroinstructions.
+  Here is the sample of structure macroinstruction:
     struc point x,y
+     {
         .x dw x
 .3.5  Repeating macroinstructions
 The "rept" directive is a special kind of macroinstruction, which makes given
-amount of duplicates of the block enclosed with braces. The basic syntax is
-"rept" directive followed by number (it cannot be an expression, since
-preprocessor doesn't do calculations, if you need repetitions based on values
-calculated by assembler, use one of the code repeating directives that are
+amount of duplicates of the block enclosed with braces. The basic syntax is
-processed by assembler, see 2.2.3), and then block of source enclosed between
+"rept" directive followed by number and then block of source enclosed between
 the "{" and "}" characters. The simplest example:
 will generate code which will clear the contents of eight SSE registers.
 You can define multiple counters separated with commas, and each one can have
 different base.
+  The number of repetitions and the base values for counters can be specified
+using the numerical expressions with operator rules identical as in the case
+of assembler. However each value used in such expression must either be a
+directly specified number, or a symbolic constant with value also being an
+expression that can be calculated by preprocessor (in such case the value
+of expression associated with symbolic constant is calculated first, and then
+substituted into the outer expression in place of that constant). If you need
+repetitions based on values that can only be calculated at assembly time, use
+one of the code repeating directives that are processed by assembler, see
+section 2.2.3.
   The "irp" directive iterates the single argument through the given list of
 parameters. The syntax is "irp" followed by the argument name, then the comma
 and then the list of parameters. The parameters are specified in the same
 way like in the invocation of standard macroinstruction, so they have to be
 separated with commas and each one can be enclosed with the "<" and ">"
     match +,+ { include 'first.inc' }
     match +,- { include 'second.inc' }
 the first file will get included, since "+" after comma matches the "+" in
-pattern, and the second file won't be included, since there is no match.
+pattern, and the second file will not be included, since there is no match.
   To match any other symbol literally, it has to be preceded by "=" character
 in the pattern. Also to match the "=" character itself, or the comma, the
 "==" and "=," constructions have to be used. For example the "=a==" pattern
 will match the "a=" sequence.
 matched with "b". But in this case:
     match a b, 1 { db a }
-there will be nothing left for "b" to match, so the block won't get processed
+there will be nothing left for "b" to match, so the block will not get
-at all.
+processed at all.
   The block of source defined by match is processed in the same way as any
 macroinstruction, so any operators specific to macroinstructions can be used
 also in this case.
   What makes "match" directive more useful is the fact, that it replaces the
 that the "fix" directive and prioritized symbolic constants are processed in
 a separate stage, and all other preprocessing is done after on the resulting
 source.
   The standard preprocessing that comes after, on each line begins with
-recognition of the first symbol. It begins with checking for the preprocessor
+recognition of the first symbol. It starts with checking for the preprocessor
 directives, and when none of them is detected, preprocessor checks whether the
 first symbol is macroinstruction. If no macroinstruction is found, it moves
 to the second symbol of line, and again begins with checking for directives,
 which in this case is only the "equ" directive, as this is the only one that
-occurs as the second symbol in line. If there's no directive, the second
+occurs as the second symbol in line. If there is no directive, the second
 symbol is checked for the case of structure macroinstruction and when none
 of those checks gives the positive result, the symbolic constants are replaced
 with their values and such line is passed to the assembler.
   To see it on the example, assume that there is defined the macroinstruction
 called "foo" and the structure macroinstruction called "bar". Those lines:
     foo bar
 would be then both interpreted as invocations of macroinstruction "foo", since
 the meaning of the first symbol overrides the meaning of second one.
-  The macroinstructions generate the new lines from their definition blocks,
+  When the macroinstruction generates the new lines from its definition block,
+in every line it first scans for macroinstruction directives, and interpretes
+them accordingly. All the other content in the definition block is used to
+brew the new lines, replacing the macroinstruction parameters with their values
-replacing the parameters with their values and then processing the "#" and "`"
+and then processing the symbol escaping and "#" and "`" operators. The
-operators. The conversion operator has the higher priority than concatenation.
+conversion operator has the higher priority than concatenation and if any of
+them operates on the escaped symbol, the escaping is cancelled before finishing
-After this is completed, the newly generated line goes through the standard
+the operation. After this is completed, the newly generated line goes through
-preprocessing, as described above.
+the standard preprocessing, as described above.
   Though the symbolic constants are usually only replaced in the lines, where
 no preprocessor directives nor macroinstructions has been found, there are some
 special cases where those replacements are performed in the parts of lines
 containing directives. First one is the definition of symbolic constant, where
 the replacements are done everywhere after the "equ" keyword and the resulting
 then replaced with matched value when generating the new lines defined by the
 block enclosed with braces. So if the "list" had value "1,2", the above line
 would generate the line containing "foo 1,2", which would then go through the
 standard preprocessing.
-  There is one more special case - when preprocessor goes to checking the
+  The other special case is in the parameters of "rept" directive. The amount
+of repetitions and the base value for counter can be specified using
+numerical expressions, and if there is a symbolic constant with non-numerical
+name used in such an expression, preprocessor tries to evaluate its value as
+a numerical expression and if succeeds, it replaces the symbolic constant with
+the result of that calculation and continues to evaluate the primary
+expression. If the expression inside that symbolic constants also contains
+some symbolic constants, preprocessor will try to calculate all the needed
+values recursively.
+  This allows to perform some calculations at the time of preprocessing, as
+long as all the values used are the numbers known at the preprocessing stage.
+A single repetition with "rept" can be used for the sole purpose of
+calculating some value, like in this example:
+    define a b+4
+    define b 3
+    rept 1 result:a*b+2 { define c result }
+To compute the base value for "result" counter, preprocessor replaces the "b"
+with its value and recursively calculates the value of "a", obtaining 7 as
+the result, then it calculates the main expression with the result being 23.
+The "c" then gets defined with the first value of counter (because the block
+is processed just one time), which is the result of the computation, so the
+value of "c" is simple "23" symbol. Note that if "b" is later redefined with
+some other numerical value, the next time and expression containing "a" is
+calculated, the value of "a" will reflect the new value of "b", because the
+symbolic constant contains just the text of the expression.
+  There is one more special case - when preprocessor goes to checking the
 second symbol in the line and it happens to be the colon character (what is
 then interpreted by assembler as definition of a label), it stops in this
 place and finishes the preprocessing of the first symbol (so if it's the
 symbolic constant it gets unrolled) and if it still appears to be the label,
 it performs the standard preprocessing starting from the place after the
 Now when assembler processes it, the condition for the "if" is false, and
 the "a" constant doesn't get defined. However symbolic constant "b" was
 processed normally, even though its definition was put just next to the one
 of "a". So because of the possible confusion you should be very careful
-every time when mixing the features of preprocessor and assembler - always
+every time when mixing the features of preprocessor and assembler - in such
-try to imagine what your source will become after the preprocessing, and
+cases it is important to realize what the source will become after the
-thus what the assembler will see and do its multiple passes on.
+preprocessing, and thus what the assembler will see and do its multiple passes
+on.
 .4  Formatter directives
 These directives are actually also a kind of control directives, with the
 purpose of controlling the format of generated code.
   "format" directive followed by the format identifier allows to select the
+output format. This directive should be put at the beginning of the source.
+Default output format is a flat binary file, it can also be selected by using
+"format binary" directive. This directive can be followed by the "as" keyword
-output format. This directive should be put at the beginning of the source.
+and the quoted string specifying the default file extension for the output
-Default output format is a flat binary file, it can also be selected by using
+file. Unless the output file name was specified from the command line,
-"format binary" directive.
+assembler will use this extension when generating the output file.
   "use16" and "use32" directives force the assembler to generate 16-bit or
 -bit code, omitting the default setting for selected output format. "use64"
 .4.2  Portable Executable
 To select the Portable Executable output format, use "format PE" directive, it
+can be followed by additional format settings: first the target subsystem
-can be followed by additional format settings: use "console", "GUI" or
+setting, which can be "console" or "GUI" for Windows applications, "native"
+for Windows drivers, "EFI", "EFIboot" or "EFIruntime" for the UEFI, it may be
+followed by the minimum version of system that the executable is targeted to
+(specified in form of floating-point value). Optional "DLL" and "WDM" keywords
+mark the output file as a dynamic link library and WDM driver respectively,
-"native" operator selects the target subsystem (floating point value
+and the "large" keyword marks the executable as able to handle addresses
-specifying subsystem version can follow), "DLL" marks the output file as a
+larger than 2 GB.
-dynamic link library. Then can follow the "at" operator and the numerical
+  After those settings can follow the "at" operator and a numerical expression
-expression specifying the base of PE image and then optionally "on" operator
+specifying the base of PE image and then optionally "on" operator followed by
-followed by the quoted string containing file name selects custom MZ stub for
+the quoted string containing file name selects custom MZ stub for PE program
-PE program (when specified file is not a MZ executable, it is treated as a
+(when specified file is not a MZ executable, it is treated as a flat binary
-flat binary executable file and converted into MZ format). The default code
-setting for this format is 32-bit. The example of fully featured PE format
+executable file and converted into MZ format). The default code setting for
-declaration:
+this format is 32-bit. The example of fully featured PE format declaration:
 to be defined there. The same applies to the resource data when the "resource"
 identifier is followed by "from" operator and quoted file name - in such case
 data is  taken from the given resource file.
   The "rva" operator can be used inside the numerical expressions to obtain
-the RVA of the item addressed by the value it is applied to.
+the RVA of the item addressed by the value it is applied to, that is the
+offset relative to the base of PE image.
 .4.3  Common Object File Format
 To select Common Object File Format, use "format COFF" or "format MS COFF"
-directive whether you want to create classic or Microsoft's COFF file. The
+directive, depending whether you want to create classic (DJGPP) or Microsoft's
-default code setting for this format is 32-bit. To create the file in
+variant of COFF file. The default code setting for this format is 32-bit. To
+create the file in Microsoft's COFF format for the x86-64 architecture, use
-Microsoft's COFF format for the x86-64 architecture, use "format MS64 COFF"
+"format MS64 COFF" setting, in such case long mode code is generated by
-setting, in such case long mode code is generated by default.
+default.
   "section" directive defines a new section, it should be followed by quoted
 string defining the name of section, then one or more section flags can
 follow. Section flags available for both COFF variants are "code" and "data",
-while "readable", "writeable", "executable", "shareable", "discardable",
+while flags "readable", "writeable", "executable", "shareable", "discardable",
-"notpageable", "linkremove" and "linkinfo" are flags available only with
+"notpageable", "linkremove" and "linkinfo" are available only with Microsoft's
-Microsoft COFF variant.
+COFF variant.
   By default section is aligned to double word (four bytes), in case of
 Microsoft COFF variant other alignment can be specified by providing the
 "align" operator followed by alignment value (any power of two up to 8192)
     public main
     public start as '_start'
+Additionally, with COFF format it's possible to specify exported symbol as
+static, it's done by preceding the name of symbol with the "static" keyword.
+  When using the Microsoft's COFF format, the "rva" operator can be used
+inside the numerical expressions to obtain the RVA of the item addressed by the
+value it is applied to.
 .4.4  Executable and Linkable Format
 To select ELF output format, use "format ELF" directive. The default code
 COFF output format is selected (described in previous section).
   The "rva" operator can be used also in the case of this format (however not
 when target architecture is x86-64), it converts the address into the offset
 relative to the GOT table, so it may be useful to create position-independent
-code.
+code. There's also a special "plt" operator, which allows to call the external
-  To create executable file, follow the format choice directive with the
-"executable" keyword. It allows to use "entry" directive followed by the value
-to set as entry point of program. On the other hand it makes "extrn" and
+functions through the Procedure Linkage Table. You can even create an alias
-"public" directives unavailable, and instead of "section" there should be the
+for external function that will make it always be called through PLT, with
-"segment" directive used, followed only by one or more segment permission
+the code like:
-flags. The origin of segment is aligned to page (4096 bytes), and available
-flags for are: "readable", "writeable" and "executable".
+    extrn 'printf' as _printf
+    printf = PLT _printf
+  To create executable file, follow the format choice directive with the
+"executable" keyword and optionally the number specifying the brand of the
+target operating system (for example value 3 would mark the executable
+for Linux system). With this format selected it is allowed to use "entry"
+directive followed by the value to set as entry point of program. On the other
+hand it makes "extrn" and "public" directives unavailable, and instead of
+"section" there should be the "segment" directive used, followed by one or
+more segment permission flags and optionally a marker of special ELF
+executable segment, which can be "interpreter", "dynamic" or "note". The
+origin of segment is aligned to page (4096 bytes), and available permission
+flags are: "readable", "writeable" and "executable".
 EOF

Subversion Repositories Kolibri OS

(root)/data/eng/docs/FASM.TXT – Rev 1737 → 2666