Subversion Repositories Kolibri OS

Rev

Rev 2666 | Go to most recent revision | Blame | Compare with Previous | Last modification | View Log | Download | RSS feed

  1.  
  2.                            ,'''
  3.                          ,,;,, ,,,,    ,,,,, ,,, ,,
  4.                            ;       ;  ;      ;  ;  ;
  5.                            ;  ,'''';   '''', ;  ;  ;
  6.                            ;  ',,,,;, ,,,,,' ;  ;  ;
  7.  
  8.                               flat assembler 1.70
  9.                               Programmer's Manual
  10.  
  11.  
  12. Table of contents
  13. -----------------
  14.  
  15. Chapter 1  Introduction
  16.  
  17.         1.1  Compiler overview
  18.         1.1.1  System requirements
  19.         1.1.2  Executing compiler from command line
  20.         1.1.3  Compiler messages
  21.         1.1.4  Output formats
  22.  
  23.         1.2  Assembly syntax
  24.         1.2.1  Instruction syntax
  25.         1.2.2  Data definitions
  26.         1.2.3  Constants and labels
  27.         1.2.4  Numerical expressions
  28.         1.2.5  Jumps and calls
  29.         1.2.6  Size settings
  30.  
  31. Chapter 2  Instruction set
  32.  
  33.         2.1  The x86 architecture instructions
  34.         2.1.1  Data movement instructions
  35.         2.1.2  Type conversion instructions
  36.         2.1.3  Binary arithmetic instructions
  37.         2.1.4  Decimal arithmetic instructions
  38.         2.1.5  Logical instructions
  39.         2.1.6  Control transfer instructions
  40.         2.1.7  I/O instructions
  41.         2.1.8  Strings operations
  42.         2.1.9  Flag control instructions
  43.         2.1.10  Conditional operations
  44.         2.1.11  Miscellaneous instructions
  45.         2.1.12  System instructions
  46.         2.1.13  FPU instructions
  47.         2.1.14  MMX instructions
  48.         2.1.15  SSE instructions
  49.         2.1.16  SSE2 instructions
  50.         2.1.17  SSE3 instructions
  51.         2.1.18  AMD 3DNow! instructions
  52.         2.1.19  The x86-64 long mode instructions
  53.         2.1.20  SSE4 instructions
  54.         2.1.21  AVX instructions
  55.         2.1.22  AVX2 instructions
  56.         2.1.23  Auxiliary sets of computational instructions
  57.         2.1.24  Other extensions of instruction set
  58.  
  59.         2.2  Control directives
  60.         2.2.1  Numerical constants
  61.         2.2.2  Conditional assembly
  62.         2.2.3  Repeating blocks of instructions
  63.         2.2.4  Addressing spaces
  64.         2.2.5  Other directives
  65.         2.2.6  Multiple passes
  66.  
  67.         2.3  Preprocessor directives
  68.         2.3.1  Including source files
  69.         2.3.2  Symbolic constants
  70.         2.3.3  Macroinstructions
  71.         2.3.4  Structures
  72.         2.3.5  Repeating macroinstructions
  73.         2.3.6  Conditional preprocessing
  74.         2.3.7  Order of processing
  75.  
  76.         2.4  Formatter directives
  77.         2.4.1  MZ executable
  78.         2.4.2  Portable Executable
  79.         2.4.3  Common Object File Format
  80.         2.4.4  Executable and Linkable Format
  81.  
  82.  
  83.  
  84. Chapter 1  Introduction
  85. -----------------------
  86.  
  87. This chapter contains all the most important information you need to begin
  88. using the flat assembler. If you are experienced assembly language programmer,
  89. you should read at least this chapter before using this compiler.
  90.  
  91.  
  92. 1.1  Compiler overview
  93.  
  94. Flat assembler is a fast assembly language compiler for the x86 architecture
  95. processors, which does multiple passes to optimize the size of generated
  96. machine code. It is self-compilable and versions for different operating
  97. systems are provided. All the versions are designed to be used from the system
  98. command line and they should not differ in behavior.
  99.  
  100.  
  101. 1.1.1  System requirements
  102.  
  103. All versions require the x86 architecture 32-bit processor (at least 80386),
  104. although they can produce programs for the x86 architecture 16-bit processors,
  105. too. DOS version requires an OS compatible with MS DOS 2.0 and either true
  106. real mode environment or DPMI. Windows version requires a Win32 console
  107. compatible with 3.1 version.
  108.  
  109.  
  110. 1.1.2  Executing compiler from command line
  111.  
  112. To execute flat assembler from the command line you need to provide two
  113. parameters - first should be name of source file, second should be name of
  114. destination file. If no second parameter is given, the name for output
  115. file will be guessed automatically. After displaying short information about
  116. the program name and version, compiler will read the data from source file and
  117. compile it. When the compilation is successful, compiler will write the
  118. generated code to the destination file and display the summary of compilation
  119. process; otherwise it will display the information about error that occurred.
  120.   The source file should be a text file, and can be created in any text
  121. editor. Line breaks are accepted in both DOS and Unix standards, tabulators
  122. are treated as spaces.
  123.   In the command line you can also include "-m" option followed by a number,
  124. which specifies how many kilobytes of memory flat assembler should maximally
  125. use. In case of DOS version this options limits only the usage of extended
  126. memory. The "-p" option followed by a number can be used to specify the limit
  127. for number of passes the assembler performs. If code cannot be generated
  128. within specified amount of passes, the assembly will be terminated with an
  129. error message. The maximum value of this setting is 65536, while the default
  130. limit, used when no such option is included in command line, is 100.
  131. It is also possible to limit the number of passes the assembler
  132. performs, with the "-p" option followed by a number specifying the maximum
  133. number of passes.
  134.   There are no command line options that would affect the output of compiler,
  135. flat assembler requires only the source code to include the information it
  136. really needs. For example, to specify output format you specify it by using
  137. the "format" directive at the beginning of source.
  138.  
  139.  
  140. 1.1.3  Compiler messages
  141.  
  142. As it is stated above, after the successful compilation, the compiler displays
  143. the compilation summary. It includes the information of how many passes was
  144. done, how much time it took, and how many bytes were written into the
  145. destination file.
  146. The following is an example of the compilation summary:
  147.  
  148. flat assembler  version 1.70 (16384 kilobytes memory)
  149. 38 passes, 5.3 seconds, 77824 bytes.
  150.  
  151. In case of error during the compilation process, the program will display an
  152. error message. For example, when compiler can't find the input file, it will
  153. display the following message:
  154.  
  155. flat assembler  version 1.70 (16384 kilobytes memory)
  156. error: source file not found.
  157.  
  158. If the error is connected with a specific part of source code, the source line
  159. that caused the error will be also displayed. Also placement of this line in
  160. the source is given to help you finding this error, for example:
  161.  
  162. flat assembler  version 1.70 (16384 kilobytes memory)
  163. example.asm [3]:
  164.         mob     ax,1
  165. error: illegal instruction.
  166.  
  167. It means that in the third line of the "example.asm" file compiler has
  168. encountered an unrecognized instruction. When the line that caused error
  169. contains a macroinstruction, also the line in macroinstruction definition
  170. that generated the erroneous instruction is displayed:
  171.  
  172. flat assembler  version 1.70 (16384 kilobytes memory)
  173. example.asm [6]:
  174.         stoschar 7
  175. example.asm [3] stoschar [1]:
  176.         mob     al,char
  177. error: illegal instruction.
  178.  
  179. It means that the macroinstruction in the sixth line of the "example.asm" file
  180. generated an unrecognized instruction with the first line of its definition.
  181.  
  182.  
  183. 1.1.4  Output formats
  184.  
  185. By default, when there is no "format" directive in source file, flat
  186. assembler simply puts generated instruction codes into output, creating this
  187. way flat binary file. By default it generates 16-bit code, but you can always
  188. turn it into the 16-bit or 32-bit mode by using "use16" or "use32" directive.
  189. Some of the output formats switch into 32-bit mode, when selected - more
  190. information about formats which you can choose can be found in 2.4.
  191.   All output code is always in the order in which it was entered into the
  192. source file.
  193.  
  194.  
  195. 1.2  Assembly syntax
  196.  
  197. The information provided below is intended mainly for the assembler
  198. programmers that have been using some other assembly compilers before.
  199. If you are beginner, you should look for the assembly programming tutorials.
  200.   Flat assembler by default uses the Intel syntax for the assembly
  201. instructions, although you can customize it using the preprocessor
  202. capabilities (macroinstructions and symbolic constants). It also has its own
  203. set of the directives - the instructions for compiler.
  204.   All symbols defined inside the sources are case-sensitive.
  205.  
  206.  
  207. 1.2.1  Instruction syntax
  208.  
  209. Instructions in assembly language are separated by line breaks, and one
  210. instruction is expected to fill the one line of text. If a line contains
  211. a semicolon, except for the semicolons inside the quoted strings, the rest of
  212. this line is the comment and compiler ignores it. If a line ends with "\"
  213. character (eventually the semicolon and comment may follow it), the next line
  214. is attached at this point.
  215.   Each line in source is the sequence of items, which may be one of the three
  216. types. One type are the symbol characters, which are the special characters
  217. that are individual items even when are not spaced from the other ones.
  218. Any of the "+-*/=<>()[]{}:,|&~#`" is the symbol character. The sequence of
  219. other characters, separated from other items with either blank spaces or
  220. symbol characters, is a symbol. If the first character of symbol is either a
  221. single or double quote, it integrates any sequence of characters following it,
  222. even the special ones, into a quoted string, which should end with the same
  223. character, with which it began (the single or double quote) - however if there
  224. are two such characters in a row (without any other character between them),
  225. they are integrated into quoted string as just one of them and the quoted
  226. string continues then. The symbols other than symbol characters and quoted
  227. strings can be used as names, so are also called the name symbols.
  228.   Every instruction consists of the mnemonic and the various number of
  229. operands, separated with commas. The operand can be register, immediate value
  230. or a data addressed in memory, it can also be preceded by size operator to
  231. define or override its size (table 1.1). Names of available registers you can
  232. find in table 1.2, their sizes cannot be overridden. Immediate value can be
  233. specified by any numerical expression.
  234.   When operand is a data in memory, the address of that data (also any
  235. numerical expression, but it may contain registers) should be enclosed in
  236. square brackets or preceded by "ptr" operator. For example instruction
  237. "mov eax,3" will put the immediate value 3 into the EAX register, instruction
  238. "mov eax,[7]" will put the 32-bit value from the address 7 into EAX and the
  239. instruction "mov byte [7],3" will put the immediate value 3 into the byte at
  240. address 7, it can also be written as "mov byte ptr 7,3". To specify which
  241. segment register should be used for addressing, segment register name followed
  242. by a colon should be put just before the address value (inside the square
  243. brackets or after the "ptr" operator).
  244.  
  245.    Table 1.1  Size operators
  246.   /-------------------------\
  247.   | Operator | Bits | Bytes |
  248.   |==========|======|=======|
  249.   | byte     | 8    | 1     |
  250.   | word     | 16   | 2     |
  251.   | dword    | 32   | 4     |
  252.   | fword    | 48   | 6     |
  253.   | pword    | 48   | 6     |
  254.   | qword    | 64   | 8     |
  255.   | tbyte    | 80   | 10    |
  256.   | tword    | 80   | 10    |
  257.   | dqword   | 128  | 16    |
  258.   | xword    | 128  | 16    |
  259.   | qqword   | 256  | 32    |
  260.   | yword    | 256  | 32    |
  261.   \-------------------------/
  262.  
  263.    Table 1.2  Registers
  264.   /-----------------------------------------------------------------\
  265.   | Type    | Bits |                                                |
  266.   |=========|======|================================================|
  267.   |         | 8    | al    cl    dl    bl    ah    ch    dh    bh   |
  268.   | General | 16   | ax    cx    dx    bx    sp    bp    si    di   |
  269.   |         | 32   | eax   ecx   edx   ebx   esp   ebp   esi   edi  |
  270.   |---------|------|------------------------------------------------|
  271.   | Segment | 16   | es    cs    ss    ds    fs    gs               |
  272.   |---------|------|------------------------------------------------|
  273.   | Control | 32   | cr0         cr2   cr3   cr4                    |
  274.   |---------|------|------------------------------------------------|
  275.   | Debug   | 32   | dr0   dr1   dr2   dr3               dr6   dr7  |
  276.   |---------|------|------------------------------------------------|
  277.   | FPU     | 80   | st0   st1   st2   st3   st4   st5   st6   st7  |
  278.   |---------|------|------------------------------------------------|
  279.   | MMX     | 64   | mm0   mm1   mm2   mm3   mm4   mm5   mm6   mm7  |
  280.   |---------|------|------------------------------------------------|
  281.   | SSE     | 128  | xmm0  xmm1  xmm2  xmm3  xmm4  xmm5  xmm6  xmm7 |
  282.   |---------|------|------------------------------------------------|
  283.   | AVX     | 256  | ymm0  ymm1  ymm2  ymm3  ymm4  ymm5  ymm6  ymm7 |
  284.   \-----------------------------------------------------------------/
  285.  
  286.  
  287. 1.2.2  Data definitions
  288.  
  289. To define data or reserve a space for it, use one of the directives listed in
  290. table 1.3. The data definition directive should be followed by one or more of
  291. numerical expressions, separated with commas. These expressions define the
  292. values for data cells of size depending on which directive is used. For
  293. example "db 1,2,3" will define the three bytes of values 1, 2 and 3
  294. respectively.
  295.   The "db" and "du" directives also accept the quoted string values of any
  296. length, which will be converted into chain of bytes when "db" is used and into
  297. chain of words with zeroed high byte when "du" is used. For example "db 'abc'"
  298. will define the three bytes of values 61, 62 and 63.
  299.   The "dp" directive and its synonym "df" accept the values consisting of two
  300. numerical expressions separated with colon, the first value will become the
  301. high word and the second value will become the low double word of the far
  302. pointer value. Also "dd" accepts such pointers consisting of two word values
  303. separated with colon, and "dt" accepts the word and quad word value separated
  304. with colon, the quad word is stored first. The "dt" directive with single
  305. expression as parameter accepts only floating point values and creates data in
  306. FPU double extended precision format.
  307.   Any of the above directive allows the usage of special "dup" operator to
  308. make multiple copies of given values. The count of duplicates should precede
  309. this operator and the value to duplicate should follow - it can even be the
  310. chain of values separated with commas, but such set of values needs to be
  311. enclosed with parenthesis, like "db 5 dup (1,2)", which defines five copies
  312. of the given two byte sequence.
  313.   The "file" is a special directive and its syntax is different. This
  314. directive includes a chain of bytes from file and it should be followed by the
  315. quoted file name, then optionally numerical expression specifying offset in
  316. file preceded by the colon, and - also optionally - comma and numerical
  317. expression specifying count of bytes to include (if no count is specified, all
  318. data up to the end of file is included). For example "file 'data.bin'" will
  319. include the whole file as binary data and "file 'data.bin':10h,4" will include
  320. only four bytes starting at offset 10h.
  321.   The data reservation directive should be followed by only one numerical
  322. expression, and this value defines how many cells of the specified size should
  323. be reserved. All data definition directives also accept the "?" value, which
  324. means that this cell should not be initialized to any value and the effect is
  325. the same as by using the data reservation directive. The uninitialized data
  326. may not be included in the output file, so its values should be always
  327. considered unknown.
  328.  
  329.    Table 1.3  Data directives
  330.   /----------------------------\
  331.   | Size    | Define | Reserve |
  332.   | (bytes) | data   | data    |
  333.   |=========|========|=========|
  334.   | 1       | db     | rb      |
  335.   |         | file   |         |
  336.   |---------|--------|---------|
  337.   | 2       | dw     | rw      |
  338.   |         | du     |         |
  339.   |---------|--------|---------|
  340.   | 4       | dd     | rd      |
  341.   |---------|--------|---------|
  342.   | 6       | dp     | rp      |
  343.   |         | df     | rf      |
  344.   |---------|--------|---------|
  345.   | 8       | dq     | rq      |
  346.   |---------|--------|---------|
  347.   | 10      | dt     | rt      |
  348.   \----------------------------/
  349.  
  350.  
  351. 1.2.3  Constants and labels
  352.  
  353. In the numerical expressions you can also use constants or labels instead of
  354. numbers. To define the constant or label you should use the specific
  355. directives. Each label can be defined only once and it is accessible from the
  356. any place of source (even before it was defined). Constant can be redefined
  357. many times, but in this case it is accessible only after it was defined, and
  358. is always equal to the value from last definition before the place where it's
  359. used. When a constant is defined only once in source, it is - like the label -
  360. accessible from anywhere.
  361.   The definition of constant consists of name of the constant followed by the
  362. "=" character and numerical expression, which after calculation will become
  363. the value of constant. This value is always calculated at the time the
  364. constant is defined. For example you can define "count" constant by using the
  365. directive "count = 17", and then use it in the assembly instructions, like
  366. "mov cx,count" - which will become "mov cx,17" during the compilation process.
  367.   There are different ways to define labels. The simplest is to follow the
  368. name of label by the colon, this directive can even be followed by the other
  369. instruction in the same line. It defines the label whose value is equal to
  370. offset of the point where it's defined. This method is usually used to label
  371. the places in code. The other way is to follow the name of label (without a
  372. colon) by some data directive. It defines the label with value equal to
  373. offset of the beginning of defined data, and remembered as a label for data
  374. with cell size as specified for that data directive in table 1.3.
  375.   The label can be treated as constant of value equal to offset of labeled
  376. code or data. For example when you define data using the labeled directive
  377. "char db 224", to put the offset of this data into BX register you should use
  378. "mov bx,char" instruction, and to put the value of byte addressed by "char"
  379. label to DL register, you should use "mov dl,[char]" (or "mov dl,ptr char").
  380. But when you try to assemble "mov ax,[char]", it will cause an error, because
  381. fasm compares the sizes of operands, which should be equal. You can force
  382. assembling that instruction by using size override: "mov ax,word [char]", but
  383. remember that this instruction will read the two bytes beginning at "char"
  384. address, while it was defined as a one byte.
  385.   The last and the most flexible way to define labels is to use "label"
  386. directive. This directive should be followed by the name of label, then
  387. optionally size operator (it can be preceded by a colon) and then - also
  388. optionally "at" operator and the numerical expression defining the address at
  389. which this label should be defined. For example "label wchar word at char"
  390. will define a new label for the 16-bit data at the address of "char". Now the
  391. instruction "mov ax,[wchar]" will be after compilation the same as
  392. "mov ax,word [char]". If no address is specified, "label" directive defines
  393. the label at current offset. Thus "mov [wchar],57568" will copy two bytes
  394. while "mov [char],224" will copy one byte to the same address.
  395.   The label whose name begins with dot is treated as local label, and its name
  396. is attached to the name of last global label (with name beginning with
  397. anything but dot) to make the full name of this label. So you can use the
  398. short name (beginning with dot) of this label anywhere before the next global
  399. label is defined, and in the other places you have to use the full name. Label
  400. beginning with two dots are the exception - they are like global, but they
  401. don't become the new prefix for local labels.
  402.   The "@@" name means anonymous label, you can have defined many of them in
  403. the source. Symbol "@b" (or equivalent "@r") references the nearest preceding
  404. anonymous label, symbol "@f" references the nearest following anonymous label.
  405. These special symbol are case-insensitive.
  406.  
  407.  
  408. 1.2.4  Numerical expressions
  409.  
  410. In the above examples all the numerical expressions were the simple numbers,
  411. constants or labels. But they can be more complex, by using the arithmetical
  412. or logical operators for calculations at compile time. All these operators
  413. with their priority values are listed in table 1.4. The operations with higher
  414. priority value will be calculated first, you can of course change this
  415. behavior by putting some parts of expression into parenthesis. The "+", "-",
  416. "*" and "/" are standard arithmetical operations, "mod" calculates the
  417. remainder from division. The "and", "or", "xor", "shl", "shr" and "not"
  418. perform the same logical operations as assembly instructions of those names.
  419. The "rva" and "plt" are special unary operators that perform conversions
  420. between different kinds of addresses, they can be used only with few of the
  421. output formats and their meaning may vary (see 2.4).
  422.   The arithmetical and logical calculations are usually processed as if they
  423. operated on infinite precision 2-adic numbers, and assembler signalizes an
  424. overflow error if because of its limitations it is not table to perform the
  425. required calculation, or if the result is too large number to fit in either
  426. signed or unsigned range for the destination unit size. However "not", "xor"
  427. and "shr" operators are exceptions from this rule - if the value specified
  428. by numerical expression has to fit in a unit of specified size, and the
  429. arguments for operation fit into that size, the operation will be performed
  430. with precision limited to that size.
  431.   The numbers in the expression are by default treated as a decimal, binary
  432. numbers should have the "b" letter attached at the end, octal number should
  433. end with "o" letter, hexadecimal numbers should begin with "0x" characters
  434. (like in C language) or with the "$" character (like in Pascal language) or
  435. they should end with "h" letter. Also quoted string, when encountered in
  436. expression, will be converted into number - the first character will become
  437. the least significant byte of number.
  438.   The numerical expression used as an address value can also contain any of
  439. general registers used for addressing, they can be added and multiplied by
  440. appropriate values, as it is allowed for the x86 architecture instructions.
  441.   There are also some special symbols that can be used inside the numerical
  442. expression. First is "$", which is always equal to the value of current
  443. offset, while "$$" is equal to base address of current addressing space. The
  444. other one is "%", which is the number of current repeat in parts of code that
  445. are repeated using some special directives (see 2.2). There's also "%t"
  446. symbol, which is always equal to the current time stamp.
  447.   Any numerical expression can also consist of single floating point value
  448. (flat assembler does not allow any floating point operations at compilation
  449. time) in the scientific notation, they can end with the "f" letter to be
  450. recognized, otherwise they should contain at least one of the "." or "E"
  451. characters. So "1.0", "1E0" and "1f" define the same floating point value,
  452. while simple "1" defines an integer value.
  453.  
  454.    Table 1.4  Arithmetical and logical operators by priority
  455.   /-------------------------\
  456.   | Priority | Operators    |
  457.   |==========|==============|
  458.   | 0        | +  -         |
  459.   |----------|--------------|
  460.   | 1        | *  /         |
  461.   |----------|--------------|
  462.   | 2        | mod          |
  463.   |----------|--------------|
  464.   | 3        | and  or  xor |
  465.   |----------|--------------|
  466.   | 4        | shl  shr     |
  467.   |----------|--------------|
  468.   | 5        | not          |
  469.   |----------|--------------|
  470.   | 6        | rva  plt     |
  471.   \-------------------------/
  472.  
  473.  
  474. 1.2.5  Jumps and calls
  475.  
  476. The operand of any jump or call instruction can be preceded not only by the
  477. size operator, but also by one of the operators specifying type of the jump:
  478. "short", "near" of "far". For example, when assembler is in 16-bit mode,
  479. instruction "jmp dword [0]" will become the far jump and when assembler is
  480. in 32-bit mode, it will become the near jump. To force this instruction to be
  481. treated differently, use the "jmp near dword [0]" or "jmp far dword [0]" form.
  482.   When operand of near jump is the immediate value, assembler will generate
  483. the shortest variant of this jump instruction if possible (but will not create
  484. 32-bit instruction in 16-bit mode nor 16-bit instruction in 32-bit mode,
  485. unless there is a size operator stating it). By specifying the jump type
  486. you can force it to always generate long variant (for example "jmp near 0")
  487. or to always generate short variant and terminate with an error when it's
  488. impossible (for example "jmp short 0").
  489.  
  490.  
  491. 1.2.6  Size settings
  492.  
  493. When instruction uses some memory addressing, by default the smallest form of
  494. instruction is generated by using the short displacement if only address
  495. value fits in the range. This can be overridden using the "word" or "dword"
  496. operator before the address inside the square brackets (or after the "ptr"
  497. operator), which forces the long displacement of appropriate size to be made.
  498. In case when address is not relative to any registers, those operators allow
  499. also to choose the appropriate mode of absolute addressing.
  500.   Instructions "adc", "add", "and", "cmp", "or", "sbb", "sub" and "xor" with
  501. first operand being 16-bit or 32-bit are by default generated in shortened
  502. 8-bit form when the second operand is immediate value fitting in the range
  503. for signed 8-bit values. It also can be overridden by putting the "word" or
  504. "dword" operator before the immediate value. The similar rules applies to the
  505. "imul" instruction with the last operand being immediate value.
  506.   Immediate value as an operand for "push" instruction without a size operator
  507. is by default treated as a word value if assembler is in 16-bit mode and as a
  508. double word value if assembler is in 32-bit mode, shorter 8-bit form of this
  509. instruction is used if possible, "word" or "dword" size operator forces the
  510. "push" instruction to be generated in longer form for specified size. "pushw"
  511. and "pushd" mnemonics force assembler to generate 16-bit or 32-bit code
  512. without forcing it to use the longer form of instruction.
  513.  
  514.  
  515. Chapter 2  Instruction set
  516. --------------------------
  517.  
  518. This chapter provides the detailed information about the instructions and
  519. directives supported by flat assembler. Directives for defining labels were
  520. already discussed in 1.2.3, all other directives will be described later in
  521. this chapter.
  522.  
  523.  
  524. 2.1  The x86 architecture instructions
  525.  
  526. In this section you can find both the information about the syntax and
  527. purpose the assembly language instructions. If you need more technical
  528. information, look for the Intel Architecture Software Developer's Manual.
  529.   Assembly instructions consist of the mnemonic (instruction's name) and from
  530. zero to three operands. If there are two or more operands, usually first is
  531. the destination operand and second is the source operand. Each operand can be
  532. register, memory or immediate value (see 1.2 for details about syntax of
  533. operands). After the description of each instruction there are examples
  534. of different combinations of operands, if the instruction has any.
  535.   Some instructions act as prefixes and can be followed by other instruction
  536. in the same line, and there can be more than one prefix in a line. Each name
  537. of the segment register is also a mnemonic of instruction prefix, altough it
  538. is recommended to use segment overrides inside the square brackets instead of
  539. these prefixes.
  540.  
  541.  
  542. 2.1.1  Data movement instructions
  543.  
  544. "mov" transfers a byte, word or double word from the source operand to the
  545. destination operand. It can transfer data between general registers, from
  546. the general register to memory, or from memory to general register, but it
  547. cannot move from memory to memory. It can also transfer an immediate value to
  548. general register or memory, segment register to general register or memory,
  549. general register or memory to segment register, control or debug register to
  550. general register and general register to control or debug register. The "mov"
  551. can be assembled only if the size of source operand and size of destination
  552. operand are the same. Below are the examples for each of the allowed
  553. combinations:
  554.  
  555.     mov bx,ax       ; general register to general register
  556.     mov [char],al   ; general register to memory
  557.     mov bl,[char]   ; memory to general register
  558.     mov dl,32       ; immediate value to general register
  559.     mov [char],32   ; immediate value to memory
  560.     mov ax,ds       ; segment register to general register
  561.     mov [bx],ds     ; segment register to memory
  562.     mov ds,ax       ; general register to segment register
  563.     mov ds,[bx]     ; memory to segment register
  564.     mov eax,cr0     ; control register to general register
  565.     mov cr3,ebx     ; general register to control register
  566.  
  567.   "xchg" swaps the contents of two operands. It can swap two byte operands,
  568. two word operands or two double word operands. Order of operands is not
  569. important. The operands may be two general registers, or general register
  570. with memory. For example:
  571.  
  572.     xchg ax,bx      ; swap two general registers
  573.     xchg al,[char]  ; swap register with memory
  574.  
  575.   "push" decrements the stack frame pointer (ESP register), then transfers
  576. the operand to the top of stack indicated by ESP. The operand can be memory,
  577. general register, segment register or immediate value of word or double word
  578. size. If operand is an immediate value and no size is specified, it is by
  579. default treated as a word value if assembler is in 16-bit mode and as a double
  580. word value if assembler is in 32-bit mode. "pushw" and "pushd" mnemonics are
  581. variants of this instruction that store the values of word or double word size
  582. respectively. If more operands follow in the same line (separated only with
  583. spaces, not commas), compiler will assemble chain of the "push" instructions
  584. with these operands. The examples are with single operands:
  585.  
  586.     push ax         ; store general register
  587.     push es         ; store segment register
  588.     pushw [bx]      ; store memory
  589.     push 1000h      ; store immediate value
  590.  
  591.   "pusha" saves the contents of the eight general register on the stack.
  592. This instruction has no operands. There are two version of this instruction,
  593. one 16-bit and one 32-bit, assembler automatically generates the appropriate
  594. version for current mode, but it can be overridden by using "pushaw" or
  595. "pushad" mnemonic to always get the 16-bit or 32-bit version. The 16-bit
  596. version of this instruction pushes general registers on the stack in the
  597. following order: AX, CX, DX, BX, the initial value of SP before AX was pushed,
  598. BP, SI and DI. The 32-bit version pushes equivalent 32-bit general registers
  599. in the same order.
  600.   "pop" transfers the word or double word at the current top of stack to the
  601. destination operand, and then increments ESP to point to the new top of stack.
  602. The operand can be memory, general register or segment register. "popw" and
  603. "popd" mnemonics are variants of this instruction for restoring the values of
  604. word or double word size respectively. If more operands separated with spaces
  605. follow in the same line, compiler will assemble chain of the "pop"
  606. instructions with these operands.
  607.  
  608.     pop bx          ; restore general register
  609.     pop ds          ; restore segment register
  610.     popw [si]       ; restore memory
  611.  
  612.   "popa" restores the registers saved on the stack by "pusha" instruction,
  613. except for the saved value of SP (or ESP), which is ignored. This instruction
  614. has no operands. To force assembling 16-bit or 32-bit version of this
  615. instruction use "popaw" or "popad" mnemonic.
  616.  
  617.  
  618. 2.1.2  Type conversion instructions
  619.  
  620. The type conversion instructions convert bytes into words, words into double
  621. words, and double words into quad words. These conversions can be done using
  622. the sign extension or zero extension. The sign extension fills the extra bits
  623. of the larger item with the value of the sign bit of the smaller item, the
  624. zero extension simply fills them with zeros.
  625.   "cwd" and "cdq" double the size of value AX or EAX register respectively
  626. and store the extra bits into the DX or EDX register. The conversion is done
  627. using the sign extension. These instructions have no operands.
  628.   "cbw" extends the sign of the byte in AL throughout AX, and "cwde" extends
  629. the sign of the word in AX throughout EAX. These instructions also have no
  630. operands.
  631.   "movsx" converts a byte to word or double word and a word to double word
  632. using the sign extension. "movzx" does the same, but it uses the zero
  633. extension. The source operand can be general register or memory, while the
  634. destination operand must be a general register. For example:
  635.  
  636.     movsx ax,al         ; byte register to word register
  637.     movsx edx,dl        ; byte register to double word register
  638.     movsx eax,ax        ; word register to double word register
  639.     movsx ax,byte [bx]  ; byte memory to word register
  640.     movsx edx,byte [bx] ; byte memory to double word register
  641.     movsx eax,word [bx] ; word memory to double word register
  642.  
  643.  
  644. 2.1.3  Binary arithmetic instructions
  645.  
  646. "add" replaces the destination operand with the sum of the source and
  647. destination operands and sets CF if overflow has occurred. The operands may
  648. be bytes, words or double words. The destination operand can be general
  649. register or memory, the source operand can be general register or immediate
  650. value, it can also be memory if the destination operand is register.
  651.  
  652.     add ax,bx       ; add register to register
  653.     add ax,[si]     ; add memory to register
  654.     add [di],al     ; add register to memory
  655.     add al,48       ; add immediate value to register
  656.     add [char],48   ; add immediate value to memory
  657.  
  658.   "adc" sums the operands, adds one if CF is set, and replaces the destination
  659. operand with the result. Rules for the operands are the same as for the "add"
  660. instruction. An "add" followed by multiple "adc" instructions can be used to
  661. add numbers longer than 32 bits.
  662.   "inc" adds one to the operand, it does not affect CF. The operand can be a
  663. general register or memory, and the size of the operand can be byte, word or
  664. double word.
  665.  
  666.     inc ax          ; increment register by one
  667.     inc byte [bx]   ; increment memory by one
  668.  
  669.   "sub" subtracts the source operand from the destination operand and replaces
  670. the destination operand with the result. If a borrow is required, the CF is
  671. set. Rules for the operands are the same as for the "add" instruction.
  672.   "sbb" subtracts the source operand from the destination operand, subtracts
  673. one if CF is set, and stores the result to the destination operand. Rules for
  674. the operands are the same as for the "add" instruction. A "sub" followed by
  675. multiple "sbb" instructions may be used to subtract numbers longer than 32
  676. bits.
  677.   "dec" subtracts one from the operand, it does not affect CF. Rules for the
  678. operand are the same as for the "inc" instruction.
  679.   "cmp" subtracts the source operand from the destination operand. It updates
  680. the flags as the "sub" instruction, but does not alter the source and
  681. destination operands. Rules for the operands are the same as for the "sub"
  682. instruction.
  683.   "neg" subtracts a signed integer operand from zero. The effect of this
  684. instructon is to reverse the sign of the operand from positive to negative or
  685. from negative to positive. Rules for the operand are the same as for the "inc"
  686. instruction.
  687.   "xadd" exchanges the destination operand with the source operand, then loads
  688. the sum of the two values into the destination operand. Rules for the operands
  689. are the same as for the "add" instruction.
  690.   All the above binary arithmetic instructions update SF, ZF, PF and OF flags.
  691. SF is always set to the same value as the result's sign bit, ZF is set when
  692. all the bits of result are zero, PF is set when low order eight bits of result
  693. contain an even number of set bits, OF is set if result is too large for a
  694. positive number or too small for a negative number (excluding sign bit) to fit
  695. in destination operand.
  696.   "mul" performs an unsigned multiplication of the operand and the
  697. accumulator. If the operand is a byte, the processor multiplies it by the
  698. contents of AL and returns the 16-bit result to AH and AL. If the operand is a
  699. word, the processor multiplies it by the contents of AX and returns the 32-bit
  700. result to DX and AX. If the operand is a double word, the processor multiplies
  701. it by the contents of EAX and returns the 64-bit result in EDX and EAX. "mul"
  702. sets CF and OF when the upper half of the result is nonzero, otherwise they
  703. are cleared. Rules for the operand are the same as for the "inc" instruction.
  704.   "imul" performs a signed multiplication operation. This instruction has
  705. three variations. First has one operand and behaves in the same way as the
  706. "mul" instruction. Second has two operands, in this case destination operand
  707. is multiplied by the source operand and the result replaces the destination
  708. operand. Destination operand must be a general register, it can be word or
  709. double word, source operand can be general register, memory or immediate
  710. value. Third form has three operands, the destination operand must be a
  711. general register, word or double word in size, source operand can be general
  712. register or memory, and third operand must be an immediate value. The source
  713. operand is multiplied by the immediate value and the result is stored in the
  714. destination register. All the three forms calculate the product to twice the
  715. size of operands and set CF and OF when the upper half of the result is
  716. nonzero, but second and third form truncate the product to the size of
  717. operands. So second and third forms can be also used for unsigned operands
  718. because, whether the operands are signed or unsigned, the lower half of the
  719. product is the same. Below are the examples for all three forms:
  720.  
  721.     imul bl         ; accumulator by register
  722.     imul word [si]  ; accumulator by memory
  723.     imul bx,cx      ; register by register
  724.     imul bx,[si]    ; register by memory
  725.     imul bx,10      ; register by immediate value
  726.     imul ax,bx,10   ; register by immediate value to register
  727.     imul ax,[si],10 ; memory by immediate value to register
  728.  
  729.   "div" performs an unsigned division of the accumulator by the operand.
  730. The dividend (the accumulator) is twice the size of the divisor (the operand),
  731. the quotient and remainder have the same size as the divisor. If divisor is
  732. byte, the dividend is taken from AX register, the quotient is stored in AL and
  733. the remainder is stored in AH. If divisor is word, the upper half of dividend
  734. is taken from DX, the lower half of dividend is taken from AX, the quotient is
  735. stored in AX and the remainder is stored in DX. If divisor is double word,
  736. the upper half of dividend is taken from EDX, the lower half of dividend is
  737. taken from EAX, the quotient is stored in EAX and the remainder is stored in
  738. EDX. Rules for the operand are the same as for the "mul" instruction.
  739.   "idiv" performs a signed division of the accumulator by the operand.
  740. It uses the same registers as the "div" instruction, and the rules for
  741. the operand are the same.
  742.  
  743.  
  744. 2.1.4  Decimal arithmetic instructions
  745.  
  746. Decimal arithmetic is performed by combining the binary arithmetic
  747. instructions (already described in the prior section) with the decimal
  748. arithmetic instructions. The decimal arithmetic instructions are used to
  749. adjust the results of a previous binary arithmetic operation to produce a
  750. valid packed or unpacked decimal result, or to adjust the inputs to a
  751. subsequent binary arithmetic operation so the operation will produce a valid
  752. packed or unpacked decimal result.
  753.   "daa" adjusts the result of adding two valid packed decimal operands in
  754. AL. "daa" must always follow the addition of two pairs of packed decimal
  755. numbers (one digit in each half-byte) to obtain a pair of valid packed
  756. decimal digits as results. The carry flag is set if carry was needed.
  757. This instruction has no operands.
  758.   "das" adjusts the result of subtracting two valid packed decimal operands
  759. in AL. "das" must always follow the subtraction of one pair of packed decimal
  760. numbers (one digit in each half-byte) from another to obtain a pair of valid
  761. packed decimal digits as results. The carry flag is set if a borrow was
  762. needed. This instruction has no operands.
  763.   "aaa" changes the contents of register AL to a valid unpacked decimal
  764. number, and zeroes the top four bits. "aaa" must always follow the addition
  765. of two unpacked decimal operands in AL. The carry flag is set and AH is
  766. incremented if a carry is necessary. This instruction has no operands.
  767.   "aas" changes the contents of register AL to a valid unpacked decimal
  768. number, and zeroes the top four bits. "aas" must always follow the
  769. subtraction of one unpacked decimal operand from another in AL. The carry flag
  770. is set and AH decremented if a borrow is necessary. This instruction has no
  771. operands.
  772.   "aam" corrects the result of a multiplication of two valid unpacked decimal
  773. numbers. "aam" must always follow the multiplication of two decimal numbers
  774. to produce a valid decimal result. The high order digit is left in AH, the
  775. low order digit in AL. The generalized version of this instruction allows
  776. adjustment of the contents of the AX to create two unpacked digits of any
  777. number base. The standard version of this instruction has no operands, the
  778. generalized version has one operand - an immediate value specifying the
  779. number base for the created digits.
  780.   "aad" modifies the numerator in AH and AL to prepare for the division of two
  781. valid unpacked decimal operands so that the quotient produced by the division
  782. will be a valid unpacked decimal number. AH should contain the high order
  783. digit and AL the low order digit. This instruction adjusts the value and
  784. places the result in AL, while AH will contain zero. The generalized version
  785. of this instruction allows adjustment of two unpacked digits of any number
  786. base. Rules for the operand are the same as for the "aam" instruction.
  787.  
  788.  
  789. 2.1.5  Logical instructions
  790.  
  791. "not" inverts the bits in the specified operand to form a one's complement
  792. of the operand. It has no effect on the flags. Rules for the operand are the
  793. same as for the "inc" instruction.
  794.   "and", "or" and "xor" instructions perform the standard logical operations.
  795. They update the SF, ZF and PF flags. Rules for the operands are the same as
  796. for the "add" instruction.
  797.   "bt", "bts", "btr" and "btc" instructions operate on a single bit which can
  798. be in memory or in a general register. The location of the bit is specified
  799. as an offset from the low order end of the operand. The value of the offset
  800. is the taken from the second operand, it either may be an immediate byte or
  801. a general register. These instructions first assign the value of the selected
  802. bit to CF. "bt" instruction does nothing more, "bts" sets the selected bit to
  803. 1, "btr" resets the selected bit to 0, "btc" changes the bit to its
  804. complement. The first operand can be word or double word.
  805.  
  806.     bt  ax,15        ; test bit in register
  807.     bts word [bx],15 ; test and set bit in memory
  808.     btr ax,cx        ; test and reset bit in register
  809.     btc word [bx],cx ; test and complement bit in memory
  810.  
  811.   "bsf" and "bsr" instructions scan a word or double word for first set bit
  812. and store the index of this bit into destination operand, which must be
  813. general register. The bit string being scanned is specified by source operand,
  814. it may be either general register or memory. The ZF flag is set if the entire
  815. string is zero (no set bits are found); otherwise it is cleared. If no set bit
  816. is found, the value of the destination register is undefined. "bsf" scans from
  817. low order to high order (starting from bit index zero). "bsr" scans from high
  818. order to low order (starting from bit index 15 of a word or index 31 of a
  819. double word).
  820.  
  821.     bsf ax,bx        ; scan register forward
  822.     bsr ax,[si]      ; scan memory reverse
  823.  
  824.   "shl" shifts the destination operand left by the number of bits specified
  825. in the second operand. The destination operand can be byte, word, or double
  826. word general register or memory. The second operand can be an immediate value
  827. or the CL register. The processor shifts zeros in from the right (low order)
  828. side of the operand as bits exit from the left side. The last bit that exited
  829. is stored in CF. "sal" is a synonym for "shl".
  830.  
  831.     shl al,1         ; shift register left by one bit
  832.     shl byte [bx],1  ; shift memory left by one bit
  833.     shl ax,cl        ; shift register left by count from cl
  834.     shl word [bx],cl ; shift memory left by count from cl
  835.  
  836.   "shr" and "sar" shift the destination operand right by the number of bits
  837. specified in the second operand. Rules for operands are the same as for the
  838. "shl" instruction. "shr" shifts zeros in from the left side of the operand as
  839. bits exit from the right side. The last bit that exited is stored in CF.
  840. "sar" preserves the sign of the operand by shifting in zeros on the left side
  841. if the value is positive or by shifting in ones if the value is negative.
  842.   "shld" shifts bits of the destination operand to the left by the number
  843. of bits specified in third operand, while shifting high order bits from the
  844. source operand into the destination operand on the right. The source operand
  845. remains unmodified. The destination operand can be a word or double word
  846. general register or memory, the source operand must be a general register,
  847. third operand can be an immediate value or the CL register.
  848.  
  849.     shld ax,bx,1     ; shift register left by one bit
  850.     shld [di],bx,1   ; shift memory left by one bit
  851.     shld ax,bx,cl    ; shift register left by count from cl
  852.     shld [di],bx,cl  ; shift memory left by count from cl
  853.  
  854.   "shrd" shifts bits of the destination operand to the right, while shifting
  855. low order bits from the source operand into the destination operand on the
  856. left. The source operand remains unmodified. Rules for operands are the same
  857. as for the "shld" instruction.
  858.   "rol" and "rcl" rotate the byte, word or double word destination operand
  859. left by the number of bits specified in the second operand. For each rotation
  860. specified, the high order bit that exits from the left of the operand returns
  861. at the right to become the new low order bit. "rcl" additionally puts in CF
  862. each high order bit that exits from the left side of the operand before it
  863. returns to the operand as the low order bit on the next rotation cycle. Rules
  864. for operands are the same as for the "shl" instruction.
  865.   "ror" and "rcr" rotate the byte, word or double word destination operand
  866. right by the number of bits specified in the second operand. For each rotation
  867. specified, the low order bit that exits from the right of the operand returns
  868. at the left to become the new high order bit. "rcr" additionally puts in CF
  869. each low order bit that exits from the right side of the operand before it
  870. returns to the operand as the high order bit on the next rotation cycle.
  871. Rules for operands are the same as for the "shl" instruction.
  872.   "test" performs the same action as the "and" instruction, but it does not
  873. alter the destination operand, only updates flags. Rules for the operands are
  874. the same as for the "and" instruction.
  875.   "bswap" reverses the byte order of a 32-bit general register: bits 0 through
  876. 7 are swapped with bits 24 through 31, and bits 8 through 15 are swapped with
  877. bits 16 through 23. This instruction is provided for converting little-endian
  878. values to big-endian format and vice versa.
  879.  
  880.     bswap edx        ; swap bytes in register
  881.  
  882.  
  883. 2.1.6  Control transfer instructions
  884.  
  885. "jmp" unconditionally transfers control to the target location. The
  886. destination address can be specified directly within the instruction or
  887. indirectly through a register or memory, the acceptable size of this address
  888. depends on whether the jump is near or far (it can be specified by preceding
  889. the operand with "near" or "far" operator) and whether the instruction is
  890. 16-bit or 32-bit. Operand for near jump should be "word" size for 16-bit
  891. instruction or the "dword" size for 32-bit instruction. Operand for far jump
  892. should be "dword" size for 16-bit instruction or "pword" size for 32-bit
  893. instruction. A direct "jmp" instruction includes the destination address as
  894. part of the instruction (and can be preceded by "short", "near" or "far"
  895. operator), the operand specifying address should be the numerical expression
  896. for near or short jump, or two numerical expressions separated with colon for
  897. far jump, the first specifies selector of segment, the second is the offset
  898. within segment. The "pword" operator can be used to force the 32-bit far call,
  899. and "dword" to force the 16-bit far call. An indirect "jmp" instruction
  900. obtains the destination address indirectly through a register or a pointer
  901. variable, the operand should be general register or memory. See also 1.2.5 for
  902. some more details.
  903.  
  904.     jmp 100h         ; direct near jump
  905.     jmp 0FFFFh:0     ; direct far jump
  906.     jmp ax           ; indirect near jump
  907.     jmp pword [ebx]  ; indirect far jump
  908.  
  909.   "call" transfers control to the procedure, saving on the stack the address
  910. of the instruction following the "call" for later use by a "ret" (return)
  911. instruction. Rules for the operands are the same as for the "jmp" instruction,
  912. but the "call" has no short variant of direct instruction and thus it not
  913. optimized.
  914.   "ret", "retn" and "retf" instructions terminate the execution of a procedure
  915. and transfers control back to the program that originally invoked the
  916. procedure using the address that was stored on the stack by the "call"
  917. instruction. "ret" is the equivalent for "retn", which returns from the
  918. procedure that was executed using the near call, while "retf" returns from
  919. the procedure that was executed using the far call. These instructions default
  920. to the size of address appropriate for the current code setting, but the size
  921. of address can be forced to 16-bit by using the "retw", "retnw" and "retfw"
  922. mnemonics, and to 32-bit by using the "retd", "retnd" and "retfd" mnemonics.
  923. All these instructions may optionally specify an immediate operand, by adding
  924. this constant to the stack pointer, they effectively remove any arguments that
  925. the calling program pushed on the stack before the execution of the "call"
  926. instruction.
  927.   "iret" returns control to an interrupted procedure. It differs from "ret" in
  928. that it also pops the flags from the stack into the flags register. The flags
  929. are stored on the stack by the interrupt mechanism. It defaults to the size of
  930. return address appropriate for the current code setting, but it can be forced
  931. to use 16-bit or 32-bit address by using the "iretw" or "iretd" mnemonic.
  932.   The conditional transfer instructions are jumps that may or may not transfer
  933. control, depending on the state of the CPU flags when the instruction
  934. executes. The mnemonics for conditional jumps may be obtained by attaching
  935. the condition mnemonic (see table 2.1) to the "j" mnemonic,
  936. for example "jc" instruction will transfer the control when the CF flag is
  937. set. The conditional jumps can be short or near, and direct only, and can be
  938. optimized (see 1.2.5), the operand should be an immediate value specifying
  939. target address.
  940.  
  941.    Table 2.1  Conditions
  942.   /-----------------------------------------------------------\
  943.   | Mnemonic | Condition tested      | Description            |
  944.   |==========|=======================|========================|
  945.   | o        | OF = 1                | overflow               |
  946.   |----------|-----------------------|------------------------|
  947.   | no       | OF = 0                | not overflow           |
  948.   |----------|-----------------------|------------------------|
  949.   | c        |                       | carry                  |
  950.   | b        | CF = 1                | below                  |
  951.   | nae      |                       | not above nor equal    |
  952.   |----------|-----------------------|------------------------|
  953.   | nc       |                       | not carry              |
  954.   | ae       | CF = 0                | above or equal         |
  955.   | nb       |                       | not below              |
  956.   |----------|-----------------------|------------------------|
  957.   | e        | ZF = 1                | equal                  |
  958.   | z        |                       | zero                   |
  959.   |----------|-----------------------|------------------------|
  960.   | ne       | ZF = 0                | not equal              |
  961.   | nz       |                       | not zero               |
  962.   |----------|-----------------------|------------------------|
  963.   | be       | CF or ZF = 1          | below or equal         |
  964.   | na       |                       | not above              |
  965.   |----------|-----------------------|------------------------|
  966.   | a        | CF or ZF = 0          | above                  |
  967.   | nbe      |                       | not below nor equal    |
  968.   |----------|-----------------------|------------------------|
  969.   | s        | SF = 1                | sign                   |
  970.   |----------|-----------------------|------------------------|
  971.   | ns       | SF = 0                | not sign               |
  972.   |----------|-----------------------|------------------------|
  973.   | p        | PF = 1                | parity                 |
  974.   | pe       |                       | parity even            |
  975.   |----------|-----------------------|------------------------|
  976.   | np       | PF = 0                | not parity             |
  977.   | po       |                       | parity odd             |
  978.   |----------|-----------------------|------------------------|
  979.   | l        | SF xor OF = 1         | less                   |
  980.   | nge      |                       | not greater nor equal  |
  981.   |----------|-----------------------|------------------------|
  982.   | ge       | SF xor OF = 0         | greater or equal       |
  983.   | nl       |                       | not less               |
  984.   |----------|-----------------------|------------------------|
  985.   | le       | (SF xor OF) or ZF = 1 | less or equal          |
  986.   | ng       |                       | not greater            |
  987.   |----------|-----------------------|------------------------|
  988.   | g        | (SF xor OF) or ZF = 0 | greater                |
  989.   | nle      |                       | not less nor equal     |
  990.   \-----------------------------------------------------------/
  991.  
  992.   The "loop" instructions are conditional jumps that use a value placed in
  993. CX (or ECX) to specify the number of repetitions of a software loop. All
  994. "loop" instructions automatically decrement CX (or ECX) and terminate the
  995. loop (don't transfer the control) when CX (or ECX) is zero. It uses CX or ECX
  996. whether the current code setting is 16-bit or 32-bit, but it can be forced to
  997. us CX with the "loopw" mnemonic or to use ECX with the "loopd" mnemonic.
  998. "loope" and "loopz" are the synonyms for the same instruction, which acts as
  999. the standard "loop", but also terminates the loop when ZF flag is set.
  1000. "loopew" and "loopzw" mnemonics force them to use CX register while "looped"
  1001. and "loopzd" force them to use ECX register. "loopne" and "loopnz" are the
  1002. synonyms for the same instructions, which acts as the standard "loop", but
  1003. also terminate the loop when ZF flag is not set. "loopnew" and "loopnzw"
  1004. mnemonics force them to use CX register while "loopned" and "loopnzd" force
  1005. them to use ECX register. Every "loop" instruction needs an operand being an
  1006. immediate value specifying target address, it can be only short jump (in the
  1007. range of 128 bytes back and 127 bytes forward from the address of instruction
  1008. following the "loop" instruction).
  1009.   "jcxz" branches to the label specified in the instruction if it finds a
  1010. value of zero in CX, "jecxz" does the same, but checks the value of ECX
  1011. instead of CX. Rules for the operands are the same as for the "loop"
  1012. instruction.
  1013.   "int" activates the interrupt service routine that corresponds to the
  1014. number specified as an operand to the instruction, the number should be in
  1015. range from 0 to 255. The interrupt service routine terminates with an "iret"
  1016. instruction that returns control to the instruction that follows "int".
  1017. "int3" mnemonic codes the short (one byte) trap that invokes the interrupt 3.
  1018. "into" instruction invokes the interrupt 4 if the OF flag is set.
  1019.   "bound" verifies that the signed value contained in the specified register
  1020. lies within specified limits. An interrupt 5 occurs if the value contained in
  1021. the register is less than the lower bound or greater than the upper bound. It
  1022. needs two operands, the first operand specifies the register being tested,
  1023. the second operand should be memory address for the two signed limit values.
  1024. The operands can be "word" or "dword" in size.
  1025.  
  1026.     bound ax,[bx]    ; check word for bounds
  1027.     bound eax,[esi]  ; check double word for bounds
  1028.  
  1029.  
  1030. 2.1.7  I/O instructions
  1031.  
  1032.   "in" transfers a byte, word, or double word from an input port to AL, AX,
  1033. or EAX. I/O ports can be addressed either directly, with the immediate byte
  1034. value coded in instruction, or indirectly via the DX register. The destination
  1035. operand should be AL, AX, or EAX register. The source operand should be an
  1036. immediate value in range from 0 to 255, or DX register.
  1037.  
  1038.     in al,20h        ; input byte from port 20h
  1039.     in ax,dx         ; input word from port addressed by dx
  1040.  
  1041.   "out" transfers a byte, word, or double word to an output port from AL, AX,
  1042. or EAX. The program can specify the number of the port using the same methods
  1043. as the "in" instruction. The destination operand should be an immediate value
  1044. in range from 0 to 255, or DX register. The source operand should be AL, AX,
  1045. or EAX register.
  1046.  
  1047.     out 20h,ax       ; output word to port 20h
  1048.     out dx,al        ; output byte to port addressed by dx
  1049.  
  1050.  
  1051. 2.1.8  Strings operations
  1052.  
  1053. The string operations operate on one element of a string. A string element
  1054. may be a byte, a word, or a double word. The string elements are addressed by
  1055. SI and DI (or ESI and EDI) registers. After every string operation SI and/or
  1056. DI (or ESI and/or EDI) are automatically updated to point to the next element
  1057. of the string. If DF (direction flag) is zero, the index registers are
  1058. incremented, if DF is one, they are decremented. The amount of the increment
  1059. or decrement is 1, 2, or 4 depending on the size of the string element. Every
  1060. string operation instruction has short forms which have no operands and use
  1061. SI and/or DI when the code type is 16-bit, and ESI and/or EDI when the code
  1062. type is 32-bit. SI and ESI by default address data in the segment selected
  1063. by DS, DI and EDI always address data in the segment selected by ES. Short
  1064. form is obtained by attaching to the mnemonic of string operation letter
  1065. specifying the size of string element, it should be "b" for byte element,
  1066. "w" for word element, and "d" for double word element. Full form of string
  1067. operation needs operands providing the size operator and the memory addresses,
  1068. which can be SI or ESI with any segment prefix, DI or EDI always with ES
  1069. segment prefix.
  1070.   "movs" transfers the string element pointed to by SI (or ESI) to the
  1071. location pointed to by DI (or EDI). Size of operands can be byte, word, or
  1072. double word. The destination operand should be memory addressed by DI or EDI,
  1073. the source operand should be memory addressed by SI or ESI with any segment
  1074. prefix.
  1075.  
  1076.     movs byte [di],[si]        ; transfer byte
  1077.     movs word [es:di],[ss:si]  ; transfer word
  1078.     movsd                      ; transfer double word
  1079.  
  1080.   "cmps" subtracts the destination string element from the source string
  1081. element and updates the flags AF, SF, PF, CF and OF, but it does not change
  1082. any of the compared elements. If the string elements are equal, ZF is set,
  1083. otherwise it is cleared. The first operand for this instruction should be the
  1084. source string element addressed by SI or ESI with any segment prefix, the
  1085. second operand should be the destination string element addressed by DI or
  1086. EDI.
  1087.  
  1088.     cmpsb                      ; compare bytes
  1089.     cmps word [ds:si],[es:di]  ; compare words
  1090.     cmps dword [fs:esi],[edi]  ; compare double words
  1091.  
  1092.   "scas" subtracts the destination string element from AL, AX, or EAX
  1093. (depending on the size of string element) and updates the flags AF, SF, ZF,
  1094. PF, CF and OF. If the values are equal, ZF is set, otherwise it is cleared.
  1095. The operand should be the destination string element addressed by DI or EDI.
  1096.  
  1097.     scas byte [es:di]          ; scan byte
  1098.     scasw                      ; scan word
  1099.     scas dword [es:edi]        ; scan double word
  1100.  
  1101.   "stos" places the value of AL, AX, or EAX into the destination string
  1102. element. Rules for the operand are the same as for the "scas" instruction.
  1103.   "lods" places the source string element into AL, AX, or EAX. The operand
  1104. should be the source string element addressed by SI or ESI with any segment
  1105. prefix.
  1106.  
  1107.     lods byte [ds:si]          ; load byte
  1108.     lods word [cs:si]          ; load word
  1109.     lodsd                      ; load double word
  1110.  
  1111.   "ins" transfers a byte, word, or double word from an input port addressed
  1112. by DX register to the destination string element. The destination operand
  1113. should be memory addressed by DI or EDI, the source operand should be the DX
  1114. register.
  1115.  
  1116.     insb                       ; input byte
  1117.     ins word [es:di],dx        ; input word
  1118.     ins dword [edi],dx         ; input double word
  1119.  
  1120.   "outs" transfers the source string element to an output port addressed by
  1121. DX register. The destination operand should be the DX register and the source
  1122. operand should be memory addressed by SI or ESI with any segment prefix.
  1123.  
  1124.     outs dx,byte [si]          ; output byte
  1125.     outsw                      ; output word
  1126.     outs dx,dword [gs:esi]     ; output double word
  1127.  
  1128.   The repeat prefixes "rep", "repe"/"repz", and "repne"/"repnz" specify
  1129. repeated string operation. When a string operation instruction has a repeat
  1130. prefix, the operation is executed repeatedly, each time using a different
  1131. element of the string. The repetition terminates when one of the conditions
  1132. specified by the prefix is satisfied. All three prefixes automatically
  1133. decrease CX or ECX register (depending whether string operation instruction
  1134. uses the 16-bit or 32-bit addressing) after each operation and repeat the
  1135. associated operation until CX or ECX is zero. "repe"/"repz" and
  1136. "repne"/"repnz" are used exclusively with the "scas" and "cmps" instructions
  1137. (described below). When these prefixes are used, repetition of the next
  1138. instruction depends on the zero flag (ZF) also, "repe" and "repz" terminate
  1139. the execution when the ZF is zero, "repne" and "repnz" terminate the execution
  1140. when the ZF is set.
  1141.  
  1142.     rep  movsd       ; transfer multiple double words
  1143.     repe cmpsb       ; compare bytes until not equal
  1144.  
  1145.  
  1146. 2.1.9  Flag control instructions
  1147.  
  1148. The flag control instructions provide a method for directly changing the
  1149. state of bits in the flag register. All instructions described in this
  1150. section have no operands.
  1151.   "stc" sets the CF (carry flag) to 1, "clc" zeroes the CF, "cmc" changes the
  1152. CF to its complement. "std" sets the DF (direction flag) to 1, "cld" zeroes
  1153. the DF, "sti" sets the IF (interrupt flag) to 1 and therefore enables the
  1154. interrupts, "cli" zeroes the IF and therefore disables the interrupts.
  1155.   "lahf" copies SF, ZF, AF, PF, and CF to bits 7, 6, 4, 2, and 0 of the
  1156. AH register. The contents of the remaining bits are undefined. The flags
  1157. remain unaffected.
  1158.   "sahf" transfers bits 7, 6, 4, 2, and 0 from the AH register into SF, ZF,
  1159. AF, PF, and CF.
  1160.   "pushf" decrements "esp" by two or four and stores the low word or
  1161. double word of flags register at the top of stack, size of stored data
  1162. depends on the current code setting. "pushfw" variant forces storing the
  1163. word and "pushfd" forces storing the double word.
  1164.   "popf" transfers specific bits from the word or double word at the top
  1165. of stack, then increments "esp" by two or four, this value depends on
  1166. the current code setting. "popfw" variant forces restoring from the word
  1167. and "popfd" forces restoring from the double word.
  1168.  
  1169.  
  1170. 2.1.10  Conditional operations
  1171.  
  1172.   The instructions obtained by attaching the condition mnemonic (see table
  1173. 2.1) to the "set" mnemonic set a byte to one if the condition is true and set
  1174. the byte to zero otherwise. The operand should be an 8-bit be general register
  1175. or the byte in memory.
  1176.  
  1177.     setne al         ; set al if zero flag cleared
  1178.     seto byte [bx]   ; set byte if overflow
  1179.  
  1180.   "salc" instruction sets the all bits of AL register when the carry flag is
  1181. set and zeroes the AL register otherwise. This instruction has no arguments.
  1182.   The instructions obtained by attaching the condition mnemonic to "cmov"
  1183. mnemonic transfer the word or double word from the general register or memory
  1184. to the general register only when the condition is true. The destination
  1185. operand should be general register, the source operand can be general register
  1186. or memory.
  1187.  
  1188.     cmove ax,bx      ; move when zero flag set
  1189.     cmovnc eax,[ebx] ; move when carry flag cleared
  1190.  
  1191.   "cmpxchg" compares the value in the AL, AX, or EAX register with the
  1192. destination operand. If the two values are equal, the source operand is
  1193. loaded into the destination operand. Otherwise, the destination operand is
  1194. loaded into the AL, AX, or EAX register. The destination operand may be a
  1195. general register or memory, the source operand must be a general register.
  1196.  
  1197.     cmpxchg dl,bl    ; compare and exchange with register
  1198.     cmpxchg [bx],dx  ; compare and exchange with memory
  1199.  
  1200.   "cmpxchg8b" compares the 64-bit value in EDX and EAX registers with the
  1201. destination operand. If the values are equal, the 64-bit value in ECX and EBX
  1202. registers is stored in the destination operand. Otherwise, the value in the
  1203. destination operand is loaded into EDX and EAX registers. The destination
  1204. operand should be a quad word in memory.
  1205.  
  1206.     cmpxchg8b [bx]   ; compare and exchange 8 bytes
  1207.  
  1208.  
  1209. 2.1.11  Miscellaneous instructions
  1210.  
  1211. "nop" instruction occupies one byte but affects nothing but the instruction
  1212. pointer. This instruction has no operands and doesn't perform any operation.
  1213.   "ud2" instruction generates an invalid opcode exception. This instruction
  1214. is provided for software testing to explicitly generate an invalid opcode.
  1215. This is instruction has no operands.
  1216.   "xlat" replaces a byte in the AL register with a byte indexed by its value
  1217. in a translation table addressed by BX or EBX. The operand should be a byte
  1218. memory addressed by BX or EBX with any segment prefix. This instruction has
  1219. also a short form "xlatb" which has no operands and uses the BX or EBX address
  1220. in the segment selected by DS depending on the current code setting.
  1221.   "lds" transfers a pointer variable from the source operand to DS and the
  1222. destination register. The source operand must be a memory operand, and the
  1223. destination operand must be a general register. The DS register receives the
  1224. segment selector of the pointer while the destination register receives the
  1225. offset part of the pointer. "les", "lfs", "lgs" and "lss" operate identically
  1226. to "lds" except that rather than DS register the ES, FS, GS and SS is used
  1227. respectively.
  1228.  
  1229.     lds bx,[si]      ; load pointer to ds:bx
  1230.  
  1231.   "lea" transfers the offset of the source operand (rather than its value)
  1232. to the destination operand. The source operand must be a memory operand, and
  1233. the destination operand must be a general register.
  1234.  
  1235.     lea dx,[bx+si+1] ; load effective address to dx
  1236.  
  1237.   "cpuid" returns processor identification and feature information in the
  1238. EAX, EBX, ECX, and EDX registers. The information returned is selected by
  1239. entering a value in the EAX register before the instruction is executed.
  1240. This instruction has no operands.
  1241.   "pause" instruction delays the execution of the next instruction an
  1242. implementation specific amount of time. It can be used to improve the
  1243. performance of spin wait loops. This instruction has no operands.
  1244.   "enter" creates a stack frame that may be used to implement the scope rules
  1245. of block-structured high-level languages. A "leave" instruction at the end of
  1246. a procedure complements an "enter" at the beginning of the procedure to
  1247. simplify stack management and to control access to variables for nested
  1248. procedures. The "enter" instruction includes two parameters. The first
  1249. parameter specifies the number of bytes of dynamic storage to be allocated on
  1250. the stack for the routine being entered. The second parameter corresponds to
  1251. the lexical nesting level of the routine, it can be in range from 0 to 31.
  1252. The specified lexical level determines how many sets of stack frame pointers
  1253. the CPU copies into the new stack frame from the preceding frame. This list
  1254. of stack frame pointers is sometimes called the display. The first word (or
  1255. double word when code is 32-bit) of the display is a pointer to the last stack
  1256. frame. This pointer enables a "leave" instruction to reverse the action of the
  1257. previous "enter" instruction by effectively discarding the last stack frame.
  1258. After "enter" creates the new display for a procedure, it allocates the
  1259. dynamic storage space for that procedure by decrementing ESP by the number of
  1260. bytes specified in the first parameter. To enable a procedure to address its
  1261. display, "enter" leaves BP (or EBP) pointing to the beginning of the new stack
  1262. frame. If the lexical level is zero, "enter" pushes BP (or EBP), copies SP to
  1263. BP (or ESP to EBP) and then subtracts the first operand from ESP. For nesting
  1264. levels greater than zero, the processor pushes additional frame pointers on
  1265. the stack before adjusting the stack pointer.
  1266.  
  1267.     enter 2048,0     ; enter and allocate 2048 bytes on stack
  1268.  
  1269.  
  1270. 2.1.12  System instructions
  1271.  
  1272. "lmsw" loads the operand into the machine status word (bits 0 through 15 of
  1273. CR0 register), while "smsw" stores the machine status word into the
  1274. destination operand. The operand for both those instructions can be 16-bit
  1275. general register or memory, for "smsw" it can also be 32-bit general
  1276. register.
  1277.  
  1278.     lmsw ax          ; load machine status from register
  1279.     smsw [bx]        ; store machine status to memory
  1280.  
  1281.   "lgdt" and "lidt" instructions load the values in operand into the global
  1282. descriptor table register or the interrupt descriptor table register
  1283. respectively. "sgdt" and "sidt" store the contents of the global descriptor
  1284. table register or the interrupt descriptor table register in the destination
  1285. operand. The operand should be a 6 bytes in memory.
  1286.  
  1287.     lgdt [ebx]       ; load global descriptor table
  1288.  
  1289.   "lldt" loads the operand into the segment selector field of the local
  1290. descriptor table register and "sldt" stores the segment selector from the
  1291. local descriptor table register in the operand. "ltr" loads the operand into
  1292. the segment selector field of the task register and "str" stores the segment
  1293. selector from the task register in the operand. Rules for operand are the same
  1294. as for the "lmsw" and "smsw" instructions.
  1295.   "lar" loads the access rights from the segment descriptor specified by
  1296. the selector in source operand into the destination operand and sets the ZF
  1297. flag. The destination operand can be a 16-bit or 32-bit general register.
  1298. The source operand should be a 16-bit general register or memory.
  1299.  
  1300.     lar ax,[bx]      ; load access rights into word
  1301.     lar eax,dx       ; load access rights into double word
  1302.  
  1303.   "lsl" loads the segment limit from the segment descriptor specified by the
  1304. selector in source operand into the destination operand and sets the ZF flag.
  1305. Rules for operand are the same as for the "lar" instruction.
  1306.   "verr" and "verw" verify whether the code or data segment specified with
  1307. the operand is readable or writable from the current privilege level. The
  1308. operand should be a word, it can be general register or memory. If the segment
  1309. is accessible and readable (for "verr") or writable (for "verw") the ZF flag
  1310. is set, otherwise it's cleared. Rules for operand are the same as for the
  1311. "lldt" instruction.
  1312.   "arpl" compares the RPL (requestor's privilege level) fields of two segment
  1313. selectors. The first operand contains one segment selector and the second
  1314. operand contains the other. If the RPL field of the destination operand is
  1315. less than the RPL field of the source operand, the ZF flag is set and the RPL
  1316. field of the destination operand is increased to match that of the source
  1317. operand. Otherwise, the ZF flag is cleared and no change is made to the
  1318. destination operand. The destination operand can be a word general register
  1319. or memory, the source operand must be a general register.
  1320.  
  1321.     arpl bx,ax       ; adjust RPL of selector in register
  1322.     arpl [bx],ax     ; adjust RPL of selector in memory
  1323.  
  1324.   "clts" clears the TS (task switched) flag in the CR0 register. This
  1325. instruction has no operands.
  1326.   "lock" prefix causes the processor's bus-lock signal to be asserted during
  1327. execution of the accompanying instruction. In a multiprocessor environment,
  1328. the bus-lock signal insures that the processor has exclusive use of any shared
  1329. memory while the signal is asserted. The "lock" prefix can be prepended only
  1330. to the following instructions and only to those forms of the instructions
  1331. where the destination operand is a memory operand: "add", "adc", "and", "btc",
  1332. "btr", "bts", "cmpxchg", "cmpxchg8b", "dec", "inc", "neg", "not", "or", "sbb",
  1333. "sub", "xor", "xadd" and "xchg". If the "lock" prefix is used with one of
  1334. these instructions and the source operand is a memory operand, an undefined
  1335. opcode exception may be generated. An undefined opcode exception will also be
  1336. generated if the "lock" prefix is used with any instruction not in the above
  1337. list. The "xchg" instruction always asserts the bus-lock signal regardless of
  1338. the presence or absence of the "lock" prefix.
  1339.   "hlt" stops instruction execution and places the processor in a halted
  1340. state. An enabled interrupt, a debug exception, the BINIT, INIT or the RESET
  1341. signal will resume execution. This instruction has no operands.
  1342.   "invlpg" invalidates (flushes) the TLB (translation lookaside buffer) entry
  1343. specified with the operand, which should be a memory. The processor determines
  1344. the page that contains that address and flushes the TLB entry for that page.
  1345.   "rdmsr" loads the contents of a 64-bit MSR (model specific register) of the
  1346. address specified in the ECX register into registers EDX and EAX. "wrmsr"
  1347. writes the contents of registers EDX and EAX into the 64-bit MSR of the
  1348. address specified in the ECX register. "rdtsc" loads the current value of the
  1349. processor's time stamp counter from the 64-bit MSR into the EDX and EAX
  1350. registers. The processor increments the time stamp counter MSR every clock
  1351. cycle and resets it to 0 whenever the processor is reset. "rdpmc" loads the
  1352. contents of the 40-bit performance monitoring counter specified in the ECX
  1353. register into registers EDX and EAX. These instructions have no operands.
  1354.   "wbinvd" writes back all modified cache lines in the processor's internal
  1355. cache to main memory and invalidates (flushes) the internal caches. The
  1356. instruction then issues a special function bus cycle that directs external
  1357. caches to also write back modified data and another bus cycle to indicate that
  1358. the external caches should be invalidated. This instruction has no operands.
  1359.   "rsm" return program control from the system management mode to the program
  1360. that was interrupted when the processor received an SMM interrupt. This
  1361. instruction has no operands.
  1362.   "sysenter" executes a fast call to a level 0 system procedure, "sysexit"
  1363. executes a fast return to level 3 user code. The addresses used by these
  1364. instructions are stored in MSRs. These instructions have no operands.
  1365.  
  1366.  
  1367. 2.1.13  FPU instructions
  1368.  
  1369. The FPU (Floating-Point Unit) instructions operate on the floating-point
  1370. values in three formats: single precision (32-bit), double precision (64-bit)
  1371. and double extended precision (80-bit). The FPU registers form the stack and
  1372. each of them holds the double extended precision floating-point value. When
  1373. some values are pushed onto the stack or are removed from the top, the FPU
  1374. registers are shifted, so ST0 is always the value on the top of FPU stack, ST1
  1375. is the first value below the top, etc. The ST0 name has also the synonym ST.
  1376.   "fld" pushes the floating-point value onto the FPU register stack. The
  1377. operand can be 32-bit, 64-bit or 80-bit memory location or the FPU register,
  1378. its value is then loaded onto the top of FPU register stack (the ST0
  1379. register) and is automatically converted into the double extended precision
  1380. format.
  1381.  
  1382.     fld dword [bx]   ; load single prevision value from memory
  1383.     fld st2          ; push value of st2 onto register stack
  1384.  
  1385.   "fld1", "fldz", "fldl2t", "fldl2e", "fldpi", "fldlg2" and "fldln2" load the
  1386. commonly used contants onto the FPU register stack. The loaded constants are
  1387. +1.0, +0.0, lb 10, lb e, pi, lg 2 and ln 2 respectively. These instructions
  1388. have no operands.
  1389.   "fild" converts the signed integer source operand into double extended
  1390. precision floating-point format and pushes the result onto the FPU register
  1391. stack. The source operand can be a 16-bit, 32-bit or 64-bit memory location.
  1392.  
  1393.     fild qword [bx]  ; load 64-bit integer from memory
  1394.  
  1395.   "fst" copies the value of ST0 register to the destination operand, which
  1396. can be 32-bit or 64-bit memory location or another FPU register. "fstp"
  1397. performs the same operation as "fst" and then pops the register stack,
  1398. getting rid of ST0. "fstp" accepts the same operands as the "fst" instruction
  1399. and can also store value in the 80-bit memory.
  1400.  
  1401.     fst st3          ; copy value of st0 into st3 register
  1402.     fstp tword [bx]  ; store value in memory and pop stack
  1403.  
  1404.   "fist" converts the value in ST0 to a signed integer and stores the result
  1405. in the destination operand. The operand can be 16-bit or 32-bit memory
  1406. location. "fistp" performs the same operation and then pops the register
  1407. stack, it accepts the same operands as the "fist" instruction and can also
  1408. store integer value in the 64-bit memory, so it has the same rules for
  1409. operands as "fild" instruction.
  1410.   "fbld" converts the packed BCD integer into double extended precision
  1411. floating-point format and pushes this value onto the FPU stack. "fbstp"
  1412. converts the value in ST0 to an 18-digit packed BCD integer, stores the result
  1413. in the destination operand, and pops the register stack. The operand should be
  1414. an 80-bit memory location.
  1415.   "fadd" adds the destination and source operand and stores the sum in the
  1416. destination location. The destination operand is always an FPU register, if
  1417. the source is a memory location, the destination is ST0 register and only
  1418. source operand should be specified. If both operands are FPU registers, at
  1419. least one of them should be ST0 register. An operand in memory can be a
  1420. 32-bit or 64-bit value.
  1421.  
  1422.     fadd qword [bx]  ; add double precision value to st0
  1423.     fadd st2,st0     ; add st0 to st2
  1424.  
  1425.   "faddp" adds the destination and source operand, stores the sum in the
  1426. destination location and then pops the register stack. The destination operand
  1427. must be an FPU register and the source operand must be the ST0. When no
  1428. operands are specified, ST1 is used as a destination operand.
  1429.  
  1430.     faddp            ; add st0 to st1 and pop the stack
  1431.     faddp st2,st0    ; add st0 to st2 and pop the stack
  1432.  
  1433. "fiadd" instruction converts an integer source operand into double extended
  1434. precision floating-point value and adds it to the destination operand. The
  1435. operand should be a 16-bit or 32-bit memory location.
  1436.  
  1437.     fiadd word [bx]  ; add word integer to st0
  1438.  
  1439.   "fsub", "fsubr", "fmul", "fdiv", "fdivr" instruction are similar to "fadd",
  1440. have the same rules for operands and differ only in the perfomed computation.
  1441. "fsub" substracts the source operand from the destination operand, "fsubr"
  1442. substract the destination operand from the source operand, "fmul" multiplies
  1443. the destination and source operands, "fdiv" divides the destination operand by
  1444. the source operand and "fdivr" divides the source operand by the destination
  1445. operand. "fsubp", "fsubrp", "fmulp", "fdivp", "fdivrp" perform the same
  1446. operations and pop the register stack, the rules for operand are the same as
  1447. for the "faddp" instruction. "fisub", "fisubr", "fimul", "fidiv", "fidivr"
  1448. perform these operations after converting the integer source operand into
  1449. floating-point value, they have the same rules for operands as "fiadd"
  1450. instruction.
  1451.   "fsqrt" computes the square root of the value in ST0 register, "fsin"
  1452. computes the sine of that value, "fcos" computes the cosine of that value,
  1453. "fchs" complements its sign bit, "fabs" clears its sign to create the absolute
  1454. value, "frndint" rounds it to the nearest integral value, depending on the
  1455. current rounding mode. "f2xm1" computes the exponential value of 2 to the
  1456. power of ST0 and substracts the 1.0 from it, the value of ST0 must lie in the
  1457. range -1.0 to +1.0. All these instruction store the result in ST0 and have no
  1458. operands.
  1459.   "fsincos" computes both the sine and the cosine of the value in ST0
  1460. register, stores the sine in ST0 and pushes the cosine on the top of FPU
  1461. register stack. "fptan" computes the tangent of the value in ST0, stores the
  1462. result in ST0 and pushes a 1.0 onto the FPU register stack. "fpatan" computes
  1463. the arctangent of the value in ST1 divided by the value in ST0, stores the
  1464. result in ST1 and pops the FPU register stack. "fyl2x" computes the binary
  1465. logarithm of ST0, multiplies it by ST1, stores the result in ST1 and pops the
  1466. FPU register stack; "fyl2xp1" performs the same operation but it adds 1.0 to
  1467. ST0 before computing the logarithm. "fprem" computes the remainder obtained
  1468. from dividing the value in ST0 by the value in ST1, and stores the result
  1469. in ST0. "fprem1" performs the same operation as "fprem", but it computes the
  1470. remainder in the way specified by IEEE Standard 754. "fscale" truncates the
  1471. value in ST1 and increases the exponent of ST0 by this value. "fxtract"
  1472. separates the value in ST0 into its exponent and significand, stores the
  1473. exponent in ST0 and pushes the significand onto the register stack. "fnop"
  1474. performs no operation. These instruction have no operands.
  1475.   "fxch" exchanges the contents of ST0 an another FPU register. The operand
  1476. should be an FPU register, if no operand is specified, the contents of ST0 and
  1477. ST1 are exchanged.
  1478.   "fcom" and "fcomp" compare the contents of ST0 and the source operand and
  1479. set flags in the FPU status word according to the results. "fcomp"
  1480. additionally pops the register stack after performing the comparison. The
  1481. operand can be a single or double precision value in memory or the FPU
  1482. register. When no operand is specified, ST1 is used as a source operand.
  1483.  
  1484.     fcom             ; compare st0 with st1
  1485.     fcomp st2        ; compare st0 with st2 and pop stack
  1486.  
  1487.   "fcompp" compares the contents of ST0 and ST1, sets flags in the FPU status
  1488. word according to the results and pops the register stack twice. This
  1489. instruction has no operands.
  1490.   "fucom", "fucomp" and "fucompp" performs an unordered comparison of two FPU
  1491. registers. Rules for operands are the same as for the "fcom", "fcomp" and
  1492. "fcompp", but the source operand must be an FPU register.
  1493.   "ficom" and "ficomp" compare the value in ST0 with an integer source operand
  1494. and set the flags in the FPU status word according to the results. "ficomp"
  1495. additionally pops the register stack after performing the comparison. The
  1496. integer value is converted to double extended precision floating-point format
  1497. before the comparison is made. The operand should be a 16-bit or 32-bit
  1498. memory location.
  1499.  
  1500.     ficom word [bx]  ; compare st0 with 16-bit integer
  1501.  
  1502.   "fcomi", "fcomip", "fucomi", "fucomip" perform the comparison of ST0 with
  1503. another FPU register and set the ZF, PF and CF flags according to the results.
  1504. "fcomip" and "fucomip" additionaly pop the register stack after performing the
  1505. comparison. The instructions obtained by attaching the FPU condition mnemonic
  1506. (see table 2.2) to the "fcmov" mnemonic transfer the specified FPU register
  1507. into ST0 register if the fiven test condition is true. These instruction
  1508. allow two different syntaxes, one with single operand specifying the source
  1509. FPU register, and one with two operands, in that case destination operand
  1510. should be ST0 register and the second operand specifies the source FPU
  1511. register.
  1512.  
  1513.     fcomi st2        ; compare st0 with st2 and set flags
  1514.     fcmovb st0,st2   ; transfer st2 to st0 if below
  1515.  
  1516.    Table 2.2  FPU conditions
  1517.   /------------------------------------------------------\
  1518.   | Mnemonic | Condition tested | Description            |
  1519.   |==========|==================|========================|
  1520.   | b        | CF = 1           | below                  |
  1521.   | e        | ZF = 1           | equal                  |
  1522.   | be       | CF or ZF = 1     | below or equal         |
  1523.   | u        | PF = 1           | unordered              |
  1524.   | nb       | CF = 0           | not below              |
  1525.   | ne       | ZF = 0           | not equal              |
  1526.   | nbe      | CF and ZF = 0    | not below nor equal    |
  1527.   | nu       | PF = 0           | not unordered          |
  1528.   \------------------------------------------------------/
  1529.  
  1530.   "ftst" compares the value in ST0 with 0.0 and sets the flags in the FPU
  1531. status word according to the results. "fxam" examines the contents of the ST0
  1532. and sets the flags in FPU status word to indicate the class of value in the
  1533. register. These instructions have no operands.
  1534.   "fstsw" and "fnstsw" store the current value of the FPU status word in the
  1535. destination location. The destination operand can be either a 16-bit memory or
  1536. the AX register. "fstsw" checks for pending umasked FPU exceptions before
  1537. storing the status word, "fnstsw" does not.
  1538.   "fstcw" and "fnstcw" store the current value of the FPU control word at the
  1539. specified destination in memory. "fstcw" checks for pending umasked FPU
  1540. exceptions before storing the control word, "fnstcw" does not. "fldcw" loads
  1541. the operand into the FPU control word. The operand should be a 16-bit memory
  1542. location.
  1543.   "fstenv" and "fnstenv" store the current FPU operating environment at the
  1544. memory location specified with the destination operand, and then mask all FPU
  1545. exceptions. "fstenv" checks for pending umasked FPU exceptions before
  1546. proceeding, "fnstenv" does not. "fldenv" loads the complete operating
  1547. environment from memory into the FPU. "fsave" and "fnsave" store the current
  1548. FPU state (operating environment and register stack) at the specified
  1549. destination in memory and reinitializes the FPU. "fsave" check for pending
  1550. unmasked FPU exceptions before proceeding, "fnsave" does not. "frstor"
  1551. loads the FPU state from the specified memory location. All these instructions
  1552. need an operand being a memory location. For each of these instruction
  1553. exist two additional mnemonics that allow to precisely select the type of the
  1554. operation. The "fstenvw", "fnstenvw", "fldenvw", "fsavew", "fnsavew" and
  1555. "frstorw" mnemonics force the instruction to perform operation as in the 16-bit
  1556. mode, while "fstenvd", "fnstenvd", "fldenvd", "fsaved", "fnsaved" and "frstord"
  1557. force the operation as in 32-bit mode.
  1558.   "finit" and "fninit" set the FPU operating environment into its default
  1559. state. "finit" checks for pending unmasked FPU exception before proceeding,
  1560. "fninit" does not. "fclex" and "fnclex" clear the FPU exception flags in the
  1561. FPU status word. "fclex" checks for pending unmasked FPU exception before
  1562. proceeding, "fnclex" does not. "wait" and "fwait" are synonyms for the same
  1563. instruction, which causes the processor to check for pending unmasked FPU
  1564. exceptions and handle them before proceeding. These instruction have no
  1565. operands.
  1566.   "ffree" sets the tag associated with specified FPU register to empty. The
  1567. operand should be an FPU register.
  1568.   "fincstp" and "fdecstp" rotate the FPU stack by one by adding or
  1569. substracting one to the pointer of the top of stack. These instruction have no
  1570. operands.
  1571.  
  1572.  
  1573. 2.1.14  MMX instructions
  1574.  
  1575. The MMX instructions operate on the packed integer types and use the MMX
  1576. registers, which are the low 64-bit parts of the 80-bit FPU registers. Because
  1577. of this MMX instructions cannot be used at the same time as FPU instructions.
  1578. They can operate on packed bytes (eight 8-bit integers), packed words (four
  1579. 16-bit integers) or packed double words (two 32-bit integers), use of packed
  1580. formats allows to perform operations on multiple data at one time.
  1581.   "movq" copies a quad word from the source operand to the destination
  1582. operand. At least one of the operands must be a MMX register, the second one
  1583. can be also a MMX register or 64-bit memory location.
  1584.  
  1585.     movq mm0,mm1     ; move quad word from register to register
  1586.     movq mm2,[ebx]   ; move quad word from memory to register
  1587.  
  1588.   "movd" copies a double word from the source operand to the destination
  1589. operand. One of the operands must be a MMX register, the second one can be a
  1590. general register or 32-bit memory location. Only low double word of MMX
  1591. register is used.
  1592.   All general MMX operations have two operands, the destination operand should
  1593. be a MMX register, the source operand can be a MMX register or 64-bit memory
  1594. location. Operation is performed on the corresponding data elements of the
  1595. source and destination operand and stored in the data elements of the
  1596. destination operand. "paddb", "paddw" and "paddd" perform the addition of
  1597. packed bytes, packed words, or packed double words.  "psubb", "psubw" and
  1598. "psubd" perform the substraction of appropriate types. "paddsb", "paddsw",
  1599. "psubsb" and "psubsw" perform the addition or substraction of packed bytes
  1600. or packed words with the signed saturation. "paddusb", "paddusw", "psubusb",
  1601. "psubusw" are analoguous, but with unsigned saturation. "pmulhw" and "pmullw"
  1602. performs a signed multiplication of the packed words and store the high or low
  1603. words of the results in the destination operand. "pmaddwd" performs a multiply
  1604. of the packed words and adds the four intermediate double word products in
  1605. pairs to produce result as a packed double words. "pand", "por" and "pxor"
  1606. perform the logical operations on the quad words, "pandn" peforms also a
  1607. logical negation of the destination operand before performing the "and"
  1608. operation. "pcmpeqb", "pcmpeqw" and "pcmpeqd" compare for equality of packed
  1609. bytes, packed words or packed double words. If a pair of data elements is
  1610. equal, the corresponding data element in the destination operand is filled with
  1611. bits of value 1, otherwise it's set to 0. "pcmpgtb", "pcmpgtw" and "pcmpgtd"
  1612. perform the similar operation, but they check whether the data elements in the
  1613. destination operand are greater than the correspoding data elements in the
  1614. source operand. "packsswb" converts packed signed words into packed signed
  1615. bytes, "packssdw" converts packed signed double words into packed signed
  1616. words, using saturation to handle overflow conditions. "packuswb" converts
  1617. packed signed words into packed unsigned bytes. Converted data elements from
  1618. the source operand are stored in the low part of the destination operand,
  1619. while converted data elements from the destination operand are stored in the
  1620. high part. "punpckhbw", "punpckhwd" and "punpckhdq" interleaves the data
  1621. elements from the high parts of the source and destination operands and
  1622. stores the result into the destination operand. "punpcklbw", "punpcklwd" and
  1623. "punpckldq" perform the same operation, but the low parts of the source and
  1624. destination operand are used.
  1625.  
  1626.     paddsb mm0,[esi] ; add packed bytes with signed saturation
  1627.     pcmpeqw mm3,mm7  ; compare packed words for equality
  1628.  
  1629.   "psllw", "pslld" and "psllq" perform logical shift left of the packed words,
  1630. packed double words or a single quad word in the destination operand by the
  1631. amount specified in the source operand. "psrlw", "psrld" and "psrlq" perform
  1632. logical shift right of the packed words, packed double words or a single quad
  1633. word. "psraw" and "psrad" perform arithmetic shift of the packed words or
  1634. double words. The destination operand should be a MMX register, while source
  1635. operand can be a MMX register, 64-bit memory location, or 8-bit immediate
  1636. value.
  1637.  
  1638.     psllw mm2,mm4    ; shift words left logically
  1639.     psrad mm4,[ebx]  ; shift double words right arithmetically
  1640.  
  1641.   "emms" makes the FPU registers usable for the FPU instructions, it must be
  1642. used before using the FPU instructions if any MMX instructions were used.
  1643.  
  1644.  
  1645. 2.1.15  SSE instructions
  1646.  
  1647. The SSE extension adds more MMX instructions and also introduces the
  1648. operations on packed single precision floating point values. The 128-bit
  1649. packed single precision format consists of four single precision floating
  1650. point values. The 128-bit SSE registers are designed for the purpose of
  1651. operations on this data type.
  1652.   "movaps" and "movups" transfer a double quad word operand containing packed
  1653. single precision values from source operand to destination operand. At least
  1654. one of the operands have to be a SSE register, the second one can be also a
  1655. SSE register or 128-bit memory location. Memory operands for "movaps"
  1656. instruction must be aligned on boundary of 16 bytes, operands for "movups"
  1657. instruction don't have to be aligned.
  1658.  
  1659.     movups xmm0,[ebx]  ; move unaligned double quad word
  1660.  
  1661.   "movlps" moves packed two single precision values between the memory and the
  1662. low quad word of SSE register. "movhps" moved packed two single precision
  1663. values between the memory and the high quad word of SSE register. One of the
  1664. operands must be a SSE register, and the other operand must be a 64-bit memory
  1665. location.
  1666.  
  1667.     movlps xmm0,[ebx]  ; move memory to low quad word of xmm0
  1668.     movhps [esi],xmm7  ; move high quad word of xmm7 to memory
  1669.  
  1670.   "movlhps" moves packed two single precision values from the low quad word
  1671. of source register to the high quad word of destination register. "movhlps"
  1672. moves two packed single precision values from the high quad word of source
  1673. register to the low quad word of destination register. Both operands have to
  1674. be a SSE registers.
  1675.   "movmskps" transfers the most significant bit of each of the four single
  1676. precision values in the SSE register into low four bits of a general register.
  1677. The source operand must be a SSE register, the destination operand must be a
  1678. general register.
  1679.   "movss" transfers a single precision value between source and destination
  1680. operand (only the low double word is trasferred). At least one of the operands
  1681. have to be a SSE register, the second one can be also a SSE register or 32-bit
  1682. memory location.
  1683.  
  1684.     movss [edi],xmm3   ; move low double word of xmm3 to memory
  1685.  
  1686.   Each of the SSE arithmetic operations has two variants. When the mnemonic
  1687. ends with "ps", the source operand can be a 128-bit memory location or a SSE
  1688. register, the destination operand must be a SSE register and the operation is
  1689. performed on packed four single precision values, for each pair of the
  1690. corresponding data elements separately, the result is stored in the
  1691. destination register. When the mnemonic ends with "ss", the source operand
  1692. can be a 32-bit memory location or a SSE register, the destination operand
  1693. must be a SSE register and the operation is performed on single precision
  1694. values, only low double words of SSE registers are used in this case, the
  1695. result is stored in the low double word of destination register. "addps" and
  1696. "addss" add the values, "subps" and "subss" substract the source value from
  1697. destination value, "mulps" and "mulss" multiply the values, "divps" and
  1698. "divss" divide the destination value by the source value, "rcpps" and "rcpss"
  1699. compute the approximate reciprocal of the source value, "sqrtps" and "sqrtss"
  1700. compute the square root of the source value, "rsqrtps" and "rsqrtss" compute
  1701. the approximate reciprocal of square root of the source value, "maxps" and
  1702. "maxss" compare the source and destination values and return the greater one,
  1703. "minps" and "minss" compare the source and destination values and return the
  1704. lesser one.
  1705.  
  1706.     mulss xmm0,[ebx]   ; multiply single precision values
  1707.     addps xmm3,xmm7    ; add packed single precision values
  1708.  
  1709.   "andps", "andnps", "orps" and "xorps" perform the logical operations on
  1710. packed single precision values. The source operand can be a 128-bit memory
  1711. location or a SSE register, the destination operand must be a SSE register.
  1712.   "cmpps" compares packed single precision values and returns a mask result
  1713. into the destination operand, which must be a SSE register. The source operand
  1714. can be a 128-bit memory location or SSE register, the third operand must be an
  1715. immediate operand selecting code of one of the eight compare conditions
  1716. (table 2.3). "cmpss" performs the same operation on single precision values,
  1717. only low double word of destination register is affected, in this case source
  1718. operand can be a 32-bit memory location or SSE register. These two
  1719. instructions have also variants with only two operands and the condition
  1720. encoded within mnemonic. Their mnemonics are obtained by attaching the
  1721. mnemonic from table 2.3 to the "cmp" mnemonic and then attaching the "ps" or
  1722. "ss" at the end.
  1723.  
  1724.     cmpps xmm2,xmm4,0  ; compare packed single precision values
  1725.     cmpltss xmm0,[ebx] ; compare single precision values
  1726.  
  1727.    Table 2.3  SSE conditions
  1728.   /-------------------------------------------\
  1729.   | Code | Mnemonic | Description             |
  1730.   |======|==========|=========================|
  1731.   | 0    | eq       | equal                   |
  1732.   | 1    | lt       | less than               |
  1733.   | 2    | le       | less than or equal      |
  1734.   | 3    | unord    | unordered               |
  1735.   | 4    | neq      | not equal               |
  1736.   | 5    | nlt      | not less than           |
  1737.   | 6    | nle      | not less than nor equal |
  1738.   | 7    | ord      | ordered                 |
  1739.   \-------------------------------------------/
  1740.  
  1741.   "comiss" and "ucomiss" compare the single precision values and set the ZF,
  1742. PF and CF flags to show the result. The destination operand must be a SSE
  1743. register, the source operand can be a 32-bit memory location or SSE register.
  1744.   "shufps" moves any two of the four single precision values from the
  1745. destination operand into the low quad word of the destination operand, and any
  1746. two of the four values from the source operand into the high quad word of the
  1747. destination operand. The destination operand must be a SSE register, the
  1748. source operand can be a 128-bit memory location or SSE register, the third
  1749. operand must be an 8-bit immediate value selecting which values will be moved
  1750. into the destination operand. Bits 0 and 1 select the value to be moved from
  1751. destination operand to the low double word of the result, bits 2 and 3 select
  1752. the value to be moved from the destination operand to the second double word,
  1753. bits 4 and 5 select the value to be moved from the source operand to the third
  1754. double word, and bits 6 and 7 select the value to be moved from the source
  1755. operand to the high double word of the result.
  1756.  
  1757.     shufps xmm0,xmm0,10010011b ; shuffle double words
  1758.  
  1759.   "unpckhps" performs an interleaved unpack of the values from the high parts
  1760. of the source and destination operands and stores the result in the
  1761. destination operand, which must be a SSE register. The source operand can be
  1762. a 128-bit memory location or a SSE register. "unpcklps" performs an
  1763. interleaved unpack of the values from the low parts of the source and
  1764. destination operand and stores the result in the destination operand,
  1765. the rules for operands are the same.
  1766.   "cvtpi2ps" converts packed two double word integers into the the packed two
  1767. single precision floating point values and stores the result in the low quad
  1768. word of the destination operand, which should be a SSE register. The source
  1769. operand can be a 64-bit memory location or MMX register.
  1770.  
  1771.     cvtpi2ps xmm0,mm0  ; convert integers to single precision values
  1772.  
  1773.   "cvtsi2ss" converts a double word integer into a single precision floating
  1774. point value and stores the result in the low double word of the destination
  1775. operand, which should be a SSE register. The source operand can be a 32-bit
  1776. memory location or 32-bit general register.
  1777.  
  1778.     cvtsi2ss xmm0,eax  ; convert integer to single precision value
  1779.  
  1780.   "cvtps2pi" converts packed two single precision floating point values into
  1781. packed two double word integers and stores the result in the destination
  1782. operand, which should be a MMX register. The source operand can be a 64-bit
  1783. memory location or SSE register, only low quad word of SSE register is used.
  1784. "cvttps2pi" performs the similar operation, except that truncation is used to
  1785. round a source values to integers, rules for the operands are the same.
  1786.  
  1787.     cvtps2pi mm0,xmm0  ; convert single precision values to integers
  1788.  
  1789.   "cvtss2si" convert a single precision floating point value into a double
  1790. word integer and stores the result in the destination operand, which should be
  1791. a 32-bit general register. The source operand can be a 32-bit memory location
  1792. or SSE register, only low double word of SSE register is used. "cvttss2si"
  1793. performs the similar operation, except that truncation is used to round a
  1794. source value to integer, rules for the operands are the same.
  1795.  
  1796.     cvtss2si eax,xmm0  ; convert single precision value to integer
  1797.  
  1798.   "pextrw" copies the word in the source operand specified by the third
  1799. operand to the destination operand. The source operand must be a MMX register,
  1800. the destination operand must be a 32-bit general register (the high word of
  1801. the destination is cleared), the third operand must an 8-bit immediate value.
  1802.  
  1803.     pextrw eax,mm0,1   ; extract word into eax
  1804.  
  1805.   "pinsrw" inserts a word from the source operand in the destination operand
  1806. at the location specified with the third operand, which must be an 8-bit
  1807. immediate value. The destination operand must be a MMX register, the source
  1808. operand can be a 16-bit memory location or 32-bit general register (only low
  1809. word of the register is used).
  1810.  
  1811.     pinsrw mm1,ebx,2   ; insert word from ebx
  1812.  
  1813.   "pavgb" and "pavgw" compute average of packed bytes or words. "pmaxub"
  1814. return the maximum values of packed unsigned bytes, "pminub" returns the
  1815. minimum values of packed unsigned bytes, "pmaxsw" returns the maximum values
  1816. of packed signed words, "pminsw" returns the minimum values of packed signed
  1817. words. "pmulhuw" performs a unsigned multiplication of the packed words and
  1818. stores the high words of the results in the destination operand. "psadbw"
  1819. computes the absolute differences of packed unsigned bytes, sums the
  1820. differences, and stores the sum in the low word of destination operand. All
  1821. these instructions follow the same rules for operands as the general MMX
  1822. operations described in previous section.
  1823.   "pmovmskb" creates a mask made of the most significant bit of each byte in
  1824. the source operand and stores the result in the low byte of destination
  1825. operand. The source operand must be a MMX register, the destination operand
  1826. must a 32-bit general register.
  1827.   "pshufw" inserts words from the source operand in the destination operand
  1828. from the locations specified with the third operand. The destination operand
  1829. must be a MMX register, the source operand can be a 64-bit memory location or
  1830. MMX register, third operand must an 8-bit immediate value selecting which
  1831. values will be moved into destination operand, in the similar way as the third
  1832. operand of the "shufps" instruction.
  1833.   "movntq" moves the quad word from the source operand to memory using a
  1834. non-temporal hint to minimize cache pollution. The source operand should be a
  1835. MMX register, the destination operand should be a 64-bit memory location.
  1836. "movntps" stores packed single precision values from the SSE register to
  1837. memory using a non-temporal hint. The source operand should be a SSE register,
  1838. the destination operand should be a 128-bit memory location. "maskmovq" stores
  1839. selected bytes from the first operand into a 64-bit memory location using a
  1840. non-temporal hint. Both operands should be a MMX registers, the second operand
  1841. selects wich bytes from the source operand are written to memory. The
  1842. memory location is pointed by DI (or EDI) register in the segment selected
  1843. by DS.
  1844.   "prefetcht0", "prefetcht1", "prefetcht2" and "prefetchnta" fetch the line
  1845. of data from memory that contains byte specified with the operand to a
  1846. specified location in hierarchy.  The operand should be an 8-bit memory
  1847. location.
  1848.   "sfence" performs a serializing operation on all instruction storing to
  1849. memory that were issued prior to it. This instruction has no operands.
  1850.   "ldmxcsr" loads the 32-bit memory operand into the MXCSR register. "stmxcsr"
  1851. stores the contents of MXCSR into a 32-bit memory operand.
  1852.   "fxsave" saves the current state of the FPU, MXCSR register, and all the FPU
  1853. and SSE registers to a 512-byte memory location specified in the destination
  1854. operand. "fxrstor" reloads data previously stored with "fxsave" instruction
  1855. from the specified 512-byte memory location. The memory operand for both those
  1856. instructions must be aligned on 16 byte boundary, it should declare operand
  1857. of no specified size.
  1858.  
  1859.  
  1860. 2.1.16  SSE2 instructions
  1861.  
  1862. The SSE2 extension introduces the operations on packed double precision
  1863. floating point values, extends the syntax of MMX instructions, and adds also
  1864. some new instructions.
  1865.   "movapd" and "movupd" transfer a double quad word operand containing packed
  1866. double precision values from source operand to destination operand. These
  1867. instructions are analogous to "movaps" and "movups" and have the same rules
  1868. for operands.
  1869.   "movlpd" moves double precision value between the memory and the low quad
  1870. word of SSE register. "movhpd" moved double precision value between the memory
  1871. and the high quad word of SSE register. These instructions are analogous to
  1872. "movlps" and "movhps" and have the same rules for operands.
  1873.   "movmskpd" transfers the most significant bit of each of the two double
  1874. precision values in the SSE register into low two bits of a general register.
  1875. This instruction is analogous to "movmskps" and has the same rules for
  1876. operands.
  1877.   "movsd" transfers a double precision value between source and destination
  1878. operand (only the low quad word is trasferred). At least one of the operands
  1879. have to be a SSE register, the second one can be also a SSE register or 64-bit
  1880. memory location.
  1881.   Arithmetic operations on double precision values are: "addpd", "addsd",
  1882. "subpd", "subsd", "mulpd", "mulsd", "divpd", "divsd", "sqrtpd", "sqrtsd",
  1883. "maxpd", "maxsd", "minpd", "minsd", and they are analoguous to arithmetic
  1884. operations on single precision values described in previous section. When the
  1885. mnemonic ends with "pd" instead of "ps", the operation is performed on packed
  1886. two double precision values, but rules for operands are the same. When the
  1887. mnemonic ends with "sd" instead of "ss", the source operand can be a 64-bit
  1888. memory location or a SSE register, the destination operand must be a SSE
  1889. register and the operation is performed on double precision values, only low
  1890. quad words of SSE registers are used in this case.
  1891.   "andpd", "andnpd", "orpd" and "xorpd" perform the logical operations on
  1892. packed double precision values. They are analoguous to SSE logical operations
  1893. on single prevision values and have the same rules for operands.
  1894.   "cmppd" compares packed double precision values and returns and returns a
  1895. mask result into the destination operand. This instruction is analoguous to
  1896. "cmpps" and has the same rules for operands. "cmpsd" performs the same
  1897. operation on double precision values, only low quad word of destination
  1898. register is affected, in this case source operand can be a 64-bit memory or
  1899. SSE register. Variant with only two operands are obtained by attaching the
  1900. condition mnemonic from table 2.3 to the "cmp" mnemonic and then attaching
  1901. the "pd" or "sd" at the end.
  1902.   "comisd" and "ucomisd" compare the double precision values and set the ZF,
  1903. PF and CF flags to show the result. The destination operand must be a SSE
  1904. register, the source operand can be a 128-bit memory location or SSE register.
  1905.   "shufpd" moves any of the two double precision values from the destination
  1906. operand into the low quad word of the destination operand, and any of the two
  1907. values from the source operand into the high quad word of the destination
  1908. operand. This instruction is analoguous to "shufps" and has the same rules for
  1909. operand. Bit 0 of the third operand selects the value to be moved from the
  1910. destination operand, bit 1 selects the value to be moved from the source
  1911. operand, the rest of bits are reserved and must be zeroed.
  1912.   "unpckhpd" performs an unpack of the high quad words from the source and
  1913. destination operands, "unpcklpd" performs an unpack of the low quad words from
  1914. the source and destination operands. They are analoguous to "unpckhps" and
  1915. "unpcklps", and have the same rules for operands.
  1916.   "cvtps2pd" converts the packed two single precision floating point values to
  1917. two packed double precision floating point values, the destination operand
  1918. must be a SSE register, the source operand can be a 64-bit memory location or
  1919. SSE register. "cvtpd2ps" converts the packed two double precision floating
  1920. point values to packed two single precision floating point values, the
  1921. destination operand must be a SSE register, the source operand can be a
  1922. 128-bit memory location or SSE register. "cvtss2sd" converts the single
  1923. precision floating point value to double precision floating point value, the
  1924. destination operand must be a SSE register, the source operand can be a 32-bit
  1925. memory location or SSE register. "cvtsd2ss" converts the double precision
  1926. floating point value to single precision floating point value, the destination
  1927. operand must be a SSE register, the source operand can be 64-bit memory
  1928. location or SSE register.
  1929.   "cvtpi2pd" converts packed two double word integers into the the packed
  1930. double precision floating point values, the destination operand must be a SSE
  1931. register, the source operand can be a 64-bit memory location or MMX register.
  1932. "cvtsi2sd" converts a double word integer into a double precision floating
  1933. point value, the destination operand must be a SSE register, the source
  1934. operand can be a 32-bit memory location or 32-bit general register. "cvtpd2pi"
  1935. converts packed double precision floating point values into packed two double
  1936. word integers, the destination operand should be a MMX register, the source
  1937. operand can be a 128-bit memory location or SSE register. "cvttpd2pi" performs
  1938. the similar operation, except that truncation is used to round a source values
  1939. to integers, rules for operands are the same. "cvtsd2si" converts a double
  1940. precision floating point value into a double word integer, the destination
  1941. operand should be a 32-bit general register, the source operand can be a
  1942. 64-bit memory location or SSE register. "cvttsd2si" performs the similar
  1943. operation, except that truncation is used to round a source value to integer,
  1944. rules for operands are the same.
  1945.   "cvtps2dq" and "cvttps2dq" convert packed single precision floating point
  1946. values to packed four double word integers, storing them in the destination
  1947. operand. "cvtpd2dq" and "cvttpd2dq" convert packed double precision floating
  1948. point values to packed two double word integers, storing the result in the low
  1949. quad word of the destination operand. "cvtdq2ps" converts packed four
  1950. double word integers to packed single precision floating point values.
  1951. For all these instruction destination operand must be a SSE register, the
  1952. source operand can be a 128-bit memory location or SSE register.
  1953. "cvtdq2pd" converts packed two double word integers from the source operand to
  1954. packed double precision floating point values, the source can be a 64-bit
  1955. memory location or SSE register, destination has to be SSE register.
  1956.   "movdqa" and "movdqu" transfer a double quad word operand containing packed
  1957. integers from source operand to destination operand. At least one of the
  1958. operands have to be a SSE register, the second one can be also a SSE register
  1959. or 128-bit memory location. Memory operands for "movdqa" instruction must be
  1960. aligned on boundary of 16 bytes, operands for "movdqu" instruction don't have
  1961. to be aligned.
  1962.   "movq2dq" moves the contents of the MMX source register to the low quad word
  1963. of destination SSE register. "movdq2q" moves the low quad word from the source
  1964. SSE register to the destination MMX register.
  1965.  
  1966.     movq2dq xmm0,mm1   ; move from MMX register to SSE register
  1967.     movdq2q mm0,xmm1   ; move from SSE register to MMX register
  1968.  
  1969.   All MMX instructions operating on the 64-bit packed integers (those with
  1970. mnemonics starting with "p") are extended to operate on 128-bit packed
  1971. integers located in SSE registers. Additional syntax for these instructions
  1972. needs an SSE register where MMX register was needed, and the 128-bit memory
  1973. location or SSE register where 64-bit memory location or MMX register were
  1974. needed. The exception is "pshufw" instruction, which doesn't allow extended
  1975. syntax, but has two new variants: "pshufhw" and "pshuflw", which allow only
  1976. the extended syntax, and perform the same operation as "pshufw" on the high
  1977. or low quad words of operands respectively. Also the new instruction "pshufd"
  1978. is introduced, which performs the same operation as "pshufw", but on the
  1979. double words instead of words, it allows only the extended syntax.
  1980.  
  1981.     psubb xmm0,[esi]   ; substract 16 packed bytes
  1982.     pextrw eax,xmm0,7  ; extract highest word into eax
  1983.  
  1984.   "paddq" performs the addition of packed quad words, "psubq" performs the
  1985. substraction of packed quad words, "pmuludq" performs an unsigned
  1986. multiplication of low double words from each corresponding quad words and
  1987. returns the results in packed quad words. These instructions follow the same
  1988. rules for operands as the general MMX operations described in 2.1.14.
  1989.   "pslldq" and "psrldq" perform logical shift left or right of the double
  1990. quad word in the destination operand by the amount of bytes specified in the
  1991. source operand. The destination operand should be a SSE register, source
  1992. operand should be an 8-bit immediate value.
  1993.   "punpckhqdq" interleaves the high quad word of the source operand and the
  1994. high quad word of the destination operand and writes them to the destination
  1995. SSE register. "punpcklqdq" interleaves the low quad word of the source operand
  1996. and the low quad word of the destination operand and writes them to the
  1997. destination SSE register. The source operand can be a 128-bit memory location
  1998. or SSE register.
  1999.   "movntdq" stores packed integer data from the SSE register to memory using
  2000. non-temporal hint. The source operand should be a SSE register, the
  2001. destination operand should be a 128-bit memory location. "movntpd" stores
  2002. packed double precision values from the SSE register to memory using a
  2003. non-temporal hint. Rules for operand are the same. "movnti" stores integer
  2004. from a general register to memory using a non-temporal hint. The source
  2005. operand should be a 32-bit general register, the destination operand should
  2006. be a 32-bit memory location. "maskmovdqu" stores selected bytes from the first
  2007. operand into a 128-bit memory location using a non-temporal hint. Both
  2008. operands should be a SSE registers, the second operand selects wich bytes from
  2009. the source operand are written to memory. The memory location is pointed by DI
  2010. (or EDI) register in the segment selected by DS and does not need to be
  2011. aligned.
  2012.   "clflush" writes and invalidates the cache line associated with the address
  2013. of byte specified with the operand, which should be a 8-bit memory location.
  2014.   "lfence" performs a serializing operation on all instruction loading from
  2015. memory that were issued prior to it. "mfence" performs a serializing operation
  2016. on all instruction accesing memory that were issued prior to it, and so it
  2017. combines the functions of "sfence" (described in previous section) and
  2018. "lfence" instructions. These instructions have no operands.
  2019.  
  2020.  
  2021. 2.1.17  SSE3 instructions
  2022.  
  2023. Prescott technology introduced some new instructions to improve the performance
  2024. of SSE and SSE2 - this extension is called SSE3.
  2025.   "fisttp" behaves like the "fistp" instruction and accepts the same operands,
  2026. the only difference is that it always used truncation, irrespective of the
  2027. rounding mode.
  2028.   "movshdup" loads into destination operand the 128-bit value obtained from
  2029. the source value of the same size by filling the each quad word with the two
  2030. duplicates of the value in its high double word. "movsldup" performs the same
  2031. action, except it duplicates the values of low double words. The destination
  2032. operand should be SSE register, the source operand can be SSE register or
  2033. 128-bit memory location.
  2034.   "movddup" loads the 64-bit source value and duplicates it into high and low
  2035. quad word of the destination operand. The destination operand should be SSE
  2036. register, the source operand can be SSE register or 64-bit memory location.
  2037.   "lddqu" is functionally equivalent to "movdqu" with memory as source
  2038. operand, but it may improve performance when the source operand crosses a
  2039. cacheline boundary. The destination operand has to be SSE register, the source
  2040. operand must be 128-bit memory location.
  2041.   "addsubps" performs single precision addition of second and fourth pairs and
  2042. single precision substracion of the first and third pairs of floating point
  2043. values in the operands. "addsubpd" performs double precision addition of the
  2044. second pair and double precision substraction of the first pair of floating
  2045. point values in the operand. "haddps" performs the addition of two single
  2046. precision values within the each quad word of source and destination operands,
  2047. and stores the results of such horizontal addition of values from destination
  2048. operand into low quad word of destination operand, and the results from the
  2049. source operand into high quad word of destination operand. "haddpd" performs
  2050. the addition of two double precision values within each operand, and stores
  2051. the result from destination operand into low quad word of destination operand,
  2052. and the result from source operand into high quad word of destination operand.
  2053. All these instruction need the destination operand to be SSE register, source
  2054. operand can be SSE register or 128-bit memory location.
  2055.   "monitor" sets up an address range for monitoring of write-back stores. It
  2056. need its three operands to be EAX, ECX and EDX register in that order. "mwait"
  2057. waits for a write-back store to the address range set up by the "monitor"
  2058. instruction. It uses two operands with additional parameters, first being the
  2059. EAX and second the ECX register.
  2060.   The functionality of SSE3 is further extended by the set of Supplemental
  2061. SSE3 instructions (SSSE3). They generally follow the same rules for operands
  2062. as all the MMX operations extended by SSE.
  2063.   "phaddw" and "phaddd" perform the horizontal additional of the pairs of
  2064. adjacent values from both the source and destination operand, and stores the
  2065. sums into the destination (sums from the source operand go into lower part of
  2066. destination register). They operate on 16-bit or 32-bit chunks, respectively.
  2067. "phaddsw" performs the same operation on signed 16-bit packed values, but the
  2068. result of each addition is saturated. "phsubw" and "phsubd" analogously
  2069. perform the horizontal substraction of 16-bit or 32-bit packed value, and
  2070. "phsubsw" performs the horizontal substraction of signed 16-bit packed values
  2071. with saturation.
  2072.   "pabsb", "pabsw" and "pabsd" calculate the absolute value of each signed
  2073. packed signed value in source operand and stores them into the destination
  2074. register. They operator on 8-bit, 16-bit and 32-bit elements respectively.
  2075.   "pmaddubsw" multiplies signed 8-bit values from the source operand with the
  2076. corresponding unsigned 8-bit values from the destination operand to produce
  2077. intermediate 16-bit values, and every adjacent pair of those intermediate
  2078. values is then added horizontally and those 16-bit sums are stored into the
  2079. destination operand.
  2080.   "pmulhrsw" multiplies corresponding 16-bit integers from the source and
  2081. destination operand to produce intermediate 32-bit values, and the 16 bits
  2082. next to the highest bit of each of those values are then rounded and packed
  2083. into the destination operand.
  2084.   "pshufb" shuffles the bytes in the destination operand according to the
  2085. mask provided by source operand - each of the bytes in source operand is
  2086. an index of the target position for the corresponding byte in the destination.
  2087.   "psignb", "psignw" and "psignd" perform the operation on 8-bit, 16-bit or
  2088. 32-bit integers in destination operand, depending on the signs of the values
  2089. in the source. If the value in source is negative, the corresponding value in
  2090. the destination register is negated, if the value in source is positive, no
  2091. operation is performed on the corresponding value is performed, and if the
  2092. value in source is zero, the value in destination is zeroed, too.
  2093.   "palignr" appends the source operand to the destination operand to form the
  2094. intermediate value of twice the size, and then extracts into the destination
  2095. register the 64 or 128 bits that are right-aligned to the byte offset
  2096. specified by the third operand, which should be an 8-bit immediate value. This
  2097. is the only SSSE3 instruction that takes three arguments.
  2098.  
  2099.  
  2100. 2.1.18  AMD 3DNow! instructions
  2101.  
  2102. The 3DNow! extension adds a new MMX instructions to those described in 2.1.14,
  2103. and introduces operation on the 64-bit packed floating point values, each
  2104. consisting of two single precision floating point values.
  2105.   These instructions follow the same rules as the general MMX operations, the
  2106. destination operand should be a MMX register, the source operand can be a MMX
  2107. register or 64-bit memory location. "pavgusb" computes the rounded averages
  2108. of packed unsigned bytes. "pmulhrw" performs a signed multiplication of the
  2109. packed words, round the high word of each double word results and stores them
  2110. in the destination operand. "pi2fd" converts packed double word integers into
  2111. packed floating point values. "pf2id" converts packed floating point values
  2112. into packed double word integers using truncation. "pi2fw" converts packed
  2113. word integers into packed floating point values, only low words of each
  2114. double word in source operand are used. "pf2iw" converts packed floating
  2115. point values to packed word integers, results are extended to double words
  2116. using the sign extension. "pfadd" adds packed floating point values. "pfsub"
  2117. and "pfsubr" substracts packed floating point values, the first one substracts
  2118. source values from destination values, the second one substracts destination
  2119. values from the source values. "pfmul" multiplies packed floating point
  2120. values. "pfacc" adds the low and high floating point values of the destination
  2121. operand, storing the result in the low double word of destination, and adds
  2122. the low and high floating point values of the source operand, storing the
  2123. result in the high double word of destination. "pfnacc" substracts the high
  2124. floating point value of the destination operand from the low, storing the
  2125. result in the low double word of destination, and substracts the high floating
  2126. point value of the source operand from the low, storing the result in the high
  2127. double word of destination. "pfpnacc" substracts the high floating point value
  2128. of the destination operand from the low, storing the result in the low double
  2129. word of destination, and adds the low and high floating point values of the
  2130. source operand, storing the result in the high double word of destination.
  2131. "pfmax" and "pfmin" compute the maximum and minimum of floating point values.
  2132. "pswapd" reverses the high and low double word of the source operand. "pfrcp"
  2133. returns an estimates of the reciprocals of floating point values from the
  2134. source operand, "pfrsqrt" returns an estimates of the reciprocal square
  2135. roots of floating point values from the source operand, "pfrcpit1" performs
  2136. the first step in the Newton-Raphson iteration to refine the reciprocal
  2137. approximation produced by "pfrcp" instruction, "pfrsqit1" performs the first
  2138. step in the Newton-Raphson iteration to refine the reciprocal square root
  2139. approximation produced by "pfrsqrt" instruction, "pfrcpit2" performs the
  2140. second final step in the Newton-Raphson iteration to refine the reciprocal
  2141. approximation or the reciprocal square root approximation. "pfcmpeq",
  2142. "pfcmpge" and "pfcmpgt" compare the packed floating point values and sets
  2143. all bits or zeroes all bits of the correspoding data element in the
  2144. destination operand according to the result of comparison, first checks
  2145. whether values are equal, second checks whether destination value is greater
  2146. or equal to source value, third checks whether destination value is greater
  2147. than source value.
  2148.   "prefetch" and "prefetchw" load the line of data from memory that contains
  2149. byte specified with the operand into the data cache, "prefetchw" instruction
  2150. should be used when the data in the cache line is expected to be modified,
  2151. otherwise the "prefetch" instruction should be used. The operand should be an
  2152. 8-bit memory location.
  2153.   "femms" performs a fast clear of MMX state. This instruction has no
  2154. operands.
  2155.  
  2156.  
  2157. 2.1.19  The x86-64 long mode instructions
  2158.  
  2159. The AMD64 and EM64T architectures (we will use the common name x86-64 for them
  2160. both) extend the x86 instruction set for the 64-bit processing. While legacy
  2161. and compatibility modes use the same set of registers and instructions, the
  2162. new long mode extends the x86 operations to 64 bits and introduces several new
  2163. registers. You can turn on generating the code for this mode with the "use64"
  2164. directive.
  2165.   Each of the general purpose registers is extended to 64 bits and the eight
  2166. whole new general purpose registers and also eight new SSE registers are added.
  2167. See table 2.4 for the summary of new registers (only the ones that was not
  2168. listed in table 1.2). The general purpose registers of smallers sizes are the
  2169. low order portions of the larger ones. You can still access the "ah", "bh",
  2170. "ch" and "dh" registers in long mode, but you cannot use them in the same
  2171. instruction with any of the new registers.
  2172.  
  2173.    Table 2.4  New registers in long mode
  2174.   /--------------------------------------------------\
  2175.   | Type |          General          |  SSE  |  AVX  |
  2176.   |------|---------------------------|-------|-------|
  2177.   | Bits |  8   |  16  |  32  |  64  |  128  |  256  |
  2178.   |======|======|======|======|======|=======|=======|
  2179.   |      |      |      |      | rax  |       |       |
  2180.   |      |      |      |      | rcx  |       |       |
  2181.   |      |      |      |      | rdx  |       |       |
  2182.   |      |      |      |      | rbx  |       |       |
  2183.   |      | spl  |      |      | rsp  |       |       |
  2184.   |      | bpl  |      |      | rbp  |       |       |
  2185.   |      | sil  |      |      | rsi  |       |       |
  2186.   |      | dil  |      |      | rdi  |       |       |
  2187.   |      | r8b  | r8w  | r8d  | r8   | xmm8  | ymm8  |
  2188.   |      | r9b  | r9w  | r9d  | r9   | xmm9  | ymm9  |
  2189.   |      | r10b | r10w | r10d | r10  | xmm10 | ymm10 |
  2190.   |      | r11b | r11w | r11d | r11  | xmm11 | ymm11 |
  2191.   |      | r12b | r12w | r12d | r12  | xmm12 | ymm12 |
  2192.   |      | r13b | r13w | r13d | r13  | xmm13 | ymm13 |
  2193.   |      | r14b | r14w | r14d | r14  | xmm14 | ymm14 |
  2194.   |      | r15b | r15w | r15d | r15  | xmm15 | ymm15 |
  2195.   \--------------------------------------------------/
  2196.  
  2197.    In general any instruction from x86 architecture, which allowed 16-bit or
  2198. 32-bit operand sizes, in long mode allows also the 64-bit operands. The 64-bit
  2199. registers should be used for addressing in long mode, the 32-bit addressing
  2200. is also allowed, but it's not possible to use the addresses based on 16-bit
  2201. registers. Below are the samples of new operations possible in long mode on the
  2202. example of "mov" instruction:
  2203.  
  2204.     mov rax,r8   ; transfer 64-bit general register
  2205.     mov al,[rbx] ; transfer memory addressed by 64-bit register
  2206.  
  2207. The long mode uses also the instruction pointer based addresses, you can
  2208. specify it manually with the special RIP register symbol, but such addressing
  2209. is also automatically generated by flat assembler, since there is no 64-bit
  2210. absolute addressing in long mode. You can still force the assembler to use the
  2211. 32-bit absolute addressing by putting the "dword" size override for address
  2212. inside the square brackets. There is also one exception, where the 64-bit
  2213. absolute addressing is possible, it's the "mov" instruction with one of the
  2214. operand being accumulator register, and second being the memory operand.
  2215. To force the assembler to use the 64-bit absolute addressing there, use the
  2216. "qword" size operator for address inside the square brackets. When no size
  2217. operator is applied to address, assembler generates the optimal form
  2218. automatically.
  2219.  
  2220.     mov [qword 0],rax  ; absolute 64-bit addressing
  2221.     mov [dword 0],r15d ; absolute 32-bit addressing
  2222.     mov [0],rsi        ; automatic RIP-relative addressing
  2223.     mov [rip+3],sil    ; manual RIP-relative addressing
  2224.  
  2225.   Also as the immediate operands for 64-bit operations only the signed 32-bit
  2226. values are possible, with the only exception being the "mov" instruction with
  2227. destination operand being 64-bit general purpose register. Trying to force the
  2228. 64-bit immediate with any other instruction will cause an error.
  2229.   If any operation is performed on the 32-bit general registers in long mode,
  2230. the upper 32 bits of the 64-bit registers containing them are filled with
  2231. zeros. This is unlike the operations on 16-bit or 8-bit portions of those
  2232. registers, which preserve the upper bits.
  2233.   Three new type conversion instructions are available. The "cdqe" sign
  2234. extends the double word in EAX into quad word and stores the result in RAX
  2235. register. "cqo" sign extends the quad word in RAX into double quad word and
  2236. stores the extra bits in the RDX register. These instructions have no
  2237. operands. "movsxd" sign extends the double word source operand, being either
  2238. the 32-bit register or memory, into 64-bit destination operand, which has to
  2239. be register. No analogous instruction is needed for the zero extension, since
  2240. it is done automatically by any operations on 32-bit registers, as noted in
  2241. previous paragraph. And the "movzx" and "movsx" instructions, conforming to
  2242. the general rule, can be used with 64-bit destination operand, allowing
  2243. extension of byte or word values into quad words.
  2244.   All the binary arithmetic and logical instruction have been promoted to
  2245. allow 64-bit operands in long mode. The use of decimal arithmetic instructions
  2246. in long mode is prohibited.
  2247.   The stack operations, like "push" and "pop" in long mode default to 64-bit
  2248. operands and it's not possible to use 32-bit operands with them. The "pusha"
  2249. and "popa" are disallowed in long mode.
  2250.   The indirect near jumps and calls in long mode default to 64-bit operands
  2251. and it's not possible to use the 32-bit operands with them. On the other hand,
  2252. the indirect far jumps and calls allow any operands that were allowed by the
  2253. x86 architecture and also 80-bit memory operand is allowed (though only EM64T
  2254. seems to implement such variant), with the first eight bytes defining the
  2255. offset and two last bytes specifying the selector. The direct far jumps and
  2256. calls are not allowed in long mode.
  2257.   The I/O instructions, "in", "out", "ins" and "outs" are the exceptional
  2258. instructions that are not extended to accept quad word operands in long mode.
  2259. But all other string operations are, and there are new short forms "movsq",
  2260. "cmpsq", "scasq", "lodsq" and "stosq" introduced for the variants of string
  2261. operations for 64-bit string elements. The RSI and RDI registers are used by
  2262. default to address the string elements.
  2263.   The "lfs", "lgs" and "lss" instructions are extended to accept 80-bit source
  2264. memory operand with 64-bit destination register (though only EM64T seems to
  2265. implement such variant). The "lds" and "les" are disallowed in long mode.
  2266.   The system instructions like "lgdt" which required the 48-bit memory operand,
  2267. in long mode require the 80-bit memory operand.
  2268.   The "cmpxchg16b" is the 64-bit equivalent of "cmpxchg8b" instruction, it uses
  2269. the double quad word memory operand and 64-bit registers to perform the
  2270. analoguous operation.
  2271.   The "fxsave64" and "fxrstor64" are new variants of "fxsave" and "fxrstor"
  2272. instructions, available only in long mode, which use a different format of
  2273. storage area in order to store some pointers in full 64-bit size.  
  2274.   "swapgs" is the new instruction, which swaps the contents of GS register and
  2275. the KernelGSbase model-specific register (MSR address 0C0000102h).
  2276.   "syscall" and "sysret" is the pair of new instructions that provide the
  2277. functionality similar to "sysenter" and "sysexit" in long mode, where the
  2278. latter pair is disallowed. The "sysexitq" and "sysretq" mnemonics provide the
  2279. 64-bit versions of "sysexit" and "sysret" instructions.
  2280.   The "rdmsrq" and "wrmsrq" mnemonics are the 64-bit variants of the "rdmsr"
  2281. and "wrmsr" instructions.
  2282.  
  2283.  
  2284. 2.1.20  SSE4 instructions
  2285.  
  2286. There are actually three different sets of instructions under the name SSE4.
  2287. Intel designed two of them, SSE4.1 and SSE4.2, with latter extending the
  2288. former into the full Intel's SSE4 set. On the other hand, the implementation
  2289. by AMD includes only a few instructions from this set, but also contains
  2290. some additional instructions, that are called the SSE4a set.
  2291.   The SSE4.1 instructions mostly follow the same rules for operands, as
  2292. the basic SSE operations, so they require destination operand to be SSE
  2293. register and source operand to be 128-bit memory location or SSE register,
  2294. and some operations require a third operand, the 8-bit immediate value.
  2295.   "pmulld" performs a signed multiplication of the packed double words and
  2296. stores the low double words of the results in the destination operand.
  2297. "pmuldq" performs a two signed multiplications of the corresponding double
  2298. words in the lower quad words of operands, and stores the results as
  2299. packed quad words into the destination register. "pminsb" and "pmaxsb"
  2300. return the minimum or maximum values of packed signed bytes, "pminuw" and
  2301. "pmaxuw" return the minimum and maximum values of packed unsigned words,
  2302. "pminud", "pmaxud", "pminsd" and "pmaxsd" return minimum or maximum values
  2303. of packed unsigned or signed words. These instruction complement the
  2304. instructions computing packed minimum or maximum introduced by SSE.
  2305.   "ptest" sets the ZF flag to one when the result of bitwise AND of the
  2306. both operands is zero, and zeroes the ZF otherwise. It also sets CF flag
  2307. to one, when the result of bitwise AND of the destination operand with
  2308. the bitwise NOT of the source operand is zero, and zeroes the CF otherwise.
  2309. "pcmpeqq" compares packed quad words for equality, and fills the
  2310. corresponding elements of destination operand with either ones or zeros,
  2311. depending on the result of comparison.
  2312.   "packusdw" converts packed signed double words from both the source and
  2313. destination operand into the unsigned words using saturation, and stores
  2314. the eight resulting word values into the destination register.
  2315.   "phminposuw" finds the minimum unsigned word value in source operand and
  2316. places it into the lowest word of destination operand, setting the remaining
  2317. upper bits of destination to zero.
  2318.   "roundps", "roundss", "roundpd" and "roundsd" perform the rounding of packed
  2319. or individual floating point value of single or double precision, using the
  2320. rounding mode specified by the third operand.
  2321.  
  2322.     roundsd xmm0,xmm1,0011b ; round toward zero
  2323.  
  2324.   "dpps" calculates dot product of packed single precision floating point
  2325. values, that is it multiplies the corresponding pairs of values from source and
  2326. destination operand and then sums the products up. The high four bits of the
  2327. 8-bit immediate third operand control which products are calculated and taken
  2328. to the sum, and the low four bits control, into which elements of destination
  2329. the resulting dot product is copied (the other elements are filled with zero).
  2330. "dppd" calculates dot product of packed double precision floating point values.
  2331. The bits 4 and 5 of third operand control, which products are calculated and
  2332. added, and bits 0 and 1 of this value control, which elements in destination
  2333. register should get filled with the result. "mpsadbw" calculates multiple sums
  2334. of absolute differences of unsigned bytes. The third operand controls, with
  2335. value in bits 0-1, which of the four-byte blocks in source operand is taken to
  2336. calculate the absolute differencies, and with value in bit 2, at which of the
  2337. two first four-byte block in destination operand start calculating multiple
  2338. sums. The sum is calculated from four absolute differencies between the
  2339. corresponding unsigned bytes in the source and destination block, and each next
  2340. sum is calculated in the same way, but taking the four bytes from destination
  2341. at the position one byte after the position of previous block. The four bytes
  2342. from the source stay the same each time. This way eight sums of absolute
  2343. differencies are calculated and stored as packed word values into the
  2344. destination operand. The instructions described in this paragraph follow the
  2345. same rules for operands, as "roundps" instruction.
  2346.   "blendps", "blendvps", "blendpd" and "blendvpd" conditionally copy the
  2347. values from source operand into the destination operand, depending on the bits
  2348. of the mask provided by third operand. If a mask bit is set, the corresponding
  2349. element of source is copied into the same place in destination, otherwise this
  2350. position is destination is left unchanged. The rules for the first two operands
  2351. are the same, as for general SSE instructions. "blendps" and "blendpd" need
  2352. third operand to be 8-bit immediate, and they operate on single or double
  2353. precision values, respectively. "blendvps" and "blendvpd" require third operand
  2354. to be the XMM0 register.
  2355.  
  2356.     blendvps xmm3,xmm7,xmm0 ; blend according to mask
  2357.  
  2358.   "pblendw" conditionally copies word elements from the source operand into the
  2359. destination, depending on the bits of mask provided by third operand, which
  2360. needs to be 8-bit immediate value. "pblendvb" conditionally copies byte
  2361. elements from the source operands into destination, depending on mask defined
  2362. by the third operand, which has to be XMM0 register. These instructions follow
  2363. the same rules for operands as "blendps" and "blendvps" instructions,
  2364. respectively.
  2365.   "insertps" inserts a single precision floating point value taken from the
  2366. position in source operand specified by bits 6-7 of third operand into location
  2367. in destination register selected by bits 4-5 of third operand. Additionally,
  2368. the low four bits of third operand control, which elements in destination
  2369. register will be set to zero. The first two operands follow the same rules as
  2370. for the general SSE operation, the third operand should be 8-bit immediate.
  2371.   "extractps" extracts a single precision floating point value taken from the
  2372. location in source operand specified by low two bits of third operand, and
  2373. stores it into the destination operand. The destination can be a 32-bit memory
  2374. value or general purpose register, the source operand must be SSE register,
  2375. and the third operand should be 8-bit immediate value.
  2376.  
  2377.     extractps edx,xmm3,3 ; extract the highest value
  2378.  
  2379.   "pinsrb", "pinsrd" and "pinsrq" copy a byte, double word or quad word from
  2380. the source operand into the location of destination operand determined by the
  2381. third operand. The destination operand has to be SSE register, the source
  2382. operand can be a memory location of appropriate size, or the 32-bit general
  2383. purpose register (but 64-bit general purpose register for "pinsrq", which is
  2384. only available in long mode), and the third operand has to be 8-bit immediate
  2385. value. These instructions complement the "pinsrw" instruction operating on SSE
  2386. register destination, which was introduced by SSE2.
  2387.  
  2388.     pinsrd xmm4,eax,1 ; insert double word into second position
  2389.  
  2390.   "pextrb", "pextrw", "pextrd" and "pextrq" copy a byte, word, double word or
  2391. quad word from the location in source operand specified by third operand, into
  2392. the destination. The source operand should be SSE register, the third operand
  2393. should be 8-bit immediate, and the destination operand can be memory location
  2394. of appropriate size, or the 32-bit general purpose register (but 64-bit general
  2395. purpose register for "pextrq", which is only available in long mode). The
  2396. "pextrw" instruction with SSE register as source was already introduced by
  2397. SSE2, but SSE4 extends it to allow memory operand as destination.
  2398.  
  2399.     pextrw [ebx],xmm3,7 ; extract highest word into memory
  2400.  
  2401.   "pmovsxbw" and "pmovzxbw" perform sign extension or zero extension of eight
  2402. byte values from the source operand into packed word values in destination
  2403. operand, which has to be SSE register. The source can be 64-bit memory or SSE
  2404. register - when it is register, only its low portion is used. "pmovsxbd" and
  2405. "pmovzxbd" perform sign extension or zero extension of the four byte values
  2406. from the source operand into packed double word values in destination operand,
  2407. the source can be 32-bit memory or SSE register. "pmovsxbq" and "pmovzxbq"
  2408. perform sign extension or zero extension of the two byte values from the
  2409. source operand into packed quad word values in destination operand, the source
  2410. can be 16-bit memory or SSE register. "pmovsxwd" and "pmovzxwd" perform sign
  2411. extension or zero extension of the four word values from the source operand
  2412. into packed double words in destination operand, the source can be 64-bit
  2413. memory or SSE register. "pmovsxwq" and "pmovzxwq" perform sign extension or
  2414. zero extension of the two word values from the source operand into packed quad
  2415. words in destination operand, the source can be 32-bit memory or SSE register.
  2416. "pmovsxdq" and "pmovzxdq" perform sign extension or zero extension of the two
  2417. double word values from the source operand into packed quad words in
  2418. destination operand, the source can be 64-bit memory or SSE register.
  2419.  
  2420.     pmovzxbq xmm0,word [si]  ; zero-extend bytes to quad words
  2421.     pmovsxwq xmm0,xmm1       ; sign-extend words to quad words
  2422.  
  2423.   "movntdqa" loads double quad word from the source operand to the destination
  2424. using a non-temporal hint. The destination operand should be SSE register,
  2425. and the source operand should be 128-bit memory location.
  2426.   The SSE4.2, described below, adds not only some new operations on SSE
  2427. registers, but also introduces some completely new instructions operating on
  2428. general purpose registers only.
  2429.   "pcmpistri" compares two zero-ended (implicit length) strings provided in
  2430. its source and destination operand and generates an index stored to ECX;
  2431. "pcmpistrm" performs the same comparison and generates a mask stored to XMM0.
  2432. "pcmpestri" compares two strings of explicit lengths, with length provided
  2433. in EAX for the destination operand and in EDX for the source operand, and
  2434. generates an index stored to ECX; "pcmpestrm" performs the same comparision
  2435. and generates a mask stored to XMM0. The source and destination operand follow
  2436. the same rules as for general SSE instructions, the third operand should be
  2437. 8-bit immediate value determining the details of performed operation - refer to
  2438. Intel documentation for information on those details.
  2439.   "pcmpgtq" compares packed quad words, and fills the corresponding elements of
  2440. destination operand with either ones or zeros, depending on whether the value
  2441. in destination is greater than the one in source, or not. This instruction
  2442. follows the same rules for operands as "pcmpeqq".
  2443.   "crc32" accumulates a CRC32 value for the source operand starting with
  2444. initial value provided by destination operand, and stores the result in
  2445. destination. Unless in long mode, the destination operand should be a 32-bit
  2446. general purpose register, and the source operand can be a byte, word, or double
  2447. word register or memory location. In long mode the destination operand can
  2448. also be a 64-bit general purpose register, and the source operand in such case
  2449. can be a byte or quad word register or memory location.
  2450.  
  2451.     crc32 eax,dl          ; accumulate CRC32 on byte value
  2452.     crc32 eax,word [ebx]  ; accumulate CRC32 on word value
  2453.     crc32 rax,qword [rbx] ; accumulate CRC32 on quad word value
  2454.  
  2455.   "popcnt" calculates the number of bits set in the source operand, which can
  2456. be 16-bit, 32-bit, or 64-bit general purpose register or memory location,
  2457. and stores this count in the destination operand, which has to be register of
  2458. the same size as source operand. The 64-bit variant is available only in long
  2459. mode.
  2460.  
  2461.     popcnt ecx,eax        ; count bits set to 1
  2462.  
  2463.   The SSE4a extension, which also includes the "popcnt" instruction introduced
  2464. by SSE4.2, at the same time adds the "lzcnt" instruction, which follows the
  2465. same syntax, and calculates the count of leading zero bits in source operand
  2466. (if the source operand is all zero bits, the total number of bits in source
  2467. operand is stored in destination).
  2468.   "extrq" extract the sequence of bits from the low quad word of SSE register
  2469. provided as first operand and stores them at the low end of this register,
  2470. filling the remaining bits in the low quad word with zeros. The position of bit
  2471. string and its length can either be provided with two 8-bit immediate values
  2472. as second and third operand, or by SSE register as second operand (and there
  2473. is no third operand in such case), which should contain position value in bits
  2474. 8-13 and length of bit string in bits 0-5.
  2475.  
  2476.     extrq xmm0,8,7        ; extract 8 bits from position 7
  2477.     extrq xmm0,xmm5       ; extract bits defined by register
  2478.  
  2479.   "insertq" writes the sequence of bits from the low quad word of the source
  2480. operand into specified position in low quad word of the destination operand,
  2481. leaving the other bits in low quad word of destination intact. The position
  2482. where bits should be written and the length of bit string can either be
  2483. provided with two 8-bit immediate values as third and fourth operand, or by
  2484. the bit fields in source operand (and there are only two operands in such
  2485. case), which should contain position value in bits 72-77 and length of bit
  2486. string in bits 64-69.
  2487.  
  2488.     insertq xmm1,xmm0,4,2 ; insert 4 bits at position 2
  2489.     insertq xmm1,xmm0     ; insert bits defined by register
  2490.  
  2491.   "movntss" and "movntsd" store single or double precision floating point
  2492. value from the source SSE register into 32-bit or 64-bit destination memory
  2493. location respectively, using non-temporal hint.
  2494.  
  2495.  
  2496. 2.1.21  AVX instructions
  2497.  
  2498. The Advanced Vector Extensions introduce instructions that are new variants
  2499. of SSE instructions, with new scheme of encoding that allows extended syntax
  2500. having a destination operand separate from all the source operands. It also
  2501. introduces 256-bit AVX registers, which extend up the old 128-bit SSE
  2502. registers. Any AVX instruction that puts some result into SSE register, puts
  2503. zero bits into high portion of the AVX register containing it.
  2504.   The AVX version of SSE instruction has the mnemonic obtained by prepending
  2505. SSE instruction name with "v". For any SSE arithmetic instruction which had a
  2506. destination operand also being used as one of the source values, the AVX
  2507. variant has a new syntax with three operands - the destination and two sources.
  2508. The destination and first source can be SSE registers, and second source can be
  2509. SSE register or memory. If the operation is performed on single pair of values,
  2510. the remaining bits of first source SSE register are copied into the the
  2511. destination register.
  2512.  
  2513.     vsubss xmm0,xmm2,xmm3         ; substract two 32-bit floats
  2514.     vmulsd xmm0,xmm7,qword [esi]  ; multiply two 64-bit floats
  2515.  
  2516. In case of packed operations, each instruction can also operate on the 256-bit
  2517. data size when the AVX registers are specified instead of SSE registers, and
  2518. the size of memory operand is also doubled then.
  2519.  
  2520.     vaddps ymm1,ymm5,yword [esi]  ; eight sums of 32-bit float pairs
  2521.  
  2522. The instructions that operate on packed integer types (in particular the ones
  2523. that earlier had been promoted from MMX to SSE) also acquired the new syntax
  2524. with three operands, however they are only allowed to operate on 128-bit
  2525. packed types and thus cannot use the whole AVX registers.
  2526.  
  2527.     vpavgw xmm3,xmm0,xmm2         ; average of 16-bit integers
  2528.     vpslld xmm1,xmm0,1            ; shift double words left
  2529.      
  2530. If the SSE version of instruction had a syntax with three operands, the third
  2531. one being an immediate value, the AVX version of such instruction takes four
  2532. operands, with immediate remaining the last one.
  2533.  
  2534.     vshufpd ymm0,ymm1,ymm2,10010011b ; shuffle 64-bit floats
  2535.     vpalignr xmm0,xmm4,xmm2,3        ; extract byte aligned value
  2536.      
  2537. The promotion to new syntax according to the rules described above has been
  2538. applied to all the instructions from SSE extensions up to SSE4, with the
  2539. exceptions described below.  
  2540.   "vdppd" instruction has syntax extended to four operans, but it does not
  2541. have a 256-bit version.
  2542.   The are a few instructions, namely "vsqrtpd", "vsqrtps", "vrcpps" and
  2543. "vrsqrtps", which can operate on 256-bit data size, but retained the syntax
  2544. with only two operands, because they use data from only one source:
  2545.    
  2546.     vsqrtpd ymm1,ymm0         ; put square roots into other register
  2547.  
  2548. In a similar way "vroundpd" and "vroundps" retained the syntax with three
  2549. operands, the last one being immediate value.  
  2550.  
  2551.     vroundps ymm0,ymm1,0011b  ; round toward zero
  2552.                              
  2553.   Also some of the operations on packed integers kept their two-operand or
  2554. three-operand syntax while being promoted to AVX version. In such case these
  2555. instructions follow exactly the same rules for operands as their SSE
  2556. counterparts (since operations on packed integers do not have 256-bit variants
  2557. in AVX extension). These include "vpcmpestri", "vpcmpestrm", "vpcmpistri",
  2558. "vpcmpistrm", "vphminposuw", "vpshufd", "vpshufhw", "vpshuflw". And there are
  2559. more instructions that in AVX versions keep exactly the same syntax for
  2560. operands as the one from SSE, without any additional options: "vcomiss",
  2561. "vcomisd", "vcvtss2si", "vcvtsd2si", "vcvttss2si", "vcvttsd2si", "vextractps",
  2562. "vpextrb", "vpextrw", "vpextrd", "vpextrq", "vmovd", "vmovq", "vmovntdqa",
  2563. "vmaskmovdqu", "vpmovmskb", "vpmovsxbw", "vpmovsxbd", "vpmovsxbq", "vpmovsxwd",
  2564. "vpmovsxwq", "vpmovsxdq", "vpmovzxbw", "vpmovzxbd", "vpmovzxbq", "vpmovzxwd",
  2565. "vpmovzxwq" and "vpmovzxdq".
  2566.   The move and conversion instructions have mostly been promoted to allow
  2567. 256-bit size operands in addition to the 128-bit variant with syntax identical
  2568. to that from SSE version of the same instruction. Each of the "vcvtdq2ps",
  2569. "vcvtps2dq" and "vcvttps2dq", "vmovaps", "vmovapd", "vmovups", "vmovupd",
  2570. "vmovdqa", "vmovdqu", "vlddqu", "vmovntps", "vmovntpd", "vmovntdq",
  2571. "vmovsldup", "vmovshdup", "vmovmskps" and "vmovmskpd" inherits the 128-bit
  2572. syntax from SSE without any changes, and also allows a new form with 256-bit
  2573. operands in place of 128-bit ones.  
  2574.  
  2575.     vmovups [edi],ymm6        ; store unaligned 256-bit data
  2576.    
  2577.   "vmovddup" has the identical 128-bit syntax as its SSE version, and it also
  2578. has a 256-bit version, which stores the duplicates of the lowest quad word
  2579. from the source operand in the lower half of destination operand, and in the
  2580. upper half of destination the duplicates of the low quad word from the upper
  2581. half of source. Both source and destination operands need then to be 256-bit
  2582. values.
  2583.   "vmovlhps" and "vmovhlps" have only 128-bit versions, and each takes three
  2584. operands, which all must be SSE registers. "vmovlhps" copies two single
  2585. precision values from the low quad word of second source register to the high
  2586. quad word of destination register, and copies the low quad word of first
  2587. source register into the low quad word of destination register. "vmovhlps"
  2588. copies two single  precision values from the high quad word of second source
  2589. register to the low quad word of destination register, and copies the high
  2590. quad word of first source register into the high quad word of destination
  2591. register.
  2592.   "vmovlps", "vmovhps", "vmovlpd" and "vmovhpd" have only 128-bit versions and
  2593. their syntax varies depending on whether memory operand is a destination or
  2594. source. When memory is destination, the syntax is identical to the one of
  2595. equivalent SSE instruction, and when memory is source, the instruction requires
  2596. three operands, first two being SSE registers and the third one 64-bit memory.
  2597. The value put into destination is then the value copied from first source with
  2598. either low or high quad word replaced with value from second source (the
  2599. memory operand).
  2600.  
  2601.     vmovhps [esi],xmm7       ; store upper half to memory
  2602.     vmovlps xmm0,xmm7,[ebx]  ; low from memory, rest from register  
  2603.  
  2604.   "vmovss" and "vmovsd" have syntax identical to their SSE equivalents as long
  2605. as one of the operands is memory, while the versions that operate purely on
  2606. registers require three operands (each being SSE register). The value stored
  2607. in destination is then the value copied from first source with lowest data
  2608. element replaced with the lowest value from second source.
  2609.  
  2610.     vmovss xmm3,[edi]        ; low from memory, rest zeroed
  2611.     vmovss xmm0,xmm1,xmm2    ; one value from xmm2, three from xmm1
  2612.  
  2613.   "vcvtss2sd", "vcvtsd2ss", "vcvtsi2ss" and "vcvtsi2d" use the three-operand
  2614. syntax, where destination and first source are always SSE registers, and the
  2615. second source follows the same rules and the source in syntax of equivalent
  2616. SSE instruction. The value stored in destination is then the value copied from
  2617. first source with lowest data element replaced with the result of conversion.
  2618.  
  2619.     vcvtsi2sd xmm4,xmm4,ecx  ; 32-bit integer to 64-bit float
  2620.     vcvtsi2ss xmm0,xmm0,rax  ; 64-bit integer to 32-bit float
  2621.  
  2622.   "vcvtdq2pd" and "vcvtps2pd" allow the same syntax as their SSE equivalents,
  2623. plus the new variants with AVX register as destination and SSE register or
  2624. 128-bit memory as source. Analogously "vcvtpd2dq", "vcvttpd2dq" and
  2625. "vcvtpd2ps", in addition to variant with syntax identical to SSE version,
  2626. allow a variant with SSE register as destination and AVX register or 256-bit
  2627. memory as source.          
  2628.   "vinsertps", "vpinsrb", "vpinsrw", "vpinsrd", "vpinsrq" and "vpblendw" use
  2629. a syntax with four operands, where destination and first source have to be SSE
  2630. registers, and the third and fourth operand follow the same rules as second
  2631. and third operand in the syntax of equivalent SSE instruction. Value stored in
  2632. destination is the the value copied from first source with some data elements
  2633. replaced with values extracted from the second source, analogously to the
  2634. operation of corresponding SSE instruction.  
  2635.  
  2636.     vpinsrd xmm0,xmm0,eax,3  ; insert double word
  2637.  
  2638.   "vblendvps", "vblendvpd" and "vpblendvb" use a new syntax with four register
  2639. operands: destination, two sources and a mask, where second source can also be
  2640. a memory operand. "vblendvps" and "vblendvpd" have 256-bit variant, where
  2641. operands are AVX registers or 256-bit memory, as well as 128-bit variant,
  2642. which has operands being SSE registers or 128-bit memory. "vpblendvb" has only
  2643. a 128-bit variant. Value stored in destination is the value copied from the
  2644. first source with some data elements replaced, according to mask, by values
  2645. from the second source.
  2646.  
  2647.     vblendvps ymm3,ymm1,ymm2,ymm7  ; blend according to mask    
  2648.    
  2649.   "vptest" allows the same syntax as its SSE version and also has a 256-bit
  2650. version, with both operands doubled in size. There are also two new
  2651. instructions, "vtestps" and "vtestpd", which perform analogous tests, but only
  2652. of the sign bits of corresponding single precision or double precision values,
  2653. and set the ZF and CF accordingly. They follow the same syntax rules as
  2654. "vptest".
  2655.  
  2656.     vptest ymm0,yword [ebx]  ; test 256-bit values
  2657.     vtestpd xmm0,xmm1        ; test sign bits of 64-bit floats
  2658.  
  2659.   "vbroadcastss", "vbroadcastsd" and "vbroadcastf128" are new instructions,
  2660. which broadcast the data element defined by source operand into all elements
  2661. of corresponing size in the destination register. "vbroadcastss" needs
  2662. source to be 32-bit memory and destination to be either SSE or AVX register.
  2663. "vbroadcastsd" requires 64-bit memory as source, and AVX register as
  2664. destination. "vbroadcastf128" requires 128-bit memory as source, and AVX
  2665. register as destination.
  2666.  
  2667.     vbroadcastss ymm0,dword [eax]  ; get eight copies of value          
  2668.  
  2669.   "vinsertf128" is the new instruction, which takes four operands. The
  2670. destination and first source have to be AVX registers, second source can be
  2671. SSE register or 128-bit memory location, and fourth operand should be an
  2672. immediate value. It stores in destination the value obtained by taking
  2673. contents of first source and replacing one of its 128-bit units with value of
  2674. the second source. The lowest bit of fourth operand specifies at which
  2675. position that replacement is done (either 0 or 1).
  2676.   "vextractf128" is the new instruction with three operands. The destination
  2677. needs to be SSE register or 128-bit memory location, the source must be AVX
  2678. register, and the third operand should be an immediate value. It extracts
  2679. into destination one of the 128-bit units from source. The lowest bit of third
  2680. operand specifies, which unit is extracted.  
  2681.   "vmaskmovps" and "vmaskmovpd" are the new instructions with three operands
  2682. that selectively store in destination the elements from second source
  2683. depending on the sign bits of corresponding elements from first source. These
  2684. instructions can operate on either 128-bit data (SSE registers) or 256-bit
  2685. data (AVX registers). Either destination or second source has to be a memory
  2686. location of appropriate size, the two other operands should be registers.  
  2687.  
  2688.     vmaskmovps [edi],xmm0,xmm5  ; conditionally store
  2689.     vmaskmovpd ymm5,ymm0,[esi]  ; conditionally load  
  2690.  
  2691.   "vpermilpd" and "vpermilps" are the new instructions with three operands
  2692. that permute the values from first source according to the control fields from
  2693. second source and put the result into destination operand. It allows to use
  2694. either three SSE registers or three AVX registers as its operands, the second
  2695. source can be a memory of size equal to the registers used. In alternative
  2696. form the second source can be immediate value and then the first source
  2697. can be a memory location of the size equal to destination register.
  2698.   "vperm2f128" is the new instruction with four operands, which selects
  2699. 128-bit blocks of floating point data from first and second source according
  2700. to the bit fields from fourth operand, and stores them in destination.
  2701. Destination and first source need to be AVX registers, second source can be
  2702. AVX register or 256-bit memory area, and fourth operand should be an immediate
  2703. value.
  2704.  
  2705.     vperm2f128 ymm0,ymm6,ymm7,12h  ; permute 128-bit blocks
  2706.  
  2707.   "vzeroall" instruction sets all the AVX registers to zero. "vzeroupper" sets
  2708. the upper 128-bit portions of all AVX registers to zero, leaving the SSE
  2709. registers intact. These new instructions take no operands.
  2710.   "vldmxcsr" and "vstmxcsr" are the AVX versions of "ldmxcsr" and "stmxcsr"
  2711. instructions. The rules for their operands remain unchanged.  
  2712.  
  2713.  
  2714. 2.1.22  AVX2 instructions
  2715.  
  2716. The AVX2 extension allows all the AVX instructions operating on packed integers
  2717. to use 256-bit data types, and introduces some new instructions as well.
  2718.   The AVX instructions that operate on packed integers and had only a 128-bit
  2719. variants, have been supplemented with 256-bit variants, and thus their syntax
  2720. rules became analogous to AVX instructions operating on packed floating point
  2721. types.
  2722.  
  2723.     vpsubb ymm0,ymm0,[esi]   ; substract 32 packed bytes
  2724.     vpavgw ymm3,ymm0,ymm2    ; average of 16-bit integers
  2725.  
  2726. However there are some instructions that have not been equipped with the
  2727. 256-bit variants. "vpcmpestri", "vpcmpestrm", "vpcmpistri", "vpcmpistrm",
  2728. "vpextrb", "vpextrw", "vpextrd", "vpextrq", "vpinsrb", "vpinsrw", "vpinsrd",
  2729. "vpinsrq" and "vphminposuw" are not affected by AVX2 and allow only the
  2730. 128-bit operands.
  2731.   The packed shift instructions, which allowed the third operand specifying
  2732. amount to be SSE register or 128-bit memory location, use the same rules
  2733. for the third operand in their 256-bit variant.
  2734.  
  2735.     vpsllw ymm2,ymm2,xmm4        ; shift words left
  2736.     vpsrad ymm0,ymm3,xword [ebx] ; shift double words right
  2737.  
  2738.   There are also new packed shift instructions with standard three-operand AVX
  2739. syntax, which shift each element from first source by the amount specified in
  2740. corresponding element of second source, and store the results in destination.
  2741. "vpsllvd" shifts 32-bit elements left, "vpsllvq" shifts 64-bit elements left,
  2742. "vpsrlvd" shifts 32-bit elements right logically, "vpsrlvq" shifts 64-bit
  2743. elements right logically and "vpsravd" shifts 32-bit elements right
  2744. arithmetically.
  2745.   The sign-extend and zero-extend instructions, which in AVX versions allowed
  2746. source operand to be SSE register or a memory of specific size, in the new
  2747. 256-bit variant need memory of that size doubled or SSE register as source and
  2748. AVX register as destination.
  2749.  
  2750.     vpmovzxbq ymm0,dword [esi]   ; bytes to quad words
  2751.    
  2752.   Also "vmovntdqa" has been upgraded with 256-bit variant, so it allows to
  2753. transfer 256-bit value from memory to AVX register, it needs memory address
  2754. to be aligned to 32 bytes.  
  2755.   "vpmaskmovd" and "vpmaskmovq" are the new instructions with syntax identical
  2756. to "vmaskmovps" or "vmaskmovpd", and they performs analogous operation on
  2757. packed 32-bit or 64-bit values.    
  2758.   "vinserti128", "vextracti128", "vbroadcasti128" and "vperm2i128" are the new
  2759. instructions with syntax identical to "vinsertf128", "vextractf128",
  2760. "vbroadcastf128" and "vperm2f128" respectively, and they perform analogous
  2761. operations on 128-bit blocks of integer data.
  2762.   "vbroadcastss" and "vbroadcastsd" instructions have been extended to allow
  2763. SSE register as a source operand (which in AVX could only be a memory).
  2764.   "vpbroadcastb", "vpbroadcastw", "vpbroadcastd" and "vpbroadcastq" are the
  2765. new instructions which broadcast the byte, word, double word or quad word from
  2766. the source operand into all elements of corresponing size in the destination
  2767. register. The destination operand can be either SSE or AVX register, and the
  2768. source operand can be SSE register or memory of size equal to the size of data
  2769. element.
  2770.  
  2771.     vpbroadcastb ymm0,byte [ebx]  ; get 32 identical bytes
  2772.                  
  2773.   "vpermd" and "vpermps" are new three-operand instructions, which use each
  2774. 32-bit element from first source as an index of element in second source which
  2775. is copied into destination at position corresponding to element containing
  2776. index. The destination and first source have to be AVX registers, and the
  2777. second source can be AVX register or 256-bit memory.
  2778.   "vpermq" and "vpermpd" are new three-operand instructions, which use 2-bit
  2779. indexes from the immediate value specified as third operand to determine which
  2780. element from source store at given position in destination. The destination
  2781. has to be AVX register, source can be AVX register or 256-bit memory, and the
  2782. third operand must be 8-bit immediate value.    
  2783.   The family of new instructions performing "gather" operation have special
  2784. syntax, as in their memory operand they use addressing mode that is unique to
  2785. them. The base of address can be a 32-bit or 64-bit general purpose register
  2786. (the latter only in long mode), and the index (possibly multiplied by scale
  2787. value, as in standard addressing) is specified by SSE or AVX register. It is
  2788. possible to use only index without base and any numerical displacement can be
  2789. added to the address. Each of those instructions takes three operands. First
  2790. operand is the destination register, second operand is memory addressed with
  2791. a vector index, and third operand is register containing a mask. The most
  2792. significant bit of each element of mask determines whether a value will be
  2793. loaded from memory into corresponding element in destination. The address of
  2794. each element to load is determined by using the corresponding element from
  2795. index register in memory operand to calculate final address with given base
  2796. and displacement. When the index register contains less elements than the
  2797. destination and mask registers, the higher elements of destination are zeroed.
  2798. After the value is successfuly loaded, the corresponding element in mask
  2799. register is set to zero. The destination, index and mask should all be
  2800. distinct registers, it is not allowed to use the same register in two
  2801. different roles.
  2802.   "vgatherdps" loads single precision floating point values addressed by
  2803. 32-bit indexes. The destination, index and mask should all be registers of the
  2804. same type, either SSE or AVX. The data addressed by memory operand is 32-bit
  2805. in size.
  2806.  
  2807.     vgatherdps xmm0,[eax+xmm1],xmm3    ; gather four floats
  2808.     vgatherdps ymm0,[ebx+ymm7*4],ymm3  ; gather eight floats
  2809.  
  2810.   "vgatherqps" loads single precision floating point values addressed by
  2811. 64-bit indexes. The destination and mask should always be SSE registers, while
  2812. index register can be either SSE or AVX register. The data addressed by memory
  2813. operand is 32-bit in size.
  2814.  
  2815.     vgatherqps xmm0,[xmm2],xmm3        ; gather two floats    
  2816.     vgatherqps xmm0,[ymm2+64],xmm3     ; gather four floats  
  2817.  
  2818.   "vgatherdpd" loads double precision floating point values addressed by
  2819. 32-bit indexes. The index register should always be SSE register, the
  2820. destination and mask should be two registers of the same type, either SSE or
  2821. AVX. The data addressed by memory operand is 64-bit in size.
  2822.  
  2823.     vgatherdpd xmm0,[ebp+xmm1],xmm3    ; gather two doubles
  2824.     vgatherdpd ymm0,[xmm3*8],ymm5      ; gather four doubles
  2825.  
  2826.   "vgatherqpd" loads double precision floating point values addressed by
  2827. 64-bit indexes. The destination, index and mask should all be registers of the
  2828. same type, either SSE or AVX. The data addressed by memory operand is 64-bit
  2829. in size.      
  2830.   "vpgatherdd" and "vpgatherqd" load 32-bit values addressed by either 32-bit
  2831. or 64-bit indexes. They follow the same rules as "vgatherdps" and "vgatherqps"
  2832. respectively.  
  2833.   "vpgatherdq" and "vpgatherqq" load 64-bit values addressed by either 32-bit
  2834. or 64-bit indexes. They follow the same rules as "vgatherdpd" and "vgatherqpd"
  2835. respectively.  
  2836.  
  2837.  
  2838. 2.1.23  Auxiliary sets of computational instructions
  2839.  
  2840.   There is a number of additional instruction set extensions related to
  2841. AVX. They introduce new vector instructions (and sometimes also their SSE
  2842. equivalents that use classic instruction encoding), and even some new
  2843. instructions operating on general registers that use the AVX-like encoding
  2844. allowing the extended syntax with separate destination and source operands.
  2845. The CPU support for each of these instruction sets needs to be determined
  2846. separately.    
  2847.   The AES extension provides a specialized set of instructions for the
  2848. purpose of cryptographic computations defined by Advanced Encryption Standard.
  2849. Each of these instructions has two versions: the AVX one and the one with
  2850. SSE-like syntax that uses classic encoding. Refer to the Intel manuals for the
  2851. details of operation of these instructions.
  2852.   "aesenc" and "aesenclast" perform a single round of AES encryption on data
  2853. from first source with a round key from second source, and store result in
  2854. destination. The destination and first source are SSE registers, and the
  2855. second source can be SSE register or 128-bit memory. The AVX versions of these
  2856. instructions, "vaesenc" and "vaesenclast", use the syntax with three operands,
  2857. while the SSE-like version has only two operands, with first operand being
  2858. both the destination and first source.
  2859.   "aesdec" and "aesdeclast" perform a single round of AES decryption on data
  2860. from first source with a round key from second source. The syntax rules for
  2861. them and their AVX versions are the same as for "aesenc".
  2862.   "aesimc" performs the InvMixColumns transformation of source operand and
  2863. store the result in destination. Both "aesimc" and "vaesimc" use only two
  2864. operands, destination being SSE register, and source being SSE register or
  2865. 128-bit memory location.
  2866.   "aeskeygenassist" is a helper instruction for generating the round key.
  2867. It needs three operands: destination being SSE register, source being SSE
  2868. register or 128-bit memory, and third operand being 8-bit immediate value.  
  2869. The AVX version of this instruction uses the same syntax.  
  2870.   The CLMUL extension introduces just one instruction, "pclmulqdq", and its
  2871. AVX version as well. This instruction performs a carryless multiplication of
  2872. two 64-bit values selected from first and second source according to the bit
  2873. fields in immediate value. The destination and first source are SSE registers,
  2874. second source is SSE register or 128-bit memory, and immediate value is
  2875. provided as last operand. "vpclmulqdq" takes four operands, while "pclmulqdq"
  2876. takes only three operands, with the first one serving both the role of
  2877. destination and first source.
  2878.   The FMA (Fused Multiply-Add) extension introduces additional AVX
  2879. instructions which perform multiplication and summation as single operation.
  2880. Each one takes three operands, first one serving both the role of destination
  2881. and first source, and the following ones being the second and third source.
  2882. The mnemonic of FMA instruction is obtained by appending to "vf" prefix: first
  2883. either "m" or "nm" to select whether result of multiplication should be taken
  2884. as-is or negated, then either "add" or "sub" to select whether third value
  2885. will be added to the product or substracted from the product, then either
  2886. "132", "213" or "231" to select which source operands are multiplied and which
  2887. one is added or substracted, and finally the type of data on which the
  2888. instruction operates, either "ps", "pd", "ss" or "sd". As it was with SSE
  2889. instructions promoted to AVX, instructions operating on packed floating point
  2890. values allow 128-bit or 256-bit syntax, in former all the operands are SSE
  2891. registers, but the third one can also be a 128-bit memory, in latter the
  2892. operands are AVX registers and the third one can also be a 256-bit memory.
  2893. Instructions that compute just one floating point result need operands to be
  2894. SSE registers, and the third operand can also be a memory, either 32-bit for
  2895. single precision or 64-bit for double precision.
  2896.  
  2897.     vfmsub231ps ymm1,ymm2,ymm3     ; multiply and substract
  2898.     vfnmadd132sd xmm0,xmm5,[ebx]   ; multiply, negate and add        
  2899.  
  2900. In addition to the instructions created by the rule described above, there are
  2901. families of instructions with mnemonics starting with either "vfmaddsub" or
  2902. "vfmsubadd", followed by either "132", "213" or "231" and then either "ps" or
  2903. "pd" (the operation must always be on packed values in this case). They add
  2904. to the result of multiplication or substract from it depending on the position
  2905. of value in packed data - instructions from the "vfmaddsub" group add when the
  2906. position is odd and substract when the position is even, instructions from the
  2907. "vfmsubadd" group add when the position is even and subtstract when the
  2908. position is odd. The rules for operands are the same as for other FMA
  2909. instructions.
  2910.   The FMA4 instructions are similar to FMA, but use syntax with four operands
  2911. and thus allow destination to be different than all the sources. Their
  2912. mnemonics are identical to FMA instructions with the "132", "213" or "231" cut
  2913. out, as having separate destination operand makes such selection of operands
  2914. superfluous. The multiplication is always performed on values from the first
  2915. and second source, and then the value from third source is added or
  2916. substracted. Either second or third source can be a memory operand, and the
  2917. rules for the sizes of operands are the same as for FMA instructions.
  2918.  
  2919.     vfmaddpd ymm0,ymm1,[esi],ymm2  ; multiply and add  
  2920.     vfmsubss xmm0,xmm1,xmm2,[ebx]  ; multiply and substract
  2921.    
  2922.   The F16C extension consists of two instructions, "vcvtps2ph" and
  2923. "vcvtph2ps", which convert floating point values between single precision and
  2924. half precision (the 16-bit floating point format). "vcvtps2ph" takes three
  2925. operands: destination, source, and rounding controls. The third operand is
  2926. always an immediate, the source is either SSE or AVX register containing
  2927. single precision values, and the destination is SSE register or memory, the
  2928. size of memory is 64 bits when the source is SSE register and 128 bits when
  2929. the source is AVX register. "vcvtph2ps" takes two operands, the destination
  2930. that can be SSE or AVX register, and the source that is SSE register or memory
  2931. with size of the half of destination operand's size.
  2932.   The AMD XOP extension introduces a number of new vector instructions with
  2933. encoding and syntax analogous to AVX instructions. "vfrczps", "vfrczss",
  2934. "vfrczpd" and "vfrczsd" extract fractional portions of single or double
  2935. precision values, they all take two operands. The packed operations allow
  2936. either SSE or AVX register as destination, for the other two it has to be SSE
  2937. register. Source can be register of the same type as destination, or memory
  2938. of appropriate size (256-bit for destination being AVX register, 128-bit for
  2939. packed operation with destination being SSE register, 64-bit for operation
  2940. on a solitary double precision value and 32-bit for operation on a solitary
  2941. single precision value).
  2942.  
  2943.     vfrczps ymm0,[esi]           ; load fractional parts
  2944.    
  2945.   "vpcmov" copies bits from either first or second source into destination
  2946. depending on the values of corresponding bits in the fourth operand (the
  2947. selector). If the bit in selector is set, the corresponding bit from first
  2948. source is copied into the same position in destination, otherwise the bit from
  2949. second source is copied. Either second source or selector can be memory
  2950. location, 128-bit or 256-bit depending on whether SSE registers or AVX
  2951. registers are specified as the other operands.
  2952.  
  2953.     vpcmov xmm0,xmm1,xmm2,[ebx]  ; selector in memory
  2954.     vpcmov ymm0,ymm5,[esi],ymm2  ; source in memory
  2955.  
  2956. The family of packed comparison instructions take four operands, the
  2957. destination and first source being SSE register, second source being SSE
  2958. register or 128-bit memory and the fourth operand being immediate value
  2959. defining the type of comparison. The mnemonic or instruction is created
  2960. by appending to "vpcom" prefix either "b" or "ub" to compare signed or
  2961. unsigned bytes, "w" or "uw" to compare signed or unsigned words, "d" or "ud"
  2962. to compare signed or unsigned double words, "q" or "uq" to compare signed or
  2963. unsigned quad words. The respective values from the first and second source
  2964. are compared and the corresponding data element in destination is set to
  2965. either all ones or all zeros depending on the result of comparison. The fourth
  2966. operand has to specify one of the eight comparison types (table 2.5). All
  2967. these instruction have also variants with only three operands and the type
  2968. of comparison encoded within the instruction name by inserting the comparison
  2969. mnemonic after "vpcom".
  2970.  
  2971.     vpcomb   xmm0,xmm1,xmm2,4    ; test for equal bytes
  2972.     vpcomgew xmm0,xmm1,[ebx]     ; compare signed words
  2973.  
  2974.    Table 2.5  XOP comparisons
  2975.   /-------------------------------------------\
  2976.   | Code | Mnemonic | Description             |
  2977.   |======|==========|=========================|
  2978.   | 0    | lt       | less than               |
  2979.   | 1    | le       | less than or equal      |
  2980.   | 2    | gt       | greater than            |
  2981.   | 3    | ge       | greater than or equal   |
  2982.   | 4    | eq       | equal                   |
  2983.   | 5    | neq      | not equal               |
  2984.   | 6    | false    | false                   |
  2985.   | 7    | true     | true                    |
  2986.   \-------------------------------------------/
  2987.  
  2988.   "vpermil2ps" and "vpermil2pd" set the elements in destination register to
  2989. zero or to a value selected from first or second source depending on the
  2990. corresponding bit fields from the fourth operand (the selector) and the
  2991. immediate value provided in fifth operand. Refer to the AMD manuals for the
  2992. detailed explanation of the operation performed by these instructions. Each
  2993. of the first four operands can be a register, and either second source or
  2994. selector can be memory location, 128-bit or 256-bit depending on whether SSE
  2995. registers or AVX registers are used for the other operands.
  2996.  
  2997.     vpermil2ps ymm0,ymm3,ymm7,ymm2,0  ; permute from two sources
  2998.  
  2999.   "vphaddbw" adds pairs of adjacent signed bytes to form 16-bit values and
  3000. stores them at the same positions in destination. "vphaddubw" does the same
  3001. but treats the bytes as unsigned. "vphaddbd" and "vphaddubd" sum all bytes
  3002. (either signed or unsigned) in each four-byte block to 32-bit results,
  3003. "vphaddbq" and "vphaddubq" sum all bytes in each eight-byte block to
  3004. 64-bit results, "vphaddwd" and "vphadduwd" add pairs of words to 32-bit
  3005. results, "vphaddwq" and "vphadduwq" sum all words in each four-word block to
  3006. 64-bit results, "vphadddq" and "vphaddudq" add pairs of double words to 64-bit
  3007. results. "vphsubbw" substracts in each two-byte block the byte at higher
  3008. position from the one at lower position, and stores the result as a signed
  3009. 16-bit value at the corresponding position in destination, "vphsubwd"
  3010. substracts in each two-word block the word at higher position from the one at
  3011. lower position and makes signed 32-bit results, "vphsubdq" substract in each
  3012. block of two double word the one at higher position from the one at lower
  3013. position and makes signed 64-bit results. Each of these instructions takes
  3014. two operands, the destination being SSE register, and the source being SSE
  3015. register or 128-bit memory.
  3016.  
  3017.     vphadduwq xmm0,xmm1          ; sum quadruplets of words
  3018.  
  3019.   "vpmacsww" and "vpmacssww" multiply the corresponding signed 16-bit values
  3020. from the first and second source and then add the products to the parallel
  3021. values from the third source, then "vpmacsww" takes the lowest 16 bits of the
  3022. result and "vpmacssww" saturates the result down to 16-bit value, and they
  3023. store the final 16-bit results in the destination. "vpmacsdd" and "vpmacssdd"
  3024. perform the analogous operation on 32-bit values. "vpmacswd" and "vpmacswd" do
  3025. the same calculation only on the low 16-bit values from each 32-bit block and
  3026. form the 32-bit results. "vpmacsdql" and "vpmacssdql" perform such operation
  3027. on the low 32-bit values from each 64-bit block and form the 64-bit results,
  3028. while "vpmacsdqh" and "vpmacssdqh" do the same on the high 32-bit values from
  3029. each 64-bit block, also forming the 64-bit results. "vpmadcswd" and
  3030. "vpmadcsswd" multiply the corresponding signed 16-bit value from the first
  3031. and second source, then sum all the four products and add this sum to each
  3032. 16-bit element from third source, storing the truncated or saturated result
  3033. in destination. All these instructions take four operands, the second source
  3034. can be 128-bit memory or SSE register, all the other operands have to be
  3035. SSE registers.
  3036.  
  3037.     vpmacsdd xmm6,xmm1,[ebx],xmm6  ; accumulate product
  3038.  
  3039.   "vpperm" selects bytes from first and second source, optionally applies a
  3040. separate transformation to each of them, and stores them in the destination.
  3041. The bit fields in fourth operand (the selector) specify for each position in
  3042. destination what byte from which source is taken and what operation is applied
  3043. to it before it is stored there. Refer to the AMD manuals for the detailed
  3044. information about these bit fields. This instruction takes four operands,
  3045. either second source or selector can be a 128-bit memory (or they can be SSE
  3046. registers both), all the other operands have to be SSE registers.
  3047.   "vpshlb", "vpshlw", "vpshld" and "vpshlq" shift logically bytes, words, double
  3048. words or quad words respectively. The amount of bits to shift by is specified
  3049. for each element separately by the signed byte placed at the corresponding
  3050. position in the third operand. The source containing elements to shift is
  3051. provided as second operand. Either second or third operand can be 128-bit
  3052. memory (or they can be SSE registers both) and the other operands have to be
  3053. SSE registers.
  3054.  
  3055.     vpshld xmm3,xmm1,[ebx]       ; shift bytes from xmm1
  3056.  
  3057. "vpshab", "vpshaw", "vpshad" and "vpshaq" arithmetically shift bytes, words,
  3058. double words or quad words. These instructions follow the same rules as the
  3059. logical shifts described above. "vprotb", "vprotw", "vprotd" and "vprotq"
  3060. rotate bytes, word, double words or quad words. They follow the same rules as
  3061. shifts, but additionally allow third operand to be immediate value, in which
  3062. case the same amount of rotation is specified for all the elements in source.
  3063.  
  3064.     vprotb xmm0,[esi],3          ; rotate bytes to the left
  3065.  
  3066.   The MOVBE extension introduces just one new instruction, "movbe", which
  3067. swaps bytes in value from source before storing it in destination, so can
  3068. be used to load and store big endian values. It takes two operands, either
  3069. the destination or source should be a 16-bit, 32-bit or 64-bit memory (the
  3070. last one being only allowed in long mode), and the other operand should be
  3071. a general register of the same size.  
  3072.   The BMI extension, consisting of two subsets - BMI1 and BMI2, introduces
  3073. new instructions operating on general registers, which use the same encoding
  3074. as AVX instructions and so allow the extended syntax. All these instructions
  3075. use 32-bit operands, and in long mode they also allow the forms with 64-bit
  3076. operands.
  3077.   "andn" calculates the bitwise AND of second source with the inverted bits
  3078. of first source and stores the result in destination. The destination and
  3079. the first source have to be general registers, the second source can be
  3080. general register or memory.
  3081.  
  3082.     andn edx,eax,[ebx]   ; bit-multiply inverted eax with memory
  3083.  
  3084.   "bextr" extracts from the first source the sequence of bits using an index
  3085. and length specified by bit fields in the second source operand and stores
  3086. it into destination. The lowest 8 bits of second source specify the position
  3087. of bit sequence to extract and the next 8 bits of second source specify the
  3088. length of sequence. The first source can be a general register or memory,
  3089. the other two operands have to be general registers.
  3090.  
  3091.     bextr eax,[esi],ecx  ; extract bit field from memory
  3092.    
  3093.   "blsi" extracts the lowest set bit from the source, setting all the other
  3094. bits in destination to zero. The destination must be a general register,
  3095. the source can be general register or memory.
  3096.  
  3097.     blsi rax,r11         ; isolate the lowest set bit      
  3098.  
  3099.   "blsmsk" sets all the bits in the destination up to the lowest set bit in
  3100. the source, including this bit. "blsr" copies all the bits from the source to
  3101. destination except for the lowest set bit, which is replaced by zero. These
  3102. instructions follow the same rules for operands as "blsi".
  3103.   "tzcnt" counts the number of trailing zero bits, that is the zero bits up to
  3104. the lowest set bit of source value. This instruction is analogous to "lzcnt"
  3105. and follows the same rules for operands, so it also has a 16-bit version,
  3106. unlike the other BMI instructions.
  3107.   "bzhi" is BMI2 instruction, which copies the bits from first source to
  3108. destination, zeroing all the bits up from the position specified by second
  3109. source. It follows the same rules for operands as "bextr".
  3110.   "pext" uses a mask in second source operand to select bits from first
  3111. operands and puts the selected bits as a continuous sequence into destination.
  3112. "pdep" performs the reverse operation - it takes sequence of bits from the
  3113. first source and puts them consecutively at the positions where the bits in
  3114. second source are set, setting all the other bits in destination to zero.
  3115. These BMI2 instructions follow the same rules for operands as "andn".    
  3116.   "mulx" is a BMI2 instruction which performs an unsigned multiplication of
  3117. value from EDX or RDX register (depending on the size of specified operands)
  3118. by the value from third operand, and stores the low half of result in the
  3119. second operand, and the high half of result in the first operand, and it does
  3120. it without affecting the flags. The third operand can be general register or
  3121. memory, and both the destination operands have to be general registers.
  3122.  
  3123.     mulx edx,eax,ecx     ; multiply edx by ecx into edx:eax  
  3124.  
  3125.   "shlx", "shrx" and "sarx" are BMI2 instructions, which perform logical or
  3126. arithmetical shifts of value from first source by the amount specified by
  3127. second source, and store the result in destination without affecting the
  3128. flags. The have the same rules for operands as "bzhi" instruction.
  3129.   "rorx" is a BMI2 instruction which rotates right the value from source
  3130. operand by the constant amount specified in third operand and stores the
  3131. result in destination without affecting the flags. The destination operand
  3132. has to be general register, the source operand can be general register or
  3133. memory, and the third operand has to be an immediate value.
  3134.  
  3135.     rorx eax,edx,7       ; rotate without affecting flags
  3136.                      
  3137.   The TBM is an extension designed by AMD to supplement the BMI set. The
  3138. "bextr" instruction is extended with a new form, in which second source is
  3139. a 32-bit immediate value. "blsic" is a new instruction which performs the
  3140. same operation as "blsi", but with the bits of result reversed. It uses the
  3141. same rules for operands as "blsi". "blsfill" is a new instruction, which takes
  3142. the value from source, sets all the bits below the lowest set bit and store
  3143. the result in destination, it also uses the same rules for operands as "blsi".
  3144.   "blci", "blcic", "blcs", "blcmsk" and "blcfill" are instructions analogous
  3145. to "blsi", "blsic", "blsr", "blsmsk" and "blsfill" respectively, but they
  3146. perform the bit-inverted versions of the same operations. They follow the
  3147. same rules for operands as the instructions they reflect.
  3148.   "tzmsk" finds the lowest set bit in value from source operand, sets all bits
  3149. below it to 1 and all the rest of bits to zero, then writes the result to
  3150. destination. "t1mskc" finds the least significant zero bit in the value from
  3151. source  operand, sets the bits below it to zero and all the other bits to 1,
  3152. and writes the result to destination. These instructions have the same rules
  3153. for operands as "blsi".
  3154.      
  3155.  
  3156. 2.1.24  Other extensions of instruction set
  3157.  
  3158. There is a number of additional instruction set extensions recognized by flat
  3159. assembler, and the general syntax of the instructions introduced by those
  3160. extensions is provided here. For a detailed information on the operations
  3161. performed by them, check out the manuals from Intel (for the VMX, SMX, XSAVE,
  3162. RDRAND, FSGSBASE, INVPCID, HLE and RTM extensions) or AMD (for the SVM
  3163. extension).
  3164.   The Virtual-Machine Extensions (VMX) provide a set of instructions for the
  3165. management of virtual machines. The "vmxon" instruction, which enters the VMX
  3166. operation, requires a single 64-bit memory operand, which should be a physical
  3167. address of memory region, which the logical processor may use to support VMX
  3168. operation. The "vmxoff" instruction, which leaves the VMX operation, has no
  3169. operands. The "vmlaunch" and "vmresume", which launch or resume the virtual
  3170. machines, and "vmcall", which allows guest software to call the VM monitor,
  3171. use no operands either.
  3172.   The "vmptrld" loads the physical address of current Virtual Machine Control
  3173. Structure (VMCS) from its memory operand, "vmptrst" stores the pointer to
  3174. current VMCS into address specified by its memory operand, and "vmclear" sets
  3175. the launch state of the VMCS referenced by its memory operand to clear. These
  3176. three instruction all require single 64-bit memory operand.
  3177.   The "vmread" reads from VCMS a field specified by the source operand and
  3178. stores it into the destination operand. The source operand should be a
  3179. general purpose register, and the destination operand can be a register of
  3180. memory. The "vmwrite" writes into a VMCS field specified by the destination
  3181. operand the value provided by source operand. The source operand can be a
  3182. general purpose register or memory, and the destination operand must be a
  3183. register. The size of operands for those instructions should be 64-bit when
  3184. in long mode, and 32-bit otherwise.
  3185.   The "invept" and "invvpid" invalidate the translation lookaside buffers
  3186. (TLBs) and paging-structure caches, either derived from extended page tables
  3187. (EPT), or based on the virtual processor identifier (VPID). These instructions
  3188. require two operands, the first one being the general purpose register
  3189. specifying the type of invalidation, and the second one being a 128-bit
  3190. memory operand providing the invalidation descriptor. The first operand
  3191. should be a 64-bit register when in long mode, and 32-bit register otherwise.
  3192.   The Safer Mode Extensions (SMX) provide the functionalities available
  3193. throught the "getsec" instruction. This instruction takes no operands, and
  3194. the function that is executed is determined by the contents of EAX register
  3195. upon executing this instruction.
  3196.   The Secure Virtual Machine (SVM) is a variant of virtual machine extension
  3197. used by AMD. The "skinit" instruction securely reinitializes the processor
  3198. allowing the startup of trusted software, such as the virtual machine monitor
  3199. (VMM). This instruction takes a single operand, which must be EAX, and
  3200. provides a physical address of the secure loader block (SLB).
  3201.   The "vmrun" instruction is used to start a guest virtual machine,
  3202. its only operand should be an accumulator register (AX, EAX or RAX, the
  3203. last one available only in long mode) providing the physical address of the
  3204. virtual machine control block (VMCB). The "vmsave" stores a subset of
  3205. processor state into VMCB specified by its operand, and "vmload" loads the
  3206. same subset of processor state from a specified VMCB. The same operand rules
  3207. as for the "vmrun" apply to those two instructions.
  3208.   "vmmcall" allows the guest software to call the VMM. This instruction takes
  3209. no operands.
  3210.   "stgi" set the global interrupt flag to 1, and "clgi" zeroes it. These
  3211. instructions take no operands.
  3212.   "invlpga" invalidates the TLB mapping for a virtual page specified by the
  3213. first operand (which has to be accumulator register) and address space
  3214. identifier specified by the second operand (which must be ECX register).
  3215.   The XSAVE set of instructions allows to save and restore processor state
  3216. components. "xsave" and "xsaveopt" store the components of processor state
  3217. defined by bit mask in EDX and EAX registers into area defined by memory
  3218. operand. "xrstor" restores from the area specified by memory operand the
  3219. components of processor state defined by mask in EDX and EAX. The "xsave64",
  3220. "xsaveopt64" and "xrstor64" are 64-bit versions of these instructions, allowed
  3221. only in long mode.
  3222.   "xgetbv" read the contents of 64-bit XCR (extended control register)
  3223. specified in ECX register into EDX and EAX registers. "xsetbv" writes the
  3224. contents of EDX and EAX into the 64-bit XCR specified by ECX register. These
  3225. instructions have no operands.
  3226.   The RDRAND extension introduces one new instruction, "rdrand", which loads
  3227. the hardware-generated random value into general register. It takes one
  3228. operand, which can be 16-bit, 32-bit or 64-bit register (with the last one
  3229. being allowed only in long mode).
  3230.   The FSGSBASE extension adds long mode instructions that allow to read and
  3231. write the segment base registers for FS and GS segments. "rdfsbase" and
  3232. "rdgsbase" read the corresponding segment base registers into operand, while
  3233. "wrfsbase" and "wrgsbase" write the value of operand into those register.
  3234. All these instructions take one operand, which can be 32-bit or 64-bit general
  3235. register.  
  3236.   The INVPCID extension adds "invpcid" instruction, which invalidates mapping
  3237. in the TLBs and paging caches based on the invalidation type specified in
  3238. first operand and PCID invalidate descriptor specified in second operand.
  3239. The first operands should be 32-bit general register when not in long mode,
  3240. or 64-bit general register when in long mode. The second operand should be
  3241. 128-bit memory location.  
  3242.   The HLE and RTM extensions provide set of instructions for the transactional
  3243. management. The "xacquire" and "xrelease" are new prefixes that can be used
  3244. with some of the instructions to start or end lock elision on the memory
  3245. address specified by prefixed instruction. The "xbegin" instruction starts
  3246. the transactional execution, its operand is the address a fallback routine
  3247. that gets executes in case of transaction abort, specified like the operand
  3248. for near jump instruction. "xend" marks the end of transcational execution
  3249. region, it takes no operands. "xabort" forces the transaction abort, it takes
  3250. an 8-bit immediate value as its only operand, this value is passed in the
  3251. highest bits of EAX to the fallback routine. "xtest" checks whether there is
  3252. transactional execution in progress, this instruction takes no operands.
  3253.  
  3254.  
  3255. 2.2  Control directives
  3256.  
  3257. This section describes the directives that control the assembly process, they
  3258. are processed during the assembly and may cause some blocks of instructions
  3259. to be assembled differently or not assembled at all.
  3260.  
  3261.  
  3262. 2.2.1  Numerical constants
  3263.  
  3264. The "=" directive allows to define the numerical constant. It should be
  3265. preceded by the name for the constant and followed by the numerical expression
  3266. providing the value. The value of such constants can be a number or an address,
  3267. but - unlike labels - the numerical constants are not allowed to hold the
  3268. register-based addresses. Besides this difference, in their basic variant
  3269. numerical constants behave very much like labels and you can even
  3270. forward-reference them (access their values before they actually get defined).
  3271.   There is, however, a second variant of numerical constants, which is
  3272. recognized by assembler when you try to define the constant of name, under
  3273. which there already was a numerical constant defined. In such case assembler
  3274. treats that constant as an assembly-time variable and allows it to be assigned
  3275. with new value, but forbids forward-referencing it (for obvious reasons). Let's
  3276. see both the variant of numerical constants in one example:
  3277.  
  3278.     dd sum
  3279.     x = 1
  3280.     x = x+2
  3281.     sum = x
  3282.  
  3283. Here the "x" is an assembly-time variable, and every time it is accessed, the
  3284. value that was assigned to it the most recently is used. Thus if we tried to
  3285. access the "x" before it gets defined the first time, like if we wrote "dd x"
  3286. in place of the "dd sum" instruction, it would cause an error. And when it is
  3287. re-defined with the "x = x+2" directive, the previous value of "x" is used to
  3288. calculate the new one. So when the "sum" constant gets defined, the "x" has
  3289. value of 3, and this value is assigned to the "sum". Since this one is defined
  3290. only once in source, it is the standard numerical constant, and can be
  3291. forward-referenced. So the "dd sum" is assembled as "dd 3". To read more about
  3292. how the assembler is able to resolve this, see section 2.2.6.
  3293.   The value of numerical constant can be preceded by size operator, which can
  3294. ensure that the value will fit in the range for the specified size, and can
  3295. affect also how some of the calculations inside the numerical expression are
  3296. performed. This example:
  3297.  
  3298.     c8 = byte -1
  3299.     c32 = dword -1
  3300.  
  3301. defines two different constants, the first one fits in 8 bits, the second one
  3302. fits in 32 bits.
  3303.   When you need to define constant with the value of address, which may be
  3304. register-based (and thus you cannot employ numerical constant for this
  3305. purpose), you can use the extended syntax of "label" directive (already
  3306. described in section 1.2.3), like:
  3307.  
  3308.     label myaddr at ebp+4
  3309.  
  3310. which declares label placed at "ebp+4" address. However remember that labels,
  3311. unlike numerical constants, cannot become assembly-time variables.
  3312.  
  3313.  
  3314. 2.2.2  Conditional assembly
  3315.  
  3316. "if" directive causes some block of instructions to be assembled only under
  3317. certain condition. It should be followed by logical expression specifying the
  3318. condition, instructions in next lines will be assembled only when this
  3319. condition is met, otherwise they will be skipped. The optional "else if"
  3320. directive followed with logical expression specifying additional condition
  3321. begins the next block of instructions that will be assembled if previous
  3322. conditions were not met, and the additional condition is met. The optional
  3323. "else" directive begins the block of instructions that will be assembled if
  3324. all the conditions were not met. The "end if" directive ends the last block of
  3325. instructions.
  3326.   You should note that "if" directive is processed at assembly stage and
  3327. therefore it doesn't affect any preprocessor directives, like the definitions
  3328. of symbolic constants and macroinstructions - when the assembler recognizes the
  3329. "if" directive, all the preprocessing has been already finished.
  3330.   The logical expression consist of logical values and logical operators. The
  3331. logical operators are "~" for logical negation, "&" for logical and, "|" for
  3332. logical or. The negation has the highest priority. Logical value can be a
  3333. numerical expression, it will be false if it is equal to zero, otherwise it
  3334. will be true. Two numerical expression can be compared using one of the
  3335. following operators to make the logical value: "=" (equal), "<" (less),
  3336. ">" (greater), "<=" (less or equal), ">=" (greater or equal),
  3337. "<>" (not equal).
  3338.   The "used" operator followed by a symbol name, is the logical value that
  3339. checks whether the given symbol is used somewhere (it returns correct result
  3340. even if symbol is used only after this check). The "defined" operator can be
  3341. followed by any expression, usually just by a single symbol name; it checks
  3342. whether the given expression contains only symbols that are defined in the
  3343. source and accessible from the current position.
  3344.   With "relativeto" operator it is possible to check whether values of two
  3345. expressions differ only by constant amount. The valid syntax is a numerical
  3346. expression followed by "relativeto" and then another expression (possibly
  3347. register-based). Labels that have no simple numerical value can be tested
  3348. this way to determine what kind of operations may be possible with them.
  3349.   The following simple example uses the "count" constant that should be
  3350. defined somewhere in source:
  3351.  
  3352.     if count>0
  3353.         mov cx,count
  3354.         rep movsb
  3355.     end if
  3356.  
  3357. These two assembly instructions will be assembled only if the "count" constant
  3358. is greater than 0. The next sample shows more complex conditional structure:
  3359.  
  3360.     if count & ~ count mod 4
  3361.         mov cx,count/4
  3362.         rep movsd
  3363.     else if count>4
  3364.         mov cx,count/4
  3365.         rep movsd
  3366.         mov cx,count mod 4
  3367.         rep movsb
  3368.     else
  3369.         mov cx,count
  3370.         rep movsb
  3371.     end if
  3372.  
  3373. The first block of instructions gets assembled when the "count" is non zero and
  3374. divisible by four, if this condition is not met, the second logical expression,
  3375. which follows the "else if", is evaluated and if it's true, the second block
  3376. of instructions get assembled, otherwise the last block of instructions, which
  3377. follows the line containing only "else", is assembled.
  3378.   There are also operators that allow comparison of values being any chains of
  3379. symbols. The "eq" compares whether two such values are exactly the same.
  3380. The "in" operator checks whether given value is a member of the list of values
  3381. following this operator, the list should be enclosed between "<" and ">"
  3382. characters, its members should be separated with commas. The symbols are
  3383. considered the same when they have the same meaning for the assembler - for
  3384. example "pword" and "fword" for assembler are the same and thus are not
  3385. distinguished by the above operators. In the same way "16 eq 10h" is the true
  3386. condition, however "16 eq 10+4" is not.
  3387.   The "eqtype" operator checks whether the two compared values have the same
  3388. structure, and whether the structural elements are of the same type. The
  3389. distinguished types include numerical expressions, individual quoted strings,
  3390. floating point numbers, address expressions (the expressions enclosed in square
  3391. brackets or preceded by "ptr" operator), instruction mnemonics, registers, size
  3392. operators, jump type and code type operators. And each of the special
  3393. characters that act as a separators, like comma or colon, is the separate type
  3394. itself. For example, two values, each one consisting of register name followed
  3395. by comma and numerical expression, will be regarded as of the same type, no
  3396. matter what kind of register and how complicated numerical expression is used;
  3397. with exception for the quoted strings and floating point values, which are the
  3398. special kinds of numerical expressions and are treated as different types. Thus
  3399. "eax,16 eqtype fs,3+7" condition is true, but "eax,16 eqtype eax,1.6" is false.
  3400.  
  3401.  
  3402. 2.2.3 Repeating blocks of instructions
  3403.  
  3404. "times" directive repeats one instruction specified number of times. It
  3405. should be followed by numerical expression specifying number of repeats and
  3406. the instruction to repeat (optionally colon can be used to separate number and
  3407. instruction). When special symbol "%" is used inside the instruction, it is
  3408. equal to the number of current repeat. For example "times 5 db %" will define
  3409. five bytes with values 1, 2, 3, 4, 5. Recursive use of "times" directive is
  3410. also allowed, so "times 3 times % db %" will define six bytes with values
  3411. 1, 1, 2, 1, 2, 3.
  3412.   "repeat" directive repeats the whole block of instructions. It should be
  3413. followed by numerical expression specifying number of repeats. Instructions
  3414. to repeat are expected in next lines, ended with the "end repeat" directive,
  3415. for example:
  3416.  
  3417.     repeat 8
  3418.         mov byte [bx],%
  3419.         inc bx
  3420.     end repeat
  3421.  
  3422. The generated code will store byte values from one to eight in the memory
  3423. addressed by BX register.
  3424.   Number of repeats can be zero, in that case the instructions are not
  3425. assembled at all.
  3426.   The "break" directive allows to stop repeating earlier and continue assembly
  3427. from the first line after the "end repeat". Combined with the "if" directive it
  3428. allows to stop repeating under some special condition, like:
  3429.  
  3430.     s = x/2
  3431.     repeat 100
  3432.         if x/s = s
  3433.             break
  3434.         end if
  3435.         s = (s+x/s)/2
  3436.     end repeat
  3437.  
  3438.   The "while" directive repeats the block of instructions as long as the
  3439. condition specified by the logical expression following it is true. The block
  3440. of instructions to be repeated should end with the "end while" directive.
  3441. Before each repetition the logical expression is evaluated and when its value
  3442. is false, the assembly is continued starting from the first line after the
  3443. "end while". Also in this case the "%" symbol holds the number of current
  3444. repeat. The "break" directive can be used to stop this kind of loop in the same
  3445. way as with "repeat" directive. The previous sample can be rewritten to use the
  3446. "while" instead of "repeat" this way:
  3447.  
  3448.     s = x/2
  3449.     while x/s <> s
  3450.         s = (s+x/s)/2
  3451.         if % = 100
  3452.             break
  3453.         end if
  3454.     end while
  3455.  
  3456.   The blocks defined with "if", "repeat" and "while" can be nested in any
  3457. order, however they should be closed in the same order in which they were
  3458. started. The "break" directive always stops processing the block that was
  3459. started last with either the "repeat" or "while" directive.
  3460.  
  3461.  
  3462. 2.2.4  Addressing spaces
  3463.  
  3464.   "org" directive sets address at which the following code is expected to
  3465. appear in memory. It should be followed by numerical expression specifying
  3466. the address. This directive begins the new addressing space, the following
  3467. code itself is not moved in any way, but all the labels defined within it
  3468. and the value of "$" symbol are affected as if it was put at the given
  3469. address. However it's the responsibility of programmer to put the code at
  3470. correct address at run-time.
  3471.   The "load" directive allows to define constant with a binary value loaded
  3472. from the already assembled code. This directive should be followed by the name
  3473. of the constant, then optionally size operator, then "from" operator and a
  3474. numerical expression specifying a valid address in current addressing space.
  3475. The size operator has unusual meaning in this case - it states how many bytes
  3476. (up to 8) have to be loaded to form the binary value of constant. If no size
  3477. operator is specified, one byte is loaded (thus value is in range from 0 to
  3478. 255). The loaded data cannot exceed current offset.
  3479.   The "store" directive can modify the already generated code by replacing
  3480. some of the previously generated data with the value defined by given
  3481. numerical expression, which follows. The expression can be preceded by the
  3482. optional size operator to specify how large value the expression defines, and
  3483. therefore how much bytes will be stored, if there is no size operator, the
  3484. size of one byte is assumed. Then the "at" operator and the numerical
  3485. expression defining the valid address in current addressing code space, at
  3486. which the given value have to be stored should follow. This is a directive for
  3487. advanced appliances and should be used carefully.
  3488.   Both "load" and "store" directives are limited to operate on places in
  3489. current addressing space. The "$$" symbol is always equal to the base address
  3490. of current addressing space, and the "$" symbol is the address of current
  3491. position in that addressing space, therefore these two values define limits
  3492. of the area, where "load" and "store" can operate.
  3493.   Combining the "load" and "store" directives allows to do things like encoding
  3494. some of the already generated code. For example to encode the whole code
  3495. generated in current addressing space you can use such block of directives:
  3496.  
  3497.     repeat $-$$
  3498.         load a byte from $$+%-1
  3499.         store byte a xor c at $$+%-1
  3500.     end repeat
  3501.  
  3502. and each byte of code will be xored with the value defined by "c" constant.
  3503.   "virtual" defines virtual data at specified address. This data will not be
  3504. included in the output file, but labels defined there can be used in other
  3505. parts of source. This directive can be followed by "at" operator and the
  3506. numerical expression specifying the address for virtual data, otherwise is
  3507. uses current address, the same as "virtual at $". Instructions defining data
  3508. are expected in next lines, ended with "end virtual" directive. The block of
  3509. virtual instructions itself is an independent addressing space, after it's
  3510. ended, the context of previous addressing space is restored.
  3511.   The "virtual" directive can be used to create union of some variables, for
  3512. example:
  3513.  
  3514.     GDTR dp ?
  3515.     virtual at GDTR
  3516.         GDT_limit dw ?
  3517.         GDT_address dd ?
  3518.     end virtual
  3519.  
  3520. It defines two labels for parts of the 48-bit variable at "GDTR" address.
  3521.   It can be also used to define labels for some structures addressed by a
  3522. register, for example:
  3523.  
  3524.     virtual at bx
  3525.         LDT_limit dw ?
  3526.         LDT_address dd ?
  3527.     end virtual
  3528.  
  3529. With such definition instruction "mov ax,[LDT_limit]" will be assembled
  3530. to the same instruction as "mov ax,[bx]".
  3531.   Declaring defined data values or instructions inside the virtual block would
  3532. also be useful, because the "load" directive can be used to load the values
  3533. from the virtually generated code into a constants. This directive should be
  3534. used after the code it loads but before the virtual block ends, because it can
  3535. only load the values from the same addressing space. For example:
  3536.  
  3537.     virtual at 0
  3538.         xor eax,eax
  3539.         and edx,eax
  3540.         load zeroq dword from 0
  3541.     end virtual
  3542.  
  3543. The above piece of code will define the "zeroq" constant containing four bytes
  3544. of the machine code of the instructions defined inside the virtual block.
  3545. This method can be also used to load some binary value from external file.
  3546. For example this code:
  3547.  
  3548.     virtual at 0
  3549.         file 'a.txt':10h,1
  3550.         load char from 0
  3551.     end virtual
  3552.  
  3553. loads the single byte from offset 10h in file "a.txt" into the "char"
  3554. constant.
  3555.   Any of the "section" directives described in 2.4 also begins a new
  3556. addressing space.
  3557.  
  3558.  
  3559. 2.2.5  Other directives
  3560.  
  3561. "align" directive aligns code or data to the specified boundary. It should
  3562. be followed by a numerical expression specifying the number of bytes, to the
  3563. multiply of which the current address has to be aligned. The boundary value
  3564. has to be the power of two.
  3565.   The "align" directive fills the bytes that had to be skipped to perform the
  3566. alignment with the "nop" instructions and at the same time marks this area as
  3567. uninitialized data, so if it is placed among other uninitialized data that
  3568. wouldn't take space in the output file, the alignment bytes will act the same
  3569. way. If you need to fill the alignment area with some other values, you can
  3570. combine "align" with "virtual" to get the size of alignment needed and then
  3571. create the alignment yourself, like:
  3572.  
  3573.     virtual
  3574.         align 16
  3575.         a = $ - $$
  3576.     end virtual
  3577.     db a dup 0
  3578.  
  3579. The "a" constant is defined to be the difference between address after
  3580. alignment and address of the "virtual" block (see previous section), so it is
  3581. equal to the size of needed alignment space.
  3582.   "display" directive displays the message at the assembly time. It should
  3583. be followed by the quoted strings or byte values, separated with commas. It
  3584. can be used to display values of some constants, for example:
  3585.  
  3586.     bits = 16
  3587.     display 'Current offset is 0x'
  3588.     repeat bits/4
  3589.         d = '0' + $ shr (bits-%*4) and 0Fh
  3590.         if d > '9'
  3591.             d = d + 'A'-'9'-1
  3592.         end if
  3593.         display d
  3594.     end repeat
  3595.     display 13,10
  3596.  
  3597. This block of directives calculates the four hexadecimal digits of 16-bit
  3598. value and converts them into characters for displaying. Note that this will
  3599. not work if the adresses in current addressing space are relocatable (as it
  3600. might happen with PE or object output formats), since only absolute values can
  3601. be used this way. The absolute value may be obtained by calculating the
  3602. relative address, like "$-$$", or "rva $" in case of PE format.
  3603.   The "err" directive immediately terminates the assembly process when it is
  3604. encountered by assembler.
  3605.   The "assert" directive tests whether the logical expression that follows it
  3606. is true, and if not, it signalizes the error.
  3607.  
  3608.  
  3609. 2.2.6  Multiple passes
  3610.  
  3611. Because the assembler allows to reference some of the labels or constants
  3612. before they get actually defined, it has to predict the values of such labels
  3613. and if there is even a suspicion that prediction failed in at least one case,
  3614. it does one more pass, assembling the whole source, this time doing better
  3615. prediction based on the values the labels got in the previous pass.
  3616.   The changing values of labels can cause some instructions to have encodings
  3617. of different length, and this can cause the change in values of labels again.
  3618. And since the labels and constants can also be used inside the expressions that
  3619. affect the behavior of control directives, the whole block of source can be
  3620. processed completely differently during the new pass. Thus the assembler does
  3621. more and more passes, each time trying to do better predictions to approach
  3622. the final solution, when all the values get predicted correctly. It uses
  3623. various method for predicting the values, which has been chosen to allow
  3624. finding in a few passes the solution of possibly smallest length for the most
  3625. of the programs.
  3626.   Some of the errors, like the values not fitting in required boundaries, are
  3627. not signaled during those intermediate passes, since it may happen that when
  3628. some of the values are predicted better, these errors will disappear. However
  3629. if assembler meets some illegal syntax construction or unknown instruction, it
  3630. always stops immediately. Also defining some label more than once causes such
  3631. error, because it makes the predictions groundless.
  3632.   Only the messages created with the "display" directive during the last
  3633. performed pass get actually displayed. In case when the assembly has been
  3634. stopped due to an error, these messages may reflect the predicted values that
  3635. are not yet resolved correctly.
  3636.   The solution may sometimes not exist and in such cases the assembler will
  3637. never manage to make correct predictions - for this reason there is a limit for
  3638. a number of passes, and when assembler reaches this limit, it stops and
  3639. displays the message that it is not able to generate the correct output.
  3640. Consider the following example:
  3641.  
  3642.     if ~ defined alpha
  3643.         alpha:
  3644.     end if
  3645.  
  3646. The "defined" operator gives the true value when the expression following it
  3647. could be calculated in this place, what in this case means that the "alpha"
  3648. label is defined somewhere. But the above block causes this label to be defined
  3649. only when the value given by "defined" operator is false, what leads to an
  3650. antynomy and makes it impossible to resolve such code. When processing the "if"
  3651. directive assembler has to predict whether the "alpha" label will be defined
  3652. somewhere (it wouldn't have to predict only if the label was already defined
  3653. earlier in this pass), and whatever the prediction is, the opposite always
  3654. happens. Thus the assembly will fail, unless the "alpha" label is defined
  3655. somewhere in source preceding the above block of instructions - in such case,
  3656. as it was already noted, the prediction is not needed and the block will just
  3657. get skipped.
  3658.   The above sample might have been written as a try to define the label only
  3659. when it was not yet defined. It fails, because the "defined" operator does
  3660. check whether the label is defined anywhere, and this includes the definition
  3661. inside this conditionally processed block. However adding some additional
  3662. condition may make it possible to get it resolved:
  3663.  
  3664.     if ~ defined alpha | defined @f
  3665.         alpha:
  3666.         @@:
  3667.     end if
  3668.  
  3669. The "@f" is always the same label as the nearest "@@" symbol in the source
  3670. following it, so the above sample would mean the same if any unique name was
  3671. used instead of the anonymous label. When "alpha" is not defined in any other
  3672. place in source, the only possible solution is when this block gets defined,
  3673. and this time this doesn't lead to the antynomy, because of the anonymous
  3674. label which makes this block self-establishing. To better understand this,
  3675. look at the blocks that has nothing more than this self-establishing:
  3676.  
  3677.     if defined @f
  3678.         @@:
  3679.     end if
  3680.  
  3681. This is an example of source that may have more than one solution, as both
  3682. cases when this block gets processed or not are equally correct. Which one of
  3683. those two solutions we get depends on the algorithm on the assembler, in case
  3684. of flat assembler - on the algorithm of predictions. Back to the previous
  3685. sample, when "alpha" is not defined anywhere else, the condition for "if" block
  3686. cannot be false, so we are left with only one possible solution, and we can
  3687. hope the assembler will arrive at it. On the other hand, when "alpha" is
  3688. defined in some other place, we've got two possible solutions again, but one of
  3689. them causes "alpha" to be defined twice, and such an error causes assembler to
  3690. abort the assembly immediately, as this is the kind of error that deeply
  3691. disturbs the process of resolving. So we can get such source either correctly
  3692. resolved or causing an error, and what we get may depend on the internal
  3693. choices made by the assembler.
  3694.   However there are some facts about such choices that are certain. When
  3695. assembler has to check whether the given symbol is defined and it was already
  3696. defined in the current pass, no prediction is needed - it was already noted
  3697. above. And when the given symbol has been defined never before, including all
  3698. the already finished passes, the assembler predicts it to be not defined.
  3699. Knowing this, we can expect that the simple self-establishing block shown
  3700. above will not be assembled at all and that the previous sample will resolve
  3701. correctly when "alpha" is defined somewhere before our conditional block,
  3702. while it will itself define "alpha" when it's not already defined earlier, thus
  3703. potentially causing the error because of double definition if the "alpha" is
  3704. also defined somewhere later.
  3705.   The "used" operator may be expected to behave in a similar manner in
  3706. analogous cases, however any other kinds of predictions my not be so simple and
  3707. you should never rely on them this way.
  3708.   The "err" directive, usually used to stop the assembly when some condition is
  3709. met, stops the assembly immediately, regardless of whether the current pass
  3710. is final or intermediate. So even when the condition that caused this directive
  3711. to be interpreted is mispredicted and temporary, and would eventually disappear
  3712. in the later passes, the assembly is stopped anyway.
  3713.   The "assert" directive signalizes the error only if its expression is false
  3714. after all the symbols have been resolved. You can use "assert 0" in place of
  3715. "err" when you do not want to have assembly stopped during the intermediate
  3716. passes.
  3717.  
  3718.  
  3719. 2.3  Preprocessor directives
  3720.  
  3721. All preprocessor directives are processed before the main assembly process,
  3722. and therefore are not affected by the control directives. At this time also
  3723. all comments are stripped out.
  3724.  
  3725.  
  3726. 2.3.1  Including source files
  3727.  
  3728. "include" directive includes the specified source file at the position where
  3729. it is used. It should be followed by the quoted name of file that should be
  3730. included, for example:
  3731.  
  3732.     include 'macros.inc'
  3733.  
  3734. The whole included file is preprocessed before preprocessing the lines next
  3735. to the line containing the "include" directive. There are no limits to the
  3736. number of included files as long as they fit in memory.
  3737.   The quoted path can contain environment variables enclosed within "%"
  3738. characters, they will be replaced with their values inside the path, both the
  3739. "\" and "/" characters are allowed as a path separators. The file is first
  3740. searched for in the directory containing file which included it and when it is
  3741. not found there, the search is continued in the directories specified in the
  3742. environment variable called INCLUDE (the multiple paths separated with
  3743. semicolons can be defined there, they will be searched in the same order as
  3744. specified). If file was not found in any of these places, preprocessor looks
  3745. for it in the directory containing the main source file (the one specified in
  3746. command line). These rules concern also paths given with the "file" directive.
  3747.  
  3748.  
  3749. 2.3.2  Symbolic constants
  3750.  
  3751. The symbolic constants are different from the numerical constants, before the
  3752. assembly process they are replaced with their values everywhere in source
  3753. lines after their definitions, and anything can become their values.
  3754.   The definition of symbolic constant consists of name of the constant
  3755. followed by the "equ" directive. Everything that follows this directive will
  3756. become the value of constant. If the value of symbolic constant contains
  3757. other symbolic constants, they are replaced with their values before assigning
  3758. this value to the new constant. For example:
  3759.  
  3760.     d equ dword
  3761.     NULL equ d 0
  3762.     d equ edx
  3763.  
  3764. After these three definitions the value of "NULL" constant is "dword 0" and
  3765. the value of "d" is "edx". So, for example, "push NULL" will be assembled as
  3766. "push dword 0" and "push d" will be assembled as "push edx". And if then the
  3767. following line was put:
  3768.  
  3769.     d equ d,eax
  3770.  
  3771. the "d" constant would get the new value of "edx,eax". This way the growing
  3772. lists of symbols can be defined.
  3773.   "restore" directive allows to get back previous value of redefined symbolic
  3774. constant. It should be followed by one more names of symbolic constants,
  3775. separated with commas. So "restore d" after the above definitions will give
  3776. "d" constant back the value "edx", the second one will restore it to value
  3777. "dword", and one more will revert "d" to original meaning as if no such
  3778. constant was defined. If there was no constant defined of given name,
  3779. "restore" will not cause an error, it will be just ignored.
  3780.   Symbolic constant can be used to adjust the syntax of assembler to personal
  3781. preferences. For example the following set of definitions provides the handy
  3782. shortcuts for all the size operators:
  3783.  
  3784.     b equ byte
  3785.     w equ word
  3786.     d equ dword
  3787.     p equ pword
  3788.     f equ fword
  3789.     q equ qword
  3790.     t equ tword
  3791.     x equ dqword
  3792.     y equ qqword
  3793.  
  3794.   Because symbolic constant may also have an empty value, it can be used to
  3795. allow the syntax with "offset" word before any address value:
  3796.  
  3797.     offset equ
  3798.  
  3799. After this definition "mov ax,offset char" will be valid construction for
  3800. copying the offset of "char" variable into "ax" register, because "offset" is
  3801. replaced with an empty value, and therefore ignored.
  3802.   The "define" directive followed by the name of constant and then the value,
  3803. is the alternative way of defining symbolic constant. The only difference
  3804. between "define" and "equ" is that "define" assigns the value as it is, it does
  3805. not replace the symbolic constants with their values inside it.
  3806.   Symbolic constants can also be defined with the "fix" directive, which has
  3807. the same syntax as "equ", but defines constants of high priority - they are
  3808. replaced with their symbolic values even before processing the preprocessor
  3809. directives and macroinstructions, the only exception is "fix" directive
  3810. itself, which has the highest possible priority, so it allows redefinition of
  3811. constants defined this way.
  3812.   The "fix" directive can be used for syntax adjustments related to directives
  3813. of preprocessor, what cannot be done with "equ" directive. For example:
  3814.  
  3815.     incl fix include
  3816.  
  3817. defines a short name for "include" directive, while the similar definition done
  3818. with "equ" directive wouldn't give such result, as standard symbolic constants
  3819. are replaced with their values after searching the line for preprocessor
  3820. directives.
  3821.  
  3822.  
  3823. 2.3.3  Macroinstructions
  3824.  
  3825. "macro" directive allows you to define your own complex instructions, called
  3826. macroinstructions, using which can greatly simplify the process of
  3827. programming. In its simplest form it's similar to symbolic constant
  3828. definition. For example the following definition defines a shortcut for the
  3829. "test al,0xFF" instruction:
  3830.  
  3831.     macro tst {test al,0xFF}
  3832.  
  3833. After the "macro" directive there is a name of macroinstruction and then its
  3834. contents enclosed between the "{" and "}" characters. You can use "tst"
  3835. instruction anywhere after this definition and it will be assembled as
  3836. "test al,0xFF". Defining symbolic constant "tst" of that value would give the
  3837. similar result, but the difference is that the name of macroinstruction is
  3838. recognized only as an instruction mnemonic. Also, macroinstructions are
  3839. replaced with corresponding code even before the symbolic constants are
  3840. replaced with their values. So if you define macroinstruction and symbolic
  3841. constant of the same name, and use this name as an instruction mnemonic, it
  3842. will be replaced with the contents of macroinstruction, but it will be
  3843. replaced with value if symbolic constant if used somewhere inside the
  3844. operands.
  3845.   The definition of macroinstruction can consist of many lines, because
  3846. "{" and "}" characters don't have to be in the same line as "macro" directive.
  3847. For example:
  3848.  
  3849.     macro stos0
  3850.      {
  3851.         xor al,al
  3852.         stosb
  3853.      }
  3854.  
  3855. The macroinstruction "stos0" will be replaced with these two assembly
  3856. instructions anywhere it's used.
  3857.   Like instructions which needs some number of operands, the macroinstruction
  3858. can be defined to need some number of arguments separated with commas. The
  3859. names of needed argument should follow the name of macroinstruction in the
  3860. line of "macro" directive and should be separated with commas if there is more
  3861. than one. Anywhere one of these names occurs in the contents of
  3862. macroinstruction, it will be replaced with corresponding value, provided when
  3863. the macroinstruction is used. Here is an example of a macroinstruction that
  3864. will do data alignment for binary output format:
  3865.  
  3866.     macro align value { rb (value-1)-($+value-1) mod value }
  3867.  
  3868. When the "align 4" instruction is found after this macroinstruction is
  3869. defined, it will be replaced with contents of this macroinstruction, and the
  3870. "value" will there become 4, so the result will be "rb (4-1)-($+4-1) mod 4".
  3871.   If a macroinstruction is defined that uses an instruction with the same name
  3872. inside its definition, the previous meaning of this name is used. Useful
  3873. redefinition of macroinstructions can be done in that way, for example:
  3874.  
  3875.     macro mov op1,op2
  3876.      {
  3877.       if op1 in <ds,es,fs,gs,ss> & op2 in <cs,ds,es,fs,gs,ss>
  3878.         push  op2
  3879.         pop   op1
  3880.       else
  3881.         mov   op1,op2
  3882.       end if
  3883.      }
  3884.  
  3885. This macroinstruction extends the syntax of "mov" instruction, allowing both
  3886. operands to be segment registers. For example "mov ds,es" will be assembled as
  3887. "push es" and "pop ds". In all other cases the standard "mov" instruction will
  3888. be used. The syntax of this "mov" can be extended further by defining next
  3889. macroinstruction of that name, which will use the previous macroinstruction:
  3890.  
  3891.     macro mov op1,op2,op3
  3892.      {
  3893.       if op3 eq
  3894.         mov   op1,op2
  3895.       else
  3896.         mov   op1,op2
  3897.         mov   op2,op3
  3898.       end if
  3899.      }
  3900.  
  3901. It allows "mov" instruction to have three operands, but it can still have two
  3902. operands only, because when macroinstruction is given less arguments than it
  3903. needs, the rest of arguments will have empty values. When three operands are
  3904. given, this macroinstruction will become two macroinstructions of the previous
  3905. definition, so "mov es,ds,dx" will be assembled as "push ds", "pop es" and
  3906. "mov ds,dx".
  3907.   By placing the "*" after the name of argument you can mark the argument as
  3908. required - preprocessor will not allow it to have an empty value. For example
  3909. the above macroinstruction could be declared as "macro mov op1*,op2*,op3" to
  3910. make sure that first two arguments will always have to be given some non empty
  3911. values.
  3912.   Alternatively, you can provide the default value for argument, by placing
  3913. the "=" followed by value after the name of argument. Then if the argument
  3914. has an empty value provided, the default value will be used instead.
  3915.   When it's needed to provide macroinstruction with argument that contains
  3916. some commas, such argument should be enclosed between "<" and ">" characters.
  3917. If it contains more than one "<" character, the same number of ">" should be
  3918. used to tell that the value of argument ends.
  3919.   "purge" directive allows removing the last definition of specified
  3920. macroinstruction. It should be followed by one or more names of
  3921. macroinstructions, separated with commas. If such macroinstruction has not
  3922. been defined, you will not get any error. For example after having the syntax
  3923. of "mov" extended with the macroinstructions defined above, you can disable
  3924. syntax with three operands back by using "purge mov" directive. Next
  3925. "purge mov" will disable also syntax for two operands being segment registers,
  3926. and all the next such directives will do nothing.
  3927.   If after the "macro" directive you enclose some group of arguments' names in
  3928. square brackets, it will allow giving more values for this group of arguments
  3929. when using that macroinstruction. Any more argument given after the last
  3930. argument of such group will begin the new group and will become the first
  3931. argument of it. That's why after closing the square bracket no more argument
  3932. names can follow. The contents of macroinstruction will be processed for each
  3933. such group of arguments separately. The simplest example is to enclose one
  3934. argument name in square brackets:
  3935.  
  3936.     macro stoschar [char]
  3937.      {
  3938.         mov al,char
  3939.         stosb
  3940.      }
  3941.  
  3942. This macroinstruction accepts unlimited number of arguments, and each one
  3943. will be processed into these two instructions separately. For example
  3944. "stoschar 1,2,3" will be assembled as the following instructions:
  3945.  
  3946.     mov al,1
  3947.     stosb
  3948.     mov al,2
  3949.     stosb
  3950.     mov al,3
  3951.     stosb
  3952.  
  3953.   There are some special directives available only inside the definitions of
  3954. macroinstructions. "local" directive defines local names, which will be
  3955. replaced with unique values each time the macroinstruction is used. It should
  3956. be followed by names separated with commas. If the name given as parameter to
  3957. "local" directive begins with a dot or two dots, the unique labels generated
  3958. by each evaluation of macroinstruction will have the same properties.
  3959. This directive is usually needed for the constants or labels that
  3960. macroinstruction defines and uses internally. For example:
  3961.  
  3962.     macro movstr
  3963.      {
  3964.         local move
  3965.       move:
  3966.         lodsb
  3967.         stosb
  3968.         test al,al
  3969.         jnz move
  3970.      }
  3971.  
  3972. Each time this macroinstruction is used, "move" will become other unique name
  3973. in its instructions, so you will not get an error you normally get when some
  3974. label is defined more than once.
  3975.   "forward", "reverse" and "common" directives divide macroinstruction into
  3976. blocks, each one processed after the processing of previous is finished. They
  3977. differ in behavior only if macroinstruction allows multiple groups of
  3978. arguments. Block of instructions that follows "forward" directive is processed
  3979. for each group of arguments, from first to last - exactly like the default
  3980. block (not preceded by any of these directives). Block that follows "reverse"
  3981. directive is processed for each group of argument in reverse order - from last
  3982. to first. Block that follows "common" directive is processed only once,
  3983. commonly for all groups of arguments. Local name defined in one of the blocks
  3984. is available in all the following blocks when processing the same group of
  3985. arguments as when it was defined, and when it is defined in common block it is
  3986. available in all the following blocks not depending on which group of
  3987. arguments is processed.
  3988.   Here is an example of macroinstruction that will create the table of
  3989. addresses to strings followed by these strings:
  3990.  
  3991.     macro strtbl name,[string]
  3992.      {
  3993.       common
  3994.         label name dword
  3995.       forward
  3996.         local label
  3997.         dd label
  3998.       forward
  3999.         label db string,0
  4000.      }
  4001.  
  4002. First argument given to this macroinstruction will become the label for table
  4003. of addresses, next arguments should be the strings. First block is processed
  4004. only once and defines the label, second block for each string declares its
  4005. local name and defines the table entry holding the address to that string.
  4006. Third block defines the data of each string with the corresponding label.
  4007.   The directive starting the block in macroinstruction can be followed by the
  4008. first instruction of this block in the same line, like in the following
  4009. example:
  4010.  
  4011.     macro stdcall proc,[arg]
  4012.      {
  4013.       reverse push arg
  4014.       common call proc
  4015.      }
  4016.  
  4017. This macroinstruction can be used for calling the procedures using STDCALL
  4018. convention, which has all the arguments pushed on stack in the reverse order.
  4019. For example "stdcall foo,1,2,3" will be assembled as:
  4020.  
  4021.     push 3
  4022.     push 2
  4023.     push 1
  4024.     call foo
  4025.  
  4026.   If some name inside macroinstruction has multiple values (it is either one
  4027. of the arguments enclosed in square brackets or local name defined in the
  4028. block following "forward" or "reverse" directive) and is used in block
  4029. following the "common" directive, it will be replaced with all of its values,
  4030. separated with commas. For example the following macroinstruction will pass
  4031. all of the additional arguments to the previously defined "stdcall"
  4032. macroinstruction:
  4033.  
  4034.     macro invoke proc,[arg]
  4035.      { common stdcall [proc],arg }
  4036.  
  4037. It can be used to call indirectly (by the pointer stored in memory) the
  4038. procedure using STDCALL convention.
  4039.   Inside macroinstruction also special operator "#" can be used. This
  4040. operator causes two names to be concatenated into one name. It can be useful,
  4041. because it's done after the arguments and local names are replaced with their
  4042. values. The following macroinstruction will generate the conditional jump
  4043. according to the "cond" argument:
  4044.  
  4045.     macro jif op1,cond,op2,label
  4046.      {
  4047.         cmp op1,op2
  4048.         j#cond label
  4049.      }
  4050.  
  4051. For example "jif ax,ae,10h,exit" will be assembled as "cmp ax,10h" and
  4052. "jae exit" instructions.
  4053.   The "#" operator can be also used to concatenate two quoted strings into one.
  4054. Also conversion of name into a quoted string is possible, with the "`" operator,
  4055. which likewise can be used inside the macroinstruction. It converts the name