Subversion Repositories Kolibri OS

Rev

Go to most recent revision | Blame | Last modification | View Log | RSS feed

  1. r600-sb
  2. =======
  3.  
  4. * * * * *
  5.  
  6. Debugging
  7. ---------
  8.  
  9. ### Environment variables
  10.  
  11. -   **R600\_DEBUG**
  12.  
  13.     There are new flags:
  14.  
  15.     -   **sb** - Enable optimization of graphics shaders
  16.     -   **sbcl** - Enable optimization of compute shaders (experimental)
  17.     -   **sbdry** - Dry run, optimize but use source bytecode -
  18.         useful if you only want to check shader dumps
  19.         without the risk of lockups and other problems
  20.     -   **sbstat** - Print optimization statistics (only time so far)
  21.     -   **sbdump** - Print IR after some passes.
  22.  
  23. ### Regression debugging
  24.  
  25. If there are any regressions as compared to the default backend
  26. (R600\_SB=0), it's possible to use the following environment variables
  27. to find the incorrectly optimized shader that causes the regression.
  28.  
  29. -   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
  30.     shaders
  31.     -   0 - disabled (default)
  32.     -   1 - skip optimization for the shaders in the range
  33.         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
  34.         optimize only the shaders that are not in this range
  35.     -   2 - optimize only the shaders in the range
  36.         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
  37.  
  38. -   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
  39.  
  40. -   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
  41.  
  42. Example - optimize only the shaders 5, 6, and 7:
  43.  
  44.     R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
  45.  
  46. All shaders compiled by the application are numbered starting from 1,
  47. the number of shaders used by the application may be obtained by running
  48. it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
  49. for each compiled shader.
  50.  
  51. After figuring out the total number of shaders used by the application,
  52. the variables above allow to use bisection to find the shader that is
  53. the cause of regression. E.g. if the application uses 100 shaders, we
  54. can divide the range [1; 100] and run the application with the
  55. optimization enabled only for the first half of the shaders:
  56.  
  57.     R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
  58.  
  59. If the regression is reproduced with these parameters, then the failing
  60. shader is in the range [1; 50], if it's not reproduced - then it's in
  61. the range [51; 100]. Then we can divide the new range again and repeat
  62. the testing, until we'll reduce the range to a single failing shader.
  63.  
  64. *NOTE: This method relies on the assumption that the application
  65. produces the same sequence of the shaders on each run. It's not always
  66. true - some applications may produce different sequences of the shaders,
  67. in such cases the tools like apitrace may be used to record the trace
  68. with the application, then this method may be applied when replaying the
  69. trace - also this may be faster and/or more convenient than testing the
  70. application itself.*
  71.  
  72. * * * * *
  73.  
  74. Intermediate Representation
  75. ---------------------------
  76.  
  77. ### Values
  78.  
  79. All kinds of the operands (literal constants, references to kcache
  80. constants, references to GPRs, etc) are currently represented by the
  81. **value** class (possibly it makes sense to switch to hierarchy of
  82. classes derived from **value** instead, to save some memory).
  83.  
  84. All values (except some pseudo values like the exec\_mask or predicate
  85. register) represent 32bit scalar values - there are no vector values,
  86. CF/FETCH instructions use groups of 4 values for src and dst operands.
  87.  
  88. ### Nodes
  89.  
  90. Shader programs are represented using the tree data structure, some
  91. nodes contain a list of subnodes.
  92.  
  93. #### Control flow nodes
  94.  
  95. Control flow information is represented using four special node types
  96. (based on the ideas from [[1]](#references) )
  97.  
  98. -   **region\_node** - single-entry, single-exit region.
  99.  
  100.     All loops and if's in the program are enclosed in region nodes.
  101.     Region nodes have two containers for phi nodes -
  102.     region\_node::loop\_phi contains the phi expressions to be executed
  103.     at the region entry, region\_node::phi contains the phi expressions
  104.     to be executed at the region exit. It's the only type of the node
  105.     that contains associated phi expressions.
  106.  
  107. -   **depart\_node** - "depart region \$id after { ... }"
  108.  
  109.     Depart target region (jump to exit point) after executing contained
  110.     code.
  111.  
  112. -   **repeat\_node** - "repeat region \$id after { ... }"
  113.  
  114.     Repeat target region (jump to entry point) after executing contained
  115.     code.
  116.  
  117. -   **if\_node** - "if (cond) { ... }"
  118.  
  119.     Execute contained code if condition is true. The difference from
  120.     [[1]](#references) is that we don't have associated phi expressions
  121.     for the **if\_node**, we enclose **if\_node** in the
  122.     **region\_node** and store corresponding phi's in the
  123.     **region\_node**, this allows more uniform handling.
  124.  
  125. The target region of depart and repeat nodes is always the region where
  126. they are located (possibly in the nested region), there are no arbitrary
  127. jumps/goto's - control flow in the program is always structured.
  128.  
  129. Typical control flow constructs can be represented as in the following
  130. examples:
  131.  
  132. GLSL:
  133.  
  134.     if (cond) {
  135.         < 1 >
  136.     } else {
  137.         < 2 >
  138.     }
  139.  
  140. IR:
  141.  
  142.     region #0 {
  143.         depart region #0 after {
  144.             if (cond) {
  145.                 depart region #0 after {
  146.                     < 1 >
  147.                 }
  148.             }
  149.             < 2 >
  150.         }
  151.         <region #0 phi nodes >
  152.     }
  153.  
  154. GLSL:
  155.  
  156.     while (cond) {
  157.         < 1 >
  158.     }
  159.  
  160. IR:
  161.  
  162.     region #0 {
  163.         <region #0 loop_phi nodes>
  164.         repeat region #0 after {
  165.             region #1 {
  166.                 depart region #1 after {
  167.                     if (!cond) {
  168.                         depart region #0
  169.                     }
  170.                 }
  171.             }
  172.             < 1 >
  173.         }
  174.         <region #0 phi nodes>
  175.     }
  176.  
  177. 'Break' and 'continue' inside the loops are directly translated to the
  178. depart and repeat nodes for the corresponding loop region.
  179.  
  180. This may look a bit too complicated, but in fact this allows more simple
  181. and uniform handling of the control flow.
  182.  
  183. All loop\_phi and phi nodes for some region always have the same number
  184. of source operands. The number of source operands for
  185. region\_node::loop\_phi nodes is 1 + number of repeat nodes that
  186. reference this region as a target. The number of source operands for
  187. region\_node::phi nodes is equal to the number of depart nodes that
  188. reference this region as a target. All depart/repeat nodes for the
  189. region have unique indices equal to the index of source operand for
  190. phi/loop\_phi nodes.
  191.  
  192. First source operand for region\_node::loop\_phi nodes (src[0]) is an
  193. incoming value that enters the region from the outside. Each remaining
  194. source operand comes from the corresponding repeat node.
  195.  
  196. More complex example:
  197.  
  198. GLSL:
  199.  
  200.     a = 1;
  201.     while (a < 5) {
  202.         a = a * 2;
  203.         if (b == 3) {
  204.             continue;
  205.         } else {
  206.             a = 6;
  207.         }
  208.         if (c == 4)
  209.             break;
  210.         a = a + 1;
  211.     }
  212.  
  213. IR with SSA form:
  214.  
  215.     a.1 = 1;
  216.     region #0 {
  217.         // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
  218.         region#0 loop_phi: a.2 = phi a.1, a.6, a.3
  219.  
  220.         repeat_1 region #0 after {
  221.             a.3 = a.2 * 2;
  222.             cond1 = (b == 3);
  223.             region #1 {
  224.                 depart_0 region #1 after {
  225.                     if (cond1) {
  226.                         repeat_2 region #0;
  227.                     }
  228.                 }
  229.                 a.4 = 6;
  230.  
  231.                 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
  232.             }
  233.             cond2 = (c == 4);
  234.             region #2 {
  235.                 depart_0 region #2 after {
  236.                     if (cond2) {
  237.                         depart_0 region #0;
  238.                     }
  239.                 }
  240.             }
  241.             a.6 = a.5 + 1;
  242.         }
  243.  
  244.         region #0 phi: a.7 = phi a.5 // src[0] from depart_0
  245.     }
  246.  
  247. Phi nodes with single source operand are just copies, they are not
  248. really necessary, but this allows to handle all **depart\_node**s in the
  249. uniform way.
  250.  
  251. #### Instruction nodes
  252.  
  253. Instruction nodes represent different kinds of instructions -
  254. **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
  255. the "bc" structure where all fields of the bytecode are stored (the type
  256. is **bc\_alu** for **alu\_node**, etc). The operands are represented
  257. using the vectors of pointers to **value** class (node::src, node::dst)
  258.  
  259. #### SSA-specific nodes
  260.  
  261. Phi nodes currently don't have special node class, they are stored as
  262. **node**. Destination vector contains a single destination value, source
  263. vector contains 1 or more source values.
  264.  
  265. Psi nodes [[5], [6]](#references) also don't have a special node class
  266. and stored as **node**. Source vector contains 3 values for each source
  267. operand - the **value** of predicate, **value** of corresponding
  268. PRED\_SEL field, and the source **value** itself.
  269.  
  270. ### Indirect addressing
  271.  
  272. Special kind of values (VLK\_RELREG) is used to represent indirect
  273. operands. These values don't have SSA versions. The representation is
  274. mostly based on the [[2]](#references). Indirect operand contains the
  275. "offset/address" value (value::rel), (e.g. some SSA version of the AR
  276. register value, though after some passes it may be any value - constant,
  277. register, etc), also it contains the maydef and mayuse vectors of
  278. pointers to **value**s (similar to dst/src vectors in the **node**) to
  279. represent the effects of aliasing in the SSA form.
  280.  
  281. E.g. if we have the array R5.x ... R8.x and the following instruction :
  282.  
  283.     MOV R0.x, R[5 + AR].x
  284.  
  285. then source indirect operand is represented with the VLK\_RELREG value,
  286. value::rel is AR, value::maydef is empty (in fact it always contain the
  287. same number of elements as mayuse to simplify the handling, but they are
  288. NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
  289. corresponding SSA versions after ssa\_rename).
  290.  
  291. Additional "virtual variables" as in [HSSA [2]](#references) are not
  292. used, also there is no special handling for "zero versions". Typical
  293. programs in our case are small, indirect addressing is rare, array sizes
  294. are limited by max gpr number, so we don't really need to use special
  295. tricks to avoid the explosion of value versions. Also this allows more
  296. precise liveness computation for array elements without modifications to
  297. the algorithms.
  298.  
  299. With the following instruction:
  300.  
  301.     MOV R[5+AR].x, R0.x
  302.  
  303. we'll have both maydef and mayuse vectors for dst operand filled with
  304. array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
  305. pass mayuse will contain previous versions, maydef will contain new
  306. potentially-defined versions.
  307.  
  308. * * * * *
  309.  
  310. Passes
  311. ------
  312.  
  313. -   **bc\_parser** - creates the IR from the source bytecode,
  314.     initializes src and dst value vectors for instruction nodes. Most
  315.     ALU nodes have one dst operand and the number of source operands is
  316.     equal to the number of source operands for the ISA instruction.
  317.     Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
  318.     gpr as in the original instruction, other two are pseudo-operands
  319.     that represent possibly updated predicate and exec\_mask. Predicate
  320.     values are used in the predicated alu instructions (node::pred),
  321.     exec\_mask values are used in the if\_nodes (if\_node::cond). Each
  322.     vector operand in the CF/TEX/VTX instructions is represented with 4
  323.     values - components of the vector.
  324.  
  325. -   **ssa\_prepare** - creates phi expressions.
  326.  
  327. -   **ssa\_rename** - renames the values (assigns versions).
  328.  
  329. -   **liveness** - liveness computation, sets 'dead' flag for unused
  330.     nodes and values, optionally computes interference information for
  331.     the values.
  332.  
  333. -   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
  334.     unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
  335.     instructions in the source, containers for ALU groups (they were
  336.     only needed for the ssa\_rename pass)
  337.  
  338. -   **if\_conversion** - converts control flow with if\_nodes to the
  339.     data flow in cases where it can improve performance (small alu-only
  340.     branches). Both branches are executed speculatively and the phi
  341.     expressions are replaced with conditional moves (CNDxx) to select
  342.     the final value using the same condition predicate as was used by
  343.     the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
  344.     instruction, CNDxx now uses dst[0] from the same PREDSETxx
  345.     instruction.
  346.  
  347. -   **peephole** - peephole optimizations
  348.  
  349. -   **gvn** - Global Value Numbering [[2]](#references),
  350.     [[3]](#references)
  351.  
  352. -   **gcm** - Global Code Motion [[3]](#references). Also performs
  353.     grouping of the instructions of the same kind (CF/FETCH/ALU).
  354.  
  355. -   register allocation passes, some ideas are used from
  356.     [[4]](#references), but implementation is simplified to make it more
  357.     efficient in terms of the compilation speed (e.g. no recursive
  358.     recoloring) while achieving good enough results.
  359.  
  360.     -   **ra\_split** - prepares the program to register allocation.
  361.         Splits live ranges for constrained values by inserting the
  362.         copies to/from temporary values, so that the live range of the
  363.         constrained values becomes minimal.
  364.  
  365.     -   **ra\_coalesce** - performs global allocation on registers used
  366.         in CF/FETCH instructions. It's performed first to make sure they
  367.         end up in the same GPR. Also tries to allocate all values
  368.         involved in copies (inserted by the ra\_split pass) to the same
  369.         register, so that the copies may be eliminated.
  370.  
  371.     -   **ra\_init** - allocates gpr arrays (if indirect addressing is
  372.         used), and remaining values.
  373.  
  374. -   **post\_scheduler** - ALU scheduler, handles VLIW packing and
  375.     performs the final register allocation for local values inside ALU
  376.     clauses. Eliminates all coalesced copies (if src and dst of the copy
  377.     are allocated to the same register).
  378.  
  379. -   **ra\_checker** - optional debugging pass that tries to catch basic
  380.     errors of the scheduler or regalloc,
  381.  
  382. -   **bc\_finalize** - propagates the regalloc information from values
  383.     in node::src and node::dst vectors to the bytecode fields, converts
  384.     control flow structure (region/depart/repeat) to the target
  385.     instructions (JUMP/ELSE/POP,
  386.     LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
  387.  
  388. -   **bc\_builder** - builds final bytecode,
  389.  
  390. * * * * *
  391.  
  392. References
  393. ----------
  394.  
  395. [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
  396. McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
  397.  
  398. [2] ["Effective Representation of Aliases and Indirect Memory Operations
  399. in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
  400. Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
  401.  
  402. [3] ["Global Code Motion. Global Value Numbering.", Cliff
  403. Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
  404.  
  405. [4] ["Register Allocation for Programs in SSA Form", Sebastian
  406. Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
  407.  
  408. [5] ["An extension to the SSA representation for predicated code",
  409. Francois de
  410. Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
  411.  
  412. [6] ["Improvements to the Psi-SSA Representation", F. de
  413. Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)
  414.