WebSVN – Kolibri OS – Path Comparison – / – /contrib/sdk/sources/Mesa/src/gallium/drivers/r600/sb/notes.markdown Rev 4357 and /contrib/sdk/sources/Mesa/src/gallium/drivers/r600/sb/notes.markdown Rev 4358

Regard whitespace Rev 4357 → Rev 4358

 /contrib/sdk/sources/Mesa/src/gallium/drivers/r600/sb/notes.markdown
 ,0 → 1,413
+r600-sb
+=======
+* * * * *
+Debugging
+---------
+### Environment variables
+-   **R600\_DEBUG**
+    There are new flags:
+    -   **sb** - Enable optimization of graphics shaders
+    -   **sbcl** - Enable optimization of compute shaders (experimental)
+    -   **sbdry** - Dry run, optimize but use source bytecode -
+        useful if you only want to check shader dumps
+        without the risk of lockups and other problems
+    -   **sbstat** - Print optimization statistics (only time so far)
+    -   **sbdump** - Print IR after some passes.
+### Regression debugging
+If there are any regressions as compared to the default backend
+(R600\_SB=0), it's possible to use the following environment variables
+to find the incorrectly optimized shader that causes the regression.
+-   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
+    shaders
+    -   0 - disabled (default)
+    -   1 - skip optimization for the shaders in the range
+        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
+        optimize only the shaders that are not in this range
+    -   2 - optimize only the shaders in the range
+        [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
+-   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
+-   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
+Example - optimize only the shaders 5, 6, and 7:
+    R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
+All shaders compiled by the application are numbered starting from 1,
+the number of shaders used by the application may be obtained by running
+it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
+for each compiled shader.
+After figuring out the total number of shaders used by the application,
+the variables above allow to use bisection to find the shader that is
+the cause of regression. E.g. if the application uses 100 shaders, we
+can divide the range [1; 100] and run the application with the
+optimization enabled only for the first half of the shaders:
+    R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
+If the regression is reproduced with these parameters, then the failing
+shader is in the range [1; 50], if it's not reproduced - then it's in
+the range [51; 100]. Then we can divide the new range again and repeat
+the testing, until we'll reduce the range to a single failing shader.
+*NOTE: This method relies on the assumption that the application
+produces the same sequence of the shaders on each run. It's not always
+true - some applications may produce different sequences of the shaders,
+in such cases the tools like apitrace may be used to record the trace
+with the application, then this method may be applied when replaying the
+trace - also this may be faster and/or more convenient than testing the
+application itself.*
+* * * * *
+Intermediate Representation
+---------------------------
+### Values
+All kinds of the operands (literal constants, references to kcache
+constants, references to GPRs, etc) are currently represented by the
+**value** class (possibly it makes sense to switch to hierarchy of
+classes derived from **value** instead, to save some memory).
+All values (except some pseudo values like the exec\_mask or predicate
+register) represent 32bit scalar values - there are no vector values,
+CF/FETCH instructions use groups of 4 values for src and dst operands.
+### Nodes
+Shader programs are represented using the tree data structure, some
+nodes contain a list of subnodes.
+#### Control flow nodes
+Control flow information is represented using four special node types
+(based on the ideas from [[1]](#references) )
+-   **region\_node** - single-entry, single-exit region.
+    All loops and if's in the program are enclosed in region nodes.
+    Region nodes have two containers for phi nodes -
+    region\_node::loop\_phi contains the phi expressions to be executed
+    at the region entry, region\_node::phi contains the phi expressions
+    to be executed at the region exit. It's the only type of the node
+    that contains associated phi expressions.
+-   **depart\_node** - "depart region \$id after { ... }"
+    Depart target region (jump to exit point) after executing contained
+    code.
+-   **repeat\_node** - "repeat region \$id after { ... }"
+    Repeat target region (jump to entry point) after executing contained
+    code.
+-   **if\_node** - "if (cond) { ... }"
+    Execute contained code if condition is true. The difference from
+    [[1]](#references) is that we don't have associated phi expressions
+    for the **if\_node**, we enclose **if\_node** in the
+    **region\_node** and store corresponding phi's in the
+    **region\_node**, this allows more uniform handling.
+The target region of depart and repeat nodes is always the region where
+they are located (possibly in the nested region), there are no arbitrary
+jumps/goto's - control flow in the program is always structured.
+Typical control flow constructs can be represented as in the following
+examples:
+GLSL:
+    if (cond) {
+        < 1 >
+    } else {
+        < 2 >
+    }
+IR:
+    region #0 {
+        depart region #0 after {
+            if (cond) {
+                depart region #0 after {
+                    < 1 >
+                }
+            }
+            < 2 >
+        }
+        <region #0 phi nodes >
+    }
+GLSL:
+    while (cond) {
+        < 1 >
+    }
+IR:
+    region #0 {
+        <region #0 loop_phi nodes>
+        repeat region #0 after {
+            region #1 {
+                depart region #1 after {
+                    if (!cond) {
+                        depart region #0
+                    }
+                }
+            }
+            < 1 >
+        }
+        <region #0 phi nodes>
+    }
+'Break' and 'continue' inside the loops are directly translated to the
+depart and repeat nodes for the corresponding loop region.
+This may look a bit too complicated, but in fact this allows more simple
+and uniform handling of the control flow.
+All loop\_phi and phi nodes for some region always have the same number
+of source operands. The number of source operands for
+region\_node::loop\_phi nodes is 1 + number of repeat nodes that
+reference this region as a target. The number of source operands for
+region\_node::phi nodes is equal to the number of depart nodes that
+reference this region as a target. All depart/repeat nodes for the
+region have unique indices equal to the index of source operand for
+phi/loop\_phi nodes.
+First source operand for region\_node::loop\_phi nodes (src[0]) is an
+incoming value that enters the region from the outside. Each remaining
+source operand comes from the corresponding repeat node.
+More complex example:
+GLSL:
+    a = 1;
+    while (a < 5) {
+        a = a * 2;
+        if (b == 3) {
+            continue;
+        } else {
+            a = 6;
+        }
+        if (c == 4)
+            break;
+        a = a + 1;
+    }
+IR with SSA form:
+    a.1 = 1;
+    region #0 {
+        // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
+        region#0 loop_phi: a.2 = phi a.1, a.6, a.3
+        repeat_1 region #0 after {
+            a.3 = a.2 * 2;
+            cond1 = (b == 3);
+            region #1 {
+                depart_0 region #1 after {
+                    if (cond1) {
+                        repeat_2 region #0;
+                    }
+                }
+                a.4 = 6;
+                region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
+            }
+            cond2 = (c == 4);
+            region #2 {
+                depart_0 region #2 after {
+                    if (cond2) {
+                        depart_0 region #0;
+                    }
+                }
+            }
+            a.6 = a.5 + 1;
+        }
+        region #0 phi: a.7 = phi a.5 // src[0] from depart_0
+    }
+Phi nodes with single source operand are just copies, they are not
+really necessary, but this allows to handle all **depart\_node**s in the
+uniform way.
+#### Instruction nodes
+Instruction nodes represent different kinds of instructions -
+**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
+the "bc" structure where all fields of the bytecode are stored (the type
+is **bc\_alu** for **alu\_node**, etc). The operands are represented
+using the vectors of pointers to **value** class (node::src, node::dst)
+#### SSA-specific nodes
+Phi nodes currently don't have special node class, they are stored as
+**node**. Destination vector contains a single destination value, source
+vector contains 1 or more source values.
+Psi nodes [[5], [6]](#references) also don't have a special node class
+and stored as **node**. Source vector contains 3 values for each source
+operand - the **value** of predicate, **value** of corresponding
+PRED\_SEL field, and the source **value** itself.
+### Indirect addressing
+Special kind of values (VLK\_RELREG) is used to represent indirect
+operands. These values don't have SSA versions. The representation is
+mostly based on the [[2]](#references). Indirect operand contains the
+"offset/address" value (value::rel), (e.g. some SSA version of the AR
+register value, though after some passes it may be any value - constant,
+register, etc), also it contains the maydef and mayuse vectors of
+pointers to **value**s (similar to dst/src vectors in the **node**) to
+represent the effects of aliasing in the SSA form.
+E.g. if we have the array R5.x ... R8.x and the following instruction :
+    MOV R0.x, R[5 + AR].x
+then source indirect operand is represented with the VLK\_RELREG value,
+value::rel is AR, value::maydef is empty (in fact it always contain the
+same number of elements as mayuse to simplify the handling, but they are
+NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
+corresponding SSA versions after ssa\_rename).
+Additional "virtual variables" as in [HSSA [2]](#references) are not
+used, also there is no special handling for "zero versions". Typical
+programs in our case are small, indirect addressing is rare, array sizes
+are limited by max gpr number, so we don't really need to use special
+tricks to avoid the explosion of value versions. Also this allows more
+precise liveness computation for array elements without modifications to
+the algorithms.
+With the following instruction:
+    MOV R[5+AR].x, R0.x
+we'll have both maydef and mayuse vectors for dst operand filled with
+array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
+pass mayuse will contain previous versions, maydef will contain new
+potentially-defined versions.
+* * * * *
+Passes
+------
+-   **bc\_parser** - creates the IR from the source bytecode,
+    initializes src and dst value vectors for instruction nodes. Most
+    ALU nodes have one dst operand and the number of source operands is
+    equal to the number of source operands for the ISA instruction.
+    Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
+    gpr as in the original instruction, other two are pseudo-operands
+    that represent possibly updated predicate and exec\_mask. Predicate
+    values are used in the predicated alu instructions (node::pred),
+    exec\_mask values are used in the if\_nodes (if\_node::cond). Each
+    vector operand in the CF/TEX/VTX instructions is represented with 4
+    values - components of the vector.
+-   **ssa\_prepare** - creates phi expressions.
+-   **ssa\_rename** - renames the values (assigns versions).
+-   **liveness** - liveness computation, sets 'dead' flag for unused
+    nodes and values, optionally computes interference information for
+    the values.
+-   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
+    unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
+    instructions in the source, containers for ALU groups (they were
+    only needed for the ssa\_rename pass)
+-   **if\_conversion** - converts control flow with if\_nodes to the
+    data flow in cases where it can improve performance (small alu-only
+    branches). Both branches are executed speculatively and the phi
+    expressions are replaced with conditional moves (CNDxx) to select
+    the final value using the same condition predicate as was used by
+    the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
+    instruction, CNDxx now uses dst[0] from the same PREDSETxx
+    instruction.
+-   **peephole** - peephole optimizations
+-   **gvn** - Global Value Numbering [[2]](#references),
+    [[3]](#references)
+-   **gcm** - Global Code Motion [[3]](#references). Also performs
+    grouping of the instructions of the same kind (CF/FETCH/ALU).
+-   register allocation passes, some ideas are used from
+    [[4]](#references), but implementation is simplified to make it more
+    efficient in terms of the compilation speed (e.g. no recursive
+    recoloring) while achieving good enough results.
+    -   **ra\_split** - prepares the program to register allocation.
+        Splits live ranges for constrained values by inserting the
+        copies to/from temporary values, so that the live range of the
+        constrained values becomes minimal.
+    -   **ra\_coalesce** - performs global allocation on registers used
+        in CF/FETCH instructions. It's performed first to make sure they
+        end up in the same GPR. Also tries to allocate all values
+        involved in copies (inserted by the ra\_split pass) to the same
+        register, so that the copies may be eliminated.
+    -   **ra\_init** - allocates gpr arrays (if indirect addressing is
+        used), and remaining values.
+-   **post\_scheduler** - ALU scheduler, handles VLIW packing and
+    performs the final register allocation for local values inside ALU
+    clauses. Eliminates all coalesced copies (if src and dst of the copy
+    are allocated to the same register).
+-   **ra\_checker** - optional debugging pass that tries to catch basic
+    errors of the scheduler or regalloc,
+-   **bc\_finalize** - propagates the regalloc information from values
+    in node::src and node::dst vectors to the bytecode fields, converts
+    control flow structure (region/depart/repeat) to the target
+    instructions (JUMP/ELSE/POP,
+    LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
+-   **bc\_builder** - builds final bytecode,
+* * * * *
+References
+----------
+[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
+McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
+[2] ["Effective Representation of Aliases and Indirect Memory Operations
+in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
+Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
+[3] ["Global Code Motion. Global Value Numbering.", Cliff
+Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
+[4] ["Register Allocation for Programs in SSA Form", Sebastian
+Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
+[5] ["An extension to the SSA representation for predicated code",
+Francois de
+Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
+[6] ["Improvements to the Psi-SSA Representation", F. de
+Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)

Subversion Repositories Kolibri OS

Compare Revisions

Regard whitespace Rev 4357 → Rev 4358