4358 |
Serge |
1 |
|
|
|
2 |
|
|
|
3 |
|
|
|
4 |
|
|
|
5 |
GL Dispatch in Mesa |
|
|
6 |
|
|
|
7 |
|
|
|
8 |
|
|
|
9 |
|
|
|
10 |
|
|
|
11 |
The Mesa 3D Graphics Library |
|
|
12 |
|
|
|
13 |
|
|
|
14 |
|
|
|
15 |
|
|
|
16 |
|
|
|
17 |
GL Dispatch in Mesa
|
|
|
18 |
|
|
|
19 |
Several factors combine to make efficient dispatch of OpenGL functions
|
|
|
20 |
fairly complicated. This document attempts to explain some of the issues |
|
|
21 |
and introduce the reader to Mesa's implementation. Readers already familiar |
|
|
22 |
with the issues around GL dispatch can safely skip ahead to the |
|
|
23 |
href="#overview">overview of Mesa's implementation. |
|
|
24 |
|
|
|
25 |
1. Complexity of GL Dispatch
|
|
|
26 |
|
|
|
27 |
Every GL application has at least one object called a GL context.
|
|
|
28 |
This object, which is an implicit parameter to ever GL function, stores all |
|
|
29 |
of the GL related state for the application. Every texture, every buffer |
|
|
30 |
object, every enable, and much, much more is stored in the context. Since |
|
|
31 |
an application can have more than one context, the context to be used is |
|
|
32 |
selected by a window-system dependent function such as |
|
|
33 |
glXMakeContextCurrent. |
|
|
34 |
|
|
|
35 |
In environments that implement OpenGL with X-Windows using GLX, every GL
|
|
|
36 |
function, including the pointers returned by glXGetProcAddress, are |
|
|
37 |
context independent. This means that no matter what context is |
|
|
38 |
currently active, the same glVertex3fv function is used. |
|
|
39 |
|
|
|
40 |
This creates the first bit of dispatch complexity. An application can
|
|
|
41 |
have two GL contexts. One context is a direct rendering context where |
|
|
42 |
function calls are routed directly to a driver loaded within the |
|
|
43 |
application's address space. The other context is an indirect rendering |
|
|
44 |
context where function calls are converted to GLX protocol and sent to a |
|
|
45 |
server. The same glVertex3fv has to do the right thing depending |
|
|
46 |
on which context is current. |
|
|
47 |
|
|
|
48 |
Highly optimized drivers or GLX protocol implementations may want to
|
|
|
49 |
change the behavior of GL functions depending on current state. For |
|
|
50 |
example, glFogCoordf may operate differently depending on whether |
|
|
51 |
or not fog is enabled. |
|
|
52 |
|
|
|
53 |
In multi-threaded environments, it is possible for each thread to have a
|
|
|
54 |
differnt GL context current. This means that poor old glVertex3fv |
|
|
55 |
has to know which GL context is current in the thread where it is being |
|
|
56 |
called. |
|
|
57 |
|
|
|
58 |
2. Overview of Mesa's Implementation
|
|
|
59 |
|
|
|
60 |
Mesa uses two per-thread pointers. The first pointer stores the address
|
|
|
61 |
of the context current in the thread, and the second pointer stores the |
|
|
62 |
address of the dispatch table associated with that context. The |
|
|
63 |
dispatch table stores pointers to functions that actually implement |
|
|
64 |
specific GL functions. Each time a new context is made current in a thread, |
|
|
65 |
these pointers a updated. |
|
|
66 |
|
|
|
67 |
The implementation of functions such as glVertex3fv becomes
|
|
|
68 |
conceptually simple: |
|
|
69 |
|
|
|
70 |
|
|
|
71 |
Fetch the current dispatch table pointer. |
|
|
72 |
Fetch the pointer to the real glVertex3fv function from the |
|
|
73 |
table. |
|
|
74 |
Call the real function. |
|
|
75 |
|
|
|
76 |
|
|
|
77 |
This can be implemented in just a few lines of C code. The file
|
|
|
78 |
src/mesa/glapi/glapitemp.h contains code very similar to this. |
|
|
79 |
|
|
|
80 |
|
|
|
81 |
|
|
82 |
|
|
|
83 |
void glVertex3f(GLfloat x, GLfloat y, GLfloat z) |
|
|
84 |
{ |
|
|
85 |
const struct _glapi_table * const dispatch = GET_DISPATCH(); |
|
|
86 |
|
|
|
87 |
(*dispatch->Vertex3f)(x, y, z); |
|
|
88 |
} |
|
|
|
89 |
Sample dispatch function |
| |
|
|
90 |
|
|
|
91 |
|
|
|
92 |
The problem with this simple implementation is the large amount of
|
|
|
93 |
overhead that it adds to every GL function call. |
|
|
94 |
|
|
|
95 |
In a multithreaded environment, a naive implementation of
|
|
|
96 |
GET_DISPATCH involves a call to pthread_getspecific or a |
|
|
97 |
similar function. Mesa provides a wrapper function called |
|
|
98 |
_glapi_get_dispatch that is used by default. |
|
|
99 |
|
|
|
100 |
3. Optimizations
|
|
|
101 |
|
|
|
102 |
A number of optimizations have been made over the years to diminish the
|
|
|
103 |
performance hit imposed by GL dispatch. This section describes these |
|
|
104 |
optimizations. The benefits of each optimization and the situations where |
|
|
105 |
each can or cannot be used are listed. |
|
|
106 |
|
|
|
107 |
3.1. Dual dispatch table pointers
|
|
|
108 |
|
|
|
109 |
The vast majority of OpenGL applications use the API in a single threaded
|
|
|
110 |
manner. That is, the application has only one thread that makes calls into |
|
|
111 |
the GL. In these cases, not only do the calls to |
|
|
112 |
pthread_getspecific hurt performance, but they are completely |
|
|
113 |
unnecessary! It is possible to detect this common case and avoid these |
|
|
114 |
calls. |
|
|
115 |
|
|
|
116 |
Each time a new dispatch table is set, Mesa examines and records the ID
|
|
|
117 |
of the executing thread. If the same thread ID is always seen, Mesa knows |
|
|
118 |
that the application is, from OpenGL's point of view, single threaded. |
|
|
119 |
|
|
|
120 |
As long as an application is single threaded, Mesa stores a pointer to
|
|
|
121 |
the dispatch table in a global variable called _glapi_Dispatch. |
|
|
122 |
The pointer is also stored in a per-thread location via |
|
|
123 |
pthread_setspecific. When Mesa detects that an application has |
|
|
124 |
become multithreaded, NULL is stored in _glapi_Dispatch. |
|
|
125 |
|
|
|
126 |
Using this simple mechanism the dispatch functions can detect the
|
|
|
127 |
multithreaded case by comparing _glapi_Dispatch to NULL. |
|
|
128 |
The resulting implementation of GET_DISPATCH is slightly more |
|
|
129 |
complex, but it avoids the expensive pthread_getspecific call in |
|
|
130 |
the common case. |
|
|
131 |
|
|
|
132 |
|
|
|
133 |
|
|
134 |
|
|
|
135 |
#define GET_DISPATCH() \ |
|
|
136 |
(_glapi_Dispatch != NULL) \ |
|
|
137 |
? _glapi_Dispatch : pthread_getspecific(&_glapi_Dispatch_key) |
|
|
138 |
|
|
|
|
139 |
Improved GET_DISPATCH Implementation |
| |
|
|
140 |
|
|
|
141 |
|
|
|
142 |
3.2. ELF TLS
|
|
|
143 |
|
|
|
144 |
Starting with the 2.4.20 Linux kernel, each thread is allocated an area
|
|
|
145 |
of per-thread, global storage. Variables can be put in this area using some |
|
|
146 |
extensions to GCC. By storing the dispatch table pointer in this area, the |
|
|
147 |
expensive call to pthread_getspecific and the test of |
|
|
148 |
_glapi_Dispatch can be avoided. |
|
|
149 |
|
|
|
150 |
The dispatch table pointer is stored in a new variable called
|
|
|
151 |
_glapi_tls_Dispatch. A new variable name is used so that a single |
|
|
152 |
libGL can implement both interfaces. This allows the libGL to operate with |
|
|
153 |
direct rendering drivers that use either interface. Once the pointer is |
|
|
154 |
properly declared, GET_DISPACH becomes a simple variable |
|
|
155 |
reference. |
|
|
156 |
|
|
|
157 |
|
|
|
158 |
|
|
159 |
|
|
|
160 |
extern __thread struct _glapi_table *_glapi_tls_Dispatch |
|
|
161 |
__attribute__((tls_model("initial-exec"))); |
|
|
162 |
|
|
|
163 |
#define GET_DISPATCH() _glapi_tls_Dispatch |
|
|
164 |
|
|
|
|
165 |
TLS GET_DISPATCH Implementation |
| |
|
|
166 |
|
|
|
167 |
|
|
|
168 |
Use of this path is controlled by the preprocessor define
|
|
|
169 |
GLX_USE_TLS. Any platform capable of using TLS should use this as |
|
|
170 |
the default dispatch method. |
|
|
171 |
|
|
|
172 |
3.3. Assembly Language Dispatch Stubs
|
|
|
173 |
|
|
|
174 |
Many platforms has difficulty properly optimizing the tail-call in the
|
|
|
175 |
dispatch stubs. Platforms like x86 that pass parameters on the stack seem |
|
|
176 |
to have even more difficulty optimizing these routines. All of the dispatch |
|
|
177 |
routines are very short, and it is trivial to create optimal assembly |
|
|
178 |
language versions. The amount of optimization provided by using assembly |
|
|
179 |
stubs varies from platform to platform and application to application. |
|
|
180 |
However, by using the assembly stubs, many platforms can use an additional |
|
|
181 |
space optimization (see below). |
|
|
182 |
|
|
|
183 |
The biggest hurdle to creating assembly stubs is handling the various
|
|
|
184 |
ways that the dispatch table pointer can be accessed. There are four |
|
|
185 |
different methods that can be used: |
|
|
186 |
|
|
|
187 |
|
|
|
188 |
Using _glapi_Dispatch directly in builds for non-multithreaded |
|
|
189 |
environments. |
|
|
190 |
Using _glapi_Dispatch and _glapi_get_dispatch in |
|
|
191 |
multithreaded environments. |
|
|
192 |
Using _glapi_Dispatch and pthread_getspecific in |
|
|
193 |
multithreaded environments. |
|
|
194 |
Using _glapi_tls_Dispatch directly in TLS enabled |
|
|
195 |
multithreaded environments. |
|
|
196 |
|
|
|
197 |
|
|
|
198 |
People wishing to implement assembly stubs for new platforms should focus
|
|
|
199 |
on #4 if the new platform supports TLS. Otherwise, implement #2 followed by |
|
|
200 |
#3. Environments that do not support multithreading are uncommon and not |
|
|
201 |
terribly relevant. |
|
|
202 |
|
|
|
203 |
Selection of the dispatch table pointer access method is controlled by a
|
|
|
204 |
few preprocessor defines. |
|
|
205 |
|
|
|
206 |
|
|
|
207 |
If GLX_USE_TLS is defined, method #4 is used. |
|
|
208 |
If HAVE_PTHREAD is defined, method #3 is used. |
|
|
209 |
If WIN32_THREADS is defined, method #2 is used. |
|
|
210 |
If none of the preceeding are defined, method #1 is used. |
|
|
211 |
|
|
|
212 |
|
|
|
213 |
Two different techniques are used to handle the various different cases.
|
|
|
214 |
On x86 and SPARC, a macro called GL_STUB is used. In the preamble |
|
|
215 |
of the assembly source file different implementations of the macro are |
|
|
216 |
selected based on the defined preprocessor variables. The assmebly code |
|
|
217 |
then consists of a series of invocations of the macros such as: |
|
|
218 |
|
|
|
219 |
|
|
|
220 |
|
|
221 |
|
|
|
222 |
GL_STUB(Color3fv, _gloffset_Color3fv) |
|
|
223 |
|
|
|
|
224 |
SPARC Assembly Implementation of glColor3fv |
| |
|
|
225 |
|
|
|
226 |
|
|
|
227 |
The benefit of this technique is that changes to the calling pattern
|
|
|
228 |
(i.e., addition of a new dispatch table pointer access method) require fewer |
|
|
229 |
changed lines in the assembly code. |
|
|
230 |
|
|
|
231 |
However, this technique can only be used on platforms where the function
|
|
|
232 |
implementation does not change based on the parameters passed to the |
|
|
233 |
function. For example, since x86 passes all parameters on the stack, no |
|
|
234 |
additional code is needed to save and restore function parameters around a |
|
|
235 |
call to pthread_getspecific. Since x86-64 passes parameters in |
|
|
236 |
registers, varying amounts of code needs to be inserted around the call to |
|
|
237 |
pthread_getspecific to save and restore the GL function's |
|
|
238 |
parameters. |
|
|
239 |
|
|
|
240 |
The other technique, used by platforms like x86-64 that cannot use the
|
|
|
241 |
first technique, is to insert #ifdef within the assembly |
|
|
242 |
implementation of each function. This makes the assembly file considerably |
|
|
243 |
larger (e.g., 29,332 lines for glapi_x86-64.S versus 1,155 lines for |
|
|
244 |
glapi_x86.S) and causes simple changes to the function |
|
|
245 |
implementation to generate many lines of diffs. Since the assmebly files |
|
|
246 |
are typically generated by scripts (see below), this |
|
|
247 |
isn't a significant problem. |
|
|
248 |
|
|
|
249 |
Once a new assembly file is created, it must be inserted in the build
|
|
|
250 |
system. There are two steps to this. The file must first be added to |
|
|
251 |
src/mesa/sources. That gets the file built and linked. The second |
|
|
252 |
step is to add the correct #ifdef magic to |
|
|
253 |
src/mesa/glapi/glapi_dispatch.c to prevent the C version of the |
|
|
254 |
dispatch functions from being built. |
|
|
255 |
|
|
|
256 |
3.4. Fixed-Length Dispatch Stubs
|
|
|
257 |
|
|
|
258 |
To implement glXGetProcAddress, Mesa stores a table that
|
|
|
259 |
associates function names with pointers to those functions. This table is |
|
|
260 |
stored in src/mesa/glapi/glprocs.h. For different reasons on |
|
|
261 |
different platforms, storing all of those pointers is inefficient. On most |
|
|
262 |
platforms, including all known platforms that support TLS, we can avoid this |
|
|
263 |
added overhead. |
|
|
264 |
|
|
|
265 |
If the assembly stubs are all the same size, the pointer need not be
|
|
|
266 |
stored for every function. The location of the function can instead be |
|
|
267 |
calculated by multiplying the size of the dispatch stub by the offset of the |
|
|
268 |
function in the table. This value is then added to the address of the first |
|
|
269 |
dispatch stub. |
|
|
270 |
|
|
|
271 |
This path is activated by adding the correct #ifdef magic to
|
|
|
272 |
src/mesa/glapi/glapi.c just before glprocs.h is |
|
|
273 |
included. |
|
|
274 |
|
|
|
275 |
4. Automatic Generation of Dispatch Stubs
|
|
|
276 |
|
|
|
277 |
|
|
|
278 |
|
|
|
279 |
|