Go to most recent revision | Details | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
3584 | sourcerer | 1 | Hubbub parser architecture |
2 | ========================== |
||
3 | |||
4 | Introduction |
||
5 | ------------ |
||
6 | |||
7 | Hubbub is a flexible HTML parser. It offers two interfaces: |
||
8 | |||
9 | * a SAX-style event interface |
||
10 | * a DOM-style tree-based interface |
||
11 | |||
12 | Overview |
||
13 | -------- |
||
14 | |||
15 | Hubbub is comprised of two parts: |
||
16 | |||
17 | * a tokeniser |
||
18 | * a tree builder |
||
19 | |||
20 | Tokeniser |
||
21 | --------- |
||
22 | |||
23 | The tokeniser divides the data held in the document buffer into chunks. |
||
24 | It sends SAX-style events for each chunk. |
||
25 | |||
26 | Tree builder |
||
27 | ------------ |
||
28 | |||
29 | The tree builder constructs a DOM-like tree from the SAX events emitted by |
||
30 | the tokeniser. The exact representation of the tree is up to the client, |
||
31 | which must provide a number of tree building handler functions. |
||
32 | |||
33 | Memory usage and ownership |
||
34 | -------------------------- |
||
35 | |||
36 | Memory usage within the library is well defined, as is ownership of allocated |
||
37 | memory. |
||
38 | |||
39 | Raw input data provided by the library client is owned by the client. |
||
40 | |||
41 | SAX events which refer to document segments contain direct references to |
||
42 | internal data. Token objects are transient and data within them are no |
||
43 | longer valid once the event handler has returned control to the tokeniser. |
||
44 | All data returned by a SAX event is owned by the library. |
||
45 | |||
46 | The tree builder will use client callbacks to create the objects used |
||
47 | within the tree. Tree objects may be reference counted (the client may |
||
48 | do nothing in the ref/unref callbacks and use garbage collection instead). |
||
49 | The resultant tree is owned by the client. |
||
50 | |||
51 | Parse errors |
||
52 | ------------ |
||
53 | |||
54 | Notification of parse errors is made through a dedicated event. This event |
||
55 | contains the line/column offset of the error location, along with a message |
||
56 | detailing the error. |
||
57 | |||
58 | Exceptional circumstances |
||
59 | ------------------------- |
||
60 | |||
61 | Exceptional circumstances (such as memory exhaustion) are reported |
||
62 | immediately. |
||
63 | |||
64 | The parser's state in such situations is undefined. There is no recovery |
||
65 | mechanism. |
||
66 | |||
67 | Therefore, if the client is able to recover from the exceptional |
||
68 | circumstance (e.g. by making more free memory available) the only valid |
||
69 | way to proceed is to create a new parser instance and start parsing from |
||
70 | scratch. The client should ensure that they destroy the old parser instance |
||
71 | and any DOM tree produced by it, to avoid resource leaks. |