Subversion Repositories Kolibri OS

Rev

Go to most recent revision | Blame | Last modification | View Log | Download | RSS feed

  1. The data which Hubbub is fed (the input stream) gets buffered into a UTF-8
  2. buffer.  This buffer only holds a subset of the input stream at any given time.
  3. To avoid unnecessary copying (which is both a speed and memory loss), Hubbub
  4. tries to make all emitted strings point into this buffer, which is then
  5. advanced after tokens have been emitted.  This is not always possible, however,
  6. because HTML5 specifies behaviour which requires changing various characters to
  7. various other characters, and these sets of characters may not have the same
  8. length.  These cases are:
  9.  
  10.  - CR handling -- CRLFs and CRs are converted to LFs
  11.  - tag and attribute names are lowercased
  12.  - entities are allowed in attribute names
  13.  - NUL bytes must be turned into U+FFFD REPLACEMENT CHARACTER
  14.  
  15. When collecting the strings it will emit, Hubbub starts by assuming that no
  16. transformations on the input stream will be required.  However, if it hits one
  17. of the above cases, then it copies all of the collected characters into a buffer
  18. and switches to using that instead.  This means that every time a character is
  19. collected and it is possible that that character could be collected into a
  20. buffer, the code must check if it should be collected into a buffer.  To allow
  21. this check, and others, to happen when necessary and never otherwise, Hubbub
  22. uses a set of macros to collect characters, detailed below.
  23.  
  24. Hubbub strings are (beginning,length) pairs.  This means that once the
  25. beginning is set to a position in the input stream, the string can collect
  26. further character runs in the stream simply by adding to the length part.  This
  27. makes extending strings very efficient.
  28.  
  29.   | COLLECT(hubbub_string str, uintptr_t cptr, size_t length)
  30.  
  31.   This collects the character pointed to "cptr" (of size "length") into "str",
  32.   whether str is a buffered or unbuffered string, but only if "str" already
  33.   points to collected characters.
  34.  
  35.   | COLLECT_NOBUF(hubbub_string str, size_t length)
  36.  
  37.   This collects "length" bytes into "str", but only if "str" already points to
  38.   collected characters.  (There is no need to pass the character, since this
  39.   just increases the length of the string.)
  40.  
  41.   | COLLECT_MS(hubbub_string str, uintptr_t cptr, size_t length)
  42.  
  43.   If "str" is currently zero-length, this acts like START(str, cptr, length).
  44.   Otherwise, it just acts like COLLECT(str, cptr, length).
  45.  
  46.   | START(hubbub_string str, uintptr_t cptr, size_t length)
  47.  
  48.   This sets the string "str"'s start to "cptr" and its length to "length".
  49.  
  50.   | START_BUF(hubbub_string str, uintptr_t cptr, size_t length)
  51.  
  52.   This buffers the character of length "length" pointed to by "c" and then
  53.   sets "str" to point to it.
  54.  
  55.   | SWITCH(hubbub_string str)
  56.  
  57.   This switches the string "str" from unbuffered to buffered; it copies all
  58.   characters currently collected in "str" to the buffer and then updates it
  59.   to point there.
  60.  
  61.