Subversion Repositories Kolibri OS

Rev

Go to most recent revision | Blame | Last modification | View Log | Download | RSS feed

  1. Content caching
  2. ===============
  3.  
  4. NetSurf's existing fetch/cache architecture has a number of problems:
  5.  
  6. 1) Content dependencies are not modelled.
  7. 2) Content source data for non-shareable contents is duplicated.
  8. 3) Detection of content sharability is dependent on Content-Type, which
  9.    requires content cloning (which will fail for dependent contents).
  10. 4) Detection of cycles in content dependency graphs is not performed
  11.    (e.g. content1 includes content2, which includes content1).
  12. 5) All content caching is in-memory, there's no offline storage.
  13.  
  14. Proposal
  15. --------
  16.  
  17. A split-level cache.
  18.  
  19. Low-level cache:
  20.  
  21.   + Responsible for source data (+header) management.
  22.   + Interfaces with low-level fetch system to retrieve data from network.
  23.   + Is responsible for offline storage (if any) of cache objects.
  24.   + Returns opaque handles to low-level cache objects.
  25.   + Handles HTTP redirects, recording URLs encountered when retrieving resource.
  26.   + May perform content-type sniffing (requires usage context)
  27.  
  28. High-level cache:
  29.  
  30.   + Responsible for content objects.
  31.   + Tracks content dependencies (and potential cycles).
  32.   + Returns opaque handles to content objects.
  33.   + Manages content sharability & reusability (see below).
  34.   + Contents with unknown types are never shared and thus get unique handles.
  35.   + Content handles <> content objects: they're an indirection mechanism.
  36.  
  37. Content sharability & reusability
  38. --------------------------------
  39.  
  40.   If a content is shareable, then it may have multiple concurrent users.
  41.   Otherwise, it may have at most one user.
  42.  
  43.   If a content is reusable, then it may be retained in the cache for later use
  44.   when it has no users. Otherwise, it will be removed from the cache when
  45.   it has no users.
  46.  
  47. Example: retrieving a top-level resource
  48. ----------------------------------------
  49.  
  50.   1) Client requests an URL, specifying no parent handle.
  51.   2) High-level cache asks low-level cache for low-level handle for URL.
  52.   3) Low-level cache looks for appropriate object in its index.
  53.      a) it finds one that's not stale and returns its handle
  54.      b) it finds only stale entries, or no appropiate entry,
  55.         so allocates a new entry, requests a fetch for it,
  56.         and returns the handle.
  57.   4) High-level cache looks for content objects that are using the low-level
  58.      handle.
  59.      a) it finds one that's shareable and selects its handle for use.
  60.      b) it finds only non-shareable entries, or no appropriate entry,
  61.         so allocates a new entry and selects its handle for use.
  62.   5) High-level cache registers the parent and client with the selected handle,
  63.      then returns the selected handle.
  64.   6) Client carries on, happy in the knowledge that a content is available.
  65.  
  66. Example: retrieving a child resource
  67. ------------------------------------
  68.  
  69.   1) Client requests an URL, specifying parent handle.
  70.   2) High-level cache searches parent+ancestors for requested URL.
  71.      a) it finds the URL, so returns a non-fatal error.
  72.      b) it does not find the URL, so proceeds from step 2 of the
  73.         top-level resource algorithm.
  74.  
  75.   NOTE: this approach means that shareable contents may have multiple parents.
  76.  
  77. Handling of contents of unknown type
  78. ------------------------------------
  79.  
  80.   Contents of unknown type are, by definition, not shareable. Therefore, each
  81.   client will be issued with a different content handle.
  82.  
  83.   Content types are only known once a resource's headers are fetched (or once
  84.   the type has been sniffed from the resource's data when the headers are
  85.   inconclusive).
  86.  
  87.   As a resource is fetched, users of the resource are informed of the fetch
  88.   status. Therefore, the high-level cache is always informed of fetch progress.
  89.   Cache clients need not care about this: they are simply interested in
  90.   a content's readiness for use.
  91.  
  92.   When the high-level cache is informed of a low-level cache object's type,
  93.   it is in a position to determine whether the corresponding content handles
  94.   can share a single content object or not.
  95.  
  96.   If it detects that a single content object may be shared by multiple handles,
  97.   it simply creates the content object and registers each of the handles as
  98.   a user of the content.
  99.  
  100.   If it detects that each handle requires a separate content object, then it
  101.   will create a content object for each handle and register the handle as a
  102.   user.
  103.  
  104.   This approach requires that clients of the high-level cache get issued with
  105.   handles to content objects, rather than content objects (so that the decision
  106.   whether to create multiple content objects can be deferred until suitable
  107.   information is available).
  108.  
  109.   Handles with no associated content object will act as if they had a content
  110.   object that was not ready for use.
  111.  
  112. A more concrete example
  113. -----------------------
  114.  
  115.   + bw1 contains html1 which includes css1, css2, img1, img2
  116.   + bw2 contains html2 which includes css1, img1, img2
  117.   + bw3 contains img1
  118.  
  119.   Neither HTML nor CSS contents are shareable.
  120.   All shareable contents are requested from the high-level cache
  121.   once their type is known.
  122.  
  123.   Low-level cache contains source data for:
  124.  
  125.     1 - html1
  126.     2 - html2
  127.     3 - css1
  128.     4 - css2
  129.     5 - img1
  130.     6 - img2
  131.  
  132.   High-level cache contains:
  133.  
  134.     Content objects (ll-handle in parentheses):
  135.  
  136.       + c1 (1 - html1)
  137.       + c2 (2 - html2)
  138.       + c3 (3 - css1)
  139.       + c4 (4 - css2)
  140.       + c5 (5 - img1)
  141.       + c6 (6 - img2)
  142.       + c7 (3 - css1)
  143.  
  144.     Content handles (objects in parentheses):
  145.  
  146.       + h1 (c1, used by bw1)
  147.       + h2 (c3, used by h1)
  148.       + h3 (c4, used by h1)
  149.       + h4 (c2, used by bw2)
  150.       + h5 (c7, used by h4)
  151.       + h6 (c5, used by h1,h4,bw3)
  152.       + h7 (c6, used by h1,h4)
  153.  
  154.   If img1 was not of known type when requested:
  155.  
  156.     Content handles (objects in parentheses):
  157.  
  158.       + h1 (c1, used by bw1)
  159.       + h2 (c3, used by h1)
  160.       + h3 (c4, used by h1)
  161.       + h4 (c2, used by bw2)
  162.       + h5 (c7, used by h4)
  163.       + h6 (c5, used by h1)
  164.       + h7 (c6, used by h1,h4)
  165.       + h8 (c5, used by h4)
  166.       + h9 (c5, used by bw3)
  167.  
  168. This achieves the desired effect that:
  169.  
  170.   + source data is shared between contents
  171.   + content objects are only created when absolutely necessary
  172.   + content usage/dependency is tracked and cycles avoided
  173.   + offline storage is possible
  174.  
  175. Achieving this requires the use of indirection objects, but these are expected
  176. to be small in comparison to the content objects / ll-cache objects that they
  177. are indirecting.
  178.  
  179.