docs/accessibility/browser/how_a11y_works_2.md - chromium/src.git - Git at Google

 # How Chrome Accessibility Works, Part 2

 This document explains the technical details behind Chrome accessibility
 code by starting at a high level and progressively adding more levels of
 detail.

 See [Part 1](how_a11y_works.md) first.

 [TOC]

 ## A multi-process browser

 There are different ways that a web browser can be multi-process, so it's
 important to first discuss the model Chromium uses.

 In Chromium, there's a single browser process. That process is the "main"
 process that's launched by the user. It owns all of the windows and UI
 elements, and it handles nearly all of the interaction with operating
 system APIs. Then there are multiple render processes that handle running
 each individual web page. Render processes are sandboxed - this means that
 they're basically forbidden from directly talking to the operating system.

 Each renderer handles one web page. For now let's forget about iframes,
 we'll deal with that added complexity down below. A renderer handles
 the entire lifecycle of a page - managing the HTTP connection, parsing
 the HTML, resolving CSS styles, executing the JavaScript, and figuring out
 what to draw to the screen.

 Because the renderer is in a sandboxed process, it doesn't directly get user
 input like key presses or mouse clicks, and it doesn't directly draw to the
 screen.  These things are all handled by communicating with the browser process.
 The browser process owns the window; when there's a user input event like a
 mouse click or key press, it forwards that event to the appropriate
 renderer. The renderer figures out how to draw the webpage, but it doesn't draw
 directly to the screen - because it's sandboxed - it either sends pixels (in
 software rendering mode) or sends the drawing commands to Chromium's separate
 gpu process.

 A simplified diagram of Chromium's multi-process architecture is shown here:

 ![Multi process system diagram showing a web page renderer sitting inside a
 sandboxed render process; it receives user input events from a browser window
 in the browser process and sends pixels to that browser window; the browser
 window communicates with the operating system.](
 figures/multi_process_browser.png)

 In most system diagrams we're only going to show a single web page,
 because supporting multiple windows and tabs doesn't complicate the
 accessibility architecture much. But, it's helpful to understand how
 the system looks when the user has multiple windows and tabs open.
 Notably:

 * The browser process owns all of the windows and the tabs within them.
 * Each browser tab maintains a connection to a web page running in a
   render process.
 * There can be multiple render processes, but there's no correspondence
   between browser windows, tabs, and render processes. The different
   web page renderers might all be in one process, they might all be in
   different processes, or they might be split across processes in any
   arbitrary way, as shown in this diagram:

 ![Diagram showing two browser windows with two tabs each, communicating
 with four web page renderers that are split across three render processes,
 illustrating what the system diagram might look like when there are
 multiple windows and tabs open, and showing that there's no correspondence
 between a specific tab or window and which process its renderer might live in.](
 figures/multi_process_multiple_tabs.png)

 You can read more about Chromium's multi-process architecture here - note that
 this document is old, so it's more Windows-centric and some of the details
 are out of date, but the basic design is still quite similar to today:

 https://www.chromium.org/developers/design-documents/multi-process-architecture

 ### Blink (Rendering Engine).

 The majority of the code inside each web renderer is implemented in a module
 called [Blink](https://www.chromium.org/blink)
 (the Blink Rendering Engine). Historically, when Chromium was first released,
 this module was WebKit, but it was forked and renamed Blink in 2013.
 As described in [How Blink Works](
 https://docs.google.com/document/d/1aitSOucL0VHZa9Z2vbRJSyAIsAz24kX8LFByQ5xQnUg/view),
 Blink implements everything that renders content inside a browser tab:

 * Implement the specs of the web platform (e.g., HTML standard),
   including DOM, CSS and Web IDL
 * Embed V8 and run JavaScript
 * Request resources from the underlying network stack
 * Build DOM trees
 * Calculate style and layout
 * Embed Chrome Compositor and draw graphics

 There are a few small layers in Chromium's render processes outside of Blink,
 containing:

 * Handling the multi-process communication
 * The renderer side of Chrome browser features (like spell check or autofill)
   that aren't a core part of the web platform

 ### Other multi-process models

 There are other ways that a multi-process web browser could be implemented.
 Another possible approach is that each tab could be in its own process,
 but each tab still communicates directly with the operating system.
 In this model, the operating system can send input events directly to
 the active tab, and the tab can paint its own contents directly.

 ![System diagram showing how some other browsers use multiple processes.
 In this diagram, a browser process communicates with two tab processes
 that each own both the web rendering but also the UI for their tabs;
 these tabs also communicate directly with the operating system.](
 figures/other_multi_process_browser.png)

 Both Apple Safari and Microsoft Edge Legacy (i.e. Edge before Chromium) use
 variations of this multi-process model. They get some of these multi-process
 advantages, shared with Chromium:

 * Stability: a stuck tab won't hang the whole browser, a crashed tab
   won't crash the whole browser
 * Performance: a slow tab won't prevent other tabs from being responsive
 * Isolation: a compromised tab won't have access to user data from other tabs

 Chromium chose its multi-process model with sandboxed render processes because
 it provides much stronger protection against exploits:

 * Security: a compromised sandboxed renderer has no access to the operating
   system, so it can't compromise the user's system

 Unfortunately, Chromium's architecture makes accessibility more complex.
 Accessibility APIs are operating system APIs. In a non-sandboxed
 multi-process browser like in the diagram above, each tab can directly
 handle its own accessibility. In Chromium, accessibility APIs need to be
 handled by the browser process, even though most of the information about
 the web page lives in one of the sandboxed render processes.

 ### A dead-end: proxying accessibility requests

 Under Chrome's architecture, operating system accessibility APIs
 can only talk to the browser process. In fact, from the point of view
 of assistive technology, they don't even know about other processes.
 All of the windows are owned by the main browser process, so all of
 the accessibility APIs get called in that process.

 Let's consider the following scenario: assistive technology has found
 a node in the accessibility tree corresponding to a checkbox, and
 wants to query it to find out its current state (enabled, checked,
 focused, etc.).

 When we were first building accessibility support in Chromium, one approach we
 tried was for the browser process to have a lightweight tree of proxy objects,
 each one corresponding to a node in the accessibility tree in the render
 process. Upon receiving a call to getState, the browser process would make a
 blocking IPC to the render process, get the state of that particular node, then
 reply to the accessibility API call with that result.

 ![In this diagram, when the operating system wants to call an accessibility
 API, it's calling it on a node in the Proxy accessibility tree in the
 Browser process. The diagram shows how this node makes a blocking IPC call
 to a corresponding node in the accessibility tree in the Render process,
 which determines the result by querying the DOM tree in the same process.](
 figures/proxy_approach.png)

 We discovered two problems with this approach.

 First, making blocking IPCs from the browser to renderer was highly
 discouraged. It introduced "jankiness" into the browser and introduced
 the possibility of deadlock. Unfortunately nearly all accessibility APIs
 are synchronous method calls, so there's no easy way around this.

 Second, we discovered that some assistive technology was calling many thousands
 of accessibility APIs in a row when loading a page. For example, both JAWS
 and NVDA scanned the entire web page from top to bottom on first load in order
 to build their virtual buffers. This proxy model was slowing things down
 dramatically. Even though most calls only took a millisecond to return,
 when accessing thousands of nodes sequentially that resulted in even
 medium-sized web pages taking 10 seconds or more to load.

 In addition, while most calls would be fast, some blocking calls could take
 much longer because they'd need to block until not only the render process's
 main thread was free, but also until the document was in a clean layout
 state (more on layout below under Blink). And if a renderer was hung
 (long-running JavaScript or an endless loop), the blocking call might never
 return or might be forced to time out.

 ### Caching the full accessibility tree

 Instead of the proxying approach, Chromium caches the full accessibility
 tree for every web page in the browser process. When accessibility API
 calls come in from the operating system or assistive technology, they're
 handled immediately out of the cache, never blocking on a render process.
 Separately, renderers send atomic updates to the browser process to keep
 the accessibility tree up-to-date.

 Here's a diagram of this approach:

 ![In this diagram, when the operating system calls an accessibility API in
 the browser process, the API is satisfied directly from the cached
 accessibility tree. Separately, atomic updates flow from the accessibility
 tree in the render process to the cache in the browser process.](
 figures/caching_approach.png)

 One advantage to this approach is that handling operating system accessibility
 API calls is quite fast. In fact, this design leads to even faster performance
 than a traditional single-process browser, where many API calls would be
 handled by querying the DOM tree or Layout tree for details.

 This approach is also completely free from blocking IPCs or deadlocks.

 There are some drawbacks to this approach, though:

 Memory usage is higher. The cache necessarily duplicates information that
 was already stored elsewhere, so this is unavoidable. We mitigate this in
 part by ensuring that the data structure we use to store each accessibility
 node is sparse and compact.

 Second, the accessibility tree can't be computed lazily. Whenever a web page
 changes, updates have to be pushed to the browser process cache right away,
 so that the cache is up-to-date as soon as possible. Now, when a large and
 complex page loads and this page is immediately consumed by assistive
 technology, then this approach is no worse - we're just shifting the burden
 from providing the accessibility tree on-demand to precomputing it, but
 essentially doing the same work. However, when assistive technology is not
 actually consuming the changes, this approach can be inefficient. More work
 is needed to mitigate this performance issue.

 One question that comes up is: isn't it a problem that the browser cache
 is potentially "behind", showing a snapshot of the web page as it existed
 a moment ago? This is true, but in practice it ends up being insignificant.
 What's most important is that the cache always represents a complete and
 consistent snapshot of the accessibility tree. Also, note that the visual
 representation you see of a web page is also delayed slightly from the
 "source of truth" in the DOM. A typical graphics frame is calculated every
 ~17 ms (assuming a display refresh rate of 60 fps), and there's some
 additional latency from when a graphics frame is computed and when it's
 actually shown on the screen - so in a sense whenever a web page makes a
 change to JavaScript, what you see on the screen is usually 10 - 20 ms
 behind that. If you've ever clicked the mouse at the exact instant the
 web page scrolled out from under you and you clicked on the wrong thing,
 you've observed this phenomenon.

 Chromium's caching approach to multi-process accessibility has led to several
 advantages or insights that were not immediately apparent in the initial design:

 * The cache can be anywhere, it doesn't have to be in the browser process.
   On Chrome OS we put the accessibility cache in the process running the
   assistive technology. On Windows we have explored the idea of making a
   separate accessibility process for assistive technology to talk to, or
   possibly pushing the cache to the assistive technology's process.
 * It's all data. By design the accessibility cache is just data, it's a
   serialization of what the accessibility tree looks like at any point in
   time. This view ends up having a lot of nice advantages, described more
   below.

 ### Push vs pull

 One way to think about different ways to architect an accessibility system
 is to explore what triggers data to move in the system.

 Most operating system accessibility APIs are based around a "pull" model.
 When assistive technology requests information from the accessibility tree,
 it calls a method on the app to requests it. In a single-process browser or
 in the proxy approach described above, the underlying data behind that node
 is pulled from the source accessibility tree, which pulls from the underlying
 data model (the DOM and layout trees).

 In contrast, Chrome's "push" model tracks changes that happen to the
 accessibility tree and then pushes changes from the web directly to
 the cached accessibility tree cache in the browser process. This incurs
 from upfront cost when a page changes, but makes access to accessibility
 APIs very fast.

 ### It's all data

 Most accessibility APIs are based around a functional API, where you override
 methods in order to answer whatever queries assistive technology has about
 any particular accessibility object.

 This approach is consistent with many common design principles, such as
 DRY (don't repeat yourself) - that there should be a single source of truth
 for any piece of information (like whether a control is visible or not),
 rather than two copies of that information (which could get out of sync).
 (This is harder to achieve in a multi-process app, though.)

 However, the functional approach has its downsides.

 To query the current state of an object, you end up calling basically all of its
 methods. Even in a single process browser where there's no IPC overhead, every
 single API might go through several layers of indirection in order to be
 satisfied.

 For example, suppose the operating system calls the isEnabled() method on a
 checkbox. The implementation might make a series of method calls that end up
 querying the DOM or the Layout tree, before returning the result.  Subsequent
 calls to isVisible(), isFocused(), and isChecked() might go through the same
 series of calls, unless the app specifically cached them temporarily.

 In comparison, if you ask that same checkbox to just fill in a simple
 data structure with its accessibility state, it might be able to quickly
 compute isEnabled, isVisible, isFocused, and isChecked all at the same time,
 with no additional layers of indirection needed.

 Another advantage of this approach is that you can take advantage of default
 values using a sparse data structure - think of a node in the accessibility
 cache as a key/value store like a hash map.  If the default value of isChecked
 is false, then for any accessible object that isn't currently checked you don't
 have to do anything.  Only if it is checked do you add a "checked" attribute to
 the map.

 Another advantage of the accessibility tree being data is that you can
 save the complete state of the accessibility tree - either making a copy
 in memory, or dumping a human-readable text version to a log file. We use
 this concept extensively in accessibility browser tests.

 One powerful consequence of this approach is that an accessibility tree
 doesn't need a backing web page in order to function. It's possible to save
 an accessibility tree and a series of atomic mutations, and then "replay"
 them later and get identical results, without any backing web page.
 Chromium currently has some experimental support for recording changes to
 a web page in the chrome://accessibility page, and we also take advantage
 of this snapshotting in order to implement support for the Android
 "freeze-dried tabs" feature where a frozen snapshot of the page is
 displayed (with accessibility support) while the real page is being
 fetched from a slow connection.

 ### Data structures used by the accessibility cache

 In this section we'll cover the data structures used in order to implement
 the accessibility cache. For the moment, we'll ignore the render processes
 and how this data is generated, in order to focus on what data is received
 and how it's used.

 The key data structures used throughout accessibility code are found in the
 [ui/accessibility](
 https://source.chromium.org/chromium/chromium/src/+/main:ui/accessibility/)
 directory.

 The underlying data from one node is stored in AXNodeData. A simplified
 version of this struct is:

 ```C++
 struct AXNodeData {
     int id;
     Role role;
     vector<pair<StringAttribute, string>> string_attributes;
     vector<pair<IntAttribute, int>> int_attributes;
     vector<pair<FloatAttribute, float>> float_attributes;
     ...
     vector<int> child_ids;
     AXRelativeBounds relative_bounds;
 };
 ```

 Every node has an ID, but IDs are only required to be unique within the
 same web frame. The IDs are used to express the tree structure - each node
 has a vector of its child IDs.

 Every node has a role, since that's a fundamental concept in accessibility
 and every node needs one. Every node also has a bounding box (we'll go into
 why it's a *relative* bounding box later). Nearly all of the other
 attributes are stored as sparse vectors of (attribute type, attribute value)
 pairs. There are currently over 100 different attributes that Chromium
 can associate with a single accessible node, but most nodes only have
 5 - 10 of them set. Anything unset is treated as having the default value.

 An AXTreeUpdate is a pure-data snapshot of an accessibility tree or an
 update to an existing tree. A slightly simplified version of that struct is:

 ```C++
 struct AXTreeUpdate {
     AXTreeData tree_data;
     int root_id;
     vector<AXNodeData> nodes;
 };
 ```

 AXTreeData just has a few attributes that apply to the entire
 tree rather than just one particular node.

 A valid AXTreeUpdate corresponding to a complete accessibility tree just has to
 contain every node exactly once with no duplicates or redundant nodes. In other
 words, root_id must contain the ID of one node; that node must have the IDs of
 its children, and every node must either be the root or the child of some other
 node. The nodes can come in any order.

 A valid AXTreeUpdate can also represent the changes needed to change an
 existing accessibility tree. Any node that's unchanged can be completely
 left out; only nodes that change need to be included. To insert a node,
 just add its AXNodeData and be sure to update the child_ids in its parent.
 To remove a node, remove it from its parent's child_ids.

 An AXTreeUpdate is *stateful* - if you're using it to update an existing
 tree then the code that generated the update needs to understand the
 state that the tree was in previously.

 There are two classes that represent a live node and a live tree, here
 are the important details:

 ```C++
 class AXNode {
   public:
     ...
     const AXNodeData& data();
     AXNode* GetParent() const;
     vector<AXNode*>& GetAllChildren() const;
     ...
 };

 class AXTree {
   public:
     AXTree();
     explicit AXTree(const AXTreeUpdate& initial_state);
     bool Unserialize(const AXTreeUpdate& update);
     AXNode* root() const;
     AXNode* GetFromId(int id) const;
     ...
 };
 ```

 You can construct an AXTree from an AXTreeUpdate directly, or create an
 empty tree first. Then call Unserialize to take an AXTreeUpdate and
 apply those changes to the current tree. It will return false if the
 AXTreeUpdate is malformed or can't be applied to the current tree.

 Once you have an AXTree, you can get its root node, or look up nodes by ID.
 Each node is just a thin wrapper around its underlying AXNodeData plus
 some convenience functions to walk the tree.

 AXTree is essentially the underlying data model used to implement
 accessibility in the browser process. As described in
 [Part 1](how_a11y_works.md), there's a platform-specific accessibility
 object for each node in the tree. When a platform accessibility API is
 called on one of those nodes, it uses the corresponding AXNode's
 AXNodeData to get the details of the node to satisfy that query.

 Nearly all accessibility APIs can be satisfied directly from AXNodeData:
 for example a node's name, role, state, value, bounding box, allowed
 actions, and relationships with other nodes. There are a few exceptions
 that will be covered in [Part 3](how_a11y_works_3.md).

 BrowserAccessibilityManager is a layer on top of AXTree. It hooks up
 the cross-platform accessibility data structures to the platform-specific
 tree of native accessibility objects. That will be discussed in more
 detail in [Part 3](how_a11y_works_3.md).

 ### Serializing the accessibility tree from the renderer

 The other half of the puzzle here is what happens on the renderer side.
 Given a web page that's constantly changing, how do we serialize a
 representation of the accessibility tree and send small, atomic updates
 to the browser process to keep the cache in sync?

 See above for a reminder of Blink and how it fits into the render process.
 Render accessibility consists of two main pieces:

 * A tree of lightweight accessibility nodes inside Blink that represent
   the current state of the accessibility tree
 * Code outside Blink that keeps track of nodes that need to be updated,
   and periodically serializes updates to send to the browser process.

 #### The Blink Accessibility Tree

 Inside Blink, we use the following classes.

 AXObject is the base class representing one node in the accessibility
 tree. Each AXObject is a wrapper, it either wraps a DOM Node (a blink::Node)
 or a Blink layout object (a blink::LayoutObject). The AXObject contains
 very little state; it caches a few attributes that are expensive to recompute
 but otherwise doesn't store its full serialization.

 AXObjectCache is the class that represents all of the AXObjects for one
 web page. It's owned by the Document class, only if accessibility is enabled.

 AXObjects are built lazily, on-demand. When an AXObject is added or deleted,
 its parent marks its list of child objects as dirty so that the next time
 it's queried it knows to compute them.

 The vast majority of the web-specific accessibility logic is in AXObject and
 AXObjectCache. Besides just support for ARIA attributes and getting
 accessibility information from DOM elements, here you'll find all of the code
 that interprets the [ACCNAME](https://w3c.github.io/accname/) spec to compute
 the accessible name and description for any HTML element, by checking
 aria-labelledby, aria-label, and other relevant attributes, for example.

 In addition, the code in AXObject and AXObjectCache builds the structure of
 the accessibility tree, especially in cases where it differs from the DOM tree,
 such as CSS generated content, aria-hidden, or aria-owns.

 When the accessibility tree changes, AXObjectCache sends an event
 notification to the render accessibility code outside of Blink indicating
 that a node has changed and needs to be re-serialized.

 Blink does not currently have listener interfaces for all of the changes
 that accessibility cares about. Rather, Blink code is generally
 specifically instrumented for accessibility. So when a new DOM node is
 inserted, or when the value of a form control changes, you'll see explicit
 code in Blink to update the AXObjectCache.

 In earlier versions, Blink accessibility code used to post a lot of
 specific event notifications indicating exactly what changed, for
 example: NameChanged, ValueChanged, StateChanged, ChildrenChanged, etc. - but
 over time we've moved away from that model. Now we usually just mark a
 node as dirty - or the event notification is still there for historical
 reasons but it's never consumed and only marking the node is dirty ends up
 mattering.

 The reason for this was because we started adding code to automatically
 generate events from tree mutations. This helped avoid entire classes of
 problems like duplicate events.

 So in a nutshell, Blink is just responsible for notifying which nodes have
 changed. It's more efficient to just re-serialize nodes that probably changed
 (occasionally doing a bit of extra work) than it is to write very careful logic
 to only update nodes that actually changed.

 #### Render Process Accessibility code (outside of Blink)

 The main render process accessibility code outside of Blink is in
 [RenderAccessibilityImpl](https://source.chromium.org/chromium/chromium/src/+/main:content/renderer/accessibility/render_accessibility_impl.h).
 That code maintains the connection with the browser accessibility
 code and handles serializing updates to the accessibility tree.

 Updates to the accessibility tree are always batched.

 One important reason is because the accessibility tree can only be serialized
 when the document's lifecycle state is *clean*. In a nutshell, every time a
 change happens to a web page that could conceivably affect how it appears
 on-screen, the document is dirty until Blink has had a chance to do CSS style
 resolution and layout. Accessibility code will fail assertions if you try to
 query certain properties when the document is in a dirty state, because
 it could lead to inconsistent results or even crashes.

 So, accessibility changes are always queued up and then sent periodically
 only after first ensuring that layout is complete.

 When it is time to send accessibility updates, we make use of an
 abstraction called AXTreeSerializer. AXTreeSerializer is a class that
 knows how to walk a tree of nodes and generate valid AXTreeUpdates that
 incrementally update a remote AXTree.

 AXTreeSerializer is designed so that it doesn't know anything about Blink,
 and it doesn't interpret any accessibility logic, it just knows how to
 work with the AXNodeData and AXTreeUpdate data structures. In fact, we're
 using AXTreeSerializer for other accessibility trees in Chromium outside
 of Blink, and we have extensive unit tests for AXTreeSerializer that
 serialize from one AXTree into another AXTree to test the logic in isolation.

 AXTreeSerializer is stateful; it keeps track of what nodes have been sent
 to its counterpart. When walking the tree, if it encounters a node that
 it hasn't serialized before, it automatically serializes it. If it
 encounters a node that was previously serialized and wasn't marked as
 dirty, it automatically skips it.

 AXTreeSerializer uses an interface called AXTreeSource to enable it to walk
 any tree-like object without being tightly coupled to Blink or any other
 specific tree. We use an implementation BlinkAXTreeSource that maps all
 of the tree-walking and serialization calls into calls to Blink's
 AXObject class.

 This figure shows the overall system diagram covered so far:

 ![Inside Blink, which is inside the Render process, a Node and CSS both
 influence a LayoutObject; the Node and the LayoutObject both influence the
 AXObject. The AXObject communicates with RenderAccessibilityImpl, which is
 outside of Blink but in the Render process. RenderAccessibilityImpl uses
 AXTreeSerializer to send an update to the AXTree that lives in
 BrowserAccessibilityManager in the Browser process, which is the
 cached accessibility tree. Assistive technology then accesses this
 cached accessibility tree using platform accessibility APIs.](
 figures/multi_process_ax.png)

 ## We must go deeper

 In the next section we'll explore some of the details that were
 glossed over, including:

 * Abstracting platform-specific APIs
 * Relative coordinates
 * Text bounding boxes
 * Hit testing
 * Views and other non-web custom-drawn UI

 See [How Chrome Accessibility Works, Part 3](how_a11y_works_3.md)