| # How Chrome Accessibility Works, Part 2 |
| |
| This document explains the technical details behind Chrome accessibility |
| code by starting at a high level and progressively adding more levels of |
| detail. |
| |
| See [Part 1](how_a11y_works.md) first. |
| |
| [TOC] |
| |
| ## A multi-process browser |
| |
| There are different ways that a web browser can be multi-process, so it's |
| important to first discuss the model Chromium uses. |
| |
| In Chromium, there's a single browser process. That process is the "main" |
| process that's launched by the user. It owns all of the windows and UI |
| elements, and it handles nearly all of the interaction with operating |
| system APIs. Then there are multiple render processes that handle running |
| each individual web page. Render processes are sandboxed - this means that |
| they're basically forbidden from directly talking to the operating system. |
| |
| Each renderer handles one web page. For now let's forget about iframes, |
| we'll deal with that added complexity down below. A renderer handles |
| the entire lifecycle of a page - managing the HTTP connection, parsing |
| the HTML, resolving CSS styles, executing the JavaScript, and figuring out |
| what to draw to the screen. |
| |
| Because the renderer is in a sandboxed process, it doesn't directly get user |
| input like key presses or mouse clicks, and it doesn't directly draw to the |
| screen. These things are all handled by communicating with the browser process. |
| The browser process owns the window; when there's a user input event like a |
| mouse click or key press, it forwards that event to the appropriate |
| renderer. The renderer figures out how to draw the webpage, but it doesn't draw |
| directly to the screen - because it's sandboxed - it either sends pixels (in |
| software rendering mode) or sends the drawing commands to Chromium's separate |
| gpu process. |
| |
| A simplified diagram of Chromium's multi-process architecture is shown here: |
| |
|  |
| |
| In most system diagrams we're only going to show a single web page, |
| because supporting multiple windows and tabs doesn't complicate the |
| accessibility architecture much. But, it's helpful to understand how |
| the system looks when the user has multiple windows and tabs open. |
| Notably: |
| |
| * The browser process owns all of the windows and the tabs within them. |
| * Each browser tab maintains a connection to a web page running in a |
| render process. |
| * There can be multiple render processes, but there's no correspondence |
| between browser windows, tabs, and render processes. The different |
| web page renderers might all be in one process, they might all be in |
| different processes, or they might be split across processes in any |
| arbitrary way, as shown in this diagram: |
| |
|  |
| |
| You can read more about Chromium's multi-process architecture here - note that |
| this document is old, so it's more Windows-centric and some of the details |
| are out of date, but the basic design is still quite similar to today: |
| |
| https://www.chromium.org/developers/design-documents/multi-process-architecture |
| |
| ### Blink (Rendering Engine). |
| |
| The majority of the code inside each web renderer is implemented in a module |
| called [Blink](https://www.chromium.org/blink) |
| (the Blink Rendering Engine). Historically, when Chromium was first released, |
| this module was WebKit, but it was forked and renamed Blink in 2013. |
| As described in [How Blink Works]( |
| https://docs.google.com/document/d/1aitSOucL0VHZa9Z2vbRJSyAIsAz24kX8LFByQ5xQnUg/view), |
| Blink implements everything that renders content inside a browser tab: |
| |
| * Implement the specs of the web platform (e.g., HTML standard), |
| including DOM, CSS and Web IDL |
| * Embed V8 and run JavaScript |
| * Request resources from the underlying network stack |
| * Build DOM trees |
| * Calculate style and layout |
| * Embed Chrome Compositor and draw graphics |
| |
| There are a few small layers in Chromium's render processes outside of Blink, |
| containing: |
| |
| * Handling the multi-process communication |
| * The renderer side of Chrome browser features (like spell check or autofill) |
| that aren't a core part of the web platform |
| |
| ### Other multi-process models |
| |
| There are other ways that a multi-process web browser could be implemented. |
| Another possible approach is that each tab could be in its own process, |
| but each tab still communicates directly with the operating system. |
| In this model, the operating system can send input events directly to |
| the active tab, and the tab can paint its own contents directly. |
| |
|  |
| |
| Both Apple Safari and Microsoft Edge Legacy (i.e. Edge before Chromium) use |
| variations of this multi-process model. They get some of these multi-process |
| advantages, shared with Chromium: |
| |
| * Stability: a stuck tab won't hang the whole browser, a crashed tab |
| won't crash the whole browser |
| * Performance: a slow tab won't prevent other tabs from being responsive |
| * Isolation: a compromised tab won't have access to user data from other tabs |
| |
| Chromium chose its multi-process model with sandboxed render processes because |
| it provides much stronger protection against exploits: |
| |
| * Security: a compromised sandboxed renderer has no access to the operating |
| system, so it can't compromise the user's system |
| |
| Unfortunately, Chromium's architecture makes accessibility more complex. |
| Accessibility APIs are operating system APIs. In a non-sandboxed |
| multi-process browser like in the diagram above, each tab can directly |
| handle its own accessibility. In Chromium, accessibility APIs need to be |
| handled by the browser process, even though most of the information about |
| the web page lives in one of the sandboxed render processes. |
| |
| ### A dead-end: proxying accessibility requests |
| |
| Under Chrome's architecture, operating system accessibility APIs |
| can only talk to the browser process. In fact, from the point of view |
| of assistive technology, they don't even know about other processes. |
| All of the windows are owned by the main browser process, so all of |
| the accessibility APIs get called in that process. |
| |
| Let's consider the following scenario: assistive technology has found |
| a node in the accessibility tree corresponding to a checkbox, and |
| wants to query it to find out its current state (enabled, checked, |
| focused, etc.). |
| |
| When we were first building accessibility support in Chromium, one approach we |
| tried was for the browser process to have a lightweight tree of proxy objects, |
| each one corresponding to a node in the accessibility tree in the render |
| process. Upon receiving a call to getState, the browser process would make a |
| blocking IPC to the render process, get the state of that particular node, then |
| reply to the accessibility API call with that result. |
| |
|  |
| |
| We discovered two problems with this approach. |
| |
| First, making blocking IPCs from the browser to renderer was highly |
| discouraged. It introduced "jankiness" into the browser and introduced |
| the possibility of deadlock. Unfortunately nearly all accessibility APIs |
| are synchronous method calls, so there's no easy way around this. |
| |
| Second, we discovered that some assistive technology was calling many thousands |
| of accessibility APIs in a row when loading a page. For example, both JAWS |
| and NVDA scanned the entire web page from top to bottom on first load in order |
| to build their virtual buffers. This proxy model was slowing things down |
| dramatically. Even though most calls only took a millisecond to return, |
| when accessing thousands of nodes sequentially that resulted in even |
| medium-sized web pages taking 10 seconds or more to load. |
| |
| In addition, while most calls would be fast, some blocking calls could take |
| much longer because they'd need to block until not only the render process's |
| main thread was free, but also until the document was in a clean layout |
| state (more on layout below under Blink). And if a renderer was hung |
| (long-running JavaScript or an endless loop), the blocking call might never |
| return or might be forced to time out. |
| |
| ### Caching the full accessibility tree |
| |
| Instead of the proxying approach, Chromium caches the full accessibility |
| tree for every web page in the browser process. When accessibility API |
| calls come in from the operating system or assistive technology, they're |
| handled immediately out of the cache, never blocking on a render process. |
| Separately, renderers send atomic updates to the browser process to keep |
| the accessibility tree up-to-date. |
| |
| Here's a diagram of this approach: |
| |
|  |
| |
| One advantage to this approach is that handling operating system accessibility |
| API calls is quite fast. In fact, this design leads to even faster performance |
| than a traditional single-process browser, where many API calls would be |
| handled by querying the DOM tree or Layout tree for details. |
| |
| This approach is also completely free from blocking IPCs or deadlocks. |
| |
| There are some drawbacks to this approach, though: |
| |
| Memory usage is higher. The cache necessarily duplicates information that |
| was already stored elsewhere, so this is unavoidable. We mitigate this in |
| part by ensuring that the data structure we use to store each accessibility |
| node is sparse and compact. |
| |
| Second, the accessibility tree can't be computed lazily. Whenever a web page |
| changes, updates have to be pushed to the browser process cache right away, |
| so that the cache is up-to-date as soon as possible. Now, when a large and |
| complex page loads and this page is immediately consumed by assistive |
| technology, then this approach is no worse - we're just shifting the burden |
| from providing the accessibility tree on-demand to precomputing it, but |
| essentially doing the same work. However, when assistive technology is not |
| actually consuming the changes, this approach can be inefficient. More work |
| is needed to mitigate this performance issue. |
| |
| One question that comes up is: isn't it a problem that the browser cache |
| is potentially "behind", showing a snapshot of the web page as it existed |
| a moment ago? This is true, but in practice it ends up being insignificant. |
| What's most important is that the cache always represents a complete and |
| consistent snapshot of the accessibility tree. Also, note that the visual |
| representation you see of a web page is also delayed slightly from the |
| "source of truth" in the DOM. A typical graphics frame is calculated every |
| ~17 ms (assuming a display refresh rate of 60 fps), and there's some |
| additional latency from when a graphics frame is computed and when it's |
| actually shown on the screen - so in a sense whenever a web page makes a |
| change to JavaScript, what you see on the screen is usually 10 - 20 ms |
| behind that. If you've ever clicked the mouse at the exact instant the |
| web page scrolled out from under you and you clicked on the wrong thing, |
| you've observed this phenomenon. |
| |
| Chromium's caching approach to multi-process accessibility has led to several |
| advantages or insights that were not immediately apparent in the initial design: |
| |
| * The cache can be anywhere, it doesn't have to be in the browser process. |
| On Chrome OS we put the accessibility cache in the process running the |
| assistive technology. On Windows we have explored the idea of making a |
| separate accessibility process for assistive technology to talk to, or |
| possibly pushing the cache to the assistive technology's process. |
| * It's all data. By design the accessibility cache is just data, it's a |
| serialization of what the accessibility tree looks like at any point in |
| time. This view ends up having a lot of nice advantages, described more |
| below. |
| |
| ### Push vs pull |
| |
| One way to think about different ways to architect an accessibility system |
| is to explore what triggers data to move in the system. |
| |
| Most operating system accessibility APIs are based around a "pull" model. |
| When assistive technology requests information from the accessibility tree, |
| it calls a method on the app to requests it. In a single-process browser or |
| in the proxy approach described above, the underlying data behind that node |
| is pulled from the source accessibility tree, which pulls from the underlying |
| data model (the DOM and layout trees). |
| |
| In contrast, Chrome's "push" model tracks changes that happen to the |
| accessibility tree and then pushes changes from the web directly to |
| the cached accessibility tree cache in the browser process. This incurs |
| from upfront cost when a page changes, but makes access to accessibility |
| APIs very fast. |
| |
| ### It's all data |
| |
| Most accessibility APIs are based around a functional API, where you override |
| methods in order to answer whatever queries assistive technology has about |
| any particular accessibility object. |
| |
| This approach is consistent with many common design principles, such as |
| DRY (don't repeat yourself) - that there should be a single source of truth |
| for any piece of information (like whether a control is visible or not), |
| rather than two copies of that information (which could get out of sync). |
| (This is harder to achieve in a multi-process app, though.) |
| |
| However, the functional approach has its downsides. |
| |
| To query the current state of an object, you end up calling basically all of its |
| methods. Even in a single process browser where there's no IPC overhead, every |
| single API might go through several layers of indirection in order to be |
| satisfied. |
| |
| For example, suppose the operating system calls the isEnabled() method on a |
| checkbox. The implementation might make a series of method calls that end up |
| querying the DOM or the Layout tree, before returning the result. Subsequent |
| calls to isVisible(), isFocused(), and isChecked() might go through the same |
| series of calls, unless the app specifically cached them temporarily. |
| |
| In comparison, if you ask that same checkbox to just fill in a simple |
| data structure with its accessibility state, it might be able to quickly |
| compute isEnabled, isVisible, isFocused, and isChecked all at the same time, |
| with no additional layers of indirection needed. |
| |
| Another advantage of this approach is that you can take advantage of default |
| values using a sparse data structure - think of a node in the accessibility |
| cache as a key/value store like a hash map. If the default value of isChecked |
| is false, then for any accessible object that isn't currently checked you don't |
| have to do anything. Only if it is checked do you add a "checked" attribute to |
| the map. |
| |
| Another advantage of the accessibility tree being data is that you can |
| save the complete state of the accessibility tree - either making a copy |
| in memory, or dumping a human-readable text version to a log file. We use |
| this concept extensively in accessibility browser tests. |
| |
| One powerful consequence of this approach is that an accessibility tree |
| doesn't need a backing web page in order to function. It's possible to save |
| an accessibility tree and a series of atomic mutations, and then "replay" |
| them later and get identical results, without any backing web page. |
| Chromium currently has some experimental support for recording changes to |
| a web page in the chrome://accessibility page, and we also take advantage |
| of this snapshotting in order to implement support for the Android |
| "freeze-dried tabs" feature where a frozen snapshot of the page is |
| displayed (with accessibility support) while the real page is being |
| fetched from a slow connection. |
| |
| ### Data structures used by the accessibility cache |
| |
| In this section we'll cover the data structures used in order to implement |
| the accessibility cache. For the moment, we'll ignore the render processes |
| and how this data is generated, in order to focus on what data is received |
| and how it's used. |
| |
| The key data structures used throughout accessibility code are found in the |
| [ui/accessibility]( |
| https://source.chromium.org/chromium/chromium/src/+/main:ui/accessibility/) |
| directory. |
| |
| The underlying data from one node is stored in AXNodeData. A simplified |
| version of this struct is: |
| |
| ```C++ |
| struct AXNodeData { |
| int id; |
| Role role; |
| vector<pair<StringAttribute, string>> string_attributes; |
| vector<pair<IntAttribute, int>> int_attributes; |
| vector<pair<FloatAttribute, float>> float_attributes; |
| ... |
| vector<int> child_ids; |
| AXRelativeBounds relative_bounds; |
| }; |
| ``` |
| |
| Every node has an ID, but IDs are only required to be unique within the |
| same web frame. The IDs are used to express the tree structure - each node |
| has a vector of its child IDs. |
| |
| Every node has a role, since that's a fundamental concept in accessibility |
| and every node needs one. Every node also has a bounding box (we'll go into |
| why it's a *relative* bounding box later). Nearly all of the other |
| attributes are stored as sparse vectors of (attribute type, attribute value) |
| pairs. There are currently over 100 different attributes that Chromium |
| can associate with a single accessible node, but most nodes only have |
| 5 - 10 of them set. Anything unset is treated as having the default value. |
| |
| An AXTreeUpdate is a pure-data snapshot of an accessibility tree or an |
| update to an existing tree. A slightly simplified version of that struct is: |
| |
| ```C++ |
| struct AXTreeUpdate { |
| AXTreeData tree_data; |
| int root_id; |
| vector<AXNodeData> nodes; |
| }; |
| ``` |
| |
| AXTreeData just has a few attributes that apply to the entire |
| tree rather than just one particular node. |
| |
| A valid AXTreeUpdate corresponding to a complete accessibility tree just has to |
| contain every node exactly once with no duplicates or redundant nodes. In other |
| words, root_id must contain the ID of one node; that node must have the IDs of |
| its children, and every node must either be the root or the child of some other |
| node. The nodes can come in any order. |
| |
| A valid AXTreeUpdate can also represent the changes needed to change an |
| existing accessibility tree. Any node that's unchanged can be completely |
| left out; only nodes that change need to be included. To insert a node, |
| just add its AXNodeData and be sure to update the child_ids in its parent. |
| To remove a node, remove it from its parent's child_ids. |
| |
| An AXTreeUpdate is *stateful* - if you're using it to update an existing |
| tree then the code that generated the update needs to understand the |
| state that the tree was in previously. |
| |
| There are two classes that represent a live node and a live tree, here |
| are the important details: |
| |
| ```C++ |
| class AXNode { |
| public: |
| ... |
| const AXNodeData& data(); |
| AXNode* GetParent() const; |
| vector<AXNode*>& GetAllChildren() const; |
| ... |
| }; |
| |
| class AXTree { |
| public: |
| AXTree(); |
| explicit AXTree(const AXTreeUpdate& initial_state); |
| bool Unserialize(const AXTreeUpdate& update); |
| AXNode* root() const; |
| AXNode* GetFromId(int id) const; |
| ... |
| }; |
| ``` |
| |
| You can construct an AXTree from an AXTreeUpdate directly, or create an |
| empty tree first. Then call Unserialize to take an AXTreeUpdate and |
| apply those changes to the current tree. It will return false if the |
| AXTreeUpdate is malformed or can't be applied to the current tree. |
| |
| Once you have an AXTree, you can get its root node, or look up nodes by ID. |
| Each node is just a thin wrapper around its underlying AXNodeData plus |
| some convenience functions to walk the tree. |
| |
| AXTree is essentially the underlying data model used to implement |
| accessibility in the browser process. As described in |
| [Part 1](how_a11y_works.md), there's a platform-specific accessibility |
| object for each node in the tree. When a platform accessibility API is |
| called on one of those nodes, it uses the corresponding AXNode's |
| AXNodeData to get the details of the node to satisfy that query. |
| |
| Nearly all accessibility APIs can be satisfied directly from AXNodeData: |
| for example a node's name, role, state, value, bounding box, allowed |
| actions, and relationships with other nodes. There are a few exceptions |
| that will be covered in [Part 3](how_a11y_works_3.md). |
| |
| BrowserAccessibilityManager is a layer on top of AXTree. It hooks up |
| the cross-platform accessibility data structures to the platform-specific |
| tree of native accessibility objects. That will be discussed in more |
| detail in [Part 3](how_a11y_works_3.md). |
| |
| ### Serializing the accessibility tree from the renderer |
| |
| The other half of the puzzle here is what happens on the renderer side. |
| Given a web page that's constantly changing, how do we serialize a |
| representation of the accessibility tree and send small, atomic updates |
| to the browser process to keep the cache in sync? |
| |
| See above for a reminder of Blink and how it fits into the render process. |
| Render accessibility consists of two main pieces: |
| |
| * A tree of lightweight accessibility nodes inside Blink that represent |
| the current state of the accessibility tree |
| * Code outside Blink that keeps track of nodes that need to be updated, |
| and periodically serializes updates to send to the browser process. |
| |
| #### The Blink Accessibility Tree |
| |
| Inside Blink, we use the following classes. |
| |
| AXObject is the base class representing one node in the accessibility |
| tree. Each AXObject is a wrapper, it either wraps a DOM Node (a blink::Node) |
| or a Blink layout object (a blink::LayoutObject). The AXObject contains |
| very little state; it caches a few attributes that are expensive to recompute |
| but otherwise doesn't store its full serialization. |
| |
| AXObjectCache is the class that represents all of the AXObjects for one |
| web page. It's owned by the Document class, only if accessibility is enabled. |
| |
| AXObjects are built lazily, on-demand. When an AXObject is added or deleted, |
| its parent marks its list of child objects as dirty so that the next time |
| it's queried it knows to compute them. |
| |
| The vast majority of the web-specific accessibility logic is in AXObject and |
| AXObjectCache. Besides just support for ARIA attributes and getting |
| accessibility information from DOM elements, here you'll find all of the code |
| that interprets the [ACCNAME](https://w3c.github.io/accname/) spec to compute |
| the accessible name and description for any HTML element, by checking |
| aria-labelledby, aria-label, and other relevant attributes, for example. |
| |
| In addition, the code in AXObject and AXObjectCache builds the structure of |
| the accessibility tree, especially in cases where it differs from the DOM tree, |
| such as CSS generated content, aria-hidden, or aria-owns. |
| |
| When the accessibility tree changes, AXObjectCache sends an event |
| notification to the render accessibility code outside of Blink indicating |
| that a node has changed and needs to be re-serialized. |
| |
| Blink does not currently have listener interfaces for all of the changes |
| that accessibility cares about. Rather, Blink code is generally |
| specifically instrumented for accessibility. So when a new DOM node is |
| inserted, or when the value of a form control changes, you'll see explicit |
| code in Blink to update the AXObjectCache. |
| |
| In earlier versions, Blink accessibility code used to post a lot of |
| specific event notifications indicating exactly what changed, for |
| example: NameChanged, ValueChanged, StateChanged, ChildrenChanged, etc. - but |
| over time we've moved away from that model. Now we usually just mark a |
| node as dirty - or the event notification is still there for historical |
| reasons but it's never consumed and only marking the node is dirty ends up |
| mattering. |
| |
| The reason for this was because we started adding code to automatically |
| generate events from tree mutations. This helped avoid entire classes of |
| problems like duplicate events. |
| |
| So in a nutshell, Blink is just responsible for notifying which nodes have |
| changed. It's more efficient to just re-serialize nodes that probably changed |
| (occasionally doing a bit of extra work) than it is to write very careful logic |
| to only update nodes that actually changed. |
| |
| #### Render Process Accessibility code (outside of Blink) |
| |
| The main render process accessibility code outside of Blink is in |
| [RenderAccessibilityImpl](https://source.chromium.org/chromium/chromium/src/+/main:content/renderer/accessibility/render_accessibility_impl.h). |
| That code maintains the connection with the browser accessibility |
| code and handles serializing updates to the accessibility tree. |
| |
| Updates to the accessibility tree are always batched. |
| |
| One important reason is because the accessibility tree can only be serialized |
| when the document's lifecycle state is *clean*. In a nutshell, every time a |
| change happens to a web page that could conceivably affect how it appears |
| on-screen, the document is dirty until Blink has had a chance to do CSS style |
| resolution and layout. Accessibility code will fail assertions if you try to |
| query certain properties when the document is in a dirty state, because |
| it could lead to inconsistent results or even crashes. |
| |
| So, accessibility changes are always queued up and then sent periodically |
| only after first ensuring that layout is complete. |
| |
| When it is time to send accessibility updates, we make use of an |
| abstraction called AXTreeSerializer. AXTreeSerializer is a class that |
| knows how to walk a tree of nodes and generate valid AXTreeUpdates that |
| incrementally update a remote AXTree. |
| |
| AXTreeSerializer is designed so that it doesn't know anything about Blink, |
| and it doesn't interpret any accessibility logic, it just knows how to |
| work with the AXNodeData and AXTreeUpdate data structures. In fact, we're |
| using AXTreeSerializer for other accessibility trees in Chromium outside |
| of Blink, and we have extensive unit tests for AXTreeSerializer that |
| serialize from one AXTree into another AXTree to test the logic in isolation. |
| |
| AXTreeSerializer is stateful; it keeps track of what nodes have been sent |
| to its counterpart. When walking the tree, if it encounters a node that |
| it hasn't serialized before, it automatically serializes it. If it |
| encounters a node that was previously serialized and wasn't marked as |
| dirty, it automatically skips it. |
| |
| AXTreeSerializer uses an interface called AXTreeSource to enable it to walk |
| any tree-like object without being tightly coupled to Blink or any other |
| specific tree. We use an implementation BlinkAXTreeSource that maps all |
| of the tree-walking and serialization calls into calls to Blink's |
| AXObject class. |
| |
| This figure shows the overall system diagram covered so far: |
| |
|  |
| |
| ## We must go deeper |
| |
| In the next section we'll explore some of the details that were |
| glossed over, including: |
| |
| * Abstracting platform-specific APIs |
| * Relative coordinates |
| * Text bounding boxes |
| * Hit testing |
| * Views and other non-web custom-drawn UI |
| |
| See [How Chrome Accessibility Works, Part 3](how_a11y_works_3.md) |
| |
| |