blob: e143d0b2ca48019e6132f02c715e9df430d2d429 [file] [log] [blame] [view]
Kent Tamura59ffb022018-11-27 05:30:561# Web Test Expectations and Baselines
pwnalld8a250722016-11-09 18:24:032
Kent Tamura59ffb022018-11-27 05:30:563The primary function of the web tests is as a regression test suite; this
pwnalld8a250722016-11-09 18:24:034means that, while we care about whether a page is being rendered correctly, we
5care more about whether the page is being rendered the way we expect it to. In
6other words, we look more for changes in behavior than we do for correctness.
7
8[TOC]
9
Kent Tamura59ffb022018-11-27 05:30:5610All web tests have "expected results", or "baselines", which may be one of
pwnalld8a250722016-11-09 18:24:0311several forms. The test may produce one or more of:
12
13* A text file containing JavaScript log messages.
14* A text rendering of the Render Tree.
15* A screen capture of the rendered page as a PNG file.
16* WAV files of the audio output, for WebAudio tests.
17
Kent Tamura59ffb022018-11-27 05:30:5618For any of these types of tests, baselines are checked into the web_tests
Robert Ma06f7acc2017-11-14 17:55:4719directory. The filename of a baseline is the same as that of the corresponding
20test, but the extension is replaced with `-expected.{txt,png,wav}` (depending on
21the type of test output). Baselines usually live alongside tests, with the
22exception when baselines vary by platforms; read
Kent Tamura59ffb022018-11-27 05:30:5623[Web Test Baseline Fallback](web_test_baseline_fallback.md) for more
Robert Ma06f7acc2017-11-14 17:55:4724details.
25
26Lastly, we also support the concept of "reference tests", which check that two
27pages are rendered identically (pixel-by-pixel). As long as the two tests'
28output match, the tests pass. For more on reference tests, see
pwnalld8a250722016-11-09 18:24:0329[Writing ref tests](https://trac.webkit.org/wiki/Writing%20Reftests).
30
31## Failing tests
32
33When the output doesn't match, there are two potential reasons for it:
34
35* The port is performing "correctly", but the output simply won't match the
36 generic version. The usual reason for this is for things like form controls,
37 which are rendered differently on each platform.
38* The port is performing "incorrectly" (i.e., the test is failing).
39
40In both cases, the convention is to check in a new baseline (aka rebaseline),
41even though that file may be codifying errors. This helps us maintain test
42coverage for all the other things the test is testing while we resolve the bug.
43
44*** promo
45If a test can be rebaselined, it should always be rebaselined instead of adding
46lines to TestExpectations.
47***
48
49Bugs at [crbug.com](https://crbug.com) should track fixing incorrect behavior,
50not lines in
Kent Tamura59ffb022018-11-27 05:30:5651[TestExpectations](../../third_party/blink/web_tests/TestExpectations). If a
pwnalld8a250722016-11-09 18:24:0352test is never supposed to pass (e.g. it's testing Windows-specific behavior, so
53can't ever pass on Linux/Mac), move it to the
Kent Tamura59ffb022018-11-27 05:30:5654[NeverFixTests](../../third_party/blink/web_tests/NeverFixTests) file. That
pwnalld8a250722016-11-09 18:24:0355gets it out of the way of the rest of the project.
56
57There are some cases where you can't rebaseline and, unfortunately, we don't
58have a better solution than either:
59
601. Reverting the patch that caused the failure, or
612. Adding a line to TestExpectations and fixing the bug later.
62
63In this case, **reverting the patch is strongly preferred**.
64
65These are the cases where you can't rebaseline:
66
67* The test is a reference test.
68* The test gives different output in release and debug; in this case, generate a
69 baseline with the release build, and mark the debug build as expected to fail.
70* The test is flaky, crashes or times out.
71* The test is for a feature that hasn't yet shipped on some platforms yet, but
72 will shortly.
73
74## Handling flaky tests
75
Alison Gale81f4f2c2024-04-22 19:33:3176<!-- TODO(crbug.com/40262793): Describe the current flakiness dashboard and
Jonathan Lee80280d22023-11-27 22:40:5677 LUCI test history. -->
pwnalld8a250722016-11-09 18:24:0378
Jonathan Lee80280d22023-11-27 22:40:5679Once you decide that a test is truly flaky, you can suppress it using the
80TestExpectations file, as [described below](#updating-the-expectations-files).
pwnalld8a250722016-11-09 18:24:0381We do not generally expect Chromium sheriffs to spend time trying to address
82flakiness, though.
83
84## How to rebaseline
85
86Since baselines themselves are often platform-specific, updating baselines in
87general requires fetching new test results after running the test on multiple
88platforms.
89
90### Rebaselining using try jobs
91
92The recommended way to rebaseline for a currently-in-progress CL is to use
Quinten Yearsleya58f83c2017-05-30 16:00:5793results from try jobs, by using the command-tool
Kent Tamurab53757e2018-04-20 17:54:4894`third_party/blink/tools/blink_tool.py rebaseline-cl`:
pwnalld8a250722016-11-09 18:24:0395
Quinten Yearsleya58f83c2017-05-30 16:00:57961. First, upload a CL.
Kent Tamurab53757e2018-04-20 17:54:48972. Trigger try jobs by running `blink_tool.py rebaseline-cl`. This should
Quinten Yearsleya58f83c2017-05-30 16:00:5798 trigger jobs on
Preethi Mohan6ad00ee2020-11-17 03:09:4299 [tryserver.blink](https://ci.chromium.org/p/chromium/g/tryserver.blink/builders).
pwnalld8a250722016-11-09 18:24:031003. Wait for all try jobs to finish.
Kent Tamurab53757e2018-04-20 17:54:481014. Run `blink_tool.py rebaseline-cl` again to fetch new baselines.
pwnalld8a250722016-11-09 18:24:031025. Commit the new baselines and upload a new patch.
103
104This way, the new baselines can be reviewed along with the changes, which helps
105the reviewer verify that the new baselines are correct. It also means that there
Kent Tamura59ffb022018-11-27 05:30:56106is no period of time when the web test results are ignored.
pwnalld8a250722016-11-09 18:24:03107
Weizhong Xiaaa38f7c2022-10-17 21:34:00108#### Handle bot timeouts
109
110When a change will cause many tests to fail, the try jobs may exit early because
111the number of failures exceeds the limit, or the try jobs may timeout because
112more time is needed for the retries. Rebaseline based on such results are not
113suggested. The solution is to temporarily increase the number of shards in
Jonathan Lee80280d22023-11-27 22:40:56114[`test_suite_exceptions.pyl`](/testing/buildbot/test_suite_exceptions.pyl) in your CL.
Weizhong Xiaaa38f7c2022-10-17 21:34:00115Change the values back to its original value before sending the CL to CQ.
116
Quinten Yearsleya58f83c2017-05-30 16:00:57117#### Options
118
Kent Tamurab53757e2018-04-20 17:54:48119The tests which `blink_tool.py rebaseline-cl` tries to download new baselines for
pwnalld8a250722016-11-09 18:24:03120depends on its arguments.
121
122* By default, it tries to download all baselines for tests that failed in the
123 try jobs.
124* If you pass `--only-changed-tests`, then only tests modified in the CL will be
125 considered.
126* You can also explicitly pass a list of test names, and then just those tests
127 will be rebaselined.
Xianzhu Wangc5e2eaf12020-01-16 22:13:09128* By default, it finds the try jobs by looking at the latest patchset. If you
129 have finished try jobs that are associated with an earlier patchset and you
130 want to use them instead of scheduling new try jobs, you can add the flag
131 `--patchset=n` to specify the patchset. This is very useful when the CL has
132 'trivial' patchsets that are created e.g. by editing the CL descrpition.
133
Xianzhu Wang61d49d52021-07-31 16:44:53134### Rebaseline script in results.html
135
136Web test results.html linked from bot job result page provides an alternative
137way to rebaseline tests for a particular platform.
138
139* In the bot job result page, find the web test results.html link and click it.
140* Choose "Rebaseline script" from the dropdown list after "Test shown ... in format".
141* Click "Copy report" (or manually copy part of the script for the tests you want
142 to rebaseline).
143* In local console, change directory into `third_party/blink/web_tests/platform/<platform>`.
144* Paste.
145* Add files into git and commit.
146
Xianzhu Wangdca49022021-08-27 20:50:11147The generated command includes `blink_tool.py optimize-baselines <tests>` which
Jonathan Lee80280d22023-11-27 22:40:56148removes redundant baselines.
Xianzhu Wang61d49d52021-07-31 16:44:53149
Xianzhu Wangc5e2eaf12020-01-16 22:13:09150### Local manual rebaselining
151
Xianzhu Wang61d49d52021-07-31 16:44:53152```bash
153third_party/blink/tools/run_web_tests.py --reset-results foo/bar/test.html
154```
pwnalld8a250722016-11-09 18:24:03155
Xianzhu Wang61d49d52021-07-31 16:44:53156If there are current expectation files for `web_tests/foo/bar/test.html`,
157the above command will overwrite the current baselines at their original
158locations with the actual results. The current baseline means the `-expected.*`
159file used to compare the actual result when the test is run locally, i.e. the
160first file found in the [baseline search path](https://cs.chromium.org/search/?q=port/base.py+baseline_search_path).
161
162If there are no current baselines, the above command will create new baselines
163in the platform-independent directory, e.g.
164`web_tests/foo/bar/test-expected.{txt,png}`.
165
166When you rebaseline a test, make sure your commit description explains why the
167test is being re-baselined.
168
169### Rebaselining flag-specific expectations
170
171See [Testing Runtime Flags](./web_tests.md#Testing-Runtime-Flags) for details
172about flag-specific expectations.
173
Jonathan Lee80280d22023-11-27 22:40:56174The [Rebaseline Tool](#How-to-rebaseline) supports all flag-specific suites that
175[run in CQ/CI](/third_party/blink/tools/blinkpy/common/config/builders.json).
176You may also rebaseline flag-specific results locally with:
Xianzhu Wang61d49d52021-07-31 16:44:53177
178```bash
179third_party/blink/tools/run_web_tests.py --flag-specific=config --reset-results foo/bar/test.html
180```
181
182New baselines will be created in the flag-specific baselines directory, e.g.
183`web_tests/flag-specific/config/foo/bar/test-expected.{txt,png}`
184
185Then you can commit the new baselines and upload the patch for review.
186
187Sometimes it's difficult for reviewers to review the patch containing only new
188files. You can follow the steps below for easier review.
189
1901. Copy existing baselines to the flag-specific baselines directory for the
191 tests to be rebaselined:
192 ```bash
193 third_party/blink/tools/run_web_tests.py --flag-specific=config --copy-baselines foo/bar/test.html
194 ```
195 Then add the newly created baseline files, commit and upload the patch.
196 Note that the above command won't copy baselines for passing tests.
197
1982. Rebaseline the test locally:
199 ```bash
200 third_party/blink/tools/run_web_tests.py --flag-specific=config --reset-results foo/bar/test.html
201 ```
202 Commit the changes and upload the patch.
203
2043. Request review of the CL and tell the reviewer to compare the patch sets that
205 were uploaded in step 1 and step 2 to see the differences of the rebaselines.
Jonathan Leedbea4d4d2022-05-25 15:35:09206
pwnalld8a250722016-11-09 18:24:03207## Kinds of expectations files
208
Kent Tamura59ffb022018-11-27 05:30:56209* [TestExpectations](../../third_party/blink/web_tests/TestExpectations): The
Quinten Yearsleyd13299d2017-07-25 17:22:17210 main test failure suppression file. In theory, this should be used for
211 temporarily marking tests as flaky.
Jonathan Lee80280d22023-11-27 22:40:56212 See [the `run_wpt_tests.py` doc](run_web_platform_tests.md) for information
213 about WPT coverage for Chrome.
Kent Tamura59ffb022018-11-27 05:30:56214* [ASANExpectations](../../third_party/blink/web_tests/ASANExpectations):
pwnalld8a250722016-11-09 18:24:03215 Tests that fail under ASAN.
Weizhong Xiad51f84042025-04-04 01:53:43216* [CfTTestExpectations](../../third_party/blink/web_tests/CfTTestExpectations):
An Sungd75ea3332024-12-05 18:45:23217 Tests that fail under Chrome for Testing
Kent Tamura59ffb022018-11-27 05:30:56218* [LeakExpectations](../../third_party/blink/web_tests/LeakExpectations):
pwnalld8a250722016-11-09 18:24:03219 Tests that have memory leaks under the leak checker.
Kent Tamura59ffb022018-11-27 05:30:56220* [MSANExpectations](../../third_party/blink/web_tests/MSANExpectations):
pwnalld8a250722016-11-09 18:24:03221 Tests that fail under MSAN.
Kent Tamura59ffb022018-11-27 05:30:56222* [NeverFixTests](../../third_party/blink/web_tests/NeverFixTests): Tests
pwnalld8a250722016-11-09 18:24:03223 that we never intend to fix (e.g. a test for Windows-specific behavior will
224 never be fixed on Linux/Mac). Tests that will never pass on any platform
225 should just be deleted, though.
Kent Tamura59ffb022018-11-27 05:30:56226* [SlowTests](../../third_party/blink/web_tests/SlowTests): Tests that take
pwnalld8a250722016-11-09 18:24:03227 longer than the usual timeout to run. Slow tests are given 5x the usual
228 timeout.
Kent Tamura59ffb022018-11-27 05:30:56229* [StaleTestExpectations](../../third_party/blink/web_tests/StaleTestExpectations):
pwnalld8a250722016-11-09 18:24:03230 Platform-specific lines that have been in TestExpectations for many months.
231 They're moved here to get them out of the way of people doing rebaselines
232 since they're clearly not getting fixed anytime soon.
Kent Tamura59ffb022018-11-27 05:30:56233* [W3CImportExpectations](../../third_party/blink/web_tests/W3CImportExpectations):
pwnalld8a250722016-11-09 18:24:03234 A record of which W3C tests should be imported or skipped.
pwnalld8a250722016-11-09 18:24:03235
236### Flag-specific expectations files
237
238It is possible to handle tests that only fail when run with a particular flag
239being passed to `content_shell`. See
Kent Tamura59ffb022018-11-27 05:30:56240[web_tests/FlagExpectations/README.txt](../../third_party/blink/web_tests/FlagExpectations/README.txt)
pwnalld8a250722016-11-09 18:24:03241for more.
242
243## Updating the expectations files
244
245### Ordering
246
247The file is not ordered. If you put new changes somewhere in the middle of the
248file, this will reduce the chance of merge conflicts when landing your patch.
249
250### Syntax
251
Xianzhu Wang61d49d52021-07-31 16:44:53252*** promo
253Please see [The Chromium Test List Format](http://bit.ly/chromium-test-list-format)
254for a more complete and up-to-date description of the syntax.
255***
256
pwnalld8a250722016-11-09 18:24:03257The syntax of the file is roughly one expectation per line. An expectation can
258apply to either a directory of tests, or a specific tests. Lines prefixed with
259`# ` are treated as comments, and blank lines are allowed as well.
260
261The syntax of a line is roughly:
262
263```
Xianzhu Wang61d49d52021-07-31 16:44:53264[ bugs ] [ "[" modifiers "]" ] test_name_or_directory [ "[" expectations "]" ]
pwnalld8a250722016-11-09 18:24:03265```
266
267* Tokens are separated by whitespace.
268* **The brackets delimiting the modifiers and expectations from the bugs and the
Xianzhu Wang61d49d52021-07-31 16:44:53269 test_name_or_directory are not optional**; however the modifiers component is optional. In
pwnalld8a250722016-11-09 18:24:03270 other words, if you want to specify modifiers or expectations, you must
271 enclose them in brackets.
Jonathan Lee80280d22023-11-27 22:40:56272* If test_name_or_directory is a directory, it should be ended with `/*`, and all
Xianzhu Wang61d49d52021-07-31 16:44:53273 tests under the directory will have the expectations, unless overridden by
Weizhong Xiac76b9202023-02-03 00:13:02274 more specific expectation lines. **The wildcard is intentionally only allowed at the
275 end of test_name_or_directory, so that it will be easy to reason about
276 which test(s) a test expectation will apply to.**
pwnalld8a250722016-11-09 18:24:03277* Lines are expected to have one or more bug identifiers, and the linter will
278 complain about lines missing them. Bug identifiers are of the form
279 `crbug.com/12345`, `code.google.com/p/v8/issues/detail?id=12345` or
280 `Bug(username)`.
281* If no modifiers are specified, the test applies to all of the configurations
282 applicable to that file.
Jonathan Lee4ea63d32024-07-24 17:35:27283* If specified, modifiers can be one of `Fuchsia`, `Mac`, `Mac11`,
284 `Mac11-arm64`, `Mac12`, `Mac12-arm64`, `Mac13`, `Mac13-arm64`, `Mac14`,
Weizhong Xiaf410e342025-05-07 21:35:06285 `Mac14-arm64`, `Mac15`, `Mac15-arm64`, `Linux`, `Win`, `Win10.20h2`,
Gyuyoung Kimd0b328da2025-05-08 15:40:18286 `Win11`, `Win11-arm64`, `Android`, `Webview`, `iOS18-Simulator`, and,
Weizhong Xiaf410e342025-05-07 21:35:06287 optionally, `Release`, or `Debug`.
Jonathan Lee80280d22023-11-27 22:40:56288 Check the `# tags: ...` comments [at the top of each
289 file](/third_party/blink/web_tests/TestExpectations#1) to see which modifiers
290 that file supports.
Weizhong Xia88cc6ef2022-06-10 21:36:55291* Some modifiers are meta keywords, e.g. `Win` represents `Win10.20h2` and `Win11`.
292 See the `CONFIGURATION_SPECIFIER_MACROS` dictionary in
Kent Tamura01019442018-05-01 22:06:58293 [third_party/blink/tools/blinkpy/web_tests/port/base.py](../../third_party/blink/tools/blinkpy/web_tests/port/base.py)
pwnalld8a250722016-11-09 18:24:03294 for the meta keywords and which modifiers they represent.
Jonathan Lee80280d22023-11-27 22:40:56295* Expectations can be one or more of `Crash`, `Failure`, `Pass`, `Slow`, or
296 `Skip`, `Timeout`.
297 Some results don't make sense for some files; check the `# results: ...`
298 comment at the top of each file to see what results that file supports.
Quinten Yearsleyd13299d2017-07-25 17:22:17299 If multiple expectations are listed, the test is considered "flaky" and any
300 of those results will be considered as expected.
pwnalld8a250722016-11-09 18:24:03301
302For example:
303
304```
305crbug.com/12345 [ Win Debug ] fast/html/keygen.html [ Crash ]
306```
307
308which indicates that the "fast/html/keygen.html" test file is expected to crash
309when run in the Debug configuration on Windows, and the tracking bug for this
310crash is bug \#12345 in the [Chromium issue tracker](https://crbug.com). Note
311that the test will still be run, so that we can notice if it doesn't actually
312crash.
313
Jonathan Lee80280d22023-11-27 22:40:56314Assuming you're running a debug build on Mac 10.9, the following lines are
pwnalld8a250722016-11-09 18:24:03315equivalent (in terms of whether the test is performed and its expected outcome):
316
317```
318fast/html/keygen.html [ Skip ]
pwnalld8a250722016-11-09 18:24:03319Bug(darin) [ Mac10.9 Debug ] fast/html/keygen.html [ Skip ]
320```
321
322### Semantics
323
Jonathan Lee80280d22023-11-27 22:40:56324`Slow` causes the test runner to give the test 5x the usual time limit to run.
325`Slow` lines go in the
326[`SlowTests` file](../../third_party/blink/web_tests/SlowTests).
327A given line cannot have both Slow and Timeout.
pwnalld8a250722016-11-09 18:24:03328
329Also, when parsing the file, we use two rules to figure out if an expectation
330line applies to the current run:
331
3321. If the configuration parameters don't match the configuration of the current
333 run, the expectation is ignored.
3342. Expectations that match more of a test name are used before expectations that
335 match less of a test name.
336
Jonathan Lee80280d22023-11-27 22:40:56337If a [virtual test] has no explicit expectations (following the rules above),
338it inherits its expectations from the base (nonvirtual) test.
339
340[virtual test]: /docs/testing/web_tests.md#Virtual-test-suites
341
pwnalld8a250722016-11-09 18:24:03342For example, if you had the following lines in your file, and you were running a
343debug build on `Mac10.10`:
344
345```
346crbug.com/12345 [ Mac10.10 ] fast/html [ Failure ]
347crbug.com/12345 [ Mac10.10 ] fast/html/keygen.html [ Pass ]
Weizhong Xia88cc6ef2022-06-10 21:36:55348crbug.com/12345 [ Win11 ] fast/forms/submit.html [ Failure ]
pwnalld8a250722016-11-09 18:24:03349crbug.com/12345 fast/html/section-element.html [ Failure Crash ]
350```
351
352You would expect:
353
354* `fast/html/article-element.html` to fail with a text diff (since it is in the
355 fast/html directory).
356* `fast/html/keygen.html` to pass (since the exact match on the test name).
Staphany Park4b66843e2019-07-11 07:28:33357* `fast/forms/submit.html` to pass (since the configuration parameters don't
pwnalld8a250722016-11-09 18:24:03358 match).
359* `fast/html/section-element.html` to either crash or produce a text (or image
360 and text) failure, but not time out or pass.
Jonathan Lee80280d22023-11-27 22:40:56361* `virtual/foo/fast/html/article-element.html` to fail with a text diff. The
362 virtual test inherits its expectation from the first line.
pwnalld8a250722016-11-09 18:24:03363
Xianzhu Wang61d49d52021-07-31 16:44:53364Test expectation can also apply to all tests under a directory (specified with a
365name ending with `/*`). A more specific expectation can override a less
366specific expectation. For example:
367```
368crbug.com/12345 virtual/composite-after-paint/* [ Skip ]
369crbug.com/12345 virtual/composite-after-paint/compositing/backface-visibility/* [ Pass ]
370crbug.com/12345 virtual/composite-after-paint/compositing/backface-visibility/test.html [ Failure ]
371```
372
pwnalld8a250722016-11-09 18:24:03373*** promo
374Duplicate expectations are not allowed within the file and will generate
375warnings.
376***
377
378You can verify that any changes you've made to an expectations file are correct
379by running:
380
381```bash
Kent Tamura02b4a5b1f2018-04-24 23:26:28382third_party/blink/tools/lint_test_expectations.py
pwnalld8a250722016-11-09 18:24:03383```
384
385which will cycle through all of the possible combinations of configurations
386looking for problems.