Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 1 | # Web Test Expectations and Baselines |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 2 | |
| 3 | |
Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 4 | The primary function of the web tests is as a regression test suite; this |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 5 | means that, while we care about whether a page is being rendered correctly, we |
| 6 | care more about whether the page is being rendered the way we expect it to. In |
| 7 | other words, we look more for changes in behavior than we do for correctness. |
| 8 | |
| 9 | [TOC] |
| 10 | |
Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 11 | All web tests have "expected results", or "baselines", which may be one of |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 12 | several forms. The test may produce one or more of: |
| 13 | |
| 14 | * A text file containing JavaScript log messages. |
| 15 | * A text rendering of the Render Tree. |
| 16 | * A screen capture of the rendered page as a PNG file. |
| 17 | * WAV files of the audio output, for WebAudio tests. |
| 18 | |
Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 19 | For any of these types of tests, baselines are checked into the web_tests |
Robert Ma | 06f7acc | 2017-11-14 17:55:47 | [diff] [blame] | 20 | directory. The filename of a baseline is the same as that of the corresponding |
| 21 | test, but the extension is replaced with `-expected.{txt,png,wav}` (depending on |
| 22 | the type of test output). Baselines usually live alongside tests, with the |
| 23 | exception when baselines vary by platforms; read |
Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 24 | [Web Test Baseline Fallback](web_test_baseline_fallback.md) for more |
Robert Ma | 06f7acc | 2017-11-14 17:55:47 | [diff] [blame] | 25 | details. |
| 26 | |
| 27 | Lastly, we also support the concept of "reference tests", which check that two |
| 28 | pages are rendered identically (pixel-by-pixel). As long as the two tests' |
| 29 | output match, the tests pass. For more on reference tests, see |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 30 | [Writing ref tests](https://trac.webkit.org/wiki/Writing%20Reftests). |
| 31 | |
| 32 | ## Failing tests |
| 33 | |
| 34 | When the output doesn't match, there are two potential reasons for it: |
| 35 | |
| 36 | * The port is performing "correctly", but the output simply won't match the |
| 37 | generic version. The usual reason for this is for things like form controls, |
| 38 | which are rendered differently on each platform. |
| 39 | * The port is performing "incorrectly" (i.e., the test is failing). |
| 40 | |
| 41 | In both cases, the convention is to check in a new baseline (aka rebaseline), |
| 42 | even though that file may be codifying errors. This helps us maintain test |
| 43 | coverage for all the other things the test is testing while we resolve the bug. |
| 44 | |
| 45 | *** promo |
| 46 | If a test can be rebaselined, it should always be rebaselined instead of adding |
| 47 | lines to TestExpectations. |
| 48 | *** |
| 49 | |
| 50 | Bugs at [crbug.com](https://crbug.com) should track fixing incorrect behavior, |
| 51 | not lines in |
Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 52 | [TestExpectations](../../third_party/blink/web_tests/TestExpectations). If a |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 53 | test is never supposed to pass (e.g. it's testing Windows-specific behavior, so |
| 54 | can't ever pass on Linux/Mac), move it to the |
Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 55 | [NeverFixTests](../../third_party/blink/web_tests/NeverFixTests) file. That |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 56 | gets it out of the way of the rest of the project. |
| 57 | |
| 58 | There are some cases where you can't rebaseline and, unfortunately, we don't |
| 59 | have a better solution than either: |
| 60 | |
| 61 | 1. Reverting the patch that caused the failure, or |
| 62 | 2. Adding a line to TestExpectations and fixing the bug later. |
| 63 | |
| 64 | In this case, **reverting the patch is strongly preferred**. |
| 65 | |
| 66 | These are the cases where you can't rebaseline: |
| 67 | |
| 68 | * The test is a reference test. |
| 69 | * The test gives different output in release and debug; in this case, generate a |
| 70 | baseline with the release build, and mark the debug build as expected to fail. |
| 71 | * The test is flaky, crashes or times out. |
| 72 | * The test is for a feature that hasn't yet shipped on some platforms yet, but |
| 73 | will shortly. |
| 74 | |
| 75 | ## Handling flaky tests |
| 76 | |
| 77 | The |
| 78 | [flakiness dashboard](https://test-results.appspot.com/dashboards/flakiness_dashboard.html) |
| 79 | is a tool for understanding a test’s behavior over time. |
| 80 | Originally designed for managing flaky tests, the dashboard shows a timeline |
| 81 | view of the test’s behavior over time. The tool may be overwhelming at first, |
| 82 | but |
| 83 | [the documentation](https://dev.chromium.org/developers/testing/flakiness-dashboard) |
| 84 | should help. Once you decide that a test is truly flaky, you can suppress it |
| 85 | using the TestExpectations file, as described below. |
| 86 | |
| 87 | We do not generally expect Chromium sheriffs to spend time trying to address |
| 88 | flakiness, though. |
| 89 | |
| 90 | ## How to rebaseline |
| 91 | |
| 92 | Since baselines themselves are often platform-specific, updating baselines in |
| 93 | general requires fetching new test results after running the test on multiple |
| 94 | platforms. |
| 95 | |
| 96 | ### Rebaselining using try jobs |
| 97 | |
| 98 | The recommended way to rebaseline for a currently-in-progress CL is to use |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 99 | results from try jobs, by using the command-tool |
Kent Tamura | b53757e | 2018-04-20 17:54:48 | [diff] [blame] | 100 | `third_party/blink/tools/blink_tool.py rebaseline-cl`: |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 101 | |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 102 | 1. First, upload a CL. |
Kent Tamura | b53757e | 2018-04-20 17:54:48 | [diff] [blame] | 103 | 2. Trigger try jobs by running `blink_tool.py rebaseline-cl`. This should |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 104 | trigger jobs on |
Preethi Mohan | 6ad00ee | 2020-11-17 03:09:42 | [diff] [blame] | 105 | [tryserver.blink](https://ci.chromium.org/p/chromium/g/tryserver.blink/builders). |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 106 | 3. Wait for all try jobs to finish. |
Kent Tamura | b53757e | 2018-04-20 17:54:48 | [diff] [blame] | 107 | 4. Run `blink_tool.py rebaseline-cl` again to fetch new baselines. |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 108 | 5. Commit the new baselines and upload a new patch. |
| 109 | |
| 110 | This way, the new baselines can be reviewed along with the changes, which helps |
| 111 | the reviewer verify that the new baselines are correct. It also means that there |
Kent Tamura | 59ffb02 | 2018-11-27 05:30:56 | [diff] [blame] | 112 | is no period of time when the web test results are ignored. |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 113 | |
Weizhong Xia | aa38f7c | 2022-10-17 21:34:00 | [diff] [blame] | 114 | #### Handle bot timeouts |
| 115 | |
| 116 | When a change will cause many tests to fail, the try jobs may exit early because |
| 117 | the number of failures exceeds the limit, or the try jobs may timeout because |
| 118 | more time is needed for the retries. Rebaseline based on such results are not |
| 119 | suggested. The solution is to temporarily increase the number of shards in |
| 120 | [test_suite_exceptions.pyl](https://source.chromium.org/chromium/chromium/src/+/main:testing/buildbot/test_suite_exceptions.pyl) in your CL. |
| 121 | Change the values back to its original value before sending the CL to CQ. |
| 122 | |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 123 | #### Options |
| 124 | |
Kent Tamura | b53757e | 2018-04-20 17:54:48 | [diff] [blame] | 125 | The tests which `blink_tool.py rebaseline-cl` tries to download new baselines for |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 126 | depends on its arguments. |
| 127 | |
| 128 | * By default, it tries to download all baselines for tests that failed in the |
| 129 | try jobs. |
| 130 | * If you pass `--only-changed-tests`, then only tests modified in the CL will be |
| 131 | considered. |
| 132 | * You can also explicitly pass a list of test names, and then just those tests |
| 133 | will be rebaselined. |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 134 | * If some of the try jobs failed to run, and you wish to continue rebaselining |
| 135 | assuming that there are no platform-specific results for those platforms, |
| 136 | you can add the flag `--fill-missing`. |
Xianzhu Wang | c5e2eaf1 | 2020-01-16 22:13:09 | [diff] [blame] | 137 | * By default, it finds the try jobs by looking at the latest patchset. If you |
| 138 | have finished try jobs that are associated with an earlier patchset and you |
| 139 | want to use them instead of scheduling new try jobs, you can add the flag |
| 140 | `--patchset=n` to specify the patchset. This is very useful when the CL has |
| 141 | 'trivial' patchsets that are created e.g. by editing the CL descrpition. |
| 142 | |
Xianzhu Wang | 61d49d5 | 2021-07-31 16:44:53 | [diff] [blame] | 143 | ### Rebaseline script in results.html |
| 144 | |
| 145 | Web test results.html linked from bot job result page provides an alternative |
| 146 | way to rebaseline tests for a particular platform. |
| 147 | |
| 148 | * In the bot job result page, find the web test results.html link and click it. |
| 149 | * Choose "Rebaseline script" from the dropdown list after "Test shown ... in format". |
| 150 | * Click "Copy report" (or manually copy part of the script for the tests you want |
| 151 | to rebaseline). |
| 152 | * In local console, change directory into `third_party/blink/web_tests/platform/<platform>`. |
| 153 | * Paste. |
| 154 | * Add files into git and commit. |
| 155 | |
Xianzhu Wang | dca4902 | 2021-08-27 20:50:11 | [diff] [blame] | 156 | The generated command includes `blink_tool.py optimize-baselines <tests>` which |
| 157 | removes redundant baselines. However, the optimization doesn't work for |
| 158 | flag-specific baselines for now, so the rebaseline script may create redundant |
| 159 | baselines for flag-specific results. We prefer local manual rebaselining (see |
| 160 | below) for flag-specific rebaselines when possible. |
Xianzhu Wang | 61d49d5 | 2021-07-31 16:44:53 | [diff] [blame] | 161 | |
Xianzhu Wang | c5e2eaf1 | 2020-01-16 22:13:09 | [diff] [blame] | 162 | ### Local manual rebaselining |
| 163 | |
Xianzhu Wang | 61d49d5 | 2021-07-31 16:44:53 | [diff] [blame] | 164 | ```bash |
| 165 | third_party/blink/tools/run_web_tests.py --reset-results foo/bar/test.html |
| 166 | ``` |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 167 | |
Xianzhu Wang | 61d49d5 | 2021-07-31 16:44:53 | [diff] [blame] | 168 | If there are current expectation files for `web_tests/foo/bar/test.html`, |
| 169 | the above command will overwrite the current baselines at their original |
| 170 | locations with the actual results. The current baseline means the `-expected.*` |
| 171 | file used to compare the actual result when the test is run locally, i.e. the |
| 172 | first file found in the [baseline search path](https://cs.chromium.org/search/?q=port/base.py+baseline_search_path). |
| 173 | |
| 174 | If there are no current baselines, the above command will create new baselines |
| 175 | in the platform-independent directory, e.g. |
| 176 | `web_tests/foo/bar/test-expected.{txt,png}`. |
| 177 | |
| 178 | When you rebaseline a test, make sure your commit description explains why the |
| 179 | test is being re-baselined. |
|
|