pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 1 | # Layout Test Expectations and Baselines |
| 2 | |
| 3 | |
| 4 | The primary function of the LayoutTests is as a regression test suite; this |
| 5 | means that, while we care about whether a page is being rendered correctly, we |
| 6 | care more about whether the page is being rendered the way we expect it to. In |
| 7 | other words, we look more for changes in behavior than we do for correctness. |
| 8 | |
| 9 | [TOC] |
| 10 | |
| 11 | All layout tests have "expected results", or "baselines", which may be one of |
| 12 | several forms. The test may produce one or more of: |
| 13 | |
| 14 | * A text file containing JavaScript log messages. |
| 15 | * A text rendering of the Render Tree. |
| 16 | * A screen capture of the rendered page as a PNG file. |
| 17 | * WAV files of the audio output, for WebAudio tests. |
| 18 | |
| 19 | For any of these types of tests, there are files checked into the LayoutTests |
| 20 | directory named `-expected.{txt,png,wav}`. Lastly, we also support the concept |
| 21 | of "reference tests", which check that two pages are rendered identically |
| 22 | (pixel-by-pixel). As long as the two tests' output match, the tests pass. For |
| 23 | more on reference tests, see |
| 24 | [Writing ref tests](https://trac.webkit.org/wiki/Writing%20Reftests). |
| 25 | |
| 26 | ## Failing tests |
| 27 | |
| 28 | When the output doesn't match, there are two potential reasons for it: |
| 29 | |
| 30 | * The port is performing "correctly", but the output simply won't match the |
| 31 | generic version. The usual reason for this is for things like form controls, |
| 32 | which are rendered differently on each platform. |
| 33 | * The port is performing "incorrectly" (i.e., the test is failing). |
| 34 | |
| 35 | In both cases, the convention is to check in a new baseline (aka rebaseline), |
| 36 | even though that file may be codifying errors. This helps us maintain test |
| 37 | coverage for all the other things the test is testing while we resolve the bug. |
| 38 | |
| 39 | *** promo |
| 40 | If a test can be rebaselined, it should always be rebaselined instead of adding |
| 41 | lines to TestExpectations. |
| 42 | *** |
| 43 | |
| 44 | Bugs at [crbug.com](https://crbug.com) should track fixing incorrect behavior, |
| 45 | not lines in |
| 46 | [TestExpectations](../../third_party/WebKit/LayoutTests/TestExpectations). If a |
| 47 | test is never supposed to pass (e.g. it's testing Windows-specific behavior, so |
| 48 | can't ever pass on Linux/Mac), move it to the |
| 49 | [NeverFixTests](../../third_party/WebKit/LayoutTests/NeverFixTests) file. That |
| 50 | gets it out of the way of the rest of the project. |
| 51 | |
| 52 | There are some cases where you can't rebaseline and, unfortunately, we don't |
| 53 | have a better solution than either: |
| 54 | |
| 55 | 1. Reverting the patch that caused the failure, or |
| 56 | 2. Adding a line to TestExpectations and fixing the bug later. |
| 57 | |
| 58 | In this case, **reverting the patch is strongly preferred**. |
| 59 | |
| 60 | These are the cases where you can't rebaseline: |
| 61 | |
| 62 | * The test is a reference test. |
| 63 | * The test gives different output in release and debug; in this case, generate a |
| 64 | baseline with the release build, and mark the debug build as expected to fail. |
| 65 | * The test is flaky, crashes or times out. |
| 66 | * The test is for a feature that hasn't yet shipped on some platforms yet, but |
| 67 | will shortly. |
| 68 | |
| 69 | ## Handling flaky tests |
| 70 | |
| 71 | The |
| 72 | [flakiness dashboard](https://test-results.appspot.com/dashboards/flakiness_dashboard.html) |
| 73 | is a tool for understanding a test’s behavior over time. |
| 74 | Originally designed for managing flaky tests, the dashboard shows a timeline |
| 75 | view of the test’s behavior over time. The tool may be overwhelming at first, |
| 76 | but |
| 77 | [the documentation](https://dev.chromium.org/developers/testing/flakiness-dashboard) |
| 78 | should help. Once you decide that a test is truly flaky, you can suppress it |
| 79 | using the TestExpectations file, as described below. |
| 80 | |
| 81 | We do not generally expect Chromium sheriffs to spend time trying to address |
| 82 | flakiness, though. |
| 83 | |
| 84 | ## How to rebaseline |
| 85 | |
| 86 | Since baselines themselves are often platform-specific, updating baselines in |
| 87 | general requires fetching new test results after running the test on multiple |
| 88 | platforms. |
| 89 | |
| 90 | ### Rebaselining using try jobs |
| 91 | |
| 92 | The recommended way to rebaseline for a currently-in-progress CL is to use |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 93 | results from try jobs, by using the command-tool |
| 94 | `third_party/WebKit/Tools/Scripts/webkit-patch rebaseline-cl`: |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 95 | |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 96 | 1. First, upload a CL. |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 97 | 2. Trigger try jobs by running `webkit-patch rebaseline-cl`. This should |
| 98 | trigger jobs on |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 99 | [tryserver.blink](https://build.chromium.org/p/tryserver.blink/builders). |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 100 | 3. Wait for all try jobs to finish. |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 101 | 4. Run `webkit-patch rebaseline-cl` again to fetch new baselines. |
| 102 | By default, this will download new baselines for any failing tests |
| 103 | in the try jobs. |
| 104 | (Run `webkit-patch rebaseline-cl --help` for more specific options.) |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 105 | 5. Commit the new baselines and upload a new patch. |
| 106 | |
| 107 | This way, the new baselines can be reviewed along with the changes, which helps |
| 108 | the reviewer verify that the new baselines are correct. It also means that there |
| 109 | is no period of time when the layout test results are ignored. |
| 110 | |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 111 | #### Options |
| 112 | |
Quinten Yearsley | d13299d | 2017-07-25 17:22:17 | [diff] [blame^] | 113 | ### Rebaselining with try jobs |
| 114 | |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 115 | The tests which `webkit-patch rebaseline-cl` tries to download new baselines for |
| 116 | depends on its arguments. |
| 117 | |
| 118 | * By default, it tries to download all baselines for tests that failed in the |
| 119 | try jobs. |
| 120 | * If you pass `--only-changed-tests`, then only tests modified in the CL will be |
| 121 | considered. |
| 122 | * You can also explicitly pass a list of test names, and then just those tests |
| 123 | will be rebaselined. |
Quinten Yearsley | a58f83c | 2017-05-30 16:00:57 | [diff] [blame] | 124 | * If some of the try jobs failed to run, and you wish to continue rebaselining |
| 125 | assuming that there are no platform-specific results for those platforms, |
| 126 | you can add the flag `--fill-missing`. |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 127 | |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 128 | ### Rebaselining manually |
| 129 | |
| 130 | 1. If the tests is already listed in TestExpectations as flaky, mark the test |
| 131 | `NeedsManualRebaseline` and comment out the flaky line so that your patch can |
| 132 | land without turning the tree red. If the test is not in TestExpectations, |
| 133 | you can add a `[ Rebaseline ]` line to TestExpectations. |
| 134 | 2. Run `third_party/WebKit/Tools/Scripts/webkit-patch rebaseline-expectations` |
| 135 | 3. Post the patch created in step 2 for review. |
| 136 | |
| 137 | ## Kinds of expectations files |
| 138 | |
| 139 | * [TestExpectations](../../third_party/WebKit/LayoutTests/TestExpectations): The |
Quinten Yearsley | d13299d | 2017-07-25 17:22:17 | [diff] [blame^] | 140 | main test failure suppression file. In theory, this should be used for |
| 141 | temporarily marking tests as flaky. |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 142 | * [ASANExpectations](../../third_party/WebKit/LayoutTests/ASANExpectations): |
| 143 | Tests that fail under ASAN. |
| 144 | * [LeakExpectations](../../third_party/WebKit/LayoutTests/LeakExpectations): |
| 145 | Tests that have memory leaks under the leak checker. |
| 146 | * [MSANExpectations](../../third_party/WebKit/LayoutTests/MSANExpectations): |
| 147 | Tests that fail under MSAN. |
| 148 | * [NeverFixTests](../../third_party/WebKit/LayoutTests/NeverFixTests): Tests |
| 149 | that we never intend to fix (e.g. a test for Windows-specific behavior will |
| 150 | never be fixed on Linux/Mac). Tests that will never pass on any platform |
| 151 | should just be deleted, though. |
| 152 | * [SlowTests](../../third_party/WebKit/LayoutTests/SlowTests): Tests that take |
| 153 | longer than the usual timeout to run. Slow tests are given 5x the usual |
| 154 | timeout. |
| 155 | * [SmokeTests](../../third_party/WebKit/LayoutTests/SmokeTests): A small subset |
| 156 | of tests that we run on the Android bot. |
| 157 | * [StaleTestExpectations](../../third_party/WebKit/LayoutTests/StaleTestExpectations): |
| 158 | Platform-specific lines that have been in TestExpectations for many months. |
| 159 | They're moved here to get them out of the way of people doing rebaselines |
| 160 | since they're clearly not getting fixed anytime soon. |
| 161 | * [W3CImportExpectations](../../third_party/WebKit/LayoutTests/W3CImportExpectations): |
| 162 | A record of which W3C tests should be imported or skipped. |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 163 | |
| 164 | ### Flag-specific expectations files |
| 165 | |
| 166 | It is possible to handle tests that only fail when run with a particular flag |
| 167 | being passed to `content_shell`. See |
| 168 | [LayoutTests/FlagExpectations/README.txt](../../third_party/WebKit/LayoutTests/FlagExpectations/README.txt) |
| 169 | for more. |
| 170 | |
| 171 | ## Updating the expectations files |
| 172 | |
| 173 | ### Ordering |
| 174 | |
| 175 | The file is not ordered. If you put new changes somewhere in the middle of the |
| 176 | file, this will reduce the chance of merge conflicts when landing your patch. |
| 177 | |
| 178 | ### Syntax |
| 179 | |
| 180 | The syntax of the file is roughly one expectation per line. An expectation can |
| 181 | apply to either a directory of tests, or a specific tests. Lines prefixed with |
| 182 | `# ` are treated as comments, and blank lines are allowed as well. |
| 183 | |
| 184 | The syntax of a line is roughly: |
| 185 | |
| 186 | ``` |
| 187 | [ bugs ] [ "[" modifiers "]" ] test_name [ "[" expectations "]" ] |
| 188 | ``` |
| 189 | |
| 190 | * Tokens are separated by whitespace. |
| 191 | * **The brackets delimiting the modifiers and expectations from the bugs and the |
| 192 | test_name are not optional**; however the modifiers component is optional. In |
| 193 | other words, if you want to specify modifiers or expectations, you must |
| 194 | enclose them in brackets. |
| 195 | * Lines are expected to have one or more bug identifiers, and the linter will |
| 196 | complain about lines missing them. Bug identifiers are of the form |
| 197 | `crbug.com/12345`, `code.google.com/p/v8/issues/detail?id=12345` or |
| 198 | `Bug(username)`. |
| 199 | * If no modifiers are specified, the test applies to all of the configurations |
| 200 | applicable to that file. |
| 201 | * Modifiers can be one or more of `Mac`, `Mac10.9`, `Mac10.10`, `Mac10.11`, |
| 202 | `Retina`, `Win`, `Win7`, `Win10`, `Linux`, `Linux32`, `Precise`, `Trusty`, |
| 203 | `Android`, `Release`, `Debug`. |
| 204 | * Some modifiers are meta keywords, e.g. `Win` represents both `Win7` and |
| 205 | `Win10`. See the `CONFIGURATION_SPECIFIER_MACROS` dictionary in |
| 206 | [third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/port/base.py](../../third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/port/base.py) |
| 207 | for the meta keywords and which modifiers they represent. |
| 208 | * Expectations can be one or more of `Crash`, `Failure`, `Pass`, `Rebaseline`, |
Quinten Yearsley | d13299d | 2017-07-25 17:22:17 | [diff] [blame^] | 209 | `Slow`, `Skip`, `Timeout`, `WontFix`, `Missing`, `NeedsManualRebaseline`. |
| 210 | If multiple expectations are listed, the test is considered "flaky" and any |
| 211 | of those results will be considered as expected. |
pwnall | d8a25072 | 2016-11-09 18:24:03 | [diff] [blame] | 212 | |
| 213 | For example: |
| 214 | |
| 215 | ``` |
| 216 | crbug.com/12345 [ Win Debug ] fast/html/keygen.html [ Crash ] |
| 217 | ``` |
| 218 | |
| 219 | which indicates that the "fast/html/keygen.html" test file is expected to crash |
| 220 | when run in the Debug configuration on Windows, and the tracking bug for this |
| 221 | crash is bug \#12345 in the [Chromium issue tracker](https://crbug.com). Note |
| 222 | that the test will still be run, so that we can notice if it doesn't actually |
| 223 | crash. |
| 224 | |
| 225 | Assuming you're running a debug build on Mac 10.9, the following lines are all |
| 226 | equivalent (in terms of whether the test is performed and its expected outcome): |
| 227 | |
| 228 | ``` |
| 229 | fast/html/keygen.html [ Skip ] |
| 230 | fast/html/keygen.html [ WontFix ] |
| 231 | Bug(darin) [ Mac10.9 Debug ] fast/html/keygen.html [ Skip ] |
| 232 | ``` |
| 233 | |
| 234 | ### Semantics |
| 235 | |
| 236 | * `WontFix` implies `Skip` and also indicates that we don't have any plans to |
| 237 | make the test pass. |
| 238 | * `WontFix` lines always go in the |
| 239 | [NeverFixTests file]((../../third_party/WebKit/LayoutTests/NeverFixTests) as |
| 240 | we never intend to fix them. These are just for tests that only apply to some |
| 241 | subset of the platforms we support. |
| 242 | * `WontFix` and `Skip` must be used by themselves and cannot be specified |
| 243 | alongside `Crash` or another expectation keyword. |
| 244 | * `Slow` causes the test runner to give the test 5x the usual time limit to run. |
| 245 | `Slow` lines go in the |
| 246 | [SlowTests file ](../../third_party/WebKit/LayoutTests/SlowTests). A given |
| 247 | line cannot have both Slow and Timeout. |
| 248 | |
| 249 | Also, when parsing the file, we use two rules to figure out if an expectation |
| 250 | line applies to the current run: |
| 251 | |
| 252 | 1. If the configuration parameters don't match the configuration of the current |
| 253 | run, the expectation is ignored. |
| 254 | 2. Expectations that match more of a test name are used before expectations that |
| 255 | match less of a test name. |
| 256 | |
| 257 | For example, if you had the following lines in your file, and you were running a |
| 258 | debug build on `Mac10.10`: |
| 259 | |
| 260 | ``` |
| 261 | crbug.com/12345 [ Mac10.10 ] fast/html [ Failure ] |
| 262 | crbug.com/12345 [ Mac10.10 ] fast/html/keygen.html [ Pass ] |
| 263 | crbug.com/12345 [ Win7 ] fast/forms/submit.html [ Failure ] |
| 264 | crbug.com/12345 fast/html/section-element.html [ Failure Crash ] |
| 265 | ``` |
| 266 | |
| 267 | You would expect: |
| 268 | |
| 269 | * `fast/html/article-element.html` to fail with a text diff (since it is in the |
| 270 | fast/html directory). |
| 271 | * `fast/html/keygen.html` to pass (since the exact match on the test name). |
| 272 | * `fast/html/submit.html` to pass (since the configuration parameters don't |
| 273 | match). |
| 274 | * `fast/html/section-element.html` to either crash or produce a text (or image |
| 275 | and text) failure, but not time out or pass. |
| 276 | |
| 277 | *** promo |
| 278 | Duplicate expectations are not allowed within the file and will generate |
| 279 | warnings. |
| 280 | *** |
| 281 | |
| 282 | You can verify that any changes you've made to an expectations file are correct |
| 283 | by running: |
| 284 | |
| 285 | ```bash |
| 286 | third_party/WebKit/Tools/Scripts/lint-test-expectations |
| 287 | ``` |
| 288 | |
| 289 | which will cycle through all of the possible combinations of configurations |
| 290 | looking for problems. |