Blame - docs/testing/web_test_expectations.md - chromium/src.git

blob: 8477ca4e6d9716595f2d49420aa667b5fa7830d4 [file] [log] [blame] [view]

Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	1	# Web Test Expectations and Baselines
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	2
				3
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	4	The primary function of the web tests is as a regression test suite; this
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	5	means that, while we care about whether a page is being rendered correctly, we
				6	care more about whether the page is being rendered the way we expect it to. In
				7	other words, we look more for changes in behavior than we do for correctness.
				8
				9	[TOC]
				10
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	11	All web tests have "expected results", or "baselines", which may be one of
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	12	several forms. The test may produce one or more of:
				13
				14	* A text file containing JavaScript log messages.
				15	* A text rendering of the Render Tree.
				16	* A screen capture of the rendered page as a PNG file.
				17	* WAV files of the audio output, for WebAudio tests.
				18
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	19	For any of these types of tests, baselines are checked into the web_tests
Robert Ma	06f7acc	2017-11-14 17:55:47	[diff] [blame]	20	directory. The filename of a baseline is the same as that of the corresponding
				21	test, but the extension is replaced with `-expected.{txt,png,wav}` (depending on
				22	the type of test output). Baselines usually live alongside tests, with the
				23	exception when baselines vary by platforms; read
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	24	[Web Test Baseline Fallback](web_test_baseline_fallback.md) for more
Robert Ma	06f7acc	2017-11-14 17:55:47	[diff] [blame]	25	details.
				26
				27	Lastly, we also support the concept of "reference tests", which check that two
				28	pages are rendered identically (pixel-by-pixel). As long as the two tests'
				29	output match, the tests pass. For more on reference tests, see
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	30	[Writing ref tests](https://trac.webkit.org/wiki/Writing%20Reftests).
				31
				32	## Failing tests
				33
				34	When the output doesn't match, there are two potential reasons for it:
				35
				36	* The port is performing "correctly", but the output simply won't match the
				37	generic version. The usual reason for this is for things like form controls,
				38	which are rendered differently on each platform.
				39	* The port is performing "incorrectly" (i.e., the test is failing).
				40
				41	In both cases, the convention is to check in a new baseline (aka rebaseline),
				42	even though that file may be codifying errors. This helps us maintain test
				43	coverage for all the other things the test is testing while we resolve the bug.
				44
				45	*** promo
				46	If a test can be rebaselined, it should always be rebaselined instead of adding
				47	lines to TestExpectations.
				48	***
				49
				50	Bugs at [crbug.com](https://crbug.com) should track fixing incorrect behavior,
				51	not lines in
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	52	[TestExpectations](../../third_party/blink/web_tests/TestExpectations). If a
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	53	test is never supposed to pass (e.g. it's testing Windows-specific behavior, so
				54	can't ever pass on Linux/Mac), move it to the
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	55	[NeverFixTests](../../third_party/blink/web_tests/NeverFixTests) file. That
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	56	gets it out of the way of the rest of the project.
				57
				58	There are some cases where you can't rebaseline and, unfortunately, we don't
				59	have a better solution than either:
				60
				61	1. Reverting the patch that caused the failure, or
				62	2. Adding a line to TestExpectations and fixing the bug later.
				63
				64	In this case, reverting the patch is strongly preferred.
				65
				66	These are the cases where you can't rebaseline:
				67
				68	* The test is a reference test.
				69	* The test gives different output in release and debug; in this case, generate a
				70	baseline with the release build, and mark the debug build as expected to fail.
				71	* The test is flaky, crashes or times out.
				72	* The test is for a feature that hasn't yet shipped on some platforms yet, but
				73	will shortly.
				74
				75	## Handling flaky tests
				76
				77	The
				78	[flakiness dashboard](https://test-results.appspot.com/dashboards/flakiness_dashboard.html)
				79	is a tool for understanding a test’s behavior over time.
				80	Originally designed for managing flaky tests, the dashboard shows a timeline
				81	view of the test’s behavior over time. The tool may be overwhelming at first,
				82	but
				83	[the documentation](https://dev.chromium.org/developers/testing/flakiness-dashboard)
				84	should help. Once you decide that a test is truly flaky, you can suppress it
				85	using the TestExpectations file, as described below.
				86
				87	We do not generally expect Chromium sheriffs to spend time trying to address
				88	flakiness, though.
				89
				90	## How to rebaseline
				91
				92	Since baselines themselves are often platform-specific, updating baselines in
				93	general requires fetching new test results after running the test on multiple
				94	platforms.
				95
				96	### Rebaselining using try jobs
				97
				98	The recommended way to rebaseline for a currently-in-progress CL is to use
Quinten Yearsley	a58f83c	2017-05-30 16:00:57	[diff] [blame]	99	results from try jobs, by using the command-tool
Kent Tamura	b53757e	2018-04-20 17:54:48	[diff] [blame]	100	`third_party/blink/tools/blink_tool.py rebaseline-cl`:
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	101
Quinten Yearsley	a58f83c	2017-05-30 16:00:57	[diff] [blame]	102	1. First, upload a CL.
Kent Tamura	b53757e	2018-04-20 17:54:48	[diff] [blame]	103	2. Trigger try jobs by running `blink_tool.py rebaseline-cl`. This should
Quinten Yearsley	a58f83c	2017-05-30 16:00:57	[diff] [blame]	104	trigger jobs on
Preethi Mohan	6ad00ee	2020-11-17 03:09:42	[diff] [blame]	105	[tryserver.blink](https://ci.chromium.org/p/chromium/g/tryserver.blink/builders).
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	106	3. Wait for all try jobs to finish.
Kent Tamura	b53757e	2018-04-20 17:54:48	[diff] [blame]	107	4. Run `blink_tool.py rebaseline-cl` again to fetch new baselines.
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	108	5. Commit the new baselines and upload a new patch.
				109
				110	This way, the new baselines can be reviewed along with the changes, which helps
				111	the reviewer verify that the new baselines are correct. It also means that there
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	112	is no period of time when the web test results are ignored.
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	113
Weizhong Xia	aa38f7c	2022-10-17 21:34:00	[diff] [blame]	114	#### Handle bot timeouts
				115
				116	When a change will cause many tests to fail, the try jobs may exit early because
				117	the number of failures exceeds the limit, or the try jobs may timeout because
				118	more time is needed for the retries. Rebaseline based on such results are not
				119	suggested. The solution is to temporarily increase the number of shards in
				120	[test_suite_exceptions.pyl](https://source.chromium.org/chromium/chromium/src/+/main:testing/buildbot/test_suite_exceptions.pyl) in your CL.
				121	Change the values back to its original value before sending the CL to CQ.
				122
Quinten Yearsley	a58f83c	2017-05-30 16:00:57	[diff] [blame]	123	#### Options
				124
Kent Tamura	b53757e	2018-04-20 17:54:48	[diff] [blame]	125	The tests which `blink_tool.py rebaseline-cl` tries to download new baselines for
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	126	depends on its arguments.
				127
				128	* By default, it tries to download all baselines for tests that failed in the
				129	try jobs.
				130	* If you pass `--only-changed-tests`, then only tests modified in the CL will be
				131	considered.
				132	* You can also explicitly pass a list of test names, and then just those tests
				133	will be rebaselined.
Quinten Yearsley	a58f83c	2017-05-30 16:00:57	[diff] [blame]	134	* If some of the try jobs failed to run, and you wish to continue rebaselining
				135	assuming that there are no platform-specific results for those platforms,
				136	you can add the flag `--fill-missing`.
Xianzhu Wang	c5e2eaf1	2020-01-16 22:13:09	[diff] [blame]	137	* By default, it finds the try jobs by looking at the latest patchset. If you
				138	have finished try jobs that are associated with an earlier patchset and you
				139	want to use them instead of scheduling new try jobs, you can add the flag
				140	`--patchset=n` to specify the patchset. This is very useful when the CL has
				141	'trivial' patchsets that are created e.g. by editing the CL descrpition.
				142
Xianzhu Wang	61d49d5	2021-07-31 16:44:53	[diff] [blame]	143	### Rebaseline script in results.html
				144
				145	Web test results.html linked from bot job result page provides an alternative
				146	way to rebaseline tests for a particular platform.
				147
				148	* In the bot job result page, find the web test results.html link and click it.
				149	* Choose "Rebaseline script" from the dropdown list after "Test shown ... in format".
				150	* Click "Copy report" (or manually copy part of the script for the tests you want
				151	to rebaseline).
				152	* In local console, change directory into `third_party/blink/web_tests/platform/<platform>`.
				153	* Paste.
				154	* Add files into git and commit.
				155
Xianzhu Wang	dca4902	2021-08-27 20:50:11	[diff] [blame]	156	The generated command includes `blink_tool.py optimize-baselines <tests>` which
				157	removes redundant baselines. However, the optimization doesn't work for
				158	flag-specific baselines for now, so the rebaseline script may create redundant
				159	baselines for flag-specific results. We prefer local manual rebaselining (see
				160	below) for flag-specific rebaselines when possible.
Xianzhu Wang	61d49d5	2021-07-31 16:44:53	[diff] [blame]	161
Xianzhu Wang	c5e2eaf1	2020-01-16 22:13:09	[diff] [blame]	162	### Local manual rebaselining
				163
Xianzhu Wang	61d49d5	2021-07-31 16:44:53	[diff] [blame]	164	```bash
				165	third_party/blink/tools/run_web_tests.py --reset-results foo/bar/test.html
				166	```
pwnall	d8a25072	2016-11-09 18:24:03	[diff] [blame]	167
Xianzhu Wang	61d49d5	2021-07-31 16:44:53	[diff] [blame]	168	If there are current expectation files for `web_tests/foo/bar/test.html`,
				169	the above command will overwrite the current baselines at their original
				170	locations with the actual results. The current baseline means the `-expected.*`
				171	file used to compare the actual result when the test is run locally, i.e. the
				172	first file found in the [baseline search path](https://cs.chromium.org/search/?q=port/base.py+baseline_search_path).
				173
				174	If there are no current baselines, the above command will create new baselines
				175	in the platform-independent directory, e.g.
				176	`web_tests/foo/bar/test-expected.{txt,png}`.
				177
				178	When you rebaseline a test, make sure your commit description explains why the
				179	test is being re-baselined.