It works by comparing a dump of the a11y widget tree with a known-good version, but that seems to vary unpredictably according to some unknown factor. Upstream's CI currently disables all the a11y tests, so we can't expect this to be reliable.