ABX test
Encyclopedia
An ABX test is a method of comparing two kinds of sensory stimuli to identify detectable differences. A subject is presented with two known samples (sample A, the reference, and sample B, an alternative), and one unknown sample X, for three samples total. X is randomly selected from A and B, and the subject identifies X as being either A or B. If sample X cannot be determined reliably with a low p-value
in a predetermined number of trials, then the null hypothesis
cannot be rejected and it cannot be proven that there is a perceptible difference between samples A and B.
ABX tests can easily be performed as double-blind trials, eliminating any possible unconscious influence from the researcher or supervising technician.
ABX tests are commonly used in evaluations of digital audio data compression methods; sample A is typically an uncompressed sample, and sample B is a compressed version of A. Audible compression artifact
s that indicate a shortcoming in the compression algorithm can be identified with subsequent testing. ABX tests can also be used to compare the different degrees of fidelity loss between two different audio formats at a given bitrate
.
ABX tests can be used to audition input, processing, and output components as well as cabling: virtually any audio product or prototype design.
Loudspeaker level and line level audio comparisons could be performed on an ABX test device offered for sale as the "ABX Comparator" by QSC Audio Products
from 1998 to 2004. Other hardware solutions have been fabricated privately by individuals or organizations for internal testing.
one requires a number of repeated trials. By increasing the number of trials the likelihood of statistically asserting a person's ability to discern the difference between A and B is enhanced for a given confidence level. A 95% confidence level is commonly considered statistically significant
. The company QSC, in the ABX Comparator user manual, recommended a minimum of ten listening trials in each round of tests.
Results required for a 95% confidence level:
QSC recommended that no more than 25 trials be performed, as listener fatigue can set in, making the test less sensitive (less likely to reveal one's actual ability to discern the difference between A and B). However a more sensitive test can be obtained by pooling the results from a number of such tests using separate individuals or tests from the same listener conducted in between rest breaks. For a large number of total trials N, a significant result (one with 95% confidence) can be claimed if the number of correct responses exceeds . Important decisions are normally based on a higher level of confidence, since an erroneous "significant result" would be claimed in one of 20 such tests simply by chance.
and the Amarok
audio players support software-based ABX testing, the latter using a third-party script. aveX is an open-source software mainly developed for Linux
which also provides test-monitoring from a remote computer. More ABX software can be found at the archived PCABX website.
, which is an implementation of the ODG
.
, such as paired comparison, duo–trio, and triangle testing. Of these, duo–trio and triangle testing are particularly close to ABX testing. Schematically:
ABX: ABX – two knowns, one unknown, test is which of the knowns the unknown is: X = A or X = B.
Duo–trio: AXY – one known, two unknown (one equals A, other equals B), test is which unknown is the known: X = A (and Y = B), or Y = A (and X = B).
Triangle: XXY – three unknowns (two are A and one is B or one is A and two are B), test which is the odd one out: Y = 1, Y = 2, or Y = 3.
In this context, ABX testing is also known as "duo–trio" in "balanced reference" mode – both knowns are presented as references, rather than one alone.
P-value
In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α ,...
in a predetermined number of trials, then the null hypothesis
Null hypothesis
The practice of science involves formulating and testing hypotheses, assertions that are capable of being proven false using a test of observed data. The null hypothesis typically corresponds to a general or default position...
cannot be rejected and it cannot be proven that there is a perceptible difference between samples A and B.
ABX tests can easily be performed as double-blind trials, eliminating any possible unconscious influence from the researcher or supervising technician.
ABX tests are commonly used in evaluations of digital audio data compression methods; sample A is typically an uncompressed sample, and sample B is a compressed version of A. Audible compression artifact
Compression artifact
A compression artifact is a noticeable distortion of media caused by the application of lossy data compression....
s that indicate a shortcoming in the compression algorithm can be identified with subsequent testing. ABX tests can also be used to compare the different degrees of fidelity loss between two different audio formats at a given bitrate
Bitrate
In telecommunications and computing, bit rate is the number of bits that are conveyed or processed per unit of time....
.
ABX tests can be used to audition input, processing, and output components as well as cabling: virtually any audio product or prototype design.
Hardware tests
ABX test equipment utilizing relays to switch between two different hardware paths can help determine if there are perceptual differences in cables and components. Video, audio and digital transmission paths can be compared. If the switching is microprocessor controlled, double-blind tests are possible.Loudspeaker level and line level audio comparisons could be performed on an ABX test device offered for sale as the "ABX Comparator" by QSC Audio Products
QSC Audio Products
QSC Audio Products, LLC is an American manufacturer of professional audio products. QSC's target markets are audio professionals in concert, installation, portable entertainment and cinema applications.-History:...
from 1998 to 2004. Other hardware solutions have been fabricated privately by individuals or organizations for internal testing.
Confidence
If only one ABX trial were performed a correct answer would occur 50% of time time just by chance, the same as flipping a coin. This would prove nothing. In order to make a statement having some degree of confidenceConfidence interval
In statistics, a confidence interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval , in principle different from sample to sample, that frequently includes the parameter of interest, if the...
one requires a number of repeated trials. By increasing the number of trials the likelihood of statistically asserting a person's ability to discern the difference between A and B is enhanced for a given confidence level. A 95% confidence level is commonly considered statistically significant
Statistical significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase test of significance was coined by Ronald Fisher....
. The company QSC, in the ABX Comparator user manual, recommended a minimum of ten listening trials in each round of tests.
Results required for a 95% confidence level:
Number of trials | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Minimum number correct | 9 | 9 | 10 | 10 | 11 | 12 | 12 | 13 | 13 | 14 | 15 | 15 | 16 | 16 | 17 | 18 |
QSC recommended that no more than 25 trials be performed, as listener fatigue can set in, making the test less sensitive (less likely to reveal one's actual ability to discern the difference between A and B). However a more sensitive test can be obtained by pooling the results from a number of such tests using separate individuals or tests from the same listener conducted in between rest breaks. For a large number of total trials N, a significant result (one with 95% confidence) can be claimed if the number of correct responses exceeds . Important decisions are normally based on a higher level of confidence, since an erroneous "significant result" would be claimed in one of 20 such tests simply by chance.
Software tests
The foobar2000Foobar2000
foobar2000 is a freeware audio player for Windows developed by Peter Pawlowski, a former freelance contractor for Nullsoft. It is known for its highly modular design and extensive SDK which allows third-party developers to do such things as completely replace the interface...
and the Amarok
Amarok (audio)
Amarok is a cross-platform free and open source music player for KDE, but is released independently of the central KDE Software Compilation release cycle...
audio players support software-based ABX testing, the latter using a third-party script. aveX is an open-source software mainly developed for Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
which also provides test-monitoring from a remote computer. More ABX software can be found at the archived PCABX website.
Algorithmic Audio Compression Evaluation
Since ABX testing requires human beings for evaluation of lossy audio codecs, it is time-consuming and costly. Therefore, cheaper approaches have been developed, e.g. PEAQPEAQ
PEAQ is a standardized algorithm for objectively measuring perceived audio quality, developed in 1994-1998 by a joint venture of experts within Task Group 6Q of the International Telecommunication Union . It was originally released as ITU-R Recommendation BS.1387 in 1998 and last updated in 2001...
, which is an implementation of the ODG
Objective Difference Grade
The objective difference grade is calculated by perceptual evaluation of the audio quality algorithm specified in ITU BS.1387-1. It corresponds to the subjective difference grade used in human-based audio tests.....
.
MUSHRA
In MUSHRA, the listener is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. A 0-100 RATING scale makes it possible to rate very small differences.Discrimination testing
Alternative general methods are used in discrimination testingDiscrimination testing
Discrimination testing is a technique employed in sensory analysis to determine whether there is a detectable difference among two or more products...
, such as paired comparison, duo–trio, and triangle testing. Of these, duo–trio and triangle testing are particularly close to ABX testing. Schematically:
ABX: ABX – two knowns, one unknown, test is which of the knowns the unknown is: X = A or X = B.
Duo–trio: AXY – one known, two unknown (one equals A, other equals B), test is which unknown is the known: X = A (and Y = B), or Y = A (and X = B).
Triangle: XXY – three unknowns (two are A and one is B or one is A and two are B), test which is the odd one out: Y = 1, Y = 2, or Y = 3.
In this context, ABX testing is also known as "duo–trio" in "balanced reference" mode – both knowns are presented as references, rather than one alone.