Selected Deep Dives

Let’s go into some association measures in more details.

Binary association

The association between binary variables have been studied prolifically in the last 100 years [Cox70, Pro, Rey84, SSC10, War19]. A binary variable has only two values. It is typical to re-encode these values into 0 or 1. How and why each of these two values are mapped to 0 or 1 is subjective, arbitrary and/or context-specific. For example, if we have a variable that captures the handedness, favoring left or right hand, of a person, we could map left to 0 and right to 1, or, left to 1 and right to 0. The 0-1 value representation of a binary variable’s values is the common foundation for understanding association. Below is a contingency table created from two binary variables. Notice the main values of the tables are a, b, c and d.

  • \(a = N_{11}\) is the count of when the two variables have a value of 1

  • \(b = N_{10}\) is the count of when the row variable has a value of 1 and the column variable has a value of 0

  • \(c = N_{01}\) is the count of when the row variable has a value of 0 and the column variable has a value of 1

  • \(d = N_{00}\) is the count of when the two variables have a value of 0

Also, look at how the table is structured with the value 1 coming before the value 0 in both the rows and columns.

Contingency table for two binary variables

1

0

Total

1

a

b

a + b

0

c

d

c + d

Total

a + c

b + d

n = a + b + c + d

Note that a and d are matches and b and c are mismatches. Sometimes, depending on the context, matching on 0 is not considered a match. For example, if 1 is the presence of something and 0 is the absence, then an observation of absence and absence does not really feel right to consider as a match (you cannot say two things match on what is not there). Additionally, when 1 is presence and 0 is absence, and the data is very sparse (a lot of 0’s compared to 1’s), considering absence and absence as matching will make it appear that the two variables are very similar.

In [SSC10], there are 76 similarity and distance measures identified (some are not unique and/or redundant). Similarity is how alike are two things, and distance is how different are two things; or, in other words, similarity is how close are two things and distance is how far apart are two things. If a similarity or distance measure produces a value in \([0, 1]\), then we can convert between the two easily.

  • If \(s\) is the similarity, then \(d = 1 - s\) is the distance.

  • If \(d\) is the distance, then \(s = 1 - d\) is the similarity.

If we use a contingency table to summarize a bivariate binary data, the following similarity and distance measures may be derived entirely from a, b, c and/or d. The general pattern is that similarity and distance is always a ratio. The numerator in the ratio defines what we are interested in measuring. When we have a and/or d in the numerator, it is likely we are measuring similarity; when we have b and/or c in the numerator, it is likely we are measuring distance. The denominator considers what is important in considering; is it the matches, mismatches or both? The following tables list some identified similarity and distance measures based off of 2 x 2 contingency tables.

Similarity measures for 2 x 2 contingency table [SSC10, Unia, War19]

Name

Computation

3W-Jaccard

\(\frac{3a}{3a+b+c}\)

Ample

\(\left|\frac{a(c+d)}{c(a+b)}\right|\)

Anderberg

\(\frac{\sigma-\sigma'}{2n}\)

Baroni-Urbani-Buser-I

\(\frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}\)

Baroni-Urbani-Buser-II

\(\frac{\sqrt{ad}+a-(b+c)}{\sqrt{ad}+a+b+c}\)

Braun-Banquet

\(\frac{a}{\max(a+b,a+c)}\)

Cole [SSC10, War19]

\(\frac{\sqrt{2}(ad-bc)}{\sqrt{(ad-bc)^2-(a+b)(a+c)(b+d)(c+d)}}\)

\(\frac{ad-bc}{\min((a+b)(a+c),(b+d)(c+d))}\)

Cosine

\(\frac{a}{(a+b)(a+c)}\)

Dennis

\(\frac{ad-bc}{\sqrt{n(a+b)(a+c)}}\)

Dice; Czekanowski; Nei-Li

\(\frac{2a}{2a+b+c}\)

Disperson

\(\frac{ad-bc}{(a+b+c+d)^2}\)

Driver-Kroeber

\(\frac{a}{2}\left(\frac{1}{a+b}+\frac{1}{a+c}\right)\)

Eyraud

\(\frac{n^2(na-(a+b)(a+c))}{(a+b)(a+c)(b+d)(c+d)}\)

Fager-McGowan

\(\frac{a}{\sqrt{(a+b)(a+c)}}-\frac{max(a+b,a+c)}{2}\)

Faith

\(\frac{a+0.5d}{a+b+c+d}\)

Forbes-II

\(\frac{na-(a+b)(a+c)}{n \min(a+b,a+c) - (a+b)(a+c)}\)

Forbesi

\(\frac{na}{(a+b)(a+c)}\)

Fossum

\(\frac{n(a-0.5)^2}{(a+b)(a+c)}\)

Gilbert-Wells

\(\log a - \log n - \log \frac{a+b}{n} - \log \frac{a+c}{n}\)

Goodman-Kruskal

\(\frac{\sigma - \sigma'}{2n-\sigma'}\)

\(\sigma=\max(a,b)+\max(c,d)+\max(a,c)+\max(b,d)\)

\(\sigma'=\max(a+c,b+d)+\max(a+b,c+d)\)

Gower

\(\frac{a+d}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Gower-Legendre

\(\frac{a+d}{a+0.5b+0.5c+d}\)

Hamann

\(\frac{(a+d)-(b+c)}{a+b+c+d}\)

Inner Product

\(a+d\)

Intersection

\(a\)

Jaccard [Wikb]

\(\frac{a}{a+b+c}\)

Johnson

\(\frac{a}{a+b}+\frac{a}{a+c}\)

Kulczynski-I

\(\frac{a}{b+c}\)

Kulczynski-II

\(\frac{0.5a(2a+b+c)}{(a+b)(a+c)}\)

\(\frac{1}{2}\left(\frac{a}{a + b} + \frac{a}{a + c}\right)\)

McConnaughey

\(\frac{a^2 - bc}{(a+b)(a+c)}\)

Michael

\(\frac{4(ad-bc)}{(a+d)^2+(b+c)^2}\)

Mountford

\(\frac{a}{0.5(ab + ac) + bc}\)

Ochiai-I [Exc]; Otsuka; Fowlkes-Mallows Index [Wika]

\(\frac{a}{\sqrt{(a+b)(a+c)}}\)

\(\sqrt{\frac{a}{a + b}\frac{a}{a + c}}\)

Ochiai-II

\(\frac{ad}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Pearson-Heron-I

\(\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Pearson-Heron-II

\(\cos\left(\frac{\pi \sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\right)\)

Pearson-I

\(\chi^2=\frac{n(ad-bc)^2}{(a+b)(a+c)(c+d)(b+d)}\)

Pearson-II

\(\sqrt{\frac{\chi^2}{n+\chi^2}}\)

Pearson-II

\(\sqrt{\frac{\rho}{n+\rho}}\)

\(\rho=\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Peirce

\(\frac{ab+bc}{ab+2bc+cd}\)

Roger-Tanimoto

\(\frac{a+d}{a+2b+2c+d}\)

Russell-Rao

\(\frac{a}{a+b+c+d}\)

Simpson; Overlap [Wikc]

\(\frac{a}{\min(a+b,a+c)}\)

Sokal-Michener; Rand Index

\(\frac{a+d}{a+b+c+d}\)

Sokal-Sneath-I

\(\frac{a}{a+2b+2c}\)

Sokal-Sneath-II

\(\frac{2a+2d}{2a+b+c+2d}\)

Sokal-Sneath-III

\(\frac{a+d}{b+c}\)

Sokal-Sneath-IV

\(\frac{1}{4}\left(\frac{a}{a+b}+\frac{a}{a+c}+\frac{d}{b+d}+\frac{d}{b+d}\right)\)

Sokal-Sneath-V

\(\frac{ad}{(a+b)(a+c)(b+d)\sqrt{c+d}}\)

Sørensen–Dice [Wikf]

\(\frac{2(a + d)}{2(a + d) + b + c}\)

Sorgenfrei

\(\frac{a^2}{(a+b)(a+c)}\)

Stiles

\(\log_{10} \frac{n\left(|ad-bc|-\frac{n}{2}\right)^2}{(a+b)(a+c)(b+d)(c+d)}\)

Tanimoto-I

\(\frac{a}{2a+b+c}\)

Tanimoto-II [Wikb]

\(\frac{a}{b + c}\)

Tarwid

\(\frac{na - (a+b)(a+c)}{na + (a+b)(a+c)}\)

Tarantula

\(\frac{a(c+d)}{c(a+b)}\)

Tetrachoric

\(\frac{y-1}{y+1}\)

\(y = \left(\frac{ad}{bc}\right)^{\frac{\pi}{4}}\)

Tversky Index [Wikg]

\(\frac{a}{a+\theta b+ \phi c}\)

\(\theta\) and \(\phi\) are user-supplied parameters

Yule-Q

\(\frac{ad-bc}{ad+bc}\)

Yule-w

\(\frac{\sqrt{ad}-\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\)

Distance measures for 2 x 2 contingency table [SSC10]

Name

Computation

Chord

\(\sqrt{2\left(1 - \frac{a}{\sqrt{(a+b)(a+c)}}\right)}\)

Euclid

\(\sqrt{b+c}\)

Hamming; Canberra; Manhattan; Cityblock; Minkowski

\(b+c\)

Hellinger

\(2\sqrt{1 - \frac{a}{\sqrt{(a+b)(a+c)}}}\)

Jaccard distance [Wikb]

\(\frac{b + c}{a + b + c}\)

Lance-Williams; Bray-Curtis

\(\frac{b+c}{2a+b+c}\)

Mean-Manhattan

\(\frac{b+c}{a+b+c+d}\)

Pattern Difference

\(\frac{4bc}{(a+b+c+d)^2}\)

Shape Difference

\(\frac{n(b+c)-(b-c)^2}{(a+b+c+d)^2}\)

Size Difference

\(\frac{(b+c)^2}{(a+b+c+d)^2}\)

Squared-Euclid

\(\sqrt{(b+c)^2}\)

Vari

\(\frac{b+c}{4a+4b+4c+4d}\)

Yule-Q

\(\frac{2bc}{ad+bc}\)

Instead of using a, b, c and d from a contingency table to define these association measures, it is common to use set notation. For two binary variables, \(X\) and \(Y\), the following are equivalent.

  • \(|X \cap Y| = a\)

  • \(|X \setminus Y| = b\)

  • \(|Y \setminus X| = c\)

  • \(|X \cup Y| = a + b + c\)

You will notice that d does not show up in the above relationship.

Concordant, discordant, tie

Let’s try to understand how to determine if a pair of observations are concordant, discordant or tied. We have made up an example dataset below having two variables \(X\) and \(Y\). Note that there are 6 observations, and as such, each observation is associated with an index from 1 to 6. An observation has a pair of values, one for \(X\) and one for \(Y\).

Warning

Do not get the pair of values of an observation confused with a pair of observations.

Raw Data for \(X\) and \(Y\)

Index

\(X\)

\(Y\)

1

1

3

2

1

3

3

2

4

4

0

2

5

0

4

6

2

2

Because there are 6 observations, there are \({{6}\choose{2}} = 15\) possible pairs of observations. If we denote an observation by its corresponding index as \(O_i\), then the observations are then as follows.

  • \(O_1 = (1, 3)\)

  • \(O_2 = (1, 3)\)

  • \(O_3 = (2, 4)\)

  • \(O_4 = (0, 2)\)

  • \(O_5 = (0, 4)\)

  • \(O_6 = (2, 2)\)

The 15 possible combinations of observation pairings are as follows.

  • \(O_1, O_2\)

  • \(O_1, O_3\)

  • \(O_1, O_4\)

  • \(O_1, O_5\)

  • \(O_1, O_6\)

  • \(O_2, O_3\)

  • \(O_2, O_4\)

  • \(O_2, O_5\)

  • \(O_2, O_6\)

  • \(O_3, O_4\)

  • \(O_3, O_5\)

  • \(O_3, O_6\)

  • \(O_4, O_5\)

  • \(O_4, O_6\)

  • \(O_5, O_6\)

For each one of these observation pairs, we can determine if such a pair is concordant, discordant or tied. There’s a couple ways to determine concordant, discordant or tie status. The easiest way to determine so is mathematically. Another way is to use rules. Both are equivalent. Because we will use abstract notation to describe these math and rules used to determine concordant, discordant or tie for each pair, and because we are striving for clarity, let’s expand these observation pairs into their component pairs of values and also their corresponding \(X\) and \(Y\) indexed notation.

  • \(O_1, O_2 = (1, 3), (1, 3) = (X_1, Y_1), (X_2, Y_2)\)

  • \(O_1, O_3 = (1, 3), (2, 4) = (X_1, Y_1), (X_3, Y_3)\)

  • \(O_1, O_4 = (1, 3), (0, 2) = (X_1, Y_1), (X_4, Y_4)\)

  • \(O_1, O_5 = (1, 3), (0, 4) = (X_1, Y_1), (X_5, Y_5)\)

  • \(O_1, O_6 = (1, 3), (2, 2) = (X_1, Y_1), (X_6, Y_6)\)

  • \(O_2, O_3 = (1, 3), (2, 4) = (X_2, Y_2), (X_3, Y_3)\)

  • \(O_2, O_4 = (1, 3), (0, 2) = (X_2, Y_2), (X_4, Y_4)\)

  • \(O_2, O_5 = (1, 3), (0, 4) = (X_2, Y_2), (X_5, Y_5)\)

  • \(O_2, O_6 = (1, 3), (2, 2) = (X_2, Y_2), (X_6, Y_6)\)

  • \(O_3, O_4 = (2, 4), (0, 2) = (X_3, Y_3), (X_4, Y_4)\)

  • \(O_3, O_5 = (2, 4), (0, 4) = (X_3, Y_3), (X_5, Y_5)\)

  • \(O_3, O_6 = (2, 4), (2, 2) = (X_3, Y_3), (X_6, Y_6)\)

  • \(O_4, O_5 = (0, 2), (0, 4) = (X_4, Y_4), (X_5, Y_5)\)

  • \(O_4, O_6 = (0, 2), (2, 2) = (X_4, Y_4), (X_6, Y_6)\)

  • \(O_5, O_6 = (0, 4), (2, 2) = (X_5, Y_5), (X_6, Y_6)\)

Now we can finally attempt to describe how to determine if any pair of observations is concordant, discordant or tied. If we want to use math to determine so, then, for any two pairs of observations \((X_i, Y_i)\) and \((X_j, Y_j)\), the following determines the status.

  • concordant when \((X_j - X_i)(Y_j - Y_i) > 0\)

  • discordant when \((X_j - X_i)(Y_j - Y_i) < 0\)

  • tied when \((X_j - X_i)(Y_j - Y_i) = 0\)

If we like rules, then the following determines the status.

  • concordant if \(X_i < X_j\) and \(Y_i < Y_j\) or \(X_i > X_j\) and \(Y_i > Y_j\)

  • discordant if \(X_i < X_j\) and \(Y_i > Y_j\) or \(X_i > X_j\) and \(Y_i < Y_j\)

  • tied if \(X_i = X_j\) or \(Y_i = Y_j\)

All pairs of observations will evaluate categorically to one of these statuses. Continuing with our dummy data above, the concordancy status of the 15 pairs of observations are as follows (where concordant is C, discordant is D and tied is T).

Concordancy Status

\((X_i, Y_i)\)

\((X_j, Y_j)\)

status

\((1, 3)\)

\((1, 3)\)

T

\((1, 3)\)

\((2, 4)\)

C

\((1, 3)\)

\((0, 2)\)

C

\((1, 3)\)

\((0, 4)\)

D

\((1, 3)\)

\((2, 2)\)

D

\((1, 3)\)

\((2, 4)\)

C

\((1, 3)\)

\((0, 2)\)

C

\((1, 3)\)

\((0, 4)\)

D

\((1, 3)\)

\((2, 2)\)

D

\((2, 4)\)

\((0, 2)\)

C

\((2, 4)\)

\((0, 4)\)

C

\((2, 4)\)

\((2, 2)\)

T

\((0, 2)\)

\((0, 4)\)

T

\((0, 2)\)

\((2, 2)\)

T

\((0, 4)\)

\((2, 2)\)

D

In this data set, the counts are \(C=6\), \(D=5\) and \(T=4\). If we divide these counts with the total of pairs of observations, then we get the following probabilities.

  • \(\pi_C = \frac{C}{{n \choose 2}} = \frac{6}{15} = 0.40\)

  • \(\pi_D = \frac{D}{{n \choose 2}} = \frac{5}{15} = 0.33\)

  • \(\pi_T = \frac{T}{{n \choose 2}} = \frac{4}{15} = 0.27\)

Sometimes, it is desirable to distinguish between the types of ties. There are three possible types of ties.

  • \(T^X\) are ties on only \(X\)

  • \(T^Y\) are ties on only \(Y\)

  • \(T^{XY}\) are ties on both \(X\) and \(Y\)

Note, \(T = T^X + T^Y + T^{XY}\). If we want to distinguish between the tie types, then the status of each pair of observations is as follows.

Concordancy Status

\((X_i, Y_i)\)

\((X_j, Y_j)\)

status

\((1, 3)\)

\((1, 3)\)

\(T^{XY}\)

\((1, 3)\)

\((2, 4)\)

C

\((1, 3)\)

\((0, 2)\)

C

\((1, 3)\)

\((0, 4)\)

D

\((1, 3)\)

\((2, 2)\)

D

\((1, 3)\)

\((2, 4)\)

C

\((1, 3)\)

\((0, 2)\)

C

\((1, 3)\)

\((0, 4)\)

D

\((1, 3)\)

\((2, 2)\)

D

\((2, 4)\)

\((0, 2)\)

C

\((2, 4)\)

\((0, 4)\)

C

\((2, 4)\)

\((2, 2)\)

\(T^X\)

\((0, 2)\)

\((0, 4)\)

\(T^X\)

\((0, 2)\)

\((2, 2)\)

\(T^Y\)

\((0, 4)\)

\((2, 2)\)

D

Distinguishing between ties, in this data set, the counts are \(C=6\), \(D=5\), \(T^X=2\), \(T^Y=1\) and \(T^{XY}=1\). The probabilities of these statuses are as follows.

  • \(\pi_C = \frac{C}{{n \choose 2}} = \frac{6}{15} = 0.40\)

  • \(\pi_D = \frac{D}{{n \choose 2}} = \frac{5}{15} = 0.33\)

  • \(\pi_{T^X} = \frac{T^X}{{n \choose 2}} = \frac{2}{15} = 0.13\)

  • \(\pi_{T^Y} = \frac{T^Y}{{n \choose 2}} = \frac{1}{15} = 0.07\)

  • \(\pi_{T^{XY}} = \frac{T^{XY}}{{n \choose 2}} = \frac{1}{15} = 0.07\)

There are quite a few measures of associations using concordance as the basis for strength of association.

Association measures using concordance

Association Measure

Formula

Goodman-Kruskal’s \(\gamma\)

\(\gamma = \frac{\pi_C - \pi_D}{1 - \pi_T}\)

Somers’ \(d\)

\(d_{Y \cdot X} = \frac{\pi_C - \pi_D}{\pi_C + \pi_D + \pi_{T^Y}}\)

\(d_{X \cdot Y} = \frac{\pi_C - \pi_D}{\pi_C + \pi_D + \pi_{T^X}}\)

Kendall’s \(\tau\)

\(\tau = \frac{C - D}{{n \choose 2}}\)

Note

Sometimes Somers’ d is written as Somers’ D, Somers’ Delta or even incorrectly as Somer’s D [Gle, Wike]. Somers’ d has two versions, one that is symmetric and one that is asymmetric. The asymmetric Somers’ d is the one most typically referred to [Gle]. The definition of Somers’ d presented here is the asymmetric one, which explains \(d_{Y \cdot X}\) and \(d_{X \cdot Y}\).

Goodman-Kruskal’s \(\lambda\)

Goodman-Kruskal’s lambda \(\lambda_{A|B}\) measures the proportional reduction in error PRE for two categorical variables, \(A\) and \(B\), when we want to understand how knowing \(B\) reduces the probability of an error in predicting \(A\). \(\lambda_{A|B}\) is estimated as follows.

\(\lambda_{A|B} = \frac{P_E - P_{E|B}}{P_E}\)

Where,

  • \(P_E = 1 - \frac{\max_c N_{+c}}{N_{++}}\)

  • \(P_{E|B} = 1 - \frac{\sum_r \max_c N_{rc}}{N_{++}}\)

In meaningful language.

  • \(P_E\) is the probability of an error in predicting \(A\)

  • \(P_{E|B}\) is the probability of an error in predicting \(A\) given knowledge of \(B\)

The terms \(N_{+c}\), \(N_{rc}\) and \(N_{++}\) comes from the contingency table we build from \(A\) and \(B\) (\(A\) is in the columns and \(B\) is in the rows) and denote the column marginal for the c-th column, total count for the r-th and c-th cell and total, correspondingly. To be clear.

  • \(N_{+c}\) is the column marginal for the c-th column

  • \(N_{rc}\) is total count for the r-th and c-th cell

  • \(N_{++}\) is total number of observations

The contingency table induced with \(A\) in the columns and \(B\) in the rows will look like the following. Note that \(A\) has C columns and \(B\) has R rows, or, in other words, \(A\) has C values and \(B\) has R values.

Contingency Table for \(A\) and \(B\)

\(A_1\)

\(A_2\)

\(\dotsb\)

\(A_C\)

\(B_1\)

\(N_{11}\)

\(N_{12}\)

\(\dotsb\)

\(N_{1C}\)

\(B_2\)

\(N_{21}\)

\(N_{22}\)

\(\dotsb\)

\(N_{2C}\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(B_R\)

\(N_{R1}\)

\(N_{R2}\)

\(\dotsb\)

\(N_{RC}\)

The table above only shows the cell counts \(N_{11}, N_{12}, \ldots, N_{RC}\) and not the row and column marginals. Below, we expand the contingency table to include

  • the row marginals \(N_{1+}, N_{2+}, \ldots, N_{R+}\), as well as,

  • the column marginals \(N_{+1}, N_{+2}, \ldots, N_{+C}\).

Contingency Table for \(A\) and \(B\)

\(A_1\)

\(A_2\)

\(\dotsb\)

\(A_C\)

\(B_1\)

\(N_{11}\)

\(N_{12}\)

\(\dotsb\)

\(N_{1C}\)

\(N_{1+}\)

\(B_2\)

\(N_{21}\)

\(N_{22}\)

\(\dotsb\)

\(N_{2C}\)

\(N_{2+}\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(B_R\)

\(N_{R1}\)

\(N_{R2}\)

\(\dotsb\)

\(N_{RC}\)

\(N_{R+}\)

\(N_{+1}\)

\(N_{+2}\)

\(\dotsb\)

\(N_{+C}\)

\(N_{++}\)

Note that the row marginal for a row is the sum of the values across the columns, and the column marginal for a colum is the sum of the values down the rows.

  • \(N_{R+} = \sum_C N_{RC}\)

  • \(N_{+C} = \sum_R N_{RC}\)

Also, \(N_{++}\) is just the sum over all the cells (excluding the row and column marginals). \(N_{++}\) is really just the sample size.

  • \(N_{++} = \sum_R \sum_C N_{RC}\)

Let’s go back to computing \(P_E\) and \(P_{E|B}\).

\(P_E\) is given as follows.

  • \(P_E = 1 - \frac{\max_c N_{+c}}{N_{++}}\)

\(\max_c N_{+c}\) is returning the maximum of the column marginals, and \(\frac{\max_c N_{+c}}{N_{++}}\) is just a probability. What probability is this one? It is the largest probability associated with a value of \(A\) (specifically, the value of \(A\) with the largest counts). If we were to predict which value of \(A\) would show up, we would choose the value of \(A\) with the highest probability (it is the most likely). We would be correct \(\frac{\max_c N_{+c}}{N_{++}}\) percent of the time, and we would be wrong \(1 - \frac{\max_c N_{+c}}{N_{++}}\) percent of the time. Thus, \(P_E\) is the error in predicting \(A\) (knowing nothing else other than the distribution, or probability mass function PMF of \(A\)).

\(P_{E|B}\) is given as follows.

  • \(P_{E|B} = 1 - \frac{\sum_r \max_c N_{rc}}{N_{++}}\)

What is \(\max_c N_{rc}\) giving us? It is giving us the maximum cell count for the r-th row. \(\sum_r \max_c N_{rc}\) adds up the all the largest values in each row, and \(\frac{\sum_r \max_c N_{rc}}{N_{++}}\) is again a probability. What probability is this one? This probability is the one associated with predicting the value of \(A\) when we know \(B\). When we know what the value of \(B\) is, then the value of \(A\) should be the one with the largest count (it has the highest probability, or, equivalently, the highest count). When we know the value of \(B\) and by always choosing the value of \(A\) with the highest count associated with that value of \(B\), we are correct \(\frac{\sum_r \max_c N_{rc}}{N_{++}}\) percent of the time and incorrect \(1 - \frac{\sum_r \max_c N_{rc}}{N_{++}}\) percent of the time. Thus, \(P_{E|B}\) is the error in predicting \(A\) when we know the value of \(B\) and the PMF of \(A\) given \(B\).

The expression \(P_E - P_{E|B}\) is the reduction in the probability of an error in predicting \(A\) given knowledge of \(B\). This expression represents the reduction in error in the phrase/term PRE. The proportional part in PRE comes from the expression \(\frac{P_E - P_{E|B}}{P_E}\), which is a proportion.

What \(\lambda_{A|B}\) is trying to compute is the reduction of error in predicting \(A\) when we know \(B\). Did we reduce any prediction error of \(A\) by knowing \(B\)?

  • When \(\lambda_{A|B} = 0\), this value means that knowing \(B\) did not reduce any prediction error in \(A\). The only way to get \(\lambda_{A|B} = 0\) is when \(P_E = P_{E|B}\).

  • When \(\lambda_{A|B} = 1\), this value means that knowing \(B\) completely reduced all prediction errors in \(A\). The only way to get \(\lambda_{A|B} = 1\) is when \(P_{E|B} = 0\).

Generally speaking, \(\lambda_{A|B} \neq \lambda_{B|A}\), and \(\lambda\) is thus an asymmetric association measure. To compute \(\lambda_{B|A}\), simply put \(B\) in the columns and \(A\) in the rows and reuse the formulas above.

Furthermore, \(\lambda\) can be used in studies of causality [Lie83]. We are not saying it is appropriate or even possible to entertain causality with just two variables alone [Pea88, Pea09, Pea16, Pea20], but, when we have two categorical variables and want to know which is likely the cause and which the effect, the asymmetry between \(\lambda_{A|B}\) and \(\lambda_{B|A}\) may prove informational [Wikd]. Causal analysis based on two variables alone has been studied [NIP].