PyPair

Contingency Table Analysis

These are the basic contingency tables used to analyze categorical data.

CategoricalTable
BinaryTable
ConfusionMatrix
AgreementTable

class pypair.contingency.AgreementMixin

Bases: object

Agreement computations.

property chohen_k

Computes Cohen’s \(\kappa\).

\(\kappa = \frac{\theta_1 - \theta_2}{1 - \theta_2}\)
\(\theta_1 = \sum_i p_{ii}\)
\(\theta_2 = \sum_i p_{i+}p_{+i}\)

Returns:: \(\kappa\).

property cohen_light_k

Cohen-Light \(\kappa\). \(\kappa\) is a measure of conditional agreement. Several \(\kappa\), one for each unique value, will be computed and returned.

\(\kappa = \frac{\theta_1 - \theta_2}{1 - \theta_2}\)
\(\theta_1 = \frac{p_{ii}}{p_{i+}}\)
\(\theta_2 = p_{+i}\)

Returns:: A list of \(\kappa\).

class pypair.contingency.AgreementStats(table: Sequence[Sequence[int]])

Bases: AgreementMixin, ContingencyTable

Computes agreement stats.

__init__(table: Sequence[Sequence[int]]) → None

ctor.

Parameters:: table – Contingency table.

Bases: AgreementMixin, ContingencyTable

Represents a contingency table for agreement data against one variable. The variable is typically a rating variable (e.g. dislike, neutral, like), and the data is a pairing of ratings over the same set of items. The agreement table that is induced by the data is typically squared, where the number of rows and columns are equal.

ctor.

Parameters:

a – Categorical variable.
b – Categorical variable.
a_vals – Values in a. Default None; figure out empirically.
b_vals – Values in b. Default None; figure out empirically.

class pypair.contingency.BinaryMixin

Bases: object

Binary computations based off of a, b, c and d from a 2x2 contingency table.

property ample

Ample

\(\left|\frac{a(c+d)}{c(a+b)}\right|\)

Returns:: Ample.

property anderberg

Anderberg

\(\frac{\sigma-\sigma'}{2n}\)

Returns:: Anderberg.

property baroni_urbani_buser_i

Baroni-Urbani-Buser-I

\(\frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}\)

Returns:: Baroni-Urbani-Buser-I.

property baroni_urbani_buser_ii

Baroni-Urbani-Buser-II

\(\frac{\sqrt{ad}+a-(b+c)}{\sqrt{ad}+a+b+c}\)

Returns:: Baroni-Urbani-Buser-II.

property braun_banquet

Braun-Banquet

\(\frac{a}{\max(a+b,a+c)}\)

Returns:: Braun-Banquet.

property chisq

\(\chi^2\) (alias for Pearson-I)

Returns:: \(\chi^2\).

property chord

Chord

\(\sqrt{2\left(1 - \frac{a}{\sqrt{(a+b)(a+c)}}\right)}\)

Returns:: Chord (distance).

property cole_i

Cole-I

\(\frac{\sqrt{2}(ad-bc)}{\sqrt{(ad-bc)^2-(a+b)(a+c)(b+d)(c+d)}}\)

Returns:: Cole-I.

property cole_ii

Cole-II

\(\frac{ad-bc}{\min((a+b)(a+c),(b+d)(c+d))}\)

Returns:: Cole-II.

property contingency_coefficient

Contingency coefficient.

Returns:: Contingency coefficient.

property cosine

Cosine

\(\frac{a}{(a+b)(a+c)}\)

Returns:: Cosine.

property cramer_v

Cramer’s V.

Returns:: Cramer’s V.

property dennis

Dennis

\(\frac{ad-bc}{\sqrt{n(a+b)(a+c)}}\)

Returns:: Dennis.

property dice

Dice; Czekanowski; Nei-Li

\(\frac{2a}{2a+b+c}\)

Returns:: Dice.

property disperson

Disperson

\(\frac{ad-bc}{(a+b+c+d)^2}\)

Returns:: Disperson.

property driver_kroeber

Driver-Kroeber

\(\frac{a}{2}\left(\frac{1}{a+b}+\frac{1}{a+c}\right)\)

Returns:: Driver-Kroeber.

property euclid

Euclid

\(\sqrt{b+c}\)

Returns:: Euclid (distance).

property eyraud

Eyraud

\(\frac{n^2(na-(a+b)(a+c))}{(a+b)(a+c)(b+d)(c+d)}\)

Returns:: Eyraud.

property fager_mcgowan

Fager-McGowan

\(\frac{a}{\sqrt{(a+b)(a+c)}}-\frac{max(a+b,a+c)}{2}\)

Returns:: Fager-McGowan.

property faith

Faith

\(\frac{a+0.5d}{a+b+c+d}\)

Returns:: Faith.

property forbes_ii

Forbes-II

\(\frac{na-(a+b)(a+c)}{n \min(a+b,a+c) - (a+b)(a+c)}\)

Returns:: Forbes-II.

property forbesi

Forbesi

\(\frac{na}{(a+b)(a+c)}\)

Returns:: Forbesi.

property fossum

Fossum

\(\frac{n(a-0.5)^2}{(a+b)(a+c)}\)

Returns:: Fossum.

property gilbert_wells

Gilbert-Wells

\(\log a - \log n - \log \frac{a+b}{n} - \log \frac{a+c}{n}\)

Returns:: Gilbert-Wells.

property goodman_kruskal

Goodman-Kruskal

\(\frac{\sigma - \sigma'}{2n-\sigma'}\)

Returns:: Goodman-Kruskal.

property gower

Gower

\(\frac{a+d}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns:: Gower.

property gower_legendre

Gower-Legendre

\(\frac{a+d}{a+0.5b+0.5c+d}\)

Returns:: Gower-Legendre.

property hamann

Hamann.

\(\frac{(a+d)-(b+c)}{a+b+c+d}\)

Returns:: Hamann.

property hamming

Hamming; Canberra; Manhattan; Cityblock; Minkowski

\(b+c\)

Returns:: Hamming (distance).

property hellinger

Hellinger

\(2\sqrt{1 - \frac{a}{\sqrt{(a+b)(a+c)}}}\)

Returns:: Hellinger (distance).

property inner_product

Inner-product.

\(a+d\)

Returns:: Inner-product.

property intersection

Intersection

\(a\)

Returns:: Intersection.

property jaccard

Jaccard

\(\frac{a}{a+b+c}\)

Returns:: Jaccard.

property jaccard_3w

3W-Jaccard

\(\frac{3a}{3a+b+c}\)

Returns:: 3W-Jaccard.

property jaccard_distance

Jaccard

\(\frac{b + c}{a + b + c}\)

Returns:: Jaccard (distance).

property johnson

Johnson.

\(\frac{a}{a+b}+\frac{a}{a+c}\)

Returns:: Johnson.

property kulcyznski_ii

Kulczynski-II

\(\frac{0.5a(2a+b+c)}{(a+b)(a+c)}\)

Returns:: Kulczynski-II.

property kulczynski_i

Kulczynski-I

\(\frac{a}{b+c}\)

Returns:: Kulczynski-I.

property lance_williams

Lance-Williams; Bray-Curtis

\(\frac{b+c}{2a+b+c}\)

Returns:: Lance-Williams (distance).

property mcconnaughey

McConnaughey

\(\frac{a^2 - bc}{(a+b)(a+c)}\)

Returns:: McConnaughey.

property mcnemar_test

McNemar’s test.

Returns:: A tuple. First element is chi-square test statistics. Second element is p-value.

property mean_manhattan

Mean-Manhattan

\(\frac{b+c}{a+b+c+d}\)

Returns:: Mean-Manhattan (distance).

property michael

Michael

\(\frac{4(ad-bc)}{(a+d)^2+(b+c)^2}\)

Returns:: Michael.

property mountford

Mountford

\(\frac{a}{0.5(ab + ac) + bc}\)

Returns:: Mountford.

property ochia_i

Ochia-I

Also known as Fowlkes-Mallows Index. This measure is typically used to judge the similarity between two clusters. A larger value indicates that the clusters are more similar.

\(\frac{a}{\sqrt{(a+b)(a+c)}}\)

Returns:: Ochai-I.

property ochia_ii

Ochia-II

\(\frac{ad}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns:: Ochia-II.

property odds_ratio

Odds ratio. The odds ratio is also referred to as the cross-product ratio.

Returns:: Odds ratio.

property pattern_difference

Pattern difference

\(\frac{4bc}{(a+b+c+d)^2}\)

Returns:: Pattern difference (distance).

property pearson_heron_i

Pearson-Heron-I

\(\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns:: Pearson-Heron-I.

property pearson_heron_ii

Pearson-Heron-II

\(\sqrt{\frac{\chi^2}{n+\chi^2}}\)

Returns:: Pearson-Heron-II.

property pearson_i

Pearson-I

\(\chi^2=\frac{n(ad-bc)^2}{(a+b)(a+c)(c+d)(b+d)}\)

Returns:: Pearson-I.

property peirce

Peirce

\(\frac{ab+bc}{ab+2bc+cd}\)

Returns:: Peirce.

property person_ii

Pearson-II

\(\sqrt{\frac{\rho}{n+\rho}}\)

\(\rho=\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns:: Pearson-II.

property roger_tanimoto

Roger-Tanimoto

\(\frac{a+d}{a+2b+2c+d}\)

Returns:: Roger-Tanimoto.

property russel_rao

Russel-Rao

\(\frac{a}{a+b+c+d}\)

Returns:: Russel-Rao.

property shape_difference

Shape difference

\(\frac{n(b+c)-(b-c)^2}{(a+b+c+d)^2}\)

Returns:: Shape difference (distance).

property simpson

Simpson (or Overlap).

\(\frac{a}{\min(a+b,a+c)}\)

Returns:: Simpson.

property size_difference

Size difference

\(\frac{(b+c)^2}{(a+b+c+d)^2}\)

Returns:: Size difference (distance).

property sokal_michener

Sokal-Michener

\(\frac{a+d}{a+b+c+d}\)

Returns:: Sokal-Michener.

property sokal_sneath_i

Sokal-Sneath-I

\(\frac{a}{a+2b+2c}\)

Returns:: Sokal-Sneath-I.

property sokal_sneath_ii

Sokal-Sneath-II

\(\frac{2a+2d}{2a+b+c+2d}\)

Returns:: Sokal-Sneath-II.

property sokal_sneath_iii

Sokal-Sneath-III

\(\frac{a+d}{b+c}\)

Returns:: Sokal-Sneath-III.

property sokal_sneath_iv

Sokal-Sneath-IV

\(\frac{ad}{(a+b)(a+c)(b+d)\sqrt{c+d}}\)

Returns:: Sokal-Sneath-IV.

property sokal_sneath_v

Sokal-Sneath-V

\(\frac{1}{4}\left(\frac{a}{a+b}+\frac{a}{a+c}+\frac{d}{b+d}+\frac{d}{b+d}\right)\)

Returns:: Sokal-Sneath-V.

property sorensen_dice

Sørensen–Dice

\(\frac{2(a + d)}{2(a + d) + b + c}\)

Returns:: Sørensen–Dice,

property sorgenfrei

Sorgenfrei

\(\frac{a^2}{(a+b)(a+c)}\)

Returns:: Sorgenfrei.

property stiles

Stiles

\(\log_{10} \frac{n\left(|ad-bc|-\frac{n}{2}\right)^2}{(a+b)(a+c)(b+d)(c+d)}\)

Returns:: Stiles.

property tanimoto_distance

Tanimoto similarity and distance.

Returns:: Tanimoto distance.

property tanimoto_i

Tanimoto-I

\(\frac{a}{2a+b+c}\)

Returns:: Tanimoto-I.

property tanimoto_ii

Tanimoto-II

\(\frac{a}{b + c}\)

Returns:: Tanimoto-II.

property tarantula

Tarantula

\(\frac{a(c+d)}{c(a+b)}\)

Returns:: Tarantula.

property tarwid

Tarwind

\(\frac{na - (a+b)(a+c)}{na + (a+b)(a+c)}\)

Returns:: Tarwind.

property tetrachoric

Tetrachoric correlation ranges from \([-1, 1]\), where 0 indicates no agreement, 1 indicates perfect agreement and -1 indicates perfect disagreement.

if \(b=0\) or \(c=0\), 1.0
if \(a=0\) or \(b=0\), -1.0
else, \(\frac{y-1}{y+1}, y={\left(\frac{da}{bc}\right)}^{\frac{\pi}{4}}\)

References

Returns:: Tetrachoric correlation.

property tschuprow_t

Tschuprow’s T.

Returns:: Tschuprow’s T.

tversky_index(theta=1, phi=0)

Compute’s Tversky’s Index.

\(\frac{a}{a+\theta b+\phi c}\)

\(\theta\) and \(\phi\) are typically between \([0,1]\) and \(\theta + \phi = 1\).

Parameters:

theta – Weight \([0,1]\) of how important match on row variable is. Default 1.
phi – Weight \([0,1]\) of how important match on column variable is. Default 0.

Returns:

Tversky’s Index.

property vari

Vari

\(\frac{b+c}{4a+4b+4c+4d}\)

Returns:: Vari (distance).

property yule_q

Yule’s Q

\(\frac{ad-bc}{ad+bc}\)

Also, Yule’s Q is based off of the odds ratio or cross-product ratio, \(\alpha\).

\(Q = \frac{\alpha - 1}{\alpha + 1}\)

Yule’s Q is the same as Goodman-Kruskal’s \(\lambda\) for 2 x 2 contingency tables and is also a measure of proportional reduction in error (PRE).

Returns:: Yule’s Q.

property yule_q_difference

Yule’s q

\(\frac{2bc}{ad+bc}\)

Returns:: Yule’s q (distance).

property yule_w

Yule’s w

\(\frac{\sqrt{ad}-\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\)

Returns:: Yule’s w.

property yule_y

Yule’s Y is based off of the odds ratio or cross-product ratio, \(\alpha\).

\(Y = \frac{\sqrt\alpha - 1}{\sqrt\alpha + 1}\)

Returns:: Yule’s Y.

class pypair.contingency.BinaryStats(table: Sequence[Sequence[int]])

Bases: CategoricalMixin, BinaryMixin, ContingencyTable

Computes binary stats.

__init__(table: Sequence[Sequence[int]]) → None

ctor.

Parameters:: table – Contingency table.

class pypair.contingency.BinaryTable(a: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], b: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], a_0: object = 0, a_1: object = 1, b_0: object = 0, b_1: object = 1, pseudocount: bool = True)

Bases: CategoricalMixin, BinaryMixin, ContingencyTable

Represents a contingency table for binary variables.

__init__(a: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], b: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], a_0: object = 0, a_1: object = 1, b_0: object = 0, b_1: object = 1, pseudocount: bool = True) → None

ctor.

Parameters:

a – Iterable list.
b – Iterable list.
a_0 – The zero value for a. Defaults to 0.
a_1 – The one value for a. Defaults to 1.
b_0 – The zero value for b. Defaults to 0.
b_1 – The zero value for b. Defaults to 1.

class pypair.contingency.CategoricalMixin

Bases: object

Categorical computations based off a contingency table.

property adjusted_rand_index

The Adjusted Rand Index (ARI) should yield a value between [0, 1], however, negative values can also arise when the index is less than the expected value. This function uses binom() from scipy.special, and when n >= 300, the results are too large and may cause overflow.

TODO: use a different way to compute binomial coefficient

References

Returns:: Adjusted Rand Index.

property chisq

The chi-square statistic \(\chi^2\), is defined as follows.

\(\sum_i \sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

In a contingency table, \(O_ij\) is the observed cell count corresponding to the \(i\) row and \(j\) column. \(E_ij\) is the expected cell count corresponding to the \(i\) row and \(j\) column.

\(E_i = \frac{N_{i*} N_{*j}}{N}\)

Where \(N_{i*}\) is the i-th row marginal, \(N_{*j}\) is the j-th column marginal and \(N\) is the sum of all the values in the contingency cells (or the total size of the data).

References

Chi-Square Statistic Definition

Returns:: Chi-square statistic.

property chisq_dof

Returns the degrees of freedom form \(\chi^2\), which is defined as \((R - 1)(C - 1)\), where \(R\) is the number of rows and \(C\) is the number of columns in a contingency table induced by two categorical variables.

Returns:: Degrees of freedom.

property gk_lambda

Goodman-Kruskal’s lambda is the proportional reduction in error of predicting one variable b given another a: \(\lambda_{B|A}\).

The probability of an error in predicting the column category: \(P_e = 1 - \frac{\max_{c} N_{* c}}{N}\)
The probability of an error in predicting the column category given the row category: \(P_{e|r} = 1 - \frac{\sum_r \max_{c} N_{r c}}{N}\)

Where,

\(\max_{c} N_{* c}\) is the maximum of the column marginals
\(\sum_r \max_{c} N_{r c}\) is the sum over the maximum value per row
\(N\) is the total

Thus, \(\lambda_{B|A} = \frac{P_e - P_{e|r}}{P_e}\).

The way the contingency table is setup by default is that a is on the rows and b is on the columns. Note that Goodman-Kruskal’s lambda is not symmetric: \(\lambda_{B|A}\) does not necessarily equal \(\lambda_{A|B}\). By default, \(\lambda_{B|A}\) is computed, but if you desire the reverse, use goodman_kruskal_lambda_reversed().

References

Returns:: Goodman-Kruskal’s lambda.

property gk_lambda_reversed

Computes \(\lambda_{A|B}\).

Returns:: Goodman-Kruskal’s lambda.

property mutual_information

The mutual information between two variables \(X\) and \(Y\) is denoted as \(I(X;Y)\). \(I(X;Y)\) is unbounded and in the range \([0, \infty]\). A higher mutual information value implies strong association. The formula for \(I(X;Y)\) is defined as follows.

\(I(X;Y) = \sum_y \sum_x P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\)

Returns:: Mutual information.

property phi

Gets \(\phi\).

\(\phi = \sqrt{\frac{\chi^2}{N}}\)

Returns:: \(\phi\).

property uncertainty_coefficient

The uncertainty coefficient \(U(X|Y)\) for two variables \(X\) and \(Y\) is defined as follows.

\(U(X|Y) = \frac{I(X;Y)}{H(X)}\)

Where,

\(H(X) = -\sum_x P(x) \log P(x)\)
\(I(X;Y) = \sum_y \sum_x P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\)

\(H(X)\) is called the entropy of \(X\) and \(I(X;Y)\) is the mutual information between \(X\) and \(Y\). Note that \(I(X;Y) < H(X)\) and both values are positive. As such, the uncertainty coefficient may be viewed as the normalized mutual information between \(X\) and \(Y\) and in the range \([0, 1]\).

Returns:: Uncertainty coefficient.

property uncertainty_coefficient_reversed

Uncertainty coefficient.

Returns:: Uncertainty coefficient.

class pypair.contingency.CategoricalStats(table: Sequence[Sequence[int]])

Bases: CategoricalMixin, ContingencyTable

Computes categorical stats.

__init__(table: Sequence[Sequence[int]]) → None

ctor.

Parameters:: table – Contingency table.

Bases: CategoricalMixin, ContingencyTable

Represents a contingency table for categorical variables.

References

ctor. If a_vals or b_vals are None, then the possible values will be determined empirically from the data.

Parameters:

a – Iterable list.
b – Iterable list.
a_vals – All possible values in a. Defaults to None.
b_vals – All possible values in b. Defaults to None.

class pypair.contingency.ConfusionMatrix(a: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], b: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], a_0: object = 0, a_1: object = 1, b_0: object = 0, b_1: object = 1, pseudocount: bool = True)

Bases: ConfusionMixin, ContingencyTable

Represents a confusion matrix. The confusion matrix looks like what is shown below for two binary variables a and b; a is in the rows and b in the columns. Most of the statistics around performance comes from the counts of TN, FN, FP and TP.

Confusion Matrix
	b=0	b=1
a=0	TN	FP
a=1	FN	TP

__init__(a: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], b: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], a_0: object = 0, a_1: object = 1, b_0: object = 0, b_1: object = 1, pseudocount: bool = True) → None

ctor. Note that a is the ground truth and b is the prediction.

Parameters:

a – Binary variable (iterable). Ground truth.
b – Binary variable (iterable). Prediction.
a_0 – The zero value for a. Defaults to 0.
a_1 – The one value for a. Defaults to 1.
b_0 – The zero value for b. Defaults to 0.
b_1 – The zero value for b. Defaults to 1.

class pypair.contingency.ConfusionMixin

Bases: object

Confusion matrix computations.

property acc

Accuracy.

\(ACC = \frac{TP + TN}{TP + TN + FP + FN}\)

Returns:: Accuracy.

property ba

Balanced accuracy.

\(BA = \frac{TPR + TNR}{2}\)

Returns:: Balanced accuracy.

property bm

Bookmaker informedness.

\(BI = TPR + TNR - 1\)

Returns:: BM.

property dor

Diagnostic odds ratio.

\(\frac{PLR}{NLR}\)

Returns:: DOR.

property f1

F1 score: harmonic mean of precision and sensitivity.

\(F1 = \frac{PPV \times TPR}{PPV + TPR}\)

Returns:: F1.

property fdr

False discovery rate.

\(FDR = \frac{FP}{FP + TP}\)

Returns:: FDR.

property fn

FN

Returns:: FN.

property fnr

False negative rate.

\(FNR = \frac{FN}{FN + TP}\)

Aliases

miss rate

Returns:: FNR.

property fomr

False omission rate.

\(FOR = \frac{FN}{FN + TN}\)

Returns:: FOR.

property fp

FP

Returns:: FP.

property fpr

False positive rate.

\(FPR = \frac{FP}{FP + TN}\)

Aliases

fall-out
probability of false alarm

Returns:: FPR.

property mcc

Matthew’s correlation coefficient.

\(MCC = \frac{TP + TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\)

Returns:

property mk

Markedness.

\(MK = PPV + NPV - 1\)

Aliases

deltaP

Returns:: Markedness.

property n

\(N = TP + FN + FP + TN\)

Returns:

property nlr

Negative likelihood ratio.

\(NLR = \frac{FNR}{TNR}\)

Aliases

LR-

Returns:: NLR.

property npv

Negative predictive value.

\(NPV = \frac{TN}{TN + FN}\)

Returns:: NPV.

property plr

Positive likelihood ratio.

\(PLR = \frac{TPR}{FPR}\)

Aliases

LR+

Returns:: PLR.

property ppv

Positive predictive value.

\(PPV = \frac{TP}{TP + FP}\)

Aliases

precision

Returns:: PPV.

property precision

Alias to PPV.

Returns:: PPV.

property prevalence

Prevalence.

\(\frac{TP + FN}{N}\)

Returns:: Prevalence.

property pt

Prevalence threshold.

\(PT = \frac{\sqrt{TPR(-TNR + 1)} + TNR - 1}{TPR + TNR - 1}\)

Returns:: Prevalence threshold.

property recall

Alias to TPR.

Returns:: TPR.

property sensitivity

Alias to TPR.

Returns:: Sensitivity.

property specificity

Alias to TNR.

Returns:: Specificity.

property tn

TN

Returns:: TN.

property tnr

True negative rate.

\(TNR = \frac{TN}{TN + FP}\)

Aliases

specificity
selectivity

Returns:: TNR.

property tp

TP

Returns:: TP.

property tpr

True positive rate.

\(TPR = \frac{TP}{TP + FN}\)

Aliases

sensitivity
recall
hit rate
power
probability of detection

Returns:: TPR.

property ts

Threat score.

\(TS = \frac{TP}{TP + FN + FP}\)

Aliases

critical success index (CSI).

Returns:: TS.

class pypair.contingency.ConfusionStats(table: Sequence[Sequence[int]])

Bases: ConfusionMixin, ContingencyTable

Computes confusion matrix stats.

__init__(table: Sequence[Sequence[int]]) → None

ctor.

Parameters:: table – Contingency table.

class pypair.contingency.ContingencyTable(table: Sequence[Sequence[int]])

Bases: MeasureMixin, ABC

Abstract contingency table. All other tables inherit from this one.

__init__(table: Sequence[Sequence[int]]) → None

ctor.

Parameters:: table – A table of counts (list of lists).

Biserial

These are the biserial association measures.

Bases: MeasureMixin, BiserialMixin, object

Biserial association between a binary and continuous variable.

__init__(b: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], c: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[int | float] | Iterable[int | float], b_0: object = 0, b_1: object = 1) → None

class pypair.biserial.BiserialMixin

Bases: object

Biserial computations based off of \(n, p, q, y_0, y_1, \sigma\).

property biserial: float

property point_biserial: float

property rank_biserial: float

class pypair.biserial.BiserialStats(n: int, p: float, y_0: float, y_1: float, std: float)

Bases: MeasureMixin, BiserialMixin, object

Computes biserial stats.

__init__(n: int, p: float, y_0: float, y_1: float, std: float) → None

pypair.biserial.pd_isna(values: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any]) → ndarray[tuple[Any, ...], dtype[bool]]

Continuous

These are the continuous association measures.

Bases: MeasureMixin, ConcordanceMixin, object

Concordance for continuous and ordinal data.

ctor.

Parameters:

x – Continuous or ordinal data (iterable).
y – Continuous or ordinal data (iterable).

class pypair.continuous.ConcordanceMixin

Bases: object

property goodman_kruskal_gamma: float

Goodman-Kruskal \(\gamma\) is like Somer’s D. It is defined as follows.

\(\gamma = \frac{\pi_c - \pi_d}{1 - \pi_t}\)

Where

\(\pi_c = \frac{C}{n}\)
\(\pi_d = \frac{D}{n}\)
\(\pi_t = \frac{T}{n}\)
\(C\) is the number of concordant pairs
\(D\) is the number of discordant pairs
\(T\) is the number of ties
\(n\) is the sample size

Returns:: \(\gamma\).

property kendall_tau: float

Kendall’s \(\tau\) is defined as follows.

\(\tau = \frac{C - D}{{{n}\choose{2}}}\)

Where

\(C\) is the number of concordant pairs
\(D\) is the number of discordant pairs
\(n\) is the sample size

Returns:: \(\tau\).

property somers_d: tuple[float, float]

Computes Somers’ d for two continuous variables. Note that Somers’ d is defined for \(d_{X \cdot Y}\) and \(d_{Y \cdot X}\) and in general \(d_{X \cdot Y} \neq d_{Y \cdot X}\).

\(d_{Y \cdot X} = \frac{\pi_c - \pi_d}{\pi_c + \pi_d + \pi_t^Y}\)
\(d_{X \cdot Y} = \frac{\pi_c - \pi_d}{\pi_c + \pi_d + \pi_t^X}\)

Where

\(\pi_c = \frac{C}{n}\)
\(\pi_d = \frac{D}{n}\)
\(\pi_t^X = \frac{T^X}{n}\)
\(\pi_t^Y = \frac{T^Y}{n}\)
\(C\) is the number of concordant pairs
\(D\) is the number of discordant pairs
\(T^X\) is the number of ties on \(X\)
\(T^Y\) is the number of ties on \(Y\)
\(n\) is the sample size

Returns:: \(d_{X \cdot Y}\), \(d_{Y \cdot X}\).

class pypair.continuous.ConcordanceStats(d: int, t_xy: int, t_x: int, t_y: int, c: int, n: int)

Bases: MeasureMixin, ConcordanceMixin

Computes concordance stats.

__init__(d: int, t_xy: int, t_x: int, t_y: int, c: int, n: int) → None

ctor.

Parameters:

d – Number of discordant pairs.
t_xy – Number of ties on XY pairs.
t_x – Number of ties on X pairs.
t_y – Number of ties on Y pairs.
c – Number of concordant pairs.
n – Total number of pairs.

Bases: MeasureMixin, object

ctor.

Parameters:

a – Continuous variable (iterable).
b – Continuous variable (iterable).

property kendall: tuple[float, float]

Kendall’s tau.

Returns:: Kendall’s tau, p-value.

property pearson: tuple[float, float]

Pearson’s r.

Returns:: Pearson’s r, p-value.

property regression: tuple[float, float]

Line regression.

Returns:: Coefficient, p-value

property spearman: tuple[float, float]

Spearman’s r.

Returns:: Spearman’s r, p-value.

Bases: MeasureMixin, object

Correlation ratio.

ctor.

Parameters:

x – Categorical variable (iterable).
y – Continuous variable (iterable).

property anova: tuple[float, float]

Computes an ANOVA test.

Returns:: F-statistic, p-value.

property calinski_harabasz: float

Calinski-Harabasz Index.

Returns:: Calinski-Harabasz Index.

property davies_bouldin: float

Davies-Bouldin Index.

Returns:: Davies-Bouldin Index.

property eta: float

Gets \(\eta\).

Returns:: \(\eta\).

property eta_squared: float

Gets \(\eta^2 = \frac{\sigma_{\bar{y}}^2}{\sigma_{y}^2}\)

Returns:: \(\eta^2\).

property kruskal: tuple[float, float]

Computes the Kruskal-Wallis H-test.

Returns:: H-statistic, p-value.

property silhouette: float

Silhouette coefficient.

Returns:: Silhouette coefficient.

pypair.continuous.pd_isna(values: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any]) → ndarray[tuple[Any, ...], dtype[bool]]

Associations

Some of the functions here are just wrappers around the contingency tables and may be looked at as convenience methods to simply pass in data for two variables. If you need more than the specific association, you are encouraged to build the appropriate contingency table and then call upon the measures you need.

Gets the agreement association.

Parameters:

a – Categorical variable (iterable).
b – Categorical variable (iterable).
measure – Measure. Default is chohen_k.
a_vals – The unique values in a.
b_vals – The unique values in b.

Returns:

Measure.

Gets the binary-binary association.

Parameters:

a – Binary variable (iterable).
b – Binary variable (iterable).
measure – Measure. Default is chisq.
a_0 – The a zero value. Default 0.
a_1 – The a one value. Default 1.
b_0 – The b zero value. Default 0.
b_1 – The b one value. Default 1.

Returns:

Measure.

Gets the binary-continuous association.

Parameters:

b – Binary variable (iterable).
c – Continuous variable (iterable).
measure – Measure. Default is biserial.
b_0 – Value when b is zero. Default 0.
b_1 – Value when b is one. Default is 1.

Returns:

Measure.

Gets the categorical-categorical association.

Parameters:

a – Categorical variable (iterable).
b – Categorical variable (iterable).
measure – Measure. Default is chisq.
a_vals – The unique values in a.
b_vals – The unique values in b.

Returns:

Measure.

Gets the categorical-continuous association.

Parameters:

x – Categorical variable (iterable).
y – Continuous variable (iterable).
measure – Measure. Default is eta.

Returns:

Measure.

Gets the specified concordance between the two variables.

Parameters:

x – Continuous or ordinal variable (iterable).
y – Continuous or ordinal variable (iterable).
measure – Measure. Default is kendall_tau.

Returns:

Measure.

Gets the specified confusion matrix stats.

Parameters:

a – Binary variable (iterable).
b – Binary variable (iterable).
measure – Measure. Default is acc.
a_0 – The a zero value. Default 0.
a_1 – The a one value. Default 1.
b_0 – The b zero value. Default 0.
b_1 – The b one value. Default 1.

Returns:

Measure.

Gets the continuous-continuous association.

Parameters:

x – Continuous variable (iterable).
y – Continuous variable (iterable).
measure – Measure. Default is ‘pearson’.

Returns:

Measure.

Decorators

These are decorators.

pypair.decorator.distance(f: Callable[[P], R]) → Callable[[P], R]: Marker for distance functions.

pypair.decorator.similarity(f: Callable[[P], R]) → Callable[[P], R]: Marker for similarity functions.

pypair.decorator.timeit(f: Callable[[P], R]) → Callable[[P], R]: Records execution time when profiling is enabled.

Utility

These are utility functions.

class pypair.util.MeasureMixin

Bases: ABC

Measure mixin. Able to get list the functions decorated with @property and also access such property based on name.

get(measure: str) → Any: Gets the specified measure.

get_measures() → list[str]: Gets a list of all the measures.

classmethod measures() → list[str]: Gets a list of all the measures.

exception pypair.util.UndefinedMeasureError

Bases: ValueError

Raised when a measure is undefined for the provided data.

pypair.util.compute_all_measures(computer: MeasureComputer, context: str | None = None) → dict[str, Any]

pypair.util.corr(df: pd.DataFrame, f: PairwiseAssociationFn | Callable[[pd.Series[Any], pd.Series[Any]], float]) → pd.DataFrame

Computes the pairwise association matrix for a pandas dataframe.

Parameters:

df – Pandas data frame.
f – Callable function; e.g. lambda a, b: categorical_categorical(a, b, measure=’phi’)

pypair.util.get_measures(clazz: type[object]) → list[str]: Gets all the measures of a clazz.

pypair.util.raise_undefined_measure(measure: str, owner: object | str, detail: str, context: str | None = None) → None

pypair.util.to_numpy(values: ndarray[tuple[Any, ...], dtype[Any]] | SupportsToNumpy | Sequence[Any] | Iterable[Any], dtype: DTypeLike | None = None) → ndarray[tuple[Any, ...], dtype[Any]]: Converts common sequence / series inputs to a numpy array.

Spark

These are functions that you can use in a Spark. You must pass in a Spark dataframe and you will get a pair-RDD as output. The pair-RDD will have the following as its keys and values.

key: in the form of a tuple of strings (k1, k2) where k1 and k2 are names of variables (column names)
value: a dictionary {'acc': 0.8, 'tpr': 0.9, 'fpr': 0.8, ...} where keys are association measure names and values are the corresponding association values

pypair.spark.agreement(sdf: Any, pseudocount: bool = True) → Any

Gets all pairwise categorical-categorical agreement association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘kappa’: 0.9, ‘delta’: 0.2}. Each record in the pair-RDD is of the form.

(k1, k2), {‘kappa’: 0.9, ‘delta’: 0.2, …}

Parameters:: sdf – Spark dataframe. Should be strings or whole numbers to represent the values.
Returns:: Spark pair-RDD.

pypair.spark.binary_binary(sdf: Any, pseudocount: bool = True) → Any

Gets all the pairwise binary-binary association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘phi’: 1, ‘lambda’: 0.8}. Each record in the pair-RDD is of the form.

(k1, k2), {‘phi’: 1, ‘lambda’: 0.8, …}

Parameters:: sdf – Spark dataframe. Should be all 1’s and 0’s.
Returns:: Spark pair-RDD.

pypair.spark.binary_continuous(sdf: Any, binary: Sequence[str], continuous: Sequence[str], b_0: object = 0, b_1: object = 1) → Any

Gets all pairwise binary-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘biserial’: 0.9, ‘point_biserial’: 0.2}. Each record in the pair-RDD is of the form.

(k1, k2), {‘biserial’: 0.9, ‘point_biserial’: 0.2, …}

All the binary fields/columns should be encoded in the same way. For example, if you are using 1 and 0, then all binary fields should only have those values, not a mixture of 1 and 0, True and False, -1 and 1, etc.

Parameters:

sdf – Spark dataframe.
binary – List of fields that are binary.
continuous – List of fields that are continuous.
b_0 – Zero value for binary field.
b_1 – One value for binary field.

Returns:

Spark pair-RDD.

pypair.spark.categorical_categorical(sdf: Any, pseudocount: bool = True) → Any

Gets all pairwise categorical-categorical association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘phi’: 0.9, ‘chisq’: 0.2}. Each record in the pair-RDD is of the form.

(k1, k2), {‘phi’: 0.9, ‘chisq’: 0.2, …}

Parameters:: sdf – Spark dataframe. Should be strings or whole numbers to represent the values.
Returns:: Spark pair-RDD.

pypair.spark.categorical_continuous(sdf: Any, categorical: Sequence[str], continuous: Sequence[str]) → Any

Gets all pairwise categorical-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘eta_sq’: 0.9, ‘eta’: 0.95}. Each record in the pair-RDD is of the form.

(k1, k2), {‘eta_sq’: 0.9, ‘eta’: 0.95}

For now, only eta \(\eta^2\) is supported.

Parameters:

sdf – Spark dataframe.
categorical – List of categorical variables.
continuous – List of continuous variables.

Returns:

Spark pair-RDD.

pypair.spark.concordance(sdf: Any, pseudocount: bool = True) → Any

Gets all the pairwise ordinal-ordinal concordance measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘kendall’: 1, ‘gamma’: 0.8}. Each record in the pair-RDD is of the form.

(k1, k2), {‘kendall’: 1, ‘gamma’: 0.8, …}

Parameters:: sdf – Spark dataframe. Should be all ordinal data (numeric).
Returns:: Spark pair-RDD.

pypair.spark.confusion(sdf: Any, pseudocount: bool = True) → Any

Gets all the pairwise confusion matrix metrics. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘acc’: 0.9, ‘fpr’: 0.2}. Each record in the pair-RDD is of the form.

(k1, k2), {‘acc’: 0.9, ‘fpr’: 0.2, …}

Parameters:: sdf – Spark dataframe. Should be all 1’s and 0’s.
Returns:: Spark pair-RDD.

pypair.spark.continuous_continuous(sdf: Any) → Any

Gets all the pairwise continuous-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘pearson’: 1}. Each record in the pair-RDD is of the form.

(k1, k2), {‘pearson’: 1}

Only pearson is supported at the moment.

Parameters:: sdf – Spark dataframe. Should be all ordinal data (numeric).
Returns:: Spark pair-RDD.