PyPair

Contingency Table Analysis

These are the basic contingency tables used to analyze categorical data.

  • CategoricalTable

  • BinaryTable

  • ConfusionMatrix

  • AgreementTable

class pypair.contingency.AgreementMixin

Bases: object

Agreement computations.

property chohen_k

Computes Cohen’s \(\kappa\).

  • \(\kappa = \frac{\theta_1 - \theta_2}{1 - \theta_2}\)

  • \(\theta_1 = \sum_i p_{ii}\)

  • \(\theta_2 = \sum_i p_{i+}p_{+i}\)

Returns

\(\kappa\).

property cohen_light_k

Cohen-Light \(\kappa\). \(\kappa\) is a measure of conditional agreement. Several \(\kappa\), one for each unique value, will be computed and returned.

  • \(\kappa = \frac{\theta_1 - \theta_2}{1 - \theta_2}\)

  • \(\theta_1 = \frac{p_{ii}}{p_{i+}}\)

  • \(\theta_2 = p_{+i}\)

Returns

A list of \(\kappa\).

class pypair.contingency.AgreementStats(table)

Bases: AgreementMixin, ContingencyTable

Computes agreement stats.

__init__(table)

ctor.

Parameters

table – Contingency table.

class pypair.contingency.AgreementTable(a, b, a_vals=None, b_vals=None)

Bases: AgreementMixin, ContingencyTable

Represents a contingency table for agreement data against one variable. The variable is typically a rating variable (e.g. dislike, neutral, like), and the data is a pairing of ratings over the same set of items. The agreement table that is induced by the data is typically squared, where the number of rows and columns are equal.

__init__(a, b, a_vals=None, b_vals=None)

ctor.

Parameters
  • a – Categorical variable.

  • b – Categorical variable.

  • a_vals – Values in a. Default None; figure out empirically.

  • b_vals – Values in b. Default None; figure out empirically.

class pypair.contingency.BinaryMixin

Bases: object

Binary computations based off of a, b, c and d from a 2x2 contingency table.

property ample

Ample

\(\left|\frac{a(c+d)}{c(a+b)}\right|\)

Returns

Ample.

property anderberg

Anderberg

\(\frac{\sigma-\sigma'}{2n}\)

Returns

Anderberg.

property baroni_urbani_buser_i

Baroni-Urbani-Buser-I

\(\frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}\)

Returns

Baroni-Urbani-Buser-I.

property baroni_urbani_buser_ii

Baroni-Urbani-Buser-II

\(\frac{\sqrt{ad}+a-(b+c)}{\sqrt{ad}+a+b+c}\)

Returns

Baroni-Urbani-Buser-II.

property braun_banquet

Braun-Banquet

\(\frac{a}{\max(a+b,a+c)}\)

Returns

Braun-Banquet.

property chisq

\(\chi^2\) (alias for Pearson-I)

Returns

\(\chi^2\).

property chord

Chord

\(\sqrt{2\left(1 - \frac{a}{\sqrt{(a+b)(a+c)}}\right)}\)

Returns

Chord (distance).

property cole_i

Cole-I

\(\frac{\sqrt{2}(ad-bc)}{\sqrt{(ad-bc)^2-(a+b)(a+c)(b+d)(c+d)}}\)

Returns

Cole-I.

property cole_ii

Cole-II

\(\frac{ad-bc}{\min((a+b)(a+c),(b+d)(c+d))}\)

Returns

Cole-II.

property contingency_coefficient

Contingency coefficient.

Returns

Contingency coefficient.

property cosine

Cosine

\(\frac{a}{(a+b)(a+c)}\)

Returns

Cosine.

property cramer_v

Cramer’s V.

Returns

Cramer’s V.

property dennis

Dennis

\(\frac{ad-bc}{\sqrt{n(a+b)(a+c)}}\)

Returns

Dennis.

property dice

Dice; Czekanowski; Nei-Li

\(\frac{2a}{2a+b+c}\)

Returns

Dice.

property disperson

Disperson

\(\frac{ad-bc}{(a+b+c+d)^2}\)

Returns

Disperson.

property driver_kroeber

Driver-Kroeber

\(\frac{a}{2}\left(\frac{1}{a+b}+\frac{1}{a+c}\right)\)

Returns

Driver-Kroeber.

property euclid

Euclid

\(\sqrt{b+c}\)

Returns

Euclid (distance).

property eyraud

Eyraud

\(\frac{n^2(na-(a+b)(a+c))}{(a+b)(a+c)(b+d)(c+d)}\)

Returns

Eyraud.

property fager_mcgowan

Fager-McGowan

\(\frac{a}{\sqrt{(a+b)(a+c)}}-\frac{max(a+b,a+c)}{2}\)

Returns

Fager-McGowan.

property faith

Faith

\(\frac{a+0.5d}{a+b+c+d}\)

Returns

Faith.

property forbes_ii

Forbes-II

\(\frac{na-(a+b)(a+c)}{n \min(a+b,a+c) - (a+b)(a+c)}\)

Returns

Forbes-II.

property forbesi

Forbesi

\(\frac{na}{(a+b)(a+c)}\)

Returns

Forbesi.

property fossum

Fossum

\(\frac{n(a-0.5)^2}{(a+b)(a+c)}\)

Returns

Fossum.

property gilbert_wells

Gilbert-Wells

\(\log a - \log n - \log \frac{a+b}{n} - \log \frac{a+c}{n}\)

Returns

Gilbert-Wells.

property goodman_kruskal

Goodman-Kruskal

\(\frac{\sigma - \sigma'}{2n-\sigma'}\)

Returns

Goodman-Kruskal.

property gower

Gower

\(\frac{a+d}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns

Gower.

property gower_legendre

Gower-Legendre

\(\frac{a+d}{a+0.5b+0.5c+d}\)

Returns

Gower-Legendre.

property hamann

Hamann.

\(\frac{(a+d)-(b+c)}{a+b+c+d}\)

Returns

Hamann.

property hamming

Hamming; Canberra; Manhattan; Cityblock; Minkowski

\(b+c\)

Returns

Hamming (distance).

property hellinger

Hellinger

\(2\sqrt{1 - \frac{a}{\sqrt{(a+b)(a+c)}}}\)

Returns

Hellinger (distance).

property inner_product

Inner-product.

\(a+d\)

Returns

Inner-product.

property intersection

Intersection

\(a\)

Returns

Intersection.

property jaccard

Jaccard

\(\frac{a}{a+b+c}\)

Returns

Jaccard.

property jaccard_3w

3W-Jaccard

\(\frac{3a}{3a+b+c}\)

Returns

3W-Jaccard.

property jaccard_distance

Jaccard

\(\frac{b + c}{a + b + c}\)

Returns

Jaccard (distance).

property johnson

Johnson.

\(\frac{a}{a+b}+\frac{a}{a+c}\)

Returns

Johnson.

property kulcyznski_ii

Kulczynski-II

\(\frac{0.5a(2a+b+c)}{(a+b)(a+c)}\)

Returns

Kulczynski-II.

property kulczynski_i

Kulczynski-I

\(\frac{a}{b+c}\)

Returns

Kulczynski-I.

property lance_williams

Lance-Williams; Bray-Curtis

\(\frac{b+c}{2a+b+c}\)

Returns

Lance-Williams (distance).

property mcconnaughey

McConnaughey

\(\frac{a^2 - bc}{(a+b)(a+c)}\)

Returns

McConnaughey.

property mcnemar_test

McNemar’s test.

Returns

A tuple. First element is chi-square test statistics. Second element is p-value.

property mean_manhattan

Mean-Manhattan

\(\frac{b+c}{a+b+c+d}\)

Returns

Mean-Manhattan (distance).

property michael

Michael

\(\frac{4(ad-bc)}{(a+d)^2+(b+c)^2}\)

Returns

Michael.

property mountford

Mountford

\(\frac{a}{0.5(ab + ac) + bc}\)

Returns

Mountford.

property ochia_i

Ochia-I

Also known as Fowlkes-Mallows Index. This measure is typically used to judge the similarity between two clusters. A larger value indicates that the clusters are more similar.

\(\frac{a}{\sqrt{(a+b)(a+c)}}\)

Returns

Ochai-I.

property ochia_ii

Ochia-II

\(\frac{ad}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns

Ochia-II.

property odds_ratio

Odds ratio. The odds ratio is also referred to as the cross-product ratio.

Returns

Odds ratio.

property pattern_difference

Pattern difference

\(\frac{4bc}{(a+b+c+d)^2}\)

Returns

Pattern difference (distance).

property pearson_heron_i

Pearson-Heron-I

\(\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns

Pearson-Heron-I.

property pearson_heron_ii

Pearson-Heron-II

\(\sqrt{\frac{\chi^2}{n+\chi^2}}\)

Returns

Pearson-Heron-II.

property pearson_i

Pearson-I

\(\chi^2=\frac{n(ad-bc)^2}{(a+b)(a+c)(c+d)(b+d)}\)

Returns

Pearson-I.

property peirce

Peirce

\(\frac{ab+bc}{ab+2bc+cd}\)

Returns

Peirce.

property person_ii

Pearson-II

\(\sqrt{\frac{\rho}{n+\rho}}\)

  • \(\rho=\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)

Returns

Pearson-II.

property roger_tanimoto

Roger-Tanimoto

\(\frac{a+d}{a+2b+2c+d}\)

Returns

Roger-Tanimoto.

property russel_rao

Russel-Rao

\(\frac{a}{a+b+c+d}\)

Returns

Russel-Rao.

property shape_difference

Shape difference

\(\frac{n(b+c)-(b-c)^2}{(a+b+c+d)^2}\)

Returns

Shape difference (distance).

property simpson

Simpson (or Overlap).

\(\frac{a}{\min(a+b,a+c)}\)

Returns

Simpson.

property size_difference

Size difference

\(\frac{(b+c)^2}{(a+b+c+d)^2}\)

Returns

Size difference (distance).

property sokal_michener

Sokal-Michener

\(\frac{a+d}{a+b+c+d}\)

Returns

Sokal-Michener.

property sokal_sneath_i

Sokal-Sneath-I

\(\frac{a}{a+2b+2c}\)

Returns

Sokal-Sneath-I.

property sokal_sneath_ii

Sokal-Sneath-II

\(\frac{2a+2d}{2a+b+c+2d}\)

Returns

Sokal-Sneath-II.

property sokal_sneath_iii

Sokal-Sneath-III

\(\frac{a+d}{b+c}\)

Returns

Sokal-Sneath-III.

property sokal_sneath_iv

Sokal-Sneath-IV

\(\frac{ad}{(a+b)(a+c)(b+d)\sqrt{c+d}}\)

Returns

Sokal-Sneath-IV.

property sokal_sneath_v

Sokal-Sneath-V

\(\frac{1}{4}\left(\frac{a}{a+b}+\frac{a}{a+c}+\frac{d}{b+d}+\frac{d}{b+d}\right)\)

Returns

Sokal-Sneath-V.

property sorensen_dice

Sørensen–Dice

\(\frac{2(a + d)}{2(a + d) + b + c}\)

Returns

Sørensen–Dice,

property sorgenfrei

Sorgenfrei

\(\frac{a^2}{(a+b)(a+c)}\)

Returns

Sorgenfrei.

property stiles

Stiles

\(\log_{10} \frac{n\left(|ad-bc|-\frac{n}{2}\right)^2}{(a+b)(a+c)(b+d)(c+d)}\)

Returns

Stiles.

property tanimoto_distance

Tanimoto similarity and distance.

Returns

Tanimoto distance.

property tanimoto_i

Tanimoto-I

\(\frac{a}{2a+b+c}\)

Returns

Tanimoto-I.

property tanimoto_ii

Tanimoto-II

\(\frac{a}{b + c}\)

Returns

Tanimoto-II.

property tarantula

Tarantula

\(\frac{a(c+d)}{c(a+b)}\)

Returns

Tarantula.

property tarwid

Tarwind

\(\frac{na - (a+b)(a+c)}{na + (a+b)(a+c)}\)

Returns

Tarwind.

property tetrachoric

Tetrachoric correlation ranges from \([-1, 1]\), where 0 indicates no agreement, 1 indicates perfect agreement and -1 indicates perfect disagreement.

  • if \(b=0\) or \(c=0\), 1.0

  • if \(a=0\) or \(b=0\), -1.0

  • else, \(\frac{y-1}{y+1}, y={\left(\frac{da}{bc}\right)}^{\frac{\pi}{4}}\)

References

Returns

Tetrachoric correlation.

property tschuprow_t

Tschuprow’s T.

Returns

Tschuprow’s T.

tversky_index(theta=1, phi=0)

Compute’s Tversky’s Index.

\(\frac{a}{a+\theta b+\phi c}\)

\(\theta\) and \(\phi\) are typically between \([0,1]\) and \(\theta + \phi = 1\).

Parameters
  • theta – Weight \([0,1]\) of how important match on row variable is. Default 1.

  • phi – Weight \([0,1]\) of how important match on column variable is. Default 0.

Returns

Tversky’s Index.

property vari

Vari

\(\frac{b+c}{4a+4b+4c+4d}\)

Returns

Vari (distance).

property yule_q

Yule’s Q

\(\frac{ad-bc}{ad+bc}\)

Also, Yule’s Q is based off of the odds ratio or cross-product ratio, \(\alpha\).

\(Q = \frac{\alpha - 1}{\alpha + 1}\)

Yule’s Q is the same as Goodman-Kruskal’s \(\lambda\) for 2 x 2 contingency tables and is also a measure of proportional reduction in error (PRE).

Returns

Yule’s Q.

property yule_q_difference

Yule’s q

\(\frac{2bc}{ad+bc}\)

Returns

Yule’s q (distance).

property yule_w

Yule’s w

\(\frac{\sqrt{ad}-\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\)

Returns

Yule’s w.

property yule_y

Yule’s Y is based off of the odds ratio or cross-product ratio, \(\alpha\).

\(Y = \frac{\sqrt\alpha - 1}{\sqrt\alpha + 1}\)

Returns

Yule’s Y.

class pypair.contingency.BinaryStats(table)

Bases: CategoricalMixin, BinaryMixin, ContingencyTable

Computes binary stats.

__init__(table)

ctor.

Parameters

table – Contingency table.

class pypair.contingency.BinaryTable(a, b, a_0=0, a_1=1, b_0=0, b_1=1)

Bases: CategoricalMixin, BinaryMixin, ContingencyTable

Represents a contingency table for binary variables.

__init__(a, b, a_0=0, a_1=1, b_0=0, b_1=1)

ctor.

Parameters
  • a – Iterable list.

  • b – Iterable list.

  • a_0 – The zero value for a. Defaults to 0.

  • a_1 – The one value for a. Defaults to 1.

  • b_0 – The zero value for b. Defaults to 0.

  • b_1 – The zero value for b. Defaults to 1.

class pypair.contingency.CategoricalMixin

Bases: object

Categorical computations based off a contingency table.

property adjusted_rand_index

The Adjusted Rand Index (ARI) should yield a value between [0, 1], however, negative values can also arise when the index is less than the expected value. This function uses binom() from scipy.special, and when n >= 300, the results are too large and may cause overflow.

TODO: use a different way to compute binomial coefficient

References

Returns

Adjusted Rand Index.

property chisq

The chi-square statistic \(\chi^2\), is defined as follows.

\(\sum_i \sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

In a contingency table, \(O_ij\) is the observed cell count corresponding to the \(i\) row and \(j\) column. \(E_ij\) is the expected cell count corresponding to the \(i\) row and \(j\) column.

\(E_i = \frac{N_{i*} N_{*j}}{N}\)

Where \(N_{i*}\) is the i-th row marginal, \(N_{*j}\) is the j-th column marginal and \(N\) is the sum of all the values in the contingency cells (or the total size of the data).

References

Returns

Chi-square statistic.

property chisq_dof

Returns the degrees of freedom form \(\chi^2\), which is defined as \((R - 1)(C - 1)\), where \(R\) is the number of rows and \(C\) is the number of columns in a contingency table induced by two categorical variables.

Returns

Degrees of freedom.

property gk_lambda

Goodman-Kruskal’s lambda is the proportional reduction in error of predicting one variable b given another a: \(\lambda_{B|A}\).

  • The probability of an error in predicting the column category: \(P_e = 1 - \frac{\max_{c} N_{* c}}{N}\)

  • The probability of an error in predicting the column category given the row category: \(P_{e|r} = 1 - \frac{\sum_r \max_{c} N_{r c}}{N}\)

Where,

  • \(\max_{c} N_{* c}\) is the maximum of the column marginals

  • \(\sum_r \max_{c} N_{r c}\) is the sum over the maximum value per row

  • \(N\) is the total

Thus, \(\lambda_{B|A} = \frac{P_e - P_{e|r}}{P_e}\).

The way the contingency table is setup by default is that a is on the rows and b is on the columns. Note that Goodman-Kruskal’s lambda is not symmetric: \(\lambda_{B|A}\) does not necessarily equal \(\lambda_{A|B}\). By default, \(\lambda_{B|A}\) is computed, but if you desire the reverse, use goodman_kruskal_lambda_reversed().

References

Returns

Goodman-Kruskal’s lambda.

property gk_lambda_reversed

Computes \(\lambda_{A|B}\).

Returns

Goodman-Kruskal’s lambda.

property mutual_information

The mutual information between two variables \(X\) and \(Y\) is denoted as \(I(X;Y)\). \(I(X;Y)\) is unbounded and in the range \([0, \infty]\). A higher mutual information value implies strong association. The formula for \(I(X;Y)\) is defined as follows.

\(I(X;Y) = \sum_y \sum_x P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\)

Returns

Mutual information.

property phi

Gets \(\phi\).

\(\phi = \sqrt{\frac{\chi^2}{N}}\)

Returns

\(\phi\).

property uncertainty_coefficient

The uncertainty coefficient \(U(X|Y)\) for two variables \(X\) and \(Y\) is defined as follows.

\(U(X|Y) = \frac{I(X;Y)}{H(X)}\)

Where,

  • \(H(X) = -\sum_x P(x) \log P(x)\)

  • \(I(X;Y) = \sum_y \sum_x P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\)

\(H(X)\) is called the entropy of \(X\) and \(I(X;Y)\) is the mutual information between \(X\) and \(Y\). Note that \(I(X;Y) < H(X)\) and both values are positive. As such, the uncertainty coefficient may be viewed as the normalized mutual information between \(X\) and \(Y\) and in the range \([0, 1]\).

Returns

Uncertainty coefficient.

property uncertainty_coefficient_reversed

Uncertainty coefficient.

Returns

Uncertainty coefficient.

class pypair.contingency.CategoricalStats(table)

Bases: CategoricalMixin, ContingencyTable

Computes categorical stats.

__init__(table)

ctor.

Parameters

table – Contingency table.

class pypair.contingency.CategoricalTable(a, b, a_vals=None, b_vals=None)

Bases: CategoricalMixin, ContingencyTable

Represents a contingency table for categorical variables.

References

__init__(a, b, a_vals=None, b_vals=None)

ctor. If a_vals or b_vals are None, then the possible values will be determined empirically from the data.

Parameters
  • a – Iterable list.

  • b – Iterable list.

  • a_vals – All possible values in a. Defaults to None.

  • b_vals – All possible values in b. Defaults to None.

class pypair.contingency.ConfusionMatrix(a, b, a_0=0, a_1=1, b_0=0, b_1=1)

Bases: ConfusionMixin, ContingencyTable

Represents a confusion matrix. The confusion matrix looks like what is shown below for two binary variables a and b; a is in the rows and b in the columns. Most of the statistics around performance comes from the counts of TN, FN, FP and TP.

Confusion Matrix

b=0

b=1

a=0

TN

FP

a=1

FN

TP

__init__(a, b, a_0=0, a_1=1, b_0=0, b_1=1)

ctor. Note that a is the ground truth and b is the prediction.

Parameters
  • a – Binary variable (iterable). Ground truth.

  • b – Binary variable (iterable). Prediction.

  • a_0 – The zero value for a. Defaults to 0.

  • a_1 – The one value for a. Defaults to 1.

  • b_0 – The zero value for b. Defaults to 0.

  • b_1 – The zero value for b. Defaults to 1.

class pypair.contingency.ConfusionMixin

Bases: object

Confusion matrix computations.

property acc

Accuracy.

\(ACC = \frac{TP + TN}{TP + TN + FP + FN}\)

Returns

Accuracy.

property ba

Balanced accuracy.

\(BA = \frac{TPR + TNR}{2}\)

Returns

Balanced accuracy.

property bm

Bookmaker informedness.

\(BI = TPR + TNR - 1\)

Returns

BM.

property dor

Diagnostic odds ratio.

\(\frac{PLR}{NLR}\)

Returns

DOR.

property f1

F1 score: harmonic mean of precision and sensitivity.

\(F1 = \frac{PPV \times TPR}{PPV + TPR}\)

Returns

F1.

property fdr

False discovery rate.

\(FDR = \frac{FP}{FP + TP}\)

Returns

FDR.

property fn

FN

Returns

FN.

property fnr

False negative rate.

\(FNR = \frac{FN}{FN + TP}\)

Aliases

  • miss rate

Returns

FNR.

property fomr

False omission rate.

\(FOR = \frac{FN}{FN + TN}\)

Returns

FOR.

property fp

FP

Returns

FP.

property fpr

False positive rate.

\(FPR = \frac{FP}{FP + TN}\)

Aliases

  • fall-out

  • probability of false alarm

Returns

FPR.

property mcc

Matthew’s correlation coefficient.

\(MCC = \frac{TP + TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\)

Returns

property mk

Markedness.

\(MK = PPV + NPV - 1\)

Aliases

  • deltaP

Returns

Markedness.

property n

\(N = TP + FN + FP + TN\)

Returns

property nlr

Negative likelihood ratio.

\(NLR = \frac{FNR}{TNR}\)

Aliases

  • LR-

Returns

NLR.

property npv

Negative predictive value.

\(NPV = \frac{TN}{TN + FN}\)

Returns

NPV.

property plr

Positive likelihood ratio.

\(PLR = \frac{TPR}{FPR}\)

Aliases

  • LR+

Returns

PLR.

property ppv

Positive predictive value.

\(PPV = \frac{TP}{TP + FP}\)

Aliases

  • precision

Returns

PPV.

property precision

Alias to PPV.

Returns

PPV.

property prevalence

Prevalence.

\(\frac{TP + FN}{N}\)

Returns

Prevalence.

property pt

Prevalence threshold.

\(PT = \frac{\sqrt{TPR(-TNR + 1)} + TNR - 1}{TPR + TNR - 1}\)

Returns

Prevalence threshold.

property recall

Alias to TPR.

Returns

TPR.

property sensitivity

Alias to TPR.

Returns

Sensitivity.

property specificity

Alias to TNR.

Returns

Specificity.

property tn

TN

Returns

TN.

property tnr

True negative rate.

\(TNR = \frac{TN}{TN + FP}\)

Aliases

  • specificity

  • selectivity

Returns

TNR.

property tp

TP

Returns

TP.

property tpr

True positive rate.

\(TPR = \frac{TP}{TP + FN}\)

Aliases

  • sensitivity

  • recall

  • hit rate

  • power

  • probability of detection

Returns

TPR.

property ts

Threat score.

\(TS = \frac{TP}{TP + FN + FP}\)

Aliases

  • critical success index (CSI).

Returns

TS.

class pypair.contingency.ConfusionStats(table)

Bases: ConfusionMixin, ContingencyTable

Computes confusion matrix stats.

__init__(table)

ctor.

Parameters

table – Contingency table.

class pypair.contingency.ContingencyTable(table)

Bases: MeasureMixin, ABC

Abstract contingency table. All other tables inherit from this one.

__init__(table)

ctor.

Parameters

table – A table of counts (list of lists).

Biserial

These are the biserial association measures.

class pypair.biserial.Biserial(b, c, b_0=0, b_1=1)

Bases: MeasureMixin, BiserialMixin, object

Biserial association between a binary and continuous variable.

__init__(b, c, b_0=0, b_1=1)

ctor.

Parameters
  • b – Binary variable (iterable).

  • c – Continuous variable (iterable).

  • b_0 – Value for b is zero. Default 0.

  • b_1 – Value for b is one. Default 1.

class pypair.biserial.BiserialMixin

Bases: object

Biserial computations based off of \(n, p, q, y_0, y_1, \sigma\).

property biserial

Computes the biserial correlation between a binary and continuous variable. The biserial correlation \(r_b\) can be computed from the point-biserial correlation \(r_{\mathrm{pb}}\) as follows.

\(r_b = \frac{r_{\mathrm{pb}}}{h} \sqrt{pq}\)

The tricky thing to explain is the \(h\) parameter. \(h\) is defined as the height of the standard normal distribution at z, where \(P(z'<z) = q\) and \(P(z’>z) = p\). The way to get \(h\) in practice is take the inverse standard normal of \(q\), and then take the standard normal probability of that result. Using Scipy norm.pdf(norm.ppf(q)).

References

Returns

Biserial correlation coefficient.

property point_biserial

Computes the point-biserial correlation coefficient between a binary variable \(X\) and a continuous variable \(Y\).

\(r_{\mathrm{pb}} = \frac{(Y_1 - Y_0) \sqrt{pq}}{\sigma_Y}\)

Where

  • \(Y_0\) is the average of \(Y\) when \(X=0\)

  • \(Y_1\) is the average of \(Y\) when \(X=1\)

  • \(\sigma_Y\) is the standard deviation of \(Y\)

  • \(p\) is \(P(X=1)\)

  • \(q\) is \(1 - p\)

Returns

Point-biserial correlation coefficient.

property rank_biserial

Computes the rank-biserial correlation between a binary variable \(X\) and a continuous variable \(Y\).

\(r_r = \frac{2 (Y_1 - Y_0)}{n}\)

Where

  • \(Y_0\) is the average of \(Y\) when \(X=0\)

  • \(Y_1\) is the average of \(Y\) when \(X=1\)

  • \(n\) is the total number of data

Returns

Rank-biserial correlation.

class pypair.biserial.BiserialStats(n, p, y_0, y_1, std)

Bases: MeasureMixin, BiserialMixin, object

Computes biserial stats.

__init__(n, p, y_0, y_1, std)

ctor.

Parameters
  • n – Total number of samples.

  • p\(P(Y|X=0)\).

  • y_0 – Average of \(Y\) when \(X=0\). \(\bar{Y}_0\)

  • y_1 – Average of \(Y\) when \(X=1\). \(\bar{Y}_1\)

  • std – Standard deviation of \(Y\), \(\sigma\).

Continuous

These are the continuous association measures.

class pypair.continuous.Concordance(x, y)

Bases: MeasureMixin, ConcordanceMixin, object

Concordance for continuous and ordinal data.

__init__(x, y)

ctor.

Parameters
  • x – Continuous or ordinal data (iterable).

  • y – Continuous or ordinal data (iterable).

class pypair.continuous.ConcordanceMixin

Bases: object

property goodman_kruskal_gamma

Goodman-Kruskal \(\gamma\) is like Somer’s D. It is defined as follows.

\(\gamma = \frac{\pi_c - \pi_d}{1 - \pi_t}\)

Where

  • \(\pi_c = \frac{C}{n}\)

  • \(\pi_d = \frac{D}{n}\)

  • \(\pi_t = \frac{T}{n}\)

  • \(C\) is the number of concordant pairs

  • \(D\) is the number of discordant pairs

  • \(T\) is the number of ties

  • \(n\) is the sample size

Returns

\(\gamma\).

property kendall_tau

Kendall’s \(\tau\) is defined as follows.

\(\tau = \frac{C - D}{{{n}\choose{2}}}\)

Where

  • \(C\) is the number of concordant pairs

  • \(D\) is the number of discordant pairs

  • \(n\) is the sample size

Returns

\(\tau\).

property somers_d

Computes Somers’ d for two continuous variables. Note that Somers’ d is defined for \(d_{X \cdot Y}\) and \(d_{Y \cdot X}\) and in general \(d_{X \cdot Y} \neq d_{Y \cdot X}\).

  • \(d_{Y \cdot X} = \frac{\pi_c - \pi_d}{\pi_c + \pi_d + \pi_t^Y}\)

  • \(d_{X \cdot Y} = \frac{\pi_c - \pi_d}{\pi_c + \pi_d + \pi_t^X}\)

Where

  • \(\pi_c = \frac{C}{n}\)

  • \(\pi_d = \frac{D}{n}\)

  • \(\pi_t^X = \frac{T^X}{n}\)

  • \(\pi_t^Y = \frac{T^Y}{n}\)

  • \(C\) is the number of concordant pairs

  • \(D\) is the number of discordant pairs

  • \(T^X\) is the number of ties on \(X\)

  • \(T^Y\) is the number of ties on \(Y\)

  • \(n\) is the sample size

Returns

\(d_{X \cdot Y}\), \(d_{Y \cdot X}\).

class pypair.continuous.ConcordanceStats(d, t_xy, t_x, t_y, c, n)

Bases: MeasureMixin, ConcordanceMixin

Computes concordance stats.

__init__(d, t_xy, t_x, t_y, c, n)

ctor.

Parameters
  • d – Number of discordant pairs.

  • t_xy – Number of ties on XY pairs.

  • t_x – Number of ties on X pairs.

  • t_y – Number of ties on Y pairs.

  • c – Number of concordant pairs.

  • n – Total number of pairs.

class pypair.continuous.ConcordantCounts(d, t_xy, t_x, t_y, c)

Bases: object

Stores the concordance, discordant and tie counts.

__init__(d, t_xy, t_x, t_y, c)

ctor.

Parameters
  • d – Discordant.

  • t_xy – Tie.

  • t_x – Tie on X.

  • t_y – Tie on Y.

  • c – Concordant.

class pypair.continuous.Continuous(a, b)

Bases: MeasureMixin, object

__init__(a, b)

ctor.

Parameters
  • a – Continuous variable (iterable).

  • b – Continuous variable (iterable).

property kendall

Kendall’s tau.

Returns

Kendall’s tau, p-value.

property pearson

Pearson’s r.

Returns

Pearson’s r, p-value.

property regression

Line regression.

Returns

Coefficient, p-value

property spearman

Spearman’s r.

Returns

Spearman’s r, p-value.

class pypair.continuous.CorrelationRatio(x, y)

Bases: MeasureMixin, object

Correlation ratio.

__init__(x, y)

ctor.

Parameters
  • x – Categorical variable (iterable).

  • y – Continuous variable (iterable).

property anova

Computes an ANOVA test.

Returns

F-statistic, p-value.

property calinski_harabasz

Calinski-Harabasz Index.

Returns

Calinski-Harabasz Index.

property davies_bouldin

Davies-Bouldin Index.

Returns

Davies-Bouldin Index.

property eta

Gets \(\eta\).

Returns

\(\eta\).

property eta_squared

Gets \(\eta^2 = \frac{\sigma_{\bar{y}}^2}{\sigma_{y}^2}\)

Returns

\(\eta^2\).

property kruskal

Computes the Kruskal-Wallis H-test.

Returns

H-statistic, p-value.

property silhouette

Silhouette coefficient.

Returns

Silhouette coefficient.

Associations

Some of the functions here are just wrappers around the contingency tables and may be looked at as convenience methods to simply pass in data for two variables. If you need more than the specific association, you are encouraged to build the appropriate contingency table and then call upon the measures you need.

pypair.association.agreement(a, b, measure='chohen_k', a_vals=None, b_vals=None)

Gets the agreement association.

Parameters
  • a – Categorical variable (iterable).

  • b – Categorical variable (iterable).

  • measure – Measure. Default is chohen_k.

  • a_vals – The unique values in a.

  • b_vals – The unique values in b.

Returns

Measure.

pypair.association.binary_binary(a, b, measure='chisq', a_0=0, a_1=1, b_0=0, b_1=1)

Gets the binary-binary association.

Parameters
  • a – Binary variable (iterable).

  • b – Binary variable (iterable).

  • measure – Measure. Default is chisq.

  • a_0 – The a zero value. Default 0.

  • a_1 – The a one value. Default 1.

  • b_0 – The b zero value. Default 0.

  • b_1 – The b one value. Default 1.

Returns

Measure.

pypair.association.binary_continuous(b, c, measure='biserial', b_0=0, b_1=1)

Gets the binary-continuous association.

Parameters
  • b – Binary variable (iterable).

  • c – Continuous variable (iterable).

  • measure – Measure. Default is biserial.

  • b_0 – Value when b is zero. Default 0.

  • b_1 – Value when b is one. Default is 1.

Returns

Measure.

pypair.association.categorical_categorical(a, b, measure='chisq', a_vals=None, b_vals=None)

Gets the categorical-categorical association.

Parameters
  • a – Categorical variable (iterable).

  • b – Categorical variable (iterable).

  • measure – Measure. Default is chisq.

  • a_vals – The unique values in a.

  • b_vals – The unique values in b.

Returns

Measure.

pypair.association.categorical_continuous(x, y, measure='eta')

Gets the categorical-continuous association.

Parameters
  • x – Categorical variable (iterable).

  • y – Continuous variable (iterable).

  • measure – Measure. Default is eta.

Returns

Measure.

pypair.association.concordance(x, y, measure='kendall_tau')

Gets the specified concordance between the two variables.

Parameters
  • x – Continuous or ordinal variable (iterable).

  • y – Continuous or ordinal variable (iterable).

  • measure – Measure. Default is kendall_tau.

Returns

Measure.

pypair.association.confusion(a, b, measure='acc', a_0=0, a_1=1, b_0=0, b_1=1)

Gets the specified confusion matrix stats.

Parameters
  • a – Binary variable (iterable).

  • b – Binary variable (iterable).

  • measure – Measure. Default is acc.

  • a_0 – The a zero value. Default 0.

  • a_1 – The a one value. Default 1.

  • b_0 – The b zero value. Default 0.

  • b_1 – The b one value. Default 1.

Returns

Measure.

pypair.association.continuous_continuous(x, y, measure='pearson')

Gets the continuous-continuous association.

Parameters
  • x – Continuous variable (iterable).

  • y – Continuous variable (iterable).

  • measure – Measure. Default is ‘pearson’.

Returns

Measure.

Decorators

These are decorators.

pypair.decorator.distance(f)

Marker for distance functions.

pypair.decorator.similarity(f)

Marker for similarity functions.

pypair.decorator.timeit(f)

Benchmarks the time it takes (seconds) to execute.

Utility

These are utility functions.

class pypair.util.MeasureMixin

Bases: ABC

Measure mixin. Able to get list the functions decorated with @property and also access such property based on name.

get(measure)

Gets the specified measure.

Parameters

measure – Name of measure.

Returns

Measure.

get_measures()

Gets a list of all the measures.

Returns

List of all the measures.

classmethod measures()

Gets a list of all the measures.

Returns

List of all the measures.

pypair.util.corr(df, f)

Computes the pairwise association matrix. ALL fields/columns must be the same type and so that the specified field f will be able to compute the pairwise associations.

Parameters
  • df – Pandas data frame.

  • f – Callable function; e.g. lambda a, b: categorical_categorical(a, b, measure=’phi’)

pypair.util.get_measures(clazz)

Gets all the measures of a clazz.

Parameters

clazz – Clazz.

Returns

List of measures.

Spark

These are functions that you can use in a Spark. You must pass in a Spark dataframe and you will get a pair-RDD as output. The pair-RDD will have the following as its keys and values.

  • key: in the form of a tuple of strings (k1, k2) where k1 and k2 are names of variables (column names)

  • value: a dictionary {'acc': 0.8, 'tpr': 0.9, 'fpr': 0.8, ...} where keys are association measure names and values are the corresponding association values

pypair.spark.agreement(sdf)

Gets all pairwise categorical-categorical agreement association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘kappa’: 0.9, ‘delta’: 0.2}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘kappa’: 0.9, ‘delta’: 0.2, …}

Parameters

sdf – Spark dataframe. Should be strings or whole numbers to represent the values.

Returns

Spark pair-RDD.

pypair.spark.binary_binary(sdf)

Gets all the pairwise binary-binary association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘phi’: 1, ‘lambda’: 0.8}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘phi’: 1, ‘lambda’: 0.8, …}

Parameters

sdf – Spark dataframe. Should be all 1’s and 0’s.

Returns

Spark pair-RDD.

pypair.spark.binary_continuous(sdf, binary, continuous, b_0=0, b_1=1)

Gets all pairwise binary-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘biserial’: 0.9, ‘point_biserial’: 0.2}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘biserial’: 0.9, ‘point_biserial’: 0.2, …}

All the binary fields/columns should be encoded in the same way. For example, if you are using 1 and 0, then all binary fields should only have those values, not a mixture of 1 and 0, True and False, -1 and 1, etc.

Parameters
  • sdf – Spark dataframe.

  • binary – List of fields that are binary.

  • continuous – List of fields that are continuous.

  • b_0 – Zero value for binary field.

  • b_1 – One value for binary field.

Returns

Spark pair-RDD.

pypair.spark.categorical_categorical(sdf)

Gets all pairwise categorical-categorical association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘phi’: 0.9, ‘chisq’: 0.2}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘phi’: 0.9, ‘chisq’: 0.2, …}

Parameters

sdf – Spark dataframe. Should be strings or whole numbers to represent the values.

Returns

Spark pair-RDD.

pypair.spark.categorical_continuous(sdf, categorical, continuous)

Gets all pairwise categorical-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘eta_sq’: 0.9, ‘eta’: 0.95}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘eta_sq’: 0.9, ‘eta’: 0.95}

For now, only eta \(\eta^2\) is supported.

Parameters
  • sdf – Spark dataframe.

  • categorical – List of categorical variables.

  • continuous – List of continuous variables.

Returns

Spark pair-RDD.

pypair.spark.concordance(sdf)

Gets all the pairwise ordinal-ordinal concordance measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘kendall’: 1, ‘gamma’: 0.8}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘kendall’: 1, ‘gamma’: 0.8, …}

Parameters

sdf – Spark dataframe. Should be all ordinal data (numeric).

Returns

Spark pair-RDD.

pypair.spark.confusion(sdf)

Gets all the pairwise confusion matrix metrics. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘acc’: 0.9, ‘fpr’: 0.2}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘acc’: 0.9, ‘fpr’: 0.2, …}

Parameters

sdf – Spark dataframe. Should be all 1’s and 0’s.

Returns

Spark pair-RDD.

pypair.spark.continuous_continuous(sdf)

Gets all the pairwise continuous-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘pearson’: 1}. Each record in the pair-RDD is of the form.

  • (k1, k2), {‘pearson’: 1}

Only pearson is supported at the moment.

Parameters

sdf – Spark dataframe. Should be all ordinal data (numeric).

Returns

Spark pair-RDD.