PyPair
Contingency Table Analysis
These are the basic contingency tables used to analyze categorical data.
CategoricalTable
BinaryTable
ConfusionMatrix
AgreementTable
- class pypair.contingency.AgreementMixin
Bases:
object
Agreement computations.
- property chohen_k
Computes Cohen’s \(\kappa\).
\(\kappa = \frac{\theta_1 - \theta_2}{1 - \theta_2}\)
\(\theta_1 = \sum_i p_{ii}\)
\(\theta_2 = \sum_i p_{i+}p_{+i}\)
- Returns
\(\kappa\).
- property cohen_light_k
Cohen-Light \(\kappa\). \(\kappa\) is a measure of conditional agreement. Several \(\kappa\), one for each unique value, will be computed and returned.
\(\kappa = \frac{\theta_1 - \theta_2}{1 - \theta_2}\)
\(\theta_1 = \frac{p_{ii}}{p_{i+}}\)
\(\theta_2 = p_{+i}\)
- Returns
A list of \(\kappa\).
- class pypair.contingency.AgreementStats(table)
Bases:
AgreementMixin
,ContingencyTable
Computes agreement stats.
- __init__(table)
ctor.
- Parameters
table – Contingency table.
- class pypair.contingency.AgreementTable(a, b, a_vals=None, b_vals=None)
Bases:
AgreementMixin
,ContingencyTable
Represents a contingency table for agreement data against one variable. The variable is typically a rating variable (e.g. dislike, neutral, like), and the data is a pairing of ratings over the same set of items. The agreement table that is induced by the data is typically squared, where the number of rows and columns are equal.
- __init__(a, b, a_vals=None, b_vals=None)
ctor.
- Parameters
a – Categorical variable.
b – Categorical variable.
a_vals – Values in a. Default None; figure out empirically.
b_vals – Values in b. Default None; figure out empirically.
- class pypair.contingency.BinaryMixin
Bases:
object
Binary computations based off of a, b, c and d from a 2x2 contingency table.
- property ample
Ample
\(\left|\frac{a(c+d)}{c(a+b)}\right|\)
- Returns
Ample.
- property anderberg
Anderberg
\(\frac{\sigma-\sigma'}{2n}\)
- Returns
Anderberg.
- property baroni_urbani_buser_i
Baroni-Urbani-Buser-I
\(\frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}\)
- Returns
Baroni-Urbani-Buser-I.
- property baroni_urbani_buser_ii
Baroni-Urbani-Buser-II
\(\frac{\sqrt{ad}+a-(b+c)}{\sqrt{ad}+a+b+c}\)
- Returns
Baroni-Urbani-Buser-II.
- property braun_banquet
Braun-Banquet
\(\frac{a}{\max(a+b,a+c)}\)
- Returns
Braun-Banquet.
- property chisq
\(\chi^2\) (alias for Pearson-I)
- Returns
\(\chi^2\).
- property chord
Chord
\(\sqrt{2\left(1 - \frac{a}{\sqrt{(a+b)(a+c)}}\right)}\)
- Returns
Chord (distance).
- property cole_i
Cole-I
\(\frac{\sqrt{2}(ad-bc)}{\sqrt{(ad-bc)^2-(a+b)(a+c)(b+d)(c+d)}}\)
- Returns
Cole-I.
- property cole_ii
Cole-II
\(\frac{ad-bc}{\min((a+b)(a+c),(b+d)(c+d))}\)
- Returns
Cole-II.
- property contingency_coefficient
-
- Returns
Contingency coefficient.
- property cosine
Cosine
\(\frac{a}{(a+b)(a+c)}\)
- Returns
Cosine.
- property cramer_v
-
- Returns
Cramer’s V.
- property dennis
Dennis
\(\frac{ad-bc}{\sqrt{n(a+b)(a+c)}}\)
- Returns
Dennis.
- property dice
Dice; Czekanowski; Nei-Li
\(\frac{2a}{2a+b+c}\)
- Returns
Dice.
- property disperson
Disperson
\(\frac{ad-bc}{(a+b+c+d)^2}\)
- Returns
Disperson.
- property driver_kroeber
Driver-Kroeber
\(\frac{a}{2}\left(\frac{1}{a+b}+\frac{1}{a+c}\right)\)
- Returns
Driver-Kroeber.
- property euclid
Euclid
\(\sqrt{b+c}\)
- Returns
Euclid (distance).
- property eyraud
Eyraud
\(\frac{n^2(na-(a+b)(a+c))}{(a+b)(a+c)(b+d)(c+d)}\)
- Returns
Eyraud.
- property fager_mcgowan
Fager-McGowan
\(\frac{a}{\sqrt{(a+b)(a+c)}}-\frac{max(a+b,a+c)}{2}\)
- Returns
Fager-McGowan.
- property faith
Faith
\(\frac{a+0.5d}{a+b+c+d}\)
- Returns
Faith.
- property forbes_ii
Forbes-II
\(\frac{na-(a+b)(a+c)}{n \min(a+b,a+c) - (a+b)(a+c)}\)
- Returns
Forbes-II.
- property forbesi
Forbesi
\(\frac{na}{(a+b)(a+c)}\)
- Returns
Forbesi.
- property fossum
Fossum
\(\frac{n(a-0.5)^2}{(a+b)(a+c)}\)
- Returns
Fossum.
- property gilbert_wells
Gilbert-Wells
\(\log a - \log n - \log \frac{a+b}{n} - \log \frac{a+c}{n}\)
- Returns
Gilbert-Wells.
- property goodman_kruskal
Goodman-Kruskal
\(\frac{\sigma - \sigma'}{2n-\sigma'}\)
- Returns
Goodman-Kruskal.
- property gower
Gower
\(\frac{a+d}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)
- Returns
Gower.
- property gower_legendre
Gower-Legendre
\(\frac{a+d}{a+0.5b+0.5c+d}\)
- Returns
Gower-Legendre.
- property hamann
Hamann.
\(\frac{(a+d)-(b+c)}{a+b+c+d}\)
- Returns
Hamann.
- property hamming
Hamming; Canberra; Manhattan; Cityblock; Minkowski
\(b+c\)
- Returns
Hamming (distance).
- property hellinger
Hellinger
\(2\sqrt{1 - \frac{a}{\sqrt{(a+b)(a+c)}}}\)
- Returns
Hellinger (distance).
- property inner_product
Inner-product.
\(a+d\)
- Returns
Inner-product.
- property intersection
Intersection
\(a\)
- Returns
Intersection.
- property jaccard
Jaccard
\(\frac{a}{a+b+c}\)
- Returns
Jaccard.
- property jaccard_3w
3W-Jaccard
\(\frac{3a}{3a+b+c}\)
- Returns
3W-Jaccard.
- property jaccard_distance
Jaccard
\(\frac{b + c}{a + b + c}\)
- Returns
Jaccard (distance).
- property johnson
Johnson.
\(\frac{a}{a+b}+\frac{a}{a+c}\)
- Returns
Johnson.
- property kulcyznski_ii
Kulczynski-II
\(\frac{0.5a(2a+b+c)}{(a+b)(a+c)}\)
- Returns
Kulczynski-II.
- property kulczynski_i
Kulczynski-I
\(\frac{a}{b+c}\)
- Returns
Kulczynski-I.
- property lance_williams
Lance-Williams; Bray-Curtis
\(\frac{b+c}{2a+b+c}\)
- Returns
Lance-Williams (distance).
- property mcconnaughey
McConnaughey
\(\frac{a^2 - bc}{(a+b)(a+c)}\)
- Returns
McConnaughey.
- property mcnemar_test
-
- Returns
A tuple. First element is chi-square test statistics. Second element is p-value.
- property mean_manhattan
Mean-Manhattan
\(\frac{b+c}{a+b+c+d}\)
- Returns
Mean-Manhattan (distance).
- property michael
Michael
\(\frac{4(ad-bc)}{(a+d)^2+(b+c)^2}\)
- Returns
Michael.
- property mountford
Mountford
\(\frac{a}{0.5(ab + ac) + bc}\)
- Returns
Mountford.
- property ochia_i
Ochia-I
Also known as Fowlkes-Mallows Index. This measure is typically used to judge the similarity between two clusters. A larger value indicates that the clusters are more similar.
\(\frac{a}{\sqrt{(a+b)(a+c)}}\)
- Returns
Ochai-I.
- property ochia_ii
Ochia-II
\(\frac{ad}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)
- Returns
Ochia-II.
- property odds_ratio
Odds ratio. The odds ratio is also referred to as the cross-product ratio.
- Returns
Odds ratio.
- property pattern_difference
Pattern difference
\(\frac{4bc}{(a+b+c+d)^2}\)
- Returns
Pattern difference (distance).
- property pearson_heron_i
Pearson-Heron-I
\(\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)
- Returns
Pearson-Heron-I.
- property pearson_heron_ii
Pearson-Heron-II
\(\sqrt{\frac{\chi^2}{n+\chi^2}}\)
- Returns
Pearson-Heron-II.
- property pearson_i
Pearson-I
\(\chi^2=\frac{n(ad-bc)^2}{(a+b)(a+c)(c+d)(b+d)}\)
- Returns
Pearson-I.
- property peirce
Peirce
\(\frac{ab+bc}{ab+2bc+cd}\)
- Returns
Peirce.
- property person_ii
Pearson-II
\(\sqrt{\frac{\rho}{n+\rho}}\)
\(\rho=\frac{ad-bc}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\)
- Returns
Pearson-II.
- property roger_tanimoto
Roger-Tanimoto
\(\frac{a+d}{a+2b+2c+d}\)
- Returns
Roger-Tanimoto.
- property russel_rao
Russel-Rao
\(\frac{a}{a+b+c+d}\)
- Returns
Russel-Rao.
- property shape_difference
Shape difference
\(\frac{n(b+c)-(b-c)^2}{(a+b+c+d)^2}\)
- Returns
Shape difference (distance).
- property size_difference
Size difference
\(\frac{(b+c)^2}{(a+b+c+d)^2}\)
- Returns
Size difference (distance).
- property sokal_michener
Sokal-Michener
\(\frac{a+d}{a+b+c+d}\)
- Returns
Sokal-Michener.
- property sokal_sneath_i
Sokal-Sneath-I
\(\frac{a}{a+2b+2c}\)
- Returns
Sokal-Sneath-I.
- property sokal_sneath_ii
Sokal-Sneath-II
\(\frac{2a+2d}{2a+b+c+2d}\)
- Returns
Sokal-Sneath-II.
- property sokal_sneath_iii
Sokal-Sneath-III
\(\frac{a+d}{b+c}\)
- Returns
Sokal-Sneath-III.
- property sokal_sneath_iv
Sokal-Sneath-IV
\(\frac{ad}{(a+b)(a+c)(b+d)\sqrt{c+d}}\)
- Returns
Sokal-Sneath-IV.
- property sokal_sneath_v
Sokal-Sneath-V
\(\frac{1}{4}\left(\frac{a}{a+b}+\frac{a}{a+c}+\frac{d}{b+d}+\frac{d}{b+d}\right)\)
- Returns
Sokal-Sneath-V.
- property sorensen_dice
Sørensen–Dice
\(\frac{2(a + d)}{2(a + d) + b + c}\)
- Returns
Sørensen–Dice,
- property sorgenfrei
Sorgenfrei
\(\frac{a^2}{(a+b)(a+c)}\)
- Returns
Sorgenfrei.
- property stiles
Stiles
\(\log_{10} \frac{n\left(|ad-bc|-\frac{n}{2}\right)^2}{(a+b)(a+c)(b+d)(c+d)}\)
- Returns
Stiles.
- property tanimoto_distance
Tanimoto similarity and distance.
- Returns
Tanimoto distance.
- property tanimoto_i
Tanimoto-I
\(\frac{a}{2a+b+c}\)
- Returns
Tanimoto-I.
- property tanimoto_ii
Tanimoto-II
\(\frac{a}{b + c}\)
- Returns
Tanimoto-II.
- property tarantula
Tarantula
\(\frac{a(c+d)}{c(a+b)}\)
- Returns
Tarantula.
- property tarwid
Tarwind
\(\frac{na - (a+b)(a+c)}{na + (a+b)(a+c)}\)
- Returns
Tarwind.
- property tetrachoric
Tetrachoric correlation ranges from \([-1, 1]\), where 0 indicates no agreement, 1 indicates perfect agreement and -1 indicates perfect disagreement.
if \(b=0\) or \(c=0\), 1.0
if \(a=0\) or \(b=0\), -1.0
else, \(\frac{y-1}{y+1}, y={\left(\frac{da}{bc}\right)}^{\frac{\pi}{4}}\)
References
- Returns
Tetrachoric correlation.
- property tschuprow_t
-
- Returns
Tschuprow’s T.
- tversky_index(theta=1, phi=0)
Compute’s Tversky’s Index.
\(\frac{a}{a+\theta b+\phi c}\)
\(\theta\) and \(\phi\) are typically between \([0,1]\) and \(\theta + \phi = 1\).
- Parameters
theta – Weight \([0,1]\) of how important match on row variable is. Default 1.
phi – Weight \([0,1]\) of how important match on column variable is. Default 0.
- Returns
Tversky’s Index.
- property vari
Vari
\(\frac{b+c}{4a+4b+4c+4d}\)
- Returns
Vari (distance).
- property yule_q
Yule’s Q
\(\frac{ad-bc}{ad+bc}\)
Also, Yule’s Q is based off of the odds ratio or cross-product ratio, \(\alpha\).
\(Q = \frac{\alpha - 1}{\alpha + 1}\)
Yule’s Q is the same as Goodman-Kruskal’s \(\lambda\) for 2 x 2 contingency tables and is also a measure of proportional reduction in error (PRE).
- Returns
Yule’s Q.
- property yule_q_difference
Yule’s q
\(\frac{2bc}{ad+bc}\)
- Returns
Yule’s q (distance).
- property yule_w
Yule’s w
\(\frac{\sqrt{ad}-\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\)
- Returns
Yule’s w.
- property yule_y
Yule’s Y is based off of the odds ratio or cross-product ratio, \(\alpha\).
\(Y = \frac{\sqrt\alpha - 1}{\sqrt\alpha + 1}\)
- Returns
Yule’s Y.
- class pypair.contingency.BinaryStats(table)
Bases:
CategoricalMixin
,BinaryMixin
,ContingencyTable
Computes binary stats.
- __init__(table)
ctor.
- Parameters
table – Contingency table.
- class pypair.contingency.BinaryTable(a, b, a_0=0, a_1=1, b_0=0, b_1=1)
Bases:
CategoricalMixin
,BinaryMixin
,ContingencyTable
Represents a contingency table for binary variables.
- __init__(a, b, a_0=0, a_1=1, b_0=0, b_1=1)
ctor.
- Parameters
a – Iterable list.
b – Iterable list.
a_0 – The zero value for a. Defaults to 0.
a_1 – The one value for a. Defaults to 1.
b_0 – The zero value for b. Defaults to 0.
b_1 – The zero value for b. Defaults to 1.
- class pypair.contingency.CategoricalMixin
Bases:
object
Categorical computations based off a contingency table.
- property adjusted_rand_index
The Adjusted Rand Index (ARI) should yield a value between [0, 1], however, negative values can also arise when the index is less than the expected value. This function uses binom() from scipy.special, and when n >= 300, the results are too large and may cause overflow.
TODO: use a different way to compute binomial coefficient
References
- Returns
Adjusted Rand Index.
- property chisq
The chi-square statistic \(\chi^2\), is defined as follows.
\(\sum_i \sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)
In a contingency table, \(O_ij\) is the observed cell count corresponding to the \(i\) row and \(j\) column. \(E_ij\) is the expected cell count corresponding to the \(i\) row and \(j\) column.
\(E_i = \frac{N_{i*} N_{*j}}{N}\)
Where \(N_{i*}\) is the i-th row marginal, \(N_{*j}\) is the j-th column marginal and \(N\) is the sum of all the values in the contingency cells (or the total size of the data).
References
- Returns
Chi-square statistic.
- property chisq_dof
Returns the degrees of freedom form \(\chi^2\), which is defined as \((R - 1)(C - 1)\), where \(R\) is the number of rows and \(C\) is the number of columns in a contingency table induced by two categorical variables.
- Returns
Degrees of freedom.
- property gk_lambda
Goodman-Kruskal’s lambda is the proportional reduction in error of predicting one variable b given another a: \(\lambda_{B|A}\).
The probability of an error in predicting the column category: \(P_e = 1 - \frac{\max_{c} N_{* c}}{N}\)
The probability of an error in predicting the column category given the row category: \(P_{e|r} = 1 - \frac{\sum_r \max_{c} N_{r c}}{N}\)
Where,
\(\max_{c} N_{* c}\) is the maximum of the column marginals
\(\sum_r \max_{c} N_{r c}\) is the sum over the maximum value per row
\(N\) is the total
Thus, \(\lambda_{B|A} = \frac{P_e - P_{e|r}}{P_e}\).
The way the contingency table is setup by default is that a is on the rows and b is on the columns. Note that Goodman-Kruskal’s lambda is not symmetric: \(\lambda_{B|A}\) does not necessarily equal \(\lambda_{A|B}\). By default, \(\lambda_{B|A}\) is computed, but if you desire the reverse, use goodman_kruskal_lambda_reversed().
References
- Returns
Goodman-Kruskal’s lambda.
- property gk_lambda_reversed
Computes \(\lambda_{A|B}\).
- Returns
Goodman-Kruskal’s lambda.
- property mutual_information
The mutual information between two variables \(X\) and \(Y\) is denoted as \(I(X;Y)\). \(I(X;Y)\) is unbounded and in the range \([0, \infty]\). A higher mutual information value implies strong association. The formula for \(I(X;Y)\) is defined as follows.
\(I(X;Y) = \sum_y \sum_x P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\)
- Returns
Mutual information.
- property phi
Gets \(\phi\).
\(\phi = \sqrt{\frac{\chi^2}{N}}\)
- Returns
\(\phi\).
- property uncertainty_coefficient
The uncertainty coefficient \(U(X|Y)\) for two variables \(X\) and \(Y\) is defined as follows.
\(U(X|Y) = \frac{I(X;Y)}{H(X)}\)
Where,
\(H(X) = -\sum_x P(x) \log P(x)\)
\(I(X;Y) = \sum_y \sum_x P(x, y) \log \frac{P(x, y)}{P(x) P(y)}\)
\(H(X)\) is called the entropy of \(X\) and \(I(X;Y)\) is the mutual information between \(X\) and \(Y\). Note that \(I(X;Y) < H(X)\) and both values are positive. As such, the uncertainty coefficient may be viewed as the normalized mutual information between \(X\) and \(Y\) and in the range \([0, 1]\).
- Returns
Uncertainty coefficient.
- property uncertainty_coefficient_reversed
-
- Returns
Uncertainty coefficient.
- class pypair.contingency.CategoricalStats(table)
Bases:
CategoricalMixin
,ContingencyTable
Computes categorical stats.
- __init__(table)
ctor.
- Parameters
table – Contingency table.
- class pypair.contingency.CategoricalTable(a, b, a_vals=None, b_vals=None)
Bases:
CategoricalMixin
,ContingencyTable
Represents a contingency table for categorical variables.
References
- __init__(a, b, a_vals=None, b_vals=None)
ctor. If a_vals or b_vals are None, then the possible values will be determined empirically from the data.
- Parameters
a – Iterable list.
b – Iterable list.
a_vals – All possible values in a. Defaults to None.
b_vals – All possible values in b. Defaults to None.
- class pypair.contingency.ConfusionMatrix(a, b, a_0=0, a_1=1, b_0=0, b_1=1)
Bases:
ConfusionMixin
,ContingencyTable
Represents a confusion matrix. The confusion matrix looks like what is shown below for two binary variables a and b; a is in the rows and b in the columns. Most of the statistics around performance comes from the counts of TN, FN, FP and TP.
b=0
b=1
a=0
TN
FP
a=1
FN
TP
- __init__(a, b, a_0=0, a_1=1, b_0=0, b_1=1)
ctor. Note that a is the ground truth and b is the prediction.
- Parameters
a – Binary variable (iterable). Ground truth.
b – Binary variable (iterable). Prediction.
a_0 – The zero value for a. Defaults to 0.
a_1 – The one value for a. Defaults to 1.
b_0 – The zero value for b. Defaults to 0.
b_1 – The zero value for b. Defaults to 1.
- class pypair.contingency.ConfusionMixin
Bases:
object
Confusion matrix computations.
- property acc
Accuracy.
\(ACC = \frac{TP + TN}{TP + TN + FP + FN}\)
- Returns
Accuracy.
- property ba
Balanced accuracy.
\(BA = \frac{TPR + TNR}{2}\)
- Returns
Balanced accuracy.
- property bm
Bookmaker informedness.
\(BI = TPR + TNR - 1\)
- Returns
BM.
- property dor
Diagnostic odds ratio.
\(\frac{PLR}{NLR}\)
- Returns
DOR.
- property f1
F1 score: harmonic mean of precision and sensitivity.
\(F1 = \frac{PPV \times TPR}{PPV + TPR}\)
- Returns
F1.
- property fdr
False discovery rate.
\(FDR = \frac{FP}{FP + TP}\)
- Returns
FDR.
- property fn
FN
- Returns
FN.
- property fnr
False negative rate.
\(FNR = \frac{FN}{FN + TP}\)
Aliases
miss rate
- Returns
FNR.
- property fomr
False omission rate.
\(FOR = \frac{FN}{FN + TN}\)
- Returns
FOR.
- property fp
FP
- Returns
FP.
- property fpr
False positive rate.
\(FPR = \frac{FP}{FP + TN}\)
Aliases
fall-out
probability of false alarm
- Returns
FPR.
- property mcc
Matthew’s correlation coefficient.
\(MCC = \frac{TP + TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}\)
- Returns
- property mk
Markedness.
\(MK = PPV + NPV - 1\)
Aliases
deltaP
- Returns
Markedness.
- property n
\(N = TP + FN + FP + TN\)
- Returns
- property nlr
Negative likelihood ratio.
\(NLR = \frac{FNR}{TNR}\)
Aliases
LR-
- Returns
NLR.
- property npv
Negative predictive value.
\(NPV = \frac{TN}{TN + FN}\)
- Returns
NPV.
- property plr
Positive likelihood ratio.
\(PLR = \frac{TPR}{FPR}\)
Aliases
LR+
- Returns
PLR.
- property ppv
Positive predictive value.
\(PPV = \frac{TP}{TP + FP}\)
Aliases
precision
- Returns
PPV.
- property precision
Alias to PPV.
- Returns
PPV.
- property prevalence
Prevalence.
\(\frac{TP + FN}{N}\)
- Returns
Prevalence.
- property pt
Prevalence threshold.
\(PT = \frac{\sqrt{TPR(-TNR + 1)} + TNR - 1}{TPR + TNR - 1}\)
- Returns
Prevalence threshold.
- property recall
Alias to TPR.
- Returns
TPR.
- property sensitivity
Alias to TPR.
- Returns
Sensitivity.
- property specificity
Alias to TNR.
- Returns
Specificity.
- property tn
TN
- Returns
TN.
- property tnr
True negative rate.
\(TNR = \frac{TN}{TN + FP}\)
Aliases
specificity
selectivity
- Returns
TNR.
- property tp
TP
- Returns
TP.
- property tpr
True positive rate.
\(TPR = \frac{TP}{TP + FN}\)
Aliases
sensitivity
recall
hit rate
power
probability of detection
- Returns
TPR.
- property ts
Threat score.
\(TS = \frac{TP}{TP + FN + FP}\)
Aliases
critical success index (CSI).
- Returns
TS.
- class pypair.contingency.ConfusionStats(table)
Bases:
ConfusionMixin
,ContingencyTable
Computes confusion matrix stats.
- __init__(table)
ctor.
- Parameters
table – Contingency table.
- class pypair.contingency.ContingencyTable(table)
Bases:
MeasureMixin
,ABC
Abstract contingency table. All other tables inherit from this one.
- __init__(table)
ctor.
- Parameters
table – A table of counts (list of lists).
Biserial
These are the biserial association measures.
- class pypair.biserial.Biserial(b, c, b_0=0, b_1=1)
Bases:
MeasureMixin
,BiserialMixin
,object
Biserial association between a binary and continuous variable.
- __init__(b, c, b_0=0, b_1=1)
ctor.
- Parameters
b – Binary variable (iterable).
c – Continuous variable (iterable).
b_0 – Value for b is zero. Default 0.
b_1 – Value for b is one. Default 1.
- class pypair.biserial.BiserialMixin
Bases:
object
Biserial computations based off of \(n, p, q, y_0, y_1, \sigma\).
- property biserial
Computes the biserial correlation between a binary and continuous variable. The biserial correlation \(r_b\) can be computed from the point-biserial correlation \(r_{\mathrm{pb}}\) as follows.
\(r_b = \frac{r_{\mathrm{pb}}}{h} \sqrt{pq}\)
The tricky thing to explain is the \(h\) parameter. \(h\) is defined as the height of the standard normal distribution at z, where \(P(z'<z) = q\) and \(P(z’>z) = p\). The way to get \(h\) in practice is take the inverse standard normal of \(q\), and then take the standard normal probability of that result. Using Scipy norm.pdf(norm.ppf(q)).
References
Point-Biserial Correlation & Biserial Correlation: Definition, Examples
How to calculate the inverse of the normal cumulative distribution function in python?
- Returns
Biserial correlation coefficient.
- property point_biserial
Computes the point-biserial correlation coefficient between a binary variable \(X\) and a continuous variable \(Y\).
\(r_{\mathrm{pb}} = \frac{(Y_1 - Y_0) \sqrt{pq}}{\sigma_Y}\)
Where
\(Y_0\) is the average of \(Y\) when \(X=0\)
\(Y_1\) is the average of \(Y\) when \(X=1\)
\(\sigma_Y\) is the standard deviation of \(Y\)
\(p\) is \(P(X=1)\)
\(q\) is \(1 - p\)
- Returns
Point-biserial correlation coefficient.
- property rank_biserial
Computes the rank-biserial correlation between a binary variable \(X\) and a continuous variable \(Y\).
\(r_r = \frac{2 (Y_1 - Y_0)}{n}\)
Where
\(Y_0\) is the average of \(Y\) when \(X=0\)
\(Y_1\) is the average of \(Y\) when \(X=1\)
\(n\) is the total number of data
- Returns
Rank-biserial correlation.
- class pypair.biserial.BiserialStats(n, p, y_0, y_1, std)
Bases:
MeasureMixin
,BiserialMixin
,object
Computes biserial stats.
- __init__(n, p, y_0, y_1, std)
ctor.
- Parameters
n – Total number of samples.
p – \(P(Y|X=0)\).
y_0 – Average of \(Y\) when \(X=0\). \(\bar{Y}_0\)
y_1 – Average of \(Y\) when \(X=1\). \(\bar{Y}_1\)
std – Standard deviation of \(Y\), \(\sigma\).
Continuous
These are the continuous association measures.
- class pypair.continuous.Concordance(x, y)
Bases:
MeasureMixin
,ConcordanceMixin
,object
Concordance for continuous and ordinal data.
- __init__(x, y)
ctor.
- Parameters
x – Continuous or ordinal data (iterable).
y – Continuous or ordinal data (iterable).
- class pypair.continuous.ConcordanceMixin
Bases:
object
- property goodman_kruskal_gamma
Goodman-Kruskal \(\gamma\) is like Somer’s D. It is defined as follows.
\(\gamma = \frac{\pi_c - \pi_d}{1 - \pi_t}\)
Where
\(\pi_c = \frac{C}{n}\)
\(\pi_d = \frac{D}{n}\)
\(\pi_t = \frac{T}{n}\)
\(C\) is the number of concordant pairs
\(D\) is the number of discordant pairs
\(T\) is the number of ties
\(n\) is the sample size
- Returns
\(\gamma\).
- property kendall_tau
Kendall’s \(\tau\) is defined as follows.
\(\tau = \frac{C - D}{{{n}\choose{2}}}\)
Where
\(C\) is the number of concordant pairs
\(D\) is the number of discordant pairs
\(n\) is the sample size
- Returns
\(\tau\).
- property somers_d
Computes Somers’ d for two continuous variables. Note that Somers’ d is defined for \(d_{X \cdot Y}\) and \(d_{Y \cdot X}\) and in general \(d_{X \cdot Y} \neq d_{Y \cdot X}\).
\(d_{Y \cdot X} = \frac{\pi_c - \pi_d}{\pi_c + \pi_d + \pi_t^Y}\)
\(d_{X \cdot Y} = \frac{\pi_c - \pi_d}{\pi_c + \pi_d + \pi_t^X}\)
Where
\(\pi_c = \frac{C}{n}\)
\(\pi_d = \frac{D}{n}\)
\(\pi_t^X = \frac{T^X}{n}\)
\(\pi_t^Y = \frac{T^Y}{n}\)
\(C\) is the number of concordant pairs
\(D\) is the number of discordant pairs
\(T^X\) is the number of ties on \(X\)
\(T^Y\) is the number of ties on \(Y\)
\(n\) is the sample size
- Returns
\(d_{X \cdot Y}\), \(d_{Y \cdot X}\).
- class pypair.continuous.ConcordanceStats(d, t_xy, t_x, t_y, c, n)
Bases:
MeasureMixin
,ConcordanceMixin
Computes concordance stats.
- __init__(d, t_xy, t_x, t_y, c, n)
ctor.
- Parameters
d – Number of discordant pairs.
t_xy – Number of ties on XY pairs.
t_x – Number of ties on X pairs.
t_y – Number of ties on Y pairs.
c – Number of concordant pairs.
n – Total number of pairs.
- class pypair.continuous.ConcordantCounts(d, t_xy, t_x, t_y, c)
Bases:
object
Stores the concordance, discordant and tie counts.
- __init__(d, t_xy, t_x, t_y, c)
ctor.
- Parameters
d – Discordant.
t_xy – Tie.
t_x – Tie on X.
t_y – Tie on Y.
c – Concordant.
- class pypair.continuous.Continuous(a, b)
Bases:
MeasureMixin
,object
- __init__(a, b)
ctor.
- Parameters
a – Continuous variable (iterable).
b – Continuous variable (iterable).
- property kendall
-
- Returns
Kendall’s tau, p-value.
- property pearson
-
- Returns
Pearson’s r, p-value.
- property regression
-
- Returns
Coefficient, p-value
- property spearman
-
- Returns
Spearman’s r, p-value.
- class pypair.continuous.CorrelationRatio(x, y)
Bases:
MeasureMixin
,object
- __init__(x, y)
ctor.
- Parameters
x – Categorical variable (iterable).
y – Continuous variable (iterable).
- property anova
Computes an ANOVA test.
- Returns
F-statistic, p-value.
- property calinski_harabasz
-
- Returns
Calinski-Harabasz Index.
- property davies_bouldin
-
- Returns
Davies-Bouldin Index.
- property eta
Gets \(\eta\).
- Returns
\(\eta\).
- property eta_squared
Gets \(\eta^2 = \frac{\sigma_{\bar{y}}^2}{\sigma_{y}^2}\)
- Returns
\(\eta^2\).
- property kruskal
Computes the Kruskal-Wallis H-test.
- Returns
H-statistic, p-value.
- property silhouette
-
- Returns
Silhouette coefficient.
Associations
Some of the functions here are just wrappers around the contingency tables and may be looked at as convenience methods to simply pass in data for two variables. If you need more than the specific association, you are encouraged to build the appropriate contingency table and then call upon the measures you need.
- pypair.association.agreement(a, b, measure='chohen_k', a_vals=None, b_vals=None)
Gets the agreement association.
- Parameters
a – Categorical variable (iterable).
b – Categorical variable (iterable).
measure – Measure. Default is chohen_k.
a_vals – The unique values in a.
b_vals – The unique values in b.
- Returns
Measure.
- pypair.association.binary_binary(a, b, measure='chisq', a_0=0, a_1=1, b_0=0, b_1=1)
Gets the binary-binary association.
- Parameters
a – Binary variable (iterable).
b – Binary variable (iterable).
measure – Measure. Default is chisq.
a_0 – The a zero value. Default 0.
a_1 – The a one value. Default 1.
b_0 – The b zero value. Default 0.
b_1 – The b one value. Default 1.
- Returns
Measure.
- pypair.association.binary_continuous(b, c, measure='biserial', b_0=0, b_1=1)
Gets the binary-continuous association.
- Parameters
b – Binary variable (iterable).
c – Continuous variable (iterable).
measure – Measure. Default is biserial.
b_0 – Value when b is zero. Default 0.
b_1 – Value when b is one. Default is 1.
- Returns
Measure.
- pypair.association.categorical_categorical(a, b, measure='chisq', a_vals=None, b_vals=None)
Gets the categorical-categorical association.
- Parameters
a – Categorical variable (iterable).
b – Categorical variable (iterable).
measure – Measure. Default is chisq.
a_vals – The unique values in a.
b_vals – The unique values in b.
- Returns
Measure.
- pypair.association.categorical_continuous(x, y, measure='eta')
Gets the categorical-continuous association.
- Parameters
x – Categorical variable (iterable).
y – Continuous variable (iterable).
measure – Measure. Default is eta.
- Returns
Measure.
- pypair.association.concordance(x, y, measure='kendall_tau')
Gets the specified concordance between the two variables.
- Parameters
x – Continuous or ordinal variable (iterable).
y – Continuous or ordinal variable (iterable).
measure – Measure. Default is kendall_tau.
- Returns
Measure.
- pypair.association.confusion(a, b, measure='acc', a_0=0, a_1=1, b_0=0, b_1=1)
Gets the specified confusion matrix stats.
- Parameters
a – Binary variable (iterable).
b – Binary variable (iterable).
measure – Measure. Default is acc.
a_0 – The a zero value. Default 0.
a_1 – The a one value. Default 1.
b_0 – The b zero value. Default 0.
b_1 – The b one value. Default 1.
- Returns
Measure.
- pypair.association.continuous_continuous(x, y, measure='pearson')
Gets the continuous-continuous association.
- Parameters
x – Continuous variable (iterable).
y – Continuous variable (iterable).
measure – Measure. Default is ‘pearson’.
- Returns
Measure.
Decorators
These are decorators.
- pypair.decorator.distance(f)
Marker for distance functions.
- pypair.decorator.similarity(f)
Marker for similarity functions.
- pypair.decorator.timeit(f)
Benchmarks the time it takes (seconds) to execute.
Utility
These are utility functions.
- class pypair.util.MeasureMixin
Bases:
ABC
Measure mixin. Able to get list the functions decorated with @property and also access such property based on name.
- get(measure)
Gets the specified measure.
- Parameters
measure – Name of measure.
- Returns
Measure.
- get_measures()
Gets a list of all the measures.
- Returns
List of all the measures.
- classmethod measures()
Gets a list of all the measures.
- Returns
List of all the measures.
- pypair.util.corr(df, f)
Computes the pairwise association matrix. ALL fields/columns must be the same type and so that the specified field
f
will be able to compute the pairwise associations.- Parameters
df – Pandas data frame.
f – Callable function; e.g. lambda a, b: categorical_categorical(a, b, measure=’phi’)
- pypair.util.get_measures(clazz)
Gets all the measures of a clazz.
- Parameters
clazz – Clazz.
- Returns
List of measures.
Spark
These are functions that you can use in a Spark. You must pass in a Spark dataframe and you will get a pair-RDD
as output. The pair-RDD will have the following as its keys and values.
key: in the form of a tuple of strings
(k1, k2)
where k1 and k2 are names of variables (column names)value: a dictionary
{'acc': 0.8, 'tpr': 0.9, 'fpr': 0.8, ...}
where keys are association measure names and values are the corresponding association values
- pypair.spark.agreement(sdf)
Gets all pairwise categorical-categorical agreement association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘kappa’: 0.9, ‘delta’: 0.2}. Each record in the pair-RDD is of the form.
(k1, k2), {‘kappa’: 0.9, ‘delta’: 0.2, …}
- Parameters
sdf – Spark dataframe. Should be strings or whole numbers to represent the values.
- Returns
Spark pair-RDD.
- pypair.spark.binary_binary(sdf)
Gets all the pairwise binary-binary association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘phi’: 1, ‘lambda’: 0.8}. Each record in the pair-RDD is of the form.
(k1, k2), {‘phi’: 1, ‘lambda’: 0.8, …}
- Parameters
sdf – Spark dataframe. Should be all 1’s and 0’s.
- Returns
Spark pair-RDD.
- pypair.spark.binary_continuous(sdf, binary, continuous, b_0=0, b_1=1)
Gets all pairwise binary-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘biserial’: 0.9, ‘point_biserial’: 0.2}. Each record in the pair-RDD is of the form.
(k1, k2), {‘biserial’: 0.9, ‘point_biserial’: 0.2, …}
All the binary fields/columns should be encoded in the same way. For example, if you are using 1 and 0, then all binary fields should only have those values, not a mixture of 1 and 0, True and False, -1 and 1, etc.
- Parameters
sdf – Spark dataframe.
binary – List of fields that are binary.
continuous – List of fields that are continuous.
b_0 – Zero value for binary field.
b_1 – One value for binary field.
- Returns
Spark pair-RDD.
- pypair.spark.categorical_categorical(sdf)
Gets all pairwise categorical-categorical association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘phi’: 0.9, ‘chisq’: 0.2}. Each record in the pair-RDD is of the form.
(k1, k2), {‘phi’: 0.9, ‘chisq’: 0.2, …}
- Parameters
sdf – Spark dataframe. Should be strings or whole numbers to represent the values.
- Returns
Spark pair-RDD.
- pypair.spark.categorical_continuous(sdf, categorical, continuous)
Gets all pairwise categorical-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘eta_sq’: 0.9, ‘eta’: 0.95}. Each record in the pair-RDD is of the form.
(k1, k2), {‘eta_sq’: 0.9, ‘eta’: 0.95}
For now, only
eta
\(\eta^2\) is supported.- Parameters
sdf – Spark dataframe.
categorical – List of categorical variables.
continuous – List of continuous variables.
- Returns
Spark pair-RDD.
- pypair.spark.concordance(sdf)
Gets all the pairwise ordinal-ordinal concordance measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘kendall’: 1, ‘gamma’: 0.8}. Each record in the pair-RDD is of the form.
(k1, k2), {‘kendall’: 1, ‘gamma’: 0.8, …}
- Parameters
sdf – Spark dataframe. Should be all ordinal data (numeric).
- Returns
Spark pair-RDD.
- pypair.spark.confusion(sdf)
Gets all the pairwise confusion matrix metrics. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and metrics e.g. {‘acc’: 0.9, ‘fpr’: 0.2}. Each record in the pair-RDD is of the form.
(k1, k2), {‘acc’: 0.9, ‘fpr’: 0.2, …}
- Parameters
sdf – Spark dataframe. Should be all 1’s and 0’s.
- Returns
Spark pair-RDD.
- pypair.spark.continuous_continuous(sdf)
Gets all the pairwise continuous-continuous association measures. The result is a Spark pair-RDD, where the keys are tuples of variable names e.g. (k1, k2), and values are dictionaries of association names and measures e.g. {‘pearson’: 1}. Each record in the pair-RDD is of the form.
(k1, k2), {‘pearson’: 1}
Only pearson is supported at the moment.
- Parameters
sdf – Spark dataframe. Should be all ordinal data (numeric).
- Returns
Spark pair-RDD.