Core class representing a binary classification (BC) machine learning (ML) model

class bcml_model[source]

bcml_model(model)

Represents a machine learning (ML) binary classification (BC) model.

Getting up and running

Creating a bcml_model instance

To create a new bcml_model instance, all you need to pass is a model, a machine learning binary classification model. This library was designed to work with scikit-learn classifiers, such as sklearn.linear_model.LogisticRegression, sklearn.ensemble.RandomForestClassifier, or sklearn.ensemble.GradientBoostingClassifier.

Basic Functionality

bcml_model.fit[source]

bcml_model.fit(preds, labels)

Fits model to data.

preds should be ? $\times$ $m$ for model with $m$ features. labels should be ? $\times$ 1 and take values in $\{0,1\}$.

bcml_model.predict_proba[source]

bcml_model.predict_proba(preds)

Predicts signal probability for each element of a dataset ($? \times m$ numpy array).

Returns numpy array of with values in $[0,1]$ giving predicted signal probabilities for each data point.

bcml_model.predict[source]

bcml_model.predict(preds, threshold=None)

Predicts signal ($1$) or background ($0$) for each element of a dataset ($? \times m$ numpy array).

Returns numpy array of length with values in $\{0,1\}$ giving predicted classifications.

Uses the predict method built into scikit-learn models.

bcml_model.predict_hist[source]

bcml_model.predict_hist(preds, labels, num_bins=100, sepbg=False, sig_norm=1, bg_norm=1, dataframe=False, yAxisUnits='events/bin width')

Constructs a histogram of predicted signal probabilities for signal and background constituents of a dataset ($? \times M$ numpy array).

If sepbg is False (the default), labels are assumed to take values in $\{0,1\}$. Backgrounds are treated in combination: a list of $3$ $?_i \times m$ numpy arrays are returned, containing bin edges (partitioning $[0,1]$), signal bin contents, and background bin contents.

If sepbg is True, labels are assumed to take values in $\{-n,\dots,-1,1\}$ (if there are $n$ backgrounds) while bg_norm should be a list of length $n$. Backgrounds are then differentiated: a list of $2 + n$ numpy arrays of shape $?_i \times m$ are returned, containing bin edges (partitioning $[0,1]$), signal bin contents, and $n$ background bin contents.

yAxisUnits accepts either the string 'events' or 'events/bin width': in the former case, histograms are generated normally (using density=True within np.histogram and then multipliying bins by sig_norm and bg_norm) meaning you get the desired normalization values by summing the area under the curve, while in the latter case the usual procedure is following BUT then bin heights are divided by bin width, thereby ensuring that merely summing the heights of the individual bins yields the desired normalization values.

bcml_model.feature_importance[source]

bcml_model.feature_importance()

Returns the importance of the $M$ features used to train self.model.

bcml_model.sorted_feature_importance[source]

bcml_model.sorted_feature_importance(features)

Returns list of features sorted by importance.

Given arguments features and importances, lists of length $M$, returns list of size $M \times 2$ where the first column gives features and the second their associated importances, sorted by importance.

bcml_model.accuracy[source]

bcml_model.accuracy(preds, labels, threshold=None)

Computes model accuracy on a dataset ($? x m$ predictors, length $?$ labels).

Returns value in $[0,1]$ giving model accuracy on the provided predictors and labels.

bcml_model.conf_matrix[source]

bcml_model.conf_matrix(labels, predictions=None, preds=None)

Computes the confusion matrix of the trained model on a dataset ($? x M$ predictors, length $?$ labels).

Returns $2 \times 2$ confusion matrix using sklearn.metrics.confusion_matrix.

If predictors preds aren't provided, self.test_preds is used. If labels aren't provided, self.test_labels is used.

bcml_model.tpr_cm[source]

bcml_model.tpr_cm(conf_matrix)

Computes the true positive rate (tpr; correctly identified signal/total signal) of a trained model given a confusion matrix.

Returns value in $[0,1]$.

bcml_model.fpr_cm[source]

bcml_model.fpr_cm(conf_matrix)

Computes the false positive rate (fpr; misidentified background/total background) of a trained model given a confusion matrix.

Returns value in $[0,1]$.

bcml_model.tpr[source]

bcml_model.tpr(labels, predictions=None, preds=None)

Computes the true positive rate (tpr; correctly identified signal/total signal) of a trained model given predictions and labels (both numpy array of length $?$ with values in $\{0,1\}$)

Returns value in $[0,1]$.

bcml_model.fpr[source]

bcml_model.fpr(labels, predictions=None, preds=None)

Computes the false positive rate (fpr; misidentified background/total background) of a trained model given predictions and labels (both numpy array of length $?$ with values in $\{0,1\}$)

Returns value in $[0,1]$.

Phenomenology

bcml_model.significance[source]

bcml_model.significance(signal, background, tpr, fpr, sepbg=False)

Computes signal significance of a trained model given signal and background yield.

Returns a positive real number computed by $$\frac{S \cdot TPR}{\sqrt{S \cdot TPR + B \cdot FPR}}$$ which corresponds to signal significance after selecting only datapoints the model identifies as signal.

If sepbg is False, background should be a single real number and is multiplied by fpr. If sepbg is True, background should be a list of length self.num_bgs where the $i$th element contains background yield of the $i$th background type. fpr, if passed, is then also a list of length self.num_bgs giving false positive rates for each of the background types.

bcml_model.newvar2thresh[source]

bcml_model.newvar2thresh(newvar)

Helper method for bcml.max_allowable_threshold(), bcml.get_tprs_fprs(), and bcml.best_threshold(), performing change of variables from newvar to threshold

In particular, threshold $= 1 - 10^{\text{newvar}}$

bcml_model.thresh2newvar[source]

bcml_model.thresh2newvar(thresh)

Helper method for bcml.max_allowable_threshold(), bcml.get_tprs_fprs(), and bcml.best_threshold(), performing change of variables from threshold to newvar

In particular, newvar $= \log_{10}(1 - \text{threhold})$

bcml_model.max_allowable_threshold[source]

bcml_model.max_allowable_threshold(preds, labels, sigYield)

Returns the highest threshold such that only labelling elements of self.test_pred with predicted probabilities higher than that threshold as signal still yields 25 signal.

To achieve a discovery potential of $5\sigma$, even in the best case scenario ($TPR = 1, FPR = 0$) we still require $5^2 = 25$ signal events, hence we cannot chose a threshold so high that we do not keep at least 25 signal events.

bcml_model.get_tprs_fprs[source]

bcml_model.get_tprs_fprs(preds, labels, sepbg=False)

Produces (true positive rate, false positive rate) pairs for various thresholds for the trained model on data sets.

If sepbg is True, labels should take values in $\{-n,\dots,-1,1\}$. Background is combined and a list of length $4$ is returned containing a list of $L$ sampled newvars (a convenient change of variable to approach arbitrarily close to 1: related to thresholds by bcml_model.newvar2thresh())), an $L$-list of tprs associated to those thresholds, an $L$-list of fprs related to those thresholds, and an $L$-list of length $?$ numpy arrays giving the predicted signal probabilities for the given data set.

If sepbg is Frue, labels should take values in $\{0,1\}$. Background is split and a list of length $4$ self.num_bgs is returned containing a list of $L$ sampled newvars, an $L$-list of tprs associated to those thresholds, an $L$-list of lists of length $n$ (number of backgrounds) containing fprs for each background type for each threshold, and an $L$-list of length $?$

bcml_model.best_threshold[source]

bcml_model.best_threshold(signal, background, preds, labels, sepbg=False)

Optimizes the threshold on a given data set ($? x M$ predictors, length $?$ labels).

bcml_model.req_sig_cs[source]

bcml_model.req_sig_cs(lumi, bg_cs, tpr, fpr, sig=5, sepbg=False)

Given a luminosity (in fb$^{-1}$), a background cross section (in pb), a true positive rate, a false positive rate, and a signal significance, computes the signal cross section required for the signal significance to be achieved.

If sepbg is False, background is combined and a single FPR is used; if sepbg is True, it is assumed that bg_cs, fpr are each lists of length $n$ (number of backgrounds) and their vector dot product is used for background yield.

The formula used by req_sig_cs arises as follows.

We know that $$\mathcal{S} = \frac{S \cdot \text{TPR}}{\sqrt{S \cdot \text{TPR} + B \cdot \text{FPR}}} = \frac{\mathcal{L} \cdot \sigma_s \cdot \text{TPR}}{\sqrt{\mathcal{L} \cdot \sigma_s \cdot \text{TPR} + \mathcal{L} \cdot \sigma_s \cdot \text{FPR}}}$$ We can work toward solving for $\sigma_s$ as follows. \begin{align} \mathcal{S} &= \frac{\mathcal{L} \cdot \sigma_s \cdot \text{TPR}}{\sqrt{\mathcal{L} \cdot \sigma_s \cdot \text{TPR} + \mathcal{L} \cdot \sigma_b \cdot \text{FPR}}} \\ \mathcal{S} \left(\sqrt{\mathcal{L} \cdot \sigma_s \cdot \text{TPR} + \mathcal{L} \cdot \sigma_b \cdot \text{FPR}}\right)&= \mathcal{L} \cdot \sigma_s \cdot \text{TPR} \\ \mathcal{S}^2 \left(\sigma_s \cdot \text{TPR} + \sigma_b \cdot \text{FPR}\right) &= \mathcal{L} \cdot \sigma_s^2 \cdot \text{TPR}^2 \\ 0 &= -\left(\mathcal{L} \cdot \text{TPR}^2\right)\sigma_s^2 + \left(\mathcal{S}^2 \cdot \text{TPR}\right)\sigma_s + \left(\mathcal{S}^2 \cdot \sigma_b \cdot \text{FPR}\right) \end{align} This is then easily solvable using the quadratic formula.

Other utilities

bcml_model.save_model[source]

bcml_model.save_model(filename)

Saves the model to filename.joblib

refresh_model[source]

refresh_model(model)

If this class gets updated, run this function on your already trained model to have it reflect the updated class without retraining being necessary.