Preparing Data

Informed by the documentation of the import47Ddata function, we can pull in all of the data files we'll need for training and testing from Google Storage. These were previously produced by printing out kinematic information from .lhe files generated by MadGraph using a MadAnalysis expert mode script.

filename_sig_masses = ['350G', '500G', '1T', '2T', '4T']
filename_bgs = ['h', '4t', 'noh']

signal_data = [import47Ddata(f'sig{mass}') for mass in filename_sig_masses]

background_data = [import47Ddata(f'bg{name}') for name in filename_bgs]

Now, using sklearn.model_selection.train_test_split we can split our data sets--one for each $m_{Z'}$ value--into training and testing subsets and create labels for each of these.

import numpy as np

from sklearn.model_selection import train_test_split
data = []
for sig_data in signal_data:
    train_preds, test_preds, train_binary_labels, test_binary_labels, train_labels, test_labels = train_test_split(
        np.concatenate([sig_data] + [bg_data[:int(len(sig_data)/3)] for bg_data in background_data]),
        [1] * len(sig_data) + [0] * 3 * int(len(sig_data)/3),
        [1] * len(sig_data) + [-1] * int(len(sig_data)/3) + [-2] * int(len(sig_data)/3) + [-3] * int(len(sig_data)/3),
        test_size=0.25, random_state=42) 
    data.append([[np.array(train_preds), np.array(train_binary_labels), np.array(train_labels)],
                 [np.array(test_preds), np.array(test_binary_labels), np.array(test_labels)]])
del train_preds, test_preds, train_binary_labels, test_binary_labels, train_labels, test_labels

Here, we have created a list data of length $5$ (as we have $5$ sampled $Z'$ masses) where each row contains two lists, each of which contains three numpy arrays. The first triplet is the $n \times m$ data set of training predictors (where $n$ is the number of data points and $m$ is the number of features: for us, $n \approx 2$ million and $m = 47$), the length $n$ binary labels (i.e., signal is $1$, background is $0$) to be used for training, and the length $n$ labels (i.e., signal is $1$, background is negative integer) to be used when we want to differentiate backgrounds. Thus, our list is $5 \times 2 \times 3$, where the final dimension consists of predictors and labels as described above. We note that, in particular, we constructed our training/testing data for each $m_{Z'}$ value by combining $1$ million signal data points with ~$1$ million background points (divided evenly between our three background types).

print(f'data has length {len(data)}')
print(f'each row of data has length {len(data[0])}')
print(f'each element in each row of data has length {len(data[0][0])}')
print(f'training data is of length {len(data[0][0][0])}',
      f'while testing data is of length {len(data[0][1][0])}')
print(f'our data has {len(data[0][0][0][0])} features')

data has length 5
each row of data has length 2
each element in each row of data has length 3
training data is of length 1499999 while testing data is of length 500000
our data has 47 features

We're also going to want one more thing. In particular, we'll eventually want to pass the (test) predictors and labels that correspond to specific background types. In order to facilitate this, we're going to construct a list indx of length $5$ (number of $Z'$ masses) where each row is a list of length $4$ (as we have signal and three background types) where data[i][j] is a numpy array containing the indices of test[i][1][2] corresponding to signal (if $j = 0$) or the $j$th background (if $1 \leq j \leq 3$).

indx = [[np.where(row[1][2]==i)[0] for i in [1] + [-i for i in range(1,4)]] for row in data]

print(f'indx has length {len(indx)}')
print(f'each row of indx has length {len(indx[0])}')
print(f'the lengths of the elements of the first row of indx are {[len(elem) for elem in indx[0]]}')

indx has length 5
each row of indx has length 4
the lengths of the elements of the first row of indx are [249991, 83394, 83544, 83071]

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

It's useful for standardize your data (subtract the mean and divide by the standard deviation in each feature) before training: we can achieve this through sklearn's StandardScaler filter and its pipeline functionality.

Initializing and Training

We'll now initialize and train a logistic regression model for each of our $5$ data sets using our bcml_model object. Note that the * syntax unpacks a list, so our training data and testing data for a given $Z'$ mass (which makes up our rows of train_test_data) are passed as separate arguments to our class instantiation.

We instantiate a bcml_model object by passing a sklearn classifying model and we fit the model via the bcml.fit method.

logreg_m350G_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m350G_model.fit(*data[0][0][:2])

logreg_m500G_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m500G_model.fit(*data[1][0][:2])

logreg_m1T_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m1T_model.fit(*data[2][0][:2])

logreg_m2T_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m2T_model.fit(*data[3][0][:2])

logreg_m4T_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)))
logreg_m4T_model.fit(*data[4][0][:2])

As a reminder, passing *data[0][0][:2], for example, means we are taking the zeroth row of data (first mass, $m_{Z'} = 350$ GeV), the zeroth element of that row (the training data, not the testing data), and then unpacking the first two elements of that element (the features and binary labels).

Naive Performance Evaluation

Now we can investigate our models' performances. Perhaps we can start by looking at the signal significance they achieve by default, given integrated luminosity $\mathcal{L} = 3000$ fb$^{-1}$. To do this, we're going to need signal and background yield, which we can compute using the function get_elijah_ttbarzp_cs, which can pull in our information regarding sampled $Z'$ masses, their associated cross sections, and the background cross sections.

masses, sig_css, bg_css = get_elijah_ttbarzp_cs()
print(masses)

[10, 50, 100, 200, 350, 500, 1000, 2000, 5000]

We will need a signal cross section for $m_{Z'} = 4000$ GeV, which we don't have, so we shall take a quick detour to interpolate it using our cross_section_helper object. In doing so, we'll update masses and sig_css to more relevant values.

zp_cs = cross_section_helper(masses, sig_css, bg_css, mass_units='GeV')
masses = [350, 500, 1000, 2000, 4000]
sig_css = [zp_cs.sig_cs(mass) for mass in masses]

Using $N = \mathcal{L} \cdot \sigma$ for number of events $N$, we can compute signal and background yields. Both of these are lists: signal cross sections vary with the $m_{Z'}$ values given by our masses variable, and the background cross sections vary between background types, originally specified by the filename_bgs variable.

conv = 10**15 / 10**12 # conv * lumi (in fb^{-1}) * cross sec (in pb) = # of events
lumi = 3000
signal_yields = [conv * lumi * sig_cs for sig_cs in sig_css]
background_yields = [conv * lumi * bg_cs for bg_cs in bg_css]
background_yield = sum(background_yields)

Now we can look at anticipated signal significance $\mathcal{S}$ yielded by our models, which takes the following form. $$\mathcal{S} = \frac{S \cdot \text{TPR}}{\sqrt{S \cdot \text{TPR} + B \cdot \text{FPR}}}$$ Here, $S$ and $B$ are signal and background yield, respectively; TPR is the true positive rate, or the proportion of signal events our model correctly identifies as signal; and FPR is the false positive rate, or the proportion of background events our model incorrectly identifies as signal (i.e., TPR, FPR $\in [0, 1]$). Effectively, we're implementing a solitary event selection criterion: namely, that our model thinks the event is signal.

For the benefit of our own understanding of the problem, let's briefly consider the signal cross section $\sigma_s$ as a function of $m_{Z'}$ with some assistance from the interpolation functionality of scipy and some matplotlib.pyplot settings retrievable via get_settings.

bg_labels = [r"No Higgs B.g. ($pp \; \to \; t\overline{t}b\overline{b} \; \backslash \; h \; \to \; bjj \; \overline{b}\ell\nu \;  b\overline{b}$)",
             r"Higgs B.g. ($pp \; \to \; t\overline{t}h \; \to \; bjj \; \overline{b}\ell\nu \;  b\overline{b}$)",
             r"4 Tops B.g. ($pp \; \to \; t\overline{t}t\overline{t} \; \to \; \overline{b}\ell\nu \;  b\overline{b} \; ...$)"]

import scipy.interpolate
import matplotlib.pyplot as plt

sample_masses = np.linspace(masses[0], masses[-1], 500)
sigs_f = scipy.interpolate.interp1d(masses, np.log10(sig_css), kind='cubic')
new_settings = get_settings()
new_settings['figure.figsize'] = (15,10)
with plt.rc_context(new_settings):
    plt.plot(sample_masses, [10**sigs_f(m) for m in sample_masses], 
             label=r"Signal ($pp \; \to \; t\overline{t}Z' \; \to \; bjj \; \overline{b}\ell\nu \;  b\overline{b}$)")
    for bg_cs, label in zip([bg_css[2], bg_css[0], bg_css[1]], bg_labels):
        plt.plot([350, 4000], [bg_cs]*2, label=r"{}".format(label))
    plt.ylabel('Cross Section (pb)');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.xlim(350, 4000)
    plt.yscale('log')
    plt.legend(bbox_to_anchor=(0,0), loc="lower left")
    plt.savefig('./images/full_example/signal_cs', dpi=200, bbox_inches='tight')

for B, name in zip(background_yields, filename_bgs):
    print(f"Background {name} constitutes {round(B/background_yield,3)}% of the total background cross section.")
print(f"\nTotal backgroundcross section: {sum(bg_css)} pb")

Background h constitutes 0.019% of the total background cross section.
Background 4t constitutes 0.002% of the total background cross section.
Background noh constitutes 0.979% of the total background cross section.

Total backgroundcross section: 5.6977 pb

The noh (no Higgs) background is far and away the dominant process. It seems negligent, then, to compute total FPR and multiply by total B (background yield): instead, we should be multiplying the FPR for each background by the associated background yield B, and then summing over these. The bcml_model.significance method allows us to specify whether or not we want to separate our backgrounds like this in the computation with the boolean argument sepbg. Let's look at both scenarios.

We'll import datetime to help ourselves start gauging runtime for some of these computational tasks.

from datetime import datetime

def getTime(form='s'):
    seconds = datetime.now().day * 86400 + datetime.now().hour * 3600 + datetime.now().minute * 60 + datetime.now().second
    if form == 's':
        return seconds
    elif form == 'm':
        return round(seconds/60,3)
    elif form == 'h':
        return round(seconds/3600,3)

logreg_models = [logreg_m350G_model, logreg_m500G_model, logreg_m1T_model, logreg_m2T_model, logreg_m4T_model]

time_before = getTime()
logreg_sigs_nosepbg = [model.significance(signal_yield, background_yield, 
                                          tpr=model.tpr(data[i][1][1], preds=data[i][1][0]), 
                                          fpr=model.fpr(data[i][1][1], preds=data[i][1][0]),
                                          sepbg=False) 
                       for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
for sig, mass in zip(logreg_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
          "w/o separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> logistic regression gives significance of 16.01 sigma w/o separated backgrounds
Z' mass = 500 GeV --> logistic regression gives significance of 8.386 sigma w/o separated backgrounds
Z' mass = 1000 GeV --> logistic regression gives significance of 2.049 sigma w/o separated backgrounds
Z' mass = 2000 GeV --> logistic regression gives significance of 0.283 sigma w/o separated backgrounds
Z' mass = 4000 GeV --> logistic regression gives significance of 0.006 sigma w/o separated backgrounds
(Runtime: 1 seconds)

time_before = getTime()
logreg_sigs_sepbg = [model.significance(signal_yield, background_yields, 
                                          tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=data[i][1][0][indx[i][0]]), 
                                          fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=data[i][1][0][indx[i][j]]) 
                                               for j in range(1,4)],
                                          sepbg=True)
                       for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
for sig, mass in zip(logreg_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
          "w/ separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> logistic regression gives significance of 19.856 sigma w/ separated backgrounds
Z' mass = 500 GeV --> logistic regression gives significance of 9.527 sigma w/ separated backgrounds
Z' mass = 1000 GeV --> logistic regression gives significance of 2.052 sigma w/ separated backgrounds
Z' mass = 2000 GeV --> logistic regression gives significance of 0.263 sigma w/ separated backgrounds
Z' mass = 4000 GeV --> logistic regression gives significance of 0.006 sigma w/ separated backgrounds
(Runtime: 1 seconds)

Numbers have limited utility on their own: let's try and visualize $\mathcal{S}$ as a function of $m_{Z'}$ by again enlisting the help of interpolation and our matplotlib settings.

max_mass = zp_cs.absolute_max_mass_sens()
logreg_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_sigs_nosepbg), kind='cubic')
logreg_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_sigs_sepbg), kind='cubic')

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**logreg_sigs_nosepbg_f(m) for m in sample_masses], label='Logistic Regression (No b.g. sep.)', 
        c='red')
    plt.plot(
        sample_masses, [10**logreg_sigs_sepbg_f(m) for m in sample_masses], label='Logistic Regression (B.g. sep.)', 
        c='red', linestyle=':')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend()

Whenever a curve falls above, say, the darker blue $5\sigma$ line, we are assured that the associated model can achieve $5\sigma$ discovery potential at that value of $m_{Z'}$. The "Theoretical 5$\sigma$ Limit" line denotes the highest $Z'$ mass such that $\mathcal{S} = 5\sigma$ is possible. In particular, in the best case scenario, TPR = 1 and FPR = 0, we have that $\mathcal{S} = \sqrt{S}$ so we require $\mathcal{L} \cdot \sigma_s > \mathcal{S}^2$ for signal cross section $\sigma_s$. Our cross_section_helper class has a built-in method for computing this maximal mass, given signal cross sections.

print("We cannnot achieve sensitivity to Z' particles (via this channel)",
      f"with masses greater than {int(max_mass)} GeV")

We cannnot achieve sensitivity to Z' particles (via this channel) with masses greater than 2363 GeV

More Sophisticated Evaluation: Threshold Optimization

When passed a data point, a trained bcml_model assigns a signal probability. By default, an sk-learn (and thus, our bcml_model) model predicts that the datapoint is signal if the associated signal probability is greater than $0.5$. However, $0.5$ need not be the optimal threshold: for this purpose, we provide the ability to optimize this threshold on a test data set. To visualize this, we can plot a histogram of predicted signal probability for signal and background test events in our model: this is easily achieved using the bcml_model.predict_hist method.

background_names = ['Higgs', '4 Tops', 'No Higgs']
with plt.rc_context(get_settings()):
    for i, model in enumerate(logreg_models):
        plt.figure(i)
        bin_edges, sig_bins, bg_bins = model.predict_hist(data[i][1][0], data[i][1][1], num_bins=40, sepbg=False)
        plt.bar(bin_edges[:-1], sig_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Signal")
        plt.bar(bin_edges[:-1], bg_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Background")
        plt.yscale('log')
        plt.gca().set_ylim(10**-3, 10**2)
        plt.gca().set_xlim(0, 1)
        plt.legend()
        plt.savefig(f'./images/full_example/histograms/log_reg_m{masses[i]}GeV_hist', dpi=200, bbox_inches='tight')

Having motivated this endeavor, let's see how threshold optimization boosts our logistic regression models' performance: the bcml_model.best_threshold method performs this optimization task for us.

time_before = getTime(form='m')
logreg_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, data[i][1][0], data[i][1][1],
                                                   sepbg=False) 
                              for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
logreg_opt_sigs_nosepbg = [result[1] for result in logreg_opt_results_nosepbg]
for sig, mass in zip(logreg_opt_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
          "w/o separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> logistic regression gives optimized sign. of 17.545 sigma w/o separated backgrounds
Z' mass = 500 GeV --> logistic regression gives optimized sign. of 10.155 sigma w/o separated backgrounds
Z' mass = 1000 GeV --> logistic regression gives optimized sign. of 3.588 sigma w/o separated backgrounds
Z' mass = 2000 GeV --> logistic regression gives optimized sign. of 2.363 sigma w/o separated backgrounds
Z' mass = 4000 GeV --> logistic regression gives optimized sign. of 1.01 sigma w/o separated backgrounds
(Runtime: 0.133 minutes)

time_before = getTime(form='m')
logreg_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, data[i][1][0], data[i][1][2],
                                                   sepbg=True)
                              for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
logreg_opt_sigs_sepbg = [result[1] for result in logreg_opt_results_sepbg]
for sig, mass in zip(logreg_opt_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
          "w/ separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(

Z' mass = 350 GeV --> logistic regression gives optimized sign. of 22.274 sigma w/ separated backgrounds
Z' mass = 500 GeV --> logistic regression gives optimized sign. of 11.603 sigma w/ separated backgrounds
Z' mass = 1000 GeV --> logistic regression gives optimized sign. of 3.519 sigma w/ separated backgrounds
Z' mass = 2000 GeV --> logistic regression gives optimized sign. of 3.367 sigma w/ separated backgrounds
Z' mass = 4000 GeV --> logistic regression gives optimized sign. of 1.01 sigma w/ separated backgrounds
(Runtime: 0.134 minutes)

Now we can overlay our new optimized signal significances onto our old plot to visualize (marked) improvement.

logreg_opt_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_opt_sigs_nosepbg), kind='linear')
logreg_opt_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_opt_sigs_sepbg), kind='linear')

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**logreg_sigs_nosepbg_f(m) for m in sample_masses], label='Logistic Regression (No b.g. sep.)', 
        c='red')
    plt.plot(
        sample_masses, [10**logreg_sigs_sepbg_f(m) for m in sample_masses], label='Logistic Regression (B.g. sep.)', 
        c='red', linestyle=':')
    plt.plot(
        sample_masses, [10**logreg_opt_sigs_nosepbg_f(m) for m in sample_masses], 
        label='Logistic Regression (No b.g. sep., opt. thresh.)', 
        c='red', linestyle='--')
    plt.plot(
        sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Logistic Regression (B.g. sep., opt. thresh.)', 
        c='red', linestyle='-.')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend()
    plt.savefig('./images/full_example/log_reg_sigs', dpi=200, bbox_inches='tight')

Additionally, we can re-plot our histogram with vertical lines visualizing where we used to be cutting for signal selection versus the optimal location (as our threshold determines the predicted signal probability required for a data point to be kept).

background_names = ['Higgs', '4 Tops', 'No Higgs']
with plt.rc_context(get_settings()):
    for i, (model, result) in enumerate(zip(logreg_models, logreg_opt_results_sepbg)):
        plt.figure(i)
        bin_edges, sig_bins, bg_bins = model.predict_hist(data[i][1][0], data[i][1][1], num_bins=40, sepbg=False)
        plt.bar(bin_edges[:-1], sig_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Signal")
        plt.bar(bin_edges[:-1], bg_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Background")
        plt.plot([0.5]*2, [10**-3, 10**2], label='Default Threshold', c='tab:red')
        plt.plot([result[0]]*2, [10**-3, 10**2], label='Optimized Threshold', c='tab:green')
        plt.yscale('log')
        plt.gca().set_ylim(10**-3, 10**2)
        plt.gca().set_xlim(0, 1)
        plt.legend()
        plt.savefig(f'./images/full_example/histograms/log_reg_m{masses[i]}GeV_opt_hist', dpi=200, bbox_inches='tight')

Random Forests

Having comprehensively explored logistic regression, we can ramp up our performance by adopting a more sophisticated machine learning model: random forests. In particular, we can implement the random forest methodology using sklearn.ensemble.RandomForestClassifier

Initializing and Training

from sklearn.ensemble import RandomForestClassifier

time_before = getTime(form='m')
randforest_m350G_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m350G_model.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 64.567 minutes)

time_before = getTime(form='m')
randforest_m500G_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m500G_model.fit(*data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 72.9 minutes)

time_before = getTime(form='m')
randforest_m1T_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m1T_model.fit(*data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 111.5 minutes)

time_before = getTime(form='m')
randforest_m2T_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m2T_model.fit(*data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 157.85 minutes)

time_before = getTime(form='m')
randforest_m4T_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m4T_model.fit(*data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 121.3 minutes)

Evaluation

We now iterate over all of the performance evaluation strategies we employed for logistic regression.

randforest_models = [
    randforest_m350G_model, randforest_m500G_model, randforest_m1T_model, randforest_m2T_model, randforest_m4T_model]

time_before = getTime()
randforest_sigs_nosepbg = [model.significance(signal_yield, background_yield, 
                                          tpr=model.tpr(data[i][1][1], preds=data[i][1][0]), 
                                          fpr=model.fpr(data[i][1][1], preds=data[i][1][0]),
                                          sepbg=False) 
                       for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
for sig, mass in zip(randforest_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
          "w/o separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> random forests gives significance of 26.779 sigma w/o separated backgrounds
Z' mass = 500 GeV --> random forests gives significance of 13.313 sigma w/o separated backgrounds
Z' mass = 1000 GeV --> random forests gives significance of 3.06 sigma w/o separated backgrounds
Z' mass = 2000 GeV --> random forests gives significance of 0.378 sigma w/o separated backgrounds
Z' mass = 4000 GeV --> random forests gives significance of 0.008 sigma w/o separated backgrounds
(Runtime: 1201 seconds)

time_before = getTime()
randforest_sigs_sepbg = [model.significance(signal_yield, background_yields, 
                                          tpr=model.tpr(data[i][1][1], preds=data[i][1][0]), 
                                          fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=data[i][1][0][indx[i][j]]) 
                                               for j in range(1,4)],
                                          sepbg=True) 
                       for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
for sig, mass in zip(randforest_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
          "w/ separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> random forests gives significance of 27.245 sigma w/ separated backgrounds
Z' mass = 500 GeV --> random forests gives significance of 13.844 sigma w/ separated backgrounds
Z' mass = 1000 GeV --> random forests gives significance of 3.107 sigma w/ separated backgrounds
Z' mass = 2000 GeV --> random forests gives significance of 0.375 sigma w/ separated backgrounds
Z' mass = 4000 GeV --> random forests gives significance of 0.008 sigma w/ separated backgrounds
(Runtime: 175 seconds)

time_before = getTime(form='m')
randforest_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, data[i][1][0], data[i][1][1],
                                                       sepbg=False) 
                                  for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
randforest_opt_sigs_nosepbg = [result[1] for result in randforest_opt_results_nosepbg]
for sig, mass in zip(randforest_opt_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
          "w/o separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> random forests gives optimized sign. of 64.017 sigma w/o separated backgrounds
Z' mass = 500 GeV --> random forests gives optimized sign. of 44.418 sigma w/o separated backgrounds
Z' mass = 1000 GeV --> random forests gives optimized sign. of 18.138 sigma w/o separated backgrounds
Z' mass = 2000 GeV --> random forests gives optimized sign. of 4.242 sigma w/o separated backgrounds
Z' mass = 4000 GeV --> random forests gives optimized sign. of 0.968 sigma w/o separated backgrounds
(Runtime: 16.817 minutes)

time_before = getTime(form='m')
randforest_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, data[i][1][0], data[i][1][2],
                                                     sepbg=True)
                                for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
randforest_opt_sigs_sepbg = [result[1] for result in randforest_opt_results_sepbg]
for sig, mass in zip(randforest_opt_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
          "w/ separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(

Z' mass = 350 GeV --> random forests gives optimized sign. of 54.843 sigma w/ separated backgrounds
Z' mass = 500 GeV --> random forests gives optimized sign. of 37.07 sigma w/ separated backgrounds
Z' mass = 1000 GeV --> random forests gives optimized sign. of 15.351 sigma w/ separated backgrounds
Z' mass = 2000 GeV --> random forests gives optimized sign. of 6.677 sigma w/ separated backgrounds
Z' mass = 4000 GeV --> random forests gives optimized sign. of 0.968 sigma w/ separated backgrounds
(Runtime: 8.867 minutes)

randforest_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_sigs_nosepbg), kind='cubic')
randforest_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_sigs_sepbg), kind='cubic')
randforest_opt_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_opt_sigs_nosepbg), kind='cubic')
randforest_opt_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_opt_sigs_sepbg), kind='linear')

import matplotlib.pyplot as plt

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**randforest_sigs_nosepbg_f(m) for m in sample_masses], 
        label='Random Forest (No b.g. sep.)', 
        c='lime')
    plt.plot(
        sample_masses, [10**randforest_sigs_sepbg_f(m) for m in sample_masses], 
        label='Random Forest (B.g. sep.)', 
        c='lime', linestyle=':')
    plt.plot(
        sample_masses, [10**randforest_opt_sigs_nosepbg_f(m) for m in sample_masses], 
        label='Random Forest (No b.g. sep., opt. thresh.)', 
        c='lime', linestyle='--')
    plt.plot(
        sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Random Forest (B.g. sep., opt. thresh.)', 
        c='lime', linestyle='-.')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend()
    plt.savefig('./images/full_example/rand_forest_sigs', dpi=200, bbox_inches='tight')

Gradient Boosting

Now we can take yet another step up by using a gradient boosting model. In particular, we'll use the xgboost library, which shares sk-learn's API but provides substantial computational speed-up for gradient boosting models.

Initializing and Training

from xgboost import XGBClassifier

time_before = getTime(form='m')
xgradboost_m350G_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 57.35 minutes)

time_before = getTime(form='m')
xgradboost_m500G_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m500G_model.fit(*data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 55.25 minutes)

time_before = getTime(form='m')
xgradboost_m1T_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m1T_model.fit(*data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 54.05 minutes)

time_before = getTime(form='m')
xgradboost_m2T_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m2T_model.fit(*data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 53.35 minutes)

time_before = getTime(form='m')
xgradboost_m4T_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m4T_model.fit(*data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 64.95 minutes)

Evaluation

xgradboost_models = [
    xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]

time_before = getTime()
xgradboost_sigs_nosepbg = [model.significance(signal_yield, background_yield, 
                                          tpr=model.tpr(data[i][1][1], preds=data[i][1][0]), 
                                          fpr=model.fpr(data[i][1][1], preds=data[i][1][0]),
                                          sepbg=False) 
                       for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
for sig, mass in zip(xgradboost_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
          "w/o separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> gradient boosting gives significance of 33.524 sigma w/o separated backgrounds
Z' mass = 500 GeV --> gradient boosting gives significance of 17.511 sigma w/o separated backgrounds
Z' mass = 1000 GeV --> gradient boosting gives significance of 4.07 sigma w/o separated backgrounds
Z' mass = 2000 GeV --> gradient boosting gives significance of 0.465 sigma w/o separated backgrounds
Z' mass = 4000 GeV --> gradient boosting gives significance of 0.007 sigma w/o separated backgrounds
(Runtime: 723 seconds)

time_before = getTime()
xgradboost_sigs_sepbg = [model.significance(signal_yield, background_yields, 
                                          tpr=model.tpr(data[i][1][1], preds=data[i][1][0]), 
                                          fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=data[i][1][0][indx[i][j]]) 
                                               for j in range(1,4)],
                                          sepbg=True) 
                       for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
for sig, mass in zip(xgradboost_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
          "w/ separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> gradient boosting gives significance of 28.158 sigma w/ separated backgrounds
Z' mass = 500 GeV --> gradient boosting gives significance of 14.851 sigma w/ separated backgrounds
Z' mass = 1000 GeV --> gradient boosting gives significance of 3.426 sigma w/ separated backgrounds
Z' mass = 2000 GeV --> gradient boosting gives significance of 0.363 sigma w/ separated backgrounds
Z' mass = 4000 GeV --> gradient boosting gives significance of 0.005 sigma w/ separated backgrounds
(Runtime: 22 seconds)

time_before = getTime(form='m')
xgradboost_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, data[i][1][0], data[i][1][1],
                                                       sepbg=False) 
                                  for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
xgradboost_opt_sigs_nosepbg = [result[1] for result in xgradboost_opt_results_nosepbg]
for sig, mass in zip(xgradboost_opt_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting forests gives optimized sign. of {round(sig, 3)} sigma",
          "w/o separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> gradient boosting forests gives optimized sign. of 79.138 sigma w/o separated backgrounds
Z' mass = 500 GeV --> gradient boosting forests gives optimized sign. of 54.933 sigma w/o separated backgrounds
Z' mass = 1000 GeV --> gradient boosting forests gives optimized sign. of 20.962 sigma w/o separated backgrounds
Z' mass = 2000 GeV --> gradient boosting forests gives optimized sign. of 7.163 sigma w/o separated backgrounds
Z' mass = 4000 GeV --> gradient boosting forests gives optimized sign. of 1.021 sigma w/o separated backgrounds
(Runtime: 13.817 minutes)

time_before = getTime(form='m')
xgradboost_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, data[i][1][0], data[i][1][2],
                                                     sepbg=True)
                                for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
xgradboost_opt_sigs_sepbg = [result[1] for result in xgradboost_opt_results_sepbg]
for sig, mass in zip(xgradboost_opt_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting gives optimized sign. of {round(sig, 3)} sigma",
          "w/ separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(
/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(

Z' mass = 350 GeV --> gradient boosting gives optimized sign. of 67.963 sigma w/ separated backgrounds
Z' mass = 500 GeV --> gradient boosting gives optimized sign. of 46.62 sigma w/ separated backgrounds
Z' mass = 1000 GeV --> gradient boosting gives optimized sign. of 17.07 sigma w/ separated backgrounds
Z' mass = 2000 GeV --> gradient boosting gives optimized sign. of 7.148 sigma w/ separated backgrounds
Z' mass = 4000 GeV --> gradient boosting gives optimized sign. of 1.021 sigma w/ separated backgrounds
(Runtime: 29.983 minutes)

for mass, result, sig_cs in zip(masses, xgradboost_opt_results_nosepbg, sig_css):
    print(f"for mass = {mass} GeV, thresh = {round(result[0],6)}, tpr = {round(result[2],5)} and fpr = {round(result[3],9)}",
          f"so yield is {int(lumi * sig_cs * conv * result[2])} and sig. sign. is {round(result[1],3)}")

for mass = 350 GeV, thresh = 0.987492, tpr = 0.52455 and fpr = 0.00139195 so yield is 15733 and sig. sign. is 79.138
for mass = 500 GeV, thresh = 0.996144, tpr = 0.52456 and fpr = 0.000343988 so yield is 5983 and sig. sign. is 54.933
for mass = 1000 GeV, thresh = 0.999345, tpr = 0.74706 and fpr = 5.1998e-05 so yield is 882 and sig. sign. is 20.962
for mass = 2000 GeV, thresh = 0.999975, tpr = 0.8408 and fpr = 0.0 so yield is 51 and sig. sign. is 7.163
for mass = 4000 GeV, thresh = 0.999873, tpr = 0.96767 and fpr = 0.0 so yield is 1 and sig. sign. is 1.021

xgradboost_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_sigs_nosepbg), kind='cubic')
xgradboost_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_sigs_sepbg), kind='cubic')
xgradboost_opt_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_opt_sigs_nosepbg), kind='cubic')
xgradboost_opt_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_opt_sigs_sepbg), kind='linear')

import matplotlib.pyplot as plt

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**xgradboost_sigs_nosepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting (No b.g. sep.)', 
        c='tab:purple')
    plt.plot(
        sample_masses, [10**xgradboost_sigs_sepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting (B.g. sep.)', 
        c='tab:purple', linestyle=':')
    plt.plot(
        sample_masses, [10**xgradboost_opt_sigs_nosepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting (No b.g. sep., opt. thresh.)', 
        c='tab:purple', linestyle='--')
    plt.plot(
        sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting (B.g. sep., opt. thresh.)', 
        c='tab:purple', linestyle='-.')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend()
    plt.savefig('./images/full_example/grad_boost_sigs', dpi=200, bbox_inches='tight')

As a bonus, let's now compare the performance between machine learning methodologies for, say, the optimized threshold + background separation case (as this is the most realistic option).

import matplotlib.pyplot as plt

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-0.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Logistic Regression', 
        c='red', linestyle='-.')
    plt.plot(
        sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Random Forests', 
        c='lime', linestyle='-.')
    plt.plot(
        sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting', 
        c='tab:purple', linestyle='-.')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend()
    plt.savefig('./images/full_example/all_model_sigs', dpi=200, bbox_inches='tight')

Feature Importance

Our models differentiate between signal and background using the features we provide, but in the eyes of the models not all features are created equal. Indeed, trained models assign importances to their features: accessing these allows us to understand precisely where our discriminating power comes from.

Our bcml_model class provides some functionality for accessing and analyzing feature importance, which we'll now explore.

Printing Values

We can access feature importances through the bcml_model.sorted_feature_importance method (except for when we're doing logistic regression, but more on that in a second). Let's start by constructing a helpful function for printing out the data returned by the aforementioned method.

max_feature_len = max([len(feature) for feature in get47Dfeatures()])

def printFeatureImportances(feature_importances, num=10, hasSign=False):
    feature_importances = feature_importances[:num]
    for i, (feature, importance) in enumerate(feature_importances):
        if hasSign:
            sign = '+' if importance >= 0 else '-'
            print('{4:2}. {0:{1}} {3}{2:.4f}'.format(feature, max_feature_len + 3, np.abs(importance), sign, i+1))
        else:
            print('{3:2}. {0:{1}} {2:.4f}'.format(feature, max_feature_len + 3, importance, i+1))

Now we can look at feature importances, starting with our logistic regression model. Unfortunately, while make sk-learn models have built-in methods for accessing feature importance (which our bcml_model class exploits), logistic regression lacks this: instead, however, we can just look at the magnitudes of the learned coefficients, which amount to the same thing. Consequently, we'll adapt some of the code in bcml_model.sorted_feature_importance to explore the logistic regression model coefficients.

features = get47Dfeatures()
lr_feature_importances_raw = [model.model[-1].coef_[0] for model in logreg_models]
ranked_indices = [np.argsort(-np.abs(importances)) for importances in lr_feature_importances_raw]
lr_feature_importances = [
    [[features[i], importances[i]] for i in indices] for importances, indices in zip(lr_feature_importances_raw, ranked_indices)]

for mass, feature_importance in zip(masses, lr_feature_importances):
    print(f"Z' mass = {mass}")
    printFeatureImportances(feature_importance, hasSign=True)
    if mass != masses[-1]:
        print()

Z' mass = 350
 1. M j1 j2        -4.0128
 2. pT b1          +1.3718
 3. M b1 b2        -0.5675
 4. MT b1 l MET    +0.5563
 5. pT j2          -0.5362
 6. M b1 b4        +0.5027
 7. pT b3          +0.4634
 8. dR b1 b2       +0.4241
 9. pT l           -0.4114
10. MET            -0.3775

Z' mass = 500
 1. M j1 j2        -4.4617
 2. pT b1          +1.4807
 3. M b1 b4        +0.6866
 4. pT b2          +0.6386
 5. M b1 b2        +0.6010
 6. pT j2          -0.5372
 7. M b1 b3        +0.4356
 8. MT b1 l MET    +0.3963
 9. MT b4 l MET    -0.3363
10. pT l           -0.3351

Z' mass = 1000
 1. M j1 j2        -5.0810
 2. M b1 b2        +4.2592
 3. M b1 b3        +1.9567
 4. M b1 b4        +1.5468
 5. pT b3          -1.0996
 6. dR b1 b2       -0.8572
 7. pT b1          +0.6903
 8. M b2 b3        +0.6769
 9. dR b1 b4       -0.5659
10. M b2 b4        +0.4443

Z' mass = 2000
 1. M b1 b2        +5.8614
 2. M j1 j2        -3.7362
 3. M b1 b3        +2.3241
 4. M b1 b4        +1.6215
 5. pT b3          -1.3961
 6. pT b1          +1.2878
 7. M b2 b3        +1.0925
 8. dR b1 b2       -0.8629
 9. M b2 b4        +0.6677
10. dR b1 b4       -0.5678

Z' mass = 4000
 1. pT b1          +5.6635
 2. M b1 b2        +4.9294
 3. M j1 j2        -2.8621
 4. M b1 b3        +1.3875
 5. pT b2          +1.3621
 6. M b1 b4        +1.0313
 7. M b2 b3        +0.7581
 8. pT j2          -0.5869
 9. MT b2 l MET    +0.5058
10. pT b3          -0.4899

Now let's look at random forests: at this point, we're free to use the built-in function.

rf_feature_importances = [model.sorted_feature_importance(get47Dfeatures()) for model in randforest_models]

for mass, feature_importance in zip(masses, rf_feature_importances):
    print(f"Z' mass = {mass}")
    printFeatureImportances(feature_importance)
    if mass != masses[-1]:
        print()

Z' mass = 350
 1. M b1 b2        0.1582
 2. M j1 j2        0.0982
 3. pT b1          0.0892
 4. M b1 b3        0.0802
 5. pT b2          0.0614
 6. M b1 b4        0.0539
 7. pT b3          0.0402
 8. M b3 b4        0.0305
 9. pT j2          0.0301
10. M b2 b3        0.0279

Z' mass = 500
 1. M b1 b2        0.1957
 2. pT b1          0.1343
 3. M b1 b3        0.0911
 4. pT b2          0.0838
 5. M j1 j2        0.0754
 6. pT b3          0.0461
 7. M b1 b4        0.0440
 8. MT b1 l MET    0.0308
 9. M b2 b3        0.0294
10. M b2 b4        0.0266

Z' mass = 1000
 1. M b1 b2        0.2532
 2. pT b1          0.1665
 3. pT b2          0.1359
 4. M b1 b3        0.0981
 5. MT b1 l MET    0.0491
 6. pT b3          0.0466
 7. M b2 b3        0.0414
 8. M j1 j2        0.0344
 9. M b1 b4        0.0342
10. M b2 b4        0.0217

Z' mass = 2000
 1. M b1 b2        0.2275
 2. pT b1          0.1941
 3. pT b2          0.1357
 4. M b1 b3        0.0915
 5. MT b1 l MET    0.0609
 6. M b2 b3        0.0533
 7. M b1 b4        0.0513
 8. M b2 b4        0.0325
 9. MT b2 l MET    0.0308
10. pT b3          0.0277

Z' mass = 4000
 1. M b1 b2        0.2199
 2. pT b1          0.1579
 3. pT b2          0.1245
 4. M b1 b3        0.1160
 5. M b1 b4        0.0614
 6. MT b2 l MET    0.0606
 7. MT b1 l MET    0.0597
 8. M b2 b3        0.0513
 9. M b2 b4        0.0357
10. dR b1 b2       0.0316

Finally, we have the gradient boosting models.

xgb_feature_importances = [model.sorted_feature_importance(get47Dfeatures()) for model in xgradboost_models]

for mass, feature_importance in zip(masses, xgb_feature_importances):
    print(f"Z' mass = {mass}")
    printFeatureImportances(feature_importance)
    if mass != masses[-1]:
        print()

Z' mass = 350
 1. M b1 b2        0.2680
 2. M j1 j2        0.1726
 3. M b1 b3        0.0763
 4. pT b1          0.0646
 5. M b1 b4        0.0594
 6. M b3 b4        0.0384
 7. dR b1 b2       0.0366
 8. pT b4          0.0349
 9. M b2 b3        0.0331
10. M b2 b4        0.0331

Z' mass = 500
 1. M b1 b2        0.4526
 2. M j1 j2        0.1334
 3. M b1 b3        0.0773
 4. M b1 b4        0.0444
 5. pT b1          0.0368
 6. dR b1 b2       0.0315
 7. M b2 b3        0.0233
 8. M b3 b4        0.0225
 9. dR b1 b3       0.0208
10. M b2 b4        0.0206

Z' mass = 1000
 1. M b1 b2        0.7459
 2. M b1 b3        0.0615
 3. M j1 j2        0.0422
 4. M b1 b4        0.0309
 5. pT b1          0.0190
 6. M b2 b3        0.0124
 7. dR b1 b2       0.0112
 8. M b2 b4        0.0081
 9. M b3 b4        0.0065
10. MT l MET       0.0054

Z' mass = 2000
 1. M b1 b2        0.9073
 2. M b1 b3        0.0301
 3. M b1 b4        0.0111
 4. M j1 j2        0.0110
 5. M b2 b3        0.0050
 6. pT b1          0.0039
 7. M b2 b4        0.0029
 8. dR b1 b2       0.0024
 9. M b3 b4        0.0023
10. MT l MET       0.0018

Z' mass = 4000
 1. M b1 b2        0.9413
 2. pT b1          0.0079
 3. M b1 b3        0.0076
 4. M j1 j2        0.0072
 5. M b1 b4        0.0031
 6. pT j2          0.0023
 7. M b2 b3        0.0022
 8. MT b1 l MET    0.0021
 9. M b3 b4        0.0021
10. M b2 b4        0.0020

Visualization

Reading out printed feature importances is fine, but it might be nice to try and visualize how feature importance varies with $m_{Z'}$ for different model types. We can do this with some list manipulation and interpolation as follows, beginning once more with our logistic regression models.

num = 18
lr_features = [[feature for feature, importance in elem] for elem in lr_feature_importances]
lr_best_features = list(set.intersection(*[set(elem[:num]) for elem in lr_features]))
lr_best_features_and_imps = [
    [feature, [elem[features.index(feature)][1] for elem, features in zip(lr_feature_importances, lr_features)]] 
    for feature in lr_best_features]
lr_fs = [scipy.interpolate.interp1d(masses, np.abs(importances), kind='quadratic') 
         for feature, importances in lr_best_features_and_imps]

This chunk of code works roughly by taking the intersection of the sets of top num features for the various $Z'$ masses (to get the most important features over all), grabbing the feature importance for each of these most importance features at each $m_{Z'}$ value, then interpolating functions of importance vs $m_{Z'}$ for each of the important features.

sample_masses = np.linspace(masses[0], masses[-1], 500)
with plt.rc_context(get_settings()):
    for f, feature in zip(lr_fs, lr_best_features):
        plt.plot(sample_masses, f(sample_masses), label=feature)
    plt.legend()
    plt.xlabel(r"$m_{Z'}$ (GeV)")
    plt.xlim(masses[0], masses[-1])
    plt.savefig('./images/full_example/log_reg_features', dpi=200, bbox_inches='tight')

Now we proceed onto random forests.

num = 10
rf_features = [[feature for feature, importance in elem] for elem in rf_feature_importances]
rf_best_features = list(set.intersection(*[set(elem[:num]) for elem in rf_features]))
rf_best_features_and_imps = [
    [feature, [elem[features.index(feature)][1] for elem, features in zip(rf_feature_importances, rf_features)]] 
    for feature in rf_best_features]
rf_fs = [scipy.interpolate.interp1d(masses, np.abs(importances), kind='quadratic') 
         for feature, importances in rf_best_features_and_imps]

sample_masses = np.linspace(masses[0], masses[-1], 500)
with plt.rc_context(get_settings()):
    for f, feature in zip(rf_fs, rf_best_features):
        plt.plot(sample_masses, f(sample_masses), label=feature)
    plt.legend()
    plt.xlabel(r"$m_{Z'}$ (GeV)")
    plt.xlim(masses[0], masses[-1])
    plt.savefig('./images/full_example/rand_forest_features', dpi=200, bbox_inches='tight')

And finally onto gradient boosting.

num = 10
xgb_features = [[feature for feature, importance in elem] for elem in xgb_feature_importances]
xgb_best_features = list(set.intersection(*[set(elem[:num]) for elem in xgb_features]))
xgb_best_features_and_imps = [
    [feature, [elem[features.index(feature)][1] for elem, features in zip(xgb_feature_importances, xgb_features)]] 
    for feature in xgb_best_features]
xgb_fs = [scipy.interpolate.interp1d(masses, np.abs(importances), kind='quadratic') 
         for feature, importances in xgb_best_features_and_imps]

sample_masses = np.linspace(masses[0], masses[-1], 500)
with plt.rc_context(get_settings()):
    for f, feature in zip(xgb_fs, xgb_best_features):
        plt.plot(sample_masses, f(sample_masses), label=feature)
    plt.yscale('log')
    plt.legend()
    plt.xlabel(r"$m_{Z'}$ (GeV)")
    plt.xlim(masses[0], masses[-1])
    plt.savefig('./images/full_example/grad_boost_features', dpi=200, bbox_inches='tight')

TPR-FPR Curves

Setup

Consider the situation at hand. We want to maximize signal significance $\mathcal{S}$, which is a function of TPR, FPR, and $m_{Z'}$ (via $\sigma_s$). Visualizing a function $\mathbb{R}^3 \to \mathbb{R}$ is hard: we're going to have to reduce our dimensionality somehow. First, recall that we are really interested in achieving discovery potential, $5\sigma$. Given a TPR and FPR, there will be a range of $m_{Z'}$ values which yield $\mathcal{S} \geq 5$. More specifically, because cross section scales inversely with $m_{Z'}$, there will be a maximal $m_{Z'}$ for fixed TPR and FPR that yields $\mathcal{S} \geq 5$ (i.e., the mass such that $\mathcal{S} = 5$). We can construct a heatmap which uses color to depict that maximal mass sensitivity as a function of TPR, FPR for the case of the $Z'$ signal cross sections we've been considering.

The cross_section_helper class allows us to plot this heatmap via the cross_section_helper.max_mass_sens_versus_tpr_fpr method.

cvalues = [500, 750, 1000, 1500, 2000]
clabels = [str(val) for val in cvalues]
manual = [(0.85, -1), (0.85, -2), (0.85, -3), (0.85, -3.5), (0.85, -6)]
tpr_bounds = (0.05,1)
fpr_bounds = (10**-7, 0.5)
colors = np.linspace(0.3, 0.8, 5)

zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000, 
                                   cvalues=cvalues, clabels=clabels, manual=manual);
plt.savefig('./images/full_example/tpr_fpr', dpi=200, bbox_inches='tight')

Logistic Regression

Recall that varying the threshold on a given model yields a set of (TPR, FPR) pairs: we can interpret these as points sweeping out a curve in TPR-FPR space. Threshold optimization, then, entails selecting the point on the curve which maximizes signal significance. In the language of our above heatmap, this means picking the point on the curve where the mass sensitivity is highest. It is therefore insightful to overlay the TPR-FPR curves which our models induce on top of the heatmap we've just established. Moreover, we will emphasize the points on the curve corresponding to both the default threshold ($0.5$) and the optimized threshold for each model, to provide yet another visualization of the improvement. We will employ this procedure first for our logistic regression models, reusing some of the output of bcml_model.best_threshold while also exploiting bcml_model.tpr and bcml_model.fpr.

lr_default_tprs = [model.tpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(logreg_models, data)]
lr_default_fprs = [model.fpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(logreg_models, data)]

lr_sample_tprs = [
    np.linspace(result[-2][0] + 1e-5, result[-2][-1] - 1e-5, 200) 
    for result in logreg_opt_results_nosepbg]
lr_tpr_fpr_fs = [
    scipy.interpolate.interp1d(result[-2], np.log10(np.where(result[-1]==0,1.2e-7,result[-1])), kind='linear')
    for result in logreg_opt_results_nosepbg]
lr_optimized_tpr = [result[-4] for result in logreg_opt_results_nosepbg]
lr_optimized_fpr = [result[-3] if result[-3] > 0 else 1.2e-7 for result in logreg_opt_results_nosepbg]

zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000, 
                                   cvalues=cvalues, clabels=clabels, manual=manual);
with plt.rc_context(get_settings()):
    for result, f, sample_tprs, mass, color in zip(logreg_opt_results_nosepbg, lr_tpr_fpr_fs, lr_sample_tprs, masses, colors):
        plt.plot(sample_tprs, f(sample_tprs), c=plt.cm.YlOrRd(color), label=r"$m_{Z'} =" + f'{mass}$ GeV')
    plt.scatter(lr_default_tprs, np.log10(lr_default_fprs), c='white', label='Default', zorder=10, marker='X')
    plt.scatter(lr_optimized_tpr, np.log10(lr_optimized_fpr), c='white', label='Optimized', zorder=10, marker='*')
    plt.legend(bbox_to_anchor=(1.35,1), loc="upper left", facecolor='lightgrey')
    plt.xlim(*tpr_bounds);
    plt.ylim(*np.log10(fpr_bounds));
    plt.yticks(list(plt.yticks()[0]) + [np.log10(1.2e-7)], list(plt.yticks()[1]) + ['0'])
    plt.savefig('./images/full_example/log_reg_tpr_fpr', dpi=200, bbox_inches='tight')

It's most insightful to plot the FPR axis using log scale, but this prevents us from visualizing portions of the TPR-FPR curves where $\text{FPR} = 0$, which does happen sometimes. To combat this, we implement a hack: whenever $\text{FPR} = 0$, we bump it up to a small (but non-zero) value, and then manually label that FPR value $0$. This is what you're seeing above.

Random Forests

We now go through the same procedure for our random forests models.

rf_default_tprs = [model.tpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(randforest_models, data)]
rf_default_fprs = [model.fpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(randforest_models, data)]

rf_sample_tprs = [
    np.linspace(result[-2][0] + 1e-5, result[-2][-1] - 1e-5, 200) 
    for result in randforest_opt_results_nosepbg]
rf_tpr_fpr_fs = [
    scipy.interpolate.interp1d(result[-2], np.log10(np.where(result[-1]==0,1.2e-7,result[-1])), kind='linear')
    for result in randforest_opt_results_nosepbg]
rf_optimized_tpr = [result[-4] for result in randforest_opt_results_nosepbg]
rf_optimized_fpr = [result[-3] if result[-3] > 0 else 1.2e-7 for result in randforest_opt_results_nosepbg]

zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000, 
                                   cvalues=cvalues, clabels=clabels, manual=manual);
with plt.rc_context(get_settings()):
    for result, f, sample_tprs, mass, color in zip(randforest_opt_results_nosepbg, rf_tpr_fpr_fs, rf_sample_tprs, masses, colors):
        plt.plot(sample_tprs, f(sample_tprs), c=plt.cm.YlOrRd(color), label=r"$m_{Z'} =" + f'{mass}$ GeV')
    plt.scatter(rf_default_tprs, np.log10(rf_default_fprs), c='white', label='Default', zorder=10, marker='X')
    plt.scatter(rf_optimized_tpr, np.log10(rf_optimized_fpr), c='white', label='Optimized', zorder=10, marker='*')
    plt.legend(bbox_to_anchor=(1.35,1), loc="upper left", facecolor='lightgrey')
    plt.xlim(*tpr_bounds);
    plt.ylim(*np.log10(fpr_bounds));
    plt.savefig('./images/full_example/rand_forest_tpr_fpr', dpi=200, bbox_inches='tight')

Gradient Boosting

Finally, we have the gradient boosting models.

xgb_default_tprs = [model.tpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(xgradboost_models, data)]
xgb_default_fprs = [model.fpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(xgradboost_models, data)]

xgb_sample_tprs = [
    np.linspace(result[-2][0] + 1e-5, result[-2][-1] - 1e-5, 200) 
    for result in xgradboost_opt_results_nosepbg]
xgb_tpr_fpr_fs = [
    scipy.interpolate.interp1d(result[-2], np.log10(np.where(result[-1]==0,1.2e-7,result[-1])), kind='linear')
    for result in xgradboost_opt_results_nosepbg]
xgb_optimized_tpr = [result[-4] for result in xgradboost_opt_results_nosepbg]
xgb_optimized_fpr = [result[-3] if result[-3] > 0 else 1.2e-7 for result in xgradboost_opt_results_nosepbg]

zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000, 
                                   cvalues=cvalues, clabels=clabels, manual=manual);
with plt.rc_context(get_settings()):
    for result, f, sample_tprs, mass, color in zip(xgradboost_opt_results_nosepbg, xgb_tpr_fpr_fs, xgb_sample_tprs, masses, colors):
        plt.plot(sample_tprs, f(sample_tprs), c=plt.cm.YlOrRd(color), label=r"$m_{Z'} =" + f'{mass}$ GeV')
    plt.scatter(xgb_default_tprs, np.log10(xgb_default_fprs), c='white', label='Default', zorder=10, marker='X')
    plt.scatter(xgb_optimized_tpr, np.log10(xgb_optimized_fpr), c='white', label='Optimized', zorder=10, marker='*')
    plt.legend(bbox_to_anchor=(1.35,1), loc="upper left", facecolor='lightgrey')
    plt.xlim(*tpr_bounds);
    plt.ylim(*np.log10(fpr_bounds));
    plt.yticks(list(plt.yticks()[0]) + [np.log10(1.2e-7)], list(plt.yticks()[1]) + ['0'])
    plt.savefig('./images/full_example/grad_boost_tpr_fpr', dpi=200, bbox_inches='tight')

CSV Files

Statistical analyses of real data at experiments like CMS and ATLAS sometimes make use of so-called "shape-based analyses" where histograms of events are processed bin by bin in a search for signal significance. What follows is some code I produced (using the built-in bcml_model.predict_hist function) to extract and save bin contents (using the pandas library) for histograms of events where our x-coordinate variable is signal probability, as predicted by each of our models.

logreg_models = [logreg_m350G_model, logreg_m500G_model, logreg_m1T_model, logreg_m2T_model, logreg_m4T_model]

xgradboost_models = [
    xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]

background_names = ['Higgs', '4 Tops', 'No Higgs']
labels = ['Signal'] + [f'Background ({name})' for name in background_names]
all_names = [["Logistic Regression (m_Z' = {}) - {}".format(m, label) for label in labels] for m in masses] + \
    [["Random Forest (m_Z' = {}) - {}".format(m, label) for label in labels] for m in masses] + \
    [["Gradient Boosting (m_Z' = {}) - {}".format(m, label) for label in labels] for m in masses]

num_binss = [50, 100, 250]
dfs = [pd.DataFrame() for n in num_binss]
conv = 10**15 / 10**12 # conv * lumi (in fb^{-1}) * cross sec (in pb) = # of events
lumi = 3000
signal_yields = [conv * lumi * sig_cs for sig_cs in sig_css]
background_yields = [conv * lumi * bg_cs for bg_cs in bg_css]
for num_bins, df in zip(num_binss, dfs):
    for model, signal_yield, names in zip(logreg_models + randforest_models + xgradboost_models, 
                                         signal_yields * 3, 
                                         all_names):
        sig_name = names[0]
        bg_names = names[1:]
        results = model.predict_hist(
            num_bins=num_bins, sepbg=True, sig_norm=signal_yield, bg_norm=background_yields)
        bin_edges = results[0]
        sig_bins = results[1]
        bg_binss = results[2:]
        df[sig_name] = sig_bins
        for bg_name, bg_bins in zip(bg_names, bg_binss):
            df[bg_name] = bg_bins
        df['Bin Edges'] = pd.Series(bin_edges)
    df.to_csv('./csvs/hist_data_{}bins.csv'.format(num_bins))

Smeared Data

Our work thus far has been rather idealistic. This is true for a variety of reasons, but to cite a specific example, we are assuming effectively perfect experimental detectors: we are training our models on data which perfectly coincides with the kinematic behavior of (simulated) particles in events. In real life, detectors have associated uncertainties, which manifests itself in "noised" or "smeared" or "perturbed" data. We can gauge how well our models will do in a more realistic situation (i.e., with smeared data) by perturbing each of our data points, which we do now.

Data Preparation

perturb = 0.1

from sklearn.model_selection import train_test_split
smeared_data = [
    train_test_split(
        np.concatenate([sig_data] + [bg_data[:int(len(sig_data)/3)] for bg_data in background_data]), 
        test_size=0.25, random_state=42) 
    for sig_data in signal_data]
smeared_data = [[[train[:,:-1], np.where(train[:,-1]==1,np.ones_like(train[:,-1]),np.zeros_like(train[:,-1])), train[:,-1]], 
                 [test[:,:-1], np.where(test[:,-1]==1,np.ones_like(test[:,-1]),np.zeros_like(test[:,-1])), test[:,-1]]] 
                 for train, test in smeared_data]

time_before = getTime(form='m')
for i, elem in enumerate(smeared_data):
    smeared_data[i][0][0] = np.reshape(np.concatenate(
        [np.random.multivariate_normal(row, np.diag(np.abs(perturb * row))) for row in elem[0][0]], axis=0), (-1,47))
    smeared_data[i][1][0] = np.reshape(np.concatenate(
        [np.random.multivariate_normal(row, np.diag(np.abs(perturb * row))) for row in elem[1][0]], axis=0), (-1,47))
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Logistic Regression

logreg_m350G_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m350G_smeared_model.fit(*smeared_data[0][0][:2])

logreg_m500G_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m500G_smeared_model.fit(*smeared_data[1][0][:2])

logreg_m1T_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m1T_smeared_model.fit(*smeared_data[2][0][:2])

logreg_m2T_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m2T_smeared_model.fit(*smeared_data[3][0][:2])

logreg_m4T_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)))
logreg_m4T_smeared_model.fit(*smeared_data[4][0][:2])

logreg_smeared_models = [logreg_m350G_smeared_model, logreg_m500G_smeared_model, logreg_m1T_smeared_model, 
                         logreg_m2T_smeared_model, logreg_m4T_smeared_model]

print(smeared_data[0][0][0].shape, smeared_data[0][1][0].shape)

(1499999, 47) (500000, 47)

time_before = getTime()
logreg_smeared_sigs_nosepbg = [model.significance(signal_yield, background_yield, 
                                          tpr=model.tpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]), 
                                          fpr=model.fpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
                                          sepbg=False) 
                       for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
for sig, mass in zip(logreg_smeared_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
          "w/o separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> logistic regression gives significance of 15.96 sigma w/o separated backgrounds (smeared data)
Z' mass = 500 GeV --> logistic regression gives significance of 8.38 sigma w/o separated backgrounds (smeared data)
Z' mass = 1000 GeV --> logistic regression gives significance of 1.99 sigma w/o separated backgrounds (smeared data)
Z' mass = 2000 GeV --> logistic regression gives significance of 0.273 sigma w/o separated backgrounds (smeared data)
Z' mass = 4000 GeV --> logistic regression gives significance of 0.006 sigma w/o separated backgrounds (smeared data)
(Runtime: 32 seconds)

time_before = getTime()
logreg_smeared_sigs_sepbg = [model.significance(signal_yield, background_yields, 
                                          tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=smeared_data[i][1][0][indx[i][0]]), 
                                          fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=smeared_data[i][1][0][indx[i][j]]) 
                                               for j in range(1,4)],
                                          sepbg=True)
                       for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
for sig, mass in zip(logreg_smeared_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
          "w/ separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> logistic regression gives significance of 19.799 sigma w/ separated backgrounds (smeared data)
Z' mass = 500 GeV --> logistic regression gives significance of 9.473 sigma w/ separated backgrounds (smeared data)
Z' mass = 1000 GeV --> logistic regression gives significance of 1.982 sigma w/ separated backgrounds (smeared data)
Z' mass = 2000 GeV --> logistic regression gives significance of 0.255 sigma w/ separated backgrounds (smeared data)
Z' mass = 4000 GeV --> logistic regression gives significance of 0.006 sigma w/ separated backgrounds (smeared data)
(Runtime: 14 seconds)

time_before = getTime(form='m')
logreg_smeared_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, smeared_data[i][1][0], smeared_data[i][1][1],
                                                           sepbg=False) 
                              for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
logreg_smeared_opt_sigs_nosepbg = [result[1] for result in logreg_smeared_opt_results_nosepbg]
for sig, mass in zip(logreg_smeared_opt_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
          "w/o separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> logistic regression gives optimized sign. of 17.416 sigma w/o separated backgrounds (smeared data)
Z' mass = 500 GeV --> logistic regression gives optimized sign. of 10.125 sigma w/o separated backgrounds (smeared data)
Z' mass = 1000 GeV --> logistic regression gives optimized sign. of 3.544 sigma w/o separated backgrounds (smeared data)
Z' mass = 2000 GeV --> logistic regression gives optimized sign. of 2.376 sigma w/o separated backgrounds (smeared data)
Z' mass = 4000 GeV --> logistic regression gives optimized sign. of 1.004 sigma w/o separated backgrounds (smeared data)
(Runtime: 4.65 minutes)

time_before = getTime(form='m')
logreg_smeared_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, smeared_data[i][1][0], smeared_data[i][1][2],
                                                         sepbg=True)
                              for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
logreg_smeared_opt_sigs_sepbg = [result[1] for result in logreg_smeared_opt_results_sepbg]
for sig, mass in zip(logreg_smeared_opt_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
          "w/ separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> logistic regression gives optimized sign. of 22.057 sigma w/ separated backgrounds (smeared data)
Z' mass = 500 GeV --> logistic regression gives optimized sign. of 11.495 sigma w/ separated backgrounds (smeared data)
Z' mass = 1000 GeV --> logistic regression gives optimized sign. of 3.457 sigma w/ separated backgrounds (smeared data)
Z' mass = 2000 GeV --> logistic regression gives optimized sign. of 3.315 sigma w/ separated backgrounds (smeared data)
Z' mass = 4000 GeV --> logistic regression gives optimized sign. of 1.004 sigma w/ separated backgrounds (smeared data)
(Runtime: 4.833 minutes)

Random Forests

time_before = getTime(form='m')
randforest_m350G_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m350G_smeared_model.fit(*smeared_data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 36.6 minutes)

time_before = getTime(form='m')
randforest_m500G_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m500G_smeared_model.fit(*smeared_data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 40.217 minutes)

time_before = getTime(form='m')
randforest_m1T_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m1T_smeared_model.fit(*smeared_data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 44.366 minutes)

time_before = getTime(form='m')
randforest_m2T_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m2T_smeared_model.fit(*smeared_data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 42.184 minutes)

time_before = getTime(form='m')
randforest_m4T_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m4T_smeared_model.fit(*smeared_data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 38.05 minutes)

randforest_smeared_models = [randforest_m350G_smeared_model, randforest_m500G_smeared_model, randforest_m1T_smeared_model, 
                             randforest_m2T_smeared_model, randforest_m4T_smeared_model]

time_before = getTime()
randforest_smeared_sigs_nosepbg = [model.significance(signal_yield, background_yield, 
                                          tpr=model.tpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]), 
                                          fpr=model.fpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
                                          sepbg=False) 
                       for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
for sig, mass in zip(randforest_smeared_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
          "w/o separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> random forests gives significance of 23.203 sigma w/o separated backgrounds (smeared data)
Z' mass = 500 GeV --> random forests gives significance of 12.314 sigma w/o separated backgrounds (smeared data)
Z' mass = 1000 GeV --> random forests gives significance of 2.962 sigma w/o separated backgrounds (smeared data)
Z' mass = 2000 GeV --> random forests gives significance of 0.355 sigma w/o separated backgrounds (smeared data)
Z' mass = 4000 GeV --> random forests gives significance of 0.008 sigma w/o separated backgrounds (smeared data)
(Runtime: 217 seconds)

time_before = getTime()
randforest_smeared_sigs_sepbg = [model.significance(signal_yield, background_yields, 
                                          tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=smeared_data[i][1][0][indx[i][0]]), 
                                          fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=smeared_data[i][1][0][indx[i][j]]) 
                                               for j in range(1,4)],
                                          sepbg=True)
                       for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
for sig, mass in zip(randforest_smeared_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
          "w/ separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> random forests gives significance of 26.641 sigma w/ separated backgrounds (smeared data)
Z' mass = 500 GeV --> random forests gives significance of 13.436 sigma w/ separated backgrounds (smeared data)
Z' mass = 1000 GeV --> random forests gives significance of 3.0 sigma w/ separated backgrounds (smeared data)
Z' mass = 2000 GeV --> random forests gives significance of 0.353 sigma w/ separated backgrounds (smeared data)
Z' mass = 4000 GeV --> random forests gives significance of 0.007 sigma w/ separated backgrounds (smeared data)
(Runtime: 100 seconds)

time_before = getTime(form='m')
randforest_smeared_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, smeared_data[i][1][0], smeared_data[i][1][1],
                                                           sepbg=False) 
                              for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
randforest_smeared_opt_sigs_nosepbg = [result[1] for result in randforest_smeared_opt_results_nosepbg]
for sig, mass in zip(randforest_smeared_opt_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
          "w/o separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> random forests gives optimized sign. of 55.96 sigma w/o separated backgrounds (smeared data)
Z' mass = 500 GeV --> random forests gives optimized sign. of 23.886 sigma w/o separated backgrounds (smeared data)
Z' mass = 1000 GeV --> random forests gives optimized sign. of 13.863 sigma w/o separated backgrounds (smeared data)
Z' mass = 2000 GeV --> random forests gives optimized sign. of 3.714 sigma w/o separated backgrounds (smeared data)
Z' mass = 4000 GeV --> random forests gives optimized sign. of 0.994 sigma w/o separated backgrounds (smeared data)
(Runtime: 7.066 minutes)

time_before = getTime(form='m')
randforest_smeared_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, smeared_data[i][1][0], smeared_data[i][1][2],
                                                         sepbg=True)
                              for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
randforest_smeared_opt_sigs_sepbg = [result[1] for result in randforest_smeared_opt_results_sepbg]
for sig, mass in zip(randforest_smeared_opt_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
          "w/ separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> random forests gives optimized sign. of 51.271 sigma w/ separated backgrounds (smeared data)
Z' mass = 500 GeV --> random forests gives optimized sign. of 21.723 sigma w/ separated backgrounds (smeared data)
Z' mass = 1000 GeV --> random forests gives optimized sign. of 13.439 sigma w/ separated backgrounds (smeared data)
Z' mass = 2000 GeV --> random forests gives optimized sign. of 3.199 sigma w/ separated backgrounds (smeared data)
Z' mass = 4000 GeV --> random forests gives optimized sign. of 0.994 sigma w/ separated backgrounds (smeared data)
(Runtime: 6.684 minutes)

Gradient Boosting

time_before = getTime(form='m')
xgradboost_m350G_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_smeared_model.fit(*smeared_data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

[00:53:02] WARNING: /opt/concourse/worker/volumes/live/7a2b9f41-3287-451b-6691-43e9a6c0910f/volume/xgboost-split_1619728204606/work/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(Runtime: 44.933 minutes)

time_before = getTime(form='m')
xgradboost_m500G_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m500G_smeared_model.fit(*smeared_data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

[01:37:54] WARNING: /opt/concourse/worker/volumes/live/7a2b9f41-3287-451b-6691-43e9a6c0910f/volume/xgboost-split_1619728204606/work/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(Runtime: 45.933 minutes)

time_before = getTime(form='m')
xgradboost_m1T_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m1T_smeared_model.fit(*smeared_data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

[02:23:47] WARNING: /opt/concourse/worker/volumes/live/7a2b9f41-3287-451b-6691-43e9a6c0910f/volume/xgboost-split_1619728204606/work/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(Runtime: 45.45 minutes)

time_before = getTime(form='m')
xgradboost_m2T_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m2T_smeared_model.fit(*smeared_data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

[03:09:13] WARNING: /opt/concourse/worker/volumes/live/7a2b9f41-3287-451b-6691-43e9a6c0910f/volume/xgboost-split_1619728204606/work/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(Runtime: 48.934 minutes)

time_before = getTime(form='m')
xgradboost_m4T_smeared_model = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m4T_smeared_model.fit(*smeared_data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

[03:58:13] WARNING: /opt/concourse/worker/volumes/live/7a2b9f41-3287-451b-6691-43e9a6c0910f/volume/xgboost-split_1619728204606/work/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(Runtime: 42.3 minutes)

xgradboost_smeared_models = [xgradboost_m350G_smeared_model, xgradboost_m500G_smeared_model, xgradboost_m1T_smeared_model, 
                             xgradboost_m2T_smeared_model, xgradboost_m4T_smeared_model]

time_before = getTime()
xgradboost_smeared_sigs_nosepbg = [model.significance(signal_yield, background_yield, 
                                          tpr=model.tpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]), 
                                          fpr=model.fpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
                                          sepbg=False) 
                       for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
for sig, mass in zip(xgradboost_smeared_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
          "w/o separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> gradient boosting gives significance of 27.652 sigma w/o separated backgrounds (smeared data)
Z' mass = 500 GeV --> gradient boosting gives significance of 15.171 sigma w/o separated backgrounds (smeared data)
Z' mass = 1000 GeV --> gradient boosting gives significance of 3.752 sigma w/o separated backgrounds (smeared data)
Z' mass = 2000 GeV --> gradient boosting gives significance of 0.465 sigma w/o separated backgrounds (smeared data)
Z' mass = 4000 GeV --> gradient boosting gives significance of 0.007 sigma w/o separated backgrounds (smeared data)
(Runtime: 75 seconds)

time_before = getTime()
xgradboost_smeared_sigs_sepbg = [model.significance(signal_yield, background_yields, 
                                          tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=smeared_data[i][1][0][indx[i][0]]), 
                                          fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=smeared_data[i][1][0][indx[i][j]]) 
                                               for j in range(1,4)],
                                          sepbg=True)
                       for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
for sig, mass in zip(xgradboost_smeared_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
          "w/ separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

Z' mass = 350 GeV --> gradient boosting gives significance of 28.297 sigma w/ separated backgrounds (smeared data)
Z' mass = 500 GeV --> gradient boosting gives significance of 14.964 sigma w/ separated backgrounds (smeared data)
Z' mass = 1000 GeV --> gradient boosting gives significance of 3.487 sigma w/ separated backgrounds (smeared data)
Z' mass = 2000 GeV --> gradient boosting gives significance of 0.374 sigma w/ separated backgrounds (smeared data)
Z' mass = 4000 GeV --> gradient boosting gives significance of 0.006 sigma w/ separated backgrounds (smeared data)
(Runtime: 26 seconds)

time_before = getTime(form='m')
xgradboost_smeared_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, smeared_data[i][1][0], smeared_data[i][1][1],
                                                           sepbg=False) 
                              for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
xgradboost_smeared_opt_sigs_nosepbg = [result[1] for result in xgradboost_smeared_opt_results_nosepbg]
for sig, mass in zip(xgradboost_smeared_opt_sigs_nosepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting gives optimized sign. of {round(sig, 3)} sigma",
          "w/o separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> gradient boosting gives optimized sign. of 65.867 sigma w/o separated backgrounds (smeared data)
Z' mass = 500 GeV --> gradient boosting gives optimized sign. of 45.621 sigma w/o separated backgrounds (smeared data)
Z' mass = 1000 GeV --> gradient boosting gives optimized sign. of 20.097 sigma w/o separated backgrounds (smeared data)
Z' mass = 2000 GeV --> gradient boosting gives optimized sign. of 7.049 sigma w/o separated backgrounds (smeared data)
Z' mass = 4000 GeV --> gradient boosting gives optimized sign. of 1.022 sigma w/o separated backgrounds (smeared data)
(Runtime: 6.434 minutes)

time_before = getTime(form='m')
xgradboost_smeared_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, smeared_data[i][1][0], smeared_data[i][1][2],
                                                         sepbg=True)
                              for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
xgradboost_smeared_opt_sigs_sepbg = [result[1] for result in xgradboost_smeared_opt_results_sepbg]
for sig, mass in zip(xgradboost_smeared_opt_sigs_sepbg, masses):
    print(f"Z' mass = {mass} GeV --> gradient boosting gives optimized sign. of {round(sig, 3)} sigma",
          "w/ separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

Z' mass = 350 GeV --> gradient boosting gives optimized sign. of 57.659 sigma w/ separated backgrounds (smeared data)
Z' mass = 500 GeV --> gradient boosting gives optimized sign. of 40.115 sigma w/ separated backgrounds (smeared data)
Z' mass = 1000 GeV --> gradient boosting gives optimized sign. of 15.489 sigma w/ separated backgrounds (smeared data)
Z' mass = 2000 GeV --> gradient boosting gives optimized sign. of 7.017 sigma w/ separated backgrounds (smeared data)
Z' mass = 4000 GeV --> gradient boosting gives optimized sign. of 1.022 sigma w/ separated backgrounds (smeared data)
(Runtime: 7.15 minutes)

Performance Comparison

logreg_smeared_opt_sigs_sepbg_f = scipy.interpolate.interp1d(
    masses, np.log10(logreg_smeared_opt_sigs_sepbg), kind='linear')
randforest_smeared_opt_sigs_sepbg_f = scipy.interpolate.interp1d(
    masses, np.log10(randforest_smeared_opt_sigs_sepbg), kind='linear')
xgradboost_smeared_opt_sigs_sepbg_f = scipy.interpolate.interp1d(
    masses, np.log10(xgradboost_smeared_opt_sigs_sepbg), kind='linear')

import matplotlib.pyplot as plt

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-1, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**logreg_smeared_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Log. Reg. (Smeared)', 
        c='red', linestyle='--')
    plt.plot(
        sample_masses, [10**randforest_smeared_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Rand. Forest (Smeared)', 
        c='lime', linestyle='--')
    plt.plot(
        sample_masses, [10**xgradboost_smeared_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Grad. Boost. (Smeared)', 
        c='tab:purple', linestyle='--')
    plt.plot(
        sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Logistic Regression', 
        c='red', linestyle=':')
    plt.plot(
        sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Random Forests', 
        c='lime', linestyle=':')
    plt.plot(
        sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting', 
        c='tab:purple', linestyle=':')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.savefig('./images/full_example/smeared_sigs', dpi=200, bbox_inches='tight')

Histograms

sig_smeared_data = [sd[0][0][np.where(sd[0][1]==1)[0]] for sd in smeared_data]

with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('M b1 b2')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(sig_smeared_data[0][:,names.index('M b1 b2')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
    plt.xlim(300,400)
#     plt.ylim(10**-5,10**-1.5)
#     plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.legend()
    plt.xlabel(r'$m(b_1,b_2)$ (GeV)')
    plt.savefig('./images/full_example/mb1b2_smeared', dpi=200, bbox_inches='tight')

with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('M b1 b3')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(sig_smeared_data[0][:,names.index('M b1 b3')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
    plt.xlim(200,500)
#     plt.ylim(10**-5,10**-1.5)
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")

with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('M j1 j2')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(sig_smeared_data[0][:,names.index('M j1 j2')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
    plt.xlim(70,90)
#     plt.ylim(10**-5,10**-1.5)
#     plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.legend()
    plt.xlabel(r'$m(j_1,j_2)$ (GeV)')
    plt.savefig('./images/full_example/mj1j2_smeared', dpi=200, bbox_inches='tight')

with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('pT b1')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(sig_smeared_data[0][:,names.index('pT b1')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
    plt.xlim(0,500)
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")

Comparison with Manuel Optimization of Event Selection Criteria

The machine learning approach to optimizing signal significance in a phenomenological problem can be motivated through a comparison with the performance of a more manual approach. The manual_opt file contains functions which enable analyses of event selection criteria optimization, which we explore as follows and ultimately compare with our machine learning-based results.

sigss = signal_data
bgss = background_data
names = get47Dfeatures()

First, let's see what signal significance looks like without making any cuts at all.

for mass, signal_yield in zip(masses, signal_yields):
    print(
        f"For Z' mass = {mass}, we have a default signal significance of",
        f"{round(manual_significance(signal_yield, background_yields, 1, [1]*3), 4)}")

For Z' mass = 350, we have a default signal significance of 7.2484
For Z' mass = 500, we have a default signal significance of 2.7579
For Z' mass = 1000, we have a default signal significance of 0.2856
For Z' mass = 2000, we have a default signal significance of 0.0148
For Z' mass = 4000, we have a default signal significance of 0.0003

Now we'll ramp up our methodology and use the opt_sig function, which takes in data (signal and background), features, and a list of cuts and iteratively scans for cut combinations that maximize signal significance.

time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(0,400,5)],
        [names.index('M b1 b3'), True, np.linspace(0,400,5)],
        [names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
        [names.index('pT b1'), True, np.linspace(0,500,5)]]
point, sig = opt_sig(sigss[0], bgss, signal_yields[0], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[0]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")

Cutting at (325.0, 225.0, 150.0, 187.5) gives maximum significance of 20.965001032944777 (Z' mass = 350 GeV)
Runtime: 5.75 minutes

time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(200,600,5)],
        [names.index('M b1 b3'), True, np.linspace(0,400,5)],
        [names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
        [names.index('pT b1'), True, np.linspace(0,500,5)]]
point, sig = opt_sig(sigss[1], bgss, signal_yields[1], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[1]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")

Cutting at (475.0, 250.0, 150.0, 250.0) gives maximum significance of 13.882260799836876 (Z' mass = 500 GeV)
Runtime: 5.783 minutes

time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(600,1200,5)],
        [names.index('M b1 b3'), True, np.linspace(0,400,5)],
        [names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
        [names.index('pT b1'), True, np.linspace(200,600,5)]]
point, sig = opt_sig(sigss[2], bgss, signal_yields[2], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[2]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")

Cutting at (975.0, 300.0, 150.0, 450.0) gives maximum significance of 6.847931300231083 (Z' mass = 1000 GeV)
Runtime: 8.317 minutes

time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(1200,2200,5)],
        [names.index('M b1 b3'), True, np.linspace(200,600,5)],
        [names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
        [names.index('pT b1'), True, np.linspace(600,1000,5)]]
point, sig = opt_sig(sigss[3], bgss, signal_yields[3], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[3]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")

Cutting at (1825.0, 375.0, 150.0, 800.0) gives maximum significance of 3.4044571732873736 (Z' mass = 2000 GeV)
Runtime: 6.1 minutes

time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(2000,4000,5)],
        [names.index('M b1 b3'), True, np.linspace(0,50,5)],
        [names.index('M j1 j2'), False, np.append(np.linspace(100,500,4), 10000)],
        [names.index('pT b1'), True, np.linspace(0,300,5)]]
point, sig = opt_sig(sigss[4], bgss, signal_yields[4], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[4]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")

Cutting at (3500.0, 0.0, 266.6666666666667, 93.75) gives maximum significance of 1.0127201854783396 (Z' mass = 4000 GeV)
Runtime: 5.883 minutes

Having found significances manually, let's compile our results, interpolate, and compare with the performance of our machine learning models.

manual_default_sigs = [
    7.248418610012742, 2.757897361779127, 0.2855952155740879, 0.014759138609928545, 0.000260774522132171]
manual_opt_sigs = [
    20.965001032944777, 13.882260799836876, 6.847931300231083, 3.4044571732873736, 1.0127201854783396]

manual_default_sigs_f = scipy.interpolate.interp1d(masses, np.log10(manual_default_sigs), kind='cubic')
manual_opt_sigs_f = scipy.interpolate.interp1d(masses, np.log10(manual_opt_sigs), kind='cubic')

import matplotlib.pyplot as plt

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-2.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Logistic Regression', 
        c='red')
    plt.plot(
        sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Random Forests', 
        c='lime')
    plt.plot(
        sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting', 
        c='tab:purple')
    plt.plot(
        sample_masses, [10**manual_default_sigs_f(m) for m in sample_masses], 
        label='Default', 
        c='black', linestyle='-.')
    plt.plot(
        sample_masses, [10**manual_opt_sigs_f(m) for m in sample_masses], 
        label='Manual', 
        c='black')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.ylim(10**-2, 10**2)
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.savefig('./images/full_example/all_models_manual_sigs', dpi=200, bbox_inches='tight')

Kinematic Histograms

include_cuts = False
with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('M b1 b2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(signal_data[2][:,names.index('M b1 b2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
    for bg_data, name in zip(background_data, filename_bgs):
        plt.hist(bg_data[:,names.index('M b1 b2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
    if include_cuts:
        plt.plot([325]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV)", c='tab:blue')
        plt.plot([975]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 1$ TeV)", c='tab:orange')
    plt.xlim(0,1300)
    plt.ylim(10**-5,10**-1.5)
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.xlabel(r'$m(b_1, b_2)$ (GeV)')
    if include_cuts:
        plt.savefig('./images/full_example/mb1b2_with_cuts', dpi=200, bbox_inches='tight')
    else:
        plt.savefig('./images/full_example/mb1b2', dpi=200, bbox_inches='tight')

include_cuts = False
with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('M b1 b3')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(signal_data[2][:,names.index('M b1 b3')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
    for bg_data, name in zip(background_data, filename_bgs):
        plt.hist(bg_data[:,names.index('M b1 b3')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
    if include_cuts:
        plt.plot([225]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV)", c='tab:blue')
        plt.plot([300]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 1$ TeV)", c='tab:orange')
    plt.xlim(0,1300)
    plt.ylim(10**-5,10**-1.5)
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.xlabel(r'$m(b_1, b_3)$ (GeV)')
    if include_cuts:
        plt.savefig('./images/full_example/mb1b3_with_cuts', dpi=200, bbox_inches='tight')
    else:
        plt.savefig('./images/full_example/mb1b3', dpi=200, bbox_inches='tight')

include_cuts = False
with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('M j1 j2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(signal_data[2][:,names.index('M j1 j2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
    for bg_data, name in zip(background_data, filename_bgs):
        bins = 100 if name is not '4t' else 1000
        plt.hist(bg_data[:,names.index('M j1 j2')], density=True, bins=bins, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
    if include_cuts:
        plt.plot([150]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV, $1$ TeV)", c='tab:blue')
    plt.xlim(0,200)
    plt.ylim(10**-4,10**-0.25)
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.xlabel(r'$m(j_1, j_2)$ (GeV)')
    if include_cuts:
        plt.savefig('./images/full_example/mj1j2_with_cuts', dpi=200, bbox_inches='tight')
    else:
        plt.savefig('./images/full_example/mj1j2', dpi=200, bbox_inches='tight')

include_cuts = False
with plt.rc_context(get_settings()):
    plt.hist(signal_data[0][:,names.index('pT b1')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
    plt.hist(signal_data[2][:,names.index('pT b1')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
    for bg_data, name in zip(background_data, filename_bgs):
        plt.hist(bg_data[:,names.index('pT b1')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
    if include_cuts:
        plt.plot([187]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV)", c='tab:blue')
        plt.plot([450]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 1$ TeV)", c='tab:orange')
    plt.xlim(0,800)
    plt.ylim(10**-5,10**-1.5)
    plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
    plt.xlabel(r'$p_T(b_1)$ (GeV)')
    if include_cuts:
        plt.savefig('./images/full_example/ptb1_with_cuts', dpi=200, bbox_inches='tight')
    else:
        plt.savefig('./images/full_example/ptb1', dpi=200, bbox_inches='tight')

Z' mass + coupling sensitivity heatmap

xgradboost_m350G_model = refresh_model(xgradboost_m350G_model)
xgradboost_m500G_model = refresh_model(xgradboost_m500G_model)
xgradboost_m1T_model = refresh_model(xgradboost_m1T_model)
xgradboost_m2T_model = refresh_model(xgradboost_m2T_model)
xgradboost_m4T_model = refresh_model(xgradboost_m4T_model)

xgradboost_models = [xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]

time_before = getTime(form='m')
couplings = np.linspace(1,10,5)
full_opt_results = [[
    model.best_threshold(signal_yield * coupling**2, background_yields, data[i][1][0], data[i][1][2], sepbg=True) 
    for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))] for coupling in couplings]
full_opt_sigs = np.array([[result[1] for result in row] for row in full_opt_results])
print('masses:', masses)
print()
print('couplings:', np.linspace(1,10,5))
print()
print(full_opt_sigs)
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

masses: [350, 500, 1000, 2000, 4000]

couplings: [ 1.    3.25  5.5   7.75 10.  ]

[[6.79628544e+01 4.66200955e+01 1.70702202e+01 7.14800670e+00
  1.02139667e+00]
 [3.92786635e+02 2.54255973e+02 9.14069818e+01 2.39198991e+01
  3.31960094e+00]
 [7.68799223e+02 4.87387318e+02 1.67096270e+02 3.91419018e+01
  5.61789071e+00]
 [1.15395522e+03 7.22102793e+02 2.41307488e+02 5.71102924e+01
  7.91634778e+00]
 [1.54278338e+03 9.62286341e+02 3.17984806e+02 7.48541743e+01
  1.02150223e+01]]
(Runtime: 33.1 minutes)

masses, sig_css, bg_css = get_elijah_ttbarzp_cs()
zp_cs_objects = [cross_section_helper(masses, np.array(sig_css) * coupling**2, bg_css, mass_units='GeV') for coupling in couplings]
masses = [350, 500, 1000, 2000, 4000]
max_masses = [float(obj.absolute_max_mass_sens()) for obj in zp_cs_objects]

res = 500
ext = [masses[0], masses[-1], couplings[0], couplings[-1]]
pts = np.array([[m, c] for m in masses for c in couplings])
sigs = np.transpose(full_opt_sigs).flatten()
grid = np.array(
        [[[x,y] for x in np.linspace(ext[0], ext[1], res)]
        for y in np.linspace(ext[2], ext[3], res)])
interp_sigs = scipy.interpolate.griddata(pts, np.log10(sigs), grid, method='cubic')

with plt.rc_context(get_settings()):
    plt.imshow(interp_sigs, origin='lower', aspect='auto', extent=ext, cmap='winter')
    cbar = plt.colorbar()
    plt.xlabel(r"$m_{Z'}$ (GeV)")
    plt.ylabel("Coupling")
    cbar.set_label(r"Signal Significance", rotation=90)
    cbar_ticks = np.linspace(0,3,7)
    cbar.set_ticks(cbar_ticks)
    cbar.set_ticklabels([r"$10^{"+str(num)+"}$" for num in cbar_ticks])
    cvalues = [np.log10(5)]
    clabels = [r'$5\sigma$']
    result = plt.contour(grid[:,:,0], grid[:,:,1], interp_sigs, cvalues, colors='white',
                         linewidths=3, linestyles='dashed')
    fmt = {lev:lab for lev, lab in zip(result.levels, clabels)}
    plt.clabel(result, result.levels, inline=True, fmt=fmt, fontsize=32)
#     plt.plot(max_masses, couplings, c='w')
    plt.xlim(masses[0], masses[-1])
    plt.ylim(couplings[0], couplings[-1])
    plt.savefig('./images/full_example/mass_coupling_sensitivity', dpi=200, bbox_inches='tight')

Model Extrapolation

sig1p5T = import47Ddata('sig1p5T')

Downloading data from https://storage.googleapis.com/ttbarzp/47dim/sig1p5T.npy
94003200/94000128 [==============================] - 27s 0us/step

bg1T = data[2][1][0][np.array(data[2][1][1]) == 0]

bg2T = data[3][1][0][np.array(data[3][1][1]) == 0]

test1p5_1T_preds = np.concatenate((sig1p5T, bg1T))
test1p5_2T_preds = np.concatenate((sig1p5T, bg2T))

test1p5_1T_labels = np.concatenate((np.ones_like(sig1p5T[:,0]), np.array(data[2][1][2])[np.array(data[2][1][1]) == 0]))
test1p5_2T_labels = np.concatenate((np.ones_like(sig1p5T[:,0]), np.array(data[3][1][2])[np.array(data[3][1][1]) == 0]))

xgboost_1T_result = xgradboost_m1T_model.best_threshold(
    signal_yields[2], background_yields, test1p5_1T_preds, test1p5_1T_labels, sepbg=True)
xgboost_2T_result = xgradboost_m2T_model.best_threshold(
    signal_yields[3], background_yields, test1p5_2T_preds, test1p5_2T_labels, sepbg=True)

print(f"Gradient boosting (Z' mass 1 TeV) has significance {round(xgboost_1T_result[1],3)} on Z' mass 1.5 TeV sample")
print(f"Gradient boosting (Z' mass 2 TeV) has significance {round(xgboost_2T_result[1],3)} on Z' mass 1.5 TeV sample")
print(f"Interpolating gradient boosting significance yields {round(10**xgradboost_opt_sigs_sepbg_f(1500),3)}",
      "for Z' mass 1.5 TeV")

Gradient boosting (Z' mass 1 TeV) has significance 6.874 on Z' mass 1.5 TeV sample
Gradient boosting (Z' mass 2 TeV) has significance 0.944 on Z' mass 1.5 TeV sample
Interpolating gradient boosting significance yields 11.046 for Z' mass 1.5 TeV

print(len(test1p5_1T_preds))
print(len(test1p5_2T_preds))

500009
503236

Kinematic Histograms

xlabels = [
    r'$p_T(b_1)$',
    r'$p_T(b_2)$',
    r'$p_T(b_3)$',
    r'$p_T(b_4)$',
    r'$|\Delta\eta(b_1, b_2)|$',
    r'$|\Delta\eta(b_1, b_3)|$',
    r'$|\Delta\eta(b_1, b_4)|$',
    r'$|\Delta\eta(b_2, b_3)|$',
    r'$|\Delta\eta(b_2, b_4)|$',
    r'$|\Delta\eta(b_3, b_4)|$',
    r'$|\Delta\phi(b_1, b_2)|$',
    r'$|\Delta\phi(b_1, b_3)|$',
    r'$|\Delta\phi(b_1, b_4)|$',
    r'$|\Delta\phi(b_2, b_3)|$',
    r'$|\Delta\phi(b_2, b_4)|$',
    r'$|\Delta\phi(b_3, b_4)|$',
    r'$\Delta R(b_1, b_2)$',
    r'$\Delta R(b_1, b_3)$',
    r'$\Delta R(b_1, b_4)$',
    r'$\Delta R(b_2, b_3)$',
    r'$\Delta R(b_2, b_4)$',
    r'$\Delta R(b_3, b_4)$',
    r'MET',
    r'$p_T(\ell)$',
    r'$m_T(l, MET)$',
    r'$m(b_1, b_2)$',
    r'$m(b_1, b_3)$',
    r'$m(b_1, b_4)$',
    r'$m(b_2, b_3)$',
    r'$m(b_2, b_4)$',
    r'$m(b_3, b_4)$',
    r'$m_T(b_1, \ell, MET)$',
    r'$m_T(b_2, \ell, MET)$',
    r'$m_T(b_3, \ell, MET)$',
    r'$m_T(b_4, \ell, MET)$',
    r'$m_T(j_1, j_2)$',
    r'$p_T(j_1)$',
    r'$p_T(j_2)$',
    r'$\Delta R(j_1, j_2)$',
    r'$\Delta R(b_1, \ell)$',
    r'$\Delta R(b_2, \ell)$',
    r'$\Delta R(b_3, \ell)$',
    r'$\Delta R(b_4, \ell)$',
    r'$|\Delta\phi(b_1, \ell)|$',
    r'$|\Delta\phi(b_2, \ell)|$',
    r'$|\Delta\phi(b_3, \ell)|$',
    r'$|\Delta\phi(b_4, \ell)|$',
]

import matplotlib.pyplot as plt
histo_path = "/Users/sheridan/Academic/bcml4pheno/images/full_example/for_paper/kin_histos/"
xlims = ([[0, 1000]] * 4     # pt
         + [[-4, 4]] * 6     # sdEta
         + [[-6.5, 6.5]] * 6 # sdPhi
         + [[0, 6]] * 6      # dR
         + [[0, 500]] * 3    # misc. kinematic
         + [[0, 1500]] * 6   # invariant mass
         + [[0, 600]] * 7    # misc. kinematic
         + [[0, 6]] * 5 
         + [[-6.5, 6.5]] * 4
        )
ylims = ([None] * 4
         + [[10**-3, 0.5]] * 6 # sdEta
         + [None] * (47 - 6 - 4)
        )
legend_spot = (['upper right'] * 4
               + ['lower center'] * 18
               + ['upper right'] * 9
               + ['lower right'] * 5
               + ['upper right'] * 2
               + ['lower center'] * 9
              )
bg_labels_short = [r"No Higgs B.g. ($pp \; \to \; t\overline{t}b\overline{b} \; \backslash \; h$)",
                   r"Higgs B.g. ($pp \; \to \; t\overline{t}h$)",
                   r"4 Tops B.g. ($pp \; \to \; t\overline{t}t\overline{t}$)"]

with plt.rc_context(get_settings()):
    for N, (name, xlabel, xlim, ylim, loc) in enumerate(zip(get47Dfeatures(), xlabels, xlims, ylims, legend_spot)):
        plt.hist(
            signal_data[0][:,N], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
        plt.hist(
            signal_data[2][:,N], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
        for j, label in enumerate(bg_labels_short):
            plt.hist(
                background_data[j][:,N], density=True, bins=1000, histtype='step', linewidth=3, label=r"{}".format(label));
        plt.xlim(*xlim)
        if ylim:
            plt.ylim(*ylim)
        plt.yscale('log')
        # plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
        plt.legend(loc=loc)
        plt.xlabel(xlabel)
        plt.ylabel('a.u.')
        plt.savefig(histo_path + name.strip() + '.pdf', dpi=200, bbox_inches='tight', transparent=True, format='pdf')
        plt.clf()

<Figure size 864x576 with 0 Axes>

Saving ML Ouput for Shape-Based Optimization

def genText(models, fileprefix, sig_yields, bg_yields, doPlots=False):
    hist_signal_yields = signal_yields # look into changing?
    background_names = ['higgs', 'four_tops', 'no_higgs']
    num_binss = [100, 100, 100, 40, 40]
#     num_binss = [1, 1, 1, 1, 1]
    ms = [350, 500, 1000, 2000]
    
    for i, (m, X, model, num_bins, signal_yield) in enumerate(zip(ms, data, models, num_binss, sig_yields)):
        # generate histos
        bin_edges, sig_bins, bg_bins1, bg_bins2, bg_bins3 = model.predict_hist(
            X[1][0], X[1][2], num_bins=num_bins, sepbg=True, sig_norm=np.array(signal_yield), bg_norm=np.array(bg_yields),
            yAxisUnits='events/bin width')
        bg_bins = [bg_bins1, bg_bins2, bg_bins3]

        # save .txt files
        txt_path = "/Users/sheridan/Academic/bcml4pheno/images/full_example/for_paper/ml_histos/"
        np.savetxt(txt_path + f'{fileprefix}{m}GeV_gradboost_binedges.txt', bin_edges, delimiter='\n')
        np.savetxt(txt_path + f'{fileprefix}{m}GeV_gradboost_signal.txt', sig_bins, delimiter='\n')
        for name, bg_bin in zip(background_names, bg_bins):
            np.savetxt(txt_path + f'{fileprefix}{m}GeV_gradboost_background_{name}.txt', bg_bin, delimiter='\n')
            
        # print area under curve
        print(np.dot(np.diff(bin_edges), sig_bins))
        print(signal_yield)
        print(np.diff(bin_edges)[0])
        print()

        # make plots
        if doPlots:
            with plt.rc_context(get_settings()):
                plt.figure(i)
                plt.bar(bin_edges[:-1], sig_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Signal")
                for bg_bin, name in zip(bg_bins, background_names):
                    plt.bar(
                        bin_edges[:-1], bg_bin, width=np.diff(bin_edges), alpha=0.75, align="edge", label=f"Background ({name})")
                plt.yscale('log')
                plt.legend()

sig_yield_13TeV_gq1 = np.array([0.03284,
                                0.01253,
                                0.001285,
                                6.6e-5]) * 3000 * conv * 0.445 * (0.7)**4 # xsec * lumi * branch ratio * b-tag efficiency
sig_yield_13TeV_gq0 = np.array([0.021576,
                                0.007379,
                                0.000502,
                                1.07e-5]) * 3000 * conv * 0.445 * (0.7)**4
sig_yield_14TeV_gq1 = np.array([0.04053,
                                0.01567,
                                0.001681,
                                9.33e-5]) * 3000 * conv * 0.445 * (0.7)**4
sig_yield_14TeV_gq0 = np.array([0.02714,
                                0.009557,
                                0.000703,
                                1.7e-5]) * 3000 * conv * 0.445 * (0.7)**4
masses, sig_css, bg_css = get_elijah_ttbarzp_cs() # ONLY BG CS IS USED HERE!
bg_yields = [conv * 3000 * bg_cs * (0.7)**4 for bg_cs in bg_css]

prefixes = ['13TeV_gq1_', '13TeV_gq0_', '14TeV_gq1_', '14TeV_gq0_']
sig_yields = [sig_yield_13TeV_gq1, sig_yield_13TeV_gq0, sig_yield_14TeV_gq1, sig_yield_14TeV_gq0]

for i, (prefix, sig_yield) in enumerate(zip(prefixes, sig_yields)):
#     if i == 0:
    genText(xgradboost_models, 'TEST' + prefix, sig_yield, bg_yields)

0.01
105.26320140000001
10526.32014
0.009994706

0.01
40.16284754999998
4016.2847549999983
0.009997864

0.01
4.118855474999998
411.8855474999999
0.009999395

self.model doesn't have a predict_proba function

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [35], in <cell line: 23>()
     21 sig_yields = [sig_yield_13TeV_gq1, sig_yield_13TeV_gq0, sig_yield_14TeV_gq1, sig_yield_14TeV_gq0]
     23 for i, (prefix, sig_yield) in enumerate(zip(prefixes, sig_yields)):
     24 #     if i == 0:
---> 25     genText(xgradboost_models, 'TEST' + prefix, sig_yield, bg_yields)

Input In [33], in genText(models, fileprefix, sig_yields, bg_yields, doPlots)
      6 ms = [350, 500, 1000, 2000]
      8 for i, (m, X, model, num_bins, signal_yield) in enumerate(zip(ms, data, models, num_binss, sig_yields)):
      9     # generate histos
---> 10     bin_edges, sig_bins, bg_bins1, bg_bins2, bg_bins3 = model.predict_hist(
     11         X[1][0], X[1][2], num_bins=num_bins, sepbg=True, sig_norm=np.array(signal_yield), bg_norm=np.array(bg_yields),
     12         yAxisUnits='events/bin width')
     13     bg_bins = [bg_bins1, bg_bins2, bg_bins3]
     15     # save .txt files

File ~/Academic/bcml4pheno/bcml4pheno/bcml_model.py:83, in bcml_model.predict_hist(self, preds, labels, num_bins, sepbg, sig_norm, bg_norm, dataframe, yAxisUnits)
     81 predictions = self.predict_proba(preds)
     82 labels = np.array(labels)
---> 83 sig_bins, bin_edges = np.histogram(predictions[labels==1], bins=num_bins, density=True)
     84 binWidthFactor = 1/num_bins if yAxisUnits == 'events/bin width' else 1
     85 sig_bins *= sig_norm * binWidthFactor

TypeError: 'NoneType' object is not subscriptable

bg_css

[0.106, 0.0117, 5.58]

sig_yields

[array([23654.652 ,  9025.359 ,   925.5855,    47.5398]),
 array([1.55411928e+04, 5.31509370e+03, 3.61590600e+02, 7.70721000e+00]),
 array([29193.759  , 11287.101  ,  1210.8243 ,    67.20399]),
 array([1.9548942e+04, 6.8839071e+03, 5.0637090e+02, 1.2245100e+01])]

bg_yields

[76351.79999999999, 8427.509999999998, 4019273.999999999]

conv

1000.0

Misc. Significance Plots

import matplotlib.pyplot as plt

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-0.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Logistic Regression', 
        c='red', linestyle='-.')
    plt.plot(
        sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Random Forests', 
        c='lime', linestyle='-.')
    plt.plot(
        sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses], 
        label='Gradient Boosting', 
        c='tab:purple', linestyle='-.')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend()
    plt.savefig('./images/full_example/for_paper/sig_plots/sigs', dpi=200, bbox_inches='tight')

import matplotlib.pyplot as plt

with plt.rc_context(get_settings()):
    plt.plot([max_mass for i in range(2)], np.logspace(-0.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
    plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
    plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
    plt.plot(
        sample_masses, [10**logreg_opt_sigs_sepbg_f(m)/2 for m in sample_masses], 
        label='Logistic Regression', 
        c='red', linestyle='-.')
    plt.plot(
        sample_masses, [10**randforest_opt_sigs_sepbg_f(m)/2 for m in sample_masses], 
        label='Random Forests', 
        c='lime', linestyle='-.')
    plt.plot(
        sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m)/2 for m in sample_masses], 
        label='Gradient Boosting', 
        c='tab:purple', linestyle='-.')
    plt.ylabel('Signal Significance');
    plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
    plt.yscale('log')
    plt.legend()
    plt.savefig('./images/full_example/for_paper/sig_plots/sigs_significance_onehalf', dpi=200, bbox_inches='tight')

Refreshing Models

logreg_m350G_model = refresh_model(logreg_m350G_model)
logreg_m500G_model = refresh_model(logreg_m500G_model)
logreg_m1T_model = refresh_model(logreg_m1T_model)
logreg_m2T_model = refresh_model(logreg_m2T_model)
logreg_m4T_model = refresh_model(logreg_m4T_model)

randforest_m350G_model = refresh_model(randforest_m350G_model)
randforest_m500G_model = refresh_model(randforest_m500G_model)
randforest_m1T_model = refresh_model(randforest_m1T_model)
randforest_m2T_model = refresh_model(randforest_m2T_model)
randforest_m4T_model = refresh_model(randforest_m4T_model)

xgradboost_m350G_model = refresh_model(xgradboost_m350G_model)
xgradboost_m500G_model = refresh_model(xgradboost_m500G_model)
xgradboost_m1T_model = refresh_model(xgradboost_m1T_model)
xgradboost_m2T_model = refresh_model(xgradboost_m2T_model)
xgradboost_m4T_model = refresh_model(xgradboost_m4T_model)

xgradboost_models = [
    xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]

Legacy

NEW

time_before = getTime(form='m')
xgradboost_m350G_model_N500 = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=500, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_N500.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 113.466 minutes)

time_before = getTime(form='m')
xgradboost_m350G_model_MD10 = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=10, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_MD10.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 99.267 minutes)

time_before = getTime(form='m')
xgradboost_m350G_model_LRpt01 = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.01, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_LRpt01.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 315.0 minutes)

time_before = getTime(form='m')
xgradboost_m350G_model_N500_MD10 = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=500, max_depth=10, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_N500_MD10.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 157.117 minutes)

time_before = getTime(form='m')
xgradboost_m350G_model_N500_LRpt01 = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=500, max_depth=7, learning_rate=0.01, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_N500_LRpt01.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 119.9 minutes)

time_before = getTime()

model = xgradboost_m350G_model_N500_LRpt01
M = 0
mass = 350
result = model.best_threshold(signal_yields[M], background_yields, data[M][1][0], data[M][1][2], sepbg=True)

time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")

/Users/sheridan/Academic/bcml4pheno/bcml4pheno/bcml_model.py:268: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  predictionsss = np.array(

(Runtime: 7 seconds)

print(f"for mass = {mass} GeV, thresh = {round(result[0],6)}, tpr = {round(result[2],5)} and",
      f"fpr = {[round(elem,9) for elem in result[3]]}",
      f"so yield is {int(lumi * sig_css[M] * conv * result[2])} and sig. sign. is {round(result[1],3)}")

for mass = 350 GeV, thresh = 0.956855, tpr = 0.41891 and fpr = [0.0, 0.001915158, 0.002070518] so yield is 12564 and sig. sign. is 57.778

print(sig_css)

[0.009997999999999998, 0.0038020000000000007, 0.00039359999999999997, 2.0339999999999988e-05, 3.593803554603651e-07]

END NEW

with plt.rc_context(get_settings()):
    for elem in hists_gbt_m350G:
        width = 0.5*(elem[0][1]-elem[0][0])
        plt.bar(elem[0][:-1], elem[1], width=width)
    plt.yscale('log')

xgradboost_m350G_model_subsamplept5 = bcml_model(
    make_pipeline(StandardScaler(), 
                  XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, 
                                subsample=0.5, use_label_encoder=False)));
xgradboost_m350G_model_subsamplept5.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")

(Runtime: 701.784 minutes)

Full Example

Preparing Data

Logistic Regression

Initializing and Training

Naive Performance Evaluation

More Sophisticated Evaluation: Threshold Optimization

Random Forests

Initializing and Training

Evaluation

Gradient Boosting

Initializing and Training

Evaluation

Feature Importance

Printing Values

Visualization

TPR-FPR Curves

Setup

Logistic Regression

Random Forests

Gradient Boosting

CSV Files

Smeared Data

Data Preparation

Logistic Regression

Random Forests

Gradient Boosting

Performance Comparison

Histograms

Comparison with Manuel Optimization of Event Selection Criteria

Kinematic Histograms

Z' mass + coupling sensitivity heatmap

Model Extrapolation

Kinematic Histograms

Saving ML Ouput for Shape-Based Optimization

Misc. Significance Plots

Refreshing Models

Legacy

NEW