Informed by the documentation of the import47Ddata
function, we can pull in all of the data files we'll need for training and testing from Google Storage. These were previously produced by printing out kinematic information from .lhe
files generated by MadGraph using a MadAnalysis expert mode script.
filename_sig_masses = ['350G', '500G', '1T', '2T', '4T']
filename_bgs = ['h', '4t', 'noh']
signal_data = [import47Ddata(f'sig{mass}') for mass in filename_sig_masses]
background_data = [import47Ddata(f'bg{name}') for name in filename_bgs]
Now, using sklearn.model_selection.train_test_split
we can split our data sets--one for each $m_{Z'}$ value--into training and testing subsets and create labels for each of these.
import numpy as np
from sklearn.model_selection import train_test_split
data = []
for sig_data in signal_data:
train_preds, test_preds, train_binary_labels, test_binary_labels, train_labels, test_labels = train_test_split(
np.concatenate([sig_data] + [bg_data[:int(len(sig_data)/3)] for bg_data in background_data]),
[1] * len(sig_data) + [0] * 3 * int(len(sig_data)/3),
[1] * len(sig_data) + [-1] * int(len(sig_data)/3) + [-2] * int(len(sig_data)/3) + [-3] * int(len(sig_data)/3),
test_size=0.25, random_state=42)
data.append([[np.array(train_preds), np.array(train_binary_labels), np.array(train_labels)],
[np.array(test_preds), np.array(test_binary_labels), np.array(test_labels)]])
del train_preds, test_preds, train_binary_labels, test_binary_labels, train_labels, test_labels
Here, we have created a list data
of length $5$ (as we have $5$ sampled $Z'$ masses) where each row contains two lists, each of which contains three numpy
arrays. The first triplet is the $n \times m$ data set of training predictors (where $n$ is the number of data points and $m$ is the number of features: for us, $n \approx 2$ million and $m = 47$), the length $n$ binary labels (i.e., signal is $1$, background is $0$) to be used for training, and the length $n$ labels (i.e., signal is $1$, background is negative integer) to be used when we want to differentiate backgrounds. Thus, our list is $5 \times 2 \times 3$, where the final dimension consists of predictors and labels as described above. We note that, in particular, we constructed our training/testing data for each $m_{Z'}$ value by combining $1$ million signal data points with ~$1$ million background points (divided evenly between our three background types).
print(f'data has length {len(data)}')
print(f'each row of data has length {len(data[0])}')
print(f'each element in each row of data has length {len(data[0][0])}')
print(f'training data is of length {len(data[0][0][0])}',
f'while testing data is of length {len(data[0][1][0])}')
print(f'our data has {len(data[0][0][0][0])} features')
We're also going to want one more thing. In particular, we'll eventually want to pass the (test) predictors and labels that correspond to specific background types. In order to facilitate this, we're going to construct a list indx
of length $5$ (number of $Z'$ masses) where each row is a list of length $4$ (as we have signal and three background types) where data[i][j]
is a numpy
array containing the indices of test[i][1][2]
corresponding to signal (if $j = 0$) or the $j$th background (if $1 \leq j \leq 3$).
indx = [[np.where(row[1][2]==i)[0] for i in [1] + [-i for i in range(1,4)]] for row in data]
print(f'indx has length {len(indx)}')
print(f'each row of indx has length {len(indx[0])}')
print(f'the lengths of the elements of the first row of indx are {[len(elem) for elem in indx[0]]}')
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
It's useful for standardize your data (subtract the mean and divide by the standard deviation in each feature) before training: we can achieve this through sklearn
's StandardScaler
filter and its pipeline
functionality.
We'll now initialize and train a logistic regression model for each of our $5$ data sets using our bcml_model
object. Note that the *
syntax unpacks a list, so our training data and testing data for a given $Z'$ mass (which makes up our rows of train_test_data
) are passed as separate arguments to our class instantiation.
We instantiate a bcml_model
object by passing a sklearn
classifying model and we fit the model via the bcml.fit
method.
logreg_m350G_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m350G_model.fit(*data[0][0][:2])
logreg_m500G_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m500G_model.fit(*data[1][0][:2])
logreg_m1T_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m1T_model.fit(*data[2][0][:2])
logreg_m2T_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m2T_model.fit(*data[3][0][:2])
logreg_m4T_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)))
logreg_m4T_model.fit(*data[4][0][:2])
As a reminder, passing *data[0][0][:2]
, for example, means we are taking the zeroth row of data
(first mass, $m_{Z'} = 350$ GeV), the zeroth element of that row (the training data, not the testing data), and then unpacking the first two elements of that element (the features and binary labels).
Now we can investigate our models' performances. Perhaps we can start by looking at the signal significance they achieve by default, given integrated luminosity $\mathcal{L} = 3000$ fb$^{-1}$. To do this, we're going to need signal and background yield, which we can compute using the function get_elijah_ttbarzp_cs
, which can pull in our information regarding sampled $Z'$ masses, their associated cross sections, and the background cross sections.
masses, sig_css, bg_css = get_elijah_ttbarzp_cs()
print(masses)
We will need a signal cross section for $m_{Z'} = 4000$ GeV, which we don't have, so we shall take a quick detour to interpolate it using our cross_section_helper
object. In doing so, we'll update masses
and sig_css
to more relevant values.
zp_cs = cross_section_helper(masses, sig_css, bg_css, mass_units='GeV')
masses = [350, 500, 1000, 2000, 4000]
sig_css = [zp_cs.sig_cs(mass) for mass in masses]
Using $N = \mathcal{L} \cdot \sigma$ for number of events $N$, we can compute signal and background yields. Both of these are lists: signal cross sections vary with the $m_{Z'}$ values given by our masses
variable, and the background cross sections vary between background types, originally specified by the filename_bgs
variable.
conv = 10**15 / 10**12 # conv * lumi (in fb^{-1}) * cross sec (in pb) = # of events
lumi = 3000
signal_yields = [conv * lumi * sig_cs for sig_cs in sig_css]
background_yields = [conv * lumi * bg_cs for bg_cs in bg_css]
background_yield = sum(background_yields)
Now we can look at anticipated signal significance $\mathcal{S}$ yielded by our models, which takes the following form. $$\mathcal{S} = \frac{S \cdot \text{TPR}}{\sqrt{S \cdot \text{TPR} + B \cdot \text{FPR}}}$$ Here, $S$ and $B$ are signal and background yield, respectively; TPR is the true positive rate, or the proportion of signal events our model correctly identifies as signal; and FPR is the false positive rate, or the proportion of background events our model incorrectly identifies as signal (i.e., TPR, FPR $\in [0, 1]$). Effectively, we're implementing a solitary event selection criterion: namely, that our model thinks the event is signal.
For the benefit of our own understanding of the problem, let's briefly consider the signal cross section $\sigma_s$ as a function of $m_{Z'}$ with some assistance from the interpolation functionality of scipy
and some matplotlib.pyplot
settings retrievable via get_settings
.
bg_labels = [r"No Higgs B.g. ($pp \; \to \; t\overline{t}b\overline{b} \; \backslash \; h \; \to \; bjj \; \overline{b}\ell\nu \; b\overline{b}$)",
r"Higgs B.g. ($pp \; \to \; t\overline{t}h \; \to \; bjj \; \overline{b}\ell\nu \; b\overline{b}$)",
r"4 Tops B.g. ($pp \; \to \; t\overline{t}t\overline{t} \; \to \; \overline{b}\ell\nu \; b\overline{b} \; ...$)"]
import scipy.interpolate
import matplotlib.pyplot as plt
sample_masses = np.linspace(masses[0], masses[-1], 500)
sigs_f = scipy.interpolate.interp1d(masses, np.log10(sig_css), kind='cubic')
new_settings = get_settings()
new_settings['figure.figsize'] = (15,10)
with plt.rc_context(new_settings):
plt.plot(sample_masses, [10**sigs_f(m) for m in sample_masses],
label=r"Signal ($pp \; \to \; t\overline{t}Z' \; \to \; bjj \; \overline{b}\ell\nu \; b\overline{b}$)")
for bg_cs, label in zip([bg_css[2], bg_css[0], bg_css[1]], bg_labels):
plt.plot([350, 4000], [bg_cs]*2, label=r"{}".format(label))
plt.ylabel('Cross Section (pb)');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.xlim(350, 4000)
plt.yscale('log')
plt.legend(bbox_to_anchor=(0,0), loc="lower left")
plt.savefig('./images/full_example/signal_cs', dpi=200, bbox_inches='tight')
for B, name in zip(background_yields, filename_bgs):
print(f"Background {name} constitutes {round(B/background_yield,3)}% of the total background cross section.")
print(f"\nTotal backgroundcross section: {sum(bg_css)} pb")
The noh
(no Higgs) background is far and away the dominant process. It seems negligent, then, to compute total FPR and multiply by total B (background yield): instead, we should be multiplying the FPR for each background by the associated background yield B, and then summing over these. The bcml_model.significance
method allows us to specify whether or not we want to separate our backgrounds like this in the computation with the boolean argument sepbg
. Let's look at both scenarios.
We'll import datetime
to help ourselves start gauging runtime for some of these computational tasks.
from datetime import datetime
def getTime(form='s'):
seconds = datetime.now().day * 86400 + datetime.now().hour * 3600 + datetime.now().minute * 60 + datetime.now().second
if form == 's':
return seconds
elif form == 'm':
return round(seconds/60,3)
elif form == 'h':
return round(seconds/3600,3)
logreg_models = [logreg_m350G_model, logreg_m500G_model, logreg_m1T_model, logreg_m2T_model, logreg_m4T_model]
time_before = getTime()
logreg_sigs_nosepbg = [model.significance(signal_yield, background_yield,
tpr=model.tpr(data[i][1][1], preds=data[i][1][0]),
fpr=model.fpr(data[i][1][1], preds=data[i][1][0]),
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
for sig, mass in zip(logreg_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
"w/o separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime()
logreg_sigs_sepbg = [model.significance(signal_yield, background_yields,
tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=data[i][1][0][indx[i][0]]),
fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=data[i][1][0][indx[i][j]])
for j in range(1,4)],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
for sig, mass in zip(logreg_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
"w/ separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
Numbers have limited utility on their own: let's try and visualize $\mathcal{S}$ as a function of $m_{Z'}$ by again enlisting the help of interpolation and our matplotlib settings.
max_mass = zp_cs.absolute_max_mass_sens()
logreg_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_sigs_nosepbg), kind='cubic')
logreg_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_sigs_sepbg), kind='cubic')
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**logreg_sigs_nosepbg_f(m) for m in sample_masses], label='Logistic Regression (No b.g. sep.)',
c='red')
plt.plot(
sample_masses, [10**logreg_sigs_sepbg_f(m) for m in sample_masses], label='Logistic Regression (B.g. sep.)',
c='red', linestyle=':')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend()
Whenever a curve falls above, say, the darker blue $5\sigma$ line, we are assured that the associated model can achieve $5\sigma$ discovery potential at that value of $m_{Z'}$. The "Theoretical 5$\sigma$ Limit" line denotes the highest $Z'$ mass such that $\mathcal{S} = 5\sigma$ is possible. In particular, in the best case scenario, TPR = 1 and FPR = 0, we have that $\mathcal{S} = \sqrt{S}$ so we require $\mathcal{L} \cdot \sigma_s > \mathcal{S}^2$ for signal cross section $\sigma_s$. Our cross_section_helper
class has a built-in method for computing this maximal mass, given signal cross sections.
print("We cannnot achieve sensitivity to Z' particles (via this channel)",
f"with masses greater than {int(max_mass)} GeV")
When passed a data point, a trained bcml_model
assigns a signal probability. By default, an sk-learn
(and thus, our bcml_model
) model predicts that the datapoint is signal if the associated signal probability is greater than $0.5$. However, $0.5$ need not be the optimal threshold: for this purpose, we provide the ability to optimize this threshold on a test data set. To visualize this, we can plot a histogram of predicted signal probability for signal and background test events in our model: this is easily achieved using the bcml_model.predict_hist
method.
background_names = ['Higgs', '4 Tops', 'No Higgs']
with plt.rc_context(get_settings()):
for i, model in enumerate(logreg_models):
plt.figure(i)
bin_edges, sig_bins, bg_bins = model.predict_hist(data[i][1][0], data[i][1][1], num_bins=40, sepbg=False)
plt.bar(bin_edges[:-1], sig_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Signal")
plt.bar(bin_edges[:-1], bg_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Background")
plt.yscale('log')
plt.gca().set_ylim(10**-3, 10**2)
plt.gca().set_xlim(0, 1)
plt.legend()
plt.savefig(f'./images/full_example/histograms/log_reg_m{masses[i]}GeV_hist', dpi=200, bbox_inches='tight')
Having motivated this endeavor, let's see how threshold optimization boosts our logistic regression models' performance: the bcml_model.best_threshold
method performs this optimization task for us.
time_before = getTime(form='m')
logreg_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, data[i][1][0], data[i][1][1],
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
logreg_opt_sigs_nosepbg = [result[1] for result in logreg_opt_results_nosepbg]
for sig, mass in zip(logreg_opt_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
"w/o separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
logreg_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, data[i][1][0], data[i][1][2],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(logreg_models, signal_yields))]
logreg_opt_sigs_sepbg = [result[1] for result in logreg_opt_results_sepbg]
for sig, mass in zip(logreg_opt_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
"w/ separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
Now we can overlay our new optimized signal significances onto our old plot to visualize (marked) improvement.
logreg_opt_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_opt_sigs_nosepbg), kind='linear')
logreg_opt_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(logreg_opt_sigs_sepbg), kind='linear')
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**logreg_sigs_nosepbg_f(m) for m in sample_masses], label='Logistic Regression (No b.g. sep.)',
c='red')
plt.plot(
sample_masses, [10**logreg_sigs_sepbg_f(m) for m in sample_masses], label='Logistic Regression (B.g. sep.)',
c='red', linestyle=':')
plt.plot(
sample_masses, [10**logreg_opt_sigs_nosepbg_f(m) for m in sample_masses],
label='Logistic Regression (No b.g. sep., opt. thresh.)',
c='red', linestyle='--')
plt.plot(
sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Logistic Regression (B.g. sep., opt. thresh.)',
c='red', linestyle='-.')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend()
plt.savefig('./images/full_example/log_reg_sigs', dpi=200, bbox_inches='tight')
Additionally, we can re-plot our histogram with vertical lines visualizing where we used to be cutting for signal selection versus the optimal location (as our threshold determines the predicted signal probability required for a data point to be kept).
background_names = ['Higgs', '4 Tops', 'No Higgs']
with plt.rc_context(get_settings()):
for i, (model, result) in enumerate(zip(logreg_models, logreg_opt_results_sepbg)):
plt.figure(i)
bin_edges, sig_bins, bg_bins = model.predict_hist(data[i][1][0], data[i][1][1], num_bins=40, sepbg=False)
plt.bar(bin_edges[:-1], sig_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Signal")
plt.bar(bin_edges[:-1], bg_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Background")
plt.plot([0.5]*2, [10**-3, 10**2], label='Default Threshold', c='tab:red')
plt.plot([result[0]]*2, [10**-3, 10**2], label='Optimized Threshold', c='tab:green')
plt.yscale('log')
plt.gca().set_ylim(10**-3, 10**2)
plt.gca().set_xlim(0, 1)
plt.legend()
plt.savefig(f'./images/full_example/histograms/log_reg_m{masses[i]}GeV_opt_hist', dpi=200, bbox_inches='tight')
Having comprehensively explored logistic regression, we can ramp up our performance by adopting a more sophisticated machine learning model: random forests. In particular, we can implement the random forest methodology using sklearn.ensemble.RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
time_before = getTime(form='m')
randforest_m350G_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m350G_model.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m500G_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m500G_model.fit(*data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m1T_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m1T_model.fit(*data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m2T_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m2T_model.fit(*data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m4T_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=1000, n_jobs=-1)));
randforest_m4T_model.fit(*data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
We now iterate over all of the performance evaluation strategies we employed for logistic regression.
randforest_models = [
randforest_m350G_model, randforest_m500G_model, randforest_m1T_model, randforest_m2T_model, randforest_m4T_model]
time_before = getTime()
randforest_sigs_nosepbg = [model.significance(signal_yield, background_yield,
tpr=model.tpr(data[i][1][1], preds=data[i][1][0]),
fpr=model.fpr(data[i][1][1], preds=data[i][1][0]),
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
for sig, mass in zip(randforest_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
"w/o separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime()
randforest_sigs_sepbg = [model.significance(signal_yield, background_yields,
tpr=model.tpr(data[i][1][1], preds=data[i][1][0]),
fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=data[i][1][0][indx[i][j]])
for j in range(1,4)],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
for sig, mass in zip(randforest_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
"w/ separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime(form='m')
randforest_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, data[i][1][0], data[i][1][1],
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
randforest_opt_sigs_nosepbg = [result[1] for result in randforest_opt_results_nosepbg]
for sig, mass in zip(randforest_opt_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
"w/o separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, data[i][1][0], data[i][1][2],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(randforest_models, signal_yields))]
randforest_opt_sigs_sepbg = [result[1] for result in randforest_opt_results_sepbg]
for sig, mass in zip(randforest_opt_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
"w/ separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
randforest_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_sigs_nosepbg), kind='cubic')
randforest_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_sigs_sepbg), kind='cubic')
randforest_opt_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_opt_sigs_nosepbg), kind='cubic')
randforest_opt_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(randforest_opt_sigs_sepbg), kind='linear')
import matplotlib.pyplot as plt
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**randforest_sigs_nosepbg_f(m) for m in sample_masses],
label='Random Forest (No b.g. sep.)',
c='lime')
plt.plot(
sample_masses, [10**randforest_sigs_sepbg_f(m) for m in sample_masses],
label='Random Forest (B.g. sep.)',
c='lime', linestyle=':')
plt.plot(
sample_masses, [10**randforest_opt_sigs_nosepbg_f(m) for m in sample_masses],
label='Random Forest (No b.g. sep., opt. thresh.)',
c='lime', linestyle='--')
plt.plot(
sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Random Forest (B.g. sep., opt. thresh.)',
c='lime', linestyle='-.')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend()
plt.savefig('./images/full_example/rand_forest_sigs', dpi=200, bbox_inches='tight')
Now we can take yet another step up by using a gradient boosting model. In particular, we'll use the xgboost
library, which shares sk-learn
's API but provides substantial computational speed-up for gradient boosting models.
from xgboost import XGBClassifier
time_before = getTime(form='m')
xgradboost_m350G_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m500G_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m500G_model.fit(*data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m1T_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m1T_model.fit(*data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m2T_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m2T_model.fit(*data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m4T_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m4T_model.fit(*data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
xgradboost_models = [
xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]
time_before = getTime()
xgradboost_sigs_nosepbg = [model.significance(signal_yield, background_yield,
tpr=model.tpr(data[i][1][1], preds=data[i][1][0]),
fpr=model.fpr(data[i][1][1], preds=data[i][1][0]),
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
for sig, mass in zip(xgradboost_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
"w/o separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime()
xgradboost_sigs_sepbg = [model.significance(signal_yield, background_yields,
tpr=model.tpr(data[i][1][1], preds=data[i][1][0]),
fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=data[i][1][0][indx[i][j]])
for j in range(1,4)],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
for sig, mass in zip(xgradboost_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
"w/ separated backgrounds")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime(form='m')
xgradboost_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, data[i][1][0], data[i][1][1],
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
xgradboost_opt_sigs_nosepbg = [result[1] for result in xgradboost_opt_results_nosepbg]
for sig, mass in zip(xgradboost_opt_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting forests gives optimized sign. of {round(sig, 3)} sigma",
"w/o separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, data[i][1][0], data[i][1][2],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))]
xgradboost_opt_sigs_sepbg = [result[1] for result in xgradboost_opt_results_sepbg]
for sig, mass in zip(xgradboost_opt_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting gives optimized sign. of {round(sig, 3)} sigma",
"w/ separated backgrounds")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
for mass, result, sig_cs in zip(masses, xgradboost_opt_results_nosepbg, sig_css):
print(f"for mass = {mass} GeV, thresh = {round(result[0],6)}, tpr = {round(result[2],5)} and fpr = {round(result[3],9)}",
f"so yield is {int(lumi * sig_cs * conv * result[2])} and sig. sign. is {round(result[1],3)}")
xgradboost_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_sigs_nosepbg), kind='cubic')
xgradboost_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_sigs_sepbg), kind='cubic')
xgradboost_opt_sigs_nosepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_opt_sigs_nosepbg), kind='cubic')
xgradboost_opt_sigs_sepbg_f = scipy.interpolate.interp1d(masses, np.log10(xgradboost_opt_sigs_sepbg), kind='linear')
import matplotlib.pyplot as plt
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-2, 1.5, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**xgradboost_sigs_nosepbg_f(m) for m in sample_masses],
label='Gradient Boosting (No b.g. sep.)',
c='tab:purple')
plt.plot(
sample_masses, [10**xgradboost_sigs_sepbg_f(m) for m in sample_masses],
label='Gradient Boosting (B.g. sep.)',
c='tab:purple', linestyle=':')
plt.plot(
sample_masses, [10**xgradboost_opt_sigs_nosepbg_f(m) for m in sample_masses],
label='Gradient Boosting (No b.g. sep., opt. thresh.)',
c='tab:purple', linestyle='--')
plt.plot(
sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Gradient Boosting (B.g. sep., opt. thresh.)',
c='tab:purple', linestyle='-.')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend()
plt.savefig('./images/full_example/grad_boost_sigs', dpi=200, bbox_inches='tight')
As a bonus, let's now compare the performance between machine learning methodologies for, say, the optimized threshold + background separation case (as this is the most realistic option).
import matplotlib.pyplot as plt
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-0.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Logistic Regression',
c='red', linestyle='-.')
plt.plot(
sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Random Forests',
c='lime', linestyle='-.')
plt.plot(
sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Gradient Boosting',
c='tab:purple', linestyle='-.')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend()
plt.savefig('./images/full_example/all_model_sigs', dpi=200, bbox_inches='tight')
Our models differentiate between signal and background using the features we provide, but in the eyes of the models not all features are created equal. Indeed, trained models assign importances to their features: accessing these allows us to understand precisely where our discriminating power comes from.
Our bcml_model
class provides some functionality for accessing and analyzing feature importance, which we'll now explore.
We can access feature importances through the bcml_model.sorted_feature_importance
method (except for when we're doing logistic regression, but more on that in a second). Let's start by constructing a helpful function for printing out the data returned by the aforementioned method.
max_feature_len = max([len(feature) for feature in get47Dfeatures()])
def printFeatureImportances(feature_importances, num=10, hasSign=False):
feature_importances = feature_importances[:num]
for i, (feature, importance) in enumerate(feature_importances):
if hasSign:
sign = '+' if importance >= 0 else '-'
print('{4:2}. {0:{1}} {3}{2:.4f}'.format(feature, max_feature_len + 3, np.abs(importance), sign, i+1))
else:
print('{3:2}. {0:{1}} {2:.4f}'.format(feature, max_feature_len + 3, importance, i+1))
Now we can look at feature importances, starting with our logistic regression model. Unfortunately, while make sk-learn
models have built-in methods for accessing feature importance (which our bcml_model
class exploits), logistic regression lacks this: instead, however, we can just look at the magnitudes of the learned coefficients, which amount to the same thing. Consequently, we'll adapt some of the code in bcml_model.sorted_feature_importance
to explore the logistic regression model coefficients.
features = get47Dfeatures()
lr_feature_importances_raw = [model.model[-1].coef_[0] for model in logreg_models]
ranked_indices = [np.argsort(-np.abs(importances)) for importances in lr_feature_importances_raw]
lr_feature_importances = [
[[features[i], importances[i]] for i in indices] for importances, indices in zip(lr_feature_importances_raw, ranked_indices)]
for mass, feature_importance in zip(masses, lr_feature_importances):
print(f"Z' mass = {mass}")
printFeatureImportances(feature_importance, hasSign=True)
if mass != masses[-1]:
print()
Now let's look at random forests: at this point, we're free to use the built-in function.
rf_feature_importances = [model.sorted_feature_importance(get47Dfeatures()) for model in randforest_models]
for mass, feature_importance in zip(masses, rf_feature_importances):
print(f"Z' mass = {mass}")
printFeatureImportances(feature_importance)
if mass != masses[-1]:
print()
Finally, we have the gradient boosting models.
xgb_feature_importances = [model.sorted_feature_importance(get47Dfeatures()) for model in xgradboost_models]
for mass, feature_importance in zip(masses, xgb_feature_importances):
print(f"Z' mass = {mass}")
printFeatureImportances(feature_importance)
if mass != masses[-1]:
print()
Reading out printed feature importances is fine, but it might be nice to try and visualize how feature importance varies with $m_{Z'}$ for different model types. We can do this with some list manipulation and interpolation as follows, beginning once more with our logistic regression models.
num = 18
lr_features = [[feature for feature, importance in elem] for elem in lr_feature_importances]
lr_best_features = list(set.intersection(*[set(elem[:num]) for elem in lr_features]))
lr_best_features_and_imps = [
[feature, [elem[features.index(feature)][1] for elem, features in zip(lr_feature_importances, lr_features)]]
for feature in lr_best_features]
lr_fs = [scipy.interpolate.interp1d(masses, np.abs(importances), kind='quadratic')
for feature, importances in lr_best_features_and_imps]
This chunk of code works roughly by taking the intersection of the sets of top num
features for the various $Z'$ masses (to get the most important features over all), grabbing the feature importance for each of these most importance features at each $m_{Z'}$ value, then interpolating functions of importance vs $m_{Z'}$ for each of the important features.
sample_masses = np.linspace(masses[0], masses[-1], 500)
with plt.rc_context(get_settings()):
for f, feature in zip(lr_fs, lr_best_features):
plt.plot(sample_masses, f(sample_masses), label=feature)
plt.legend()
plt.xlabel(r"$m_{Z'}$ (GeV)")
plt.xlim(masses[0], masses[-1])
plt.savefig('./images/full_example/log_reg_features', dpi=200, bbox_inches='tight')
Now we proceed onto random forests.
num = 10
rf_features = [[feature for feature, importance in elem] for elem in rf_feature_importances]
rf_best_features = list(set.intersection(*[set(elem[:num]) for elem in rf_features]))
rf_best_features_and_imps = [
[feature, [elem[features.index(feature)][1] for elem, features in zip(rf_feature_importances, rf_features)]]
for feature in rf_best_features]
rf_fs = [scipy.interpolate.interp1d(masses, np.abs(importances), kind='quadratic')
for feature, importances in rf_best_features_and_imps]
sample_masses = np.linspace(masses[0], masses[-1], 500)
with plt.rc_context(get_settings()):
for f, feature in zip(rf_fs, rf_best_features):
plt.plot(sample_masses, f(sample_masses), label=feature)
plt.legend()
plt.xlabel(r"$m_{Z'}$ (GeV)")
plt.xlim(masses[0], masses[-1])
plt.savefig('./images/full_example/rand_forest_features', dpi=200, bbox_inches='tight')
And finally onto gradient boosting.
num = 10
xgb_features = [[feature for feature, importance in elem] for elem in xgb_feature_importances]
xgb_best_features = list(set.intersection(*[set(elem[:num]) for elem in xgb_features]))
xgb_best_features_and_imps = [
[feature, [elem[features.index(feature)][1] for elem, features in zip(xgb_feature_importances, xgb_features)]]
for feature in xgb_best_features]
xgb_fs = [scipy.interpolate.interp1d(masses, np.abs(importances), kind='quadratic')
for feature, importances in xgb_best_features_and_imps]
sample_masses = np.linspace(masses[0], masses[-1], 500)
with plt.rc_context(get_settings()):
for f, feature in zip(xgb_fs, xgb_best_features):
plt.plot(sample_masses, f(sample_masses), label=feature)
plt.yscale('log')
plt.legend()
plt.xlabel(r"$m_{Z'}$ (GeV)")
plt.xlim(masses[0], masses[-1])
plt.savefig('./images/full_example/grad_boost_features', dpi=200, bbox_inches='tight')
Consider the situation at hand. We want to maximize signal significance $\mathcal{S}$, which is a function of TPR, FPR, and $m_{Z'}$ (via $\sigma_s$). Visualizing a function $\mathbb{R}^3 \to \mathbb{R}$ is hard: we're going to have to reduce our dimensionality somehow. First, recall that we are really interested in achieving discovery potential, $5\sigma$. Given a TPR and FPR, there will be a range of $m_{Z'}$ values which yield $\mathcal{S} \geq 5$. More specifically, because cross section scales inversely with $m_{Z'}$, there will be a maximal $m_{Z'}$ for fixed TPR and FPR that yields $\mathcal{S} \geq 5$ (i.e., the mass such that $\mathcal{S} = 5$). We can construct a heatmap which uses color to depict that maximal mass sensitivity as a function of TPR, FPR for the case of the $Z'$ signal cross sections we've been considering.
The cross_section_helper
class allows us to plot this heatmap via the cross_section_helper.max_mass_sens_versus_tpr_fpr
method.
cvalues = [500, 750, 1000, 1500, 2000]
clabels = [str(val) for val in cvalues]
manual = [(0.85, -1), (0.85, -2), (0.85, -3), (0.85, -3.5), (0.85, -6)]
tpr_bounds = (0.05,1)
fpr_bounds = (10**-7, 0.5)
colors = np.linspace(0.3, 0.8, 5)
zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000,
cvalues=cvalues, clabels=clabels, manual=manual);
plt.savefig('./images/full_example/tpr_fpr', dpi=200, bbox_inches='tight')
Recall that varying the threshold on a given model yields a set of (TPR, FPR) pairs: we can interpret these as points sweeping out a curve in TPR-FPR space. Threshold optimization, then, entails selecting the point on the curve which maximizes signal significance. In the language of our above heatmap, this means picking the point on the curve where the mass sensitivity is highest. It is therefore insightful to overlay the TPR-FPR curves which our models induce on top of the heatmap we've just established. Moreover, we will emphasize the points on the curve corresponding to both the default threshold ($0.5$) and the optimized threshold for each model, to provide yet another visualization of the improvement. We will employ this procedure first for our logistic regression models, reusing some of the output of bcml_model.best_threshold
while also exploiting bcml_model.tpr
and bcml_model.fpr
.
lr_default_tprs = [model.tpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(logreg_models, data)]
lr_default_fprs = [model.fpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(logreg_models, data)]
lr_sample_tprs = [
np.linspace(result[-2][0] + 1e-5, result[-2][-1] - 1e-5, 200)
for result in logreg_opt_results_nosepbg]
lr_tpr_fpr_fs = [
scipy.interpolate.interp1d(result[-2], np.log10(np.where(result[-1]==0,1.2e-7,result[-1])), kind='linear')
for result in logreg_opt_results_nosepbg]
lr_optimized_tpr = [result[-4] for result in logreg_opt_results_nosepbg]
lr_optimized_fpr = [result[-3] if result[-3] > 0 else 1.2e-7 for result in logreg_opt_results_nosepbg]
zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000,
cvalues=cvalues, clabels=clabels, manual=manual);
with plt.rc_context(get_settings()):
for result, f, sample_tprs, mass, color in zip(logreg_opt_results_nosepbg, lr_tpr_fpr_fs, lr_sample_tprs, masses, colors):
plt.plot(sample_tprs, f(sample_tprs), c=plt.cm.YlOrRd(color), label=r"$m_{Z'} =" + f'{mass}$ GeV')
plt.scatter(lr_default_tprs, np.log10(lr_default_fprs), c='white', label='Default', zorder=10, marker='X')
plt.scatter(lr_optimized_tpr, np.log10(lr_optimized_fpr), c='white', label='Optimized', zorder=10, marker='*')
plt.legend(bbox_to_anchor=(1.35,1), loc="upper left", facecolor='lightgrey')
plt.xlim(*tpr_bounds);
plt.ylim(*np.log10(fpr_bounds));
plt.yticks(list(plt.yticks()[0]) + [np.log10(1.2e-7)], list(plt.yticks()[1]) + ['0'])
plt.savefig('./images/full_example/log_reg_tpr_fpr', dpi=200, bbox_inches='tight')
It's most insightful to plot the FPR axis using log scale, but this prevents us from visualizing portions of the TPR-FPR curves where $\text{FPR} = 0$, which does happen sometimes. To combat this, we implement a hack: whenever $\text{FPR} = 0$, we bump it up to a small (but non-zero) value, and then manually label that FPR value $0$. This is what you're seeing above.
We now go through the same procedure for our random forests models.
rf_default_tprs = [model.tpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(randforest_models, data)]
rf_default_fprs = [model.fpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(randforest_models, data)]
rf_sample_tprs = [
np.linspace(result[-2][0] + 1e-5, result[-2][-1] - 1e-5, 200)
for result in randforest_opt_results_nosepbg]
rf_tpr_fpr_fs = [
scipy.interpolate.interp1d(result[-2], np.log10(np.where(result[-1]==0,1.2e-7,result[-1])), kind='linear')
for result in randforest_opt_results_nosepbg]
rf_optimized_tpr = [result[-4] for result in randforest_opt_results_nosepbg]
rf_optimized_fpr = [result[-3] if result[-3] > 0 else 1.2e-7 for result in randforest_opt_results_nosepbg]
zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000,
cvalues=cvalues, clabels=clabels, manual=manual);
with plt.rc_context(get_settings()):
for result, f, sample_tprs, mass, color in zip(randforest_opt_results_nosepbg, rf_tpr_fpr_fs, rf_sample_tprs, masses, colors):
plt.plot(sample_tprs, f(sample_tprs), c=plt.cm.YlOrRd(color), label=r"$m_{Z'} =" + f'{mass}$ GeV')
plt.scatter(rf_default_tprs, np.log10(rf_default_fprs), c='white', label='Default', zorder=10, marker='X')
plt.scatter(rf_optimized_tpr, np.log10(rf_optimized_fpr), c='white', label='Optimized', zorder=10, marker='*')
plt.legend(bbox_to_anchor=(1.35,1), loc="upper left", facecolor='lightgrey')
plt.xlim(*tpr_bounds);
plt.ylim(*np.log10(fpr_bounds));
plt.savefig('./images/full_example/rand_forest_tpr_fpr', dpi=200, bbox_inches='tight')
Finally, we have the gradient boosting models.
xgb_default_tprs = [model.tpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(xgradboost_models, data)]
xgb_default_fprs = [model.fpr(data_row[1][1], preds=data_row[1][0]) for model, data_row in zip(xgradboost_models, data)]
xgb_sample_tprs = [
np.linspace(result[-2][0] + 1e-5, result[-2][-1] - 1e-5, 200)
for result in xgradboost_opt_results_nosepbg]
xgb_tpr_fpr_fs = [
scipy.interpolate.interp1d(result[-2], np.log10(np.where(result[-1]==0,1.2e-7,result[-1])), kind='linear')
for result in xgradboost_opt_results_nosepbg]
xgb_optimized_tpr = [result[-4] for result in xgradboost_opt_results_nosepbg]
xgb_optimized_fpr = [result[-3] if result[-3] > 0 else 1.2e-7 for result in xgradboost_opt_results_nosepbg]
zp_cs.max_mass_sens_versus_tpr_fpr(plot=True, sig=5, lumi=3000, tpr_bounds=tpr_bounds, fpr_bounds=fpr_bounds, res=1000,
cvalues=cvalues, clabels=clabels, manual=manual);
with plt.rc_context(get_settings()):
for result, f, sample_tprs, mass, color in zip(xgradboost_opt_results_nosepbg, xgb_tpr_fpr_fs, xgb_sample_tprs, masses, colors):
plt.plot(sample_tprs, f(sample_tprs), c=plt.cm.YlOrRd(color), label=r"$m_{Z'} =" + f'{mass}$ GeV')
plt.scatter(xgb_default_tprs, np.log10(xgb_default_fprs), c='white', label='Default', zorder=10, marker='X')
plt.scatter(xgb_optimized_tpr, np.log10(xgb_optimized_fpr), c='white', label='Optimized', zorder=10, marker='*')
plt.legend(bbox_to_anchor=(1.35,1), loc="upper left", facecolor='lightgrey')
plt.xlim(*tpr_bounds);
plt.ylim(*np.log10(fpr_bounds));
plt.yticks(list(plt.yticks()[0]) + [np.log10(1.2e-7)], list(plt.yticks()[1]) + ['0'])
plt.savefig('./images/full_example/grad_boost_tpr_fpr', dpi=200, bbox_inches='tight')
Statistical analyses of real data at experiments like CMS and ATLAS sometimes make use of so-called "shape-based analyses" where histograms of events are processed bin by bin in a search for signal significance. What follows is some code I produced (using the built-in bcml_model.predict_hist
function) to extract and save bin contents (using the pandas library) for histograms of events where our x-coordinate variable is signal probability, as predicted by each of our models.
logreg_models = [logreg_m350G_model, logreg_m500G_model, logreg_m1T_model, logreg_m2T_model, logreg_m4T_model]
xgradboost_models = [
xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]
background_names = ['Higgs', '4 Tops', 'No Higgs']
labels = ['Signal'] + [f'Background ({name})' for name in background_names]
all_names = [["Logistic Regression (m_Z' = {}) - {}".format(m, label) for label in labels] for m in masses] + \
[["Random Forest (m_Z' = {}) - {}".format(m, label) for label in labels] for m in masses] + \
[["Gradient Boosting (m_Z' = {}) - {}".format(m, label) for label in labels] for m in masses]
num_binss = [50, 100, 250]
dfs = [pd.DataFrame() for n in num_binss]
conv = 10**15 / 10**12 # conv * lumi (in fb^{-1}) * cross sec (in pb) = # of events
lumi = 3000
signal_yields = [conv * lumi * sig_cs for sig_cs in sig_css]
background_yields = [conv * lumi * bg_cs for bg_cs in bg_css]
for num_bins, df in zip(num_binss, dfs):
for model, signal_yield, names in zip(logreg_models + randforest_models + xgradboost_models,
signal_yields * 3,
all_names):
sig_name = names[0]
bg_names = names[1:]
results = model.predict_hist(
num_bins=num_bins, sepbg=True, sig_norm=signal_yield, bg_norm=background_yields)
bin_edges = results[0]
sig_bins = results[1]
bg_binss = results[2:]
df[sig_name] = sig_bins
for bg_name, bg_bins in zip(bg_names, bg_binss):
df[bg_name] = bg_bins
df['Bin Edges'] = pd.Series(bin_edges)
df.to_csv('./csvs/hist_data_{}bins.csv'.format(num_bins))
Our work thus far has been rather idealistic. This is true for a variety of reasons, but to cite a specific example, we are assuming effectively perfect experimental detectors: we are training our models on data which perfectly coincides with the kinematic behavior of (simulated) particles in events. In real life, detectors have associated uncertainties, which manifests itself in "noised" or "smeared" or "perturbed" data. We can gauge how well our models will do in a more realistic situation (i.e., with smeared data) by perturbing each of our data points, which we do now.
perturb = 0.1
from sklearn.model_selection import train_test_split
smeared_data = [
train_test_split(
np.concatenate([sig_data] + [bg_data[:int(len(sig_data)/3)] for bg_data in background_data]),
test_size=0.25, random_state=42)
for sig_data in signal_data]
smeared_data = [[[train[:,:-1], np.where(train[:,-1]==1,np.ones_like(train[:,-1]),np.zeros_like(train[:,-1])), train[:,-1]],
[test[:,:-1], np.where(test[:,-1]==1,np.ones_like(test[:,-1]),np.zeros_like(test[:,-1])), test[:,-1]]]
for train, test in smeared_data]
time_before = getTime(form='m')
for i, elem in enumerate(smeared_data):
smeared_data[i][0][0] = np.reshape(np.concatenate(
[np.random.multivariate_normal(row, np.diag(np.abs(perturb * row))) for row in elem[0][0]], axis=0), (-1,47))
smeared_data[i][1][0] = np.reshape(np.concatenate(
[np.random.multivariate_normal(row, np.diag(np.abs(perturb * row))) for row in elem[1][0]], axis=0), (-1,47))
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
logreg_m350G_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m350G_smeared_model.fit(*smeared_data[0][0][:2])
logreg_m500G_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m500G_smeared_model.fit(*smeared_data[1][0][:2])
logreg_m1T_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m1T_smeared_model.fit(*smeared_data[2][0][:2])
logreg_m2T_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression()))
logreg_m2T_smeared_model.fit(*smeared_data[3][0][:2])
logreg_m4T_smeared_model = bcml_model(make_pipeline(StandardScaler(), LogisticRegression(max_iter=500)))
logreg_m4T_smeared_model.fit(*smeared_data[4][0][:2])
logreg_smeared_models = [logreg_m350G_smeared_model, logreg_m500G_smeared_model, logreg_m1T_smeared_model,
logreg_m2T_smeared_model, logreg_m4T_smeared_model]
print(smeared_data[0][0][0].shape, smeared_data[0][1][0].shape)
time_before = getTime()
logreg_smeared_sigs_nosepbg = [model.significance(signal_yield, background_yield,
tpr=model.tpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
fpr=model.fpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
for sig, mass in zip(logreg_smeared_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
"w/o separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime()
logreg_smeared_sigs_sepbg = [model.significance(signal_yield, background_yields,
tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=smeared_data[i][1][0][indx[i][0]]),
fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=smeared_data[i][1][0][indx[i][j]])
for j in range(1,4)],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
for sig, mass in zip(logreg_smeared_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives significance of {round(sig, 3)} sigma",
"w/ separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime(form='m')
logreg_smeared_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, smeared_data[i][1][0], smeared_data[i][1][1],
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
logreg_smeared_opt_sigs_nosepbg = [result[1] for result in logreg_smeared_opt_results_nosepbg]
for sig, mass in zip(logreg_smeared_opt_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
"w/o separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
logreg_smeared_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, smeared_data[i][1][0], smeared_data[i][1][2],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(logreg_smeared_models, signal_yields))]
logreg_smeared_opt_sigs_sepbg = [result[1] for result in logreg_smeared_opt_results_sepbg]
for sig, mass in zip(logreg_smeared_opt_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> logistic regression gives optimized sign. of {round(sig, 3)} sigma",
"w/ separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m350G_smeared_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m350G_smeared_model.fit(*smeared_data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m500G_smeared_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m500G_smeared_model.fit(*smeared_data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m1T_smeared_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m1T_smeared_model.fit(*smeared_data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m2T_smeared_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m2T_smeared_model.fit(*smeared_data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_m4T_smeared_model = bcml_model(
make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=250, n_jobs=-1)));
randforest_m4T_smeared_model.fit(*smeared_data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
randforest_smeared_models = [randforest_m350G_smeared_model, randforest_m500G_smeared_model, randforest_m1T_smeared_model,
randforest_m2T_smeared_model, randforest_m4T_smeared_model]
time_before = getTime()
randforest_smeared_sigs_nosepbg = [model.significance(signal_yield, background_yield,
tpr=model.tpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
fpr=model.fpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
for sig, mass in zip(randforest_smeared_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
"w/o separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime()
randforest_smeared_sigs_sepbg = [model.significance(signal_yield, background_yields,
tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=smeared_data[i][1][0][indx[i][0]]),
fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=smeared_data[i][1][0][indx[i][j]])
for j in range(1,4)],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
for sig, mass in zip(randforest_smeared_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives significance of {round(sig, 3)} sigma",
"w/ separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime(form='m')
randforest_smeared_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, smeared_data[i][1][0], smeared_data[i][1][1],
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
randforest_smeared_opt_sigs_nosepbg = [result[1] for result in randforest_smeared_opt_results_nosepbg]
for sig, mass in zip(randforest_smeared_opt_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
"w/o separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
randforest_smeared_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, smeared_data[i][1][0], smeared_data[i][1][2],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(randforest_smeared_models, signal_yields))]
randforest_smeared_opt_sigs_sepbg = [result[1] for result in randforest_smeared_opt_results_sepbg]
for sig, mass in zip(randforest_smeared_opt_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> random forests gives optimized sign. of {round(sig, 3)} sigma",
"w/ separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m350G_smeared_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_smeared_model.fit(*smeared_data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m500G_smeared_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m500G_smeared_model.fit(*smeared_data[1][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m1T_smeared_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m1T_smeared_model.fit(*smeared_data[2][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m2T_smeared_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m2T_smeared_model.fit(*smeared_data[3][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m4T_smeared_model = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m4T_smeared_model.fit(*smeared_data[4][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
xgradboost_smeared_models = [xgradboost_m350G_smeared_model, xgradboost_m500G_smeared_model, xgradboost_m1T_smeared_model,
xgradboost_m2T_smeared_model, xgradboost_m4T_smeared_model]
time_before = getTime()
xgradboost_smeared_sigs_nosepbg = [model.significance(signal_yield, background_yield,
tpr=model.tpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
fpr=model.fpr(smeared_data[i][1][1], preds=smeared_data[i][1][0]),
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
for sig, mass in zip(xgradboost_smeared_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
"w/o separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime()
xgradboost_smeared_sigs_sepbg = [model.significance(signal_yield, background_yields,
tpr=model.tpr(np.ones((len(indx[i][0]),)), preds=smeared_data[i][1][0][indx[i][0]]),
fpr=[model.fpr(np.zeros((len(indx[i][j]),)), preds=smeared_data[i][1][0][indx[i][j]])
for j in range(1,4)],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
for sig, mass in zip(xgradboost_smeared_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting gives significance of {round(sig, 3)} sigma",
"w/ separated backgrounds (smeared data)")
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
time_before = getTime(form='m')
xgradboost_smeared_opt_results_nosepbg = [model.best_threshold(signal_yield, background_yield, smeared_data[i][1][0], smeared_data[i][1][1],
sepbg=False)
for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
xgradboost_smeared_opt_sigs_nosepbg = [result[1] for result in xgradboost_smeared_opt_results_nosepbg]
for sig, mass in zip(xgradboost_smeared_opt_sigs_nosepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting gives optimized sign. of {round(sig, 3)} sigma",
"w/o separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_smeared_opt_results_sepbg = [model.best_threshold(signal_yield, background_yields, smeared_data[i][1][0], smeared_data[i][1][2],
sepbg=True)
for i, (model, signal_yield) in enumerate(zip(xgradboost_smeared_models, signal_yields))]
xgradboost_smeared_opt_sigs_sepbg = [result[1] for result in xgradboost_smeared_opt_results_sepbg]
for sig, mass in zip(xgradboost_smeared_opt_sigs_sepbg, masses):
print(f"Z' mass = {mass} GeV --> gradient boosting gives optimized sign. of {round(sig, 3)} sigma",
"w/ separated backgrounds (smeared data)")
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
logreg_smeared_opt_sigs_sepbg_f = scipy.interpolate.interp1d(
masses, np.log10(logreg_smeared_opt_sigs_sepbg), kind='linear')
randforest_smeared_opt_sigs_sepbg_f = scipy.interpolate.interp1d(
masses, np.log10(randforest_smeared_opt_sigs_sepbg), kind='linear')
xgradboost_smeared_opt_sigs_sepbg_f = scipy.interpolate.interp1d(
masses, np.log10(xgradboost_smeared_opt_sigs_sepbg), kind='linear')
import matplotlib.pyplot as plt
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-1, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**logreg_smeared_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Log. Reg. (Smeared)',
c='red', linestyle='--')
plt.plot(
sample_masses, [10**randforest_smeared_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Rand. Forest (Smeared)',
c='lime', linestyle='--')
plt.plot(
sample_masses, [10**xgradboost_smeared_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Grad. Boost. (Smeared)',
c='tab:purple', linestyle='--')
plt.plot(
sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Logistic Regression',
c='red', linestyle=':')
plt.plot(
sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Random Forests',
c='lime', linestyle=':')
plt.plot(
sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Gradient Boosting',
c='tab:purple', linestyle=':')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.savefig('./images/full_example/smeared_sigs', dpi=200, bbox_inches='tight')
sig_smeared_data = [sd[0][0][np.where(sd[0][1]==1)[0]] for sd in smeared_data]
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('M b1 b2')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(sig_smeared_data[0][:,names.index('M b1 b2')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
plt.xlim(300,400)
# plt.ylim(10**-5,10**-1.5)
# plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.legend()
plt.xlabel(r'$m(b_1,b_2)$ (GeV)')
plt.savefig('./images/full_example/mb1b2_smeared', dpi=200, bbox_inches='tight')
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('M b1 b3')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(sig_smeared_data[0][:,names.index('M b1 b3')], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
plt.xlim(200,500)
# plt.ylim(10**-5,10**-1.5)
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('M j1 j2')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(sig_smeared_data[0][:,names.index('M j1 j2')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
plt.xlim(70,90)
# plt.ylim(10**-5,10**-1.5)
# plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.legend()
plt.xlabel(r'$m(j_1,j_2)$ (GeV)')
plt.savefig('./images/full_example/mj1j2_smeared', dpi=200, bbox_inches='tight')
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('pT b1')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(sig_smeared_data[0][:,names.index('pT b1')], density=True, bins=400, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV, smeared)");
plt.xlim(0,500)
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
The machine learning approach to optimizing signal significance in a phenomenological problem can be motivated through a comparison with the performance of a more manual approach. The manual_opt
file contains functions which enable analyses of event selection criteria optimization, which we explore as follows and ultimately compare with our machine learning-based results.
sigss = signal_data
bgss = background_data
names = get47Dfeatures()
First, let's see what signal significance looks like without making any cuts at all.
for mass, signal_yield in zip(masses, signal_yields):
print(
f"For Z' mass = {mass}, we have a default signal significance of",
f"{round(manual_significance(signal_yield, background_yields, 1, [1]*3), 4)}")
Now we'll ramp up our methodology and use the opt_sig
function, which takes in data (signal and background), features, and a list of cuts and iteratively scans for cut combinations that maximize signal significance.
time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(0,400,5)],
[names.index('M b1 b3'), True, np.linspace(0,400,5)],
[names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
[names.index('pT b1'), True, np.linspace(0,500,5)]]
point, sig = opt_sig(sigss[0], bgss, signal_yields[0], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[0]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")
time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(200,600,5)],
[names.index('M b1 b3'), True, np.linspace(0,400,5)],
[names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
[names.index('pT b1'), True, np.linspace(0,500,5)]]
point, sig = opt_sig(sigss[1], bgss, signal_yields[1], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[1]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")
time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(600,1200,5)],
[names.index('M b1 b3'), True, np.linspace(0,400,5)],
[names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
[names.index('pT b1'), True, np.linspace(200,600,5)]]
point, sig = opt_sig(sigss[2], bgss, signal_yields[2], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[2]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")
time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(1200,2200,5)],
[names.index('M b1 b3'), True, np.linspace(200,600,5)],
[names.index('M j1 j2'), False, np.append(np.linspace(100,400,4), 10000)],
[names.index('pT b1'), True, np.linspace(600,1000,5)]]
point, sig = opt_sig(sigss[3], bgss, signal_yields[3], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[3]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")
time_before = getTime(form='m')
cuts = [[names.index('M b1 b2'), True, np.linspace(2000,4000,5)],
[names.index('M b1 b3'), True, np.linspace(0,50,5)],
[names.index('M j1 j2'), False, np.append(np.linspace(100,500,4), 10000)],
[names.index('pT b1'), True, np.linspace(0,300,5)]]
point, sig = opt_sig(sigss[4], bgss, signal_yields[4], background_yields, cuts)
print(f"Cutting at {point} gives maximum significance of {sig} (Z' mass = {masses[4]} GeV)")
time_after = getTime(form='m')
print(f"Runtime: {round(time_after - time_before,3)} minutes")
Having found significances manually, let's compile our results, interpolate, and compare with the performance of our machine learning models.
manual_default_sigs = [
7.248418610012742, 2.757897361779127, 0.2855952155740879, 0.014759138609928545, 0.000260774522132171]
manual_opt_sigs = [
20.965001032944777, 13.882260799836876, 6.847931300231083, 3.4044571732873736, 1.0127201854783396]
manual_default_sigs_f = scipy.interpolate.interp1d(masses, np.log10(manual_default_sigs), kind='cubic')
manual_opt_sigs_f = scipy.interpolate.interp1d(masses, np.log10(manual_opt_sigs), kind='cubic')
import matplotlib.pyplot as plt
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-2.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Logistic Regression',
c='red')
plt.plot(
sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Random Forests',
c='lime')
plt.plot(
sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Gradient Boosting',
c='tab:purple')
plt.plot(
sample_masses, [10**manual_default_sigs_f(m) for m in sample_masses],
label='Default',
c='black', linestyle='-.')
plt.plot(
sample_masses, [10**manual_opt_sigs_f(m) for m in sample_masses],
label='Manual',
c='black')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.ylim(10**-2, 10**2)
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.savefig('./images/full_example/all_models_manual_sigs', dpi=200, bbox_inches='tight')
include_cuts = False
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('M b1 b2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(signal_data[2][:,names.index('M b1 b2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
for bg_data, name in zip(background_data, filename_bgs):
plt.hist(bg_data[:,names.index('M b1 b2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
if include_cuts:
plt.plot([325]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV)", c='tab:blue')
plt.plot([975]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 1$ TeV)", c='tab:orange')
plt.xlim(0,1300)
plt.ylim(10**-5,10**-1.5)
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.xlabel(r'$m(b_1, b_2)$ (GeV)')
if include_cuts:
plt.savefig('./images/full_example/mb1b2_with_cuts', dpi=200, bbox_inches='tight')
else:
plt.savefig('./images/full_example/mb1b2', dpi=200, bbox_inches='tight')
include_cuts = False
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('M b1 b3')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(signal_data[2][:,names.index('M b1 b3')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
for bg_data, name in zip(background_data, filename_bgs):
plt.hist(bg_data[:,names.index('M b1 b3')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
if include_cuts:
plt.plot([225]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV)", c='tab:blue')
plt.plot([300]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 1$ TeV)", c='tab:orange')
plt.xlim(0,1300)
plt.ylim(10**-5,10**-1.5)
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.xlabel(r'$m(b_1, b_3)$ (GeV)')
if include_cuts:
plt.savefig('./images/full_example/mb1b3_with_cuts', dpi=200, bbox_inches='tight')
else:
plt.savefig('./images/full_example/mb1b3', dpi=200, bbox_inches='tight')
include_cuts = False
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('M j1 j2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(signal_data[2][:,names.index('M j1 j2')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
for bg_data, name in zip(background_data, filename_bgs):
bins = 100 if name is not '4t' else 1000
plt.hist(bg_data[:,names.index('M j1 j2')], density=True, bins=bins, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
if include_cuts:
plt.plot([150]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV, $1$ TeV)", c='tab:blue')
plt.xlim(0,200)
plt.ylim(10**-4,10**-0.25)
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.xlabel(r'$m(j_1, j_2)$ (GeV)')
if include_cuts:
plt.savefig('./images/full_example/mj1j2_with_cuts', dpi=200, bbox_inches='tight')
else:
plt.savefig('./images/full_example/mj1j2', dpi=200, bbox_inches='tight')
include_cuts = False
with plt.rc_context(get_settings()):
plt.hist(signal_data[0][:,names.index('pT b1')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(signal_data[2][:,names.index('pT b1')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
for bg_data, name in zip(background_data, filename_bgs):
plt.hist(bg_data[:,names.index('pT b1')], density=True, bins=100, histtype='step', log=True, linewidth=3, label=f"Background ({name})");
if include_cuts:
plt.plot([187]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 350$ GeV)", c='tab:blue')
plt.plot([450]*2, [10**-5, 10**3], label=r"Cut ($m_{Z'} = 1$ TeV)", c='tab:orange')
plt.xlim(0,800)
plt.ylim(10**-5,10**-1.5)
plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.xlabel(r'$p_T(b_1)$ (GeV)')
if include_cuts:
plt.savefig('./images/full_example/ptb1_with_cuts', dpi=200, bbox_inches='tight')
else:
plt.savefig('./images/full_example/ptb1', dpi=200, bbox_inches='tight')
xgradboost_m350G_model = refresh_model(xgradboost_m350G_model)
xgradboost_m500G_model = refresh_model(xgradboost_m500G_model)
xgradboost_m1T_model = refresh_model(xgradboost_m1T_model)
xgradboost_m2T_model = refresh_model(xgradboost_m2T_model)
xgradboost_m4T_model = refresh_model(xgradboost_m4T_model)
xgradboost_models = [xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]
time_before = getTime(form='m')
couplings = np.linspace(1,10,5)
full_opt_results = [[
model.best_threshold(signal_yield * coupling**2, background_yields, data[i][1][0], data[i][1][2], sepbg=True)
for i, (model, signal_yield) in enumerate(zip(xgradboost_models, signal_yields))] for coupling in couplings]
full_opt_sigs = np.array([[result[1] for result in row] for row in full_opt_results])
print('masses:', masses)
print()
print('couplings:', np.linspace(1,10,5))
print()
print(full_opt_sigs)
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
masses, sig_css, bg_css = get_elijah_ttbarzp_cs()
zp_cs_objects = [cross_section_helper(masses, np.array(sig_css) * coupling**2, bg_css, mass_units='GeV') for coupling in couplings]
masses = [350, 500, 1000, 2000, 4000]
max_masses = [float(obj.absolute_max_mass_sens()) for obj in zp_cs_objects]
res = 500
ext = [masses[0], masses[-1], couplings[0], couplings[-1]]
pts = np.array([[m, c] for m in masses for c in couplings])
sigs = np.transpose(full_opt_sigs).flatten()
grid = np.array(
[[[x,y] for x in np.linspace(ext[0], ext[1], res)]
for y in np.linspace(ext[2], ext[3], res)])
interp_sigs = scipy.interpolate.griddata(pts, np.log10(sigs), grid, method='cubic')
with plt.rc_context(get_settings()):
plt.imshow(interp_sigs, origin='lower', aspect='auto', extent=ext, cmap='winter')
cbar = plt.colorbar()
plt.xlabel(r"$m_{Z'}$ (GeV)")
plt.ylabel("Coupling")
cbar.set_label(r"Signal Significance", rotation=90)
cbar_ticks = np.linspace(0,3,7)
cbar.set_ticks(cbar_ticks)
cbar.set_ticklabels([r"$10^{"+str(num)+"}$" for num in cbar_ticks])
cvalues = [np.log10(5)]
clabels = [r'$5\sigma$']
result = plt.contour(grid[:,:,0], grid[:,:,1], interp_sigs, cvalues, colors='white',
linewidths=3, linestyles='dashed')
fmt = {lev:lab for lev, lab in zip(result.levels, clabels)}
plt.clabel(result, result.levels, inline=True, fmt=fmt, fontsize=32)
# plt.plot(max_masses, couplings, c='w')
plt.xlim(masses[0], masses[-1])
plt.ylim(couplings[0], couplings[-1])
plt.savefig('./images/full_example/mass_coupling_sensitivity', dpi=200, bbox_inches='tight')
sig1p5T = import47Ddata('sig1p5T')
bg1T = data[2][1][0][np.array(data[2][1][1]) == 0]
bg2T = data[3][1][0][np.array(data[3][1][1]) == 0]
test1p5_1T_preds = np.concatenate((sig1p5T, bg1T))
test1p5_2T_preds = np.concatenate((sig1p5T, bg2T))
test1p5_1T_labels = np.concatenate((np.ones_like(sig1p5T[:,0]), np.array(data[2][1][2])[np.array(data[2][1][1]) == 0]))
test1p5_2T_labels = np.concatenate((np.ones_like(sig1p5T[:,0]), np.array(data[3][1][2])[np.array(data[3][1][1]) == 0]))
xgboost_1T_result = xgradboost_m1T_model.best_threshold(
signal_yields[2], background_yields, test1p5_1T_preds, test1p5_1T_labels, sepbg=True)
xgboost_2T_result = xgradboost_m2T_model.best_threshold(
signal_yields[3], background_yields, test1p5_2T_preds, test1p5_2T_labels, sepbg=True)
print(f"Gradient boosting (Z' mass 1 TeV) has significance {round(xgboost_1T_result[1],3)} on Z' mass 1.5 TeV sample")
print(f"Gradient boosting (Z' mass 2 TeV) has significance {round(xgboost_2T_result[1],3)} on Z' mass 1.5 TeV sample")
print(f"Interpolating gradient boosting significance yields {round(10**xgradboost_opt_sigs_sepbg_f(1500),3)}",
"for Z' mass 1.5 TeV")
print(len(test1p5_1T_preds))
print(len(test1p5_2T_preds))
xlabels = [
r'$p_T(b_1)$',
r'$p_T(b_2)$',
r'$p_T(b_3)$',
r'$p_T(b_4)$',
r'$|\Delta\eta(b_1, b_2)|$',
r'$|\Delta\eta(b_1, b_3)|$',
r'$|\Delta\eta(b_1, b_4)|$',
r'$|\Delta\eta(b_2, b_3)|$',
r'$|\Delta\eta(b_2, b_4)|$',
r'$|\Delta\eta(b_3, b_4)|$',
r'$|\Delta\phi(b_1, b_2)|$',
r'$|\Delta\phi(b_1, b_3)|$',
r'$|\Delta\phi(b_1, b_4)|$',
r'$|\Delta\phi(b_2, b_3)|$',
r'$|\Delta\phi(b_2, b_4)|$',
r'$|\Delta\phi(b_3, b_4)|$',
r'$\Delta R(b_1, b_2)$',
r'$\Delta R(b_1, b_3)$',
r'$\Delta R(b_1, b_4)$',
r'$\Delta R(b_2, b_3)$',
r'$\Delta R(b_2, b_4)$',
r'$\Delta R(b_3, b_4)$',
r'MET',
r'$p_T(\ell)$',
r'$m_T(l, MET)$',
r'$m(b_1, b_2)$',
r'$m(b_1, b_3)$',
r'$m(b_1, b_4)$',
r'$m(b_2, b_3)$',
r'$m(b_2, b_4)$',
r'$m(b_3, b_4)$',
r'$m_T(b_1, \ell, MET)$',
r'$m_T(b_2, \ell, MET)$',
r'$m_T(b_3, \ell, MET)$',
r'$m_T(b_4, \ell, MET)$',
r'$m_T(j_1, j_2)$',
r'$p_T(j_1)$',
r'$p_T(j_2)$',
r'$\Delta R(j_1, j_2)$',
r'$\Delta R(b_1, \ell)$',
r'$\Delta R(b_2, \ell)$',
r'$\Delta R(b_3, \ell)$',
r'$\Delta R(b_4, \ell)$',
r'$|\Delta\phi(b_1, \ell)|$',
r'$|\Delta\phi(b_2, \ell)|$',
r'$|\Delta\phi(b_3, \ell)|$',
r'$|\Delta\phi(b_4, \ell)|$',
]
import matplotlib.pyplot as plt
histo_path = "/Users/sheridan/Academic/bcml4pheno/images/full_example/for_paper/kin_histos/"
xlims = ([[0, 1000]] * 4 # pt
+ [[-4, 4]] * 6 # sdEta
+ [[-6.5, 6.5]] * 6 # sdPhi
+ [[0, 6]] * 6 # dR
+ [[0, 500]] * 3 # misc. kinematic
+ [[0, 1500]] * 6 # invariant mass
+ [[0, 600]] * 7 # misc. kinematic
+ [[0, 6]] * 5
+ [[-6.5, 6.5]] * 4
)
ylims = ([None] * 4
+ [[10**-3, 0.5]] * 6 # sdEta
+ [None] * (47 - 6 - 4)
)
legend_spot = (['upper right'] * 4
+ ['lower center'] * 18
+ ['upper right'] * 9
+ ['lower right'] * 5
+ ['upper right'] * 2
+ ['lower center'] * 9
)
bg_labels_short = [r"No Higgs B.g. ($pp \; \to \; t\overline{t}b\overline{b} \; \backslash \; h$)",
r"Higgs B.g. ($pp \; \to \; t\overline{t}h$)",
r"4 Tops B.g. ($pp \; \to \; t\overline{t}t\overline{t}$)"]
with plt.rc_context(get_settings()):
for N, (name, xlabel, xlim, ylim, loc) in enumerate(zip(get47Dfeatures(), xlabels, xlims, ylims, legend_spot)):
plt.hist(
signal_data[0][:,N], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 350$ GeV)");
plt.hist(
signal_data[2][:,N], density=True, bins=1000, histtype='step', linewidth=3, label=r"Signal ($m_{Z'} = 1$ TeV)");
for j, label in enumerate(bg_labels_short):
plt.hist(
background_data[j][:,N], density=True, bins=1000, histtype='step', linewidth=3, label=r"{}".format(label));
plt.xlim(*xlim)
if ylim:
plt.ylim(*ylim)
plt.yscale('log')
# plt.legend(bbox_to_anchor=(1.02,1), loc="upper left")
plt.legend(loc=loc)
plt.xlabel(xlabel)
plt.ylabel('a.u.')
plt.savefig(histo_path + name.strip() + '.pdf', dpi=200, bbox_inches='tight', transparent=True, format='pdf')
plt.clf()
def genText(models, fileprefix, sig_yields, bg_yields, doPlots=False):
hist_signal_yields = signal_yields # look into changing?
background_names = ['higgs', 'four_tops', 'no_higgs']
num_binss = [100, 100, 100, 40, 40]
# num_binss = [1, 1, 1, 1, 1]
ms = [350, 500, 1000, 2000]
for i, (m, X, model, num_bins, signal_yield) in enumerate(zip(ms, data, models, num_binss, sig_yields)):
# generate histos
bin_edges, sig_bins, bg_bins1, bg_bins2, bg_bins3 = model.predict_hist(
X[1][0], X[1][2], num_bins=num_bins, sepbg=True, sig_norm=np.array(signal_yield), bg_norm=np.array(bg_yields),
yAxisUnits='events/bin width')
bg_bins = [bg_bins1, bg_bins2, bg_bins3]
# save .txt files
txt_path = "/Users/sheridan/Academic/bcml4pheno/images/full_example/for_paper/ml_histos/"
np.savetxt(txt_path + f'{fileprefix}{m}GeV_gradboost_binedges.txt', bin_edges, delimiter='\n')
np.savetxt(txt_path + f'{fileprefix}{m}GeV_gradboost_signal.txt', sig_bins, delimiter='\n')
for name, bg_bin in zip(background_names, bg_bins):
np.savetxt(txt_path + f'{fileprefix}{m}GeV_gradboost_background_{name}.txt', bg_bin, delimiter='\n')
# print area under curve
print(np.dot(np.diff(bin_edges), sig_bins))
print(signal_yield)
print(np.diff(bin_edges)[0])
print()
# make plots
if doPlots:
with plt.rc_context(get_settings()):
plt.figure(i)
plt.bar(bin_edges[:-1], sig_bins, width=np.diff(bin_edges), alpha=0.75, align="edge", label="Signal")
for bg_bin, name in zip(bg_bins, background_names):
plt.bar(
bin_edges[:-1], bg_bin, width=np.diff(bin_edges), alpha=0.75, align="edge", label=f"Background ({name})")
plt.yscale('log')
plt.legend()
sig_yield_13TeV_gq1 = np.array([0.03284,
0.01253,
0.001285,
6.6e-5]) * 3000 * conv * 0.445 * (0.7)**4 # xsec * lumi * branch ratio * b-tag efficiency
sig_yield_13TeV_gq0 = np.array([0.021576,
0.007379,
0.000502,
1.07e-5]) * 3000 * conv * 0.445 * (0.7)**4
sig_yield_14TeV_gq1 = np.array([0.04053,
0.01567,
0.001681,
9.33e-5]) * 3000 * conv * 0.445 * (0.7)**4
sig_yield_14TeV_gq0 = np.array([0.02714,
0.009557,
0.000703,
1.7e-5]) * 3000 * conv * 0.445 * (0.7)**4
masses, sig_css, bg_css = get_elijah_ttbarzp_cs() # ONLY BG CS IS USED HERE!
bg_yields = [conv * 3000 * bg_cs * (0.7)**4 for bg_cs in bg_css]
prefixes = ['13TeV_gq1_', '13TeV_gq0_', '14TeV_gq1_', '14TeV_gq0_']
sig_yields = [sig_yield_13TeV_gq1, sig_yield_13TeV_gq0, sig_yield_14TeV_gq1, sig_yield_14TeV_gq0]
for i, (prefix, sig_yield) in enumerate(zip(prefixes, sig_yields)):
# if i == 0:
genText(xgradboost_models, 'TEST' + prefix, sig_yield, bg_yields)
bg_css
sig_yields
bg_yields
conv
import matplotlib.pyplot as plt
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-0.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**logreg_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Logistic Regression',
c='red', linestyle='-.')
plt.plot(
sample_masses, [10**randforest_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Random Forests',
c='lime', linestyle='-.')
plt.plot(
sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m) for m in sample_masses],
label='Gradient Boosting',
c='tab:purple', linestyle='-.')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend()
plt.savefig('./images/full_example/for_paper/sig_plots/sigs', dpi=200, bbox_inches='tight')
import matplotlib.pyplot as plt
with plt.rc_context(get_settings()):
plt.plot([max_mass for i in range(2)], np.logspace(-0.5, 2, 2), label=r'Theoretical $5\sigma$ Limit', c='black')
plt.plot(sample_masses, [5 for m in sample_masses], label=r'5$\sigma$', c='blue')
plt.plot(sample_masses, [1.645 for m in sample_masses], label=r'90% confidence', c='cornflowerblue')
plt.plot(
sample_masses, [10**logreg_opt_sigs_sepbg_f(m)/2 for m in sample_masses],
label='Logistic Regression',
c='red', linestyle='-.')
plt.plot(
sample_masses, [10**randforest_opt_sigs_sepbg_f(m)/2 for m in sample_masses],
label='Random Forests',
c='lime', linestyle='-.')
plt.plot(
sample_masses, [10**xgradboost_opt_sigs_sepbg_f(m)/2 for m in sample_masses],
label='Gradient Boosting',
c='tab:purple', linestyle='-.')
plt.ylabel('Signal Significance');
plt.xlabel(r"$Z'$ mass $m_{Z'}$ (GeV)");
plt.yscale('log')
plt.legend()
plt.savefig('./images/full_example/for_paper/sig_plots/sigs_significance_onehalf', dpi=200, bbox_inches='tight')
logreg_m350G_model = refresh_model(logreg_m350G_model)
logreg_m500G_model = refresh_model(logreg_m500G_model)
logreg_m1T_model = refresh_model(logreg_m1T_model)
logreg_m2T_model = refresh_model(logreg_m2T_model)
logreg_m4T_model = refresh_model(logreg_m4T_model)
randforest_m350G_model = refresh_model(randforest_m350G_model)
randforest_m500G_model = refresh_model(randforest_m500G_model)
randforest_m1T_model = refresh_model(randforest_m1T_model)
randforest_m2T_model = refresh_model(randforest_m2T_model)
randforest_m4T_model = refresh_model(randforest_m4T_model)
xgradboost_m350G_model = refresh_model(xgradboost_m350G_model)
xgradboost_m500G_model = refresh_model(xgradboost_m500G_model)
xgradboost_m1T_model = refresh_model(xgradboost_m1T_model)
xgradboost_m2T_model = refresh_model(xgradboost_m2T_model)
xgradboost_m4T_model = refresh_model(xgradboost_m4T_model)
xgradboost_models = [
xgradboost_m350G_model, xgradboost_m500G_model, xgradboost_m1T_model, xgradboost_m2T_model, xgradboost_m4T_model]
time_before = getTime(form='m')
xgradboost_m350G_model_N500 = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=500, max_depth=7, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_N500.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m350G_model_MD10 = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=10, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_MD10.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m350G_model_LRpt01 = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.01, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_LRpt01.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m350G_model_N500_MD10 = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=500, max_depth=10, learning_rate=0.1, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_N500_MD10.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime(form='m')
xgradboost_m350G_model_N500_LRpt01 = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=500, max_depth=7, learning_rate=0.01, nthread=4, use_label_encoder=False)));
xgradboost_m350G_model_N500_LRpt01.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")
time_before = getTime()
model = xgradboost_m350G_model_N500_LRpt01
M = 0
mass = 350
result = model.best_threshold(signal_yields[M], background_yields, data[M][1][0], data[M][1][2], sepbg=True)
time_after = getTime()
print(f"(Runtime: {time_after - time_before} seconds)")
print(f"for mass = {mass} GeV, thresh = {round(result[0],6)}, tpr = {round(result[2],5)} and",
f"fpr = {[round(elem,9) for elem in result[3]]}",
f"so yield is {int(lumi * sig_css[M] * conv * result[2])} and sig. sign. is {round(result[1],3)}")
print(sig_css)
END NEW
with plt.rc_context(get_settings()):
for elem in hists_gbt_m350G:
width = 0.5*(elem[0][1]-elem[0][0])
plt.bar(elem[0][:-1], elem[1], width=width)
plt.yscale('log')
xgradboost_m350G_model_subsamplept5 = bcml_model(
make_pipeline(StandardScaler(),
XGBClassifier(n_estimators=250, max_depth=7, learning_rate=0.1, nthread=4,
subsample=0.5, use_label_encoder=False)));
xgradboost_m350G_model_subsamplept5.fit(*data[0][0][:2])
time_after = getTime(form='m')
print(f"(Runtime: {round(time_after - time_before,3)} minutes)")