Bayesian parameter inference of the binding energy

In this tutorial we will use the data from Garcia & Phillips 2011. In this beautiful paper they measured the fold-change in gene expression due to the presence of a transcriptional repressor lacI for different binding sites for the repressor.

Using equilibrium statistical mechanics the authors derived a simple expression that describes this phenomena as a function of some of the "natural variables" of the problem such as the repressor copy number per cell R and its binding energy to the operator

. The equation used in this paper is of the form

where

represents the number of non-specific (NS) binding sites where the repressor can bind and

In the paper the authors built E. coli strains with different ribosomal binding sites for the repressor gene so that they could vary the mean copy number of this transcription factor in the cell. They then determined experimentaly this mean copy number using quantitative western blots and then measured the gene expression fold-change with respect to a reference strain with no repressors present. Our job for this tutorial will be to use these pair of measurements

to infer the repressor binding energy

Exploratory data analysis.

The first step on every analysis should always be an exploration of the data. What this means is that we have to get a sense of not only what the data is, but how is it structured. The way in which you format your data can have a huge impact on how you perform your analysis. It is highly recommended to keep your data in what is called a "tidy data frame." We have made this data set tidy already meaning that on each row there is a single measurement with all the relevant information for such measurement. This will make it much easier to manipulate our data.

% Read the data frame containing the experimental measurements
df = readtable('Garcia2011_AbsoluteSinglelacZ_wFC.csv');
% Display 3 specific colums of the data set
df(1:5,{'operator', 'repressors', 'fold_change'})
ans = 5×3 table
 operatorrepressorsfold_change
1'Oid'6100.0002
2'O1'6100.0013
3'O2'6100.0061
4'Oid'110.0067
5'O1'110.0378

	operator	repressors	fold_change
1	'Oid'	610	0.0002
2	'O1'	610	0.0013
3	'O2'	610	0.0061
4	'Oid'	11	0.0067
5	'O1'	11	0.0378

The file contains other relevant columns but the ones we just printed into the screen are the important ones for what we are planning to do.

A powerful feature about having tidy data frames is that we can very easily group our data by different criteria. For example let's say I want to generate a log-log plot of R vs

for the different operators. All I have to do is find the entries for each of the unique operators and plot them in a simple for loop. Let's implement that.

% Find unique operators
operators = unique(df.operator)
operators = 4×1 cell array
    {'O1' }
    {'O2' }
    {'O3' }
    {'Oid'}

Before performing the loop to plot the data let's show how we can use the operator column to extract all the data points from a particular operator. Let's say I'm interested in particular in the operator O2. To extract all the data from this operator all I would have to do is find which rows contain O2 in the operator column and then use boolean indexing to extract the other columns where this condition is true. Boolean indexing is one of the most powerful tools to work with your data if it is tidy. What this means is that if I want only measurements of a specific type (time points before certain moment, specific cell type, a list of concentrations of drugs, a specific date or whatever other feature of the data) I just have to go row by row and ask "does this row satisfy my condition?" and if it does I extract that row. In this case since we have a columns that specifies the operator we will go row and extract the ones that have the opeartor we want to work with.

It used to be the case that one couldn't use Boolean operators such as == with strings in Matlab and one had to use a function such as strcmp in ordert to perform this boolean comparison. So to extract the rows containing O2 I would have to write something like

O2Rows = strcmp(df.operator, 'O2');

But since version 2017a the string array data structure was introduced so now if we use double quotes we can use the normal boolean comparision operator. Let's find which rows contain data for O2 and then use those True / False indices to separate this data.

% Generate a boolean array (i.e. True and False) for where the column operator contains O2 as an entry.
O2Rows = df.operator == "O2";
% Use the boolean array to extract the data for O2 from the general data frame.
% We will extract only the columns we care about to keep it simple.
dfO2 = df(O2Rows, {'operator', 'repressors', 'fold_change'});
% Display the first 5 rows
dfO2(1:5,:)
ans = 5×3 table
 operatorrepressorsfold_change
1'O2'6100.0061
2'O2'110.1790
3'O2'01.0000
4'O2'6100.0073
5'O2'110.1575

	operator	repressors	fold_change
1	'O2'	610	0.0061
2	'O2'	11	0.1790
3	'O2'	0	1.0000
4	'O2'	610	0.0073
5	'O2'	11	0.1575

Great! We just extracted all of the O2 data. Now we can literally just take the repressors and fold_change columns and plot what we were looking for.

% Plot repressor vs fold-change on a log-log plot.

loglog(dfO2.repressors, dfO2.fold_change, 'o')

% Label axis

xlabel('repressors per cell')

ylabel('fold-change')

% Change the font size

set(gca, 'FontSize', 16)

Another powerful thing about the tidy data frame is that we can compute interesting quantities over groups. For example let's say that we wanted to plot the mean and standard deviation rather than all of the data points for the experimental fold change. What we would need to do is group by repressor copy number and compute both the mean and standard deviation and then plot this as a point with error bars. Matlab has a built-in function to compute statistics by grouping from a table with respect to some column. Let's use it to compute both the mean and standard deviation. The syntax is as follows

grpstats(table, column used to group, list of statistics to compute, extra arguments)

Let's then group by repressor copy number and compute the mean and standard deviation of the fold_change column.

% Group by repressor copy number and compute the mean fold-change and the standard deviation
statsO2 = grpstats(dfO2, 'repressors', {'mean','std'}, 'DataVars', 'fold_change')
statsO2 = 7×4 table
 repressorsGroupCountmean_fold_changestd_fold_change
1 00111.00000
2 111140.17650.0143
3 303040.04530.0024
4 626240.02210.0042
5 13013040.01250.0013
6 61061040.00590.0011
7 87087040.00470.0005

	repressors	GroupCount	mean_fold_change	std_fold_change
1 0	0	11	1.0000	0
2 11	11	4	0.1765	0.0143
3 30	30	4	0.0453	0.0024
4 62	62	4	0.0221	0.0042
5 130	130	4	0.0125	0.0013
6 610	610	4	0.0059	0.0011
7 870	870	4	0.0047	0.0005

Having computed both the mean and standard deviation for each repressor copy number we can now plot this with error bars.

% Plot mean +- std with error bars

errorbar(statsO2.repressors, statsO2.mean_fold_change, statsO2.std_fold_change,...

'ko', 'Linewidth', 1.5)

% Set both axis scale to log

set(gca,'xscale','log')

set(gca,'yscale','log')

% Label axis

xlabel('repressors per cell')

ylabel('fold-change')

% Schange font size

set(gca, 'FontSize', 16)

Our error bars are assymetric on a log plot which is not right, but for now we will ignore this caveat. I just want to highlight that if you ever plot something with error bars on a log scale you need to display the error bars of the log of the quantity you are plotting, not the linear error bars. Keep that in mind for whenever you plan to do this.

Finally before getting into the real objective of this tutorial let's exemplify the power of tidy data frames by reproducing this same plot including all operators available in the data frame. This time we will use grpstats to group by two criteria: operator and repressor copy number. We will then compute the mean and standard deviation of the fold_change column for each group. In a single step with no nasty for loops or anything like that we will be able to plot all of the different operators because of the tidyness of our data!

% Compute the mean and standard deviation of the fold-change grouping by both operator and repressors columns
statsOperator = grpstats(df, {'operator', 'repressors'}, {'mean','std'}, 'DataVars', 'fold_change');
% Display first 5 rows
statsOperator(1:5, :)
ans = 5×5 table
 operatorrepressorsGroupCountmean_fold_changestd_fold_change
1 Oid_0'Oid'0111.00000
2 Oid_11'Oid'1140.00766.6052e-04
3 Oid_30'Oid'3040.00162.0450e-04
4 Oid_62'Oid'6240.00113.1438e-04
5 Oid_130'Oid'13040.00069.6439e-05

	operator	repressors	GroupCount	mean_fold_change	std_fold_change
1 Oid_0	'Oid'	0	11	1.0000	0
2 Oid_11	'Oid'	11	4	0.0076	6.6052e-04
3 Oid_30	'Oid'	30	4	0.0016	2.0450e-04
4 Oid_62	'Oid'	62	4	0.0011	3.1438e-04
5 Oid_130	'Oid'	130	4	0.0006	9.6439e-05

As far as I know (which I could be completely wrong) MATLAB doesn't have functionalities to plot data automatically grouping by certain criteria. But this is not an issue for us. We will simply use a for loop to go through each of the individual operators and plot the mean and standard deviation in fold-change for each of the operators.

% loop through each operator plotting the fold-change with error bars

figure

hold on

for i=1:length(operators)

% Find rows for the ith operator

% NOTE: we have to convert the operator into a string type in order to use

% the == boolean operator

opRows = statsOperator.operator == string(operators(i));

% Extract data for that particular operator

opData = statsOperator(opRows, :);

% Plot fold-change with error bars

errorbar(opData.repressors, opData.mean_fold_change,...

opData.std_fold_change, 'o', 'Linewidth', 1.5)

end %for

% Change bot axis to log scale

set(gca,'xscale','log')

set(gca,'yscale','log')

% Label axis

xlabel('repressors per cell')

ylabel('fold-change')

% Create a legend

legend(operators, 'Location', 'southwest')

% Change font size

set(gca, 'FontSize', 16)

% Set box around plot

box on

hold off

Introduction to Bayesian Regression

This section is based on Justin Bois' fantastic data analysis course at Caltech.

We highly recommend you to read his introduction to bayesian parameter estimation to learn with more details what we will present here. Due to time constraints we will just go through a short primer on the topic.

As we mentioned during lecture for Bayesians probability is a measure of plausibility for any statement. For parameter estimation we want to find the probability distribution of the value of our parameters of interest, i.e. as Bayesians we will never say something like: "this is the value of the parameter we measure in our experiment with 100% certainty"; but rather we would report the probability distribution of the values that these parameters can take.

Since the parameter we are trying to infer given some data is the binding energy, this can be written in terms of Bayes theorem as

where

represents our data set. Recall that each term on Bayes theorem has a name commonly used in data analysis

But note that for our purposes of finding the posterior probability of the binding energy given the data the evidence doesn't depend on this parameter. So for this case it is just a normalization constant that we can ignore, giving us

As researches we have to make decisions on what functional forms the likelihood and the prior will take. One of the things I like the most about Bayesian analysis is that every single assumption that goes into the analysis is transparent. Every researcher when performing statistical analysis is making assumptions about the data even when he/she doesn't now it. For example if you have ever used a student-t test you have implicitly assumed that your data is Gaussian distributed. But in the Bayesian framework we are under control of all of these assumptions, making the analysis much more clear.

Let's go ahead and list all of the assumptions that go into our analysis.

The likelihood

The likelihood, as the name indicates, tells us how likely it is that we would have observed our data given certain parameter value. In this case if my binding energy is completely off of what the most likely value is, then this term will be really small indicating that under our assumptions the data that we obtained would be very unlikely to be generated by our model given that parameter value. On the other hand if my binding energy is close to the most likely parameter value, then this term will be high, indicating that this specific choice of binding energy could have likely generated the data that we observed.

Let's list the assumptions that go into this term.

Every datum is independent of each other

The first safe assumption we will make is that each datum in our data set is completely independent of each other. Given the laws of probability we have the "mantra" that independence of events means to multiply probabilities. What that means mathematically is that we will write the likelihood as

where

represents the

datum and

is read as the product of each datum

in the data set D.

Gaussian likelihood for the deviations from our theoretical predictions

The likelihood contains all the information that we have on how the parameter we are trying to infer relates to our experiment. We could clasify likelihoods in two categories depending on the nature of the model we have for the data:

The likelihood models the deviations from the model given some deterministic relationship between parameters and data.
The likelihood itself could be the model if our relationship between parameters and data is probabilistic. For example if my model is itself a probability distribution of how I expect to find my data I could use that distribution directly as a model.

Our model falls into the first category. We have a deterministic relationship between binding energy and fold-change, i.e. our model doesn't tell us anything about the deviations from the mean fold-change. If we were interested in these deviations we would have to use a different theoretical framework such as chemical master equations (look at this interesting review by Jane Kondev for more on the use of chemical master equations). This means that our likelihood has to describe the distribution of deviations of the experimental data from our theoretical prediction. Note that for this we are assuming that the model indeed describes the data. Without getting into the philosophy of what does it mean for a model to be "correct" or not let's assume that the equation for fold-change is an accurate description of the phenomena.

Since our likelihood will describe deviations of the theoretical prediction we can say that every experimental fold-change we measured is of the form

where the subscripts exp and

indicates experimental and theoretical fold-changes respectively; e is the error or deviation from the expected theoretical prediction. If we solve for e we will find that

where we made the dependence on the binding energy explicit. That means that for a given value of our parameter

we will use a statistical model for the difference between our theoretical predictions and our experimental data rather than for the experimental data directly.

For these deviations from the theoretical prediction it is common to use a Gaussian prior given that the powerful central limit theorem tells us that processes which involve many subprocesses tend to be Gaussian distributed (we are giving a very superficial explanation of this, for more information we highly recommend you Sivia's outstanding book). Therefore since we assume each data point is independent of the others our likelihood will be given by

where the superscripts

indicate the index of the datum, i.e. each experimental fold-change is related to a single theoretical prediction given some repressor copy number

. Notice that we added an extra parameter σ for this Gaussian likelihood. This is a nuisance parameter that describes how the data deviates from the theoretical model; but since we are not interested on its actual value we can at the end integrate it out of our posterior probability.

Now that we have our likelihood ready let's move on to the other term, the prior.

The prior

that now includes the extra parameter σ captures the information we knew about these parameters before performing the experiment (hence the name Prior Information). This is probably the single most controversial aspect about Bayesian statistics. Traditional statisticians claim that we can bias the outcome of our inference to be whatever we want given some arbitrary prior. But as a Bayesian, I think the prior is a natural extention of how we do science. It is important to acknowledge all the previous knowledge. There are other arguments one can give about the nature of the prior, for example:

Since we are writing one can think of it as an extra datum multiplying all of our data. That means given enough data if we choose a broad prior the likelihood will easily overcome the prior and the effect will be negligible.
There are information theoretic arguments on how to choose the most uninformative prior for each parameter. Therefore by using these uniformative priors we are guaranteed to commit the least possible to what we don't know about the parameter.

The topic of choosing priors for your parameters is an entire area in and of itself. For our short introduction we will consider that there are two types of parameters − location and scale parameters.

Location parameters

This type of parameter, as their name indicates, dictate the location of the distribution. Think for example for a Gaussian distribution the mean μ is a location parameter that sets where do the bell-shape curve lives. Let's look at a graphical example of this by plotting Gaussian curves with different mean μ.

% Define a range of x values to evaluate the Gaussian PDF

x = linspace(-5, 5, 100);

% Define different mean values

mu = [0, 1.5, -0.5];

% Plot both curves on the same axis

figure

hold on

% Loop through the different means

for i=1:length(mu)

% Compute the probability density function (PDF) for a Gaussian distribution

pdf = normpdf(x, mu(i), 1);

% Plot the PDF

plot(x, pdf, 'Linewidth', 1.5)

end %for

% Label axis

xlabel('x')

ylabel('P(x)')

% Change font size

set(gca, 'FontSize', 16)

% Set box around plot

box on

hold off

Technically speaking these parameters are said to have translation invariance. We can exemplify this with an example borrowed from Sivia's book. Let's imagine that you are trying to determine the position of the lighthouse near Nobska beach with respect to The Gemma (the MBL research boat) that is anchored at some position. All you are given are a series of estimates of the positionX. The probability that the lighthouse lies in the infinitesimally small range between x and

is given by

But whay if they told you that the data you were given was actually shifted by a distance

because they didn't account for the length of the boat itself or some other mistake that could have happened. That would mean that every single datumXyou were given was actually

. That shouldn't affect the probability of the lighthouse of being located on a small range since all we did was translated all the points together by the same factor (that's where the location invariance comes from). What this would imply is that

Since

is a constant we can define that

. Therefore it must be true that

, and this can only be possible if

is a constant for all values of X. That's why location parameters are given what is called a uniform prior, menaing that before performing the measurement it is equaliy likely to be at any value within the allowed span.

Scale parameters

Scale parameters usually determine the shape of the distributions. Going back to the graphical example we had before with the Gaussian curves, the variance

is a scale parameter since shifting its value does not translate the position of the curve, but changes the width of the distribution. Again let's quickly plot an example for different variance values.

% Define a range of x values to evaluate the Gaussian PDF

x = linspace(-5, 5, 100);

% Define two different mean values

sigma = [1, 1.5, 0.5];

% Plot both curves on the same axis

figure

hold on

% Loop through each of the vlaues for the standard deviation

for i=1:length(mu)

% Compute the PDF for a Gaussian distribution

pdf = normpdf(x, 0, sigma(i));

% Plot the PDF

plot(x, pdf, 'Linewidth', 1.5)

end %for

% Label axis

xlabel('x')

ylabel('P(x)')

% Change font size

set(gca, 'FontSize', 16)

% Set plot around box

box on

hold off

These parameters are said to have scale invariance. We can again use the lighthouse example to explain what this means. Let's say that I took the data in what Jane Kondev likes to call "communist units", for example meters. You naivly thought that I was giving you "freedom units", i.e. feet, and the numbers I was giving you couldn't be right. When we realize why you were having such a hard time with the data we realize that all what you needed was to multiply each datumX by a constant β to convert back to feet. But if I analyze my data in meters and you analyze it in feet it should be true that in a small interval between Xand

the probabilities we get are the same as long as the intervals are equal. We write that mathematially as

where again β is the conversion factor between units (our scale parameter). Since β is a constant it follows that

. That implies that

Only if I set

to be proportional to

this equality will be satisfied. You can see that by substituting

This is called the Jeffreys' prior and is the most common uninformative prior for scale parameters.

Having given a short introduction to selection of prior probabilities let's choose a prior for both of our parameters.

The prior of each parameter is indpendent of the other parameter.

Since we chose a Gaussian likelihood we had to add an extra parameter σ to our posterior. Given there is not a clear connection I can think of between this new parameter and the binding energy

it is safe to assume that they are both independent. This allows us to apply the mantra of independence again and write

Uniform prior for the binding energy

If we look at our likelihood again we note that

is involved in the numerator of the expression in the exponential. It is therefore acting as a location parameter that we will assign a uniform prior

where

and

represent the boundaries that we set for our binding energy. The selection of these boundaries should make physical/biological sense. But since we will be computing a non-normalized posterior distribution this is just a constant that we will end up throwing to the proportionality constant.

Jeffreys' prior for σ

Again if we stare at the likelihood we immediately recognize that σ is in the denominator of the term in the exponential, so it is acting as a scale parameter. That means that we will assign it a Jeffres' prior, i.e.

The full posterior probability

Putting our likelihood and our prior together, the posterior probability looks like

There are a couple of algebraic manipulations we can do. First of all σ does not have an index

, that allows us to take it out of the product raised to the n power. To see this more clearly notice that

for a = constant.

Then we note that we can send the constants

and the prior for the energy to the proportionality constant. With this we can rewrite the posterior as

Removing the nuisance parameter σ

To get rid of the nuisance parameter σ there is a standard procedure in probability theory that involves some heavy math machinery. What we can do to get rid of the parameter we don't care about is to integrate it out over all possible values

Performing this type of integrals to marginalize parameters can be a very nasty task. As a matter of fact in most cases these integrals have no close form analytical solutions. That is one of the main reasons that historically made scientists choose frequentists methods rather than Bayesians. But in the era of modern computers this should not be a limitation. In the next tutorial I will show you how to use a incredibly powerful tool know as Markov Chain Monte Carlo sampling to numerically perform these integrals over parameters we don't care about.

For our particular case in which we had a uniform prior for the parameter we care about

and a Jeffreys prior for the parameter imposed by our selection of the Gaussian likelihood σ there exists a close form solution for the integral. Here we will just quote the result since it is not a trivial result. The marginalized posterior takes the form

where

is the number of data points. This is call the student-t distribution. If you have heard of the frequentist method known as Maximum -likelihood or minimization of the sum of the square of the residuals this functional form might ring a bell. Whenever someone uses those approaches for parameter inference they are implicitly assuming that the residuals follow a Gaussian likelihood as we did.

Working with the log posterior

Many times when managing very small numbers (such as probabilities) is much easier for analytical and numerical reasons to work in log scale. In our example if we take the log of this posterior we find that

Evaluating the log posterior function

Now that we have the theory in place we can just compute the posterior probability for the binding energy. Let us use O2 as a first illustrative example. We have written a function logPosterior that computes the non-normalized posterior distribution. This function takes 3 arguments

foldChangeExp: Experimental fold-change measured for those given values of repressors.
R: Number of repressors.
epsilonR: Repressor binding energy.

We will compute the posterior probability for energies between -14.5 and -13.5

% Select rows that have operator O2 and a positive number of repressors (removing ∆lacI strain)
O2Rows = df.operator == "O2" & df.repressors > 0;
% Use the boolean array to extract the data for O2 from the general data frame.
% We will extract only the columns we care about to keep it simple.
dfO2 = df(O2Rows, {'operator', 'repressors', 'fold_change'});
% Define energies to compute the posterior probability on.
energies = linspace(-14.5, -13.5, 500);
% Initialize array to save the log posterior probability
logP = zeros(1, length(energies));
% Loop through the energies computing the log probability at each step.
for i=1:length(logP)
    logP(i) = logPosterior(dfO2.fold_change,  dfO2.repressors, energies(i));
end
% Scale the probability to be maximum at 1 (since is a non-normalized distribution)
logP = logP - max(logP);

We can now plot and see the whole distribution! This is part of the power of the Bayesian approach. We don't have to stare only at one fixed value for our parameter, but we can get a complete distribtion of the possible values that the parameter can take.

% Plot the energies vs probability

% remember to exponentiate since we computed the log posterior!

plot(energies, exp(logP), 'Linewidth', 1.5)

% Label axis

xlabel('\Delta\epsilon_r')

ylabel('P(\Delta\epsilon_r \mid D)')

% Cahnge font size

set(gca, 'FontSize', 16)

Finding the most likely parameter value

Let's say we want to find where the peak on that distribution is. That would give you what in the Bayesian jargon is called the MAP (for Maximum APosteriori estimation) or the most likely value of the parameter. This value is the special number

that maximizes the posterior (and therefore the log posterior) distribution. Finding the extremum of a function is a common computational problem, and MATLAB has a lot of tools available to do this. This itself is an active research area on computer science, but for us we will use a simple function fminbnd to find the MAP value of our energy.

The fminbnd function takes 3 main arguments: the single parameter function to be optimized, and the lower and upper bound where to search for this minimum. In our case we want to maximize the log posterior, but since the function is designed to find minimums we can simply find the minimum of the negative log posterior. A tricky thing is that fminbnd takes a function with a single input argument and our logPosterior function takes 3 arguments as discussed before. The trick to work around this issue is to define what MATLAB calls an anonymous function. Here we will define a function that takes only the binding energy as input, having the data fixed in the definition of the function. This will make more sense once we code this.

%Define anonymous function to compute -logPosterior of the O2 data
minusLogPosteriorO2 = @(epsilonR) -logPosterior(dfO2.fold_change, dfO2.repressors, epsilonR);
% Test the function feeding a single parameter
minusLogPosteriorO2(-13.7)
ans = -58.6459

We just defined a function that computes the negative log posterior, but on this function we cannot change the data that it uses to compute this. This means that this function wouldn't be useful to compute things for another operator, let's say O1, because the O2 data is "hard-coded" into the function itself. So it is not a very flexible function, but remember we just want to define this in order to find the MAP of our energy.

Having defined this anonymous function we can now feed it to the fminbnd function and find the MAP for our binding energy.

% Use fminbnd function to find the minimum of -log posterior.
[O2energy, fval] = fminbnd(minusLogPosteriorO2, -20, -10)
O2energy = -13.8826
fval = -66.7127

So we obtained a value of ≈ -13.9

for the O2 operator. Let's look at how good is this fit compared to the data. We defined a function foldChange that computes the fold-chage Eq. This function takes as input the repressor copy number and the binding energy. So let's plot the theory and then the data to see how well this parameter value describes the experiment.

% Define array of repressor values to use.

% Syntax: logspace(log_10(lower bound), log_10(upper bound), # datapoints)

Rarray = logspace(1, 4, 50);

% Compute the theoretical fold-change

foldChangeO2 = foldChange(Rarray, O2energy);

% Plot theory and prediction

figure()

hold on

% Plot theoretical prediction

plot(Rarray, foldChangeO2, 'k')

% Plot mean +- std with error bars

errorbar(statsO2.repressors, statsO2.mean_fold_change, statsO2.std_fold_change,...

'ko', 'Linewidth', 1.5)

% Change bot axis to log scale

set(gca,'xscale','log')

set(gca,'yscale','log')

% Label axis

xlabel('repressors per cell')

ylabel('fold-change')

% Schange font size

set(gca, 'FontSize', 16)

% Set box around plot

box on

hold off

It looks like the binding energy that we inferred worked pretty well! Now as an excercise for you I challenge you to do the same inference for the rest of the operators in a systematic way.

Bayesian error bar for a parameter value.

One of the main differences between Bayesian and frequentist analysis is the interpretation of the parameter value and the "error bars" associated with these parameters. You might be familiar with the frase "95% confidence interval" that you might have read in a paper. Each school of thought has its own way to assign an error bar to the parameter value, but the interpretation of such constructs is completely different. If I were preachy about the Bayesian approach (which I'm trying really hard not to be) I would say that the Bayesian interpretation of probability and of this error bar assigned to an inferred parameter value makes more sense. As a matter of fact it is very likely that what you think the 95% confidence interval is telling you, which is a a frequentist construct, is actually what the Bayesian equivalent called the "credible region" is saying. To explain the difference between both interpretations I will adapt an example from this excellent video by Jake Vanderplas.

Imagine we are trying to infer the binding energy of our transcription factor. Under the frequentist interpretation the transcription factor has a specific value for the binding energy. As we take experimental measurements we can report what frequentists call the confidence interval. This confidence interval has a very non-intuitive interpretation. What the 95% confidence interval is telling you is a probabilistic statement about the confidence interval itself. In other words, the confidence interval tells you a recipe in which in the limit of many many many experimental repeats, for every confidence interval you construct for such an experiment, 95% of the time this interval would contain the "real" value of your parameter. It is weird, right? If you are confused right now on what I tried to say it means that I'm doing a great job in transmitting the message that the confidence interval is not what most people think it is.

Let me try to rephrase in a different way. Frequentist statistics gives you a "recipe" to construct an interval called the confidence interval. The interpretation of this confidence interval is associated with the recipe itself rather than with the actual interval you obtain when following the recipe. What the recipe tells you is that in the limit of infinitely many experimental repeats, 95% of the times you follow the recipe to construct the confidence interval, this interval would actually contain the value for your parameter. I know... it is shocking and you feel like you've been living in the darkness all of your life. I call that effect the "Bayesian baptism" (but I'm not being preachy!).

In the Bayesian interpretation one never says "the value of the parameter is exactly x." Since we assign a probability distribution to the parameter value, we would say that the value of the parameter has a probabilityx of falling in certain interval. In other words if we report what Bayesians call the 95% credible region what we are reporting is that given our data, the parameter we are trying to infer has a 95% probability of falling within the credible region range. That is a much more natural assignment of an error bar on an inferred parameter, I think. It shouldn't be surprising as I mentioned before that what you thought the confidence interval was telling you is actually what the credible region is saying. Giving Bayesian interpretations to frequentist constructs is a very common mistake.

Gaussian approximation of the posterior

So how do we compute this credible region? Here we will follow an analytical approximation. For this we will need to exercise those calculus muscles that hopefully you haven't completely attrophiated. For this we will define the log posterior probability as

Now we Taylor expand this log probability distribution around the most likely parameter value, i.e. the peak of the distribution

up to second order, obtaining

Since we are expading at

which is the peak of the distribution, i.e. the maximum, that means that the first derivative of the log probability distribution evaluated at this point is equal to zero, leaving us only with the zeroth and second order terms. Now we exponentiate the result to recover back the posterior distribution to obtain

where Cis a normalization constant. We now can easily see that this looks a lot like a Gaussian distribution where

plays the role of the mean μ, and as a consequence the variance

is then given by

which is exactly what we were looking for! This is called the Gaussian approximation of the posterior and it is a good approximation for well behaved posteriors such as the one we obtained. What this approximation is telling us to do is to evaluate the second derivative of the log posterior distribution at the MAP, multiply its inverse by -1 and that will give us our credible region on the binding energy.

Computing the error bar in the binding energy

To numerically compute the second derivative of a function there are several routines one could follow. The field of numerical differentiation is it's own beast. If you are interested in how to generate a better estimate of this derivative there is plenty of literature on the topic. For simplicity we will take the most naive and to some extent innacuarte approach. What we will do is to use the function del2 to compute the discrete Laplacian, i.e. the second derivative. Therefore we will define a range of binding energies around the MAP, compute this discrete Laplacian, and take the value of the one closest to the MAP.

% Define very fine range of binding energies
energies = linspace(O2energy - 0.3, O2energy + 0.3, 200);
% Initialize array to save the log posterior probability
logP = zeros(1, length(energies));
% Loop through the energies computing the log probability at each step.
for i=1:length(logP)
    logP(i) = logPosterior(dfO2.fold_change, dfO2.repressors, energies(i));
end %for
% Compute the discrete Laplacian
% The second argument of the function is the spacing between gridpoints.
% This is necessary when computing the numerical derivative
laplacian = del2(logP, min(diff(energies)));
% Find the closest value in energies to the O2energy
[closest, idx] = min(abs(energies - O2energy));
% Compute the variance for the binding energy parameter distribution
sigmaSquareEnergy = - 1 / laplacian(idx);
% Compute the error bar, i.e. the standard deviation of the distribution
errorEnergy = sqrt(sigmaSquareEnergy)
errorEnergy = 0.0814

So there we have it. Our best inference for the binding energy of the O2 operator is given by

As an exercise can you expand the analysis to the rest of the operators so that we can compute all the binding energies automatically?

Conclusion

In this tutorial we learned the basic concepts in Basian parameter inference. This is by far not an extensive review of the topic, but at least now you are aware of the available tools and the general mindset that is behind this type of analysis.