(c) 2017 the authors. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
# For operating system interaction
import os
import glob
import datetime
import sys
# For loading .pkl files.
import pickle
# For scientific computing
import numpy as np
import pandas as pd
import scipy.special
# Library to perform MCMC sampling
import emcee
# Import custom utilities
import mwc_induction_utils as mwc
# Useful plotting libraries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import corner
mwc.set_plotting_style()
# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline
# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'
# Generate a variable with the day that the script is run
today = str(datetime.datetime.today().strftime('%Y%m%d'))
In our first parameter estimation using MCMC we reported a credible region for the MWC parameters and the fold-change based on the data and the fit to the model. But these estimates didn't include previous characterized uncertainty in the parameters of the model.
For example, to compute the fold-change for each of the RBS mutants we assumed we knew with 100% certainty the mean repressor copy number. This assumption might not be completely truth since in their paper Garcia and Phillips report the mean $\pm$ standard deviation of the repressor copy number as revealed by multiple measurements of these quantities. The same applies to the repressor binding energies.
The question then becomes how do we include these sources of uncertainty into our fold-change model with induction?
Recall that the theoretical fold-change equation is given by $$ \text{fold-change} = \left( 1 + \frac{\left( 1 + \frac{c}{1M}\cdot e^{\tilde{k_A}} \right)^2}{\left( 1 + \frac{c}{1M}\cdot e^{\tilde{k_A}} \right)^2 + e^{-\beta \Delta\varepsilon_{AI}}\left( 1 + \frac{c}{1M}\cdot e^{\tilde{k_I}} \right)^2} \frac{R}{N_{NS}}e^{-\beta\Delta\varepsilon_{RA}} \right)^{-1}. \tag{1} $$ where $\beta \equiv \frac{1}{k_BT}, $ $c$ is the inducer concentration, $1M$ is a reference concentration, $\Delta\varepsilon_{AI}$ is the energy difference between the active and inactive state conformations, $R$ is the repressor copy number, $N_{NS}$ is the number of non-specific binding sites where the repressor can bind, and $\Delta\varepsilon_{RA}$ is the binding energy of the repressor to the DNA. We conveniently define $\tilde{k}_A \equiv -\log\frac{K_A}{1\text{ M}}$, and $\tilde{k}_I \equiv -\log\frac{K_I}{1\text{ M}}$ to sample the dissociation constants $K_A$ and $K_I$ in log scale since dissociation constants are scale invariant. This means that a change from 10 $\mu M$ to 1 $\mu M$ is an increase in binding as significant as a change from 1 $\mu M$ to 0.1 $\mu M$.
Originally we defined the parameter estimation problem using Bayes theorem as $$ P(\tilde{k}_A, \tilde{k}_I \mid D) \propto P(D \mid \tilde{k}_A, \tilde{k}_I)P(\tilde{k}_A, \tilde{k}_I), \tag{2} $$ where $D$ is all the data composed of independent variables (repressor copy number $R$, repressor-DNA binding energy $\Delta\varepsilon_{RA}$, and inducer concentration $c$) and one dependent variable (experimental fold-change), $P(D \mid \tilde{k}_A, \tilde{k}_I)$ is the likelihood of having observed the data given the parameter values, and $P(\tilde{k}_A, \tilde{k}_I)$ contains all the prior information on these parameters.
But now to include the uncertainty in the repressor copy number $R$ and the repressor binding energy $\Delta\varepsilon_{RA}$ we need to modify Eq. 2.
The Bayesian framework gives a natural form on how to tackle this problem. Let us focus first on the uncertainty on the repressor copy number measurements. We write our parameter estimation to include a single repressor copy number as a parameter to determine as \begin{equation} P(\tilde{k}_A, \tilde{k}_I, R \mid D) \propto P(D \mid \tilde{k}_A, \tilde{k}_I, R) \cdot P(\tilde{k}_A, \tilde{k}_I, R). \tag{3} \end{equation} The second term on the right hand side, the so-called prior, includes all the information we know before performing the experiment. For simplicity we assume that the parameters are independent such that we can rewrite this term as
\begin{equation} P(\tilde{k}_A, \tilde{k}_I, R \mid I) = P(\tilde{k}_A \mid I) \cdot P(\tilde{k}_I \mid I) \cdot P(R \mid I). \tag{4} \end{equation}Usually one does not want to bias the inference with the prior, therefore we are trained to use maximally uniformative priors. This is why for our previous model we chose a uniform prior for the $\tilde{k}_A$ and $\tilde{k}_I$ parameteres, i.e.
\begin{align} P(\tilde{k}_A \mid I) &\propto \frac{1}{\tilde{k}_{A}^\max - \tilde{k}_{A}^\min}, \tag{5} \\ P(\tilde{k}_I \mid I) &\propto \frac{1}{\tilde{k}_{I}^\max - \tilde{k}_{I}^\min}. \tag{6} \end{align}But in a sense Bayes theorem is a model for learning. What we mean with that is that it naturally allows us to update our parameter estimates after performing an experiment and obtaining data. So in the case when one does not know anything at all about the parameter value these uniformative priors are the right thing to use. But what happens when we have information about the value of some of these parameters from previous experiments? In that case we should include that information in the prior.
For the term $P(R \mid I)$ we must include the characterized uncertainty that Garcia and Phillips reported in their paper. As is written in the paper,
Immunoblots were used to measure the number of Lac repressors in six strains with different constitutive levels of Lac repressor. Each value corresponds to an average of cultures grown on at least 3 different days. The error bars are the SD of these measurements.
This means that the number they report is the mean repressor copy number and the standard deviation on these measurements. It is important to clarify that this standard deviation does not reflect the single-cell variability of repressors, but the experimental uncertainty when measuring the mean repressor copy number. In other words this standard deviation captures the lack of accuracy when measuring this parameter, not the natural biological noise one expects on a clonal population.
The repressor copy number per cell itself is a discrete variable, therefore one could naively think that the prior should be therefore given by a discrete distribution. But we highlight that what Garcia and Phillips measured with their immunoblots was not a single cell measurement, but a bulk measurement of the mean repressor copy number, which is not a discrete variable itself. As a matter of fact, from the central limit theorem we know that the distribution of the mean repressor copy number must be Gaussian. Therefore we can write \begin{equation} P(R \mid \sigma_R) \approx \frac{1}{\sqrt{2 \pi \sigma_R^2}}\exp \left[ -\frac{(R - R^)^2}{2 \sigma_R^2} \right] \; \forall~R > 0, \tag{7} \end{equation} where $\sigma_R$ is the characterized standard deviation that Garcia and Phillips experimentally characterized for each RBS mutant, and $R^$ is the mean repressor level also characterized experimentally.
If we include this prior for the repressor copy number and an equivalent one for the repressor binding energy, and allow the MCMC walkers to walk on these two new dimensions while fitting the MWC parameters, then the posterior probability for the fold-change will include all of the characterized uncertainty on all parameters. In this way we can properly build the credible region for our predictions.
We can use the same kind of informative prior for the binding energies and use all of the experimental fold-changes to determine the dissociation constatants, the four repressor-DNA binding energies for the four different operators {O1, O2, O3, Oid} and the 6 different mean repressor copy numbers.
We then write Bayes theorem as \begin{equation} P(\tilde{k}_A, \tilde{k}I, \mathbf{R}, \boldsymbol{\Delta\varepsilon{RA}} \mid D) \propto P(D \mid \tilde{k}_A, \tilde{k}I, \mathbf{R}, \boldsymbol{\Delta\varepsilon{RA}})\cdot P(\tilde{k}_A, \tilde{k}I, \mathbf{R}, \boldsymbol{\Delta\varepsilon{RA}}), \tag{8} \end{equation} where $\mathbf{R}$ is an array containing the six different repressor copy numbers to be fit, $\boldsymbol{\Delta\varepsilon_{RA}}$ is an array containing the four binding energies to be fit, and $D$ is the experimental fold-change data. The term $P(\tilde{k}_A, \tilde{k}_I, \mathbf{R}, \boldsymbol{\Delta\varepsilon_{RA}} \mid D)$ is the parameters probability distributions we aim to compute since it describes the plausibility of all parameters, taking into account the dependences they might have with each other. The term $(D \mid \tilde{k}_A, \tilde{k}_I, \mathbf{R}, \boldsymbol{\Delta\varepsilon_{RA}})$ represents the likelihood of having observed our experimental data given a value of the parameters. And finally $P(\tilde{k}_A, \tilde{k}_I, \mathbf{R}, \boldsymbol{\Delta\varepsilon_{RA}})$ contains all the prior information on the value of these parameters.
As before we assume that the deviations of the experimental fold-change from the theoretical predictions are normally distributed with mean zero and standard deviation $\sigma$. That implies that the first term on the right hand side of Eq. 8 can be written as \begin{align} P(\tilde{k}_A, \tilde{k}I, \mathbf{R}, \boldsymbol{\Delta\varepsilon{RA}}, \sigma \mid D ) = \frac{1}{(2\pi\sigma^2)^{\frac{n}{2}}}\prod\limits{i=1}^n \exp \left[-\frac{(\mathrm{fc}^{(i)}{\exp} - \mathrm{fc}(\tilde{k}_A, \tilde{k}I, R^{(i)}, \Delta\varepsilon{RA}^{(i)}, c^{(i)}))^2}{2\sigma^2}\right], \tag{9} \end{align} where $\text{fc}^{(i)}_{\text{exp}}$ is the experimental fold-change and $\mathrm{fc(\cdots)}$ is the theoretical prediction. The product $\prod_{i=1}^n$ captures the assumption that the $n$ data points are independent. Note that since we assume the errors follow a normal distribution with unknown standard deviation $\sigma$, this term has to be included as part of the fitting procedure.
For the prior term $P(\tilde{k}_A, \tilde{k}_I, \mathbf{R}, \boldsymbol{\Delta\varepsilon_{RA}}, \sigma)$ that also includes now the prior information on $\sigma$, we assign uniform priors to $\tilde{k}_A$ and $\tilde{k}_I$ as explained above, as well as Gaussian informative priors to both the repressor copy number and the repressor-DNA binding energies. Finally we assign a Jeffreys prior to $\sigma$ since it is a scaling parameter. All this together gives \begin{equation} P(\tilde{k}_A, \tilde{k}I, \mathbf{R}, \boldsymbol{\Delta\varepsilon{RA}}, \sigma) \propto \frac{1}{\tilde{k}{A}^\max - \tilde{k}{A}^\min} \cdot \frac{1}{\tilde{k}{I}^\max - \tilde{k}{I}^\min} \cdot \prodi \frac{1}{\sqrt{2\pi\sigma{Ri}^2}} \exp \left(-\frac{(R^{(i)} - \bar{R}^{(i)})^2}{2 \sigma{R_i}^2} \right)\cdot \prodj \frac{1}{\sqrt{2\pi\sigma{\varepsilonj}^2}} \exp \left(-\frac{(\Delta\varepsilon{RA}^{(j)} - \Delta\bar{\varepsilon}{RA}^{(j)})^2}{2 \sigma{\varepsilonj}^2} \right) \cdot \frac{1}{\sigma} \tag{10} \end{equation} where the product over all $i$ values represents all six RBS mutants used in this work and the product over all $j$ values represent all four operator sequences used in this work. The terms $\bar{R}^{(i)}$ and $\Delta\bar{\varepsilon}{RA}^{(j)}$ represent the prior values for the repressor copy numbers and binding energies as reported by Garcia and Phillips.
To implement the MCMC routine we need to compute the log posterior probability. Putting back together our likelihood from Eq. 9 and the prior from Eq. 10, and taking the log gives \begin{equation} \ln P(\tilde{k}_A, \tilde{k}_I, Ri, \varepsilon{RAj} \mid fc{\exp_{ijk}}, Ri^*, \sigma{Ri}, \varepsilon{RAj}^*, \sigma{\varepsilon_{RA_j}}, c_k, \sigma, I) \propto - (n + 1) \ln \sigma + \sum_i \sum_j \sumk - \frac{1}{2 \sigma^2} \left(fc{\exp_{ijk}} - fc(\tilde{k}_A, \tilde{k}_I, Ri, \varepsilon{RA_j}, c_k) \right)^2 - \frac{\left( R_i - Ri^* \right)^2}{2 \sigma^2{Ri}} - \frac{\left( \varepsilon{RAj} - \varepsilon{RAj}^* \right)^2}{2 \sigma^2{\varepsilon_{RAj}}} , \tag{11} \end{equation} where the subindex $i$ lists all the RBS mutants, the subindex $j$ lists all the operators, and the subindex $k$ lists all the IPTG concentrations $c$. $fc{\exp_{ijk}}$ represents the experimentally determined fold-change, and $n$ is the total number of data points.
In order to minimize the data frame parsing that the log-posterior has to do when performing the MCMC let's write a pre-processing function that will parse the data once such that the output can be feed to the log-posterior function.
def mcmc_pre_process(df):
"""
Pre-process the tidy DataFrame to prepare it for the MCMC. This is done
separately from the log-posterior calculation to speed up the process
avoiding parsing the DataFrame every evaluation of the posterior.
Parameteres
-----------
df : pandas DataFrame.
A tidy pandas DataFrame as standardized in the project that contains
at least the following columns:
fold_change_A : the experimental fold-change from channel A in the
flow cytometer.
IPTG_uM : 1d-array
Concentrations of the inducer in micromolar.
repressors : int
The mean repressor copy number in copies per cell.
delta_repressors : float
The experimental standard deviation on the mean repressor copy number
binding_energy : float
The mean repressor binding energy
delta_energy : float
The experimental standard deviation on the binding energy
Returns
-------
[rep_unique, eps_unique] : list
A list whose first element is the list of the unique mean repressor
copy number found in the DataFrame.
The second element is the list of unique binding energies also found
in the DataFrame.
This is used by the MCMC function to determine how many dimensions
the walkers should walk in.
param_idx : array-like.
An array that indicates in the param array where are each parameters
located. The logic is the following:
In the first 3 positions of the param argument for the MCMC we find
epsilon_A, epsilon_I and sigma the error associated with the Gaussian
likelihood.
After that we have all the repressor copy numbers for each of the RBS
mutants. Followed by all the unique binding energies in the DataFrame.
This variable indicates the position of each of these variables such
that the function is robust and it works for a DataFrame with 1 RBS
mutant and 1 energy as well as for multiple mutants and multiple enrgies.
data : array-like.
Numpy array pre-arranged in the order that the log-posterior function
expects it with the following columns:
data[:, 0] : fold_change_A
data[:, 1] : IPTG_uM
data[:, 2] : repressors
data[:, 3] : delta_repressors
data[:, 4] : binding_energy
data[:, 5] : delta_energy
"""
# List the unique variables
rep_unique = np.sort(df.repressors.unique())
eps_unique = np.sort(df.binding_energy.unique())
IPTG_unique = np.sort(df.IPTG_uM.unique())
# determine the number of unique variables
n_repressor = len(rep_unique)
n_epsilon_r = len(eps_unique)
n_IPTG = len(IPTG_unique)
# Depending on the number of parameters determine the indexes of the
# parameters to fit
param_idx = np.cumsum([3, n_repressor, n_epsilon_r])
# Sort the data frame such that the log-posterior function can
# automatically compute the log probability with the right parameters
# for each data point
df_sort = df.sort(['repressors', 'binding_energy', 'IPTG_uM'])
data = np.array(df_sort[['fold_change_A', 'IPTG_uM',
'repressors', 'delta_repressors',
'binding_energy', 'delta_energy']])
return [rep_unique, eps_unique], param_idx, data
Now let's define the function to compute the likelihood, the prior and the posterior probability
def log_likelihood(param, param_idx, unique_var, data, epsilon=4.5):
'''
Computes the log-likelihood
Parameters
----------
param : array-like
Array with the value of all the parameters/dismensions on which the
MCMC walkers should walk. The array follows this order:
ea, ei, sigma : first 3 columns.
repressor copy number : next columns.
binding energies : final columns.
The exact position of each of these parameters depends on the number
of unique repressors and energies as indicated by param_idx.
param_idx : array-like.
An array that indicates in the param array where are each parameters
located. The logic is the following:
In the first 3 positions of the param argument for the MCMC we find
epsilon_A, epsilon_I and sigma the error associated with the Gaussian
likelihood.
After that we have all the repressor copy numbers for each of the RBS
mutants. Followed by all the unique binding energies in the DataFrame.
This variable indicates the position of each of these variables such
that the function is robust and it works for a DataFrame with 1 RBS
mutant and 1 energy as well as for multiple mutants and multiple enrgies.
unique_var : : list.
A list whose first element is the list of the unique mean repressor
copy number found in the DataFrame.
The second element is the list of unique binding energies also found
in the DataFrame.
This is used by the MCMC function to determine how many dimensions
the walkers should walk in.
data : array-like.
Numpy array pre-arranged in the order that the log-posterior function
expects it with the following columns:
data[:, 0] : fold_change_A
data[:, 1] : IPTG_uM
data[:, 2] : repressors
data[:, 3] : delta_repressors
data[:, 4] : binding_energy
data[:, 5] : delta_energy
epsilon : float.
Energetic difference between the active and inactive state.
Returns
-------
log likelihood probability
'''
# unpack parameters
ea, ei, sigma = param[0:param_idx[0]] # MWC parameters
rep = param[param_idx[0]:param_idx[1]] # Repressor copy numbers
eps_r = param[param_idx[1]:param_idx[2]] # Represor energies
# Initialize the log_likelihood
log_like = 0
# loop through the parameters to fit in order to compute the
# theoretical fold change using the right parameters for each strain
for i, r in enumerate(unique_var[0]):
for j, eps in enumerate(unique_var[1]):
data_block = data[(data[:, 2]==r) & (data[:, 4]==eps), :]
# compute the theoretical fold-change
fc_theory = mwc.fold_change_log(data_block[:, 1],
ea, ei, epsilon,
rep[i], eps_r[j])
# compute the log likelihood for this block of data
log_like -= np.sum((fc_theory - data_block[:, 0])**2) / 2 / sigma**2
return log_like
def log_prior(param, param_idx, unique_var, data, epsilon=4.5):
'''
Computes the log-prior probability
Parameters
----------
param : array-like
Array with the value of all the parameters/dismensions on which the
MCMC walkers should walk. The array follows this order:
ea, ei, sigma : first 3 columns.
repressor copy number : next columns.
binding energies : final columns.
The exact position of each of these parameters depends on the number
of unique repressors and energies as indicated by param_idx.
param_idx : array-like.
An array that indicates in the param array where are each parameters
located. The logic is the following:
In the first 3 positions of the param argument for the MCMC we find
epsilon_A, epsilon_I and sigma the error associated with the Gaussian
likelihood.
After that we have all the repressor copy numbers for each of the RBS
mutants. Followed by all the unique binding energies in the DataFrame.
This variable indicates the position of each of these variables such
that the function is robust and it works for a DataFrame with 1 RBS
mutant and 1 energy as well as for multiple mutants and multiple enrgies.
unique_var : : list.
A list whose first element is the list of the unique mean repressor
copy number found in the DataFrame.
The second element is the list of unique binding energies also found
in the DataFrame.
This is used by the MCMC function to determine how many dimensions
the walkers should walk in.
data : array-like.
Numpy array pre-arranged in the order that the log-posterior function
expects it with the following columns:
data[:, 0] : fold_change_A
data[:, 1] : IPTG_uM
data[:, 2] : repressors
data[:, 3] : delta_repressors
data[:, 4] : binding_energy
data[:, 5] : delta_energy
epsilon : float.
Energetic difference between the active and inactive state.
Returns
-------
log prior probability
'''
# unpack parameters
ea, ei, sigma = param[0:param_idx[0]] # MWC parameters
rep = param[param_idx[0]:param_idx[1]] # Repressor copy numbers
eps_r = param[param_idx[1]:param_idx[2]] # Represor energies
# Initialize the log_prior
log_prior = 0
# loop through the parameters to to fit in order to compute the appropiate
# log prior
for i, r in enumerate(unique_var[0]):
for j, eps in enumerate(unique_var[1]):
data_block = data[(data[:, 2]==r) & (data[:, 4]==eps), :]
log_prior -= np.sum((rep[i] - data_block[:, 2])**2 / \
2 / data_block[:, 3]**2)
log_prior -= np.sum((eps_r[j] - data_block[:, 4])**2 / \
2 / data_block[:, 5]**2)
# check the bounds on the parameterreps
if np.any(rep <= 0) or (sigma <= 0):
return -np.inf
return log_prior
def log_post(param, param_idx, unique_var, data, epsilon=4.5):
'''
Computes the log posterior probability.
Parameters
----------
param : array-like
Array with the value of all the parameters/dismensions on which the
MCMC walkers should walk. The array follows this order:
ea, ei, sigma : first 3 columns.
repressor copy number : next columns.
binding energies : final columns.
The exact position of each of these parameters depends on the number
of unique repressors and energies as indicated by param_idx.
param_idx : array-like.
An array that indicates in the param array where are each parameters
located. The logic is the following:
In the first 3 positions of the param argument for the MCMC we find
epsilon_A, epsilon_I and sigma the error associated with the Gaussian
likelihood.
After that we have all the repressor copy numbers for each of the RBS
mutants. Followed by all the unique binding energies in the DataFrame.
This variable indicates the position of each of these variables such
that the function is robust and it works for a DataFrame with 1 RBS
mutant and 1 energy as well as for multiple mutants and multiple enrgies.
unique_var : : list.
A list whose first element is the list of the unique mean repressor
copy number found in the DataFrame.
The second element is the list of unique binding energies also found
in the DataFrame.
This is used by the MCMC function to determine how many dimensions
the walkers should walk in.
data : array-like.
Numpy array pre-arranged in the order that the log-posterior function
expects it with the following columns:
data[:, 0] : fold_change_A
data[:, 1] : IPTG_uM
data[:, 2] : repressors
data[:, 3] : delta_repressors
data[:, 4] : binding_energy
data[:, 5] : delta_energy
epsilon : float.
Energetic difference between the active and inactive state.
Returns
-------
The log posterior probability
'''
# unpack parameters
ea, ei, sigma = param[0:param_idx[0]] # MWC parameters
rep = param[param_idx[0]:param_idx[1]] # Repressor copy numbers
eps_r = param[param_idx[1]:param_idx[2]] # Represor energies
lnp = log_prior(param, param_idx, unique_var, data, epsilon)
# Check before computing the likelihood if one of the boundaries set by
# the prior was not satisfied. If that is the case don't waste time
# computing the likelihood and return -inf
if lnp == -np.inf:
return lnp
return -(len(data) + 1) * np.log(sigma)\
+ log_likelihood(param, param_idx, unique_var, data, epsilon)\
+ lnp
Now let's read the data into a tidy DataFrame
. We will include the uncertainty on the repressor copy number as determined experimentally by Garcia and Phillips.
# List the error sources as described by Garcia & Phillips PNAS 2011.
delta_R = {'HG104':2, 'RBS1147':10, 'RBS446':15, 'RBS1027':20, 'RBS1':80,
'RBS1L':170}
delta_epsilon_r = {'O1':0.2, 'O2':0.2, 'O3':0.1, 'Oid':0.2}
# Define working directory
datadir = '../../data/'
# List files to be read
files = ['flow_master.csv', 'merged_Oid_data_foldchange.csv']
# Read flow cytometry data
df_Oid = pd.read_csv(datadir + files[1], comment='#')
# make an extra column to have consistent labeling
df_Oid['fold_change_A'] = df_Oid.fold_change
# Remove manually the outlier with an unphysical fold-change
df_Oid = df_Oid[df_Oid.fold_change_A <= 1]
# Read the flow cytometry data
df = pd.read_csv(datadir + files[0], comment='#')
# Attach both data frames into a single one
df = pd.concat([df, df_Oid])
# Drop rows containing NA values
df.dropna(axis=1, inplace=True)
# Now we remove the autofluorescence and delta values
df = df[(df.rbs != 'auto') & (df.rbs != 'delta')]
# Restart index
df = df.reset_index()
# Add the error columns to the data frame
df['delta_repressors'] = pd.Series([delta_R[df.iloc[x].rbs] for x\
in np.arange(df.shape[0])])
df['delta_energy'] = pd.Series([delta_epsilon_r[x] for x in df.operator])
df.head()
To make the function more robust we will define a function that initializes the MCMC walkers nearby the MAP for the MWC parameters. The function first finds the MAP using the non-linear regression function we previously defined, and starts the walkers around that region. Then it initializes the walkers for the mean repressor copy number and the binding energy also around the MAP value.
def init_walkers(df, n_walkers, unique_var, param_idx):
'''
Initialize walkers according to however many dimensions will be explored
by the MCMC
Parameters
----------
df : pandas DataFrame
Data frame containing the data that will be used for fitting the
parameters
n_walkers : int
Number of walkers for the MCMC.
unique_var : : list
A list whose first element is the list of the unique mean repressor
copy number found in the DataFrame.
The second element is the list of unique binding energies also found
in the DataFrame.
This is used by the MCMC function to determine how many dimensions
the walkers should walk in.
param_idx : array-like
An array that indicates in the param array where are each parameters
located. The logic is the following:
In the first 3 positions of the param argument for the MCMC we find
epsilon_A, epsilon_I and sigma the error associated with the Gaussian
likelihood.
After that we have all the repressor copy numbers for each of the RBS
mutants. Followed by all the unique binding energies in the DataFrame.
This variable indicates the position of each of these variables such
that the function is robust and it works for a DataFrame with 1 RBS
mutant and 1 energy as well as for multiple mutants and multiple enrgies.
n_dim : int
Number of dimensions that the MCMC walkers will walk on.
Returns
-------
[p0, ndim] : list
The maximum a-priori value from optimization and the number of parameters
used for the MCMC execution.
'''
#Define the parameters for emcee
n_dim = 3 + np.sum([len(x) for x in unique_var])
# Perform a non-linear regression
map_param = mwc.non_lin_reg_mwc(df, p0=[1, 7], diss_const=False)
mean = [map_param[0], map_param[2]]
cov = np.array([[map_param[1], 0], [0, map_param[3]]])
# Initialize walkers
p0 = np.empty((n_walkers, n_dim))
# Initialize walkers
p0 = np.empty((n_walkers, n_dim))
p0[:,[0, 1]] = np.random.multivariate_normal(mean, cov, n_walkers)# ea, ei
p0[:,2] = np.random.uniform(1E-5, 0.2, n_walkers)# sigma
# loop through the repressors
for i, r in enumerate(unique_var[0]):
sigma_r = df[df.repressors==r].delta_repressors.unique()
# Check if any walker was initialized in a forbidden area
rep_num = np.random.normal(r, sigma_r, n_walkers)
rep_num[rep_num < 0] = 0
p0[:, param_idx[0]+i] = rep_num
for j, eps in enumerate(unique_var[1]):
sigma_eps = df[df.binding_energy==eps].delta_energy.unique()
p0[:, param_idx[1]+j] = np.random.normal(eps, sigma_eps, n_walkers)
return p0, n_dim
Let's first test the functions fitting for the RBS1027 strain.
rbs = df[(df.rbs=='RBS1027') & (df.operator=='O2')]
# Preprocess the data
unique_var, param_idx, data = mcmc_pre_process(rbs)
# initialize the walkers
n_walkers = 100
n_burn = 500
n_steps = 5000
p0, n_dim = init_walkers(rbs, n_walkers, unique_var, param_idx)
#Call the sampler.
sampler = emcee.EnsembleSampler(n_walkers, n_dim, log_post,\
args=(param_idx, unique_var, data, 4.5),\
threads=6)
#Do the burn in
pos, prob, state = sampler.run_mcmc(p0, n_burn, storechain=False)
sample = True
if sample:
# Perform the real MCMC
_ = sampler.run_mcmc(pos, n_steps)
output = open('../../data/mcmc/all_parameters_O2_R260.pkl', 'wb')
pickle.dump(sampler.flatchain, output)
pickle.dump(sampler.flatlnprobability, output)
# Load the flat-chain
with open('../../data/mcmc/all_parameters_O2_R260.pkl', 'rb') as file:
unpickler = pickle.Unpickler(file)
gauss_flatchain = unpickler.load()
gauss_flatlnprobability = unpickler.load()
# Draw the corner plot
fig = corner.corner(gauss_flatchain, bins=50, plot_contours=True)
# Load the flat-chain
with open('../../data/mcmc/all_parameters_O2_R260.pkl', 'rb') as file:
unpickler = pickle.Unpickler(file)
gauss_flatchain = unpickler.load()
gauss_flatlnprobability = unpickler.load()
# Generate a Pandas Data Frame with the mcmc chain
index = np.concatenate([['ka', 'ki', 'sigma'],\
[rbs[rbs.repressors==r].rbs.unique()[0] for r in \
np.sort(rbs.repressors.unique())],
[rbs[rbs.binding_energy==o].operator.unique()[0] for o in \
np.sort(rbs.binding_energy.unique())]])
# Generate a data frame out of the MCMC chains
mcmc_rbs = pd.DataFrame(gauss_flatchain, columns=index)
mcmc_rbs['Ka'] = np.exp(-mcmc_rbs['ka'])
mcmc_rbs['Ki'] = np.exp(-mcmc_rbs['ki'])
# Generate data frame with mode values for each parameter
max_idx = np.argmax(gauss_flatlnprobability, axis=0)
# Obtain the MAP for each parameter
param_fit = mcmc_rbs.ix[max_idx, :]
# Convert to data frame with column name mode
param_fit = param_fit.to_frame(name='mode')
# Generate parameter to save the hpd for each parameter
param_hpd = pd.DataFrame(columns=['hpd_min', 'hpd_max'])
# Loop through each parameter computing the 95% hpd
for column in mcmc_rbs:
param_hpd = param_hpd.append(pd.Series(np.abs(mwc.hpd(mcmc_rbs[column], 0.95) - \
param_fit.ix[column, 'mode']),
index=['hpd_min', 'hpd_max'], name=column))
# Combine the data frames into a single data frame
param_fit = pd.concat([param_fit, param_hpd], axis=1)
param_fit.round(2)
We see that for this particular strain we obtain basically the same values as the ones determined without fitting the repressor copy number or the energy.
Having tested the functions we can now proceed to perform the MCMC using all the strains.
# Preprocess the data
unique_var, param_idx, data = mcmc_pre_process(df)
n_walkers = 50
n_burn = 500
n_steps = 8000
p0, n_dim = init_walkers(df, n_walkers, unique_var, param_idx)
#Call the sampler.
sampler = emcee.EnsembleSampler(n_walkers, n_dim, log_post,\
args=(param_idx, unique_var, data, 4.5),\
threads=6)
sample = True
if sample:
#Do the burn in
print('Performing the burn-in')
pos, prob, state = sampler.run_mcmc(p0, n_burn, storechain=False)
# Perform the real MCMC
print('Performing the MCMC')
_ = sampler.run_mcmc(pos, n_steps)
output = open('../../data/mcmc/all_parameters_all_data_informative.pkl', 'wb')
pickle.dump(sampler.flatchain, output)
pickle.dump(sampler.flatlnprobability, output)
# Load the flat-chain
with open('../../data/mcmc/all_parameters_all_data_informative.pkl','rb') as file:
unpickler = pickle.Unpickler(file)
gauss_flatchain = unpickler.load()
gauss_flatlnprobability = unpickler.load()
# # Draw the corner plot
# fig = corner.corner(gauss_flatchain, bins=50, plot_contours=False,
# rasterized=True)
Let us now create a DataFrame
out of the MCMC chains.
# Generate a Pandas Data Frame with the mcmc chain
index = np.concatenate([['ka', 'ki', 'sigma'],\
[df[df.repressors==r].rbs.unique()[0] for r in \
np.sort(df.repressors.unique())],
[df[df.binding_energy==o].operator.unique()[0] for o in \
np.sort(df.binding_energy.unique())]])
# Generate a data frame out of the MCMC chains
mcmc_df = pd.DataFrame(gauss_flatchain, columns=index)
mcmc_df['Ka'] = np.exp(-mcmc_df['ka'])
mcmc_df['Ki'] = np.exp(-mcmc_df['ki'])
# redfine the index with the new entries
index = mcmc_df.columns
mcmc_df.head()
# Generate data frame with mode values for each parameter
max_idx = np.argmax(gauss_flatlnprobability, axis=0)
# Obtain the MAP for each parameter
param_fit = mcmc_df.ix[max_idx, :]
# Convert to data frame with column name mode
param_fit = param_fit.to_frame(name='mode')
# Generate parameter to save the hpd for each parameter
param_hpd = pd.DataFrame(columns=['hpd_min', 'hpd_max'])
# Loop through each parameter computing the 95% hpd
for column in mcmc_df:
param_hpd = param_hpd.append(pd.Series(np.abs(mwc.hpd(mcmc_df[column], 0.95) - \
param_fit.ix[column, 'mode']),
index=['hpd_min', 'hpd_max'], name=column))
# Combine the data frames into a single data frame
param_fit = pd.concat([param_fit, param_hpd], axis=1)
param_fit.round(2)
An interesting feature to explore is to see "how well we could fit the data" given that we truly allow all the parameters to be fit. We can simply do that by extending the $\sigma$ value on the priors for the repressor copy numbers and the binding energies. In this example we will try increasing the error that Hernan and Phillips originally infered.
Let's read the data and define this larger error.
# Define working directory
datadir = '../../data/'
# List files to be read
files = ['flow_master.csv', 'merged_Oid_data_foldchange.csv']
# Read flow cytometry data
df_Oid = pd.read_csv(datadir + files[1], comment='#')
# make an extra column to have consistent labeling
df_Oid['fold_change_A'] = df_Oid.fold_change
# Remove manually the outlier with an unphysical fold-change
df_Oid = df_Oid[df_Oid.fold_change_A <= 1]
# Read the flow cytometry data
df = pd.read_csv(datadir + files[0], comment='#')
# Attach both data frames into a single one
df = pd.concat([df, df_Oid])
# Drop rows containing NA values
df.dropna(axis=1, inplace=True)
# Now we remove the autofluorescence and delta values
df = df[(df.rbs != 'auto') & (df.rbs != 'delta')]
# Restart index
df = df.reset_index()
#Define increased error
error_r = 3
error_e = 15
# Add the error columns to the data frame
df['delta_repressors'] = pd.Series([delta_R[df.iloc[x].rbs] * error_r for x\
in np.arange(df.shape[0])])
df['delta_energy'] = pd.Series([delta_epsilon_r[x] * error_e for x in df.operator])
df[['repressors', 'delta_repressors', 'binding_energy', 'delta_energy']].head(6)
# Preprocess the data
unique_var, param_idx, data = mcmc_pre_process(df)
n_walkers = 50
n_burn = 2500
n_steps = 5000
p0, n_dim = init_walkers(df, n_walkers, unique_var, param_idx)
#Call the sampler.
sampler = emcee.EnsembleSampler(n_walkers, n_dim, log_post,\
args=(param_idx, unique_var, data, 4.5),\
threads=6)
sample = True
if sample:
#Do the burn in
print('Performing the burn-in')
pos, prob, state = sampler.run_mcmc(p0, n_burn, storechain=False)
# Perform the real MCMC
print('Performing the MCMC')
_ = sampler.run_mcmc(pos, n_steps)
output = open('../../data/mcmc/SI_E_global.pkl', 'wb')
pickle.dump(sampler.flatchain, output)
pickle.dump(sampler.flatlnprobability, output)
# Load the flat-chain
with open('../../data/mcmc/SI_E_global.pkl', 'rb') as file:
unpickler = pickle.Unpickler(file)
gauss_flatchain = unpickler.load()
gauss_flatlnprobability = unpickler.load()
# Generate a Pandas Data Frame with the mcmc chain
index = np.concatenate([['ka', 'ki', 'sigma'],\
[df[df.repressors==r].rbs.unique()[0] for r in \
np.sort(df.repressors.unique())],
[df[df.binding_energy==o].operator.unique()[0] for o in \
np.sort(df.binding_energy.unique())]])
# Generate a data frame out of the MCMC chains
mcmc_df = pd.DataFrame(gauss_flatchain, columns=index)
mcmc_df['Ka'] = np.exp(-mcmc_df['ka'])
mcmc_df['Ki'] = np.exp(-mcmc_df['ki'])
# redfine the index with the new entries
index = mcmc_df.columns
mcmc_df.head()
# Generate data frame with mode values for each parameter
max_idx = np.argmax(gauss_flatlnprobability, axis=0)
# Obtain the MAP for each parameter
param_fit = mcmc_df.ix[max_idx, :]
# Convert to data frame with column name mode
param_fit = param_fit.to_frame(name='mode')
# Generate parameter to save the hpd for each parameter
param_hpd = pd.DataFrame(columns=['hpd_min', 'hpd_max'])
# Loop through each parameter computing the 95% hpd
for column in mcmc_df:
param_hpd = param_hpd.append(pd.Series(np.abs(mwc.hpd(mcmc_df[column], 0.95) - \
param_fit.ix[column, 'mode']),
index=['hpd_min', 'hpd_max'], name=column))
# Combine the data frames into a single data frame
param_fit = pd.concat([param_fit, param_hpd], axis=1)
param_fit.round(2)
for value in param_fit.iterrows():
print(value[0])
if value[0] in df.rbs.unique():
p_mode = value[1][0] * 2
p_max = value[1][2] * 2
p_min = value[1][1] * 2
print('{0:.0f}^{{+{1:0.0f}}}_{{-{2:0.0f}}}'.format(p_mode, p_max, p_min))
elif value[0] in df.operator.unique():
p_mode = value[1][0]
p_max = value[1][2]
p_min = value[1][1]
print('{0:.1f}^{{+{1:0.1f}}}_{{-{2:0.1f}}}'.format(p_mode, p_max, p_min))
else:
p_mode = value[1][0]
p_max = value[1][2]
p_min = value[1][1]
print('{0:.2f}^{{+{1:0.2f}}}_{{-{2:0.2f}}}'.format(p_mode, p_max, p_min))
# print('{0:.2f}^{{+{1:0.2f}}}_{{-{2:.2f}}}'.format(float('%.1g' % p_mode),
# float('%.1g' % p_max),
# float('%.1g' % p_min)))
# map value of the parameters
map_param = param_fit['mode'].to_dict()
# Define the IPTG concentrations to evaluate
IPTG = np.logspace(-8, -2, 75)
# Set the colors for the strains
colors = sns.color_palette('colorblind', n_colors=7)
colors[4] = sns.xkcd_palette(['dusty purple'])[0]
# Define the operators and their respective energies
operators = ['O1', 'O2', 'O3', 'Oid']
energies = {'O1': -15.3, 'O2': -13.9, 'O3': -9.7, 'Oid': -17}
# Initialize the plot to set the size
fig,ax = plt.subplots(4, 1, figsize=(5.5, 13.5))
# Loop through operators
for i, op in enumerate(operators):
data = df[df.operator==op]
# loop through RBS mutants
for j, rbs in enumerate(df.rbs.unique()):
# plot the theory using the parameters from the fit.
ax[i].plot(IPTG, mwc.fold_change_log(IPTG * 1E6,
ea=map_param['ka'], ei=map_param['ki'], epsilon=4.5,
R=map_param[rbs],
epsilon_r=map_param[op]),
color=colors[j])
# plot 95% HPD region using the variability in the parameters
flatchain = np.array(mcmc_df[['ka', 'ki', rbs, op]])
cred_region = mwc.mcmc_cred_reg_error_prop(IPTG * 1E6,
flatchain, epsilon=4.5)
ax[i].fill_between(IPTG, cred_region[0,:], cred_region[1,:],
alpha=0.3, color=colors[j])
# compute the mean value for each concentration
fc_mean = data[data.rbs==rbs].groupby('IPTG_uM').fold_change_A.mean()
# compute the standard error of the mean
fc_err = data[data.rbs==rbs].groupby('IPTG_uM').fold_change_A.std() / \
np.sqrt(data[data.rbs==rbs].groupby('IPTG_uM').size())
# plot the experimental data
ax[i].errorbar(np.sort(data[data.rbs==rbs].IPTG_uM.unique()) / 1E6,
fc_mean, yerr=fc_err, fmt='o',
label=df[df.rbs==rbs].repressors.unique()[0],
color=colors[j])
# Add operator and binding energy labels.
ax[i].text(0.8, 0.08, r'{0}'.format(op), transform=ax[i].transAxes,
fontsize=14)
ax[i].text(0.7, 0.02,
r'$\Delta\varepsilon_{RA} = %s\,k_BT$' %energies[op],
transform=ax[i].transAxes, fontsize=12)
ax[i].set_xscale('log')
ax[i].set_xlabel('IPTG (M)', fontsize=15)
ax[i].set_ylabel('fold-change', fontsize=16)
ax[i].set_ylim([-0.01, 1.1])
ax[i].tick_params(labelsize=14)
ax[i].margins(0.02)
ax[0].legend(loc='upper left', title='repressors / cell')
# add plot letter labels
plt.tight_layout()
To see how this fit energies and repressor copy numbers affect the results from previous work let's look at the originl data from papers.
file='../../data/tidy_lacI_titration_data.csv'
df_old = pd.read_csv(file, comment='#')
df_old.head()
Since Brewster et al measured the repressor copy number using a different method based on the dilution method we won't change their repressor copy number. For Garcia & Phillips the story is different. Since we are using their original strains, the values for the repressors that we fit correspond to these strains. Let's add that information to these strains.
rep_dict = dict(zip([22, 60, 124, 260, 1220, 1740], param_fit.index[3:9]))
rep_dict
# Group by operator and author to distinguish between data sets
df_group = df_old.groupby(['operator', 'author'])
# Define the colors for each data set
colors = sns.color_palette('Paired', n_colors=8)
# Define an array to evaluate the fold-change as a function of repressor
# copy number
r_array = np.logspace(0, 3.5, 100)
# initialize counter (because of df_group)
i = 0
fig = plt.figure(figsize=(6,6))
for group, data in df_group:
ax = plt.subplot(111)
if group[1] == 'garcia':
# define the label for the plot
label=group[0]
# Compute the theoretical fold change for this operator
fc = mwc.fold_change_log(np.array([0]), -5.3, .31, 4.5,
r_array / 2, data.energy.unique())
ax.plot(r_array, fc, color=colors[i])
# Compute the fold change with the fit energies
fc = mwc.fold_change_log(np.array([0]), -5.3, .31, 4.5,
r_array / 2, param_fit.ix[group[0],'mode'])
ax.plot(r_array, fc, color=colors[i], linestyle='--')
# Adjust the number of repressors
# Extract the number of repressors
rep = data.repressor.values
# Find the closest repressor copy number compared to the original
# reported values and point at the right RBS
# NOTE: This was very clever ;)
rbs = [rep_dict[min(rep_dict, key=lambda x:abs(x - r))] for r in rep]
# Plot experimental data
ax.plot(param_fit.ix[rbs, 'mode'] * 2, data.fold_change,
color=colors[i], lw=0, marker='o', markeredgecolor='black',
label=label)
else:
label=''
# Plot experimental data
ax.plot(data.repressor, data.fold_change, color=colors[i],
marker='o', lw=0, label=label, markeredgecolor='black')
i+=1
df_group = df[df.IPTG_uM==0].groupby('operator')
i=1
for group, data in df_group:
plt.plot(data.repressors * 2, data.fold_change_A,
color=colors[i], marker='v', lw=0, label='')
i+=2
# Add fake plots to include generic label for theory curves
ax.plot([], [], color='black', label='HG & Phillips, 2011')
ax.plot([], [], color='black', linestyle='--', label='global fit')
# Format the plot
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('repressors / cell')
ax.set_ylabel('fold-change')
ax.set_xlim(right=10**3.5)
ax.legend(loc='lower left', fontsize=11)