© 2018 Suzy Beeler. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license

This exercise was generated from a Jupyter notebook. You can download the notebook here.


Objective

In this tutorial, we will be embarking on a large endeavor to compare theory and experiment in the context of gene expression in E. coli. In lecture, we derived a mathematical prediction of how the expression of a reporter gene (YFP) should change in response to increasing copy number of the LacI repressor. Here, we will make plots of our predictions as well as go through all the data analysis needed to plot the data on top of our theory to see how they compare.

Part 1: Using phase images to get segmentation masks

At the end of the day, we are interesting in determining the amount of YFP expression in E. coli with varying copy numbers of the repressor, LacI. Before we can assess the amount of fluorescence, we first need to find where the cell are in the images. To do this, it's important to use a channel that is independent of the reporter gene, so we will be using phase images to get our segmentation masks.

1.1: Background subtraction

Let's first look at a sample image.

In [1]:
# import the usual 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# for nice plots
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 14, 'axes.titlesize': 14, \
    'xtick.labelsize' : 14, 'ytick.labelsize' : 14}
sns.set(style='ticks', rc=rc)

# show images in viridis by default
plt.rcParams['image.cmap'] = 'viridis'

# for extracting filenames
import glob

# skimage submodules we need
import skimage.io
import skimage.measure
import skimage.filters
import skimage.exposure
In [2]:
# Read an example phase contrast image
im = skimage.io.imread('data/lacI_titration/O2_R124_phase_pos_02.tif')

# Show the image
plt.imshow(im)
Out[2]:
<matplotlib.image.AxesImage at 0x1c2093fa90>

This image is noisy and unevenly illuminated, where the top right corner is brighter than the bottom left. This gradient of pixel intensity, is not biologically relevant, and is actually something we would like to remove. We can fix this by background subtracting the image. To get the "background" we need to blur the image, removing any fine-detail structures (like the bacteria).

In [3]:
# normalize it
im_float = (im - np.min(im))/(np.max(im)-np.min(im))

# find the background
gauss_radius = 50
im_bg = skimage.filters.gaussian(im_float, sigma=gauss_radius)

# show the background
plt.imshow(im_bg)
Out[3]:
<matplotlib.image.AxesImage at 0x1c216858d0>

Now that we have the background image, we can subtract it from the original image.

In [4]:
# perform a background subtraction
im_bs = im_float - im_bg

# show the background-subtracted image
fig, ax = plt.subplots(1,2)
ax[0].imshow(im_float)
ax[0].set_title('original image')

ax[1].imshow(im_bs)
ax[1].set_title('background-subtracted image')
Out[4]:
Text(0.5,1,'background-subtracted image')

Nice! We can see that the background of our background-subtracted image is a lot more even now.

1.2: Thresholding

Now that we have cleaned up the image a little bit, we can start to distinguishing bacteria from background. We will do these by thresholding the image, similar to how we did on the first day. To figure out a good threshold, let's plot the histogram of pixel intensities.

In [5]:
hist, pixel_vals = skimage.exposure.histogram(im_bs)

plt.plot(pixel_vals, hist, '.')
plt.yscale('log')
plt.xlabel('pixel value')
plt.ylabel('count')
Out[5]:
Text(0,0.5,'count')

By plotting with the $y$-axis logarithmically scaled, we can distinguish one hump of low pixel intensities (bacteria) from another hump of high pixel intensities (background). There is a sharp transition around $-0.2$ where we transition from bacteria to background, and we will use this as our threshold.

In [6]:
# Choose a threshold
threshold = -0.2

# Thresholded image
im_thresh = im_bs < threshold

# Show the thresholded image
plt.imshow(im_thresh)
Out[6]:
<matplotlib.image.AxesImage at 0x1c22a8b710>

1.3: Label and get sizes of the objects in the image

Our thresholding did okay, but we see we have a lot of single pixels and a lot of clumps that don't correspond to individual bacteria. To remove these features that don't correspond to the bacteria, we will first label the image using skimage.measure.label(). This function will look for contiguous islands of $1$ pixels in our labeled image and assign each island as an object. Let's label our image and plot the result.

In [7]:
# label the objects in the binary image
im_label = skimage.measure.label(im_thresh)

# show the labeled image in a colored scale
plt.imshow(im_label)
Out[7]:
<matplotlib.image.AxesImage at 0x1c22b1ad68>

We see that each object whether it's a bacterium, a clump, or a single pixel was assigned a different pixel intensity (i.e. a different color). This is the result of skimage.measure.label() assigning all the pixels in the first object with pixel intensity $1$, and all the pixels in the second object with pixel intensity $2$, and so on, until all the objects have been uniquely identified. This means we can find the total number of objects in the image by looking at the maximum value in the image.

In [8]:
print("Objects in image:", im_label.max())
Objects in image: 179

Now that we have a labeled image, we will call skimage.measure.regionprops() to compute properties of each unique object. regionprops computes many things for us (like centroid, eccentricity, diameter, etc.), but we will just be using area, which accessed by object.area for a given object from the labeled image. Let's plot the areas (converted to µm$^2$) of all the objects in our image.

In [9]:
# length calibration 
pixel_size = 0.16 # µm

# list to store area values in um^2
areas = []

# obtain the areas of all labeled objects
objs = skimage.measure.regionprops(im_label)

# loop through the objects and update 
for obj in objs:
    areas.append(obj.area * pixel_size**2)
    
# Take a look at the distribution
_ = plt.hist(areas, bins = 50)
plt.xlabel('area (µm^2)')
plt.ylabel('count')
Out[9]:
Text(0,0.5,'count')

1.4: Remove non-bacteria objects from the image.

We see that there are a lot of very small objects in the image that certainly aren't bacteria. Additionally, things larger than 4 µm$^2$ are likely multiple bacteria. We would like to remove these objects, and we can do so by initially starting with a blank image (filled with all $0$s), and populating the image with only objects from our labeled image that are of a permissible size.

In [10]:
# area thresholds for what we call a bacteria
lower_area = 1 # µm^2
upper_area = 4 # µm^2

# create a blank image with dimensions of im_label
im_mask = np.zeros_like(im_label)

# loop through the objects
for obj in objs:
    
    # compute object area
    area = obj.area * pixel_size**2
    
    # if area is within the thresholds, add this object to the mask
    if (area > lower_area) and (area < upper_area):
        im_mask = im_mask + (im_label == obj.label)
        
plt.imshow(im_mask)
Out[10]:
<matplotlib.image.AxesImage at 0x1c23165550>

Great! We see that we've cleared out a substantial portion of non-bacteria looking objects. Let's see how many objects remain in this masked image.

In [11]:
# label the remaining objects
im_mask_label = skimage.measure.label(im_mask)

# print the number of object after size filtering
print(im_mask_label.max(), ' objects remain after filtering')
15  objects remain after filtering

1.5 Write a masking function

So far we've set up a pipeline for how to get a mask, an image the specifies where bacteria are located, for an single image. Now we just need to put all these operations together into a single function that we can call for all images. We've made such a function and have included it in the pboc_utilites file. Let's import this file and see that the create_mask() function works the same as the code we've written here.

In [12]:
import pboc_utils as pboc
In [13]:
test_mask = pboc.create_mask(im)

plt.imshow(test_mask)
Out[13]:
<matplotlib.image.AxesImage at 0x1c240aca20>

Step 2: Use our segmentation mask to get fluorescence intensities from a YFP image

Now that we have our mask, we are finally ready to assess the level of fluorescence in the corresponding bacteria. Below we show what the corresponding YFP image looks like.

In [14]:
# read the matching YFP image
im_yfp = skimage.io.imread('data/lacI_titration/O2_R124_yfp_pos_02.tif')

# show them next to each other
fig, ax = plt.subplots(1,3, figsize=[10,3])
ax[0].imshow(im_yfp)
ax[0].set_title('YFP image')

ax[1].imshow(im)
ax[1].set_title('Phase image')

ax[2].imshow(test_mask)
ax[2].set_title('Mask');

Below, we will get the corresponding fluorescence values for the bacteria, by looping through the individual cells in the masked image and then referring to the appropriate pixels in the YFP image.

In [15]:
# list to store yfp intensities 
intensities = [] 

# number of cells in the mask
n_cells = np.max(test_mask)

# loop through cells and find the mean intensity for each
for i in range(1,n_cells):
    
    # get a mask of the i'th cell
    cell_mask = (test_mask == i)
    
    # use cell mask to get a version of the yfp image for only that cell
    cell_yfp = im_yfp * cell_mask
    
    # use the number of pixels in the cell to get a mean intensity
    num_pixels = np.sum(cell_mask)
    cell_mean_intensity = np.sum(cell_yfp) / num_pixels
    
    # update our intensity list
    intensities.append(cell_mean_intensity)
    
# show intensities
sns.swarmplot(intensities)
plt.xlabel("YFP expression (a.u.)");

Write a function to find the YFP intensities

Now that we have a way to find the YFP intensities for a single image, let's turn it into a function that returns the intensities for any given YFP and mask image.

In [16]:
def find_intensities(im_phase,im_yfp):
    
    mask = pboc.create_mask(im_phase)
    
    # list to store yfp intensities 
    intensities = [] 

    # number of cells in the mask
    n_cells = np.max(mask)

    # loop through cells and find the mean intensity for each
    for i in range(1,n_cells):
        cell_mask = (mask == i)
        num_pixels = np.sum(cell_mask)
        cell_yfp = im_yfp * cell_mask
        cell_mean_intensity = np.sum(cell_yfp) / num_pixels
        intensities.append(cell_mean_intensity)
        
    return intensities

Step 3: Get YFP intensities for all images of a given strain

We are making progress towards our ultimate goal of comparing the YFP expression in strains with different binding sites and different repressor copy numbers. We have a pipeline in place that will find the segmentation mask of a phase image and then get the mean intensities of the bacteria in the corresponding YFP image. Let's now work on getting the YFP intensities for all the images of a given strain.

In [17]:
# strain information
operator = "O2"
repressor = "R124"

# filename structure
filename_structure_phase = "./data/lacI_titration/" + operator + "_" + repressor + \
                "_phase_pos_*.tif"
filename_structure_yfp = "./data/lacI_titration/" + operator + "_" + repressor + \
                "_yfp_pos_*.tif"

# grab all the images for this given strain
phase_names = np.sort(glob.glob(filename_structure_phase))
yfp_names = np.sort(glob.glob(filename_structure_yfp))

# number of positions
n_images = len(phase_names)

# list to store intensity values for *all* the images
intensities_all = []

# loop through the images
for i in range(n_images):
    
    # read in the image
    im_phase = skimage.io.imread(phase_names[i])
    im_yfp = skimage.io.imread(yfp_names[i])
    
    # get the intensities from from the YFP image
    intensities = find_intensities(im_phase, im_yfp)
    
    # update the all_intensities
    intensities_all = intensities_all + intensities

Let's plot these intensities.

In [18]:
plt.hist(intensities_all, bins=50);
plt.xlabel("YFP intensity (a.u.)")
plt.ylabel("frequency")
Out[18]:
Text(0,0.5,'frequency')

Write a function to get all bacterial intensities for a given strain

Once again, let's take the code we've developed and turn it into a function.

In [19]:
def find_intensities_all(operator,repressor):
    # filename structure
    filename_phase = "./data/lacI_titration/" + operator + "_" + repressor + \
                    "_phase_pos_*.tif"
    filename_yfp = "./data/lacI_titration/" + operator + "_" + repressor + \
                    "_yfp_pos_*.tif"

    # grab all the images for this given strain
    phase_names = np.sort(glob.glob(filename_phase))
    yfp_names = np.sort(glob.glob(filename_yfp))

    # number of positions
    n_images = len(phase_names)

    # list to store intensity values for *all* the images
    intensities_all = []

    # loop through the images
    for i in range(n_images):
    
        # read in the image
        im_phase = skimage.io.imread(phase_names[i])
        im_yfp = skimage.io.imread(yfp_names[i])

        # get the intensities from from the YFP image
        intensities = find_intensities(im_phase, im_yfp)

        # update the all_intensities
        intensities_all = intensities_all + intensities
        
    return intensities_all

Step 4: Get the fold-change for a given strain.

The fold-change for a given strain is defined as

$$\frac{\text{(YFP when $R \neq 0$)} - \text{autofluorescence}}{\text{(YFP when $R = 0$)} - \text{autofluorescence}}. \tag{1}$$

That is, we are interested in the expression in the presence of repressors relative to some baseline, where there are no repressors. Accordingly, fold-change should take on values less than $1$, as expression will decrease with increasing repressor copy number. Below we write a function the computes the fold change for a given operator / repressor copy number pair.

In [20]:
def find_fold_change(operator, repressor):
    # mean intensities for "auto", "delta", and "Rxxx"
    mean_auto = np.mean(find_intensities_all(operator,"auto"))
    mean_delta = np.mean(find_intensities_all(operator,"delta"))
    mean_R = np.mean(find_intensities_all(operator,repressor))
    
    # calculate the fold change
    fold_change = (mean_R - mean_auto) / (mean_delta - mean_auto)
    
    return fold_change

Let's test this function on the strain we've been working on throughout.

In [21]:
# strain information
operator = 'O2'
repressor = 'R124'

# compute fold-change
fc = find_fold_change(operator, repressor)

# Show the fold-change
print('Fold-change = ', fc )
Fold-change =  0.02670756956035521

Step 5: Analyze the whole dataset

Whew! We're almost there! Just to remind us of what we've accomplished so far:

  • We've made a function that will take a phase image and return a segmentation mask, showing where all the bacteria are.
  • We then wrote a function that will take that segmentation mask and then return the bacterial YFP intensities from the corresponding YFP image.
  • We then ramped things up to write a function that get the YFP intensities for all the images of a given strain.
  • Lastly, we wrote a function that computes the fold-change of given operator / repressor copy number pair, by getting the YFP intensities from the autofluorescent strain, the strain lacking LacI, and the strain with the repressor.

This means we are now ready to make use of our 'fold_change' function to run on the whole dataset!

In [22]:
# names of strain parameters
operators = ['O1', 'O2', 'O3']
repressors = ['R22', 'R60', 'R124', 'R260', 'R1220', 'R1740']
repressor_counts = [22, 60, 124, 260, 1220, 1740]

# number of operators and repressors
n_operators = len(operators)
n_repressors = len(repressors)

# fold-change 2D array to store values in
fold_change_values = np.zeros([n_operators, n_repressors])

# loop through list of operators
for i in range(n_operators):
    # loop through list of repressors
    for j in range(n_repressors):
        
        # grab the operator / repressor pair
        operator = operators[i]
        repressor = repressors[j]
        
        # print to see how the loop is progressing
        print(operator,repressor)
        
        # find the fold-change
        fold_change_values[i,j] = find_fold_change(operator, repressor)
O1 R22
O1 R60
O1 R124
O1 R260
O1 R1220
O1 R1740
O2 R22
O2 R60
O2 R124
O2 R260
O2 R1220
O2 R1740
O3 R22
O3 R60
O3 R124
O3 R260
O3 R1220
O3 R1740

Step 6: Plot the theory curves

To show we haven't "cooked the books," let's plot the theory curves before we look at the results of our data analysis. As we've derived in lecture, the fold-change equation is given by

$$ \text{fold-change}(R) = \frac{1}{1 + \frac{R}{N_{NS}}e^{-\beta \ \epsilon_R}}, \tag{1}$$

where $R$ is the number of repressors, $N_{NS}$ is the number of non-specific binding sites (which we will take to be the length of the genome), and $\epsilon_R$ is the binding energy of the repressor to the DNA. Below, let's plot our theroy curve for our three operator binding site O1, O2, and O3 which have the corresponding binding enegeries: $-15.3$, $-13.9$, and $-9.3 \ k_BT$.

In [23]:
# number of non-specific sites
NNS = 4.6E6

# repressor binding energies in kT units
dE = [-15.3, -13.9, -9.3]

# range of repressor copy numbers, 10 to 10,000
R_range = np.logspace(1, 4, 100)

# assign a color to each operator
colors = ['r', 'g', 'b']

# find and plot the theoretical predictions
for i in range(len(dE)):
    fold_change = 1/(1 + R_range/ NNS * np.exp(-dE[i]))
    plt.loglog(R_range, fold_change, color = colors[i])

plt.xlabel('Number of repressors')
plt.ylabel('Fold-change')
plt.legend(["O1","O2","O3"])
Out[23]:
<matplotlib.legend.Legend at 0x1c24ab71d0>

Step 7: Plot the data over the theory

Finally! The moment of truth! Let's plot our data on top of our theory curves and see how they fare.

In [25]:
# find and plot the theoretical predictions
for i in range(len(operators)):
    fold_change = 1/(1 + R_range/NNS * np.exp(-dE[i]))
    plt.loglog(R_range, fold_change, color = colors[i])

# plot the data on top    
for i in range(len(operators)):
    plt.loglog(repressor_counts, fold_change_values[i,:], '.', markersize = 20, color = colors[i])

plt.xlabel('Number of repressors')
plt.ylabel('Fold-change')
plt.legend(["O1","O2","O3"])
Out[25]:
<matplotlib.legend.Legend at 0x1c253146d8>

Nice! After all that hard work, we see that our data matches the theory extremely well (especially for the lower repressor copy numbers). We see that the last two data points for O1 and O2 don't match as well, but here we are approaching a $1000$-fold decrease in expression, which becomes hard to distinguish from autofluorescence.