Methods for modeling protein-DNA interaction based on high-throughput SELEX data

Before You Start
The SELEX Package
SelexGLM
- Loading and initalizing SelexGLM
- Creating a Feature-based Model
NRLBtools
- Working with NRLB Models for DNA Binding Specificity
- Visualizing enhancer affinity profiles

Before You Start

Before continuing with the rest of the tutorial below, first download and install the SELEX package from BioConductor here:

https://www.bioconductor.org/packages/release/bioc/html/SELEX.html

The installation instructions are straightforward and should not create issues. However, on some computers the rJava install either fails or incorrectly links to the system’s Java installation. As long as a JRE is properly installed on your computer (on unix systems, can type which java in a terminal to test), this issue is usually resolved by closing R, going to a terminal and typing R CMD javareconf. The SelexGLM package can be installed from the package archive file from here:

http://bussemakerlab.org/R-packages/SelexGLM_0.99.tar.gz

The NRLBTools package can be downloaded from here:

http://bussemakerlab.org/R-packages/NRLBtools_1.1.tar.gz

Finally, while not required for this exercise, the full data files from Slattery et al., Cell, 2011 are available here:

http://bussemakerlab.org/papers/SELEXseq2011/fastq/

The SELEX Package

Initializing the Workspace

This package contains an efficient implementation of the k-mer table based approach for analyzing SELEX-seq data described in Slattery et al. (2011) and Riley et al. (2014). This tutorial will use down-sampled data from Slattery et al. (2011) to recreate some of the figures and analyses. We will begin by setting the maximum Java memory usage limit and then load the package:

options(java.parameters="-Xmx1500M")  # Sets memory usage to 1.5 GB
library(SELEX)

Workflow in the SELEX package is centered around the workspace. The workspace consists of both the samples (and their associated data files) you are currently working with and the saved outputs of the various analyses you have run on these data. The workspace has a physical location on disk, and can be configured at any time:

workDir = "./cache/"
selex.config(workingDir=workDir, maxThreadNumber=4)  # maxThreadNumber sets the number of processor threads to use

## Reading the setting file:/Users/HJB/SELEX.properties
## Saving the setting file:/Users/HJB/SELEX.properties

Before any analyses can be performed, samples must be made ‘visible’ to the current SELEX session, which can be performed with the selex.loadAnnotation or selex.defineSample commands. Loading a sample lets the current SELEX session know the experimental setup of your data. You must provide round, barcode, variable region, and file path information for each sample you load. While selex.defineSample is convenient when one needs to quickly analyze new data, the XML-based database used by selex.loadAnnotation can be very useful for long-term storage and cataloging of data.

exampleFiles = selex.exampledata(workDir)  # Extract example data from package, including XML annotation

## Extracting example data file to : ./cache//R0.fastq.gz [TRUE]
## Extracting example data file to : ./cache//R2.fastq.gz [TRUE]
## Extracting example data file to : ./cache//config.xml [TRUE]

selex.loadAnnotation(exampleFiles[3])  # Load all sample files using XML database

You can use selex.sampleSummary to see the samples currently visible datasets:

  selex.sampleSummary()

##                seqName        sampleName rounds leftBarcode rightBarcode
## 2         R0.libraries      R0.barcodeCG      0         TGG      CCACGTC
## 5         R0.libraries      R0.barcodeGC      0         TGG      CCAGCTG
## 1         R2.libraries         ExdHox.R2      2         TGG      CCAGCTG
## 4 R2.libraries.split.1 ExdHox.R2.split.1      2         TGG      CCAGCTG
## 3 R2.libraries.split.2 ExdHox.R2.split.2      2         TGG      CCAGCTG
##                       leftFlank                   rightFlank
## 2 GTTCAGAGTTCTACAGTCCGACGATCTGG CCACGTCTCGTATGCCGTCTTCTGCTTG
## 5 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
## 1 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
## 4 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
## 3 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
##                                                                       seqFile
## 2                                                         ./cache/R0.fastq.gz
## 5                                                         ./cache/R0.fastq.gz
## 1                                                         ./cache/R2.fastq.gz
## 4 /Users/HJB/GitHub/Tutorials/./cache//R2.libraries.ExdHox.R2.2_split//0.part
## 3 /Users/HJB/GitHub/Tutorials/./cache//R2.libraries.ExdHox.R2.2_split//1.part

Sample handles are an easy way to address visible data, and are used by most package functions to allow easy manipulation of your datasets.

r0train = selex.sample(seqName="R0.libraries", 
                       sampleName="R0.barcodeGC", round=0)
r0test  = selex.sample(seqName="R0.libraries", 
                       sampleName="R0.barcodeCG", round=0)
r2 = selex.sample(seqName="R2.libraries", 
                  sampleName="ExdHox.R2", round=2)

At this point, you are ready to analyze your data.

Building a Markov Model

In order to properly identify motifs from a SELEX experiment, one needs to be able to characterize the non-randomness of the initial pool of oligomers. This non-randomness can be represented with a Markov model. In order to choose the optimal Markov model, models of various orders will be evaluated, and the one with the greatest cross-validated predictive capability will be chosen. This requires a testing dataset and finding the longest oligonucleotide length \(k\) such that all \(k\)-mers within this dataset are found at least 100 times. You can find this value using selex.kmax:

kmax.value = selex.kmax(sample=r0test)

## Counting [R0.libraries.R0.barcodeCG.0][ K = 1 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 2 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 3 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 4 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 5 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 6 ]
## [ sample id : R0.libraries.R0.barcodeCG.0,  filter:  variableRegionIncludeRegex:null,variableRegionExcludeRegex:null,variableRegionGroupRegex:null ]
## [ R0.libraries.R0.barcodeCG.0.kmax = 5 ]

We are now ready to find, build and store the optimal Markov model. All this work can be easily performed using selex.mm:

mm = selex.mm(sample=r0train, order=NA, 
              crossValidationSample=r0test, Kmax=kmax.value)

## Overwriting Kmax = 5
## Counting [R0.libraries.R0.barcodeGC.0][ K = 1 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 2 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 3 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 4 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 5 ]
## [ markovLength = 2 ]
## [ maxR = 0.989223 ]
## [ Model = MarkovModelInfo [markovLength=2, markovLengthTotalCount=518340, markovR2=0.9892230425821316, markovCountsPath=/Users/HJB/GitHub/Tutorials/./cache//R0.libraries.R0.barcodeGC.0.2.dat_A7FE7F4E2E78A43F892C7F3227FFA520, markovObjPath=/Users/HJB/GitHub/Tutorials/./cache//R0.libraries.R0.barcodeGC.0.2.dat_A7FE7F4E2E78A43F892C7F3227FFA520.prob.obj2, sample=config.ExperimentReference@5dfcfece, markovModelMethod=DIVISION, crossValidationSample=config.ExperimentReference@23ceabc1, filter=variableRegionIncludeRegex:null,variableRegionExcludeRegex:null,variableRegionGroupRegex:null,kmerIncludeRegex:null,kmerExcludeRegex:null,kmerIncludeOnly:null] ]

You can use selex.mmSummary to see the cross-validated R\(^2\) values of these models and plot them:

selex.mmSummary()

##                        Sample Order MarkovModelType         R
## 1 R0.libraries.R0.barcodeGC.0     0        DIVISION 0.9800940
## 2 R0.libraries.R0.barcodeGC.0     1        DIVISION 0.9892230
## 3 R0.libraries.R0.barcodeGC.0     2        DIVISION 0.9891462
## 4 R0.libraries.R0.barcodeGC.0     3        DIVISION 0.9889096
## 5 R0.libraries.R0.barcodeGC.0     4        DIVISION 0.9837154
##                      CVSample CVLength
## 1 R0.libraries.R0.barcodeCG.0        5
## 2 R0.libraries.R0.barcodeCG.0        5
## 3 R0.libraries.R0.barcodeCG.0        5
## 4 R0.libraries.R0.barcodeCG.0        5
## 5 R0.libraries.R0.barcodeCG.0        5

mm.r2 = selex.mmSummary()
idx = which(mm.r2$R==max(mm.r2$R))
colstring = rep('BLUE',nrow(mm.r2))
colstring[idx]='RED'
barplot(height=mm.r2$R,names.arg=(mm.r2$Order), ylim=c(.98,1), xpd=FALSE, col=colstring, 
        xlab="Markov Model Order", ylab=expression(Markov ~ Model ~ R^{2}))

Finding the Optimal Footprint Size

We can find the optimal footprint size by observing how the distribution of \(k\)-mers frequences changes between selection rounds. selex.infogain uses the Kullback-Leibler divergence metric can be used to measure this for many \(k\)-mer lengths:

selex.infogain(sample=r2,markovModel=mm)

## Counting [InfoGain][ K = 2 ]
## Counting [InfoGain][ K = 3 ]
## Counting [InfoGain][ K = 4 ]
## Counting [InfoGain][ K = 5 ]
## Counting [InfoGain][ K = 6 ]
## Counting [InfoGain][ K = 7 ]
## Counting [InfoGain][ K = 8 ]
## Counting [InfoGain][ K = 9 ]
## Counting [InfoGain][ K = 10 ]
## Counting [InfoGain][ K = 11 ]
## Counting [InfoGain][ K = 12 ]
## Counting [InfoGain][ K = 13 ]
## Counting [InfoGain][ K = 14 ]
## Counting [InfoGain][ K = 15 ]
## Counting [InfoGain][ K = 16 ]

## [1] 2.415874

selex.infogainSummary can be used to view and plot the results:

selex.infogainSummary()[,1:3]

##                      Sample  K InformationGain
## 1  R2.libraries.ExdHox.R2.2  2       0.1448297
## 2  R2.libraries.ExdHox.R2.2  3       0.3623032
## 3  R2.libraries.ExdHox.R2.2  4       0.6695950
## 4  R2.libraries.ExdHox.R2.2  5       1.0660545
## 5  R2.libraries.ExdHox.R2.2  6       1.5104582
## 6  R2.libraries.ExdHox.R2.2  7       1.9133892
## 7  R2.libraries.ExdHox.R2.2  8       2.2302201
## 8  R2.libraries.ExdHox.R2.2  9       2.4158743
## 9  R2.libraries.ExdHox.R2.2 10       2.4143849
## 10 R2.libraries.ExdHox.R2.2 11       2.0598432
## 11 R2.libraries.ExdHox.R2.2 12       1.1818942
## 12 R2.libraries.ExdHox.R2.2 13       0.1063632
## 13 R2.libraries.ExdHox.R2.2 14       0.0000000
## 14 R2.libraries.ExdHox.R2.2 15       0.0000000
## 15 R2.libraries.ExdHox.R2.2 16       0.0000000

infoscores = selex.infogainSummary()
idx = which(infoscores$InformationGain==max(infoscores$InformationGain))
colstring = rep('BLUE', nrow(infoscores))
colstring[idx] = 'RED'
barplot(height=infoscores$InformationGain, names.arg=infoscores$K, col=colstring,
        xlab="Oligonucleotide Length (bp)", ylab="Information Gain (bits)")

optimalLength = infoscores$K[idx]

selex.counts can be used to compute the \(k\)-mer count table for the optimal footprint size:

table = selex.counts(sample=r2, k=optimalLength, markovModel=mm)

## Counting [R2.libraries.ExdHox.R2.2][ K = 9 ]
## [ Lowest Count =  1 ]

head(table)

##        Kmer ObservedCount  Probability ExpectedCount
## 1 ATGATTGAT          7118 7.741514e-06      3.093261
## 2 TGATTGATT          6068 9.665983e-06      3.862217
## 3 TGATTGATG          4310 7.506314e-06      2.999283
## 4 GATTGATTA          3996 7.916402e-06      3.163141
## 5 ATCAATCAT          3870 4.642924e-06      1.855164
## 6 TTGATTGAT          3451 9.665982e-06      3.862217

Calculating Relative Affinity Tables and Error

With the optimal footprint size and a Markov model, selex.affinities can be used to estimate the affinity and the standard error of the affinity estimate from the \(k\)-mer enrichments:

aff = selex.affinities(sample=r2, k=optimalLength, markovModel=mm)

## Counting [R2.libraries.ExdHox.R2.2][ K = 9 ]
## [ Lowest Count =  1 ]

head(aff)

##        Kmer ObservedCount  Probability ExpectedCount  Affinity         SE
## 1 ATGATTGAT          7118 7.741514e-06      3.093261 1.0000000 0.01676239
## 2 TGATTGATT          6068 9.665983e-06      3.862217 0.8262924 0.01500120
## 3 TGATTGATG          4310 7.506314e-06      2.999283 0.7902404 0.01702298
## 4 GATTGATTA          3996 7.916402e-06      3.163141 0.7409396 0.01657620
## 5 ATCAATCAT          3870 4.642924e-06      1.855164 0.9521244 0.02164478
## 6 TTGATTGAT          3451 9.665982e-06      3.862217 0.6231369 0.01500120

SelexGLM

Loading and initalizing SelexGLM

First, we need to load all the packages used by SelexGLM:

library(stringi)
library(Biostrings)
library(SelexGLM)
library(devtools)
library(reshape2)
library(ggplot2)
library(Rmisc)

First, we need to provide some information about the SELEX library and the kind of SelexGLM model we want to build:

model = new("model",
            varRegLen = 16,
            leftFixedSeq = "GTTCAGAGTTCTACAGTCCGACGATCTGG",  # These are the left and right adapter sequences 
            rightFixedSeq ="CCAGCTGTCGTATGCCGTCTTCTGCTTG", 
            seedLen = optimalLength,  # This is the optimal kmer length computed earlier 
            leftFixedSeqOverlap = 4,  # The number of bases in the fixed flanking regions to consider
            initialAffinityCutoff = 0.00,
            missingValueSuppression = 1,
            minSeedValue = .001, 
            upFootprintExtend = 2,
            includeWindowFactor = FALSE,
            confidenceLevel = .95, 
            verbose = FALSE, 
            useFixedValuesOffset.N = FALSE,
            rounds = list(c(2)),
            rcSymmetric = FALSE,
            minAffinity = 0.01
)

Next, we need to precompute some tables to eventually create design matrices:

data.kmerTable = selex.affinities(sample=r2, k=optimalLength, markovModel=mm)

## Counting [R2.libraries.ExdHox.R2.2][ K = 9 ]
## [ Lowest Count =  1 ]

data.kmerTable = data.kmerTable[order(-data.kmerTable$Affinity), ]
rownames(data.kmerTable) = NULL
head(data.kmerTable)  # Visualize what the kmer count table looks like

##        Kmer ObservedCount  Probability ExpectedCount  Affinity         SE
## 1 ATGATTGAT          7118 7.741514e-06      3.093261 1.0000000 0.01676239
## 2 ATCAATCAT          3870 4.642924e-06      1.855164 0.9521244 0.02164478
## 3 AATCAATCA          2603 3.466848e-06      1.385241 0.9036571 0.02504849
## 4 TGATTGATT          6068 9.665983e-06      3.862217 0.8262924 0.01500120
## 5 CATCAATCA          1853 3.042163e-06      1.215551 0.8139176 0.02673977
## 6 TGATTGATG          4310 7.506314e-06      2.999283 0.7902404 0.01702298

data.probeCounts = getProbeCounts(r2, markovModel = mm)

## Counting [R2.libraries.split.1.ExdHox.R2.split.1.2][ K = 16 ]
## [ Lowest Count =  1 ]
## Counting [R2.libraries.split.2.ExdHox.R2.split.2.2][ K = 16 ]
## [ Lowest Count =  1 ]

head(data.probeCounts)  # Visualize what the raw R2 probe data looks like

##              Probe ObservedCount  Probability Round
## 1 ATGTTTGATTGATTAT             4 1.413033e-09     2
## 2 ATGATTGATTATGTTT             4 1.413033e-09     2
## 3 GTTGATTGATGGGTTT             3 1.193594e-09     2
## 4 GTGATTGATTGTTTTC             3 1.129550e-09     2
## 5 GGATCATCAATCATTT             3 4.386453e-10     2
## 6 GATGATTGATGGGTTT             3 8.936956e-10     2

Creating a Feature-based Model

We will now use SelexGLM to create a mononucleotide model of the sample data stored in the SELEX package. To do so, we will first need to seed the fit. We do so by computing a seed PSAM from the kmer table computed before:

addSeedPsam(model) = seedTable2psam(model, data.kmerTable)
model@features@N  # How the model object looks after the seeding

## An object of class 'N'
## 
## Slot "seedLen":  9 
## 
## Slot "N.upFootprintExtend":  2 
## 
## Slot "N.downFootprintExtend":  2 
## 
## Slot "fS.upFootprintExtend":  2 
## 
## Slot "fS.downFootprintExtend":  2 
## 
## Slot "fpLen":  13 
## 
## Slot "N.set":  1 2 3 4 5 6 7 8 9 10 11 12 13 
## 
## Slot "N.equivMat":
## 13  x  13  null equivalence matrix
## 
## Slot "N.values":
##     1 2          3         4         5         6         7         8
## N.A 0 0  0.0000000 -1.120526 -1.752076  0.000000 -3.047111 -1.799968
## N.C 0 0 -1.5076843 -3.047111 -3.047111 -3.047111 -1.841626 -1.007452
## N.G 0 0 -0.5755842 -3.047111  0.000000 -3.047111 -3.047111 -1.000440
## N.T 0 0 -0.4729891  0.000000 -3.047111 -3.047111  0.000000  0.000000
##             9        10        11 12 13
## N.A -1.295752  0.000000 -3.047111  0  0
## N.C -3.047111 -3.047111 -3.047111  0  0
## N.G  0.000000 -3.047111 -3.047111  0  0
## N.T -2.047111 -3.047111  0.000000  0  0
## 
## 
## Slot "N.errors":
##     1 2 3 4 5 6 7 8 9 10 11 12 13
## N.A 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.C 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.G 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.T 0 0 0 0 0 0 0 0 0  0  0  0  0
## 
## 
## Slot "N.z":
##     1 2 3 4 5 6 7 8 9 10 11 12 13
## N.A 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.C 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.G 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.T 0 0 0 0 0 0 0 0 0  0  0  0  0
## 
## 
## Slot "N.sig":
##     1 2 3 4 5 6 7 8 9 10 11 12 13
## N.A 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.C 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.G 0 0 0 0 0 0 0 0 0  0  0  0  0
## N.T 0 0 0 0 0 0 0 0 0  0  0  0  0
## 
## 
## Slot "N.oldValues":
## <4 x 13 x 0 array of double>
## 
## Slot "N.oldErrors":
## <4 x 13 x 0 array of double>
## 
## Slot "N.oldZ":
## <4 x 13 x 0 array of double>
## 
## Slot "N.oldSig":
## <4 x 13 x 0 array of double>

Next, we need to select the top binding window for each probe and a model affinity score and confidence level for the selected window.

data = data.probeCounts
data = topModelMatch(data, model)

Use the aligned probes to build design matrix:

data = addDesignMatrix(data, model)

Now construct the regression expression with independent features and fit model:

regressionFormula = updatedRegressionFormula(data, model)
fit = glm(regressionFormula, 
          data=data, 
          family = poisson(link="log"))
model = addNewBetas(model, data, fit)  # Nucleotide Features after first round of fitting
plot(model, Nplot.ddG = TRUE, verticalPlots = TRUE)  # Plot the feature values

As you can see, the parameter values are a bit extreme. We need to run a few more rounds of fitting to reach a stable solution:

data = data.probeCounts
data.nrow = nrow(data)
for (i in 2:10) {
  data = topModelMatch(data, model)
  data = addDesignMatrix(data, model)
  if (data.nrow == nrow(data)) {
    print(paste0("Stability Reached in iteration ", i))
    break
  } else {
    data.nrow = nrow(data)
  }

  regressionFormula = updatedRegressionFormula(data, model)
  fit = glm(regressionFormula, 
            data=data, 
            family = poisson(link="log"))

  model = addNewBetas(model,data,fit)  # Nucleotide Features after i'th round of fitting
}
plot(model, Nplot.ddG = TRUE, verticalPlots = TRUE)  # Plot the final solution

NRLBtools

Working with NRLB Models for DNA Binding Specificity

The NRLBtools package makes it easy to use the DNA binding specicity models that were built from SELEX in vitro binding data using the NRLB algorithm and as described in the Rastogi et al., PNAS (2018) paper. While it can be used to work with NRLB models that you have built, we explore the pre-loaded models. The available models are:

library(NRLBtools)

## Welcome to the NRLBtools package

NRLBtools::model.info()

##       Protein  Platform        Info GenomicMaximum
## 1        AbdB SELEX-seq  HoxMonomer      1.0000000
## 2         Exd SELEX-seq  HoxMonomer      1.0000000
## 3         Lab SELEX-seq  HoxMonomer      1.0000000
## 4          Pb SELEX-seq  HoxMonomer      1.0000000
## 5         Scr SELEX-seq  HoxMonomer      1.0000000
## 6       UbxIa SELEX-seq  HoxMonomer      1.0000000
## 7      UbxIVa SELEX-seq  HoxMonomer      1.0000000
## 8     ExdAbdA SELEX-seq ExdHoxDimer      0.7926055
## 9     ExdAbdB SELEX-seq ExdHoxDimer      0.8906533
## 10    ExdAntp SELEX-seq ExdHoxDimer      0.9175093
## 11     ExdDfd SELEX-seq ExdHoxDimer      0.8553881
## 12     ExdLab SELEX-seq ExdHoxDimer      1.0000000
## 13      ExdPb SELEX-seq ExdHoxDimer      0.8217055
## 14     ExdScr SELEX-seq ExdHoxDimer      0.9197844
## 15   ExdUbxIa SELEX-seq ExdHoxDimer      0.8262734
## 16  ExdUbxIVa SELEX-seq ExdHoxDimer      0.8429623
## 17       AbdA SELEX-seq  HoxMonomer      1.0000000
## 18        Dfd SELEX-seq  HoxMonomer      1.0000000
## 19      CEBPb  HT-SELEX     NucOnly             NA
## 20      CEBPd  HT-SELEX     NucOnly             NA
## 21       E2F1  HT-SELEX     NucOnly             NA
## 22       EBF1  HT-SELEX     NucOnly             NA
## 23       EGR1  HT-SELEX     NucOnly             NA
## 24       ELF1  HT-SELEX     NucOnly             NA
## 25       ELK1  HT-SELEX     NucOnly             NA
## 26       ETS1  HT-SELEX     NucOnly             NA
## 27      GABPA  HT-SELEX     NucOnly             NA
## 28      GATA3  HT-SELEX     NucOnly             NA
## 29       IRF4  HT-SELEX     NucOnly             NA
## 30       MAFF  HT-SELEX     NucOnly             NA
## 31       MAFK  HT-SELEX     NucOnly             NA
## 32      MEF2A  HT-SELEX     NucOnly             NA
## 33     NFATC1  HT-SELEX     NucOnly             NA
## 34       NFE2  HT-SELEX     NucOnly             NA
## 35      NR2C2  HT-SELEX     NucOnly             NA
## 36       NRF1  HT-SELEX     NucOnly             NA
## 37       PAX5  HT-SELEX     NucOnly             NA
## 38     POU2F2  HT-SELEX     NucOnly             NA
## 39   POU5F1P1  HT-SELEX     NucOnly             NA
## 40       RFX5  HT-SELEX     NucOnly             NA
## 41      RUNX3  HT-SELEX     NucOnly             NA
## 42        SP1  HT-SELEX     NucOnly             NA
## 43     TFAP2A  HT-SELEX     NucOnly             NA
## 44     TFAP2C  HT-SELEX     NucOnly             NA
## 45       USF1  HT-SELEX     NucOnly             NA
## 46        YY1  HT-SELEX     NucOnly             NA
## 47     ZNF143  HT-SELEX     NucOnly             NA
## 48        MAX SELEX-seq     NucOnly             NA
## 49        MAX SELEX-seq    NucDinuc             NA
## 50        MAX  HT-SELEX     NucOnly             NA
## 51        MAX  HT-SELEX    NucDinuc             NA
## 52        MAX SMiLE-seq     NucOnly             NA
## 53        MAX SMiLE-seq    NucDinuc             NA
## 54        p53 SELEX-seq    WildType             NA
## 55        p53 SELEX-seq         Δ30             NA
## 56       ATF4 SELEX-seq    ATF4Only             NA
## 57      CEBPb SELEX-seq   CEBPbOnly             NA
## 58       ATF4 SELEX-seq  ATF4-CEBPb             NA
## 59      CEBPb SELEX-seq  ATF4-CEBPb             NA
## 60 ATF4-CEBPb SELEX-seq  ATF4-CEBPb             NA
##                             Genome
## 1  BSgenome.Dmelanogaster.UCSC.dm3
## 2  BSgenome.Dmelanogaster.UCSC.dm3
## 3  BSgenome.Dmelanogaster.UCSC.dm3
## 4  BSgenome.Dmelanogaster.UCSC.dm3
## 5  BSgenome.Dmelanogaster.UCSC.dm3
## 6  BSgenome.Dmelanogaster.UCSC.dm3
## 7  BSgenome.Dmelanogaster.UCSC.dm3
## 8  BSgenome.Dmelanogaster.UCSC.dm3
## 9  BSgenome.Dmelanogaster.UCSC.dm3
## 10 BSgenome.Dmelanogaster.UCSC.dm3
## 11 BSgenome.Dmelanogaster.UCSC.dm3
## 12 BSgenome.Dmelanogaster.UCSC.dm3
## 13 BSgenome.Dmelanogaster.UCSC.dm3
## 14 BSgenome.Dmelanogaster.UCSC.dm3
## 15 BSgenome.Dmelanogaster.UCSC.dm3
## 16 BSgenome.Dmelanogaster.UCSC.dm3
## 17 BSgenome.Dmelanogaster.UCSC.dm3
## 18 BSgenome.Dmelanogaster.UCSC.dm3
## 19                            <NA>
## 20                            <NA>
## 21                            <NA>
## 22                            <NA>
## 23                            <NA>
## 24                            <NA>
## 25                            <NA>
## 26                            <NA>
## 27                            <NA>
## 28                            <NA>
## 29                            <NA>
## 30                            <NA>
## 31                            <NA>
## 32                            <NA>
## 33                            <NA>
## 34                            <NA>
## 35                            <NA>
## 36                            <NA>
## 37                            <NA>
## 38                            <NA>
## 39                            <NA>
## 40                            <NA>
## 41                            <NA>
## 42                            <NA>
## 43                            <NA>
## 44                            <NA>
## 45                            <NA>
## 46                            <NA>
## 47                            <NA>
## 48                            <NA>
## 49                            <NA>
## 50                            <NA>
## 51                            <NA>
## 52                            <NA>
## 53                            <NA>
## 54                            <NA>
## 55                            <NA>
## 56                            <NA>
## 57                            <NA>
## 58                            <NA>
## 59                            <NA>
## 60                            <NA>

It is always a good idea to visualize the model as an energy logo before using it. We can visualize the Exd-UbxIVa model used in the paper by using the logo function and its model number (16 in this case):

NRLBtools::logo(nrlb.model = 16)

We can actually use model.info to provide more information about the NRLB models that come with this package. For example, we can see that model number 16 is a multi-mode model with two modes:

NRLBtools::model.info(nrlb.model = 16)

##   Mode  k RelativeAffinity       BestSequence
## 1    1 18         0.000000 GGATGATTTATGACCTTT
## 2    2 10        -7.057776         ATCATTAAAA
## 3  NSB  -        -9.559147                  -

Visualizing enhancer affinity profiles

We start by extracting the sequence underlying a genomic region of interest. Here, we will only use the E3N enhancer element (as shown in the paper) from the dm3 Drosophila genome:

# Just extracted the 292bp genomic sequence chrX:4,915,195-4,915,486 from the dm3 assembly

enh = "TCTTTTTTTTTATCCGGTGCGATCGTTTAGTTGCTGATTTGTTGACCCGATAAAAAATGGGACTTTAAGCCTCGCTGGCATGCACATAATTTGTAGTTTTTGGTCATTGGACAAAACTTTGACAAACAAAAAAAAAAGAAAAAGGCGATCAAAACTTGTCGAAAGTTCGGAAGTTCGATTGAGAGGTTTTTTTATTGCTTTTAAAGTGGACGGCCGCCCAAGTTGGTGGAGTTCATAATTCCTGAGCTGTTTGTTTGTTCTTCTTTATTACGCGACTACCTACCTAACCTAC"

An affinity profile along the E3N enhancer can be created as follows:

NRLBtools::score.seq(sequence = enh, nrlb.model = 16, plot = TRUE, nPeaks = 10, annotate = TRUE)

## ExdUbxIVa scores in enh
##        Affinity Position           Sequence
## 1  2.100306e-03       84 ACATAATTTGTAGTTTTT
## 2  1.733473e-03      185 GGTTTTTTTATTGCTTTT
## 3  1.149797e-03       32 TGCTGATTTGTTGACCCG
## 4  9.752449e-04      -79 CATGCACATAATTTGTAG
## 5  9.206493e-04       48 CGATAAAAAATGGGACTT
## 6  7.306726e-04     -228 GGAGTTCATAATTCCTGA
## 7  4.345313e-04      258 TTCTTCTTTATTACGCGA
## 8  3.288965e-04      -43 TGACCCGATAAAAAATGG
## 9  5.154725e-05        3 TTTTTTTTTATCCGGTGC
## 10 4.078117e-05      173 GTTCGATTGAGAGGTTTT