Before continuing with the rest of the tutorial below, first download and install the SELEX package from BioConductor here:
https://www.bioconductor.org/packages/release/bioc/html/SELEX.html
The installation instructions are straightforward and should not create issues. However, on some computers the rJava install either fails or incorrectly links to the system’s Java installation. As long as a JRE is properly installed on your computer (on unix systems, can type which java
in a terminal to test), this issue is usually resolved by closing R, going to a terminal and typing R CMD javareconf
. The SelexGLM package can be installed from the package archive file from here:
http://bussemakerlab.org/R-packages/SelexGLM_0.99.tar.gz
The NRLBTools package can be downloaded from here:
http://bussemakerlab.org/R-packages/NRLBtools_1.1.tar.gz
Finally, while not required for this exercise, the full data files from Slattery et al., Cell, 2011 are available here:
This package contains an efficient implementation of the k-mer table based approach for analyzing SELEX-seq data described in Slattery et al. (2011) and Riley et al. (2014). This tutorial will use down-sampled data from Slattery et al. (2011) to recreate some of the figures and analyses. We will begin by setting the maximum Java memory usage limit and then load the package:
options(java.parameters="-Xmx1500M") # Sets memory usage to 1.5 GB
library(SELEX)
Workflow in the SELEX package is centered around the workspace. The workspace consists of both the samples (and their associated data files) you are currently working with and the saved outputs of the various analyses you have run on these data. The workspace has a physical location on disk, and can be configured at any time:
workDir = "./cache/"
selex.config(workingDir=workDir, maxThreadNumber=4) # maxThreadNumber sets the number of processor threads to use
## Reading the setting file:/Users/HJB/SELEX.properties
## Saving the setting file:/Users/HJB/SELEX.properties
Before any analyses can be performed, samples must be made ‘visible’ to the current SELEX session, which can be performed with the selex.loadAnnotation
or selex.defineSample
commands. Loading a sample lets the current SELEX session know the experimental setup of your data. You must provide round, barcode, variable region, and file path information for each sample you load. While selex.defineSample
is convenient when one needs to quickly analyze new data, the XML-based database used by selex.loadAnnotation
can be very useful for long-term storage and cataloging of data.
exampleFiles = selex.exampledata(workDir) # Extract example data from package, including XML annotation
## Extracting example data file to : ./cache//R0.fastq.gz [TRUE]
## Extracting example data file to : ./cache//R2.fastq.gz [TRUE]
## Extracting example data file to : ./cache//config.xml [TRUE]
selex.loadAnnotation(exampleFiles[3]) # Load all sample files using XML database
You can use selex.sampleSummary
to see the samples currently visible datasets:
selex.sampleSummary()
## seqName sampleName rounds leftBarcode rightBarcode
## 2 R0.libraries R0.barcodeCG 0 TGG CCACGTC
## 5 R0.libraries R0.barcodeGC 0 TGG CCAGCTG
## 1 R2.libraries ExdHox.R2 2 TGG CCAGCTG
## 4 R2.libraries.split.1 ExdHox.R2.split.1 2 TGG CCAGCTG
## 3 R2.libraries.split.2 ExdHox.R2.split.2 2 TGG CCAGCTG
## leftFlank rightFlank
## 2 GTTCAGAGTTCTACAGTCCGACGATCTGG CCACGTCTCGTATGCCGTCTTCTGCTTG
## 5 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
## 1 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
## 4 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
## 3 GTTCAGAGTTCTACAGTCCGACGATCTGG CCAGCTGTCGTATGCCGTCTTCTGCTTG
## seqFile
## 2 ./cache/R0.fastq.gz
## 5 ./cache/R0.fastq.gz
## 1 ./cache/R2.fastq.gz
## 4 /Users/HJB/GitHub/Tutorials/./cache//R2.libraries.ExdHox.R2.2_split//0.part
## 3 /Users/HJB/GitHub/Tutorials/./cache//R2.libraries.ExdHox.R2.2_split//1.part
Sample handles are an easy way to address visible data, and are used by most package functions to allow easy manipulation of your datasets.
r0train = selex.sample(seqName="R0.libraries",
sampleName="R0.barcodeGC", round=0)
r0test = selex.sample(seqName="R0.libraries",
sampleName="R0.barcodeCG", round=0)
r2 = selex.sample(seqName="R2.libraries",
sampleName="ExdHox.R2", round=2)
At this point, you are ready to analyze your data.
In order to properly identify motifs from a SELEX experiment, one needs to be able to characterize the non-randomness of the initial pool of oligomers. This non-randomness can be represented with a Markov model. In order to choose the optimal Markov model, models of various orders will be evaluated, and the one with the greatest cross-validated predictive capability will be chosen. This requires a testing dataset and finding the longest oligonucleotide length \(k\) such that all \(k\)-mers within this dataset are found at least 100 times. You can find this value using selex.kmax
:
kmax.value = selex.kmax(sample=r0test)
## Counting [R0.libraries.R0.barcodeCG.0][ K = 1 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 2 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 3 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 4 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 5 ]
## Counting [R0.libraries.R0.barcodeCG.0][ K = 6 ]
## [ sample id : R0.libraries.R0.barcodeCG.0, filter: variableRegionIncludeRegex:null,variableRegionExcludeRegex:null,variableRegionGroupRegex:null ]
## [ R0.libraries.R0.barcodeCG.0.kmax = 5 ]
We are now ready to find, build and store the optimal Markov model. All this work can be easily performed using selex.mm
:
mm = selex.mm(sample=r0train, order=NA,
crossValidationSample=r0test, Kmax=kmax.value)
## Overwriting Kmax = 5
## Counting [R0.libraries.R0.barcodeGC.0][ K = 1 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 2 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 3 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 4 ]
## Counting [R0.libraries.R0.barcodeGC.0][ K = 5 ]
## [ markovLength = 2 ]
## [ maxR = 0.989223 ]
## [ Model = MarkovModelInfo [markovLength=2, markovLengthTotalCount=518340, markovR2=0.9892230425821316, markovCountsPath=/Users/HJB/GitHub/Tutorials/./cache//R0.libraries.R0.barcodeGC.0.2.dat_A7FE7F4E2E78A43F892C7F3227FFA520, markovObjPath=/Users/HJB/GitHub/Tutorials/./cache//R0.libraries.R0.barcodeGC.0.2.dat_A7FE7F4E2E78A43F892C7F3227FFA520.prob.obj2, sample=config.ExperimentReference@5dfcfece, markovModelMethod=DIVISION, crossValidationSample=config.ExperimentReference@23ceabc1, filter=variableRegionIncludeRegex:null,variableRegionExcludeRegex:null,variableRegionGroupRegex:null,kmerIncludeRegex:null,kmerExcludeRegex:null,kmerIncludeOnly:null] ]
You can use selex.mmSummary
to see the cross-validated R\(^2\) values of these models and plot them:
selex.mmSummary()
## Sample Order MarkovModelType R
## 1 R0.libraries.R0.barcodeGC.0 0 DIVISION 0.9800940
## 2 R0.libraries.R0.barcodeGC.0 1 DIVISION 0.9892230
## 3 R0.libraries.R0.barcodeGC.0 2 DIVISION 0.9891462
## 4 R0.libraries.R0.barcodeGC.0 3 DIVISION 0.9889096
## 5 R0.libraries.R0.barcodeGC.0 4 DIVISION 0.9837154
## CVSample CVLength
## 1 R0.libraries.R0.barcodeCG.0 5
## 2 R0.libraries.R0.barcodeCG.0 5
## 3 R0.libraries.R0.barcodeCG.0 5
## 4 R0.libraries.R0.barcodeCG.0 5
## 5 R0.libraries.R0.barcodeCG.0 5
mm.r2 = selex.mmSummary()
idx = which(mm.r2$R==max(mm.r2$R))
colstring = rep('BLUE',nrow(mm.r2))
colstring[idx]='RED'
barplot(height=mm.r2$R,names.arg=(mm.r2$Order), ylim=c(.98,1), xpd=FALSE, col=colstring,
xlab="Markov Model Order", ylab=expression(Markov ~ Model ~ R^{2}))
We can find the optimal footprint size by observing how the distribution of \(k\)-mers frequences changes between selection rounds. selex.infogain
uses the Kullback-Leibler divergence metric can be used to measure this for many \(k\)-mer lengths:
selex.infogain(sample=r2,markovModel=mm)
## Counting [InfoGain][ K = 2 ]
## Counting [InfoGain][ K = 3 ]
## Counting [InfoGain][ K = 4 ]
## Counting [InfoGain][ K = 5 ]
## Counting [InfoGain][ K = 6 ]
## Counting [InfoGain][ K = 7 ]
## Counting [InfoGain][ K = 8 ]
## Counting [InfoGain][ K = 9 ]
## Counting [InfoGain][ K = 10 ]
## Counting [InfoGain][ K = 11 ]
## Counting [InfoGain][ K = 12 ]
## Counting [InfoGain][ K = 13 ]
## Counting [InfoGain][ K = 14 ]
## Counting [InfoGain][ K = 15 ]
## Counting [InfoGain][ K = 16 ]
## [1] 2.415874
selex.infogainSummary
can be used to view and plot the results:
selex.infogainSummary()[,1:3]
## Sample K InformationGain
## 1 R2.libraries.ExdHox.R2.2 2 0.1448297
## 2 R2.libraries.ExdHox.R2.2 3 0.3623032
## 3 R2.libraries.ExdHox.R2.2 4 0.6695950
## 4 R2.libraries.ExdHox.R2.2 5 1.0660545
## 5 R2.libraries.ExdHox.R2.2 6 1.5104582
## 6 R2.libraries.ExdHox.R2.2 7 1.9133892
## 7 R2.libraries.ExdHox.R2.2 8 2.2302201
## 8 R2.libraries.ExdHox.R2.2 9 2.4158743
## 9 R2.libraries.ExdHox.R2.2 10 2.4143849
## 10 R2.libraries.ExdHox.R2.2 11 2.0598432
## 11 R2.libraries.ExdHox.R2.2 12 1.1818942
## 12 R2.libraries.ExdHox.R2.2 13 0.1063632
## 13 R2.libraries.ExdHox.R2.2 14 0.0000000
## 14 R2.libraries.ExdHox.R2.2 15 0.0000000
## 15 R2.libraries.ExdHox.R2.2 16 0.0000000
infoscores = selex.infogainSummary()
idx = which(infoscores$InformationGain==max(infoscores$InformationGain))
colstring = rep('BLUE', nrow(infoscores))
colstring[idx] = 'RED'
barplot(height=infoscores$InformationGain, names.arg=infoscores$K, col=colstring,
xlab="Oligonucleotide Length (bp)", ylab="Information Gain (bits)")
optimalLength = infoscores$K[idx]
selex.counts
can be used to compute the \(k\)-mer count table for the optimal footprint size:
table = selex.counts(sample=r2, k=optimalLength, markovModel=mm)
## Counting [R2.libraries.ExdHox.R2.2][ K = 9 ]
## [ Lowest Count = 1 ]
head(table)
## Kmer ObservedCount Probability ExpectedCount
## 1 ATGATTGAT 7118 7.741514e-06 3.093261
## 2 TGATTGATT 6068 9.665983e-06 3.862217
## 3 TGATTGATG 4310 7.506314e-06 2.999283
## 4 GATTGATTA 3996 7.916402e-06 3.163141
## 5 ATCAATCAT 3870 4.642924e-06 1.855164
## 6 TTGATTGAT 3451 9.665982e-06 3.862217
With the optimal footprint size and a Markov model, selex.affinities
can be used to estimate the affinity and the standard error of the affinity estimate from the \(k\)-mer enrichments:
aff = selex.affinities(sample=r2, k=optimalLength, markovModel=mm)
## Counting [R2.libraries.ExdHox.R2.2][ K = 9 ]
## [ Lowest Count = 1 ]
head(aff)
## Kmer ObservedCount Probability ExpectedCount Affinity SE
## 1 ATGATTGAT 7118 7.741514e-06 3.093261 1.0000000 0.01676239
## 2 TGATTGATT 6068 9.665983e-06 3.862217 0.8262924 0.01500120
## 3 TGATTGATG 4310 7.506314e-06 2.999283 0.7902404 0.01702298
## 4 GATTGATTA 3996 7.916402e-06 3.163141 0.7409396 0.01657620
## 5 ATCAATCAT 3870 4.642924e-06 1.855164 0.9521244 0.02164478
## 6 TTGATTGAT 3451 9.665982e-06 3.862217 0.6231369 0.01500120
First, we need to load all the packages used by SelexGLM:
library(stringi)
library(Biostrings)
library(SelexGLM)
library(devtools)
library(reshape2)
library(ggplot2)
library(Rmisc)
First, we need to provide some information about the SELEX library and the kind of SelexGLM model we want to build:
model = new("model",
varRegLen = 16,
leftFixedSeq = "GTTCAGAGTTCTACAGTCCGACGATCTGG", # These are the left and right adapter sequences
rightFixedSeq ="CCAGCTGTCGTATGCCGTCTTCTGCTTG",
seedLen = optimalLength, # This is the optimal kmer length computed earlier
leftFixedSeqOverlap = 4, # The number of bases in the fixed flanking regions to consider
initialAffinityCutoff = 0.00,
missingValueSuppression = 1,
minSeedValue = .001,
upFootprintExtend = 2,
includeWindowFactor = FALSE,
confidenceLevel = .95,
verbose = FALSE,
useFixedValuesOffset.N = FALSE,
rounds = list(c(2)),
rcSymmetric = FALSE,
minAffinity = 0.01
)
Next, we need to precompute some tables to eventually create design matrices:
data.kmerTable = selex.affinities(sample=r2, k=optimalLength, markovModel=mm)
## Counting [R2.libraries.ExdHox.R2.2][ K = 9 ]
## [ Lowest Count = 1 ]
data.kmerTable = data.kmerTable[order(-data.kmerTable$Affinity), ]
rownames(data.kmerTable) = NULL
head(data.kmerTable) # Visualize what the kmer count table looks like
## Kmer ObservedCount Probability ExpectedCount Affinity SE
## 1 ATGATTGAT 7118 7.741514e-06 3.093261 1.0000000 0.01676239
## 2 ATCAATCAT 3870 4.642924e-06 1.855164 0.9521244 0.02164478
## 3 AATCAATCA 2603 3.466848e-06 1.385241 0.9036571 0.02504849
## 4 TGATTGATT 6068 9.665983e-06 3.862217 0.8262924 0.01500120
## 5 CATCAATCA 1853 3.042163e-06 1.215551 0.8139176 0.02673977
## 6 TGATTGATG 4310 7.506314e-06 2.999283 0.7902404 0.01702298
data.probeCounts = getProbeCounts(r2, markovModel = mm)
## Counting [R2.libraries.split.1.ExdHox.R2.split.1.2][ K = 16 ]
## [ Lowest Count = 1 ]
## Counting [R2.libraries.split.2.ExdHox.R2.split.2.2][ K = 16 ]
## [ Lowest Count = 1 ]
head(data.probeCounts) # Visualize what the raw R2 probe data looks like
## Probe ObservedCount Probability Round
## 1 ATGTTTGATTGATTAT 4 1.413033e-09 2
## 2 ATGATTGATTATGTTT 4 1.413033e-09 2
## 3 GTTGATTGATGGGTTT 3 1.193594e-09 2
## 4 GTGATTGATTGTTTTC 3 1.129550e-09 2
## 5 GGATCATCAATCATTT 3 4.386453e-10 2
## 6 GATGATTGATGGGTTT 3 8.936956e-10 2
We will now use SelexGLM to create a mononucleotide model of the sample data stored in the SELEX package. To do so, we will first need to seed the fit. We do so by computing a seed PSAM from the kmer table computed before:
addSeedPsam(model) = seedTable2psam(model, data.kmerTable)
model@features@N # How the model object looks after the seeding
## An object of class 'N'
##
## Slot "seedLen": 9
##
## Slot "N.upFootprintExtend": 2
##
## Slot "N.downFootprintExtend": 2
##
## Slot "fS.upFootprintExtend": 2
##
## Slot "fS.downFootprintExtend": 2
##
## Slot "fpLen": 13
##
## Slot "N.set": 1 2 3 4 5 6 7 8 9 10 11 12 13
##
## Slot "N.equivMat":
## 13 x 13 null equivalence matrix
##
## Slot "N.values":
## 1 2 3 4 5 6 7 8
## N.A 0 0 0.0000000 -1.120526 -1.752076 0.000000 -3.047111 -1.799968
## N.C 0 0 -1.5076843 -3.047111 -3.047111 -3.047111 -1.841626 -1.007452
## N.G 0 0 -0.5755842 -3.047111 0.000000 -3.047111 -3.047111 -1.000440
## N.T 0 0 -0.4729891 0.000000 -3.047111 -3.047111 0.000000 0.000000
## 9 10 11 12 13
## N.A -1.295752 0.000000 -3.047111 0 0
## N.C -3.047111 -3.047111 -3.047111 0 0
## N.G 0.000000 -3.047111 -3.047111 0 0
## N.T -2.047111 -3.047111 0.000000 0 0
##
##
## Slot "N.errors":
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## N.A 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.C 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.G 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.T 0 0 0 0 0 0 0 0 0 0 0 0 0
##
##
## Slot "N.z":
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## N.A 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.C 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.G 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.T 0 0 0 0 0 0 0 0 0 0 0 0 0
##
##
## Slot "N.sig":
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## N.A 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.C 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.G 0 0 0 0 0 0 0 0 0 0 0 0 0
## N.T 0 0 0 0 0 0 0 0 0 0 0 0 0
##
##
## Slot "N.oldValues":
## <4 x 13 x 0 array of double>
##
## Slot "N.oldErrors":
## <4 x 13 x 0 array of double>
##
## Slot "N.oldZ":
## <4 x 13 x 0 array of double>
##
## Slot "N.oldSig":
## <4 x 13 x 0 array of double>
Next, we need to select the top binding window for each probe and a model affinity score and confidence level for the selected window.
data = data.probeCounts
data = topModelMatch(data, model)
Use the aligned probes to build design matrix:
data = addDesignMatrix(data, model)
Now construct the regression expression with independent features and fit model:
regressionFormula = updatedRegressionFormula(data, model)
fit = glm(regressionFormula,
data=data,
family = poisson(link="log"))
model = addNewBetas(model, data, fit) # Nucleotide Features after first round of fitting
plot(model, Nplot.ddG = TRUE, verticalPlots = TRUE) # Plot the feature values
As you can see, the parameter values are a bit extreme. We need to run a few more rounds of fitting to reach a stable solution:
data = data.probeCounts
data.nrow = nrow(data)
for (i in 2:10) {
data = topModelMatch(data, model)
data = addDesignMatrix(data, model)
if (data.nrow == nrow(data)) {
print(paste0("Stability Reached in iteration ", i))
break
} else {
data.nrow = nrow(data)
}
regressionFormula = updatedRegressionFormula(data, model)
fit = glm(regressionFormula,
data=data,
family = poisson(link="log"))
model = addNewBetas(model,data,fit) # Nucleotide Features after i'th round of fitting
}
plot(model, Nplot.ddG = TRUE, verticalPlots = TRUE) # Plot the final solution
The NRLBtools
package makes it easy to use the DNA binding specicity models that were built from SELEX in vitro binding data using the NRLB algorithm and as described in the Rastogi et al., PNAS (2018) paper. While it can be used to work with NRLB models that you have built, we explore the pre-loaded models. The available models are:
library(NRLBtools)
## Welcome to the NRLBtools package
NRLBtools::model.info()
## Protein Platform Info GenomicMaximum
## 1 AbdB SELEX-seq HoxMonomer 1.0000000
## 2 Exd SELEX-seq HoxMonomer 1.0000000
## 3 Lab SELEX-seq HoxMonomer 1.0000000
## 4 Pb SELEX-seq HoxMonomer 1.0000000
## 5 Scr SELEX-seq HoxMonomer 1.0000000
## 6 UbxIa SELEX-seq HoxMonomer 1.0000000
## 7 UbxIVa SELEX-seq HoxMonomer 1.0000000
## 8 ExdAbdA SELEX-seq ExdHoxDimer 0.7926055
## 9 ExdAbdB SELEX-seq ExdHoxDimer 0.8906533
## 10 ExdAntp SELEX-seq ExdHoxDimer 0.9175093
## 11 ExdDfd SELEX-seq ExdHoxDimer 0.8553881
## 12 ExdLab SELEX-seq ExdHoxDimer 1.0000000
## 13 ExdPb SELEX-seq ExdHoxDimer 0.8217055
## 14 ExdScr SELEX-seq ExdHoxDimer 0.9197844
## 15 ExdUbxIa SELEX-seq ExdHoxDimer 0.8262734
## 16 ExdUbxIVa SELEX-seq ExdHoxDimer 0.8429623
## 17 AbdA SELEX-seq HoxMonomer 1.0000000
## 18 Dfd SELEX-seq HoxMonomer 1.0000000
## 19 CEBPb HT-SELEX NucOnly NA
## 20 CEBPd HT-SELEX NucOnly NA
## 21 E2F1 HT-SELEX NucOnly NA
## 22 EBF1 HT-SELEX NucOnly NA
## 23 EGR1 HT-SELEX NucOnly NA
## 24 ELF1 HT-SELEX NucOnly NA
## 25 ELK1 HT-SELEX NucOnly NA
## 26 ETS1 HT-SELEX NucOnly NA
## 27 GABPA HT-SELEX NucOnly NA
## 28 GATA3 HT-SELEX NucOnly NA
## 29 IRF4 HT-SELEX NucOnly NA
## 30 MAFF HT-SELEX NucOnly NA
## 31 MAFK HT-SELEX NucOnly NA
## 32 MEF2A HT-SELEX NucOnly NA
## 33 NFATC1 HT-SELEX NucOnly NA
## 34 NFE2 HT-SELEX NucOnly NA
## 35 NR2C2 HT-SELEX NucOnly NA
## 36 NRF1 HT-SELEX NucOnly NA
## 37 PAX5 HT-SELEX NucOnly NA
## 38 POU2F2 HT-SELEX NucOnly NA
## 39 POU5F1P1 HT-SELEX NucOnly NA
## 40 RFX5 HT-SELEX NucOnly NA
## 41 RUNX3 HT-SELEX NucOnly NA
## 42 SP1 HT-SELEX NucOnly NA
## 43 TFAP2A HT-SELEX NucOnly NA
## 44 TFAP2C HT-SELEX NucOnly NA
## 45 USF1 HT-SELEX NucOnly NA
## 46 YY1 HT-SELEX NucOnly NA
## 47 ZNF143 HT-SELEX NucOnly NA
## 48 MAX SELEX-seq NucOnly NA
## 49 MAX SELEX-seq NucDinuc NA
## 50 MAX HT-SELEX NucOnly NA
## 51 MAX HT-SELEX NucDinuc NA
## 52 MAX SMiLE-seq NucOnly NA
## 53 MAX SMiLE-seq NucDinuc NA
## 54 p53 SELEX-seq WildType NA
## 55 p53 SELEX-seq Δ30 NA
## 56 ATF4 SELEX-seq ATF4Only NA
## 57 CEBPb SELEX-seq CEBPbOnly NA
## 58 ATF4 SELEX-seq ATF4-CEBPb NA
## 59 CEBPb SELEX-seq ATF4-CEBPb NA
## 60 ATF4-CEBPb SELEX-seq ATF4-CEBPb NA
## Genome
## 1 BSgenome.Dmelanogaster.UCSC.dm3
## 2 BSgenome.Dmelanogaster.UCSC.dm3
## 3 BSgenome.Dmelanogaster.UCSC.dm3
## 4 BSgenome.Dmelanogaster.UCSC.dm3
## 5 BSgenome.Dmelanogaster.UCSC.dm3
## 6 BSgenome.Dmelanogaster.UCSC.dm3
## 7 BSgenome.Dmelanogaster.UCSC.dm3
## 8 BSgenome.Dmelanogaster.UCSC.dm3
## 9 BSgenome.Dmelanogaster.UCSC.dm3
## 10 BSgenome.Dmelanogaster.UCSC.dm3
## 11 BSgenome.Dmelanogaster.UCSC.dm3
## 12 BSgenome.Dmelanogaster.UCSC.dm3
## 13 BSgenome.Dmelanogaster.UCSC.dm3
## 14 BSgenome.Dmelanogaster.UCSC.dm3
## 15 BSgenome.Dmelanogaster.UCSC.dm3
## 16 BSgenome.Dmelanogaster.UCSC.dm3
## 17 BSgenome.Dmelanogaster.UCSC.dm3
## 18 BSgenome.Dmelanogaster.UCSC.dm3
## 19 <NA>
## 20 <NA>
## 21 <NA>
## 22 <NA>
## 23 <NA>
## 24 <NA>
## 25 <NA>
## 26 <NA>
## 27 <NA>
## 28 <NA>
## 29 <NA>
## 30 <NA>
## 31 <NA>
## 32 <NA>
## 33 <NA>
## 34 <NA>
## 35 <NA>
## 36 <NA>
## 37 <NA>
## 38 <NA>
## 39 <NA>
## 40 <NA>
## 41 <NA>
## 42 <NA>
## 43 <NA>
## 44 <NA>
## 45 <NA>
## 46 <NA>
## 47 <NA>
## 48 <NA>
## 49 <NA>
## 50 <NA>
## 51 <NA>
## 52 <NA>
## 53 <NA>
## 54 <NA>
## 55 <NA>
## 56 <NA>
## 57 <NA>
## 58 <NA>
## 59 <NA>
## 60 <NA>
It is always a good idea to visualize the model as an energy logo before using it. We can visualize the Exd-UbxIVa model used in the paper by using the logo function and its model number (16 in this case):
NRLBtools::logo(nrlb.model = 16)
We can actually use model.info
to provide more information about the NRLB models that come with this package. For example, we can see that model number 16 is a multi-mode model with two modes:
NRLBtools::model.info(nrlb.model = 16)
## Mode k RelativeAffinity BestSequence
## 1 1 18 0.000000 GGATGATTTATGACCTTT
## 2 2 10 -7.057776 ATCATTAAAA
## 3 NSB - -9.559147 -
We start by extracting the sequence underlying a genomic region of interest. Here, we will only use the E3N enhancer element (as shown in the paper) from the dm3 Drosophila genome:
# Just extracted the 292bp genomic sequence chrX:4,915,195-4,915,486 from the dm3 assembly
enh = "TCTTTTTTTTTATCCGGTGCGATCGTTTAGTTGCTGATTTGTTGACCCGATAAAAAATGGGACTTTAAGCCTCGCTGGCATGCACATAATTTGTAGTTTTTGGTCATTGGACAAAACTTTGACAAACAAAAAAAAAAGAAAAAGGCGATCAAAACTTGTCGAAAGTTCGGAAGTTCGATTGAGAGGTTTTTTTATTGCTTTTAAAGTGGACGGCCGCCCAAGTTGGTGGAGTTCATAATTCCTGAGCTGTTTGTTTGTTCTTCTTTATTACGCGACTACCTACCTAACCTAC"
An affinity profile along the E3N enhancer can be created as follows:
NRLBtools::score.seq(sequence = enh, nrlb.model = 16, plot = TRUE, nPeaks = 10, annotate = TRUE)
## ExdUbxIVa scores in enh
## Affinity Position Sequence
## 1 2.100306e-03 84 ACATAATTTGTAGTTTTT
## 2 1.733473e-03 185 GGTTTTTTTATTGCTTTT
## 3 1.149797e-03 32 TGCTGATTTGTTGACCCG
## 4 9.752449e-04 -79 CATGCACATAATTTGTAG
## 5 9.206493e-04 48 CGATAAAAAATGGGACTT
## 6 7.306726e-04 -228 GGAGTTCATAATTCCTGA
## 7 4.345313e-04 258 TTCTTCTTTATTACGCGA
## 8 3.288965e-04 -43 TGACCCGATAAAAAATGG
## 9 5.154725e-05 3 TTTTTTTTTATCCGGTGC
## 10 4.078117e-05 173 GTTCGATTGAGAGGTTTT