The FreeHiCLite
package is designed for simulate Hi-C contact matrix.
The original package FreeHi-C is short for Fragment interactions empirical estimation for fast simulation of Hi-C data. It is a data-driven Hi-C data simulator for simulating and augmenting Hi-C datasets. FreeHi-C employs a non-parametric strategy for estimating an interaction distribution of genome fragments and simulates Hi-C reads from interacting fragments. Data from FreeHi-C exhibit higher fidelity to the biological Hi-C data. FreeHi-C not only can be used to study and benchmark a wide range of Hi-C analysis methods but also boosts power and enables false discovery rate control for differential interaction detection algorithms through data augmentation. Different from FreeHi-C (v1.0), a spike-in module is added enabling the simulation of true differential chromatin interactions.
FreeHi-C is designed for studies that are prone to simulate Hi-C interactions from the real data and add deviations from the true ones. Therefore, FreeHi-C requires real Hi-C sequencing data (FASTQ format) as input along with user-defined simulation parameters. FreeHi-C will eventually provide the simulated genomics contact counts in a sparse matrix format (BED format) which is compatible with the standard input of downstream Hi-C analysis.
Install the development version using the devtools package:
devtools::install_github("baconzhou/FreeHiCLite")
Load the package:
library(FreeHiCLite)
The .hic
file is a highly compressed binary file, which is developed in the Aiden Lab. Which can be used in juicebox for contact matrix visualization.
The .hic
file is formatted as HiC Format. To program with .hic
file, they provide straw and Dump to extract the information from the .hic
file. The readJuicer()
adopts most from the C++ version of straw.
The .hic
file only contains two units of resolution, and each unit contains a fix set of resolutions.
## Remote file location. The reomte file include downloading, it may take a while remoteFilePath <- "https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic" ## Local file location localFilePath <- system.file("extdata", "example.hic", package = "FreeHiCLite") ## Chromosomes needs to be extra chromosomes <- c("chr1", "chr2") ## Pairs needs to be extra pairs <- c("1_1", "1_2") unit <- "BP" resolution <- 500000L
We can extract some basic information from a .hic
file via:
juicerInfo <- readJuicerInformation(localFilePath, verbose = TRUE) #> File: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/FreeHiCLite/extdata/example.hic #> GenomId: #> Available pair: 1_1, 1_2, 1_3, 2_2, 2_3, 3_3. #> Hi-C resolution: #> BP: 2500000, 500000, 5000. #> FRAG not available.
Besides, juicerInfo()
also provides chromosome size information, if the genomeID is not one of the genome they provided. check here
head(juicerInfo[["chromosomeSizes"]]) #> chromosome size #> 1 1 249250621 #> 2 10 135534747 #> 3 11 135006516 #> 4 12 133851895 #> 5 13 115169878 #> 6 14 107349540
To read a remote .hic
file. Provide a remote url to readJuicer()
. The full list of remote links can be found in http://aidenlab.org/data.html.
## pass chrosomes into function, it will contains all the interaction pairs datRemote <- readJuicer(file=remoteFilePath, chromosomes=chromosomes, pairs = NULL, unit=unit, resolution=resolution)
## pass chrosomes into function, it will contains all the interaction pairs datLoc <- readJuicer(file=localFilePath, chromosomes=chromosomes, pairs = NULL, unit=unit, resolution=resolution) str(datLoc) #> List of 3 #> $ contact :List of 3 #> ..$ 1_1: int [1:87158, 1:3] 500000 500000 1000000 500000 1000000 1500000 500000 1000000 1500000 2000000 ... #> .. ..- attr(*, "dimnames")=List of 2 #> .. .. ..$ : NULL #> .. .. ..$ : chr [1:3] "x" "y" "counts" #> ..$ 1_2: int [1:69341, 1:3] 2500000 4000000 4500000 5000000 5500000 6000000 7000000 8500000 9500000 10000000 ... #> .. ..- attr(*, "dimnames")=List of 2 #> .. .. ..$ : NULL #> .. .. ..$ : chr [1:3] "x" "y" "counts" #> ..$ 2_2: int [1:104782, 1:3] 0 0 500000 0 500000 1000000 0 500000 1000000 1500000 ... #> .. ..- attr(*, "dimnames")=List of 2 #> .. .. ..$ : NULL #> .. .. ..$ : chr [1:3] "x" "y" "counts" #> $ information:List of 4 #> ..$ genomeID : chr "hg19" #> ..$ resolution :List of 2 #> .. ..$ BP : int [1:3] 2500000 500000 5000 #> .. ..$ FRAG: int(0) #> ..$ pairs : chr [1:6] "1_1" "1_2" "1_3" "2_2" ... #> ..$ chromosomeSizes:'data.frame': 26 obs. of 2 variables: #> .. ..$ chromosome: chr [1:26] "1" "10" "11" "12" ... #> .. ..$ size : int [1:26] 249250621 135534747 135006516 133851895 115169878 107349540 102531392 90354753 81195210 78077248 ... #> $ settings :List of 4 #> ..$ unit : chr "BP" #> ..$ resolution : int 500000 #> ..$ chromosomes: chr [1:2] "chr1" "chr2" #> ..$ file : chr "/Library/Frameworks/R.framework/Versions/4.0/Resources/library/FreeHiCLite/extdata/example.hic" #> - attr(*, "class")= chr [1:2] "juicer" "freehic"
The juicebox also provide a function Pre to generate .hic
file from different format. In FreeHiCLite
, we provide a function writeJuicer()
to write the contact matrix into a short with score format. You can use convertJuicer()
to view the matrix, or directly use writeJuicer()
.
head(convertJuicer(datLoc[['contact']][[1]], "1", "1")) #> str1 chr1 pos1 frag1 str2 chr2 pos2 frag2 score #> 1 0 1 500000 0 0 1 500000 1 384 #> 2 0 1 500000 0 0 1 1000000 1 231 #> 3 0 1 1000000 0 0 1 1000000 1 1272 #> 4 0 1 500000 0 0 1 1500000 1 47 #> 5 0 1 1000000 0 0 1 1500000 1 373 #> 6 0 1 1500000 0 0 1 1500000 1 1665
writeJuicer(datLoc, file, overwrite = TRUE)
To add spikeIn into contact matrix. The FreeHiCLite
provides FreeSpikeIn()
to add spikein.
kernelSmooth <- TRUE bandwidth <- 50000L ## Create a random spikeIn contact <- datLoc[['contact']][[1]] Ns <- 0.1 * NROW(contact) spikeIn <- contact[sample(1:NROW(contact), Ns), ] spikeIn[,3] <- spikeIn[,3] * sample(seq(0, 10, 0.5), Ns, replace=TRUE) # 3rd is the counts spikeInContact <- FreeSpikeIn(contact, spikeIn, kernelSmooth = kernelSmooth, bandwidth = bandwidth) print(summary(contact[,3])) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 1.00 2.00 4.00 25.71 9.00 4545.00 print(summary(spikeInContact[,3])) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 1 2 5 206 14 4545