Sequences ========= ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/ file: TAIR9_chr_all.fas R code used to split TAIR9_chr_all.fas into 1 FASTA file per sequence: splitTair9ChrAllFasta <- function(filepath) { NAMESMAP <- paste("Chr", c(1:5, "M", "C"), sep="") names(NAMESMAP) <- c(1:5, "mitochondria", "chloroplast") chr_all <- readDNAStringSet(filepath) shortnames <- sapply(strsplit(names(chr_all), " ", fixed=TRUE), function(xx) xx[1L]) if (!identical(shortnames, names(NAMESMAP))) stop(filepath, " doesn't contain the expected sequence names") names(chr_all) <- NAMESMAP for (i in seq_len(length(chr_all))) { out <- paste(NAMESMAP[i], ".fa", sep="") cat("Writing ", out, " file ... ", sep="") write.XStringSet(chr_all[i], filepath=out) cat("OK\n") } } splitTair9ChrAllFasta("TAIR9_chr_all.fas") Masks ===== This was an attempt at putting built-in masks in BSgenome.Athaliana.TAIR.TAIR9 but it didn't work! (see below why) The first published version of BSgenome.Athaliana.TAIR.TAIR9 (version 1.3.18) has NO built-in masks. Assembly gaps extracted from ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/TAIR9_gff3/Assembly_GFF/ file: tair9_Assembly_gaps.gff R code used to convert tair9_Assembly_gaps.gff into a UCSC-like "gap" file: tair9GapsGFF.to.UCSCGapFile <- function(infile, outfile) { gff <- read.table(infile, sep="\t", quote="") if (!identical(levels(gff[[3L]]), "gap")) stop("some rows have a feature type != \"gap\"") if (!identical(levels(gff[[7L]]), "+")) stop("some gaps are not on the + strand") if (!identical(levels(gff[[6L]]), ".") || !identical(levels(gff[[8L]]), ".")) stop("field 6 and 8 are not empty") gap <- gff[c(1L, 4L, 5L)] gap[[2L]] <- gap[[2L]] - 1L gap <- cbind(gap, 0L, "N", gap[[3L]] - gap[[2L]], ".", ".") write.table(gap, file=outfile, quote=FALSE, sep="\t", col.names=FALSE) } tair9GapsGFF.to.UCSCGapFile("tair9_Assembly_gaps.gff", "gap.txt") Problems with those assembly gaps: some of them have a size that doesn't match the size of the corresponding N-block in the sequence (as reported in TAIR9_chr_all.fas). Email sent to the TAIR curators on Dec 14, 2010: ----------------------------------------------------------------------- Hi TAIR curators, For my project, I need to use the whole chromosome TAIR9 sequences provided here ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/TAIR9_chr_all.fas and also I need to know the exact locations (and sizes) of the assembly gaps for those sequences. After looking at the tair9_Assembly_gaps.gff file (from ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR9_genome_release/TAIR9_gff3/Assembly_GFF/), I found that some of the assembly gaps listed in this file are not compatible with the N-blocks found in the sequences. See at the end of this email for the exact list of problems I found. My question is: what could explain that the size of an assembly gap is not the same as the size of its corresponding N-block? Is this really intended? If not, is there another file somewhere on your site that contains the actual size of the assembly gaps? Thanks in advance, H. List of problems found in the tair9_Assembly_gaps.gff file: Line # | seqname | type | start end | NOTE 5 | Chr1 | gap | 14545722 14581721 | NOTE=Size_36000_centromeric BAC Pb: the size of the N-block found in the sequence is 35998! 15 | Chr1 | gap | 15345855 15346528 | NOTE=Size_674_exact sizes of the gaps are unknown Pb: the size of the N-block found in the sequence is 60! 35 | Chr3 | gap | 13589757 13589816 | NOTE=Size_60_centromeric BAC Pb: the size of the N-block found in the sequence is 56! 76 | Chr3 | gap | 15132295 15132547 | NOTE=Size_253_N's as sequence needs to be verified Pb: the size of the N-block found in the sequence is 251! 93 | Chr5 | gap | 11725025 11726024 | NOTE=Size_1000 Pb: the size of the N-block found in the sequence is 998! -----------------------------------------------------------------------