Format of a Block

ID   short_identifier; BLOCK
AC   block_number; distance from previous block = (min,max)
DE   description
BL   xxx motif; width=w; seqs=s; 99.5%=n1; strength=n2
sequence_id  (offset) sequence_segment  sequence_weight
.
.
.
//

ID line starts a block entry and contains a short identifier for the group of sequences from which the block was made. If the block was taken from InterPro, it will be the InterPro group ID. The identifier is terminated by a semi-colon, and the word "BLOCK" indicates the entry type.

AC line contains the block number, a seven-character group number for sequences from which the block was made, followed by a letter (A-Z) indicating the order of the block in the sequences. If the group has only one block, the letter is omitted. If the block was made from InterPro group IPRnnnnnn, the block number is IPBnnnnnna. If the block was converted from Terri Attwood's Prints Database the block number is PRnnnnna. min,max = minimum,maximum number of amino acids from previous block for sequences in this block. For the first block in the group, the distance from the beginning of the sequences.

DE line contains a description of the group of sequences from which the block was made. If the block was taken from InterPro, it will be a slightly edited version of the InterPro description.

BL line contains information about the block:
xxx = the amino acids in the spaced triplet found by MOTIF upon which the block is based.
w = width of the sequence segments (columns) in the block.
s = number of sequence segments (rows) in the block.
n1 = raw calibration score; 99.5th percentile score of true negative sequences. Raw search scores are normalized by dividing by this score and multiplying by 1000.
n2 = median normalized score of known true positive sequences as documented in InterPro.

Following the BL line are lines for each sequence with a segment in the block. The segments may be clustered with clusters separated by blank lines. Each segment line contains a sequence identifier, the offset from the beginning of the sequence to the block in parentheses, the sequence segment, and a weight for the segment. The weights are normalized so that the most distant segment has a weight of 100.

// line terminates a block entry.

Current Blocks Database Release

About the Blocks Database


Other Multiple Alignment Formats

FASTA Format

Each sequence in the multiple alignment starts with a FASTA title line containing the sequence name followed by the aligned sequence residues with dashes representing gaps:
>JC2395
NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----
-------------------------QKIQLLQCWYQSHGKT--GACQALIQGLRKANRCD
IAEEIQAM
>KPEL_DROME
MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----
-------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN
AMRLIKDY
>FASA_MOUSE
NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----
-------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR
TLDKFQDM


CLUSTAL/STOCKHOLM Format

ClustalW Site.
The first non-blank line must contain the word "CLUSTAL" or "STOCKHOLM". Sequences are interleaved on separate lines with gaps represented by dashes. Each sequence line starts with the sequence name which is separated from the aligned sequence residues by spaces or tabs. Each set of interleaved sequence segments is separated by one or more blank lines. Lines containing sequence conservations symbols (CLUSTAL) or "//" (STOCKHOLM) are ignored.
(Please note: Some WWW sites post-process Clustal output so that it has a different format than in this example; in this case use FASTA format).
CLUSTAL W(1.60) multiple sequence alignment



JC2395          NVSDVNLNK---YIWRTAEKMK---ICDAKKFARQHKIPESKIDEIEHNSPQDAAE----
KPEL_DROME      MAIRLLPLPVRAQLCAHLDAL-----DVWQQLATAVKLYPDQVEQISSQKQRGRS-----
FASA_MOUSE      NASNLSLSK---YIPRIAEDMT---IQEAKKFARENNIKEGKIDEIMHDSIQDTAE----


JC2395          -------------------------QKIQLLQCWYQSHGKT--GACQALIQGLRKANRCD
KPEL_DROME      -------------------------ASNEFLNIWGGQYN----HTVQTLFALFKKLKLHN
FASA_MOUSE      -------------------------QKVQLLLCWYQSHGKS--DAYQDLIKGLKKAECRR


JC2395          IAEEIQAM
KPEL_DROME      AMRLIKDY
FASA_MOUSE      TLDKFQDM




MSF Format

Any comments at the beginning of the file are terminated with a line starting with two slashes. Sequences are interleaved on separate lines with gaps represented by periods. Each sequence line starts with the sequence name which is separated from the aligned sequence residues by white space:
//


                1                                                   50
JC2395          NVSDVNLNK. ..YIWRTAEK MK...ICDAK KFARQHKIPE SKIDEIEHNS 
KPEL_DROME      MAIRLLPLPV RAQLCAHLDA L.....DVWQ QLATAVKLYP DQVEQISSQK 
FASA_MOUSE      NASNLSLSK. ..YIPRIAED MT...IQEAK KFARENNIKE GKIDEIMHDS

		51                                                 100
JC2395		PQDAAE.... .......... .......... .....QKIQL LQCWYQSHGK
KPEL_DROME	QRGRS..... .......... .......... .....ASNEF LNIWGGQYN.
FASA_MOUSE	IQDTAE.... .......... .......... .....QKVQL LLCWYQSHGK

                101
JC2395		T..GACQALI QGLRKANRCD IAEEIQAM
KPEL_DROME	...HTVQTLF ALFKKLKLHN AMRLIKDY
FASA_MOUSE	S..DAYQDLI KGLKKAECRR TLDKFQDM



[Blocks home]