Feature Formats

Contents


What is a feature?

A feature is a region of interest in a specified nucleic or protein sequence. It has a specified start and end position. It has a name describing what type of thing it is.

Features may also explicitly or implicitly hold the name of the program or database that they are derived from, the sense (in a nucleic sequence), the score and many other pieces of information.

Feature Tables are groups of features.


Why have standard formats?

Standardising on a set of formats enables programs to be written that can read in the results from many different programs.

If you only intend to look at the resulting features and not read them into any other programs, then it is still worth having a standard set of formats as you will very quickly get used to the look and feel of a format and be able to compare the features from different programs more easily.

Different programs may have different default feature formats. You may accept the default or chose your preferred format when you run the program.


What are the formats?

As the majority of sequence analysis programs find regions of interest in a sequence, the number of output format is vast and chaotic.

The output types range from graphical displays of where restriction enzymes cut, to probabilities of the three states of a protein secondary structure prediction along a sequence, to rigidly defined text tables of the start and end positions of things like predicted exons or motif matches.

We will confine this document to describing the well-defined and flexible feature formats that have been developed for the major sequence databases (EMBL, Genbank, SwissProt, PIR) and for the input of features into the genome databases (GFF, acedb).

EMBOSS programs which write out in these feature formats will all obey the commands described below.

There are two ways feature tables can be stored; they can either be part of a sequence file or database entry or they can be in a file that does not contain the sequence that it refers to (a raw feature table).

When feature tables are held together with the sequence they refer to, then the format is identical to the sequence format of the same name as the feature format. e.g. EMBL sequence format is EMBL feature format.

Even when the feature table is not held in the same file as the sequence information, the format of the feature table is the format defined by the feature table definition of the equivalent sequence format. i.e. SwissProt feature table format is defined as part of the SwissProt sequence format definition.

Because most feature table definitions have a controlled vocabulary (i.e. there is a specified list of feature key names that can be used), you cannot edit feature tables to add in features with keys like 'PhD-motif-3'. If you edit the feature tables, you must stick to the allowed set of feature Keys. See the documentation below.

The commands you can give to modify the behaviour of the programs with regards to feature formats differ depending on whether the features are included in a sequence file or database entry, or whether the features are in a file which is separate from the sequence that it refers to.

Feature Format Names

The following feature formats are understood by EMBOSS.

NameComments / Documentation
embl
em
The format used by the EMBL nucleic database.
gff The General Feature Format defined by the Sanger Centre
swissprot
swiss
sw
The format used by the SWISSPROT protein database.
The feature table keys are also defined
pir The format used by the PIR protein database.
nbrf Only available for input - the same as PIR format

Uniform Feature Object

A 'UFO' (Uniform Feature Object) is a standard way of referring to a feature file so that it specifies the format of the features in a file and the name of that file. In an analogous way to USAs, the feature format is given and then a ':' and then the name of the file. e.g. embl:results.dat

UFOs can be used to specify feature format and file both on input or on output.

If no format is specified, then 'GFF' format is the default.


Sequences with Features

Sequence Feature Input Command-line qualifiers

Many programs will read in and use the feature table of an input sequence. Amongst these are diffseq, extractfeat, maskfeat, seqret, showfeat.

If the feature table is already a part of the sequence (which is generally the case when you are reading the sequence from a database), then the feature table will be read with no problem. If the feature table is in a separate file, you can force the application to read it in using the '-ufo' command-line qualifier, e.g. '-ufo gff:results.dat'.

The '-fformat' and '-fopenfile' qualifiers can be used together to specify the feature format and the feature file name individually instead of as part of a UFO.

  -ufo                string     UFO features
  -fformat            string     features format
  -fopenfile          string     features file name

Using '-ufo' or '-fopenfile' to read in a feature table will cause the new feature table to replace any existing feature table that is part of the sequence data.

If you wish to combine feature table files from various sources, then the easiest way is to concatenate the GFF format feature files into one file and to specify that file using '-ufo'.

Sequence Feature Output Command-line qualifiers

If the program is capable of writing out sequences with features (the only example of such a program as of version 2.2 is seqret -feature), then the feature table will be written out as part of the output sequence file if the format of the sequence file is one of embl, gff, swissprot or pir. i.e. if the sequence fiel is capable of holding a feature table. If the sequence is written in a format which cannot hold a sequence file, for example 'fasta', then a file 'unknown.gff' is written to hold the feature table.

This behaviour can be overridden by using the following command-line qualifiers. Even if a sequence format that is capable of holding a feature table has been specified, then these will enable you to specify an output file and format for the features.

  -oufo               string     UFO features
  -offormat           string     features format
  -ofname             string     features file name

Just Features (Raw Features)

Feature Input Command-line qualifiers

Not many programs currently read in just a feature table that is not part of sequence (raw feature tables). (As of EMBOSS version 2.2, there are no such programs.)

These command-line qualifiers change the behaviour of a 'features' input parameter.

  -fformat            string     features format
  -fopenfile          string     features file name
  -fask               bool       prompt for begin/end/reverse
  -fbegin             integer    first base used
  -fend               integer    last base used, def=max length
  -freverse           bool       reverse (if DNA)

Feature Output Command-line qualifiers

Many programs are capable of writing raw feature tables. This is expected to become the default way of reporting things found in sequences as the EMBOSS project matures.

The default output feature format is 'gff', but this can be changed to the required format using the command-line qualifier '-offormat' followed by the format name.

These command-line qualifiers change the behaviour of a 'featout' output parameter.

  -offormat           string     output feature format
  -ofopenfile         string     features file name
  -ofextension        string     file name extension
  -ofname             string     base file name
  -ofsingle           bool       separate file for each entry
  -ofdirectory        bool       Output feature file directory

Displaying/Extracting features

You might find the program showfeat useful for displaying features.

You might find the program extractfeat useful for extracting the sequences of features.