Questions? Feedback?

To infer family relationships of genes, and the proteins they encode, JGI portals provide clustering for all organisms, and organism group pages. We compute clusters based on the TRIBE-MCL clustering method of Enright et al., 2002, from all-vs.-all BLAST of the proteins in set of organisms included in a clustering run. (Note that in this help page the terms ‘gene’ and ‘protein’ may be used interchangably. In such cases ‘gene’ will generally mean ‘protein encoded by a gene’.)

Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002 Apr 1;30(7):1575-84.

Clustering run page

To view clustering results for a given organism or group portal, click on the MCL CLUSTERS navigation tab. This leads to the default run for that organism or group of organism. Notice that in the pull-down menu labeled "Run", other cluster runs can be selected. If you navigated to the cluster page from an individual organism's portal, these cluster runs will be others in which that organism participated.

There are two main pages in which the results of a clustering run are displayed: Clustering Run Page and Cluster Details Page.

Clustering run table

The clustering run table is the central page of a clustering run. It shows list of computed clusters ordered top to bottom, from largest total number of proteins in the cluster, to least (generally there are many singletons in a cluster run). Each row represents a cluster; each column represents an organism and the proteins it contributes to a cluster (the rightmost column is a total for each cluster, summed irrespective of organism). The Pfam domains detected in each cluster's proteins are also summarized.

Table columns

Details of the columns of the clustering run table follow.

Cluster ID

This column features the numeric ID of a given cluster linked to the cluster details page and a checkbox that allows selection of this row. There is also a checkbox in the column header that lets you to select all clusters in the run.


The columns are technically gene model sets, or “tracks”; however in virtually all biologically meaningful applications of clustering, these tracks will be the gene catalog (best predicted or canonical gene set) of a particular organism, compared to that of other organisms. Thus a biologically intuitive way of thinking about the columns is to call them “organisms”. Technically, however, we could cluster different gene model sets (tracks) from the same organism, and this occasionally has uses in genome annotation quality control. Due to the general equivalency of ‘organisms’ and ‘tracks’, the terms will be used interchangeably in this help page.

In each cluster (row), for each organism (column) we see:

Track hover

The “TOTAL" column has pie chart and protein count for the whole cluster (all organisms pooled together).

Organism labels have a hover function to let you know what gene set (track) was used in that organism. If you mouse over the label (which could be cryptic, e.g. ‘SchcoA’), you will see the full genome name and track that was used for clustering.

Domains hover

The organism column label can be clicked to reorder the clusters by counts of genes from that organism, from highest to lowest. A second click on the same column will reverse the order.

The pie charts also have a hover pop-up to show list of domains, and gene count for each domain.


This column has a list of all PFAM domains detected in the protein sequences of a given cluster. The color labels correspond to the color for a given domain on the chart. The number represents how many protein sequences in the cluster have this domain, in at least one occurrence per sequence. The domain name is linked to the PFAM website where you can get more information about the domain.

Color labels

Note that the color labels are not unique across all clusters. There are two types of color labels - “bold” color and “dissolved” color. Bold colors are assigned to the 144 most abundant domains assigned in all proteins, for all clusters (by number of proteins associated with each domain). The bold colors are unique. Dissolved colors are assigned to all other (less abundant) domains and are cycled (not unique).

Bottom of the page

At the page bottom is a list of all tracks used in clustering with:

Bottom tracks

Top controls

Controls at the top of the page allow users to perform certain functions, e.g. filtering what clusters are shown, tweaking the table presentation, and downloading clusters.

Runs selector

Clustering run selector

On the very top of the screen you may see clustering run selector. It is visible only if you see the run in context of certain organism portal. It shows a list of all clustering runs where a given organism was used. The user can select desired clustering runs, and will be redirected to the selected run.

Run statistics

Clustering run statistics shows the following data about a clustering run:

Table controls

Table display controls

Table display controls allows user to tweak run table presentation.

Select all

Download function

The download function allows to downloading data from clustering run. Before download user must select one or more clusters on the table. This is done by selecting checkboxes for the desired rows (clusters).

Download Options

Download options

There are following format options for download function:

Download CSV example

id,run,cluster,pfam domains

The download file may be uncompressed, compressed as Zip, or compressed as Gzip file.

Filtering options

Filtering function

Filtering function allows filtering the clusters that appear on the clustering run table. Filtering is performed as a "composition" of 3 filtering criteria:

'Composition" means all 3 criteria are combined using an AND operation. Only those clusters that satisfy all 3 criteria will be shown.

Filter by keywords

Following keywords could be used in filter by keyword input:

It is possible to use multiple keywords, e.g Dotse1.73412 Dotse1.88705 . This is implemented with a simplified Lucene query language. Examples are:

pf00069 pf09192 Show all clusters with either pf00069 or pf09192 domains
+pf00069 +pf09192 Show all clusters with both pf00069 and pf09192 domains
+pf00069 pf09192 Show all clusters either with both domains, or just pf00069 domain
+pf00069 -pf09192 Show all clusters with pf00069 but no pf09192

(Keywords are NOT case sensitive.)

Filter by composition of genes in clusters

This option allows to filtering clusters that satisfy the following conditions:

Filter by number of genes

Filter by number of genes from individual tracks in a cluster

This tool allows "fine tuning” of filtering. The following options are available:

The selector button on the TOTAL column behaves analogously; e.g. =1 for TOTAL will filter only singleton clusters. Similarly, the selector button on the cluster id column sets all selectors to the same value in all track columns (and thus returns the clusters that match those criteria).

Table Navigations

Table navigation

Table navigation lets the user navigate through pages on clustering run table. It shows

Current page shown in bold style.