Questions? Feedback?

To infer family relationships of genes, and the proteins they encode, JGI portals provide clustering for all organisms, and organism group pages. We compute clusters based on the TRIBE-MCL clustering method of Enright et al., 2002, from all-vs.-all BLAST of the proteins in set of organisms included in a clustering run. (Note that in this help page the terms ‘gene’ and ‘protein’ may be used interchangably. In such cases ‘gene’ will generally mean ‘protein encoded by a gene’.)

Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002 Apr 1;30(7):1575-84.

Clustering run page

To view clustering results for a given organism or group portal, click on the MCL CLUSTERS navigation tab. This leads to the default run for that organism or group of organism. Notice that in the pull-down menu labeled "Run", other cluster runs can be selected. If you navigated to the cluster page from an individual organism's portal, these cluster runs will be others in which that organism participated.

There are two main pages in which the results of a clustering run are displayed: Clustering Run Page and Cluster Details Page.

Clustering run table

The clustering run table is the central page of a clustering run. It shows list of computed clusters ordered top to bottom, from largest total number of proteins in the cluster, to least (generally there are many singletons in a cluster run). Each row represents a cluster; each column represents an organism and the proteins it contributes to a cluster (the rightmost column is a total for each cluster, summed irrespective of organism). The Pfam domains detected in each cluster's proteins are also summarized.

Table columns

Details of the columns of the clustering run table follow.

Cluster ID
Organisms
TOTAL
Domains

Cluster ID

This column features the numeric ID of a given cluster linked to the cluster details page and a checkbox that allows selection of this row. There is also a checkbox in the column header that lets you to select all clusters in the run.

Organisms

The columns are technically gene model sets, or “tracks”; however in virtually all biologically meaningful applications of clustering, these tracks will be the gene catalog (best predicted or canonical gene set) of a particular organism, compared to that of other organisms. Thus a biologically intuitive way of thinking about the columns is to call them “organisms”. Technically, however, we could cluster different gene model sets (tracks) from the same organism, and this occasionally has uses in genome annotation quality control. Due to the general equivalency of ‘organisms’ and ‘tracks’, the terms will be used interchangeably in this help page.

In each cluster (row), for each organism (column) we see:

PFAM domains chart - donut chart that indicates the distribution of PFAM domain in given organism and cluster. The solid area of the donut (or, inversely, size of the donut hole) indicates the relative numbers of proteins each organism contributes to that cluster. Thus, the organism contributing the most proteins to the cluster does not have a hole; organisms contributing fewer proteins to the cluster have larger donut holes. The coloring of the donuts indicates the distribution of Pfam domains within the set of proteins contributed by an organism to a cluster. [NOTE: I’m under the impression we are getting rid of pie charts.]
Counter - the number of genes contributed by an organism to a cluster

Track hover

The “TOTAL" column has pie chart and protein count for the whole cluster (all organisms pooled together).

Organism labels have a hover function to let you know what gene set (track) was used in that organism. If you mouse over the label (which could be cryptic, e.g. ‘SchcoA’), you will see the full genome name and track that was used for clustering.

Domains hover

The organism column label can be clicked to reorder the clusters by counts of genes from that organism, from highest to lowest. A second click on the same column will reverse the order.

The pie charts also have a hover pop-up to show list of domains, and gene count for each domain.

Domains

This column has a list of all PFAM domains detected in the protein sequences of a given cluster. The color labels correspond to the color for a given domain on the chart. The number represents how many protein sequences in the cluster have this domain, in at least one occurrence per sequence. The domain name is linked to the PFAM website where you can get more information about the domain.

Color labels

Note that the color labels are not unique across all clusters. There are two types of color labels - “bold” color and “dissolved” color. Bold colors are assigned to the 144 most abundant domains assigned in all proteins, for all clusters (by number of proteins associated with each domain). The bold colors are unique. Dissolved colors are assigned to all other (less abundant) domains and are cycled (not unique).

Bottom of the page

At the page bottom is a list of all tracks used in clustering with:

Bottom tracks

Assigned labels (e.g. ‘SchcoA’)
Full name of the organism, linked to the organism portal
Name of the gene model track, linked to the Info page on the organism portal

Top controls

Controls at the top of the page allow users to perform certain functions, e.g. filtering what clusters are shown, tweaking the table presentation, and downloading clusters.

Runs selector

Clustering run selector

On the very top of the screen you may see clustering run selector. It is visible only if you see the run in context of certain organism portal. It shows a list of all clustering runs where a given organism was used. The user can select desired clustering runs, and will be redirected to the selected run.

Run statistics

Clustering run statistics shows the following data about a clustering run:

Number of multi-gene clusters in a run
Number of models (genes) in multi-gene clusters
Average multi-gene cluster size for a given run
Number of singletons; some genes end up in clusters in which they are the only gene
Clustering run date - date on which this clustering run was performed
Number of tracks (generally meaning organisms) used in given clustering run

Table controls

Table display controls

Table display controls allows user to tweak run table presentation.

show charts - turn on and off display of "donuts" charts; note that by default charts are shown for all clustering run that contain 20 or fewer tracks
show counters - turn on and off display of counters (number of genes); note that if charts are hidden, than this control is inactive and counters are always shown
show domains - turn on and off "domains" column in the table; by default it shown for all clustering run that contain 20 or fewer tracks

Select all

Download function

The download function allows to downloading data from clustering run. Before download user must select one or more clusters on the table. This is done by selecting checkboxes for the desired rows (clusters).

Download Options

Download options

There are following format options for download function:

CSV - download comma-separated file containing cluster members (see example below)
Protein FASTA - download protein FASTA formatted file with all members of all selected clusters
Transcript FASTA - download transcript FASTA formatted file with all members of all selected clusters
Genomic FASTA - download genomic (scaffolds) FASTA formatted file with scaffolds segments that correspond to genes for all members of all selected clusters

Download CSV example

id,run,cluster,pfam domains
jgi|Bauco1|365231,Altbr1-comparative.280,5655,pf08636
jgi|CocheC5_1|25664,Altbr1-comparative.280,5655,
jgi|Dotse1|66178,Altbr1-comparative.280,5655,pf08636
jgi|Hyspu1|12316,Altbr1-comparative.280,5655,pf08636
jgi|Mycfi2|212589,Altbr1-comparative.280,5655,pf08636
jgi|Mycgr3|106144,Altbr1-comparative.280,5655,
jgi|Pyrtr1|152077,Altbr1-comparative.280,5655,
jgi|Pyrtt1|8188,Altbr1-comparative.280,5655,pf08636
jgi|Rhyru1|5374,Altbr1-comparative.280,5655,pf08636
jgi|Sepmu1|151711,Altbr1-comparative.280,5655,pf08636
jgi|Settu1|37360,Altbr1-comparative.280,5655,pf08636

The download file may be uncompressed, compressed as Zip, or compressed as Gzip file.

Filtering options

Filtering function

Filtering function allows filtering the clusters that appear on the clustering run table. Filtering is performed as a "composition" of 3 filtering criteria:

Filter by keyword
Filter by composition of genes in clusters
Filter by number of genes from individual tracks in a single cluster

'Composition" means all 3 criteria are combined using an AND operation. Only those clusters that satisfy all 3 criteria will be shown.

Filter by keywords

Following keywords could be used in filter by keyword input:

Organism
Protein id
Combination of organism and protein id, like Dotse1.73412
Chromosome or scaffold name
Combination of organism and scaffold name, like Dotse1.scaffold_1
Any PFAM domain ID or name for gene annotations
Words from description of PFAM domain
Name of the track
Words from the description of the track

It is possible to use multiple keywords, e.g Dotse1.73412 Dotse1.88705 . This is implemented with a simplified Lucene query language. Examples are:

pf00069 pf09192	Show all clusters with either pf00069 or pf09192 domains
+pf00069 +pf09192	Show all clusters with both pf00069 and pf09192 domains
+pf00069 pf09192	Show all clusters either with both domains, or just pf00069 domain
+pf00069 -pf09192	Show all clusters with pf00069 but no pf09192

(Keywords are NOT case sensitive.)

Filter by composition of genes in clusters

This option allows to filtering clusters that satisfy the following conditions:

All tracks in cluster have the same number of genes
All except one track have the same number of genes
All except two tracks have the same number of genes
80% or more tracks have the same number of genes
less than 80% of tracks have the same number of genes
50% or more tracks have the same number of genes
less than 50% of tracks have the same number of genes

Filter by number of genes

Filter by number of genes from individual tracks in a cluster

This tool allows "fine tuning” of filtering. The following options are available:

any - default, ANY NUMBER of genes
=0 - show only clusters with NO members from a given track
=1 - show only clusters with ONE members from a given track
01 - show only clusters with ONE or LESS members from a given track
=2 - show only clusters with TWO members from a given track
1+ - show only clusters with ONE or MORE members from a given track
2+ - show only clusters with TWO or MORE members from a given track
1 - show only clusters with ANY but ONE members from a given track
2 - show only clusters with ANY but TWO members from a given track

The selector button on the TOTAL column behaves analogously; e.g. =1 for TOTAL will filter only singleton clusters. Similarly, the selector button on the cluster id column sets all selectors to the same value in all track columns (and thus returns the clusters that match those criteria).

Table Navigations

Table navigation

Table navigation lets the user navigate through pages on clustering run table. It shows

Total number of rows after filtering
Page navigation bar with "first", "last" and page number navigation
Select number of clusters (rows) per page; default value 100, possible values 100,200,500 and 1000 rows per page

Current page shown in bold style.