Interpreting HMMER Output: Practical Tips and Examples
HMMER is a suite for searching sequence databases using profile hidden Markov models (HMMs). Its output contains several scores, E-values, alignments, and domain annotations that can be confusing at first. This guide explains the key fields, how to interpret them, and practical examples to help you separate true matches from noise.
Key output sections and fields
-
Program and command
- Shows which HMMER program produced the output (hmmscan, hmmsearch, phmmer, jackhmmer) and the exact command line used.
-
Model and target identifiers
- target name: sequence or database entry being searched.
- model name: profile HMM (for hmmscan) or query sequence (for hmmsearch/phmmer).
-
Bit score
- What it is: log-odds score (in bits) indicating how much more likely the sequence is under the model than a null model.
- Interpretation: higher is better. Use bit scores to compare matches to the same model; differences of ~10 bits are substantial.
-
E-value
- What it is: expected number of false positives with that score or better in the search.
- Interpretation: lower is better. Typical cutoffs: ≤1e-3 for strong matches, ≤1e-1 for tentative; adjust by database size and objective. E-values depend on search type (per-sequence vs per-domain).
-
Per-sequence vs per-domain reporting
- Per-sequence: summarizes the best match of a sequence to the model (useful to find candidate homologs).
- Per-domain: reports individual domain hits when multiple domains exist in a sequence. Domain-level E-values are often more relevant for multi-domain proteins.
-
Full sequence (full) vs domain (best 1, c-Evalue, i-Evalue)
- full sequence score/E-value: score/E-value for the whole sequence-model alignment.
- domain scores:
- c-Evalue (conditional E-value): E-value for the domain given the sequence-level match; useful for deciding whether an additional domain is genuine.
- i-Evalue (independent E-value): E-value for the domain as if it were searched independently.
-
Bias
- What it is: score reduction due to low-complexity or compositionally biased regions.
- Interpretation: high bias suggests the match may be driven by biased composition, not true homology. Treat high-bias, low-bit matches with caution.
-
Alignment block
- Shows the alignment between model consensus and sequence, with match/mismatch indicators and posterior probabilities (usually presented as a string of case and symbols). High posterior probabilities (closer to 1) indicate confident residue assignments.
-
Domain coordinates
- Start/end positions in both the model and the target sequence; important to check whether the hit covers expected functional motifs or catalytic residues.
Practical interpretation tips
-
Start with E-values but check bit scores
- Use an E-value threshold (e.g., 1e-3) to filter candidates, then use bit scores and alignment quality to rank and confirm.
-
Prefer domain-level E-values for multi-domain proteins
- A sequence may have a significant full-sequence E-value due to one strong domain; verify each reported domain with its domain E-values.
-
Watch for composition bias
- If bias is high, examine the alignment; low-complexity regions (repeats, poly-A/G stretches) can inflate scores.
-
Check coverage and conserved motifs
- Confirm that key conserved residues, motifs, or catalytic residues are present and aligned properly within the domain coordinates.
-
Use bit score differences for model-specific ranking
- When comparing multiple sequence hits to the same HMM, rank by bit score rather than E-value for consistency across database sizes.
-
Consider database size and search type
- E-values scale with database size. For very large databases use more stringent cutoffs. Per-domain and per-sequence E-values differ—choose based on your aim.
-
Manual inspection of borderline hits
- For hits near the threshold, manually inspect alignments, domain boundaries, and biological plausibility (species, domain architecture).
-
Combine HMMER with other evidence
- Use complementary tools (BLAST, InterPro, structural prediction, phylogenetics) when function inference is critical.
Example 1 — Single-domain protein (hmmscan output)
- Observed: model X hits sequence A with full E-value 2e-20, full bit 150, domain i-Evalue 1e-21, bias 0.0, domain covers residues 45–230 and includes the catalytic Lys at alignment position 110 with high posterior probability.
- Interpretation: strong, confident match. Low E-values and high bits indicate true homology; presence of conserved catalytic residue and good coverage confirm functional annotation.
Example 2 — Multi-domain protein with one strong and one weak hit
- Observed: sequence B has two domain hits to model Y: domain1 i-Evalue 1e-50 (bit 300), domain2 i-Evalue 0.05 (bit 12), domain2 bias 8.0.
- Interpretation: domain1 is a clear match. Domain2 is borderline and affected by bias — likely false positive or low-complexity-driven. Manual inspection required; consider trimming low-complexity regions and re-running.
Example 3 — Short, borderline match (hmmsearch)
- Observed: short sequence C returns a full E-value 0.2, full bit 8, high posterior uncertainty across alignment.
- Interpretation: likely spurious. Short sequences produce unreliable scores; require corroborating evidence (conserved motif, synteny, experimental data) before accepting.
Quick checklist before accepting a hit
- E-value below your project cutoff (e.g., 1e-3 for high confidence).
- Bit score substantially above background for that model.
- Low bias value.
- Good coverage of the domain and presence of conserved residues/motifs.
- Consistent domain architecture with known homologs.
- Manual alignment inspection for borderline cases.
Commands and output options to help interpretation
- Use –domtblout for easy parsing of per-domain results (tabular).
- Use –tblout for sequence-level summaries.
- Add –noali to skip verbose alignments when only tabular results are needed.
- Use –cut_ga, –cut_tc or –cut_nc when using models with curated thresholds to apply trusted cutoffs.
Summary
Interpretation of HMMER output combines automated thresholds (E-values, bit scores) with biological judgment (domain coverage, conserved residues, composition bias). Favor domain-level metrics for multi-domain proteins, beware of biased sequences, and manually inspect borderline hits or use complementary evidence to confirm functional assignments.
Leave a Reply