Read Filtering & Downsampling¶
This guide documents which reads Lancet2 includes in analysis, how per-sample downsampling works, and how the --extract-pairs flag modifies read collection behavior.
Read Exclusion Criteria¶
The following reads are silently excluded during both active region detection and read collection:
| Flag | SAM Bit | Effect |
|---|---|---|
| QC-failed | 0x200 | Excluded — sequencer-flagged quality failures |
| Duplicate | 0x400 | Excluded — optical/PCR duplicate reads |
| Unmapped | 0x4 | Excluded — reads with no alignment |
| MAPQ = 0 | — | Excluded — multi-mapped reads with ambiguous placement |
Additionally, during the Profile pass (Pass 1), reads with MAPQ < 20 are counted but do not contribute to the downsampling candidate pool.
Two-Pass Read Collection¶
Read collection uses a memory-efficient two-pass strategy per sample:
Pass 1: Profile & Downsample Math¶
Iterates the region using zero-copy alignment proxies — no sequence or quality data is deep-copied. This pass:
- Counts total passing reads and bases.
- Collects unique qname hashes for all passing reads.
- Computes a coverage-based downsampling threshold from
--max-sample-cov. - If
--extract-pairsis enabled, tracks out-of-region mate locations.
The downsampling threshold is computed as:
If the number of passing reads exceeds max_reads, the qname hash list is shuffled with a deterministic seed and truncated. Both mates of a pair are symmetrically accepted or rejected — if one mate is in the keep set, its partner is always included.
Deterministic Downsampling
The downsampling shuffle uses a fixed seed (std::default_random_engine(0)) to guarantee identical variant calls between runs on the same input. This is a deliberate design choice for reproducibility in clinical and production pipelines.
Pass 2: Deep Copy & Object Construction¶
Re-iterates the region. Only reads whose qname hash is in the keep set trigger expensive operations — BuildSequence() and BuildQualities() via the Read constructor. Reads not in the keep set are skipped entirely.
Pass 3: Mate Recapture¶
If --extract-pairs is enabled, mates that fall outside the current window region are retrieved via targeted single-position queries using htslib. Mates whose qnames were not kept during downsampling are also excluded.
--extract-pairs¶
When this flag is enabled, the read collector changes behavior:
-
SA tag extraction: The supplementary alignment tag (
SA) is additionally requested from the BAM alongsideASandXS. Without this flag, onlyASandXSare extracted. -
Mate tracking: For each read that is not in a proper pair (
FLAG & 0x2 == 0) or has a supplementary alignment (SAtag present), the mate's location is recorded for out-of-region retrieval in Pass 3. Reads where both mates are already in-region are excluded from mate retrieval since both copies are already captured. -
Runtime impact: Mate retrieval issues additional targeted
htslibqueries (one per out-of-region mate location), which can slow down read collection significantly — particularly in regions with many discordant pairs or structural variant breakpoints.
When to enable: Use --extract-pairs when you expect structural variant breakpoints at window edges where one mate maps inside and the other maps far away. For standard SNV/indel calling, this flag is not needed.
- CLI reference:
--extract-pairs
--max-sample-cov¶
Controls the maximum per-sample read coverage retained per window.
- Default: 1000×
- Effect: Windows with >1000× per-sample coverage are downsampled to ~1000× using the deterministic paired downsampling described above.
-
Trade-off: Lower values = faster runtime and lower memory, but reduced sensitivity for subclonal variants. Higher values = more reads kept for assembly, but diminishing returns above ~500× for most variant types.
-
CLI reference:
--max-sample-cov