Merging High-Throughput, Amplicon-Based Second and Third Generation Sequencing Data: An Integrative and Modular Data Analysis Framework for Haplotype Prediction and Output Evaluation.

Authors

Sylvia Mink, Christian Attenberger, Yannik Busch, Johanna Kiefer, Wolfgang Peter, Janne Cadamuro, Tim A Steiert, Andre Franke, Christoph Gassner

Year of publication

2025

Journal

INT J MOL SCI

Volume

26

Issue

7

ISSN

1422-0067

Impact factor

4.9

Abstract

Despite providing highly accurate results, the short reads generated by second generation sequencing have major limitations in mapping complex genomic regions. Longer reads can resolve these issues and additionally phase distant variants. The third generation sequencing platform ONT currently achieves the longest sequencing reads but falls short in sequencing accuracy. Additionally, deriving phased haplotypes from amplicon-based NGS data remains a complex and time-consuming task that requires extensive bioinformatic expertise. We constructed an integrative, open-access modular data-analysis framework that allows for automated processing of high-throughput sequencing data from both second (Illumina) and third generation (ONT) sequencing platforms, combining the strengths of both technologies. Variant information is automatically evaluated and color-coded for discrepancies. Haplotypes are listed by frequency. All parts of the framework can be used independently. The framework’s performance was validated using synthetic and tested with real-life data by analyzing partly homologous FUT1/2/3 sequencing data from 400 blood donors.