Integrating the ENCODE blocklist* for machine learning quality control of ChIP-seq with seqQscorer

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Motivation

Quality assessment of next-generation sequencing (NGS) data is a complex but important task to ensure correct conclusions from experiments in molecular biology, biomedicine, and biotechnology. While there are multiple software tools to support this process, it is still mainly driven by domain experts manually curating quality reports. This time-consuming effort is expected to increase with the future capacity to generate more sequencing samples, at lower costs, using advanced technologies. By leveraging quality-labelled data from large databases and automatically created quality-related features derived from raw sequencing samples, supervised machine learning shows great potential for supporting quality assessment, as shown by the seqQscorer tool. This tool shows outstanding results for automated quality assessment and has been used to investigate several quality-related aspects, integrating large-scale sequencing datasets. To improve seqQscorer in terms of accuracy and processing time, we explored the potential of creating features that are more informative and faster in generation than those conventionally used by seqQscorer’s models. For this purpose, we used the ENCODE blocklist*, a set of problematic genomic regions known to be related to the quality of NGS data.

Results

Our results show that we can improve the quality assessment for ChIP-seq samples derived from human tissues and cell lines when using features based on the ENCODE blocklist by up to 4.4%, when using a generic dataset containing single-end and paired-end sequencing samples. For other assays, these new blocklist features allow for highly accurate quality assessment, in general, but are less accurate in comparison to using the conventional seqQscorer features.

Outlook

These results strongly encourage an extension of the seqQscorer tool that automatically integrates the new blocklist features. Users investigating human protein-DNA interactions would benefit from this extension, as it not only improves model performance during quality assessment but also enhances usability by simplifying the installation procedure and reducing the computational resources required for feature generation.

Related articles

Related articles are currently not available for this article.