Conservative taxonomy and quality assessment of giant virus genomes with GVClass (2025)

Introduction

Giant viruses (GVs) of the viral phylum Nucleocytoviricota infect a wide range of eukaryotic hosts, exhibit a broad spectrum of morphologies, capsid sizes, and genome lengths1. They are currently classified in two classes: Megaviricetes which includes the orders Imitervirales, Pandoravirales, Pimascovirales, Algavirales and Pokkesviricetes, which includes the orders Asfuvirales and Chitovirales2. Additionally, two more classes, Proculviricetes and Mriyaviricetes, have been proposed recently3,4. Their double-stranded DNA genomes range from under 100 kb to over 2.7 Mb, encoding hundreds to thousands of genes, often acquired through dynamic gene exchanges with various cellular and viral lineages1,5,6,7. Among them, the presence of genes associated with diverse microbial metabolic pathways, including photosynthesis, the tricarboxylic acid (TCA) cycle and glycolysis, suggests that GVs may influence their host’s metabolic capabilities8. With the increased availability of metagenomic sequencing data from diverse ecosystems, it has become possible to study the diversity and the distribution of giant virus genomes, greatly expanding our understanding of their genetic diversity, function, and role in shaping the ecosystems they inhabit.

Despite their complex genomes posing challenges to taxonomic classification, GVs encode several giant virus orthologous groups (GVOGs)2, or nucleocytoplasmic large DNA virus orthologous groups (NCVOGs)9. NCVOGs have been used in earlier studies for quality estimates of giant virus genomes10. A subset of these GVOGs serve as phylogenetic markers and are integral to computational pipelines and tools for detecting and identifying GV in metagenome assembled sequences (contigs, bins) (e.g. Virsorter211, geNomad12, ViralRecall13 or TIGTOG14). Other tools were designed and based on metagenome reads alignment with giant virus open reading frames (ORFs)15,16. Recently, a set of 7 mainly vertically inherited GVOGs was used to establish a taxonomic framework for studying Nucleocytoviricota diversity2. The largest proportion of this framework is based on giant virus metagenome-assembled genomes (GVMAGs)8,17.

GVMAGs were recovered from de novo assemblies and in some cases metagenomic binning is prone to errors, leading to fragmented, or mixed population genomes with potential cellular contamination18. Further, GVMAGs often contain only a few conserved genes, making classification more challenging.

Here we present GVClass, a bioinformatic pipeline that assigns taxonomy to putative giant virus genomes down to species-level (Fig. 1). GVClass employs a conservative approach, relying on the consensus of single protein phylogenetic trees inferred from identified GVOGs in GV sequences after optimized gene calling. It also incorporates specific hallmark genes from the recently proposed Mriyaviricetes and Mirusviricota3,4,19, as well as cellular single-copy panorthologs - sets of genes that are orthologous across multiple species or lineages within a taxonomic group. Subsequently, GVClass estimates genome completeness and contamination of identified giant viruses and summarizes all results. Benchmarking in silico modified giant virus genomes showed the robustness of GVClass across different levels of completeness and taxonomic novelty. Taken together, GVClass is a fast, reliable and user-friendly tool to detect, classify and assess the quality of giant virus genomes, including those recovered from environmental sequence data, as shown in Pitot et al.20.

Schematic of the optimized gene calling (opgecall), taxonomic affiliation and quality assessment of giant virus sequences identified by GVClass.

Full size image

Results and discussion

Taxonomic classification with GVClass

For benchmarking GVClass workflow (Fig. 1), we excluded self-hits and limited taxonomic classification to the Class, Order, Family and Genus ranks. We used a comprehensive set of published giant virus genomes (n = 1109), which were fragmented and randomly reduced (in triplicate) to 75%, 50%, and 25% of their original length (Fig. 2A).

A Completeness percentage of all 1109 benchmarked GVs genomes before and after reduction to 75%, 50%, and 25% of their original length. B Relative distribution of GVClass “strict” and “majority” taxonomic predictions of reduced GVs genomes after reduction. C Calculated precision of GVClass. Recall measures the proportion of actual positives correctly identified by the GVClass, while precision measures the proportion of positive predictions that are actually correct. The closer to 1 the better.

Full size image

Using “majority” taxonomic assignments, GVClass demonstrated high efficiency across all ranks based on the percentage of true positives (TPs) at 94.96%, false negatives (FNs) at 4.69% and false positives at 0.35% of overall results regardless of the genome completeness (Fig. 2B).

The highest proportion of TPs was observed at the Class-level, with a slight decrease at lower taxonomic ranks (Class 96.52% > Order 96.43% > Family 95.43% > Genus 91.46%). At each taxonomic rank, genomes reduced to 75% completeness showed a higher proportion of TPs at 98.93% and 0.06 FPs, which decreased as genome completeness was further reduced to 50% at 97.79% TPs and 0.23% FPs and 88.17% TPs and 0.76% FPs at 25% completeness. On average, regardless of the completeness and the taxonomic ranks, GVClass achieved 97% precision (instances predicted as positive that were actually positive) and 91% recall (positive instances successfully identified) (Fig. 2C). These high values demonstrate GVClass’s reliability and efficiency correctly identifying giant virus sequences while minimizing false positives and false negatives.

Closer analysis revealed that GVClass showed both highest precision and recall in high taxonomic levels, with slight decreases at lower ranks (Fig. 2C). As expected, stronger variations in precision and recall were observed with reduced genome completeness (Fig. 2C). More challenging was correct identification at Genus-level for highly incomplete genomes (25% completeness), with precision at 86%, and recall at 76%. This result underlines the importance of marker gene presence, which is typically directly linked to sufficient sequencing depth and assembly quality.

Given that a large proportion of our benchmark data consists of GVMAGs, we examined performance of GVClass on genomes of giant virus isolates. From a set of 121 giant virus isolate genomes affiliated with the orders Algalvirales, Asfuvirales, Chitovirales, Imitevirales, and Pimascovirales as well as the proposed Pandoravirales (Fig. 3A), 97% were correctly predicted up to the Genus-level, exceptions were Clandestinovirus, Ectocarpus siliculosus virus 1 and Feldmannia species virus (Supplementary Table 1). However, each of the 3 isolates represents the only representative of their respective genus, making it impossible for GVClass to correctly predict them as self-hits were excluded from the analysis.

A Treemap plot of the taxonomic classification of the 121 isolate genomes by GVClass. B Genome quality estimation for giant virus genomes. Density curve showing the distribution of duplication factor and completeness estimates by GVClass for the 121 Nucleocytoviricota isolates. Recommended threshold for completeness are <30% as low, 30–70% as medium, and >70% as high completeness. Contamination is low if duplication factor <1.5, medium for a duplication factor of 1.5–2 and high for a duplication factor of >2.

Full size image

Completeness and contamination estimation with GVClass

Next, we tested the completeness estimation feature. We first ran GVClass on the original genomes to generate a baseline before screening the reduced genomes. The mean percentage of completeness of non-reduced genomes was estimated at 85.5% (Fig. 2A). Reduced genomes were respectively estimated to reach completeness of 61%, 42%, and 22% respectively (Fig. 2A), closely matching the 75%, 50% and 25% reductions after applying a 0.25 correction factor (based on the mean completeness of non-reduced genomes). With our set of isolates, the average completeness was estimated at 82.5% (Fig. 3B).

Based on our results we consider a completeness value below 30% as low, 30–70% as medium, and above 70% as high completeness when using GVClass. Our benchmarking illustrates the efficiency of GVClass in predicting giant virus query sequences with very high precision across taxonomic ranks and into estimating genome completeness.

In the second phase, we analyzed the order duplication factor of the tested isolates. The mean duplication factor of isolate genomes was 1.2 (Fig. 3B). None of the isolate giant virus genomes exceeded a duplication factor of 2, thus, higher values potentially indicate mixed viral populations. Based on our results an order duplication factor below 1.5 suggests a low chance of representing a mixed bin (high quality). An order duplication factor between 1.5 and 2 suggests a medium chance of a mixed bin (medium quality). Lastly, an order duplication factor above 3 suggests a high chance of a mixed bin (indicating low quality).

Novelty detection capability of GVClass

We tested GVClass’s capability to detect novelty using query sequences from 25 Nucleocytoviricota taxonomic groups (10 genera, 10 families, and 5 orders). To conduct this analysis, we created 25 versions of the database by removing all genomes associated with each selected group, one at a time. By running GVClass on the members of the omitted taxa, we could assess how the predictions perform on genomes that are novel at different taxonomic levels (Fig. 4A).

A Schematic of the reference dataset reduction for novelty detection. 25 versions of the database were created by removing all GVs genomes associated with each selected group, one at a time (10 genera, 10 families, and 5 orders). B Relative distribution of GVClass “strict” and “majority” taxonomic predictions of omitted Nucleocytoviricota taxa. C Boxplots summarizing phylogenetic distances to nearest neighbors (measured as the mean length of branches connecting the query to its nearest neighbor across the GVOG single protein trees) for TPs for each group that was left out. Distribution of data points suggests values for mean evolutionary distance (across all GVOG trees) of 0.60–1.27 for genus-level novelty, 1.27–2.74 for family-level novelty and >2.74 for order-level novelty.

Full size image

For instance, if a genus is removed from the database, members of that genus cannot be assigned to the correct genus, as it is no longer part of the reference data. This absence should result in a signal of novelty. Ideally, novelty should be detected as FN, as queries cannot be assigned correctly due to missing data in the database. However, prediction on higher taxonomic ranks should result in higher TP rates (with exception of families that contain only a single genus). Indeed, this is reflected in our results with a TP rate of over 92% on family-level and 100% on order-level or above (Fig. 4B).

Our results showed that the highest proportions of FN predictions were found at the same or higher taxonomic level than that of the omitted taxon (Fig. 4B). The main exception was for omitted genera, where classification at the genus-level frequently yielded also FPs. A possible explanation for this is the greater similarity between viruses of the same genus, which can lead to assignment to a closely related genus, resulting in false positives in the benchmarking (Fig. 4B).

We also estimated novelty by calculating the average phylogenetic distance for every predicted affiliation of omitted taxa. Our analysis revealed that the mean of branch lengths to nearest neighbors of a query genome across all GVOG phylogenetic trees inferred by GVClass may allow us to estimate novelty at different taxonomic ranks. When choosing a lower bound of 25th percentile and upper bound of 75th percentile, values 0.6–1.27 were indicative of genus-level novelty, 1.27–2.74 of family-level novelty and >2.74 of order-level novelty. (Fig. 4C).

Our benchmarking demonstrated GVClass’s robust capability to detect novel viral sequences. This is crucial for identifying potentially new viral lineages in metagenomic data, which will add to the existing toolkit for exploring viral diversity and evolution.

Methods

The GVClass framework

To identify and classify GVs, query sequences can be provided as either FASTA amino acid (faa) or nucleotide (fna) format input (Fig. 1). To benefit from the entire set of functionalities of GVClass, fna is the preferred input format, with a recommended minimum assembly size of 20 kb.

For fna input, nucleotide sequences undergo optimized gene calling (opgecall) using a modified version of pyrodigal (https://github.com/tomasbruna/pyrodigal) using different genetic codes: “meta” which employs pre-trained models for codes 1, 4, and 11, as well as de novo trained models for genetic codes 4, 6, 15, 29, 106 and 129. Codes 106 and 129 are our custom modifications to codes 6 and 29 where only one stop codon is re-assigned: code 106 re-assigns TAA to glutamine, and code 129 re-assigns TAG to tyrosine. Following the opgecall step, the outputs are assessed using hmmsearch (http://hmmer.org/ v3.3.2) with a combined set of general HMMs and Nucleocytovirictoa order-level HMMs (Fig. 1). Genetic codes are then ranked based on: 1) the highest number of complete profile hits (>60% of model coverage), 2) the average of bitscores corresponding to the best profile hits for each predicted protein. In the latter case, the coding density must also exceed prodigal meta by 5% to select a de novo trained model. The highest ranked output is selected for subsequent analyses.

Hits with the general HMMs are extracted from translated proteins (faa), and each marker sequence is used as a query for a DIAMOND blastp search (v2.0.15)21 against a custom reference database. This database contains pre-extracted homologs of the general HMMs from a representative set of GVOGs2, Mirusviricota (Mirus 5), Mriyaviruses (Mrya 6), bacterial and archaeal genomes from the Genome Taxonomy Database (GTDB, release 214)22, EukProt (v3)23 and IMG/VR (v4)24. The top 100 hits for each query are then extracted and aligned with MAFFT (v7.505)25 using local pairwise alignment (-linsi) by default. Alignments are then trimmed with trimAl (v1.4.1)26 with the option -gt 0.1.

Next, single protein trees are constructed for each aligned set of proteins using IQ-TREE (v2.2.0.3)27 with the LG4X substitution model in the -fast mode (users can specify FastTree 2.1 for slightly decreased runtime per tree28). GVClass then analyzes phylogenetic trees to identify the nearest neighbors based on branch length of every query sequence, including paralogs. It employs functions to traverse the trees, find closest relatives, identify and characterize the nearest neighbors in provided reference datasets, and compiles results.

To yield taxonomic assignment in the “strict” mode, all nearest neighbors (nn) in all phylogenetic trees must agree on the respective taxonomic level (species, genus, family, order, class, phylum, domain). In the “majority” mode (recommended), 50% of nn must agree for successful taxonomic classification. GVClass offers the option to build additional trees from a larger set of Nucleocytoviricta order-level HMMs to increase taxonomic affiliation precision (fast_mode FALSE). However, as this leads to a much larger set of trees (up to 264), runtime of GVClass increases substantially.

For sequences predicted to be affiliated to the Nuclectoviricota phylum, the hit count of the order-level subset of the models (n = 274) is used to estimate lineage-specific genome completeness and contamination. For the latter, a lineage-specific duplication factor is calculated to estimate contamination. The genome completeness estimate in GVClass is based on the proportion of mainly single copy genes (duplication factor in published genomes <1.5) conserved in at least 50% of genomes of the respective Nucleocytoviricota order. In brief, the total number of hits is divided by the unique hits, taking into account the order-level conserved genes.

Finally, the GVClass pipeline summarizes all results in tabular output. The provided “gvclass_out_v1.0.0.tab” encompasses essential genome stats such as contig count, base pair length, GC content, gene count, coding percentage and genetic code. Taxonomic classification is detailed, extending down to the species level whenever possible, for both “strict” consensus and a “majority” consensus. Moreover, the file offers detailed insights into specific genetic elements, including the unique and total counts of Nucleocytoviricota and Mirusviricota MCP (Major capsid protein) genes, GVOG4 and GVOG8 (a set of 4 and 8 GVOGs suitable for supermatrix-based species tree calculation), phage and universal cellular housekeeping genes, along with their respective duplication factors. As there is no universal rule for giant viruses29, the provided wealth of information empowers users to discern intricate details about their query sequences. Of particular significance are the duplication factor and completeness index provided at the predicted order-level. A high duplication factor might suggest a sequence composed of multiple closely related giant viruses, while a low completeness index could indicate insufficient sequencing depth.

Understanding all GVClass metrics aids users in interpreting their data accurately and making informed decisions regarding subsequent analyses.

Data availability

GVClass is an open-source software, and its code can be found at https://github.com/NeLLi-team/gvclass.

References

  1. Schulz, F., Abergel, C. & Woyke, T. Giant virus biology and diversity in the era of genome-resolved metagenomics. Nat. Rev. Microbiol. 20, 721–736 (2022).

    Article PubMed CAS Google Scholar

  2. Aylward, F. O., Moniruzzaman, M., Ha, A. D. & Koonin, E. V. A phylogenomic framework for charting the diversity and evolution of giant viruses. PLoS Biol. 19, e3001430 (2021).

  3. Natalya, Y., Pascal, M., Mart, K. & V, K. E. Mriyaviruses: small relatives of giant viruses. mBio 15, e01035–24 (2024).

    Google Scholar

  4. Gaïa, M. et al. Mirusviruses link herpesviruses to giant viruses. Nature 616, 783–789 (2023).

  5. Filée, J. & Chandler, M. Gene exchange and the origin of giant viruses. Intervirology 53, https://doi.org/10.1159/000312920 (2010).

  6. La Scola, B. et al. A giant virus in amoebae. Science 299, 2033 (2003).

    Article PubMed Google Scholar

  7. Schulz, F. et al. Giant viruses with an expanded complement of translation system components. Science 85, 82–85 (2017).

    Article Google Scholar

  8. Moniruzzaman, M., Martinez-Gutierrez, C. A., Weinheimer, A. R. & Aylward, F. O. Dynamic genome evolution and complex virocell metabolism of globally-distributed giant viruses. Nat. Commun. 11, 1–12 (2020).

    Article Google Scholar

  9. Yutin, N., Wolf, Y. I., Raoult, D. & Koonin, E. V. Eukaryotic large nucleo-cytoplasmic DNA viruses: Clusters of orthologous genes and reconstruction of viral genome evolution. Virol J. 6, 1–13 (2009).

    Article Google Scholar

  10. Schulz, F. et al. Hidden diversity of soil giant viruses. Nat Commun 9, 4881 (2018).

  11. Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).

  12. Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01953-y (2023).

  13. Aylward, F. O. & Moniruzzaman, M. Viralrecall—a flexible command-line tool for the detection of giant virus signatures in ‘omic data. Viruses 13, 150 (2021).

  14. Ha, A. D. & Aylward, F. O. Automated classification of giant virus genomes using a random forest model built on trademark protein families. npj Viruses 2, 9 (2024).

  15. Verneau, J., Levasseur, A., Raoult, D., La Scola, B. & Colson, P. MG-digger: An automated pipeline to search for giant virus-related sequences in metagenomes. Front. Microbiol. 7, 428 (2016).

  16. Kerepesi, C. & Grolmusz, V. The “Giant Virus Finder” discovers an abundance of giant viruses in the Antarctic dry valleys. Arch. Virol. 162, 1671–1676 (2017).

  17. Schulz, F. et al. Giant virus diversity and host interactions through global metagenomics. Nature 578, 432–436 (2020).

    Article PubMed PubMed Central CAS Google Scholar

  18. Schulz, F. et al. Advantages and Limits of Metagenomic Assembly and Binning of a Giant Virus. mSystems 5, e00048-20 (2020).

  19. Zhao, H., Meng, L., Hikida, H. & Ogata, H. Eukaryotic genomic data uncover an extensive host range of mirusviruses. Current Biology 34, 2633–2643.e3 (2024).

    Article PubMed CAS Google Scholar

  20. Pitot, T. M. et al. Distinct and rich assemblages of giant viruses in Arctic and Antarctic lakes. ISME Commun. https://doi.org/10.1093/ismeco/ycae048 (2024).

  21. Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18, 366–368 (2021).

  22. Parks, D. H. et al. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).

  23. Richter, D. J. et al. EukProt: A database of genome-scale predicted proteins across the diversity of eukaryotes. Peer Community J. 2, e56 (2022).

  24. Camargo, A. P. et al. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023).

  25. Katoh, K. et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).

    Article PubMed PubMed Central CAS Google Scholar

  26. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).

  27. Nguyen, L. T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

  28. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 - Approximately maximum-likelihood trees for large alignments. PLoS One 5, e9490 (2010).

  29. Claverie, J. M. & Abergel, C. Giant viruses: The difficult breaking of multiple epistemological barriers. Stud. Hist. Philos. Biol. Biomed. Sci. 59, 89–99 (2016).

Download references

Acknowledgements

The work conducted by the U.S. Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility, is supported by the Office of Science of the U.S. Department of Energy operated under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

  1. Department of Biochemistry, Microbiology and Bioinformatics, Université Laval, 2325 rue de l’Université, Québec, QC, G1V0A6, Canada

    Thomas M. Pitot

  2. DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA, 94720, USA

    Thomas M. Pitot,Tomáš Brůna&Frederik Schulz

Authors

  1. Thomas M. Pitot

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  2. Tomáš Brůna

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  3. Frederik Schulz

    View author publications

    You can also search for this author in PubMedGoogle Scholar

Contributions

F.S. and T.B. are the primary contributors to the GVClass pipeline. T.M.P. wrote the main manuscript text. T.M.P. and F.S. prepared and designed the figures. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Frederik Schulz.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

Not applicable.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Conservative taxonomy and quality assessment of giant virus genomes with GVClass (5)

Cite this article

Pitot, T.M., Brůna, T. & Schulz, F. Conservative taxonomy and quality assessment of giant virus genomes with GVClass. npj Viruses 2, 60 (2024). https://doi.org/10.1038/s44298-024-00069-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s44298-024-00069-7

Conservative taxonomy and quality assessment of giant virus genomes with GVClass (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Aron Pacocha

Last Updated:

Views: 6038

Rating: 4.8 / 5 (68 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Aron Pacocha

Birthday: 1999-08-12

Address: 3808 Moen Corner, Gorczanyport, FL 67364-2074

Phone: +393457723392

Job: Retail Consultant

Hobby: Jewelry making, Cooking, Gaming, Reading, Juggling, Cabaret, Origami

Introduction: My name is Aron Pacocha, I am a happy, tasty, innocent, proud, talented, courageous, magnificent person who loves writing and wants to share my knowledge and understanding with you.