Only a small fraction of genes deposited to databases have been experimentally characterised. The majority of proteins have their function assigned automatically, which can result in erroneous annotations. The reliability of current annotations in public databases is largely unknown; experimental attempts to validate the accuracy within individual enzyme classes are lacking. In this study we performed an overview of functional annotations to the BRENDA enzyme database. We first applied a high-throughput experimental platform to verify functional annotations to an enzyme class of S-2-hydroxyacid oxidases (EC 188.8.131.52). We chose 122 representative sequences of the class and screened them for their predicted function. Based on the experimental results, predicted domain architecture and similarity to previously characterised S-2-hydroxyacid oxidases, we inferred that at least 78% of sequences in the enzyme class are misannotated. We experimentally confirmed four alternative activities among the misannotated sequences and showed that misannotation in the enzyme class increased over time. Finally, we performed a computational analysis of annotations to all enzyme classes in the BRENDA database, and showed that nearly 18% of all sequences are annotated to an enzyme class while sharing no similarity or domain architecture to experimentally characterised representatives. We showed that even well-studied enzyme classes of industrial relevance are affected by the problem of functional misannotation.
Correct annotation of genomes is crucial for our understanding and utilization of functional gene diversity, yet the reliability of current protein annotations in public databases is largely unknown. In our work we validated annotations to an S-2-hydroxyacid oxidase enzyme class (EC 184.108.40.206) by assessing activity of 122 representative sequences in a high-throughput screening experiment. From this dataset we inferred that at least 78% of the sequences in the enzyme class are misannotated, and confirmed four alternative activities among the misannotated sequences. We showed that the misannotation is widespread throughout enzyme classes, affecting even well-studied classes of industrial relevance. Overall, our study highlights the value of experimental and computational validation of predicted functions within individual enzyme classes.
Citation: Rembeza E, Engqvist MKM (2021) Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 220.127.116.11 enzyme class. PLoS Comput Biol 17(9): e1009446.
In this study we utilize a high-throughput experimental platform, similar to those used for substrate profiling of protein families, to verify functional annotations to an enzyme class in the BRENDA database . We provide an overview of all the sequences annotated as S-2-hydroxyacid oxidases (EC 18.104.22.168) and select 122 representatives of the class for experimental screening of their predicted function. We show that the majority of the sequences contain non-canonical protein domains, do not catalyse the predicted reaction, and are wrongly annotated to the enzyme class. Among the misannotated sequences we confirm four alternative enzymatic activities. Finally, a computational analysis of all EC classes in BRENDA reveals that a large proportion of sequences are annotated to enzyme classes with no similarity to characterised enzymes, a problem which warrants further investigation.
Members of EC 22.214.171.124 are of high biological importance, with plant GOX being crucial for photorespiration, mammalian HAOs taking part in glycine synthesis and fatty acid oxidation, and bacterial LOX metabolising L-lactate as an energy source . Human HAO1 was recently proposed as a target for treating primary hyperoxaluria, an autosomal metabolic disorder leading to decline in renal function . Bacterial LOX are of particular medical and industrial interest, being used for lactate biosensor development in clinical care, sport medicine, and food processing .
To obtain an overview of sequence diversity in EC 126.96.36.199, we downloaded all sequences annotated to this EC in BRENDA 2017.1 and obtained 1058 unique sequences after filtering out partial genes. The sequence interrelatedness of these diverse proteins was visualized in a multidimensional scaling (MDS) plot using computed UniRep embeddings ; a smaller distance in this plot indicates higher relatedness (Figs 1 and S2). Among the 1058 sequences 17 are characterised and/or manually curated enzymes: sequences listed in BRENDA  as experimentally tested or in SwissProt  as manually curated sequences having experimental evidence at protein level. Over 90% of the enzymes annotated to this enzyme class are of bacterial origin, nearly 6% of eukaryotic and 2.6% of archaeal (Fig 1A). Strikingly, 14 out of 17 characterised enzymes are of eukaryotic origin, showing a clear over-representation. The characterised sequences also cluster close together in the visualization (Fig 1A, 1B and 1C), indicating that the characterised/curated sequence diversity in EC 188.8.131.52 is limited.
We next determined the similarity of each sequence in EC 184.108.40.206 to the closest characterised S-2-hydroxyacid oxidase in terms of alignment-based sequence identity and domain architecture. Most sequences have little similarity with the characterised ones; 79% of sequences annotated as 220.127.116.11 share less than 25% sequence identity with the closest characterised/curated sequence (Figs 1B and S3). Furthermore, only 22.5% of the 1058 sequences are predicted to contain the FMN-dependent dehydrogenase domain (FMN_dh, PF01070) which is canonical for known 2-hydroxy acid oxidases (Fig 1C). The majority of sequences were predicted to contain non-canonical domains, such as FAD binding domains characteristic for FAD-dependant oxidoreductases (PF01266, PF01565, PF02913), as well as a cysteine rich domain (PF02754) and 2Fe-2S binding domain (PF04324). Many of the sequences with non-canonical domains form distinct clusters (Fig 1C). An analysis of alignment-based similarity between these domain clusters showed that the average sequence identity to the canonical FMN-dependent dehydrogenase domain cluster is below 16% for all clusters. An all versus all comparison revealed that no two clusters share more than 21% average sequence identity, while the identity of sequences within clusters ranges between 33% and 55% (Fig 1D).
This analysis clearly shows that the enzyme class EC 18.104.22.168 contains a set of very diverse protein sequences, the majority of which have low identity to sequences with experimental evidence, and also lack protein domains characteristic of S-2-hydroxy acid oxidases.
Due to the large diversity of sequences annotated to EC 22.214.171.124 we carried on to experimental validation of their predicted activity. A total of 122 genes throughout the sequence space of the enzyme class were selected (S4 Fig, left panel), synthesised, cloned and recombinantly expressed in Escherichia coli in a high throughput set up. Out of the 122 proteins, 65 were in soluble state (53%), with archaeal and eukaryotic proteins being proportionally less soluble than bacterial proteins (S4 Fig, right panel). Despite representing only half of the sequences chosen for experimental characterisation, the soluble proteins were still distributed throughout the sequence space of EC 126.96.36.199 (S4 Fig, left panel). The 65 soluble proteins were tested for S-2-hydroxy acid oxidase activity in an Amplex Red peroxide detection assay with a set of six 2-hydroxy acids: glycolate, lactate, 2-hydroxyoctanoate, 2-hydroxydecanoate, mandelate, and 2-hydroxyglutarate (S5 Fig).
We first investigated 24 proteins representing a cluster of 230 sequences containing the FMN_dh domain; these have the highest sequence identity to previously characterised 2-hydroxy acid oxidases (Figs 1C and 2A). Among them 14 proteins were active with a broad substrate range, as is characteristic for enzymes in EC 188.8.131.52, while 10 proteins were inactive. Bacterial sequences in the cluster were predominantly active with lactate, medium chain and aromatic 2-hydroxy acids, whereas the two active eukaryotic enzymes showed the highest activity with glycolate and lactate.
The seven active site residues are, however, conserved not only in S-2-hydroxyacid oxidases, but also among all the members of the FMN-dependant S-2-hydroxyacid oxidase/dehydrogenase family . We therefore looked for sequence motifs indicating the presence of other family members in our selection (Fig 2C). Two of the screened proteins (B8MKR3 and B8MMC0 from Talaromyces stipitatus) contain a heme binding domain (PF00173) characteristic for flavocytochrome b2 L-lactate dehydrogenase (EC 184.108.40.206)  (Figs 2A and S6). These two proteins were tested in vitro for their ability to reduce cytochrome c, a physiological electron acceptor of flavocytochrome b2 L-lactate dehydrogenase. Indeed, the B8MKR3 protein displayed cytochrome b2 L-lactate dehydrogenase activity (S7 Fig). Additionally, four other proteins (E6SCX5 from Intrasporangium calvum, C9Y9E7 from a Curvibacter species and W6W585 from Rhizobium sp. CF080) contain a longer stretch in loop 4 characteristic for S-mandelate dehydrogenase (EC 220.127.116.11) and L-lactate 2-monooxygenase (EC 18.104.22.168) [27,28] (Figs 2A and S6). As seen in our Amplex Red assay, the four proteins display a high activity with mandelate, suggesting their native function may be as S-mandelate dehydrogenases, although further experiments are needed to determine this.
Next, we investigated the activity of 41 proteins not containing the canonical FMN-dh domain (Fig 1C), yet representing a full 78% of all sequences annotated to EC 22.214.171.124 in BRENDA. These proteins have only low sequence identity with previously characterised S-2-hydroxyacid oxidases (Fig 1B and 1D). 2b1af7f3a8