On the quality of NMR structures.
Tools and methodology for NMR data and structure validation.

Scope and summary of this thesis

At an ever increasing rate the genetic inventories of numerous organisms are being revealed by genome sequencing efforts. However, the resulting genomic information reveals comparatively little on how the individual parts fulfill their function in the machinery of life. This is a direct consequence of the fact that the genes are not the actual actors within a cell. In order to perform their encoded function, genes are transcribed and translated into proteins, and these in turn mediate most of the essential structures and functions of cells.

Much of our understanding of protein function at the atomic level originates from studying protein structures. There are various different techniques to elucidate protein structure and of these currently only two are producing structures in large quantities: single-crystal X-ray diffraction and nuclear magnetic resonance (NMR) spectroscopy. For the proteins that form suitable crystals, X-ray crystallography represents a mature and rapid approach. Soluble proteins that do not crystallize readily can only be studied by solution state NMR spectroscopy.

X-ray crystallographers, aided by largely automated procedures, sometimes can solve a protein structure within hours of data collection. NMR spectroscopists have to go through a more laborious process of resonance assignment and structure calculation, and this can take months. Nevertheless, due to the development of more advanced spectrometers, more sophisticated experiments, and automated assignment and structure calculation procedures, NMR protein structure determination methods have advanced to the point were the structure of small- to medium-sized proteins (up to roughly 30 kDa) can now be determined in a routine manner. This has led to the integration of NMR spectroscopy as a structure determination tool in many of the currently ongoing structural genomics projects.

One of the foremost goals of these structural genomics projects is to determine a basic set of protein structures, which includes at least one member of each of the many different protein fold classes. This set of protein folds should provide the basis for the prediction of the three-dimensional structure of most of the remaining proteins using homology modeling techniques. However, for such an approach to be successful it is of the utmost importance that the protein structures present in the basic set are accurate and of high structural quality. Also, for further interpretation and use of these structure models in follow-up studies it is essential to have detailed knowledge of structural quality. Therefore, it is important that the structure models be extensively validated, using both the experimental data and structural knowledge obtained from a reference set of high quality structures. The high rate by which biomolecular structures are being determined within structural genomics projects and the ever increasing amount of automation within these projects renders proper validation of the resulting structures important. It is within this context that the work presented in this thesis, with its special focus on the validation of biomolecular structures determined using NMR spectroscopy, should be placed.

Validation of NMR structures is typically aimed at two aspects: how well do the structures agree with the experimental NMR data and how do the structures compare to statistics derived from a reference database of high quality protein structures. The both of these aspects are discussed in the first two, introductory, chapters. In Chapter 1 an overview is presented of validation approaches that are commonly used to assess the quality of NMR derived structure models. Following a brief introduction on NMR structure calculation and the precision and accuracy of NMR structures, different techniques are discussed for the validation of both local and global geometric quality. Chapter 2 reviews common methods used to assess the quality of macromolecular structures in light of the experimental data obtained by NMR spectroscopy. An overview is given of the different types of experimental input data, their application in structure calculation algorithms, and the concepts and tools available for their validation.

In Chapter 3 a new method to analyze the information content of NMR data is presented. This method, named QUEEN, is based on a description of the structures in distance space and concepts taken from information theory. It allows for an objective description of the amount of information contained in complete datasets as well as individual restraints. The method is tested on several experimental datasets, and it is shown that QUEEN can be used to successfully identify the crucial restraints in a structure determination project.

Subsequently, the information measures implemented in QUEEN are applied in Chapter 4 to investigate the relation between the information contained in experimental datasets and the quality of resulting structure ensembles. The results show, for the first time, that there is a direct relation between data information content and structural quality. This knowledge is used to derive a new per-residue quality parameter, which provides direct insight into the extent to which structural quality is governed by the experimental input data.

In addition to the quality of the data, the energy parameters used in the final refinement step contribute a mayor influence on the quality of biomolecular NMR structures. The DRESS database, presented in Chapter 5, provides a clear example of this finding. In this database, a set of 100 NMR derived protein structures was re-refined using restrained molecular dynamics in explicit solvent. Validation of the structure ensembles, using approaches discussed in the introductory chapters, demonstrates that both the geometric and overall quality of the NMR structure models in DRESS is significantly improved compared to the original ensembles.

It has become increasingly clear in recent years that the precision of deposited NMR ensembles often overestimates their accuracy, an issue discussed in the first chapter of this thesis. Chapter 6 reports on a method that yields a more realistic estimate of the uncertainty in the atomic coordinates by maximizing the structural variance within an ensemble of structures, while maintaining accordance with the experimentally derived data. The results indicate that the structural variance of most NMR structure ensembles can be significantly increased without compromising geometric quality or the fit to the experimental NMR data.

Finally, Chapter 7 presents an application of the methodology, tools and knowledge described in this thesis to a practical example. In search of a suitable template to build a homology model for the protein Dynein Light Chain 2A, two NMR structures of this protein were obtained from the Protein Data Bank, both originating from different structural genomics efforts. The folds of the two structures are remarkably different, despite their high sequence identity (96%). In this chapter a detailed analysis of both structure ensembles is presented, which allows us to identify one of the two ensembles as incorrect. Subsequently, the analysis of a large set of structures solved as part of structural genomics efforts shows that this erroneous structure is unfortunately not an isolated incident. In the conclusion, suggestions are offered on how the methods and tools described in this thesis could be applied to prevent such serious errors from occurring in the future.