1 Generalities

1.1 On Chemical Formulæ and Chemical Reactions
1.2 The massXpert Framework Data Format
1.3 Chemical Entity Naming Policy

In this chapter, I wish to introduce some general concepts around the massXpert program and the way data elements are named in this manual and in the program.

The massXpert mass spectrometry software suite has been designed to be able to “work” with every linear polymer. Well, in a certain way this is true… A more faithful account of the massXpert's capabilities would be: “The massXpert software suite works with whatever polymer chemistry the user cares to define; the more accurate the polymer chemistry definition, the more accurate massXpert will be”.

For the program to be able to cope with a variety of possibly very different polymers, it had to be written using some abstraction layer in between the mass calculations engine and the mere description of the polymer sequence. This abstraction layer is implemented with the help of “polymer chemistry definitions”, which are files describing precisely how a given polymer type should behave in the program and what its constitutive entities are. The way polymer chemistry definitions are detailed by the user is the subject of a chapter of this book (see menu XpertDef of the program). However, in order to give a quick overview, here is a simple situation: a user is working on two polymer sequences, one of chemistry type “protein” and another one of chemistry type “DNA”. The protein sequence reads “ATGC”, and the DNA sequence reads “CGTA”. Now imagine that the user wants to compute the mass of these sequences. How will massXpert know what formula (hence mass) each monomer code corresponds to? There must be a way to inform massXpert that one of the sequences is a protein while the other is a DNA oligonucleotide: this is done upon creation of a polymer sequence; the programs asks of what chemistry type the sequence to be created is. Once this “chemical parentage” has been defined for each sequence, massXpert will know how to handle both the graphical display of each sequence and the calculations for each sequence.

1.1 On Chemical Formulæ and Chemical Reactions #

Any user of massXpert will inevitably have to perform two kinds of chemical simulations:

Define the formula of some chemical entity;
Define a given chemical reaction, like a protein monomer modification, for example.

While the definition of a formula poses no special difficulty, the definition of a chemical reaction is less trivial, as detailed in the following example. The lysyl residue has the following formula: C₆H₁₂N₂O. If that lysyl residue gets acetylated, the acetylation reaction will read this way:—“An acetic acid molecule will condense onto the ɛ amine of the lysyl side chain”. This can also read:—“An acetyl group enters the lysyl side chain while a hydrogen atom leaves the lysyl side chain; water is lost in the process”. The representation of that reaction is:

R-NH₂ + CH₃COOH ⇋ R-NH-CO-CH₃ + H₂O

When the user wants to define that chemical reaction, she can use that representation: “-H₂O+CH₃COOH”, or even the more brief but chemically equivalent one: “-H+CH₃CO”. In massXpert, the chemical reaction representation is considered a valid formula.

1.2 The massXpert Framework Data Format #

All the data dealt with in massXpert are stored on disk as XML-formatted files. XML is the eXtensible Markup Language. This “language” allows to describe the structure of a document. The structure of the data is first described in a section of the document that is called the Document Type Definition, DTD, and the data follow in the same file. One of the big advantages of using such XML format in massXpert is that it is a text format, and not a binary one. This means that any data in the massXpert package is human-readable (even if the XML syntax makes it a bit difficult to read data, it is actually possible). Try to read one of the polymer chemistry definition XML files that are shipped with this software package, and you'll see that these files are pure text files (the same applies for the *.mxp XML polymer sequence files). The advantages of using text file formats, with respect to binary file formats are:

The data in the files are readable even without the program that created them. Data extraction is possible, even if it costs work;
Whenever a text document gets corrupted, it remains possible to extract some valid data bits from its uncorrupted parts. With a binary format, data are chained from bit to bit; loosing one bit lead to automatic corruption of all the remaining bits in the file;
Text data files are searchable with standard console tools (sed, grep…), which make it possible to search easily text patterns in any text file or thousands of these files in one single command line. This is not possible with binary format, simply because reading them require the program that knows how to decode the data and the powerful console-based tools would prove useless.

1.3 Chemical Entity Naming Policy #

Unless otherwise specified, the user is strongly advised not to insert any non-alphanumeric-non-ASCII characters (space, %, #, $…) in the strings that identify polymer chemistry definition entities. This means that, for example, users must refrain from using non-alphanumeric-non-ASCII characters for the atom names and symbols, the names, the codes or the formulæ of the monomers or of the modifications, or of the cleavage specifications, or of the fragmentation specifications… Usually, the accepted delimiting characters are - and _. It is important not to cripple these polymer data for two main reasons:

So that the program performs smoothly (some file-parsing processes rely on specific characters (like # or %, for example) to isolate sub-strings from larger strings);
So that the results can be easily and clearly displayed when time comes to print all the data.

Print this page