In this chapter, I wish to introduce some general concepts around the massXpert program and the way data elements are named in this manual and in the program.
The massXpert mass spectrometry software suite has been designed to be able to “work” with every linear polymer. Well, in a certain way this is true… A more faithful account of the massXpert's capabilities would be: “The massXpert software suite works with whatever polymer chemistry the user cares to define; the more accurate the polymer chemistry definition, the more accurate massXpert will be”.
For the program to be able to cope with a variety of possibly very different polymers, it had to be written using some abstraction layer in between the mass calculations engine and the mere description of the polymer sequence. This abstraction layer is implemented with the help of “polymer chemistry definitions”, which are files describing precisely how a given polymer type should behave in the program and what its constitutive entities are. The way polymer chemistry definitions are detailed by the user is the subject of a chapter of this book (see menu of the program). However, in order to give a quick overview, here is a simple situation: a user is working on two polymer sequences, one of chemistry type “protein” and another one of chemistry type “DNA”. The protein sequence reads “ATGC”, and the DNA sequence reads “CGTA”. Now imagine that the user wants to compute the mass of these sequences. How will massXpert know what formula (hence mass) each monomer code corresponds to? There must be a way to inform massXpert that one of the sequences is a protein while the other is a DNA oligonucleotide: this is done upon creation of a polymer sequence; the programs asks of what chemistry type the sequence to be created is. Once this “chemical parentage” has been defined for each sequence, massXpert will know how to handle both the graphical display of each sequence and the calculations for each sequence.
Any user of massXpert will inevitably have to perform two kinds of chemical simulations:
Define the formula of some chemical entity;
Define a given chemical reaction, like a protein monomer modification, for example.
While the definition of a formula poses no special difficulty, the definition of a chemical reaction is less trivial, as detailed in the following example. The lysyl residue has the following formula: C6H12N2O. If that lysyl residue gets acetylated, the acetylation reaction will read this way:—“An acetic acid molecule will condense onto the ɛ amine of the lysyl side chain”. This can also read:—“An acetyl group enters the lysyl side chain while a hydrogen atom leaves the lysyl side chain; water is lost in the process”. The representation of that reaction is:
R-NH2 + CH3COOH ⇋ R-NH-CO-CH3 + H2O
When the user wants to define that chemical reaction, she can use that representation: “-H2O+CH3COOH”, or even the more brief but chemically equivalent one: “-H+CH3CO”. In massXpert, the chemical reaction representation is considered a valid formula.
All the data dealt with in massXpert are stored on disk as
XML-formatted files. XML is the eXtensible Markup
Language. This “language” allows to describe the
structure of a document. The structure of the data is first described in
a section of the document that is called the Document Type
Definition, DTD, and the data follow in the same file. One of
the big advantages of using such XML format in massXpert is that it is a
text format, and not a binary one. This means that any data in the
massXpert package is human-readable (even if the XML syntax makes it a bit
difficult to read data, it is actually possible). Try to read one of the
polymer chemistry definition XML files that are shipped with this software
package, and you'll see that these files are pure text files (the same
applies for the *.mxp
XML polymer
sequence files). The advantages of using text file formats, with respect
to binary file formats are:
The data in the files are readable even without the program that created them. Data extraction is possible, even if it costs work;
Whenever a text document gets corrupted, it remains possible to extract some valid data bits from its uncorrupted parts. With a binary format, data are chained from bit to bit; loosing one bit lead to automatic corruption of all the remaining bits in the file;
Text data files are searchable with standard console tools
(sed
, grep
…), which
make it possible to search easily text patterns in any text file or
thousands of these files in one single command line. This is not
possible with binary format, simply because reading them require the
program that knows how to decode the data and the powerful
console-based tools would prove useless.
Unless otherwise specified, the user is strongly advised not to insert any non-alphanumeric-non-ASCII characters (space, %, #, $…) in the strings that identify polymer chemistry definition entities. This means that, for example, users must refrain from using non-alphanumeric-non-ASCII characters for the atom names and symbols, the names, the codes or the formulæ of the monomers or of the modifications, or of the cleavage specifications, or of the fragmentation specifications… Usually, the accepted delimiting characters are - and _. It is important not to cripple these polymer data for two main reasons:
So that the program performs smoothly (some file-parsing processes rely on specific characters (like # or %, for example) to isolate sub-strings from larger strings);
So that the results can be easily and clearly displayed when time comes to print all the data.