Home » Software » PyMsXML

PyMsXML

Introduction

PyMsXML is a python script for converting vendor specific mass spectrometry data files for Applied Biosystems' Q-Star, 4700, 4800, Mariner, and Voyager mass spectrometers from their raw binary form, to either of the emerging XML file formats for mass spectra: mzXML, from the Sashimi Glossolalia project of the Institute for Systems Biology (ISB); and mzData, from the Proteome Standardization Initiative (PSI) project of the Human Proteome Organization (HUPO).

PyMsXML uses installed vendor software under Windows to access the proprietary raw mass spectra file format, interfacing to the vendor supplied libraries via the supplied COM interface. Unlike other software solutions that use this approach, PyMsXML is written in a free, open-source language called Python. As such, no installation of Microsoft Visual C++ or Visual Basic is necessary to use, alter, or improve the PyMsXML script.

PyMsXML is easily extended for new instruments and vendor software, and for new, or changed, XML file formats. The code that interfaces to the vendor software is decomposed from the code that formats the data as XML, as such, the addition of new instrument capability need not re-write the XML data format code. Similarly, as new XML file formats emerge, the code that interfaces with the instrument software need not change.

PyMsXML is hosted at bioinformatics.org.

NOTE!

The vendor software (Analyst for .wiff files, Data Explorer for .t2d files) be installed on the same computer as PyMsXML. These binary formats cannot be read without the vendor support libraries.

Installation

  1. Download and install the latest version of ActiveState ActivePython for Windows.

  2. Start the Pythonwin IDE (All Programs -> ActiveState ActivePython 2.4 -> Pythonwin IDE). From the Tools menu, select the "COM Makepy utility" entry. In the popup window, select "ExploreDataObjects 1.0 Type Library (1.0)" to build a python interface to Analyst's COM libraries for reading .wiff files, select "IDAExplorer 1.0 Type Library (1.0)" to build a python interface to Data Explorer's COM libraries for reading ".dat" and ".t2d" files. If you have both pieces of software, repeat this step for each software package. Click OK.

  3. Check the installation of COM library interfaces. If any of these tests are unsuccessful, then PyMsXML will be unable to read the corresponding raw datafiles.

    For Analyst, these commands at the Pythonwin IDE command-line (copy-and-paste!) should elicit similar responses:

    >>> from win32com.client import Dispatch
    >>> Dispatch('Analyst.FMANSpecData')
    <win32com.gen_py.ExploreDataObjects 1.0 Type Library.IFMANSpecData instance at 0x14421558>
    >>> Dispatch('Analyst.FMANChromData')
    <win32com.gen_py.ExploreDataObjects 1.0 Type Library.IFMANChromData instance at 0x14418408>
    

    For Data Explorer, these commands at the Pythonwin IDE command-line (copy-and-paste!) should elicit similar responses:

    >>> from win32com.client import Dispatch, gencache
    >>> Dispatch('DataExplorer.Application',resultCLSID='{3FED40F1-D409-11D1-8B56-0060971CB54B}')
    <COMObject DataExplorer.Application>
    >>> gencache.EnsureModule('{06972F50-13F6-11D3-A5CB-0060971CB54B}',0,4,2)
    <module 'win32com.gen_py.06972F50-13F6-11D3-A5CB-0060971CB54Bx0x4x2' from 'C:\Python24\lib\site-packages\win32com\gen_py \06972F50-13F6-11D3-A5CB-0060971CB54Bx0x4x2.py'>
    

  4. Download and unpack the PyMsXML scripts and examples. Download PyMsXML. After unzipping PyMsXML, edit the file pymsxml.cmd to point to your Python installation (usually C:\Python24\python.exe) and your PyMsXML installation.

Usage

PyMsXML consists of a single python script. A windows cmd file wrapper is provided, to take care of calling the python interpretor appropriately.

pymsxml [ options ] raw-spectra-data-file

Options:

-R raw-format, --rawdata raw-format
Valid raw-format values: wiff, qstar, t2d, ab4700, ab4800, voyager, mariner, mzXML. Optional if raw-spectra-data-file ends in .wiff, .t2m, or .mzXML.

-X xml-format, --xmlformat xml-format
Valid xml-format values: mzXML (ISB), mzData (HUPO). Optional if output-file ends in .mzXML or .mzData.

-o output-file, --output output-file
Name of output file. If omitted, and xml-format is supplied, then the output file is inferred by changing the file extention of raw-spectra-data-file to xml-format.

-p ms-levels, --peaks ms-levels
Apply (vendor library) peak detetion to spectra with level in ms-levels (comma separated). QStar (MS/MS spectra only) raw format, 4700, 4800 raw format only. Default: 2.

-f filter-spec, --filter filter-spec
Filter output scans by their meta-data. Filters are specified as a comma-separated list of filter tokens. Each filter token is specified as field.comparison.value. field must be an attribute of the scan object. comparison must be one of eq, ne, lt, le, gt, or ge, specifying =, ≠, <, ≤, >, and ≥ respectively.

-V version, --version version
XML version. mzXML only. Valid options 2.1,2.2,3.0. Default: 3.0.

-z, --compress_peaks
Compress mzXML peak data using zlib. Default: No compression of peak data. Requires mzXML version 3.0.

-Z compress-format, --compress compress-format
Compress output file. Valid options: gz. Default: None, unless output file ends with .gz, then gz.

-d, --debug
Debug. Output XML for first 10 spectra only. Truncate spectral data, too. Useful to verify that the output is formatted correctly.

-h, --help
Help.

Applied Biosystems Q-Star Spectra

The raw spectra data files for the ESI spectra from Applied Biosystems' Q-Star instruments are usually extracted as ".wiff" files. These can be opened using Applied Biosystem's Analyst or BioAnalyst programs. PyMsXML uses Analyst's support libraries to extract mass spectra from these files.

Applied Biosystems Mariner, Voyager, 4700 Spectra

The raw spectra data files for the MALDI spectra from Applied Biosystems' Mariner, Voyager, 4700, and 4800 instruments are usually extracted as ".t2d" or ".dat" files. These can be opened using Applied Biosystem's Data Explorer program. PyMsXML uses Data Explorer's support libraries to extract mass spectra from these files. These file formats store very little meta-data in addition to the mass spectrum. If you do not care whether or not the output data-file contains valid spot, matrix, and plate meta-data; and if your spectra are all contained in one raw datafile, then the datafile can be supplied on the command-line as raw-spectra-data-file. However, if you have many .dat or .t2d files, each corresponding to a MALDI spot, or if you wish to have spot, matrix, or plate meta-data populated into the output file, then see the next set of instructions.

Plate, matrix, and spot meta-data must be supplied in a meta-data text file, which is supplied on the command-line as raw-spectra-data-file. The meta-data file is most easily constructed in Excel and saved as tab-separated-values, but it can be formed by hand too, if desired. Each line of the meta-data file specifies a record, describing the MALDI plate, the plates' spots, and the scans acquired from these spots. A short-cut record, that defines the plate and spot naming convention is also provided.

The plate definition record consists of the word PLATE (case insensitive) in the first column, followed by alternating key-value pairs in subsequent columns. Particular key-value pairs do not need to be specified in any particular order. The following keys must be provided: plateID, spotXCount, spotYCount, plateManufacturer, and plateModel. The plateID value is referenced by the spot and scan definition records. The spotXCount is the number of MALDI spots in the horizontal dimension (integer). The spotYCount is the number of MALDI spots in the vertical dimension (integer). The plateManufacturer and plateModel values are inserted verbatim in the output XML.

The spot definition record consists of the word SPOT (case insensitive) in the first column, followed by alternating key-value pairs in subsequent columns. Particular key-value pairs do not need to be specified in any particular order. The following keys must be provided: plateID, spotID, spotXPosition, spotYPosition, and maldiMatrix. The plateID value must be defined by some plate definition record. The spotID is referenced by the scan definition records. The spotXPosition is the horizontal position of the spot on the plate (integer). The spotYPosition is the vertical position of the spot on the plate (integer). Spot positions can be numbered beginning at 0 or 1. The maldiMatrix value is inserted verbatim in the output XML.

The scan definition record consists of the word SCAN (case insensitive) in the first column, followed by alternating key-value pairs in subsequent columns. Particular key-value pairs do not need to be specified in any particular order. The following keys must be provided: plateID, spotID, filename, and index. The plateID must be defined by some plate definition record. The spotID must be defined by some spot definition record. The filename is the name of the ".dat" or ".t2d" file containing the corresponding scan's spectrum. The index is the ordinal of the corresponding spectrum in the provided file. Spectra within files should be referenced beginning at 1.

To alleviate some of the tedium with specifying the spot definition records, a shortcut plate definition record is provided. The platedef definition record consists of the word PLATEDEF (case insensitive) in the first column, followed by alternating key-value pairs in subsequent columns. Particular key-value pairs do not need to be specified in any particular order. The following keys must be provided: plateID, plateManufacturer, plateModel, spotNaming, and maldiMatrix. The plateID value is referenced by the spot and scan definition records. The plateManufacturer and plateModel are used to identify the properties of the MALDI plate. Currently, only the values ABI / SCIEX and 01-192+06-BB are recognized, but others are easily added on request. The "ABI / SCIEX 01-192+06-BB" plate consists of 8 rows of 24 spots (plus 6 calibration spots). The spotNaming must be one of alpha, parallel, or antiparallel. At this time, only alpha is implemented. The alpha spot attribute constructs spotID values for all spots on the plate with the row specified by a letter from A to H, and the column specified by a number from 1 to 24. The maldiMatrix value is assumed the same for each spot and is inserted verbatim in the output XML.

NOTE: The order of scan definition lines is important! MS/MS spectra must appear immediately after the MS spectrum containing their precursors.

Example meta-data files are provided in the distribution. example1.t2m explicitly defines the MALDI plate, spots and scans. example2.t2m is equivalent, but uses the platedef keyword shortcut.

Examples

Convert the Q-Star example.wiff file to mzXML format, placing the output in the file qstar-example.xml.

C:\PyMsXML\example\wiff> pymsxml.cmd -R wiff -X mzXML -o qstar-example.xml example.wiff

Convert the Q-Star example.wiff file to mzData format, placing the output in the file example.mzData.

C:\PyMsXML\example\wiff> pymsxml.cmd -X mzData example.wiff

Convert the Mariner ESI-CE-MALDI spectra in example3.dat to mzXML format, placing the output in the file example3.mzXML.gz (automatically gzipped):

C:\PyMsXML\example\dat> pymsxml.cmd -R mariner -o example3.mzXML.gz example3.dat

Convert the AB4700 spectra listed in example1.t2m and example2.t2m to mzXML format, placing the output in the files example1.mzXML and example2.mzXML.

C:\PyMsXML\example\t2d> pymsxml.cmd -X mzXML example*.t2m

Release Notes

The Analyst COM libraries seem to have trouble with long pathnames. If you consistently have trouble getting PyMsXML to read ".wiff" files, try moving the files to a shorter directory path.

Credits

Development of PyMsXML was significantly helped by the open-source Visual Basic source code from the MzStar program of the ISB glossolalia project.