PyMsXML
Introduction
PyMsXML is a python script for converting vendor specific mass spectrometry data files for Applied Biosystems' Q-Star, 4700, 4800, Mariner, and Voyager mass spectrometers from their raw binary form, to either of the emerging XML file formats for mass spectra: mzXML, from the Sashimi Glossolalia project of the Institute for Systems Biology (ISB); and mzData, from the Proteome Standardization Initiative (PSI) project of the Human Proteome Organization (HUPO).
PyMsXML uses installed vendor software under Windows to access the proprietary raw mass spectra file format, interfacing to the vendor supplied libraries via the supplied COM interface. Unlike other software solutions that use this approach, PyMsXML is written in a free, open-source language called Python. As such, no installation of Microsoft Visual C++ or Visual Basic is necessary to use, alter, or improve the PyMsXML script.
PyMsXML is easily extended for new instruments and vendor software, and for new, or changed, XML file formats. The code that interfaces to the vendor software is decomposed from the code that formats the data as XML, as such, the addition of new instrument capability need not re-write the XML data format code. Similarly, as new XML file formats emerge, the code that interfaces with the instrument software need not change.
PyMsXML is hosted at bioinformatics.org.
The vendor software (Analyst for .wiff
files, Data Explorer for .t2d files) be installed on the same computer
as PyMsXML. These binary formats cannot be read without the
vendor support libraries.
NOTE!
Installation
- Download and install the latest version of ActiveState ActivePython for Windows.
- Start the Pythonwin IDE (All Programs -> ActiveState ActivePython
2.4 -> Pythonwin IDE). From the Tools menu, select the "COM Makepy
utility" entry. In the popup window, select "ExploreDataObjects 1.0
Type Library (1.0)" to build a python interface to Analyst's COM
libraries for reading .wiff files, select "IDAExplorer 1.0 Type
Library (1.0)" to build a python interface to Data Explorer's COM
libraries for reading ".dat" and ".t2d" files. If you have both pieces
of software, repeat this step for each software package. Click OK.
- Check the installation of COM library interfaces. If any of these
tests are unsuccessful, then PyMsXML will be unable to read the
corresponding raw datafiles.
For Analyst, these commands at the Pythonwin IDE command-line (copy-and-paste!) should elicit similar responses:
>>> from win32com.client import Dispatch >>> Dispatch('Analyst.FMANSpecData') <win32com.gen_py.ExploreDataObjects 1.0 Type Library.IFMANSpecData instance at 0x14421558> >>> Dispatch('Analyst.FMANChromData') <win32com.gen_py.ExploreDataObjects 1.0 Type Library.IFMANChromData instance at 0x14418408>
For Data Explorer, these commands at the Pythonwin IDE command-line (copy-and-paste!) should elicit similar responses:
>>> from win32com.client import Dispatch, gencache >>> Dispatch('DataExplorer.Application',resultCLSID='{3FED40F1-D409-11D1-8B56-0060971CB54B}') <COMObject DataExplorer.Application> >>> gencache.EnsureModule('{06972F50-13F6-11D3-A5CB-0060971CB54B}',0,4,2) <module 'win32com.gen_py.06972F50-13F6-11D3-A5CB-0060971CB54Bx0x4x2' from 'C:\Python24\lib\site-packages\win32com\gen_py \06972F50-13F6-11D3-A5CB-0060971CB54Bx0x4x2.py'>
- Download and unpack the PyMsXML scripts and examples. Download PyMsXML. After unzipping PyMsXML, edit the file pymsxml.cmd to point to your Python installation (usually C:\Python24\python.exe) and your PyMsXML installation.
Usage
PyMsXML consists of a single python script. A windows cmd file wrapper is provided, to take care of calling the python interpretor appropriately.
- pymsxml [ options ] raw-spectra-data-file
-
Options:
-
- -R raw-format, --rawdata raw-format
- Valid raw-format values:
wiff
,qstar
,t2d
,ab4700
,ab4800
,voyager
,mariner
,mzXML
. Optional if raw-spectra-data-file ends in.wiff
,.t2m
, or.mzXML
. - -X xml-format, --xmlformat xml-format
- Valid xml-format values:
mzXML
(ISB),mzData
(HUPO). Optional if output-file ends in.mzXML
or.mzData
. - -o output-file, --output output-file
- Name of output file. If omitted, and xml-format is supplied, then the output file is inferred by changing the file extention of raw-spectra-data-file to xml-format.
- -p ms-levels, --peaks ms-levels
- Apply (vendor library) peak detetion to spectra with level in ms-levels (comma separated). QStar (MS/MS spectra only) raw format, 4700, 4800 raw format only. Default: 2.
- -f filter-spec, --filter filter-spec
-
Filter output scans by their meta-data. Filters are specified as a
comma-separated list of filter tokens. Each filter token is specified
as field.comparison.value. field must be
an attribute of the scan object. comparison must be one of
eq
,ne
,lt
,le
,gt
, orge
, specifying =, ≠, <, ≤, >, and ≥ respectively. - -V version, --version version
-
XML version. mzXML only. Valid options
2.1
,2.2
,3.0
. Default:3.0
. - -z, --compress_peaks
- Compress mzXML peak data using zlib. Default: No compression of peak data. Requires mzXML version 3.0.
- -Z compress-format, --compress compress-format
- Compress output file. Valid options:
gz
. Default: None, unless output file ends with.gz
, thengz
. - -d, --debug
- Debug. Output XML for first 10 spectra only. Truncate spectral data, too. Useful to verify that the output is formatted correctly.
- -h, --help
- Help.
Applied Biosystems Q-Star Spectra
The raw spectra data files for the ESI spectra from Applied Biosystems' Q-Star instruments are usually extracted as ".wiff" files. These can be opened using Applied Biosystem's Analyst or BioAnalyst programs. PyMsXML uses Analyst's support libraries to extract mass spectra from these files.
Applied Biosystems Mariner, Voyager, 4700 Spectra
The raw spectra data files for the MALDI spectra from Applied Biosystems' Mariner, Voyager, 4700, and 4800 instruments are usually extracted as ".t2d" or ".dat" files. These can be opened using Applied Biosystem's Data Explorer program. PyMsXML uses Data Explorer's support libraries to extract mass spectra from these files. These file formats store very little meta-data in addition to the mass spectrum. If you do not care whether or not the output data-file contains valid spot, matrix, and plate meta-data; and if your spectra are all contained in one raw datafile, then the datafile can be supplied on the command-line as raw-spectra-data-file. However, if you have many .dat or .t2d files, each corresponding to a MALDI spot, or if you wish to have spot, matrix, or plate meta-data populated into the output file, then see the next set of instructions.
Plate, matrix, and spot meta-data must be supplied in a meta-data text file, which is supplied on the command-line as raw-spectra-data-file. The meta-data file is most easily constructed in Excel and saved as tab-separated-values, but it can be formed by hand too, if desired. Each line of the meta-data file specifies a record, describing the MALDI plate, the plates' spots, and the scans acquired from these spots. A short-cut record, that defines the plate and spot naming convention is also provided.
The plate definition record consists of the word PLATE
(case
insensitive) in the first column, followed by alternating key-value
pairs in subsequent columns. Particular key-value pairs do not need to
be specified in any particular order. The following keys must be
provided:
plateID
, spotXCount
, spotYCount
,
plateManufacturer
, and plateModel
. The
plateID
value is referenced by the spot and scan definition
records. The spotXCount
is the number of MALDI spots in
the horizontal dimension (integer). The spotYCount
is
the number of MALDI spots in the vertical dimension (integer). The
plateManufacturer
and plateModel
values are inserted verbatim in the output XML.
The spot definition record consists of the word SPOT
(case
insensitive) in the first column, followed by alternating key-value
pairs in subsequent columns. Particular key-value pairs do not need to
be specified in any particular order. The following keys must be
provided:
plateID
, spotID
, spotXPosition
,
spotYPosition
, and maldiMatrix
. The
plateID
value must be defined by some plate definition
record. The spotID
is referenced by the scan definition
records. The spotXPosition
is the horizontal position of the
spot on the plate (integer). The spotYPosition
is the
vertical position of the spot on the plate (integer). Spot positions
can be numbered beginning at 0 or 1. The
maldiMatrix
value is inserted verbatim in the output XML.
The scan definition record consists of the word SCAN
(case
insensitive) in the first column, followed by alternating key-value
pairs in subsequent columns. Particular key-value pairs do not need to
be specified in any particular order. The following keys must be
provided:
plateID
, spotID
, filename
, and index
.
The
plateID
must be defined by some plate definition record. The
spotID
must be defined by some spot definition record. The
filename
is the name of the ".dat" or ".t2d" file containing
the corresponding scan's spectrum. The index
is the ordinal
of the corresponding spectrum in the provided file. Spectra within
files should be referenced beginning at 1.
To alleviate some of the tedium with specifying the spot definition records, a shortcut plate definition record is provided. The platedef definition record consists of the word PLATEDEF
(case insensitive) in the first column, followed by alternating key-value
pairs in subsequent columns. Particular key-value pairs do not need to
be specified in any particular order. The following keys must be
provided:
plateID
, plateManufacturer
, plateModel
, spotNaming
, and maldiMatrix
.
The
plateID
value is referenced by the spot and scan definition
records. The plateManufacturer
and plateModel
are
used to identify the properties of the MALDI plate. Currently, only
the values ABI / SCIEX
and 01-192+06-BB
are
recognized, but others are easily added on request. The "ABI / SCIEX
01-192+06-BB" plate consists of 8 rows of 24 spots (plus 6 calibration
spots). The spotNaming
must be one of alpha
,
parallel
, or antiparallel
. At this time, only
alpha
is implemented. The alpha
spot attribute
constructs spotID
values for all spots on the plate with the
row specified by a letter from A to H, and the column specified by a
number from 1 to 24. The
maldiMatrix
value is assumed the same for each spot and is
inserted verbatim in the output XML.
NOTE: The order of scan definition lines is important! MS/MS spectra must appear immediately after the MS spectrum containing their precursors.
Example meta-data files are provided in the distribution. example1.t2m
explicitly defines the MALDI plate, spots and scans. example2.t2m
is equivalent, but uses the platedef keyword shortcut.
Examples
Convert the Q-Star example.wiff
file to mzXML format, placing the output in the file qstar-example.xml
.
C:\PyMsXML\example\wiff> pymsxml.cmd -R wiff -X mzXML -o qstar-example.xml example.wiff
Convert the Q-Star example.wiff
file to mzData format, placing the output in the file example.mzData
.
C:\PyMsXML\example\wiff> pymsxml.cmd -X mzData example.wiff
Convert the Mariner ESI-CE-MALDI spectra in example3.dat
to mzXML format, placing the output in the file example3.mzXML.gz
(automatically gzipped):
C:\PyMsXML\example\dat> pymsxml.cmd -R mariner -o example3.mzXML.gz example3.dat
Convert the AB4700 spectra listed in example1.t2m
and example2.t2m
to mzXML format, placing the output in the files example1.mzXML
and example2.mzXML
.
C:\PyMsXML\example\t2d> pymsxml.cmd -X mzXML example*.t2m
Release Notes
The Analyst COM libraries seem to have trouble with long pathnames. If you consistently have trouble getting PyMsXML to read ".wiff" files, try moving the files to a shorter directory path.
Credits
Development of PyMsXML was significantly helped by the open-source Visual Basic source code from the MzStar program of the ISB glossolalia project.