rainbow.agilent.masshunter
Methods for parsing Agilent Masshunter files.
Functions
|
Bins per-point (mz, intensity) values into a (retention time x mz) grid. |
|
Converts a raw time-of-flight axis to calibrated mz values. |
|
Returns the total scan count from MSTS.xml, or None if MSTS.xml is absent. |
|
Decompresses the run-length-encoded intensity stream of a MSProfile.bin segment (see |
|
Finds and parses Agilent Masshunter MS data files. |
|
Reads the polynomial |
|
Parses Agilent Masshunter ICP-MS data (MSProfile.bin). |
|
Parses Masshunter MS data. |
|
Parses Masshunter centroided MS data stored in MSPeak.bin. |
|
Parses MSScan.xsd into a dictionary describing its "complex" types. |
|
Reads a "complex" type from |
|
Reads the default per-calibration-id calibration rows from DefaultMassCal.xml. |
|
Reads the scan records (ScanRecordType) from MSScan.bin, one per retention time. |
|
Reads a type from |
|
Returns whether a MSProfile.bin segment uses run-length encoding. |
|
Returns the on-disk byte size of one MSScan.bin record of the given type. |
- parse_allfiles(path, prec=0, hrms=False, centroid=False)[source]
Finds and parses Agilent Masshunter MS data files.
MassHunter stores a scan’s spectrum as a dense profile trace (
MSProfile.bin) and/or a peak-picked centroid list (MSPeak.bin). Both are opt-in:hrmsparses the profile andcentroidparses the centroids (seeparse_msdataandparse_mspeakdata). With neither flag set nothing is parsed here.- Parameters:
path (str) – Path to the Agilent .D directory.
prec (int, optional) – Number of decimals to round ylabels.
hrms (bool, optional) – Parse the profile spectrum (MSProfile.bin).
centroid (bool, optional) – Parse the centroid spectrum (MSPeak.bin).
- Returns:
List containing a DataFile for each parsed file.
- parse_msdata(path, prec=0)[source]
Parses Masshunter MS data.
IMPORTANT: Masshunter MS data can be either stored in MSProfile.bin or MSPeak.bin. This method only supports parsing MSProfile.bin.
- The following files are used (in order of listing):
MSScan.xsd -> File structure of MSScan.bin.
MSScan.bin -> Offsets, compression info, and scan count.
MSMassCal.bin -> Calibration info for masses.
MSProfile.bin -> Actual data values.
The scan count is recovered by reading MSScan.bin to EOF, so MSTS.xml is not required. This lets us parse OpenLab .rslt/.sirslt result folders, which omit MSTS.xml.
Learn more about this file format here.
- Parameters:
path (str) – Path to the AcqData subdirectory.
prec (int, optional) – Number of decimals to round mz values.
- Returns:
DataFile containing Masshunter MS data.
- parse_icpmsdata(path, prec=0)[source]
Parses Agilent Masshunter ICP-MS data (MSProfile.bin).
ICP-MS acquisitions store an intensity for each isotope channel at every retention time. Unlike the HRMS MSProfile.bin parsed by
parse_msdata, the ICP-MS MSProfile.bin is NOT LZF-compressed and is laid out as four parallel blocks per scan (channel index, reported value, raw pulse count, analog value). This parser reads the reported values, which are the intensities Masshunter reports in its CSV export.The decoding was contributed by Jeremy Hourigan (UC Santa Cruz); see issue #25. It has been verified against an Agilent 8900 triple-quadrupole ICP-MS file. It currently supports time-resolved acquisitions with a single tune mode and one measurement per isotope; files with multiple tune modes or multiple measurements per isotope are not yet handled.
- The following files are used (in order of listing):
MSScan.xsd -> File structure of MSScan.bin.
MSScan.bin -> Per-scan retention time, offset, and point count.
MSTS_XSpecific.xml -> Number of isotope channels.
MSTS_XAddition.xml (parent dir) -> Real isotope m/z labels.
MSProfile.bin -> Actual data values (uncompressed).
- Parameters:
path (str) – Path to the AcqData subdirectory.
prec (int, optional) – Number of decimals to round m/z values.
- Returns:
DataFile containing Masshunter ICP-MS data.
- bin_to_grid(mz_arr, intensities, rows, num_times, prec)[source]
Bins per-point (mz, intensity) values into a (retention time x mz) grid.
Rounding to
precdecimals puts the mz values on a discrete grid of spacing10**-prec. Scaling them to integers lets us assign each point a column directly and sum with a single pass, avoiding the global sort thatnumpy.unique()/numpy.searchsorted()would do over every point. For wide mz ranges at highprecthe dense grid would be too large, so above_MAX_DENSE_BINSwe fall back to the sort-based mapping.- Parameters:
mz_arr (np.ndarray) – Rounded mz value of every point, all scans concatenated.
intensities (np.ndarray) – uint64 intensity of every point.
rows (np.ndarray) – Retention-time (row) index of every point.
num_times (int) – Number of retention times (grid rows).
prec (int) – Number of decimals the mz values were rounded to.
- Returns:
the sorted unique mz values that occur, and the
(num_times, mz_ylabels.size)uint64 intensity grid.- Return type:
Tuple
(mz_ylabels, data)
- parse_mspeakdata(path, prec=0)[source]
Parses Masshunter centroided MS data stored in MSPeak.bin.
MSPeak.bin holds the peak-picked (centroid) spectrum of each scan - a list of (mz, intensity) pairs - in contrast to the dense profile trace in MSProfile.bin (
parse_msdata). GC quadrupole acquisitions store only centroids; Q-TOF/TOF acquisitions store a profile block and a centroid block per scan (seeread_scan_records), and this reads the centroid one.- The following files are used:
MSScan.xsd -> File structure of MSScan.bin.
MSScan.bin -> Per-scan metadata and pointers into MSPeak.bin.
MSPeak.bin -> Raw (mz, intensity) peak pairs.
The MSPeak.bin centroid decoding was contributed by denisshragin (issue #37).
- Parameters:
path (str) – Path to the AcqData subdirectory.
prec (int, optional) – Number of decimals to round mz values.
- Returns:
DataFile containing Masshunter centroided MS data.
- parse_default_masscal(xml_path)[source]
Reads the polynomial
ValueUseFlagsfor each calibration id from DefaultMassCal.xml.Each
DefaultCalibrationhas aPolynomialstep whoseValueUseFlagsis a bitmask: bitk(counting from the least significant) being set means the polynomial includes a term of orderk, and the active coefficients in MSMassCal.bin fill those orders in ascending order. A flag of 0 (or a missing file) means no polynomial refinement - only the traditional calibration is used.- Parameters:
xml_path (str) – Path to DefaultMassCal.xml.
- Returns:
Dictionary mapping calibration id (int) to its ValueUseFlags (int). Empty if the file does not exist.
- read_default_masscal_rows(xml_path)[source]
Reads the default per-calibration-id calibration rows from DefaultMassCal.xml.
Used as the fallback when the per-scan MSMassCal.bin is absent. Each
DefaultCalibrationprovides aTraditionalstep (coeff, base) and, optionally, aPolynomialstep (left, right, then six coefficients). Together these are the ten doubles MSMassCal.bin would otherwise store per scan -[coeff, base, left, right, c0..c5]- so a row can stand in for a MSMassCal.bin row directly (seecalibrate_mz). The polynomial’s ValueUseFlags is read separately byparse_default_masscal.- Parameters:
xml_path (str) – Path to DefaultMassCal.xml.
- Returns:
Dictionary mapping calibration id (int) to a length-10 list of doubles. Empty if the file does not exist or defines no traditional calibration.
- calibrate_mz(tof, calib_row, use_flags)[source]
Converts a raw time-of-flight axis to calibrated mz values.
The traditional calibration is
mz = (coeff * (tof - base))**2. When a polynomial refinement is active (use_flagstruthy), a correction is subtracted: the six MSMassCal.bin coefficients are assigned to the polynomial orders whose bits are set inuse_flags(ascending), and the polynomial is evaluated on the time-of-flight clipped to[left, right]. This matches the masses Agilent MassHunter reports (validated to <0.0001 Da against exported spectra); without it the masses are off by ~1-2 ppm.- Parameters:
tof (np.ndarray) – Raw time-of-flight values for one scan.
calib_row (np.ndarray) – The scan’s 10 MSMassCal.bin doubles (coeff, base, left, right, and six polynomial coefficients).
use_flags (int or None) – The polynomial ValueUseFlags for this scan’s calibration id, or None/0 to apply only the traditional formula.
- Returns:
A numpy array of calibrated mz values.
- segment_is_rle(comp_bytes, num_mz)[source]
Returns whether a MSProfile.bin segment uses run-length encoding.
RLE segments leave the 16-byte (smallest mz, mz delta) header raw and follow it with an intensity stream whose first 4 bytes are a little-endian word: the low 3 bytes hold the point count and the high byte is a fixed 0x90 marker. Both must match for us to treat the segment as RLE, which makes this a self-validating check rather than a guess (LZF-compressed segments effectively never satisfy it). See
decompress_inten_list.- Parameters:
comp_bytes (bytes) – The raw segment bytes read from MSProfile.bin.
num_mz (int) – The expected number of mz-intensity pairs.
- Returns:
True if the segment is RLE-encoded, False otherwise.
- decompress_inten_list(comp_view, num_mz)[source]
Decompresses the run-length-encoded intensity stream of a MSProfile.bin segment (see
segment_is_rle). Q-TOF profile acquisitions store intensities this way instead of LZF-compressing them (issue #27).The stream begins with a 4-byte point-count word (low 3 bytes) and a fixed 0x90 marker (high byte), then two little-endian int32s: an initial count of leading zero intensities and a width flag (both stored negated). The width flag is 1, 2, 3 or 4, mapping to a 1-, 2-, 4- or 8-byte signed integer. The remaining values are then read at the current width:
A non-negative value is a literal intensity.
A negative value -v encodes
divmod(v, 4): the quotient is a run of zero intensities to emit, and the remainder is the new width flag to switch to for subsequent values.
Trailing zero intensities are not stored, so the output is pre-filled with zeros to length
num_mz.- Parameters:
comp_view (memoryview) – Segment bytes after the 16-byte header.
num_mz (int) – The number of mz-intensity pairs (output length).
- Returns:
A numpy array of
num_mzuint32 intensities.- Raises:
ValueError – If the stream is malformed (bad width flag, runs past the point count, or is truncated).
- parse_scan_xsd(xsd_path)[source]
Parses MSScan.xsd into a dictionary describing its “complex” types.
There are “simple” types that translate directly into number types, and “complex” types made up of other “simple” and “complex” types. The returned dictionary maps each complex type’s name to a list of its (name, type) members, which enables the recursive parsing in
read_complextype.- Parameters:
xsd_path (str) – Path to MSScan.xsd.
- Returns:
Dictionary mapping complex type names to lists of (name, type) tuples.
- type_size(complextypes_dict, name)[source]
Returns the on-disk byte size of one MSScan.bin record of the given type.
Mirrors
read_complextype/read_type: each member is counted once, so for ScanRecordType this is the size of a record with a single SpectrumParamValues block. Lets read_scan_records reason about the record stride without reading the file.- Parameters:
complextypes_dict (dict) – Output of
parse_scan_xsd.name (str) – A simple (“xs:int”, …) or complex type name.
- Returns:
Size in bytes (int).
- read_scan_records(msscan_path, complextypes_dict, num_records=None)[source]
Reads the scan records (ScanRecordType) from MSScan.bin, one per retention time.
Each record holds the scalar scan fields followed by one or more
SpectrumParamValuesblocks - the schema element ismaxOccurs="unbounded". A profile-only acquisition writes a single block, but an acquisition that also stores centroids (MSPeak.bin) writes a profile block and a centroid block, making the record larger.read_complextypereads only the first block, so reading records back-to-back would mis-parse the trailing block(s) as the next record.To stay aligned we read each record at the true record stride and skip to the next. The stride is
scalar + n * blockfor some block countn; it is taken from the MSTS.xml scan count when that is consistent, otherwise inferred as the value ofnthat tiles the record region exactly. The first block of each record is the profile spectrum, which is whatparse_msdataconsumes. Only when no stride tiles the region (a record truncated mid-write) do we fall back to reading single-block records to EOF.- Parameters:
msscan_path (str) – Path to MSScan.bin.
complextypes_dict (dict) – Output of
parse_scan_xsd.num_records (int, optional) – Scan count from MSTS.xml (
count_scans), used as a hint. May be None (OpenLab result folders omit MSTS.xml) or stale (interrupted acquisitions); the stride is validated against the file geometry either way.
- Returns:
List of dictionaries, one per retention time, each mapping the ScanRecordType member names to their parsed values.
- count_scans(acqdata_path)[source]
Returns the total scan count from MSTS.xml, or None if MSTS.xml is absent.
MSTS.xml lists the number of scans per acquisition time segment (
<NumOfScans>); the total is their sum. Agilent OpenLab .rslt/.sirslt result folders omit MSTS.xml, in which caseread_scan_recordsinfers the scan count from the record geometry instead.- Parameters:
acqdata_path (str) – Path to the AcqData subdirectory.
- Returns:
Total scan count (int), or None if MSTS.xml does not exist.
- read_complextype(f, complextypes_dict, name)[source]
Reads a “complex” type from
f. Used only for MSScan.bin.Mutually recurs with
read_type.- Parameters:
f (_io.BufferedReader) – File opened in ‘rb’ mode.
complextypes_dict (dict) – Dictionary defining all “complex” types.
name (str) – Name of the “complex” type to parse.
- Returns:
Dictionary mapping subtype names to values. If the subtype is “complex”, the value is a nested dictionary. Otherwise, the value is a number.
- read_type(f, complextype_dict, name)[source]
Reads a type from
f. Used only for MSScan.bin.Mutually recurs with
read_complextype.- Parameters:
f (_io.BufferedReader) – File opened in ‘rb’ mode.
complextypes_dict (dict) – Dictionary defining all “complex” types.
name (str) – Name of the type to parse.
- Returns:
If the type is “simple”, a number value. If the type is “complex”, a dictionary mapping names to values.