rainbow.agilent.masshunter

Methods for parsing Agilent Masshunter files.

Functions

`bin_to_grid`(mz_arr, intensities, rows, ...)	Bins per-point (mz, intensity) values into a (retention time x mz) grid.
`calibrate_mz`(tof, calib_row, use_flags)	Converts a raw time-of-flight axis to calibrated mz values.
`count_scans`(acqdata_path)	Returns the total scan count from MSTS.xml, or None if MSTS.xml is absent.
`decompress_inten_list`(comp_view, num_mz)	Decompresses the run-length-encoded intensity stream of a MSProfile.bin segment (see `segment_is_rle`).
`parse_allfiles`(path[, precision, hrms, ...])	Finds and parses Agilent Masshunter MS data files.
`parse_default_masscal`(xml_path)	Reads the polynomial `ValueUseFlags` for each calibration id from DefaultMassCal.xml.
`parse_icpmsdata`(path[, precision])	Parses Agilent Masshunter ICP-MS data (MSProfile.bin).
`parse_msdata`(path[, precision, bin_width])	Parses Masshunter MS data.
`parse_mspeakdata`(path[, precision])	Parses Masshunter centroided MS data stored in MSPeak.bin.
`parse_scan_xsd`(xsd_path)	Parses MSScan.xsd into a dictionary describing its "complex" types.
`read_complextype`(f, complextypes_dict, name)	Reads a "complex" type from `f`.
`read_default_masscal_rows`(xml_path)	Reads the default per-calibration-id calibration rows from DefaultMassCal.xml.
`read_scan_records`(msscan_path, complextypes_dict)	Reads the scan records (ScanRecordType) from MSScan.bin, one per retention time.
`read_type`(f, complextype_dict, name)	Reads a type from `f`.
`segment_is_rle`(comp_bytes, num_mz)	Returns whether a MSProfile.bin segment uses run-length encoding.
`type_size`(complextypes_dict, name)	Returns the on-disk byte size of one MSScan.bin record of the given type.

Classes

ProfileDataFile(path, xlabels, tof, data, ...)

A high-resolution profile spectrum whose m/z axis is per-scan.

class ProfileDataFile(path, xlabels, tof, data, calib, use_flags, metadata, mz_decimals=4)[source]

Bases: DataFile

A high-resolution profile spectrum whose m/z axis is per-scan.

Unlike a regular DataFile, an HRMS profile has no single m/z axis. Every scan is sampled on the same raw flight-time grid (tof), so a column index is the same physical bin in every scan, but the flight-time-to-m/z calibration drifts from scan to scan. The m/z of a point therefore depends on both the scan and the point. Access a scan’s m/z with mass_labels() or scan(); reading ylabels raises, because a single shared m/z axis does not exist (see HRMS Profile Data: Why Each Scan Has Its Own m/z).

Attributes:

tof (numpy.ndarray) – The shared flight-time axis, one value per column of data, identical for every scan.
data (numpy.ndarray) – 2D intensities, shape (num_scans, num_points). Rows are scans (retention times); columns are flight-time bins.
xlabels (numpy.ndarray) – Retention time of each scan (row).
mz_decimals (int or None) – Decimals to round reported m/z to; None keeps full float precision.

property ylabels

extract_traces(labels=None)[source]

Extracts data corresponding to the specified labels.

Raises an exception if any labels are invalid.

Parameters:: labels (int/float/list, optional) – Ylabel(s) to extract.
Returns:: 2D numpy array containing data for the specified ylabel(s). The rows correspond to the ylabels and the columns corrrespond to the retention times.

export_csv(filename, labels=None, delim=',')[source]

Outputs a CSV containing data for the specified labels.

Parameters:

filename (str) – Filename for the output CSV.
labels (int/float/list, optional) – Ylabel(s) to export.
delim (str, optional) – Delimiter used in the output CSV.

to_csvstr(labels=None, delim=',')[source]

Returns a string representation of a CSV containing data for the specified labels.

Parameters:

labels (int/float/list, optional) – Ylabel(s) to return.
delim (str, optional) – Delimiter used in the CSV representation.

plot(label, **kwargs)[source]

Shows a basic matplotlib plot for the specified label.

Parameters:

label (int/float) – Ylabel to be plotted.
**kwargs (optional) – Keyword arguments for matplotlib.

mass_labels(i)[source]: The calibrated m/z values for scan i (rounded to mz_decimals).

scan(i)[source]: The decoded spectrum of scan i as (mass_labels, intensities), i.e. the per-scan m/z axis and that scan’s intensities, with no binning and no inserted zeros.

get_info()[source]: Returns a string summary of the DataFile.

parse_allfiles(path, precision='auto', hrms=False, centroid=False, bin_width=None)[source]

Finds and parses Agilent Masshunter MS data files.

MassHunter stores a scan’s spectrum as a dense profile trace (MSProfile.bin) and/or a peak-picked centroid list (MSPeak.bin). Both are opt-in: hrms parses the profile and centroid parses the centroids (see parse_msdata and parse_mspeakdata). With neither flag set nothing is parsed here.

Parameters:

path (str) – Path to the Agilent .D directory.
precision (int or str, optional) – Number of decimals to round m/z to. 'auto' (the default) resolves per file: 4 for the profile and TOF centroids, 0 for unit-resolution (GC/quadrupole) centroids.
hrms (bool, optional) – Parse the profile spectrum (MSProfile.bin).
centroid (bool, optional) – Parse the centroid spectrum (MSPeak.bin).
bin_width (float, optional) – For the profile, omit (the default) to keep the per-scan representation (ProfileDataFile); pass a width in daltons to project onto the shared m/z grid; see parse_msdata.

Returns:

List containing a DataFile for each parsed file.

parse_msdata(path, precision='auto', bin_width=None)[source]

Parses Masshunter MS data.

IMPORTANT: Masshunter MS data can be either stored in MSProfile.bin or MSPeak.bin. This method only supports parsing MSProfile.bin.

The following files are used (in order of listing):

MSScan.xsd -> File structure of MSScan.bin.
MSScan.bin -> Offsets, compression info, and scan count.
MSMassCal.bin -> Calibration info for masses.
MSProfile.bin -> Actual data values.

The scan count is recovered by reading MSScan.bin to EOF, so MSTS.xml is not required. This lets us parse OpenLab .rslt/.sirslt result folders, which omit MSTS.xml.

Learn more about this file format here.

With no bin_width (the default) the per-scan representation is returned: a list of ProfileDataFile objects (one per flight-time grid), each keeping the raw intensities and exposing the per-scan m/z via scan(i). Pass a bin_width to project the spectra onto a single shared m/z grid, which is convenient for extracted-ion chromatograms and heatmaps but inserts zeros and loses resolution for high-resolution data (see HRMS Profile Data: Why Each Scan Has Its Own m/z). The bin width is independent of the precision label rounding; it is what turns binning on.

Parameters:

path (str) – Path to the AcqData subdirectory.
precision (int or str, optional) – Number of decimals to round mz labels to. 'auto' (the default) resolves to 4 for this high-resolution data.
bin_width (float, optional) – Omit (the default) to return the per-scan representation (a list of ProfileDataFile, one per flight-time grid); pass a width in daltons to project onto the shared grid.

Returns:

A list of ProfileDataFile (one per grid), or, when a bin_width is given, a single DataFile on the shared grid.

parse_icpmsdata(path, precision='auto')[source]

Parses Agilent Masshunter ICP-MS data (MSProfile.bin).

ICP-MS acquisitions store an intensity for each isotope channel at every retention time. Unlike the HRMS MSProfile.bin parsed by parse_msdata, the ICP-MS MSProfile.bin is NOT LZF-compressed and is laid out as four parallel blocks per scan (channel index, reported value, raw pulse count, analog value). This parser reads the reported values, which are the intensities Masshunter reports in its CSV export.

The decoding was contributed by Jeremy Hourigan (UC Santa Cruz); see issue #25. It has been verified against an Agilent 8900 triple-quadrupole ICP-MS file. It currently supports time-resolved acquisitions with a single tune mode and one measurement per isotope; files with multiple tune modes or multiple measurements per isotope are not yet handled.

The following files are used (in order of listing):

MSScan.xsd -> File structure of MSScan.bin.
MSScan.bin -> Per-scan retention time, offset, and point count.
MSTS_XSpecific.xml -> Number of isotope channels.
MSTS_XAddition.xml (parent dir) -> Real isotope m/z labels.
MSProfile.bin -> Actual data values (uncompressed).

Parameters:

path (str) – Path to the AcqData subdirectory.
precision (int, optional) – Number of decimals to round m/z values.

Returns:

DataFile containing Masshunter ICP-MS data.

bin_to_grid(mz_arr, intensities, rows, num_times, precision, bin_width=None)[source]

Bins per-point (mz, intensity) values into a (retention time x mz) grid.

The shared-grid bin width is decoupled from the label precision. precision only sets how many decimals the returned m/z labels are rounded to, while bin_width (in daltons) sets how wide each bin is - i.e. how aggressively points from different scans are pooled into one column. When bin_width is None it defaults to 10**-precision (one bin per labelled m/z), which is the historical behavior.

Each point is assigned to bin round(mz / bin_width); mapping those bins to integers lets us assign each point a column directly and sum with a single pass, avoiding the global sort that numpy.unique()/ numpy.searchsorted() would do over every point. For wide mz ranges with narrow bins the dense grid would be too large, so above _MAX_DENSE_BINS we fall back to the sort-based mapping.

Parameters:

mz_arr (np.ndarray) – mz value of every point, all scans concatenated.
intensities (np.ndarray) – uint64 intensity of every point.
rows (np.ndarray) – Retention-time (row) index of every point.
num_times (int) – Number of retention times (grid rows).
precision (int) – Number of decimals to round the returned mz labels to.
bin_width (float, optional) – Width of each shared-grid bin in daltons. Defaults to 10**-precision.

Returns:

the sorted bin-center mz values that occur, and the (num_times, mz_ylabels.size) uint64 intensity grid.

Return type:

Tuple (mz_ylabels, data)

parse_mspeakdata(path, precision='auto')[source]

Parses Masshunter centroided MS data stored in MSPeak.bin.

MSPeak.bin holds the peak-picked (centroid) spectrum of each scan - a list of (mz, intensity) pairs - in contrast to the dense profile trace in MSProfile.bin (parse_msdata). GC quadrupole acquisitions store only centroids; Q-TOF/TOF acquisitions store a profile block and a centroid block per scan (see read_scan_records), and this reads the centroid one.

The following files are used:

MSScan.xsd -> File structure of MSScan.bin.
MSScan.bin -> Per-scan metadata and pointers into MSPeak.bin.
MSPeak.bin -> Raw (mz, intensity) peak pairs.

The MSPeak.bin centroid decoding was contributed by denisshragin (issue #37).

Parameters:

path (str) – Path to the AcqData subdirectory.
precision (int or str, optional) – Number of decimals to round mz values to. 'auto' (the default) resolves to 4 for TOF-calibrated centroids and 0 for unit-resolution (GC/quadrupole) centroids.

Returns:

DataFile containing Masshunter centroided MS data.

parse_default_masscal(xml_path)[source]

Reads the polynomial ValueUseFlags for each calibration id from DefaultMassCal.xml.

Each DefaultCalibration has a Polynomial step whose ValueUseFlags is a bitmask: bit k (counting from the least significant) being set means the polynomial includes a term of order k, and the active coefficients in MSMassCal.bin fill those orders in ascending order. A flag of 0 (or a missing file) means no polynomial refinement - only the traditional calibration is used.

Parameters:: xml_path (str) – Path to DefaultMassCal.xml.
Returns:: Dictionary mapping calibration id (int) to its ValueUseFlags (int). Empty if the file does not exist.

read_default_masscal_rows(xml_path)[source]

Reads the default per-calibration-id calibration rows from DefaultMassCal.xml.

Used as the fallback when the per-scan MSMassCal.bin is absent. Each DefaultCalibration provides a Traditional step (coeff, base) and, optionally, a Polynomial step (left, right, then six coefficients). Together these are the ten doubles MSMassCal.bin would otherwise store per scan - [coeff, base, left, right, c0..c5] - so a row can stand in for a MSMassCal.bin row directly (see calibrate_mz). The polynomial’s ValueUseFlags is read separately by parse_default_masscal.

Parameters:: xml_path (str) – Path to DefaultMassCal.xml.
Returns:: Dictionary mapping calibration id (int) to a length-10 list of doubles. Empty if the file does not exist or defines no traditional calibration.

calibrate_mz(tof, calib_row, use_flags)[source]

Converts a raw time-of-flight axis to calibrated mz values.

The traditional calibration is mz = (coeff * (tof - base))**2. When a polynomial refinement is active (use_flags truthy), a correction is subtracted: the six MSMassCal.bin coefficients are assigned to the polynomial orders whose bits are set in use_flags (ascending), and the polynomial is evaluated on the time-of-flight clipped to [left, right]. This matches the masses Agilent MassHunter reports (validated to <0.0001 Da against exported spectra); without it the masses are off by ~1-2 ppm.

Parameters:

tof (np.ndarray) – Raw time-of-flight values for one scan.
calib_row (np.ndarray) – The scan’s 10 MSMassCal.bin doubles (coeff, base, left, right, and six polynomial coefficients).
use_flags (int or None) – The polynomial ValueUseFlags for this scan’s calibration id, or None/0 to apply only the traditional formula.

Returns:

A numpy array of calibrated mz values.

segment_is_rle(comp_bytes, num_mz)[source]

Returns whether a MSProfile.bin segment uses run-length encoding.

RLE segments leave the 16-byte (smallest mz, mz delta) header raw and follow it with an intensity stream whose first 4 bytes are a little-endian word: the low 3 bytes hold the point count and the high byte is a fixed 0x90 marker. Both must match for us to treat the segment as RLE, which makes this a self-validating check rather than a guess (LZF-compressed segments effectively never satisfy it). See decompress_inten_list.

Parameters:

comp_bytes (bytes) – The raw segment bytes read from MSProfile.bin.
num_mz (int) – The expected number of mz-intensity pairs.

Returns:

True if the segment is RLE-encoded, False otherwise.

decompress_inten_list(comp_view, num_mz)[source]

Decompresses the run-length-encoded intensity stream of a MSProfile.bin segment (see segment_is_rle). Q-TOF profile acquisitions store intensities this way instead of LZF-compressing them (issue #27).

The stream begins with a 4-byte point-count word (low 3 bytes) and a fixed 0x90 marker (high byte), then a negated little-endian int32 giving the count of leading zero intensities. The token stream follows, opening at a width of 4 bytes (signed). Each value is read at the current width:

A non-negative value is a literal intensity.
A negative value -v encodes divmod(v, 4): the quotient is a run of zero intensities to emit, and the remainder is the new width flag (1, 2, 3 -> 1-, 2-, 4-byte; 4 -> 8-byte) to switch to for subsequent values.

Most scans open with an 0xffffffff token, read as -1 at the 4-byte starting width, which emits no zeros and switches to 1-byte values; this is why the opening width is rarely seen directly. High-signal scans instead open with a literal 4-byte intensity. (Issue #27: an earlier reading mistook that first token for a separate “width flag” field, which decoded identically for the common case but failed on the literal-first scans.)

Trailing zero intensities are not stored, so the output is pre-filled with zeros to length num_mz.

Parameters:

comp_view (memoryview) – Segment bytes after the 16-byte header.
num_mz (int) – The number of mz-intensity pairs (output length).

Returns:

A numpy array of num_mz uint32 intensities.

Raises:

ValueError – If the stream is malformed (bad width flag, runs past the point count, or is truncated).

parse_scan_xsd(xsd_path)[source]

Parses MSScan.xsd into a dictionary describing its “complex” types.

There are “simple” types that translate directly into number types, and “complex” types made up of other “simple” and “complex” types. The returned dictionary maps each complex type’s name to a list of its (name, type) members, which enables the recursive parsing in read_complextype.

Parameters:: xsd_path (str) – Path to MSScan.xsd.
Returns:: Dictionary mapping complex type names to lists of (name, type) tuples.

type_size(complextypes_dict, name)[source]

Returns the on-disk byte size of one MSScan.bin record of the given type.

Mirrors read_complextype/read_type: each member is counted once, so for ScanRecordType this is the size of a record with a single SpectrumParamValues block. Lets read_scan_records reason about the record stride without reading the file.

Parameters:

complextypes_dict (dict) – Output of parse_scan_xsd.
name (str) – A simple (“xs:int”, …) or complex type name.

Returns:

Size in bytes (int).

read_scan_records(msscan_path, complextypes_dict, num_records=None)[source]

Reads the scan records (ScanRecordType) from MSScan.bin, one per retention time.

Each record holds the scalar scan fields followed by one or more SpectrumParamValues blocks - the schema element is maxOccurs="unbounded". A profile-only acquisition writes a single block, but an acquisition that also stores centroids (MSPeak.bin) writes a profile block and a centroid block, making the record larger. read_complextype reads only the first block, so reading records back-to-back would mis-parse the trailing block(s) as the next record.

To stay aligned we read each record at the true record stride and skip to the next. The stride is scalar + n * block for some block count n; it is taken from the MSTS.xml scan count when that is consistent, otherwise inferred as the value of n that tiles the record region exactly. The first block of each record is the profile spectrum, which is what parse_msdata consumes. Only when no stride tiles the region (a record truncated mid-write) do we fall back to reading single-block records to EOF.

Parameters:

msscan_path (str) – Path to MSScan.bin.
complextypes_dict (dict) – Output of parse_scan_xsd.
num_records (int, optional) – Scan count from MSTS.xml (count_scans), used as a hint. May be None (OpenLab result folders omit MSTS.xml) or stale (interrupted acquisitions); the stride is validated against the file geometry either way.

Returns:

List of dictionaries, one per retention time, each mapping the ScanRecordType member names to their parsed values.

count_scans(acqdata_path)[source]

Returns the total scan count from MSTS.xml, or None if MSTS.xml is absent.

MSTS.xml lists the number of scans per acquisition time segment (<NumOfScans>); the total is their sum. Agilent OpenLab .rslt/.sirslt result folders omit MSTS.xml, in which case read_scan_records infers the scan count from the record geometry instead.

Parameters:: acqdata_path (str) – Path to the AcqData subdirectory.
Returns:: Total scan count (int), or None if MSTS.xml does not exist.

read_complextype(f, complextypes_dict, name)[source]

Reads a “complex” type from f. Used only for MSScan.bin.

Mutually recurs with read_type.

Parameters:

f (_io.BufferedReader) – File opened in ‘rb’ mode.
complextypes_dict (dict) – Dictionary defining all “complex” types.
name (str) – Name of the “complex” type to parse.

Returns:

Dictionary mapping subtype names to values. If the subtype is “complex”, the value is a nested dictionary. Otherwise, the value is a number.

read_type(f, complextype_dict, name)[source]

Reads a type from f. Used only for MSScan.bin.

Mutually recurs with read_complextype.

Parameters:

f (_io.BufferedReader) – File opened in ‘rb’ mode.
complextypes_dict (dict) – Dictionary defining all “complex” types.
name (str) – Name of the type to parse.

Returns:

If the type is “simple”, a number value. If the type is “complex”, a dictionary mapping names to values.