MS Selection Design Discussion Document ======================================= George Moellenbrock 2003/11/12 Introduction ------------ Users' perceptions of aips++ have suffered from a confusing range (and lack) of ways to select MS data. Some tool functions use explicit parameters for certain selection keys, but these parameters, and their names, are not consistent (nor consistently present) in all tools requiring data selection. In most tools, usually in a setdata function, there is a means of selecting data via the Table Query Language (TaQL). While extraordinarily flexible and powerful, TaQL is is not adequate, in terms of ease-of-use, for the general user, especially for simple data selection cases which tend to be the most often encountered. Therefore, a single, straightforward model for user-oriented MS data selection is required, and this document is an attempt to define one. Thanks to Kumar for helping flesh out the examples. General Principles ------------------ Implementation of a user-friendly mechanism for MS data selection should obey the following general principles. 1. Uniformity. One, and only one, set of MS data selection parameters should be used throughout the package. In general, for any conceptual parameter, the same parameter name should be used throughout the package, even if the parameter is not a global parameter. (This issue should be addressed in all aspects of the package, not just data selection.) 2. Name/Value specification. It should be possible to specify selection on MS keys and attributes according to name, value, and index (not just index), using a minimum of distinct parameters. In effect, this capability should be part of the built-in knowledge intended in the MS definition (that makes it distinct from a generic table of data). Note that fillers must label various data attributes with distinctive names (usually in the NAME columns of the various subtables) for name recognition to be effective. 3. Syntax/simplicity. Some data selection concepts (e.g., antenna-based vs. baseline-based selection) do not map in a one-to-one fashion from a simple selection specification and the selected data. Also, some selection parameters depend on the context set by what has been specified by other parameters (e.g., channel selection depends upon which spectral windows have been specified, and how many channels those spectral windows have). There will therefore need to be syntactical mechanisms to handle these cases. In fact, even the interfaces for the simpler parameters can benefit from a strong syntax. Combined with the name/value recognition flexibility mentioned above, the result will be a very powerful selection mechanism. In developing a powerful syntax, it is important that simple selections can be still be expressed in a simple manner. The major part of this document attempts to describe powerful yet simple specification syntax. 4. Portability. Data selection should be portable in the sense that one ensemble of selection parameters could be used in several different tools, without having to respecify it in detail for each. Also, it should be possible interpret a subset of one MS in terms of a data selection specification, and thus use it to effect selection in different MSs. Whereas the parsing of a user's selection specification involves translating names and values into indices, a fully portable data selection mechanism will require the reverse translation. This reverse translation will have an important application in specification and storage of flagging information. 5. Hide the details. There is an internal distinction between primary keys (fundamental data descriptors which can uniquely describe any subset of a dataset) and non-key attributes (which can be expressed in terms of the primary keys and serve only as shorthand for them), but this distinction should be largely hidden from the user. Also, the data keys in the MS are indices stored in a zero-based manner (i.e., counting starts at 0, not 1), and this is confusing in some user communities. It is hoped that the name and value recognition mechanisms will eliminate the need for ad hoc (and instrument- and community-dependent) adjustments to the index counting for the user. With these principles in mind, this document proposes a set of selection parameters which is divided into two groups. The primary group consists of the most common and fundamental data selection parameters, and the secondary group consists of more obscure attributes. Eventually, all of these parameters will be available to the user, but the primary group will be implemented first, and graphical interfaces of the future might recognize the distinction to provide for an uncluttered 'basic' data selection interface, with the option to expand it to the fully general one. For the most part, the intentions a user might have for the selected data are kept distinct from the selection specification. This is not true for the strided channel selection (which allows averaging), nor for the correlation specification, which allows polarization conversions. These operations are already possible and implemented (e.g., in msplot). It is likely that in some data selection contexts, these options will be forbidden, e.g., no polarization conversion will be permitted when selecting data in calibrater, etc. General Syntax -------------- Data selection typically involves specification of lists and ranges of names, values, or indices. It is proposed that a basic regex-like syntax be adopted to enable indication of ranges, wildcards, etc. The following table describes the meaning of the fundamental special characters which will be used throughout the data selection interface. This list should be viewed as a proposal; comments and suggestions are most welcome. For simplicity's sake, we intend to try very hard to maintain a single meaning for each special character, for all parameters, and to keep the list of special characters as short as possible. Character Meaning -------------------------------- , (comma) Delimiter indicating distinct members of a list. Nearly all parameters permit a meaningful list specification; how each member of the list is specified depends upon the parameter. The comma represents a logical OR in the selection. - (hyphen) Indicates a range of values, with extrema at the values specified on either side of the hyphen, i.e., a short-hand for a continuous list. Lists of ranges (separated by commas) will be supported. () Grouping. Used to avoid ambiguity in selection expression. * Wildcard, one or more characters. ? Wildcard, single character. : (colon) Specialization. When a specified principle value is non-degenerate, a colon is used to (optionally) specify distinguishable sub-selections. / Delimiter for associated values which will be evaluated together (e.g., spw channel stride specification) <,>,etc. Arithmetic comparisons. & Pairing syntax for baselines Selection Parameters -------------------- The table below lists the data selection parameters. Context-dependent parameters are listed immediately following the one they directly depend on. The table is an overview; details of each and how to specify it follow. At this point, only the primary parameters are described in detail. Principle Parameter Dependents Descrition ---------------------------------------------------------- -------------------Primary parameters--------------------- time Times, time ranges, or scan numbers field Field names or indices antenna Antenna names, stations, or indices feed Feed index (per antenna, consistent w/ spw) pol Polarization (receptor per antenna/feed) uvdist Range(s) of uv distances spw Spectral Window names or indices chan stride Strided chan selection ch/freq/vel Range or list of chan/freq/vel corr Correlation (between two 'pols') or Stokes ------------------Secondary parameters-------------------- observation Observation name/index array (Sub-)array indices uvw Range(s) of u,v,w coordinates pulsargate Specifies one or more pulsar gates pulsarbin Specifies one or more pulsar bins (per pulsargate) state Specifies observational states (e.g., in a switching observation) processor Specifies one or more distinct instrumental backends procphase Specifies processor phase weight Range(s) of data weights ampl Range(s) of data amplitudes phase Range(s) of data phases (about zero) hourang Range(s) of hourangles elevation Range(s) of elevations parang Range(s) of parallactic angles ---------------------------------------------------------- Details - Primary Parameters ---------------------------- In the following, the context and style of each primary selection parameter is described. The generic description of each specification indicates the basic forms possible for a single name/value/index. For name and value specifications, these forms refer to various MS subtables and their columns as required to describe what the specification will be matched with. Values in these subtable columns are indicated as "SUBTABLE_NAME.COLUMN_NAME", and are usually self-explanatory. Sometimes, there is more than one possibility for matching a specified name. In such cases, the multiple options will be listed in order of decreasing precedence and separated by the || (short-circut) OR operator, which indicates that the matching tests will stop as soon as a match is found. Some parameter specifications have optional parts; this are enclosed in square brackets []. For now, all name and value specifications will be specified as strings, enclosed in single quotes in the glish interface. The requirement for quotes may change in future CLIs and GUIs. Wildcards are always permitted for name (String) matching. Indices will be specified as integers (without quotes), unless a specialization follows (using a colon). Note that specifying an integer without quotes will bypass any name matching. (Is this a bad idea?) In most cases, ranges or list of the basic generic specifications are permitted. These are demonstrated in the specific examples. time ---- This parameter selects based on time ranges in the data. Since scan numbers are often a proxy for time, scan numbers can also be specified (even though scan numbers may not be a strict function of time). Timestamp(s), time ranges, scans and, and scan ranges may be specified. The full date/time can be specified in the measures syntax (YYYY/MM/DD/HH:MM:SS.sss or DD-MMM-YYYY/HH:MM:SS.sss), in the same epoch frame (e.g., UTC) as the data. If the date portion is omitted, then the first date in the dataset is assumed. The number of days after the first day may be prepended to the time (ddd/HH:MM:SS.sss). The second time in a timerange specification (using '-') need only be the least significant unique portion of the overall time string. time = 'YYYY/MM/DD/HH:MM:SS.sss' = '< YYYY/MM/DD/HH:MM:SS.sss' = '> YYYY/MM/DD/HH:MM:SS.sss' = 'ddd/HH:MM:SS.sss' = '< ddd/HH:MM:SS.sss' = '> ddd/HH:MM:SS.sss' Examples: time = '2003/11/07/12:58:20' # selects the timestamp nearest this time time = '2003/11/07/12:58:20-45' # selects data within this 25s range time = '2003/11/07/12:58:20-59:45' # selects data within this 1m25s range time = '< 2003/11/07/12:58:20' # selects data prior to this time time = '13:05:10.005' # selects timestamp nearest this time (date defaults to first date in dataset) time = '0/13:05:10.005' # same as above time = '3/13:05:10.005' # selects timestamp nearest this time on 4th day in dataset time = '13:05' # selects timestamp nearest 13h05m (date defaults to first date in dataset) time = '< 13:05:10, > 13:06:35' # all but the 1m25s of data between these times (date defaults to first date) time = '4,6,7' # scans 4,6,7 time = '54-231' # scans 54 through 231 time = '4325, 2003/11/07/01:16:00-17:10' # scan 4325 and this 1m10s of data -Should scan be a separate parameter? field ----- A field is an observational unit characterized by the direction of observation in a celestial frame, and is described in a row of the FIELD subtable. For synthesis telescopes, the relevant directions are the phase and delay tracking centers (in the correlator model, not necessarily the same), without regard to the possibility that the individual telescopes' pointing directions may be different from these. Note that synthesis mosaic observations always consist of many fields, each with a distinct direction in celestial coordinates. Each field has a unique NAME (String), which is often just a radio source name in a simple single-field observations, and is qualified by a CODE (String) which indicates the field's user- or observatory-determined classification (e.g., as a calibrator). In a synthesis mosaic observation, the collection of associated fields will share various properties, not the least of which might be the name of the large object or region being observed. Other common properties include line transistion names and rest frequencies, the object's systematic velocity and proper motion, etc. These common properties are described in a row of the MS SOURCE subtable, which is characterized by a unique NAME (String) and CODE (String), similar to the FIELD subtable. The SOURCE_ID column in the FIELD subtable supplies an index to the relevant row of the SOURCE subtable for each field. It is therefore necessary to enable field selection according to the NAME and CODE values in either the FIELD or SOURCE subtables, as well as by an index into the FIELD table. Thus, each member of a list or range of field specifications is matched with (in this order) the FIELD.NAME, FIELD.CODE, SOURCE.NAME, SOURCE.CODE, until a successful match is found. If none match, the specification is interpreted as an index into the field table, and if that fails, an error message will be generated. field = 'FIELD.NAME || FIELD.CODE || SOURCE.NAME || SOURCE.CODE' = FIELD.INDEX Examples: field = '3C*,4C*' # all FIELD.NAMEs beginning with 3C or 4C, or failing that, all fields with SOURCE.NAMEs beginning this way field = 'P' # all fields with CODE='P', e.g., calcode='P' in VLA calibrator manual field = 0-3 # the first four entries in the FIELD subtable field = >3 # all fields after the first 4 entries in the FIELD subtable antenna ------- This parameter is used to select combinations of physical array elements of a synthesis telescope, usually which antennas in the array, which feed on each antenna (for multi-feed systems, consistent with the spw selection), and which receptor (polarization) in each feed. This is essentially provides for the selection of a unique signal path in front of the electronics. Since most (all?) existing synthesis telescopes are single-feed, the feed selection is currently degenerate, and that part of the specification can be omitted. For single-dish telescopes, the ANTENNA selection is degerate, and for multi-feed systems (e.g., as on Parkes, the GBT, bolometers) the FEED selection becomes primary. Note that polarization selection here is always antenna-based, where specific polarization states are desired on a per-antenna basis. To select specific correlations globally, it is better to use the correlations parameter described below. For arrays, a means of distiguishing antenna and baseline selection is required. For example, one may wish to select all baselines involving a certain antenna or antennas, or only those baselines among a certain group of antennas. This is achieved in the antenna parameter via syntax using (optionally) the '&' character. A list of antennas specified without a '&' indicates an "exclusive" selection, i.e., only baselines among the antennas in the list. Using the '&' provides for "inclusive" antenna-based selections via wildcards, as well as for very specific baseline selections. Note than not using '&' is the same as specifying the same thing on both sides of a '&'. antenna = 'ANTENNA.NAME || ANTENNA.STATION' = ANTENNA.INDEX = 'ANTENNA.NAME || ANTENNA.STATION || INDEX [:POL]' = 'ANTENNA.NAME || ANTENNA.STATION || INDEX [:FEED][:POL]' Examples: antenna = '5' # selects baselines involving antenna 5 only (self-correlations) antenna = '5 & 5' # same as above antenna = '5 & 6' # baseline 5&6 only antenna = '5,6,7' # selects all baselines among antenna 5,6,7 (5&6, 5&7, 6&7) antenna = '5,6,7 & 5,6,7' # same as above antenna = '5 & *' # selects data for all baselines which include antenna 5 antenna = '5 & *, 6 & *' # selects data for all baselines which include antenna 5 or 6 antenna = '(5,6) & *' # same as above antenna = '(5-6) & *' # same as above antenna = '(5-8) & (9,10)' # selects 5&9,5&10,6&9,6&10,7&9,7&10,8&9,8&10 antenna = 'VLA_N*' # all baselines among North arm (VLA) antennas antenna = 'VLA_N* & *' # all baselines with at least one N arm antenna antenna = 'VLA_E* & VLA_W*' # all baselines between E and W arms (does not include baselines internal to each arm) antenna = 'VLA_E*,VLA_W*' # all baselines among E and W arms (includes internal ones) antenna = '5:R' # selects RCP visibilities for antenna 5 only (RR-only) antenna = '5:R & *' # selects all visibilities involving the RCP receptor on antenna 5 antenna = '5:R & 7:L' # selectes baselines 5&7, RL (not LR) antenna = '7:L & 5:R' # exactly same as above antenna = '7:R & 5:L' # selectes baselines 5&7, LR (not RL) antenna = '5:R & (3,4,7,8):L' # selects visibilities involving RCP on antenna 5 and LCP on antennas 3,4,7,8 uvdist ------ This is a geometrical way to specify baseline selection; an alternative to direct specification of the antennas involved. The uv-distance is the baseline length projected on a plane perpendicular to the instantaneous field direction. It can be expressed in units of distance (e.g., km or m, explicitly) or wavelengths at the reference frequency (e.g., l, kl, Ml). In the latter case, the result will be spectral window-dependent. Ranges and inequalities are permitted. Annuli of uv-distances may be specified either using a range, or by appending a fractional value (in percent) to a single uv distance value. uvdist = 'value1-value2km' = 'value1-value2Ml' = 'valuekm:percentage%' = 'valueMl:percentage%' Examples: uvdist = '24-35km, 40-45km' # two annuli in units of distance uvdist = '24-35Ml, 40-45Ml' # two annuli in units of wavelengths uvdist = '< 45kl' # less than 45 kilolambda uvdist = '> 0l' # greater than zero-length (no ACs) uvdist = '31Ml:5%' # +/- 2.5% about 31Ml spw --- The data for a single spectral window is characterized by a unique spectral setup. The spectral setup is defined by a unique reference frequency (in some frequency frame, e.g. LSRK, TOPO, etc.), a unique sideband, and a unique channelization (total bandwidth, resolution, channel width, number of channels, etc.). Transmission bandwidth limitations throughout the signal path typically force a wide-bandwith observation to be divided into many separate spectral windows. For single-polarization observations, each spectral window can be identified with a single distinct physical electronic path appearing at the output of the backend. For dual-polarization observations, each spectral window consists of two distinct signal paths in each receiving element, and up to four distinct outputs are formed from these in the backend or correlator. Different spectral windows may be derived from different physical feeds on an antenna; if the data from different feeds share the same spectral setup, the are considered the same spectral window, and the feed selection (in antenna) is required to distinguish them. spw = 'SPECTRAL_WINDOW.NAME' = SPECTRAL_WINDOW.INDEX = 'SPECTRAL_WINDOW.NAME || INDEX [:nchan_start_step_width]' = 'SPECTRAL_WINDOW.NAME || INDEX [:]' = 'SPECTRAL_WINDOW.NAME || INDEX [:]' = 'SPECTRAL_WINDOW.NAME || INDEX [:]' Examples: spw = '2' # spw 2 spw = '2:16/5/2/2' # spw 2, 16 channels, starting with channel #5, steping by 2 and averaging in pairs spw = '2:16-40' # spw 2, channels 16-40 spw = '2:5134-5138MHz' # spw 2, 5134-5138MHz section only spw = '2:51-76km/s' # spw 2, 51-76km/s section only spw = '3:(15,16,21,34)' # spw 3, channels 15,16,21,34 spw = '2:16, 3:32-34' # spw 2, channel 16 and spectral window 3 channels 32-34 -Should strided selection be handled in a separate parameter, where one set of strides is specified for each spw specified? -How should we handle velocity range specification when the velocities are negative? Maybe ranges could be enclosed in square brackets, without the hyphen, and using a comma: 2:[-51,-76]km/s ? -Should averaging (in strided selection) be handled separately from selection? correlation ----------- This selects baseline-based correlations (cf antenna-based polarizations in 'antenna'). Any correlation available in or derivable from the otherwise selected data may be specified. Formal Stokes parameters may also be specified (but note that for uncalibrated data, these may be meaningless). correlation='RR || RL || LR || LL || XX || XY || YX || YY || I || Q || U || V || P || X ' Examples: correlation='RR' correlation='RR RL' correlation='RR RL LR LL' correlation='LL I' correlation='I Q U V' correlation='P X' -Should derived correlations/polarizations be supported in data selection specifications? This will have to be forbidden in some contexts. Details - Secondary Parameters ------------------------------ (TBD) (Suggestions welcome) --