MS Selection Design Discussion Document
=======================================

George Moellenbrock
2003/11/12

Introduction
------------

Users' perceptions of aips++ have suffered from a confusing range (and
lack) of ways to select MS data.  Some tool functions use explicit
parameters for certain selection keys, but these parameters, and their
names, are not consistent (nor consistently present) in all tools
requiring data selection.  In most tools, usually in a setdata
function, there is a means of selecting data via the Table Query
Language (TaQL).  While extraordinarily flexible and powerful, TaQL is
is not adequate, in terms of ease-of-use, for the general user,
especially for simple data selection cases which tend to be the most
often encountered.  Therefore, a single, straightforward model for
user-oriented MS data selection is required, and this document is an
attempt to define one.

Thanks to Kumar for helping flesh out the examples. 


General Principles
------------------

Implementation of a user-friendly mechanism for MS data selection should
obey the following general principles.

1. Uniformity.  One, and only one, set of MS data selection parameters
should be used throughout the package.  In general, for any conceptual
parameter, the same parameter name should be used throughout the
package, even if the parameter is not a global parameter.  (This issue
should be addressed in all aspects of the package, not just data
selection.)

2. Name/Value specification. It should be possible to specify
selection on MS keys and attributes according to name, value, and
index (not just index), using a minimum of distinct parameters.  In
effect, this capability should be part of the built-in knowledge
intended in the MS definition (that makes it distinct from a generic
table of data).  Note that fillers must label various data attributes
with distinctive names (usually in the NAME columns of the various
subtables) for name recognition to be effective.

3. Syntax/simplicity. Some data selection concepts (e.g.,
antenna-based vs. baseline-based selection) do not map in a one-to-one
fashion from a simple selection specification and the selected data.
Also, some selection parameters depend on the context set by what has
been specified by other parameters (e.g., channel selection depends
upon which spectral windows have been specified, and how many channels
those spectral windows have).  There will therefore need to be
syntactical mechanisms to handle these cases.  In fact, even the
interfaces for the simpler parameters can benefit from a strong
syntax.  Combined with the name/value recognition flexibility
mentioned above, the result will be a very powerful selection
mechanism.  In developing a powerful syntax, it is important that
simple selections can be still be expressed in a simple manner.  The
major part of this document attempts to describe powerful yet simple
specification syntax.

4. Portability.  Data selection should be portable in the sense that
one ensemble of selection parameters could be used in several
different tools, without having to respecify it in detail for each.
Also, it should be possible interpret a subset of one MS in terms of a
data selection specification, and thus use it to effect selection in
different MSs.  Whereas the parsing of a user's selection
specification involves translating names and values into indices, a
fully portable data selection mechanism will require the reverse
translation.  This reverse translation will have an important
application in specification and storage of flagging information.

5. Hide the details. There is an internal distinction between primary
keys (fundamental data descriptors which can uniquely describe any
subset of a dataset) and non-key attributes (which can be expressed in
terms of the primary keys and serve only as shorthand for them), but
this distinction should be largely hidden from the user.  Also, the
data keys in the MS are indices stored in a zero-based manner (i.e.,
counting starts at 0, not 1), and this is confusing in some
user communities.  It is hoped that the name and value recognition
mechanisms will eliminate the need for ad hoc (and instrument- and
community-dependent) adjustments to the index counting for the user.


With these principles in mind, this document proposes a set of
selection parameters which is divided into two groups.  The primary
group consists of the most common and fundamental data selection
parameters, and the secondary group consists of more obscure
attributes.  Eventually, all of these parameters will be available to
the user, but the primary group will be implemented first, and
graphical interfaces of the future might recognize the distinction to
provide for an uncluttered 'basic' data selection interface, with the
option to expand it to the fully general one.

For the most part, the intentions a user might have for the selected
data are kept distinct from the selection specification.  This is not
true for the strided channel selection (which allows averaging), nor
for the correlation specification, which allows polarization
conversions.  These operations are already possible and implemented
(e.g., in msplot).  It is likely that in some data selection contexts,
these options will be forbidden, e.g., no polarization conversion will
be permitted when selecting data in calibrater, etc.


General Syntax
--------------

Data selection typically involves specification of lists and ranges of
names, values, or indices.  It is proposed that a basic regex-like
syntax be adopted to enable indication of ranges, wildcards, etc.  The
following table describes the meaning of the fundamental special
characters which will be used throughout the data selection interface.
This list should be viewed as a proposal; comments and suggestions are
most welcome.  For simplicity's sake, we intend to try very hard to
maintain a single meaning for each special character, for all
parameters, and to keep the list of special characters as short as
possible.


Character     Meaning
--------------------------------
 ,  (comma)   Delimiter indicating distinct members of a list.  Nearly 
              all parameters permit a meaningful list specification; how
              each member of the list is specified depends upon the
              parameter.  The comma represents a logical OR in the
              selection. 

 -  (hyphen)  Indicates a range of values, with extrema at the values 
              specified on either side of the hyphen, i.e., a short-hand
              for a continuous list.  Lists of ranges (separated by commas)
              will be supported. 

 ()           Grouping.  Used to avoid ambiguity in selection expression.

 *            Wildcard, one or more characters.

 ?            Wildcard, single character.

 : (colon)    Specialization.   When a specified principle value
              is non-degenerate, a colon is used to (optionally) specify 
              distinguishable sub-selections.

 /            Delimiter for associated values which will be evaluated 
              together (e.g., spw channel stride specification)

 <,>,etc.     Arithmetic comparisons.

 &            Pairing syntax for baselines


Selection Parameters
--------------------

The table below lists the data selection parameters.
Context-dependent parameters are listed immediately following the one
they directly depend on.  The table is an overview; details of each
and how to specify it follow.  At this point, only the primary
parameters are described in detail.


Principle
Parameter     Dependents     Descrition
----------------------------------------------------------

-------------------Primary parameters---------------------

time                         Times, time ranges, or scan numbers

field                        Field names or indices

antenna                      Antenna names, stations, or indices
              feed           Feed index (per antenna, consistent w/ spw)
              pol            Polarization (receptor per antenna/feed)

uvdist                       Range(s) of uv distances

spw                          Spectral Window names or indices
              chan stride    Strided chan selection 
              ch/freq/vel    Range or list of chan/freq/vel 

corr                         Correlation (between two 'pols') or Stokes


------------------Secondary parameters--------------------

observation                  Observation name/index

array                        (Sub-)array indices

uvw                          Range(s) of u,v,w coordinates

pulsargate                   Specifies one or more pulsar gates
              pulsarbin      Specifies one or more pulsar bins 
                               (per pulsargate)

state                        Specifies observational states 
                              (e.g., in a switching observation)

processor                    Specifies one or more distinct 
                              instrumental backends
              procphase      Specifies processor phase

weight                       Range(s) of data weights  
ampl                         Range(s) of data amplitudes
phase                        Range(s) of data phases (about zero)

hourang                      Range(s) of hourangles
elevation                    Range(s) of elevations
parang                       Range(s) of parallactic angles

----------------------------------------------------------


Details - Primary Parameters
----------------------------

In the following, the context and style of each primary selection
parameter is described.  The generic description of each specification
indicates the basic forms possible for a single name/value/index.  For
name and value specifications, these forms refer to various MS
subtables and their columns as required to describe what the
specification will be matched with.  Values in these subtable columns
are indicated as "SUBTABLE_NAME.COLUMN_NAME", and are usually
self-explanatory.

Sometimes, there is more than one possibility for matching a specified
name.  In such cases, the multiple options will be listed in order of
decreasing precedence and separated by the || (short-circut) OR
operator, which indicates that the matching tests will stop as soon as
a match is found.  Some parameter specifications have optional parts;
this are enclosed in square brackets [].

For now, all name and value specifications will be specified as
strings, enclosed in single quotes in the glish interface.  The
requirement for quotes may change in future CLIs and GUIs. Wildcards
are always permitted for name (String) matching.  Indices will be
specified as integers (without quotes), unless a specialization
follows (using a colon).   Note that specifying an integer without
quotes will bypass any name matching.  (Is this a bad idea?)

In most cases, ranges or list of the basic generic specifications are
permitted.  These are demonstrated in the specific examples.


time
----

This parameter selects based on time ranges in the data.  Since scan
numbers are often a proxy for time, scan numbers can also be specified
(even though scan numbers may not be a strict function of time).
Timestamp(s), time ranges, scans and, and scan ranges may be
specified.  The full date/time can be specified in the measures syntax
(YYYY/MM/DD/HH:MM:SS.sss or DD-MMM-YYYY/HH:MM:SS.sss), in the same
epoch frame (e.g., UTC) as the data.  If the date portion is omitted,
then the first date in the dataset is assumed. The number of days
after the first day may be prepended to the time (ddd/HH:MM:SS.sss).
The second time in a timerange specification (using '-') need only be
the least significant unique portion of the overall time string.

time = 'YYYY/MM/DD/HH:MM:SS.sss'
     = '< YYYY/MM/DD/HH:MM:SS.sss'
     = '> YYYY/MM/DD/HH:MM:SS.sss'
     = 'ddd/HH:MM:SS.sss'
     = '< ddd/HH:MM:SS.sss'
     = '> ddd/HH:MM:SS.sss'

Examples:

time = '2003/11/07/12:58:20'        # selects the timestamp nearest this time
time = '2003/11/07/12:58:20-45'     # selects data within this 25s range
time = '2003/11/07/12:58:20-59:45'  # selects data within this 1m25s range

time = '< 2003/11/07/12:58:20'     # selects data prior to this time

time = '13:05:10.005'              # selects timestamp nearest this time 
                                   (date defaults to first date in dataset)
time = '0/13:05:10.005'            # same as above
time = '3/13:05:10.005'            # selects timestamp nearest this time
                                   on 4th day in dataset 
time = '13:05'                     # selects timestamp nearest 13h05m
                                   (date defaults to first date in dataset)
time = '< 13:05:10, > 13:06:35'    # all but the 1m25s of data between these 
                                    times (date defaults to first date)
time = '4,6,7'                     # scans 4,6,7
time = '54-231'                    # scans 54 through 231

time = '4325, 2003/11/07/01:16:00-17:10'   # scan 4325 and this 1m10s of data


-Should scan be a separate parameter?


field
-----

A field is an observational unit characterized by the direction of
observation in a celestial frame, and is described in a row of the
FIELD subtable.  For synthesis telescopes, the relevant directions are
the phase and delay tracking centers (in the correlator model, not
necessarily the same), without regard to the possibility that the
individual telescopes' pointing directions may be different from
these. Note that synthesis mosaic observations always consist of many
fields, each with a distinct direction in celestial coordinates. Each
field has a unique NAME (String), which is often just a radio source
name in a simple single-field observations, and is qualified by a CODE
(String) which indicates the field's user- or observatory-determined
classification (e.g., as a calibrator).

In a synthesis mosaic observation, the collection of associated fields
will share various properties, not the least of which might be the
name of the large object or region being observed.  Other common
properties include line transistion names and rest frequencies, the
object's systematic velocity and proper motion, etc.  These common
properties are described in a row of the MS SOURCE subtable, which is
characterized by a unique NAME (String) and CODE (String), similar to
the FIELD subtable.  The SOURCE_ID column in the FIELD subtable
supplies an index to the relevant row of the SOURCE subtable for each
field. 

It is therefore necessary to enable field selection according to
the NAME and CODE values in either the FIELD or SOURCE subtables,
as well as by an index into the FIELD table.  Thus, each
member of a list or range of field specifications is matched with
(in this order) the FIELD.NAME, FIELD.CODE, SOURCE.NAME, SOURCE.CODE,
until a successful match is found.  If none match, the specification
is interpreted as an index into the field table, and if that fails,
an error message will be generated.

field = 'FIELD.NAME || FIELD.CODE || SOURCE.NAME || SOURCE.CODE'
      = FIELD.INDEX

Examples:

field = '3C*,4C*'  # all FIELD.NAMEs beginning with 3C or 4C, 
                      or failing that, all fields with SOURCE.NAMEs
                      beginning this way
field = 'P'        # all fields with CODE='P', e.g., calcode='P' in VLA
                      calibrator manual
field = 0-3        # the first four entries in the FIELD subtable
field = >3         # all fields after the first 4 entries in the 
                      FIELD subtable


antenna
-------

This parameter is used to select combinations of physical array
elements of a synthesis telescope, usually which antennas in the
array, which feed on each antenna (for multi-feed systems, consistent
with the spw selection), and which receptor (polarization) in each
feed.  This is essentially provides for the selection of a unique
signal path in front of the electronics.  Since most (all?) existing
synthesis telescopes are single-feed, the feed selection is currently
degenerate, and that part of the specification can be omitted.  For
single-dish telescopes, the ANTENNA selection is degerate, and for
multi-feed systems (e.g., as on Parkes, the GBT, bolometers) the FEED
selection becomes primary.

Note that polarization selection here is always antenna-based, where
specific polarization states are desired on a per-antenna basis.  To
select specific correlations globally, it is better to use the
correlations parameter described below.

For arrays, a means of distiguishing antenna and baseline selection is
required.  For example, one may wish to select all baselines involving
a certain antenna or antennas, or only those baselines among a certain
group of antennas.  This is achieved in the antenna parameter via
syntax using (optionally) the '&' character.  A list of antennas
specified without a '&' indicates an "exclusive" selection, i.e., only
baselines among the antennas in the list.  Using the '&' provides for
"inclusive" antenna-based selections via wildcards, as well as for
very specific baseline selections.  Note than not using '&' is the same as 
specifying the same thing on both sides of a '&'.

antenna = 'ANTENNA.NAME || ANTENNA.STATION'
        = ANTENNA.INDEX
        = 'ANTENNA.NAME || ANTENNA.STATION || INDEX [:POL]'
        = 'ANTENNA.NAME || ANTENNA.STATION || INDEX [:FEED][:POL]'


Examples:

antenna = '5'              # selects baselines involving antenna 5 only
                             (self-correlations)
antenna = '5 & 5'          # same as above
antenna = '5 & 6'          # baseline 5&6 only
antenna = '5,6,7'          # selects all baselines among antenna 5,6,7 
                             (5&6, 5&7, 6&7)
antenna = '5,6,7 & 5,6,7'  # same as above
antenna = '5 & *'          # selects data for all baselines which include
                             antenna 5
antenna = '5 & *, 6 & *'   # selects data for all baselines which include 
                             antenna 5 or 6
antenna = '(5,6) & *'      # same as above
antenna = '(5-6) & *'      # same as above
antenna = '(5-8) & (9,10)' # selects 5&9,5&10,6&9,6&10,7&9,7&10,8&9,8&10

antenna = 'VLA_N*'         # all baselines among North arm (VLA) antennas
antenna = 'VLA_N* & *'     # all baselines with at least one N arm antenna
antenna = 'VLA_E* & VLA_W*' # all baselines between E and W arms
                             (does not include baselines internal to each arm)
antenna = 'VLA_E*,VLA_W*'  # all baselines among E and W arms
                              (includes internal ones)

antenna = '5:R'            # selects RCP visibilities for antenna 5 only 
                              (RR-only)
antenna = '5:R & *'        # selects all visibilities involving the RCP 
                              receptor on antenna 5
antenna = '5:R & 7:L'      # selectes baselines 5&7, RL (not LR)
antenna = '7:L & 5:R'      # exactly same as above
antenna = '7:R & 5:L'      # selectes baselines 5&7, LR (not RL)
antenna = '5:R & (3,4,7,8):L'  # selects visibilities involving RCP on
                                  antenna 5 and LCP on antennas 3,4,7,8 


uvdist
------

This is a geometrical way to specify baseline selection; an
alternative to direct specification of the antennas involved.  The
uv-distance is the baseline length projected on a plane perpendicular
to the instantaneous field direction.   It can be expressed in
units of distance (e.g., km or m, explicitly) or wavelengths at the
reference frequency (e.g., l, kl, Ml).  In the latter case, the result
will be spectral window-dependent.  Ranges and inequalities are
permitted.  Annuli of uv-distances may be specified either using a
range, or by appending a fractional value (in percent) to a single uv
distance value.

uvdist = 'value1-value2km'
       = 'value1-value2Ml'
       = 'valuekm:percentage%'
       = 'valueMl:percentage%'

Examples:

uvdist = '24-35km, 40-45km'    # two annuli in units of distance
uvdist = '24-35Ml, 40-45Ml'    # two annuli in units of wavelengths
uvdist = '< 45kl'              # less than 45 kilolambda 
uvdist = '> 0l'                # greater than zero-length (no ACs)
uvdist = '31Ml:5%'             # +/- 2.5% about 31Ml


spw
---

The data for a single spectral window is characterized by a unique
spectral setup.  The spectral setup is defined by a unique reference
frequency (in some frequency frame, e.g. LSRK, TOPO, etc.), a unique
sideband, and a unique channelization (total bandwidth, resolution,
channel width, number of channels, etc.).  Transmission bandwidth
limitations throughout the signal path typically force a wide-bandwith
observation to be divided into many separate spectral windows.  For
single-polarization observations, each spectral window can be
identified with a single distinct physical electronic path appearing
at the output of the backend.  For dual-polarization observations,
each spectral window consists of two distinct signal paths in each
receiving element, and up to four distinct outputs are formed from
these in the backend or correlator.

Different spectral windows may be derived from different physical 
feeds on an antenna; if the data from different feeds share the
same spectral setup, the are considered the same spectral window,
and the feed selection (in antenna) is required to distinguish them.


spw = 'SPECTRAL_WINDOW.NAME'
    = SPECTRAL_WINDOW.INDEX 
    = 'SPECTRAL_WINDOW.NAME || INDEX [:nchan_start_step_width]'
    = 'SPECTRAL_WINDOW.NAME || INDEX [:<channel list or range>]'
    = 'SPECTRAL_WINDOW.NAME || INDEX [:<frequency list or range>]'
    = 'SPECTRAL_WINDOW.NAME || INDEX [:<velocity list or range>]'

Examples:

spw = '2'               # spw 2
spw = '2:16/5/2/2'      # spw 2, 16 channels, starting with
                           channel #5, steping by 2 and averaging in pairs
spw = '2:16-40'         # spw 2, channels 16-40
spw = '2:5134-5138MHz'  # spw 2, 5134-5138MHz section only
spw = '2:51-76km/s'     # spw 2, 51-76km/s section only
spw = '3:(15,16,21,34)' # spw 3, channels 15,16,21,34
spw = '2:16, 3:32-34'   # spw 2, channel 16 and spectral
                           window 3 channels 32-34


-Should strided selection be handled in a separate parameter, where
 one set of strides is specified for each spw specified?

-How should we handle velocity range specification when the velocities
 are negative?  Maybe ranges could be enclosed in square brackets,
 without the hyphen, and using a comma: 2:[-51,-76]km/s ?

-Should averaging (in strided selection) be handled separately from
 selection?


correlation
-----------

This selects baseline-based correlations (cf antenna-based
polarizations in 'antenna').  Any correlation available in or
derivable from the otherwise selected data may be specified.
Formal Stokes parameters may also be specified (but note that
for uncalibrated data, these may be meaningless).

correlation='RR || RL || LR || LL || 
             XX || XY || YX || YY || 
              I ||  Q ||  U ||  V ||
              P ||  X '

Examples:

correlation='RR'
correlation='RR RL'
correlation='RR RL LR LL'
correlation='LL I'
correlation='I Q U V'
correlation='P X'

-Should derived correlations/polarizations be supported in data
selection specifications?  This will have to be forbidden in some
contexts.


Details - Secondary Parameters
------------------------------

(TBD)  (Suggestions welcome)


--