Comments on Data Selection Proposal

S.T. Myers 25-Mar-2004

Based on document:
http://almasw.hq.eso.org/almasw/bin/view/OFFLINE/DataSelection
"MS Selection Design Discussion Document" George Moellenbrock 2003/11/12

General Principles:
-------------------

These seem sensible and complete.  Attention is paid to both functionality
and usability (for programmers and users).

General Syntax:
---------------

Adoption of regex style wildcarding is reasonable.

Character Meaning:
------------------

The choice of characters is sensible, with most defaulting to standard
uses.  Initial confusion may occur in some cases (- for ranges instead
of :, the delimeter /) but there is no good way around this.

Principle Parameter Dependents Description
------------------------------------------

"Principle" should be "Principal" :)

The division into "primary" and "secondary" parameters seems ok but 
somewhat arbitrary, and possibly unnecessary.  I would have divided into
"primary" and "derived" quantities, based on whether these parameters
exist in the MS or are derived from others.  Examples of secondary
parameters would be hour angles, az-el ranges, parallactic angles which
are derived from time, location, direction, and orientation primary 
parameters.

Details
-------

Time - the choices seem ok and straightforward.  In the CBI time selection
       we also allow dates in the form 25-mar-2004 which seems parsable also.

       I would make scan a separate parameter, as time = '1-3' seems ambiguous
       (I would guess that it actually meant integration numbers 1-3).  Note
       that scan needs to be clearly defined somewhere (ie. a MS keyword-value
       somewhere) to make sense.  I would let time.index be integration,
       and maybe use time.scan or scan for, well, scan.

       Should time.index or scan be stridable (see spw below)?

Field - this seems ok, though one should be careful if one has fields named
       like numbers, e.g. "1-2" which could be mistaken for the range of
       field numbers "1-2".  One should be able to explicitly note what one
       is intending, e.g. field.index = "1-2" with field = refering to the
       generic order.  

       Should there be an escape character, e.g. \, so you could do "1\-2" in
       the above example?  Or should there be quote identifiers "" or `' to
       set off literals, e.g. "1-2"?

       I would have thought that field.index would refer to the row in the
       FIELD table (=FIELD_ID), while source.index would be a row 
       (=SOURCE_ID) in the SOURCE table.  It is not clear to me that these
       are necessarily the same.  But I may be confused...

       For example, it may be desirable to have fields in a scan around a
       target source (e.g. an OTF map of NGC1068) have different field ids for
       the same source, in which case it seems one would need to specify both

               field.index = '1-10'; source.name = 'NGC1068'

       Is this the way mosaics are meant to be handled?

       It seems easiest to lump FIELD.* and SOURCE.* into 'field' as users
       might be confused between FIELD and SOURCE table entries, as long as
       there is a way to specify what is meant.  Otherwise, maybe 'field'
       and 'source' should be separate (and selecting on 'source' of course
       selects all fields that refer to SOURCE_IDs corresponding to the
       selected source, while selecting on 'field' is necessarily unique
       since the rows of the FIELD table are unique.

       Should field.index be stridable (see spw below), e.g. along a scan
       or in a mosaic?

Antenna - this looks ok.  The only question I have is how feeds will be
       designated.  I am not sure its a good idea to specify :feed:pol 
       purely by order.  For example, I might expect to be able to use
       indexes for polarization as well as designators, e.g. :1=:R, :2=:L,
       and same for feeds.  Maybe should use ":pol/feed" to differentiate?
       
       As in the field case, there may be problems with literals if the
       antenna name has - or : in it (which is not entirely unlikely).
       Could use antenna.name = '1-2' to distinguish from antenna.index =
       '1-2' if one had such an unfortunate name.  Or use of escape \-
       or quoted literals "1-2" could work also.

       This brings up the question of context for ranges, since names can
       be associated with indexes it seems logical to allow ranges in names
       (not just wildcards) in which case it seems like a way to escape 
       identifiers to use in names is necessary.

UVdist - looks reasonable, but should this also be applicable to u and v
       (and w?) separately also?  I could see u = '20-30Ml' as a valid
       selection.
 
       I am not so sure about using :percentage%, since : is not really
       used in this manner anywhere else.  If you want to do this I would
       prefer a unique identifier to specify ranges around central values
       such as value+-delta where delta can be absolute or %.  Of course
       that means + is added to the identifier list but thats probably ok.
       I would guess that the need for :percentage% is small since it is
       easy enough to set up absolute ranges in any event (particularly
       in glish).

spw - I don't like specifications, like the proposed striding and averaging,
       that depend on order.  I would use unique identifiers such as

                     :nchan_start+step^width

       so you can default, like 2:16^2 to average 16 channels (starting from
       1) pairwise ending up with 4 double-wide channels.  Note that I am
       assuming the nchan is the original number, not the final (this could
       be seen to mean take 32 channels and average down to 16) - this should
       be carefully specified!  In the case of averaging, what to do about 
       leftover bits, e.g. 16^3 (I would allow partial averages, so that 
       16^3 is different than 15^3.)

       Note that the above applies to anything where you want to stride and
       or average, e.g. field indexes along a scan or in a mosaic might work,
       and pulsar bins.

       I think if striding is done using unique identifiers as above, dont
       need separate parameters, but dealing with ranges of windows could
       be messy, e.g. what do you mean by 2-3:16^2?  I would say this applies
       two-chan averaging to windows 2 and 3, and if you want separate strides
       for different windows require something like 2:16^2, 3:15^3.

       For negative velocities, I would just sensibly use trailing - as a
       range and leading - as a sign, e.g. 2:[-51- -76]km/s or even just
       2:-51--76km/s for that matter.  Seems to work for me...

       I think as long as it is clear that averaging comes after selection
       you can specify both together (see above).

       Where do the IRAM-style wavebands, e.g. '3mm-USB', fit in?  Do we want
       to recognize these as valid spw selection parameters?  I would say yes,
       and it can probably be accommodated.

correlation - Whats 'X'?  I guess its Chi the EVPA.  Is this confusing with
       X as a polarization?  Maybe use C or A?  I guess context is enough
       to distinguish.

       Note that for CBI we have the notion of pseudo-I, since we dont have
       all pol products on each vis.  This is particularly for the imaging
       case.  For example, we have PI which means grid oRR or LL if you have
       it (or average them if you have both).  Also, what do you do if youve
       got single-pol feeds but mixed, so RL and LR are different
       visibilities?  Does RL only give you half or do you conjugate LR and
       call it RL but in the other half-plane?  Note that imager should
       (eventually, if not now) handle complex RL or LR images which entails
       forming the other half plane, and this would be what is meant by RL or
       LR (which maps to Q+iU and Q-iU modulo PA) in that case.

       Is one allowed to use numbers, since pols are enumerated? e.g.
       correlation='12' instead of 'RL'?

       Note IQUV and P and X are derived quantities in my terminology.

Details - Primary Parameters
----------------------------

None given in document.

Subarray indexes straightforward, with subarray = range parameter.

I would lump uvw Ranges with uvdist (see above).

Pulsargate and bin will get its own parameter, e.g. gate='1' and bins
could be numbered,strided,averaged like channels

state may be most problematic, depending on where the switching state
ends up being, e.g. sideband-separation phases will end up as windows
by the filler I expect, wobbler phases go where? fields? electronic
switching phases?  May need to give these as separate parameters to
make sure the meaning is clear.

My guess is that processor (e.g. backends) will map to different spw and
can be handled like the IRAM bands '3mm-USB'.

Ranges of weights, amplitudes, phases should be straightforward (handle like
uv-ranges but w/o units? or allow units like Jy,mjy,K? allow % or +/- ranges?)
What about real, imag?

Handle hourangles as time?  az-el, pa ranges straightforward numerical ranges
(units rad, deg? default is what?  Can you give angles in hours? I would
guess anything allowed by quanta is ok).  All numerical parameters including
uvw, uvdist should be handled similarly.  Are units necessary or are there
sensible defaults?

Summary
-------

Overall, this looks very comprehensive and should allow clearer selection
within the various tools.  I have some issues, which were given above and
are summarized here:

1. Avoid relying on order of things to specify context, like channel striding
and averaging.  I would try to use unique identifiers for things like channel 
numbers, start, striding and averaging.

2. You may need a way to allow 'escaping' of symbols appearing in names, such
as \- or \:, or a way to include literal quotes, e.g. "1-2"

3. Allow access directly to parameter subtypes, e.g. field.name and
field.index. 

4. May need to separate field and source as parameters.

5. May need time.index for integrations (timestamps) and time.scan or scan
for scans.

6. Unify u,v,w and uvdist.  Not sure about the :percentage% thing (and see
10 below).

7. Channel, pulsar bin, or field index specification through specification of
n, start, step, width best done through unique identifiers (see 1).  I suggest
:nchan_start+step^width.  Averaging (using width) should follow selection, and
partially filled averages should be kept.  Note that anything enumerated could
possibly be stridable so keep notation clear.  May even need to do something
like #n_start+step^width to be consistent.

8. Need to more fully specify what the states are (e.g. wobbler, electronic
beam switch).

9. Numerical parameters e.g. uvw, uvdist, amp, phase, pa, need units.  What
do they default to or are defaults allowed?

10. Do we want a way to specify ranges by central value and delta around it,
e.g. 10+-2 instead of 8-12, or 10+-20%?

11. Is enumeration allowed for anything that can be enumerated (rows in any
table, polarizations, correlations)?  I would say yes...

One final comments is with regard to how this relates to TaQL selection.
The above selection by parameters most useful, it seems to me, when the
parameters are arguments to a selection tool, e.g.

   ms.selectpar(field='3C*',spw='2:16^2)

and not in a big string

   ms.selectpar(string="field='3C*', spw='2:16^2'") 

since that is not really any different than TaQL.  On the other hand,
it might be advantageous to allow the latter as it is often easier in
a script to assemble strings.  But this is just a random thought...