Comments on Data Selection Proposal S.T. Myers 25-Mar-2004 Based on document: http://almasw.hq.eso.org/almasw/bin/view/OFFLINE/DataSelection "MS Selection Design Discussion Document" George Moellenbrock 2003/11/12 General Principles: ------------------- These seem sensible and complete. Attention is paid to both functionality and usability (for programmers and users). General Syntax: --------------- Adoption of regex style wildcarding is reasonable. Character Meaning: ------------------ The choice of characters is sensible, with most defaulting to standard uses. Initial confusion may occur in some cases (- for ranges instead of :, the delimeter /) but there is no good way around this. Principle Parameter Dependents Description ------------------------------------------ "Principle" should be "Principal" :) The division into "primary" and "secondary" parameters seems ok but somewhat arbitrary, and possibly unnecessary. I would have divided into "primary" and "derived" quantities, based on whether these parameters exist in the MS or are derived from others. Examples of secondary parameters would be hour angles, az-el ranges, parallactic angles which are derived from time, location, direction, and orientation primary parameters. Details ------- Time - the choices seem ok and straightforward. In the CBI time selection we also allow dates in the form 25-mar-2004 which seems parsable also. I would make scan a separate parameter, as time = '1-3' seems ambiguous (I would guess that it actually meant integration numbers 1-3). Note that scan needs to be clearly defined somewhere (ie. a MS keyword-value somewhere) to make sense. I would let time.index be integration, and maybe use time.scan or scan for, well, scan. Should time.index or scan be stridable (see spw below)? Field - this seems ok, though one should be careful if one has fields named like numbers, e.g. "1-2" which could be mistaken for the range of field numbers "1-2". One should be able to explicitly note what one is intending, e.g. field.index = "1-2" with field = refering to the generic order. Should there be an escape character, e.g. \, so you could do "1\-2" in the above example? Or should there be quote identifiers "" or `' to set off literals, e.g. "1-2"? I would have thought that field.index would refer to the row in the FIELD table (=FIELD_ID), while source.index would be a row (=SOURCE_ID) in the SOURCE table. It is not clear to me that these are necessarily the same. But I may be confused... For example, it may be desirable to have fields in a scan around a target source (e.g. an OTF map of NGC1068) have different field ids for the same source, in which case it seems one would need to specify both field.index = '1-10'; source.name = 'NGC1068' Is this the way mosaics are meant to be handled? It seems easiest to lump FIELD.* and SOURCE.* into 'field' as users might be confused between FIELD and SOURCE table entries, as long as there is a way to specify what is meant. Otherwise, maybe 'field' and 'source' should be separate (and selecting on 'source' of course selects all fields that refer to SOURCE_IDs corresponding to the selected source, while selecting on 'field' is necessarily unique since the rows of the FIELD table are unique. Should field.index be stridable (see spw below), e.g. along a scan or in a mosaic? Antenna - this looks ok. The only question I have is how feeds will be designated. I am not sure its a good idea to specify :feed:pol purely by order. For example, I might expect to be able to use indexes for polarization as well as designators, e.g. :1=:R, :2=:L, and same for feeds. Maybe should use ":pol/feed" to differentiate? As in the field case, there may be problems with literals if the antenna name has - or : in it (which is not entirely unlikely). Could use antenna.name = '1-2' to distinguish from antenna.index = '1-2' if one had such an unfortunate name. Or use of escape \- or quoted literals "1-2" could work also. This brings up the question of context for ranges, since names can be associated with indexes it seems logical to allow ranges in names (not just wildcards) in which case it seems like a way to escape identifiers to use in names is necessary. UVdist - looks reasonable, but should this also be applicable to u and v (and w?) separately also? I could see u = '20-30Ml' as a valid selection. I am not so sure about using :percentage%, since : is not really used in this manner anywhere else. If you want to do this I would prefer a unique identifier to specify ranges around central values such as value+-delta where delta can be absolute or %. Of course that means + is added to the identifier list but thats probably ok. I would guess that the need for :percentage% is small since it is easy enough to set up absolute ranges in any event (particularly in glish). spw - I don't like specifications, like the proposed striding and averaging, that depend on order. I would use unique identifiers such as :nchan_start+step^width so you can default, like 2:16^2 to average 16 channels (starting from 1) pairwise ending up with 4 double-wide channels. Note that I am assuming the nchan is the original number, not the final (this could be seen to mean take 32 channels and average down to 16) - this should be carefully specified! In the case of averaging, what to do about leftover bits, e.g. 16^3 (I would allow partial averages, so that 16^3 is different than 15^3.) Note that the above applies to anything where you want to stride and or average, e.g. field indexes along a scan or in a mosaic might work, and pulsar bins. I think if striding is done using unique identifiers as above, dont need separate parameters, but dealing with ranges of windows could be messy, e.g. what do you mean by 2-3:16^2? I would say this applies two-chan averaging to windows 2 and 3, and if you want separate strides for different windows require something like 2:16^2, 3:15^3. For negative velocities, I would just sensibly use trailing - as a range and leading - as a sign, e.g. 2:[-51- -76]km/s or even just 2:-51--76km/s for that matter. Seems to work for me... I think as long as it is clear that averaging comes after selection you can specify both together (see above). Where do the IRAM-style wavebands, e.g. '3mm-USB', fit in? Do we want to recognize these as valid spw selection parameters? I would say yes, and it can probably be accommodated. correlation - Whats 'X'? I guess its Chi the EVPA. Is this confusing with X as a polarization? Maybe use C or A? I guess context is enough to distinguish. Note that for CBI we have the notion of pseudo-I, since we dont have all pol products on each vis. This is particularly for the imaging case. For example, we have PI which means grid oRR or LL if you have it (or average them if you have both). Also, what do you do if youve got single-pol feeds but mixed, so RL and LR are different visibilities? Does RL only give you half or do you conjugate LR and call it RL but in the other half-plane? Note that imager should (eventually, if not now) handle complex RL or LR images which entails forming the other half plane, and this would be what is meant by RL or LR (which maps to Q+iU and Q-iU modulo PA) in that case. Is one allowed to use numbers, since pols are enumerated? e.g. correlation='12' instead of 'RL'? Note IQUV and P and X are derived quantities in my terminology. Details - Primary Parameters ---------------------------- None given in document. Subarray indexes straightforward, with subarray = range parameter. I would lump uvw Ranges with uvdist (see above). Pulsargate and bin will get its own parameter, e.g. gate='1' and bins could be numbered,strided,averaged like channels state may be most problematic, depending on where the switching state ends up being, e.g. sideband-separation phases will end up as windows by the filler I expect, wobbler phases go where? fields? electronic switching phases? May need to give these as separate parameters to make sure the meaning is clear. My guess is that processor (e.g. backends) will map to different spw and can be handled like the IRAM bands '3mm-USB'. Ranges of weights, amplitudes, phases should be straightforward (handle like uv-ranges but w/o units? or allow units like Jy,mjy,K? allow % or +/- ranges?) What about real, imag? Handle hourangles as time? az-el, pa ranges straightforward numerical ranges (units rad, deg? default is what? Can you give angles in hours? I would guess anything allowed by quanta is ok). All numerical parameters including uvw, uvdist should be handled similarly. Are units necessary or are there sensible defaults? Summary ------- Overall, this looks very comprehensive and should allow clearer selection within the various tools. I have some issues, which were given above and are summarized here: 1. Avoid relying on order of things to specify context, like channel striding and averaging. I would try to use unique identifiers for things like channel numbers, start, striding and averaging. 2. You may need a way to allow 'escaping' of symbols appearing in names, such as \- or \:, or a way to include literal quotes, e.g. "1-2" 3. Allow access directly to parameter subtypes, e.g. field.name and field.index. 4. May need to separate field and source as parameters. 5. May need time.index for integrations (timestamps) and time.scan or scan for scans. 6. Unify u,v,w and uvdist. Not sure about the :percentage% thing (and see 10 below). 7. Channel, pulsar bin, or field index specification through specification of n, start, step, width best done through unique identifiers (see 1). I suggest :nchan_start+step^width. Averaging (using width) should follow selection, and partially filled averages should be kept. Note that anything enumerated could possibly be stridable so keep notation clear. May even need to do something like #n_start+step^width to be consistent. 8. Need to more fully specify what the states are (e.g. wobbler, electronic beam switch). 9. Numerical parameters e.g. uvw, uvdist, amp, phase, pa, need units. What do they default to or are defaults allowed? 10. Do we want a way to specify ranges by central value and delta around it, e.g. 10+-2 instead of 8-12, or 10+-20%? 11. Is enumeration allowed for anything that can be enumerated (rows in any table, polarizations, correlations)? I would say yes... One final comments is with regard to how this relates to TaQL selection. The above selection by parameters most useful, it seems to me, when the parameters are arguments to a selection tool, e.g. ms.selectpar(field='3C*',spw='2:16^2) and not in a big string ms.selectpar(string="field='3C*', spw='2:16^2'") since that is not really any different than TaQL. On the other hand, it might be advantageous to allow the latter as it is often easier in a script to assemble strings. But this is just a random thought...