Proposal for the Second Pipeline User Test Chris Wilson Draft 3: 10 June 2004 Overview: ========= The Second Pipeline User Test will be a type (1) test (see Pipeline Test Plan): "(1) testing the heuristics for the science pipeline in an off-line or stand-alone mode" This test will be run using glish scripts running in AIPS++. Test start: July 5, 2004 (NOTE: sandwiched between Canadian Test end: August 5, 2004 holiday weekends) Report completed: August 13, 2004 Testers: Chris Wilson (McMaster University) James di Francesco (HIA) (TBC) Brenda Matthews (Berkeley) (TBC) The goal of this User Test is to test the ability of the pipeline heuristics to perform automatic flagging of the calibrator data in a VLA or PdBI data set. We forsee that two types of flags will be applied: (1) flags based on the internal properties of the calibrator itself (i.e. amplitude, phase); (2) flags based on knowledge of the properties of the instrument on which the data was obtained (i.e. VLA: remove autocorrelator data; PdBI: remove Gibbs channels). We expect the second type of flags to be quite straightforward, so the evaluation focuses on the first set of flags. This test will use only the raw uv data for the calibrator sources; it will not flag the target source data, it will not attempt to solve for or apply calibration solutions, and it will not attempt to produce images. This test will test and grade a single requirement from the revised Science Pipeline Software Requirements (6.5.1-R0) (see Appendix A). Flagging based on weather conditions (G1.1) would require simulated ALMA data and so will not be tested at this time; whether flagging based on antenna temperature (G1.2) can be carried out in this test is TBD. The basic questions to be evaluated in this test are the following: (A) Are there data which should have been flagged but weren't flagged? (B) Are there data which should not have been flagged but were flagged? This test is of necessity a qualitative test, as flagging bad data is somewhat subjective. This is one of the reasons the test will involve three different testers. The testers will attempt to answer these questions by examining the raw and flagged uv data using AIPS++ and also inspecting the logs, where that may be helpful. Where problems are found, the testers will be asked to describe the problem that was found and to provide an evaluation of the severity of the problem (i.e. does it affect a single uv data point or 25% of the uv data, to give an extreme example). The detailed description of the test is given below. The Pipeline will be deemed to have passed this user test if: (A) the flagging quality on the two data sets used to develop the heuristics is evaluated to be good by all three testers; (B) steps (1) and (2) complete properly for all ten of the new data sets (i.e. the glish scripts don't crash or flag all the data bad, etc.); (C) flagging quality is evaluated to be good for at least 3 of the 5 data sets from each of the VLA and PdBI by all three testers. Two of the data sets that were used in developing the heuristics are included in this test as a "reality check" on the flagging heuristics. The first step in the test will be for the testers to examine these two data sets, which the developers believe are properly flagged, to see if the testers agree with the assessment of the developers. Test Description: ================= Raw data required: - one set of VLA and one set of PdBI data that were used in developing the heuristics - one set of VLA spectral line data for each of five different projects; one set of PdBI data for each of five different projects TBD: will we use both 3 mm and 1mm data in the PdBI data sets? Scripts required: glish script for AIPS++ for flagging VLA data glish script for AIPS++ for flagging PdBI data Examination of: raw uvdata and flagged uvdata for each calibrator source in each data set (examination done in AIPS++ using a display tool); examination of logs produced for each data set, where useful Criteria: (1) did the script run to completion with no modifications of script? with minor modifications of script? with major modifications of script? (2) did the script properly flag data known to be bad a priori? (VLA) autocorrelations removed? (VLA) "quack" run successfully? (PDBI) Gibbs channels removed? (both) shadowed antennas flagged? how many data points remaining at this point? (3) For each calibrator source: were the right number of channels removed at the end of each spectrum? too many channels removed (how many too many?) how many uv data points involved? not enough channels removed (how many too few?) how many uv data points involved? describe the severity of the error: EXAMPLE: does it flag one single channel bad that you think it shouldn't flag, or did it flag 25% of the channels bad? were the correct number of spuriously high data points removed? how many high data points missed? how many high data points removed by mistake? describe the severity of the error: EXAMPLE: did it miss clear amplitude outliers (i.e. a uv point 3 times higher than any other uv point), or did it just not flag quite tightly enough (you'd have flagged at amp=95, it flagged at amp=100) were data with large phase scatter removed properly? how much bad phase data missed? how much data removed by mistake? describe the severity of the error: EXAMPLE: did it miss flagging a set of data with 360 deg phase scatter, or did it just miss a set of data with 50% more phase scatter than was typical (i.e. 30 deg range instead of 20 deg range) Are there other criteria you would normally use instead of, or in addition to, the criteria above to evaluate the quality of the flagging? if so, please describe and then evaluate the success of the automatic flagging as above. in your opinion, what is the total number and fraction of uv points that were mistakenly flagged as bad? in your opinion, what is the total number and fraction of uv points that were mistakenly NOT flagged as bad? (4) Please provide an estimate of the quality of the automatic flagging for this data set: Perfect (flagged exactly what I would have flagged) Excellent (minor flagging differences, unlikely to affect calibration fit results) Good (significant numbers of flagging differences; unlikely to affect calibration fit results; may result in some loss of source data i.e. one whole block of calibrator data flagged as bad) Fair (lots of flagging differences; may still result in a reasonable calibration fit; significant loss of data and/or chance of artifacts in the final image; please expand) Poor (the data as flagged would not give a good calibration solution; please descibe why you think so, provide plots, etc.) Examples of work required: ========================== For each of 10 data sets: 1.* run glish scripts in AIPS++ environment 2.* assess whether flagging of "known" bad data was performed correctly 3.* count the number of uv data points for each calibrator 4. inspect log messages for any useful information, errors, etc. 5. inspect the flagged uv data for each calibrator - amplitude versus time - phase versus time - amplitude versus channel - (maybe) system temperature versus time - any other measures the tester normally would use note any uv data the tester would have flagged as bad - number of uv points affected - time range, if appropriate 6. inspect the raw uv data and compare to the flagged data - amplitude versus time - phase versus time - etc. note any uv data that the script flagged as bad that the tester would NOT have flagged as bad - number of uv points affected - time range, if appropriate * these items may need to be done by only one tester (C. Wilson) and the resulting data sets and information made available to other testers Examples of possible failure modes: =================================== 1. script crashes before reaching completion (J. Lightfoot et al. to fix) 2. script fails to flag "known" bad data properly (J. Lightfoot et al. to fix) 3. flagging of one of the data sets is not perfect (tester to document nature of failure in detail) 4. flagging of one of the data sets is useless (i.e. tester to provide an example of a piece of very bad data that should have been flagged; J. Lightfoot to try to figure out why it was missed) Preparation required: ===================== Preparation for this test (preparing and pre-testing heuristics scripts, identifying data sets) will be carried out by John Lightfoot, Dirk Muders, Chris Wilson, and possibly Debra Shepherd (help with VLA and other data sets). The test will be carried out by Chris Wilson and at least one other person, possibly from outside the ALMA project. Support during the testing period will be provided by John Lightfoot and Chris Wilson, and also Dirk Muders, if his time permits. - need scripts for flagging of PdBI and VLA spectral line data sets (John) - identify additional PdBI and VLA data sets to be used as a "reality check" on the heuristics scripts once they are developed; note these data sets are NOT the same ones used in the test described here, but are for the heuristics developers (Dirk, John) - need at least 5 new PdBI and 5 new VLA spectral line data sets (Dirk,Chris) - may need "target" files to provide i.e. source identification for scripts for each data set (John) - need instructions for testers i.e. nature of test, AIPS++, etc. (Chris) - once we think we are ready to go, may need to run the 10 new data sets through the scritps just to make sure they don't crash or otherwise fail due to minor pilot errors (i.e. typos in "target" files). If so, it's important that no changes to the heuristics in the scripts be done at this stage so as to provde a fair test ... (John, Chris?) - (maybe) one PdBI data set and one VLA data set that have been successfully processed as part of script development; provide to Chris to test AIPS++ installation works (John) - (maybe) install the correct version of AIPS++ at McMaster to match what the heuristics team is using (Chris) - (maybe) run one pre-tested VLA data set and one PdBI data set through scripts to make sure AIPS++ installation working properly (Chris) - review methods for plotting uvdata in AIPS++ to make sure it will work well for testers (Chris) Appendix A: =========== Software Requirements to be tested/evaluated: ============================================= May 18, 2004 version 3.6.5.1 INTERFEROMETRIC DATA 6.5.1-R0 G1 The Pipeline shall flag bad data using some or all of the following criteria: G1.1 weather conditions Priority: 1 G1.2 antenna temperature Priority: 1 G1.3 difference from a running mean Priority: 1 G1.4 difference from an rms Priority: 1 G1.5 OTHERS?? Priority: 2 G2 The data must nevertheless be archived Priority: 1 G3 The flagging must be reversible in an off-line data reduction system Priority: 2 Appendix B: =========== Original Version of Second Test from Pipeline User Test Plan: ============================================================= ************************************************* ** this is included for reference so you can ** ** see what has changed from the original plan ** ************************************************* Pipeline data processing, user test 2: (Occurs before R2) Test 2 (stand-alone): July 2004, single field, no single dish, 256 channels or less, integration time 10 sec or less, 5-27 antennas, spectral line (with and without continuum subtraction), no self-calibration. Testing Focus: - automatic data editing - calibration - imaging Lower priority: - pointing, Tsys, weather etc. to identify bad data - choice of best deconvolution algorithm and/or region - quality assessment - robustness of heuristics script to variations in organization of data set Requires: - raw data (real and/or simulated) - heuristics script for single field imaging (Early Science) - required components of offline package available to be called by heuristics script - meta-data (lower priority?) - results from Tel Cal processing of calibration observations (lower priority?) - two testers (Subsystem Scientist + one other (Offline Subsystem Scientist?)) COMMENTS ON TEST 2: We may need to think carefully about how best to do a stand-alone user test of the heuristics script in a sensible way, so that a lot of extra effort is not required to prepare two versions of the same script (one for the stand-alone test, a second for the integrated test).