Proto-pipeline User Test Christine Wilson and Debra Shephard 15 January 2004 Overview: ========= The prototype pipeline should undergo user testing before CDR2 to give us confidence in the pipeline software as it has been designed to date, to let us present test results at CDR2, and to help us decide whether the prototype pipeline project met its goals. There should be one session of Pipeline user tests before CDR2. The tests will be primarily carried out by the Pipeline Subsystem Scientist (C. Wilson) with assistance as needed from L. Davis, D. Shephard, and other members of the Offline/Pipeline team as required. This test is planned for the period 7 April 2004 to 7 May 2004 with the written report of the test results completed by 7 May 2004. There are two general types of tests that we would like to do with the prototype pipeline software before CDR2: (A) tests to ensure that the prototype pipeline is doing the right steps and producing the right outcomes (e.g. images and logs) in processing the data (B) tests of the automated functioning of the prototype pipeline, e.g., given a recipe that works for one data set, can the pipeline process similar data sets in an automatic or semi-automatic way. Note that the Pipeline Test Plan written by C. Wilson (Aug 29, 2003 version) focuses primarily on testing heuristics for the Science Pipeline (both in off-line mode and in the pipeline itself), the Quick Look user interfaces, and the Quick Look heuristics. There is nothing in that plan for user tests of the Pipeline itself (as distinct from the heuristics). However, the first test described in that plan (given here in Appendix A) actually provides a pretty good description of what we need to do for the pre-CDR2 test if we replace the focus on heuristics with the two types of tests described above. For these tests, we need to be able to flag bad data, calibrate the data, and then image the data. Thus, it is important that these primary functionalities be available in the prototype pipeline. There is some concern about whether flagging will be available; it would be preferable to delay the start of these tests to allow flagging to be included, rather than dropping the flagging function from the tests. Nevertheless, the hard deadline for completing these tests, including the flagging functionality, is CDR2. In each type of test, we intend to run the prototype pipeline on two different data sets: one GRB data set from the VLA and one day's data set on GG Tau from IRAM PdB. Note that, although we intend to work only with the 3mm and 1mm continuum data for GG Tau in these tests, these data are intrinsically spectral line in nature. We will discuss these two types of tests separately below, as there are somewhat different requirements and criteria for success in each case. The prototype pipeline will be deemed to have passed this user test if it successfully passes Test A (parts 1 and 2) and Test B (part 1). Test A: Is the prototype pipeline functioning correctly? ======================================================= This test is designed to make sure that the processing of data in the prototype pipeline is proceeding correctly. In philosophy, this test is similar to the IRAM-AIPS++ Phase I test: we want to process the same raw data with the same processing steps via two different methods and see how well the output images and log files agree. In this test, since both methods (pipeline and AIPS++) are using the same underlying AIPS++ engines, we should also be able to ensure that choices of input parameters (ranges of data to flag, type of fit performed in calibration, etc.) are identical, and so we might expect the output images from the two methods to be identical also. The prototype pipeline has been designed with the VLA GRB data in mind, and in fact will have been designed and tested using four specific sets of GRB data obtained on different sources and/or different days. Thus, we expect that the first part of this test, which uses one of these GRB data sets, should be fairly straightforward and easily successful. However, it is important to test the prototype pipeline using a set of data other than that for which it was originally designed to give us an idea of the generality of the system. We feel that one day of data from the GG Tau data set which was used in Phase I of the IRAM-AIPS++ test is ideal for this purpose, as it has been extensively processed in the offline AIPS++ environment and glish scripts exist to reduce the data. This data set has the additional advantage of being intrinsically spectral line data and so will test parts of the prototype pipeline that the GRB data cannot test. In comparing the images, we will use a quantitative comparison of various measurements from the image, such as peak flux, rms level, etc. Note that this quantitative comparison will be done in AIPS++ using tools such as msplot by C. Wilson (not automatically by the protopipeline). The comparison of the images will be a little different between the GRB data (which produce a point source) and the GG Tau data (which produce an extended source); the exact criteria are described separately for each input data set. In comparing the log files, we will not be looking for "word-for-word" matching, but rather looking for things like: errors reported in one log file but not in the other; successful task completion reported in one file but not in the other; details about the task (rms of fits to data, etc) that differ between the log files; etc. 1. GRB test ----------- Raw data required: one set of GRB data (taken on a single day); will be one of the data sets that Lindsey has used in designing the prototype pipeline Scripts required: python script for prototype pipeline glish script for AIPS++ Comparison between: image produced by prototype pipeline and image produced by AIPS++ using glish script; log files produced by each method Criteria: source peak flux agrees to within 10% rms in source-free regions agrees to within 20% ratio of peak to rms agrees to within 10% no important disagreements in log messages produced in pipeline and offline modes Note: we expect that the raw data set will have been identified and the python script produced before these tests by L. Davis. The glish script for AIPS++ reduction will also need to be produced before these tests begin, perhaps by D. Shephard. 2. GG Tau test -------------- Raw data required: one set of GG Tau data taken on a single day; likely the set of data that D. Shephard used in producing the offline cookbook; this data set will be filled (and if necessary run through phcor) outside of the prototype pipeline Scripts required: python script for prototype pipeline glish script for AIPS++ Comparison between: image produced by prototype pipeline and image produced by AIPS++ using glish script; log files produced by each method; comparisons done for both 3 mm and 1 mm images Criteria: source peak flux agrees to within 10% rms in source-free regions agrees to within 20% ratio of peak to rms agrees to within 10% difference image (between pipeline and AIPS++-produced images) has no structure larger than |3 x rms|, where rms is measured on the difference image no important disagreements in log messages produced in pipeline and offline modes Note: we expect that the raw data set will have been identified and the glish script produced before these tests by D. Shephard. The python script for prototype pipeline reduction will also need to be produced before these tests begin, presumably by L. Davis. Test B: Can the prototype pipeline process data automatically? ============================================================= This test is designed to test the ability of the pipeline to process a data set automatically using scripts that were not designed and optimized for that data set. It is important to note here that we are testing the ability of the prototype pipeline to do automatic data processing but not the heuristics used in that automatic processing. Thus, the images produced in this test are likely not to be the best images that could be produced from the data. The automatic tests described here using 3mm and 1 mm GG Tau data are lower priority than the automatic tests using GRB data. It would be useful to complete the GG Tau automatic tests if possible, but there may be difficulties with the GG Tau data that we do not foresee at this point. The comparison of the images produced in this test with those produced in Test A will be qualitative rather than quantitative. In an ideal result, there will be nothing obviously wrong with the image e.g. the rms and the peak values will be reasonably consistent given differences in integration times and time of observations. Image with obvious defects (such as stripes) would still be acceptable if it is clear that the defects arise from a failure of the heuristics underlying the python script (such as how flagging was done) rather than an error in the actual processing of the data. Again, comparison of the images will be done by C. Wilson using AIPS++ tools such as msplot. In comparing the log files, we will be looking for things like: reported successful completion of each task; successful completion of the whole script; whether any errors reported apply to relatively simple tasks that should have been successful; etc. 1. GRB test ----------- Raw data required: one additional set of GRB data that was not one of the four used in designing the pipeline; it may be easiest if these data observed the same source at the same frequency as used in Test A, but this may not be required Scripts required: python script for prototype pipeline Comparison between: image produced by prototype pipeline in Test A and image produced in Test B; log files produced by each Test Criteria: did pipeline run to completion with no modifications of script? with minor modifications of script? with major modifications of script? did pipeline produce an image with no modifications of script? with minor modifications of script? with major modifications of script? what is the quality of the images produced in this test? if poor, is the poor quality due to limited or incorrect heuristics being used (OK) or a failure of the pipeline or pipeline script (not OK) how do the log messages produced in this test compare with the log messages produced by the pipeline in Test A? Note: the raw data set will need to be identified (by L. Davis and D. Frayer?) before these tests. The phython scripts from Test A will be reused, ideally with no or very limited modifications. 2. GG Tau test -------------- Raw data required: another set of GG Tau data taken on a single day this data set will be filled (and if necessary run through phcor) outside of the prototype pipeline Scripts required: python script for prototype pipeline Comparison between: image produced by prototype pipeline in Test A and image produced in Test B; log files produced by each Test; comparisons done for both 3 mm and 1 mm images Criteria: did pipeline run to completion with no modifications of script? with minor modifications of script? with major modifications of script? did pipeline produce an image with no modifications of script? with minor modifications of script? with major modifications of script? what is the quality of the images produced in this test? if poor, is the poor quality due to limited or incorrect heuristics being used (OK) or a failure of the pipeline or pipeline script (not OK) how do the log messages produced in this test compare with the log messages produced by the pipeline in Test A? Note: the raw data set will need to be identified (by D. Shephard) and filled before these tests. The phython scripts from Test A will be reused, ideally with no or very limited modifications. Examples of work required: ========================== Test A: 1. Run python scripts in pipeline environment 2. Run glish scripts in AIPS++ environment 3. Compare images using AIPS++ tools (measure peak, rms, peak/rms ratio; for GG Tau produce difference maps and measure rms and peak values) 4. Compare pipeline log messages with offline log messages (will need to look at each task to see what happened and follow up on any discrepancies) Test B: 1. Run python scripts in pipeline environment 2. Examine images using AIPS++ tools (what is peak, rms, etc? are the values reasonable (e.g. too high, too low, etc)? any obvious artifacts in images? if appropriate, how do they compare with images from Test A?) 3. Compare pipeline log messages with log messages from Test A (will need to look at each task to see what happened and follow up on any discrepancies; is every step doing what you'd expect, given the data? i.e. is it doing something similar to what's done in test A) Examples of possible failure modes: =================================== Test A: 1. glish script crashes (D. Shephard et al. to fix) 2. python script crashes (L. Davis et al. to fix) 3. images produced by python script are wrong (C. Wilson to try to diagnose why they are wrong i.e. where in script did error occur; L. Davis et al. to fix script or underlying engines) Test B: 1. python script crashes (C. Wilson to try to determine why it crashed; L. Davis et al. to fix if possible) 2. obvious error recorded or noted in log file although image produced (L. Davis to determine source of error and fix) 3. image seems wrong (C. Wilson to try to diagnose; is every step doing what you'd expect, given the data? i.e. is it doing something similar to what's done in test A) Appendix A: First test from Pipeline User Test Plan =================================================== August 29, 2003 version Pipeline data processing, user test 1: (Occurs before R1.1) ** OPTIONAL ** Test 1 (stand-alone): Jan-Feb 2004, single field, no single dish, 256 channels or less, integration time 10 sec or less, 5-27 antennas, spectral line (without continuum subtraction), no self-calibration. Testing Focus: - calibration and/or imaging - provide early testing experience to guide later, more critical tests Lower priority: - automatic data editing - pointing, Tsys, weather etc. to identify bad data - choice of best deconvolution algorithm and/or region - quality assessment - robustness of heuristics script to variations in organization of data set Requires: - raw data if calibration being tested, calibrated data if only imaging is being tested. - partial heuristics script for single field imaging (Early Science Case) - some components of offline package required by heuristics script are available - one tester beyond the heuristics developers (the Subsystem Scientist) ********************************************************************* COMMENTS ON TEST 1: Whether this test is carried out at all depends on whether tools are available in the Pipeline Prototype Infrastructure for the heuristics team to use to develop a partial or complete heuristics script. *********************************************************************