ALMA Computing document number: COMP-70.45.00.00-004-A-REP Report from the First Pipeline User Test ======================================== Christine Wilson May 4, 2004 Test Start Date: April 7, 2004 Test Planned End Date: May 7, 2004 Test Actual End Date: May 4, 2004 Tester: Christine Wilson Computer used to run scripts: bernoulli.aoc.nrao.edu Computer for image analysis: eccles.physics.mcmaster.ca Total hours spent on test : 23 hours Introduction ============ The first Pipeline User Test consisted of two different types of tests of the prototype pipeline software: (1) two tests to ensure that the prototype pipeline is doing the right steps and producing the right outcomes (e.g. images and logs) in processing the data (2) two tests of the automated functioning of the prototype pipeline, e.g., given a recipe that works for one data set, is the pipeline successful in processing similar data sets in an automatic or semi-automatic way. Each of these tests is described in more detail below. The scripts used in the test were run in a guest account on the NRAO computer bernoulli.aoc.nrao.edu while the image analysis and log inspection was done on the solaris machine eccles.physics.mcmaster.ca at McMaster University. The guest account environment was configured for both AIPS++ and ACS by Lindsey Davis before the start of the test. Each test was run on two fundamentally different data sets. One was a 8.4 GHz continuum data set from the VLA (to image a GRB) and the second was a 115GHz/230GHz spectral line data set from the Plateau de Bure (PdB) Interferometer (to image a protostar). Each test data set consisted of a single time-continuous set of data from the telescope. The data used in the second test were taken with essentially the same instrumental setup on the same astronomical sources, but obtained on a different day from the first test. The data used in the second test had not been used to develop the prototype pipeline and had not been run through the pipeline prior to the immediate preparation for this test. It is important to understand that the PdB data are intrinsically spectral line data, even for the "continuum" data contained within the data set. Because the observations of the CO spectral line do not have sufficient signal-to-noise to produce a good image from a single day's worth of data, the tests were carried out on the 115 GHz and 230 GHz continuum data only. However, these data are still valid for testing the spectral line portions of the prototype pipeline. Test 1: Is the prototype pipeline functioning correctly? ======================================================== This test was designed to make sure that the processing of data in the prototype pipeline is proceeding correctly. In philosophy, this test is similar to the IRAM-AIPS++ Phase I test: we want to process the same raw data with the same processing steps via two different methods and see how well the output images and log files agree. In this test, since both methods (pipeline and AIPS++) are using the same underlying AIPS++ engines, we should also be able to ensure that choices of input parameters (ranges of data to flag, type of fit performed in calibration, etc.) are identical, and so we might expect the output images from the two methods to be identical also. Lindsey Davis prepared and tested the two glish and two python scripts before the start of the test period. She also identified the set of raw data from the VLA and configured the input "targets" file for the script to find the raw data properly. She also stored the PdB data for this test in fits format in the AIPS++ data repository and in the AIPS++ ms format in the test directory. (Note that the PdB data had to be read into aips++ format outside of the python script because no python interface to the almatifiller task was planned to be developed as part of the prototype pipeline.) Test 2: Can the prototype pipeline process data automatically? ============================================================== This test was designed to test the ability of the pipeline to process a data set automatically using scripts that were not designed and optimized for that data set. It is important to note here that we are testing the ability of the prototype pipeline to do automatic data processing but not the heuristics used in that automatic processing. Thus, the images produced in this test are likely not to be the best images that could be produced from the data. Lindsey Davis prepared and tested the glish and python scripts for the VLA data set for Test 2 before the start of the test period. She also identified the set of raw data from the VLA and configured the input "targets" file for the script to find the raw data properly. Due to time constraints and personnel availability, the PdB data set for Test 2 was only identified after the formal start of the test period. Lindsey Davis stored the PdB data for this test in fits format and in the AIPS++ ms format in the test directory, and set up the scripts and input files. The scripts and data were available by the end of the first week of the test period, well before the tester was ready to start this part of the test. Summary of Test Results: ======================== Test 1 ("Is the prototype pipeline functioning correctly?") was very successful. After minor discrepancies (that had been left in by accident) between the glish and python scripts were corrected, the images produced by the python and glish scripts were identical for both the GRB data set and the PdB data set. The logs from the two scripts also agreed very well. The only significant difference between the logs appeared in autoflag, where python writes out incomplete information in reporting back from some tasks. Both glish and python gave one or two warning messages that clearly did not affect the actual processing of the data. Test 2 ("Can the prototype pipeline process data automatically?") was also successful. The performance of the prototype pipeline on the unknown GRB data set was impressive, as no modifications to the script were required between Test 1 and Test 2. The only significant differences between Test 1 and Test 2 were in the clean thresholds specified as inputs to the script and in the fact that the first data set reached threshold in under 1000 iterations while the second one did not. This difference could be attributed to how precisely the threshold was set to 3 sigma of the noise in the dirty image, and should be classified as a heuristics issue, which is not the focus of this test. (Inspection of the dirty V image showed that the threshold chosen was 3.03 sigma for Test 1 and 2.93 sigma for Test 2, which is probably enough to explain the different level of success in reaching threshold.) In any case, the quality of the image produced by Test 2 was not substantially different from the image produced by Test 1 for the GRB data. The performance of the prototype pipeline on the unknown PdB data set was also very good. Here also Test 1 reached threshold in clean and Test 2 did not, both for the 1mm and the 3 mm data. The images produced in Test 2 appear qualitatively better than those produced in Test 1 for this source; both the 3mm and 1mm images from Test 2 show a clear ring of emission for the central source. In contrast, the 3 mm image from Test 1 is CONSISTENT with a ring of emission but is broken down into several point sources rather than a smooth ring. The 1 mm image from Test 1 looks qualitatively wrong, in that it shows a scattering of point sources concentrated to the central part of the image and no evidence obvious evidence of a ring. However, comparison of the 1 mm image from Test 1 with the image produced from the same data set in the AIPS++ PdB cookbook suggests that this result is in fact the correct image given the input data. Identical scripts were not used for the PdB data in Test 1 and Test 2; in particular, details of the data flagging were tweaked for Test 2. However, an extra test running the script from Test 1 on the data from Test 2 produced very similar images as when the tweaked script was used. Overall, the first Pipeline User Test is judged to be a success. Appendix: Details of Individual Tests ===================================== GRB test 1 ========== Raw data required: one set of GRB data (AK573_D031030) (VLA B array) Scripts required: python script for prototype pipeline glish script for AIPS++ Comparison between: image produced by prototype pipeline and image produced by AIPS++ using glish script; log files produced by each method Notes: (1) The first time this test was run, there were some minor incompatibilities between the inputs in the python and glish scripts: (a) In autoflag, python is running with Selector: flag autocorr quack=120s,10s while glish is running with Selector: flag autocorr quack=60s,10s (b) in calibrater glish: The following calibration components will be solved for: G table=D031030cal.ms.gtab t=90 preavg=90 phaseonly=F refant=0 append=F python: The following calibration components will be solved for: G table=D031030cal.ms.gtab t=90 preavg=60 phaseonly=F refant=0 append=F (c) the first time opacity is used glish: For interval of -1 seconds, found 178 slots Elevation-dependent opacity corrections will be applied using a constant zenith opacity = 0 python: For interval of -1 seconds, found 178 slots Elevation-dependent opacity corrections will be applied using a constant zenith opacity = 1e-04 Points (a) and (c) were simply typos in the scripts that were supplied and were corrected by Lindsey Davis. Point (b) was slightly more complicated but quick to be fixed; Lindsey explained it this way: "This turned out to be a Glish / Python interface issue. On the Glish side I was not setting the preavg value specifically but letting AIPS++ choose the default value. Although the specified default value is 60, for type 'G' solutions setsolve ignores this and sets preavg=t (90 in this case). On the Python side I cannot currently let parameters default so I must set them explicitly. I chose what I thought was the default value (60) and the C++ code accepted this. It turns out the C++ code does some limit checking as well which complicates things in some circumstances, but not here. The bottom line is to fix this problem for now I explicitly set preavg=t in the Python code." These three problems (and a similar one to (b) that popped up in imager) were fixed by Lindsey and then the test scripts were run again from scratch. All the analysis of the logs and images was carried out on the results from this second running of the scripts. Criteria: source peak flux agrees to within 10% YES (identical) rms in source-free regions agrees to within 20% YES (identical) ratio of peak to rms agrees to within 10% YES (identical) no important disagreements in log messages produced in pipeline and offline modes OK Comments: (1) glish and python images appear identical (from the statistics on the full image as well as the restoring beam listed in the header). (2) The only significant difference between the logs appears to be in autoflag, where python writes out incomplete information in reporting back from some tasks. For example, the glish log would report Selector: 0 pixel flags, 7182 row flags where the python log would report simply Selector[ However, since the fitted beams and the peak values for the iterations in clean are identical between the glish and python implementations for both the calibrator and the source, I believe this difference is only one of reporting to the log and not one of acting on the data. Summary: The images produced by the glish and python scripts were identical. The only significant difference between the logs appears to be in autoflag, where python writes out incomplete information in reporting back from some tasks. However, since the images produced are identical, I believe these differences in the logs are only ones of reporting to the log and not one of acting on the data. PdB test 1 ========== Raw data required: one set of GG Tau data (taken 07feb97) Scripts required: python script for prototype pipeline glish script for AIPS++ Comparison between: image produced by prototype pipeline and image produced by AIPS++ using glish script; log files produced Notes: (1) Lindsey Davis found some minor incompatibilities between the inputs in the python and glish scripts before the tester had a chance to run them (see Note 1 in GRB Test 1 above). These inconsistencies were fixed during the first weak of the test period before the tester began this portion of the test. (2) The tester ran into a small but frustrating delay in calculating the difference image, as she was not familiar enough with aips++ and the aips++ general help files were not clear enough for her to be able to quickly figure out how to calculate a difference image. However, Lindsey Davis responded promptly with a recipe that worked for the tester on the first try. (3) A major irritant in inspecting the logs in this test was the fact that the sequence of commands diverged in a major way starting with the first call to imager. This made it even more difficult to compare the glish and python calls line by line than it was for the GRB test 1. In the end, the tester had to work through the python log sequentially and jump around in the glish log to find the corresponding sections to compare. This has the danger that extra steps in the glish log could be overlooked. However, since the final images from glish and python are identical, it was probably safe enough in this case. 3mm 1mm Criteria: source peak flux agrees to within 10% IDENTICAL rms in source-free regions agrees to within 20% IDENTICAL ratio of peak to rms agrees to within 10% IDENTICAL difference image (between pipeline and AIPS++-produced images) has no structure larger than |3 x rms|, where rms is measured on the difference image IDENTICAL no important disagreements in log messages produced in pipeline and offline modes OK OK Comments: (1) glish and python images appear identical (from the statistics on the full image as well as the restoring beam listed in the header) and the difference images showed them to be precisely identical. (2) In inspecting the logs, the fact that glish-autoflag is 1-indexed while python-autoflag is 0-indexed caused some confusion for the tester at first, i.e., it was not clear that these two log messages were in fact reporting identical processing. glish: Selector: chan=34:35 spwid=3 field=1; flag all python: Selector: chan=33:34 spwid=2 field=0415+379; flag all (3) Some incompleteness in reporting autoflag steps was also seen in some parts of the PdB python log (see GRB Test 1, comment 2). (4) There were a couple of warning messages in the glish log that did not appear to have counterparts in the python log: WARN calibrater The following transfer fields have no solutions available: Index=-1=out-of-range WARN calibrater:setsolvegainsp Calibration table 07feb97-g067.ms.1mm.gcal exists, and append=F. WARN calibrater:setsolvegainsp Therefore, this table will be updated. Summary: The images produced by the glish and python scripts were identical for both 3 mm and 1 mm. The only significant difference between the logs appears to be in autoflag and calibrater. In autoflag, python writes out incomplete information in reporting back from some tasks. There are also some differences in some of the messages which I think are due to the 1-index for glish versus 0-index for python. In calibrater, there are two types of warning messages in the glish logs that aren't in the python logs. However, since the fitted beams and the peak values for the iterations in clean are identical between the glish and python implementations for both the calibrator and the source, I believe these differences are only ones of reporting to the log and not one of acting on the data. GRB test 2 ========== Raw data required: one set of GRB data that was not used in designing the pipeline (AK575_K040329) (VLA C array) Scripts required: python script for prototype pipeline (glish script also available for comparison) Comparison between: image produced by prototype pipeline in Test 1 and image produced in Test 2; log files produced by each Test Notes: (1) The preliminary tests by Lindsey Davis revealed an inconsistency in the way the clean mask was dealt with between the glish and python scripts. The inconsistency showed up because the data for Test 2 had more than one source in the target field. This inconsistency was fixed before the start of the official test period. Criteria: did pipeline run to completion with no modifications of script? YES with minor modifications of script? with major modifications of script? did pipeline produce an image with no modifications of script? YES with minor modifications of script? with major modifications of script? what is the quality of the images produced in this test? VERY GOOD if poor, is the poor quality due to limited or incorrect heuristics being used (OK) or a failure of the pipeline or pipeline script (not OK) how do the log messages produced in this test compare with the log messages produced by the pipeline in Test 1? WELL Comments: (1) Identical scripts were used for GRB Test 1 and GRB Test 2. The only differences were in the inputs in the "targets" file, of which the only important ones were the clean thresholds specified for the calibrator and the target source. Lindsey Davis had decided on these thresholds before the test by producing a dirty image in I and V and setting the threshold to something like 3 * sigma of the V image. (2) In clean on the target source, Test 1 reached threshold at iteration 910 but Test 2 did not reach threshold after 1001 iterations. The final peak was 0.0855 mJy where the threshold was specified to be 0.08 mJy. Summary: the performance of the pipeline on the unknown VLA GRB data set was excellent. Both the images and the logs agree very well. It is impressive that no modifications to the script were required for this test. PdB Test 2 ---------- Raw data required: another set of GG Tau data (taken 20feb97) Scripts required: python script for prototype pipeline (glish script also available for comparison) Comparison between: image produced by prototype pipeline in Test 1 and image produced in Test 2; log files produced by each Test; comparisons done for both 3 mm and 1 mm images Notes: (1) Lindsey Davis ran the data set through the glish script without error, but she did have to fiddle with one "auto" flagging routine a little bit as the sources and conditions are somewhat different and the actual autoflagging does not work as well on these data as on the GRB data. In the end the only changes made to the script between Test 1 and Test 2 were to some flagging criteria. (2) For both Test 1 and Test 2, small specific time ranges were identified to be flagged by Lindsey Davis using msplot, information in the Cookbook (for Test 1), and personal judgement. This flagging was specified as one of the inputs to the script using the "target" file. There were also some specific small ranges of channels identified as bad in each case, again by Lindsey Davis by examining the data. This flagging was specified inside the script itself. (3) In general, the flagging of the PdB data was divided into two parts (a) a common part (known bad time intervals, known bad data i.e. bad channels) (b) an autoflag method which for the PDB data is not very auto. This mostly includes channels that are bad in a given time interval, or on a given baseline as determined by msplot and judgement. Lindsey did not have too much luck with autoflag auto methods here, although she did spend more time tuning the algorithms for the GRB data. (4) The tester also ran the data for this second test through the identical script used in the first test. That script ran to completion and produced an image that was not noticeably different from that produced by the optimized script produced by Lindsey Davis (see Note 1). Criteria: did pipeline run to completion with no modifications of script? YES with minor modifications of script? YES with major modifications of script? did pipeline produce an image with no modifications of script? YES with minor modifications of script? YES with major modifications of script? what is the quality of the images produced in this test? VERY GOOD if poor, is the poor quality due to limited or incorrect heuristics being used (OK) or a failure of the pipeline or pipeline script (not OK) how do the log messages produced in this test compare with the log messages produced by the pipeline in Test 1? WELL Comments: (1) The python scripts used for Test 2 and Test 1 were not identical in how the data were flagged. (a) in the first call to autoflag, more data were flagged in Test 2 Flagging MS '20feb97-g067.ms' chunk 2 (field 0415+379, spw 1) Test 2: 100 (29.41%) rows have been flagged. Test 1: 20 (4.55%) rows have been flagged. (and similar messages for spw 2 to spw 23) (b) in the second call to autoflag Selector: chan=33:34 spwid=2 field=0415+379; flag all Selector#2: spwid=2 field=0528+134 timerng(1) ant=4; flag all Selector#3: spwid=10,14,18 field=0528+134 timerng(1) ant=4; flag all Selector#4: chan=34:37 spwid=2 field=CRL618 timerng(1) baseline=1-3,2-4; flag all Selector#5: chan=38:46 spwid=10,14,18 field=CRL618 timerng(1) baseline=1-3; flag all Test 2: Selector: chan=33:34 spwid=2 field=0415+379; flag all Selector#2: chan=38:39 spwid=2 field=MWC349 ; flag all (2) There were naturally some differences in the rms reported from the bandpass fits: Test 2: Per-baseline RMS log(Amp) statistics: (min/mean/max) = 0.0137405/0.0224736/ 0.0452974 Per baseline RMS phase (deg) statistics: (min/mean/max) = 0.60342/1.12337/ 2.10041 Test 1: Per-baseline RMS log(Amp) statistics: (min/mean/max) = 0.00795083/0.0132126/ 0.02231 Per baseline RMS phase (deg) statistics: (min/mean/max) = 0.510469/0.572744/ 0.682272 Similar differences were reported for the 1 mm bandpass, although in this case the rms for Test 2 was a bit better than that for Test 1. (3) In clean on the 3mm data, Test 2 did not reach threshold while Test 1 reached threshold in 298 iterations. This is probably OK, as the emission appears more extended in the image from the Test 2 data. (4) In clean on the 1mm data, Test2 did not reach threshold while Test 1 did. In addition, the log messages for Test 2 suggest it may not have gone through many loops. 2004-04-26T20:52:34.989 Initial maximum residual: 0.0364477 2004-04-26T20:52:35.177 Iteration: 81, Maximum residual=0.0168215 2004-04-26T20:52:35.714 Iteration: 1000, Maximum residual=0.00428044 2004-04-26T20:52:35.715 Clean used 1000 iterations to get to a max residual of 0.00428044 2004-04-26T20:52:35.739 Clean did not reach threshold I don't understand why this run of clean produced so few lines in the log file; every other run of clean on GRB or PDB data produced several "Iteration" lines. However, the image from Test 2 was recognizably similar to the 3 mm image in that it showed a fairly smooth ring of emission, so it doesn't seem like something major is wrong. Perhaps this reporting has to do with major/minor cycles in clean? Summary: Although there are differences in whether cleaning gets to threshold and in the flagging of the data, there are no important differences between the two logs files. The images produced in this second test appear qualitatively better than those produced in the first test.