Calibrater Performance Improveents as of 2004 Apr 14 ==================================================== George Moellenbrock Intro ----- At last NAUG meeting, I discussed four areas in which calibrater performance improvements were likely: CalTable I/O, trivial model case, accumulated calibration, and core solver. The first and third of these have been addressed, and improvements will be checked in to the system this week. CalTable I/O ------------ The current CalTable I/O methods suffer from the costs of a row-wise treatment. For each antenna and timestamp, the solution and all of its descriptive information is packaged into a record structure which is passed to the table system for write to the disk. Considering the GAIN column alone (the most voluminous column), I have demonstrated with a test program that packaging the time-dependent solutions into a appropriately sized array (matching the shape it will have on disk) and sending this array en masse to the table system for write is nearly 70X (!) faster. Implementing this for all of the relevant columns should make this fraction of the calibration solve execution all but disappear. A similar improvement should be possible for the CalTable read. (For very large solution sets, the column-wise I/O will occur for a minimum of suitably- sized ranges of table rows.) Accumulated Calibration ----------------------- Initially, this item concerned accumulating the antenna-based factors for all calibration types (e.g., G, D, P) *before* forming the nAnt*(nAnt+1)/2 baseline factors (the '+' becomes '-' if ACs are ignored). However, in the process of investigating this issue, it became apparent that even for application of a single calibration type, the order and manner of calculation of the baseline correction factors was not optimal. The calibration solutions we solve for and store are the antenna-based factors (2x2 Jones matrices) which, after forming antenna-pair-wise outer products (yielding 4x4 Meuller matrices), corrupt a perfect data model via multiplication. Correction by this calibration therefore requires taking the inverse of these solutions (matrix "division"). In the current calibrater, the outer product of nAnt*nAnt pairs of 2x2 matrices (all antenna pairs in both directions) is formed, and then the inverses of all of these 4x4 matrices was taken. It would be much quicker to take the inverse of the nAnt 2X2 matrices *before* forming (only) the nAnt*(nAnt+1)/2 required 4x4 matrics. Additionally, the large number of 4x4 matrix inversions are performed with no consideration of whether the matrices were general, diagonal, or scalar. Recognizing when the (4x4) matrix is diagonal (requiring the inverse of 4 complex numbers) or scalar (requiring inverse of 1 complex number) can save a factor of 16 (n^2) or 64 (n^3), respectively (n is the size of the matrix). The gains for 2x2 matrices is somewhat more modest (4 or 8), but we will also gain by doing a factor ~nAnt fewer of them. The performance improvement available by choosing the optimal strategy (the code for clever matrix inversion already exists) are substantial, as the following table indicates. The table lists the performance figures for application of various numbers of different types of solutions to a 1030-timestamp dataset (VLA continuum polarimetry). The P (parallactic angle) solution is applied twice, first from only one p.a. (unrealistically) to illustrate the magnitude of other costs (mainly data i/o, and actuall apply), second using per-timestamp p.a. values. The G and D solution applications are dominated by data i/o and the actual apply since there are so few solutions in these cases. The T solution application includes 20 seconds of CalTable I/O which should largely vanish when the CalTable I/O improvements are included. At that point the appliction of 1030 T solutions will be comparable to application the G and D solutions (many fewer solutions, T is scalar), at ~6.8 sec. The P solution application is comparatively more expensive because the parallactic angle must be calculated on-the-fly. Avoiding repeated disk I/O of the antenna positions used in this calculation should improve this. Essentially, the baseline factor formation step during apply will be reduced to a near-negligible fraction of the overall calibration apply cost. Note that solves for which calibration is pre-applied (on-the-fly) will also benifit. Finally, applying several calibration types in sequence appears to be a relatively small incremental cost, as indicated by the last 3 rows of the table which are clearly dominated by the cost of applying P (the data I/O costs are the same for *all* rows in this table at about ~6 seconds). Improvements in Cal Apply Performance ------------------------------------- Timings include: Data i/o, Soln i/o, Baseline factor formation, and apply Type mtype nSol nData Old(s) New(s) --------------------------------------------------------------------- P diag 1 1030 7.9 8.0 (incl on-the-fly calc) P diag 1030 1030 23.9 12.6 (incl on-the-fly calc) G diag 12 1030 6.8 6.3 D gen 1 1030 7.1 7.2 T scalar 1030 1030 36.8 26.8 (incl 20sec soln i/o) P,G 1030 24.9 14.2 P,G,D 1030 25.0 14.0 P,G,D,T 1030 56.6 34.5 (incl 20sec T soln i/o) The data from which the above table was derived is the aips++/AIPS/miriad benchmark dataset, the simulated gravitational lens dataset. Executing this benchmark before and after the cal apply improvements yields the following results (cf official aips++ benchmarks at , click on "ALMA Benchmark Page", then "Test Case #1"): Step Old(s) New(s) (as of 2004Apr16) ================================================================== Fill 19.4 19.1 Setjy 0.7 0.7 Phase/Amp Cal 33.1 6.3* (solve + PG apply to calibrators) D Cal + apply 37.3 25.6 (solve=10.8 + full PG apply=14.8) Image1 72.3 72.1 Selfcal + apply 131.1 44** (solve=30 + full PGDT apply=14) Image2 69.1 66.5 ------------------------------ Total 363 234.3 = 1.5X improvement * The original benchmark script (used for Old) includes a spurious step at this point: application of PG to the target source. AIPS is not doing this. ** 44 seconds does not include the current solution I/O cost of approximately 50 seconds (write=30, read=20) for the 1030-solution T table. The solution I/O cost for the other types (G, D) is already negligible because these solutions sets are so small. Summary ------- So, with these two improvements, the aips++ performance for modest continuum polarimetry datasets should reduce to about 1.7 X AIPS and 3.5X miriad (from 2.6X and 5.2X). In fact, the gains should be somewhat better than this since the fraction of time spent on the imaging steps is somewhat larger on my laptop than on the benchmark machine (due to differences in the h/w details). Additionally, more improvements are possible, including optimization of the P calculation, implementation of the trivial model case (which will reduce the data i/o costs during solves by a factor of 2, e.g., the selfcal solve should decrease from 30 to ~24 seconds), and a number of low-level improvements in the core solver.