Calibrater Performance Improveents as of 2004 Apr 14
====================================================
George Moellenbrock

Intro
-----

At last NAUG meeting, I discussed four areas in which calibrater
performance improvements were likely: CalTable I/O, trivial model
case, accumulated calibration, and core solver.  The first and third
of these have been addressed, and improvements will be checked in to
the system this week.  

CalTable I/O  
------------

The current CalTable I/O methods suffer from the costs of a row-wise
treatment.  For each antenna and timestamp, the solution and all of
its descriptive information is packaged into a record structure which
is passed to the table system for write to the disk.  Considering the
GAIN column alone (the most voluminous column), I have demonstrated
with a test program that packaging the time-dependent solutions into a
appropriately sized array (matching the shape it will have on disk)
and sending this array en masse to the table system for write is
nearly 70X (!) faster.  Implementing this for all of the relevant
columns should make this fraction of the calibration solve execution
all but disappear.  A similar improvement should be possible for the
CalTable read.  (For very large solution sets, the column-wise I/O
will occur for a minimum of suitably- sized ranges of table rows.)

Accumulated Calibration
-----------------------

Initially, this item concerned accumulating the antenna-based factors
for all calibration types (e.g., G, D, P) *before* forming the
nAnt*(nAnt+1)/2 baseline factors (the '+' becomes '-' if ACs are
ignored).  However, in the process of investigating this issue, it
became apparent that even for application of a single calibration
type, the order and manner of calculation of the baseline correction
factors was not optimal.  The calibration solutions we solve for and
store are the antenna-based factors (2x2 Jones matrices) which, after
forming antenna-pair-wise outer products (yielding 4x4 Meuller
matrices), corrupt a perfect data model via multiplication.
Correction by this calibration therefore requires taking the inverse
of these solutions (matrix "division").  In the current calibrater,
the outer product of nAnt*nAnt pairs of 2x2 matrices (all antenna
pairs in both directions) is formed, and then the inverses of all of
these 4x4 matrices was taken.  It would be much quicker to take the
inverse of the nAnt 2X2 matrices *before* forming (only) the
nAnt*(nAnt+1)/2 required 4x4 matrics.  Additionally, the large number
of 4x4 matrix inversions are performed with no consideration of
whether the matrices were general, diagonal, or scalar.  Recognizing
when the (4x4) matrix is diagonal (requiring the inverse of 4 complex
numbers) or scalar (requiring inverse of 1 complex number) can save a
factor of 16 (n^2) or 64 (n^3), respectively (n is the size of the
matrix).  The gains for 2x2 matrices is somewhat more modest (4 or 8),
but we will also gain by doing a factor ~nAnt fewer of them.

The performance improvement available by choosing the optimal strategy
(the code for clever matrix inversion already exists) are substantial,
as the following table indicates. The table lists the performance
figures for application of various numbers of different types of
solutions to a 1030-timestamp dataset (VLA continuum polarimetry).
The P (parallactic angle) solution is applied twice, first from only
one p.a. (unrealistically) to illustrate the magnitude of other costs
(mainly data i/o, and actuall apply), second using per-timestamp
p.a. values.  The G and D solution applications are dominated by data
i/o and the actual apply since there are so few solutions in these
cases.  The T solution application includes 20 seconds of CalTable I/O
which should largely vanish when the CalTable I/O improvements are
included.  At that point the appliction of 1030 T solutions will be
comparable to application the G and D solutions (many fewer solutions,
T is scalar), at ~6.8 sec.  The P solution application is
comparatively more expensive because the parallactic angle must be
calculated on-the-fly.  Avoiding repeated disk I/O of the antenna
positions used in this calculation should improve this.  Essentially,
the baseline factor formation step during apply will be reduced to a
near-negligible fraction of the overall calibration apply cost.  Note
that solves for which calibration is pre-applied (on-the-fly) will
also benifit.  Finally, applying several calibration types in sequence
appears to be a relatively small incremental cost, as indicated by the
last 3 rows of the table which are clearly dominated by the cost of
applying P (the data I/O costs are the same for *all* rows in this
table at about ~6 seconds).


     Improvements in Cal Apply Performance
     -------------------------------------

       Timings include: Data i/o, Soln i/o, 
                        Baseline factor formation, and apply


       Type  mtype   nSol   nData      Old(s)    New(s)
     ---------------------------------------------------------------------
        P     diag      1    1030       7.9      8.0  (incl on-the-fly calc)
        P     diag   1030    1030      23.9     12.6  (incl on-the-fly calc)
        G     diag     12    1030       6.8      6.3
        D      gen      1    1030       7.1      7.2
        T    scalar  1030    1030      36.8     26.8  (incl 20sec soln i/o)
       P,G                   1030      24.9     14.2
      P,G,D                  1030      25.0     14.0
     P,G,D,T                 1030      56.6     34.5  (incl 20sec T soln i/o)


The data from which the above table was derived is the
aips++/AIPS/miriad benchmark dataset, the simulated gravitational lens
dataset.  Executing this benchmark before and after the cal apply
improvements yields the following results (cf official aips++
benchmarks at <http://aips2.nrao.edu/projectoffice/> , click on "ALMA
Benchmark Page", then "Test Case #1"):


Step             Old(s)   New(s) (as of 2004Apr16)
==================================================================
Fill             19.4     19.1
Setjy             0.7      0.7
Phase/Amp Cal    33.1      6.3*  (solve + PG apply to calibrators)
D Cal + apply    37.3     25.6   (solve=10.8 + full PG apply=14.8)
Image1           72.3     72.1 
Selfcal + apply 131.1     44**   (solve=30 + full PGDT apply=14)
Image2           69.1     66.5
------------------------------
Total           363      234.3 = 1.5X improvement

* The original benchmark script (used for Old) includes a spurious
step at this point: application of PG to the target source.  AIPS is
not doing this.  
** 44 seconds does not include the current solution I/O cost of
approximately 50 seconds (write=30, read=20) for the 1030-solution T
table.  The solution I/O cost for the other types (G, D) is already
negligible because these solutions sets are so small.

Summary
-------

So, with these two improvements, the aips++ performance for modest
continuum polarimetry datasets should reduce to about 1.7 X AIPS and
3.5X miriad (from 2.6X and 5.2X).  In fact, the gains should be
somewhat better than this since the fraction of time spent on the
imaging steps is somewhat larger on my laptop than on the benchmark
machine (due to differences in the h/w details).  Additionally, more
improvements are possible, including optimization of the P
calculation, implementation of the trivial model case (which will
reduce the data i/o costs during solves by a factor of 2, e.g., the
selfcal solve should decrease from 30 to ~24 seconds), and a number of
low-level improvements in the core solver.