Automatic RFI identification and flagging in CASA (Version 2)

Urvashi R.V.

Date: 26 Jun 2011 (Updated : 10 Aug 2011)

Abstract:

This note describes the autoflag algorithms available in CASA and their usage.

The 'testautoflag' task in casapy-test (> 15503) contains the TFCrop autoflag algorithm, flag extensions, and an interactive data/flag display.

Within CASA, please type 'help testautoflag' for inline documentation.

Autoflag task in CASA

The 'testautoflag' task in CASA runs an RFI-detection algorithm with options to extend and grow flags. Data and flags can be visualized at run-time and flag-summary plots can be generated at the end. The task can also be run in inspection-mode to find RFI and gather statistics (but not write flags to the MS).

Data Selection

Selected data is iterated though in chunks of time. Baselines, Field Ids, Spectral Windows and Array Ids are treated separately. Scans are combined to accumulate 'ntime' seconds of integrations.

Data Selection Parameters

vis ' ' Name of Measurement Set

field ' ' (all) Select data based on field id(s) or name(s)

spw ' ' (all) Select data based on spectral window and channels

selectdata False Other data-selection parameters

antenna ' ' (all) Select data based on baseline

uvrange ' ' (all) Select data within uvrange (default units meters)

timerange ' ' (all) Select data based on time range

scan ' ' (all) Select data based on scan number

feed ' ' (all) Selection based on the feed (not implemented yet)

array ' ' (all) Selection based on the antenna array

datacolumn 'data' Data column on which to operate.
options : 'data', 'corrected', 'model', 'residual', 'residual_data'
Flags are based on the absolute values of the visibilities from the specified column.

ntime 100 Timerange (in seconds) over which to accumulate data before running the autoflag algorithms.
The dataset will be iterated through in time-chunks defined here.

corrs [ ] (all) List of ones/zeros to signal which correlations to operate upon. default : [] (all correlations)
example : [1,0,0,1] to choose RR and LL in data with RR, RL, LR, LL
example : [1,0] to choose only RR from data containing RR and LL
NOTE : This syntax will change in a later version of this task, to support user-specified lists of the type 'RR, LL'.

TFCrop algorithm

For each chunk of time, visibilities are organized as 2D time-frequency planes, one for each baseline and correlation type, and the following steps (outlier-detection, and flag extension) are performed on each 2D plane.

STEP 1 : Calculate a bandshape template : Average the data across time, to construct an average bandpass. Construct an estimate of a clean bandpass (without RFI) via a robust piece-wise polynomial fit to the average bandpass shape.

Note : A robust fit is computed in upto 5 iterations. It begins with a straight line fit across the full range, and gradually increases to 'maxnpieces' number of pieces with third-order polynomials in each piece. At each iteration, the stddev between the data and the fit is computed, values beyond N-stddev are flagged, and the fit and stddev are re-calculated with the remaining points. This stddev calculation is adaptive, and converges to a value that reflects only the data and no RFI. At each iteration, the same relative threshold is applied to detect flags, and this results in a varying set of flagging thresholds, that allows deep flagging only when the fit represents the true data best. Iterations stop when the stddev changes by less than 10%, or when 5 iterations are completed.

The resulting clean bandpass is a fit across the base of RFI spikes.

STEP 2: Use this clean bandpass to find RFI on the 2D time-frequency plane.

Flagging is also done in upto 5 iterations. In each iteration, for every timestep, calculate the stddev of the data spectrum w.r.to the clean fitted bandshape, flag all points further than N times stddev from the fit, and recalculate the stddev. At each iteration, the same relative threshold is applied to detect flags. Optionally, use sliding-window based statistics to calculate additional flags.

STEP 3: Repeat STEPS 1 & 2, but in the other direction (i.e. average the data across frequency, calculate a piece-wise polynomial fit to the average time-series, and find flags based on deviations w.r.to this fit.)

TFCrop Parameters (Used if 'tfcrop'=True) :

timecutoff 4.0 Flag threshold in time (flag all data-points further than N-stddev from the fit).

freqcutoff 3.0 Flag threshold in frequency. Flag all data-points further than N-stddev from the fit.

timefit 'line' Fitting function for the time direction
options = 'line', 'poly'
A 'line' fit is a robust straight-line fit across the entire timerange (defined by 'ntime').
A 'poly' fit is a robust piece-wise polynomial fit across the timerange. Choose 'poly' only if the visibilities are expected to vary significantly over the timerange selected by 'ntime', or if there is a lot of strong but intermittent RFI.

freqfit 'poly' Fitting function for the frequency direction
options = 'line','poly' (similar to 'timefit')
Choose 'line' only if you are operating on bandpass-corrected data, or residuals, and expect that the bandshape is linear. The 'poly' option works better when there are multiple lines of strong narrow-band RFI.

maxnpieces 7 Maxinum number of pieces to allow in the piecewise-polynomial fits
options = 1 - 9
This parameter is used only if 'timefit' or 'freqfit' are chosen as 'poly'. If there is significant broad-band RFI, reduce this number (say 5). Using too many pieces could result in the RFI being fitted in the 'clean' bandpass.
In later stages of the fit, a third-order polynomial is fit per piece, so for best results, please ensure that nchan/maxnpieces is at-least 5.

flagdimension 'freqtime' Choose the directions along which to perform flagging
default = 'freqtime' : First flag along frequency, and then along time
options = 'time', 'freq', 'timefreq', 'freqtime'
For most cases, 'freqtime' or 'timefreq' are appropriate, and differences between these choices are apparant only if RFI in one dimension is significantly stronger than the other. The goal is to flag the dominant RFI first.
If there are very few (less than 5) channels of data, then choose 'time'. Similarly for 'freq'.

usewindowstats 'none' Use sliding-window statistics to find additional flags ( This is Experimental !! )
options = 'none', 'sum', 'std', 'both'
The 'sum' option chooses to flag a point, if the mean-value in a window centered on that point deviates from the fit by more than N-stddev/1.5. This option is an attempt to catch broad-band or time-persistent RFI that the above polynomial fits will mistakenly fit as the clean band. It is an approximation to the sumThreshold method found to be effective by Offringa et.al (2010) for LOFAR data.
The 'std' option chooses to flag a point, if the 'local' stddev calculated in a window centered on that point is larger than N-stddev/1.5. This option is an attempt to catch noisy RFI that is not excluded in the polynomial fits, and which increases the global stddev, and results in fewer flags (based on the N-stddev threshold). This is an approximation to the idea behind 'rflag' in AIPS (which E.Greisen is currently refining).

halfwin 1 Half width of sliding window to use with 'usewindowstats'. (This is Experimental !!)
options = 1,2,3 for 3-point, 5-point or 7-point window sizes

Flag Extensions

Extend/grow flags that have been detected until now (old and new).

Flag extension Parameters (Used if 'extendflags'=True) :

extendpols False Extend flags to all correlations
This option can be used in conjunction with 'corrs' to calculate flags using only parallel-hand data, but apply them to all correlations (for example)

growtime 50.0 For any channel, flag the entire timerange in the current 2D chunk (set by 'ntime') if more than X% of the timerange is already flagged.
options = 0.0 - 100.0
This option catches the low-intensity parts of time-persistent RFI.

growfreq 50.0 For any timestep, flag all channels in the current 2D chunk (set by data-selection) if more than X% of the channels are already flagged.
options = 0.0 - 100.0
This option catches broad-band RFI that is partially identified by earlier steps.

growaround True Extend flags to immediately surrounding points in the time-freq plane.
For every un-flagged point on the 2D time/freq plane, if more than four surrounding points are already flagged, flag that point. This option catches some wings of strong RFI spikes.

flagneartime False Flag points before and after every flagged one, in the time-direction.
Note : This can result in excessive flagging.

flagnearfreq False Flag points before and after every flagged one, in the frequency-direction
This option allows flagging of wings in the spectral response of strong RFI. Note : This can result in excessive flagging.

Flag Display and Summary

Visualization of the data and flags at run-time is possible by setting 'datadisplay'=True. The intended usage is to run testautoflag with datadisplay=True and writeflags=False on a small sub-selection of the data, and change parameters until the desired flagging results are obtained. Then, turn off the display, set writeflags=True, and run it again.

A flag summary is generated at the end of the run, to list the percentage of data flagged as a function of frequency channel, spw, field, etc. If the task is run by turning off the tfcrop and extendflag options, it will compute statistics for all existing flags in the MS.

datadisplay False Display data and flags at run-time, within an interactive GUI
This option opens a GUI to show the 2D time-freq planes of the data with old and new flags, for all correlations per baseline.
- The GUI allows stepping through all baselines (prev/next) in the current chunk (set by 'ntime'), and stepping to the next-chunk.
- The testautoflag task can be quit from the GUI, in case it becomes obvious that the current set of parameters is just wrong.
- There is an option to stop the display but continue flagging.

plotsummary False Parse flag counts, and display a spectrum of percentage-of-flagged-data
Flag percentages are shown separately for different fields and spws, and counts are combined for all selected time-ranges, baselines, and correlations.
Note : If some baselines are completely flagged (or some correlations have exact zeros and are flagged), the floor of the spectrum will rise.

showknownrfi False Overlay lines and ranges of known EVLA RFI on the above spectrum.
Show the frequencies of known EVLA RFI using dotted lines, and shaded boxes. (purple = Continuous, green = Intermittent, black = all other types)

Flag Reads/Writes

usepreflags True Choose whether or not to use/honour existing flags in the MS. If 'writeflags'=True, old flags in the MS will be overwritten (but a backup can be taken)

preflagzeros True Choose whether or not to preflag visibilities exactly equal to 0.0
Note : If set to True, an elevated floor-level in the 'plotsummary' rfi-spectrum will indicate the presence of exact-zeros in some baselines or correlations.

writeflags False Choose whether or not to write flags to the MS
The testautoflag task can be run in 'writeflags=False' mode just to inspect the data, or to try different parameters and converge on a set that works for a particular dataset.

flagbackup True If flags are being written to the MS, create a back-up flagversion first ?

Examples

A set of examples are discussed in these slides.

For each example, screen-shots of the data-display GUI are shown with default parameters, followed by successive changes to these parameters to achieve desired flagging results. A process of trial and error is currently required at-least once per spw. Based on tests with 5 EVLA datasets, it can be said that parameters do not have to be varied across antennas or baselines, but each SPW needs to be tuned separately.

Heuristics

Some rules of thumb are as follows. This list will be updated as more is learnt, and if any defaults are changed.

The defaults are optimized for strong narrow-band RFI.
For broad-band RFI, reduce 'maxnpieces' to minimize the chances of the piece-wise polynomial fit treating the RFI as a valid part of the bandshape.
For noisy RFI (extended in either time or frequency), use 'usewindowstats'='std' or 'sum'.
In case of overflagging, check thresholds, and flag extension options.

Change Log

Casapy revision 15451 (27 June 2011) : Implemented version 2 of the tfcrop algorithm, in a new task 'autoflag'.
Casapy revision 15459 (27 June 2011) : The 'corrs' parameter was being ignored in r15451. Fixed.
Casapy revision 15503 (01 Jul 2011) : Changed the task name to 'testautoflag' to avoid conflict with a parameter named 'autoflag' in the old flagdata2 task.
Casapy revision 15855 (10 Aug 2011) : (a) fixed a bug that caused the 'line' option for time and freq fitting to not be a robust line-fit, (b) fixed a bug that resulted in freqcutoff being used for time also, (c) modified the scaling of these thresholds for the 'sum','std' cases. Earlier, they were going too deep.

To Do List

Extending/Growing Flags :
1. extend cross-correlation flags from baseline 'a-b' to all baselines with antenna 'a' and 'b'
2. extend autocorrelation flags from antenna 'a' to all baselines containing antenna 'a'
Parallelize (at-least on baselines)
Allow the user to build up a set of flag-cmds with different parameters per SPW (for example) and run them all together and generate summary plots. This has to wait until the data-handler classes are written.
GUI features :
1. Add a button to step through baselines more easily - jump to the next baseline with the same antenna1 or antenna2 - to examine all baselines with a particular antenna.
2. Add a button to step to the next SPW - sometimes next-chunk will make the user go through too many timesteps.
3. Disable GUI buttons while flagging is in progress
4. Generate flagcmds from current data-selection and algorithm parameters. This is to allow users to record parameters they converge on after trial-and-error, and re-run autoflag in silent-mode with the same settings. These parameters can be different for different data-selections (for each spw for example), and it will allow for all the flags (and summary) to be computed in a single pass through the data (with writeflags=True).
5. Provide a mechanism to modify parameters from the GUI and re-run the algorithms with new parameters on the currently-displayed chunk of data. In the extreme-case this will allow the user to fine-tune parameters for every field, spw and timerange (set by 'ntime'), and if desired, step through the entire dataset.
Improve syntax for data column and correlation selections. This has to wait until more of the flagger infrastructure refactoring is complete.
Antenna/baseline selections :
1. Provide options to select and operate on only auto-correlations, or only cross-correlations, or both.
2. Distinguish between missing baselines, baselines with exact zeros, and flagged-baselines while displaying flags and counting flag statistics.

Known Problems

When the task is run multiple times, baselines with exact zeros in them appear to be counted differently (as flags). This affects the summary plots.
When the task is run multiple times from within the same casapy session, and datadisplay=True, the first 'chunk' moves by in the display without waiting for a button-press. This occurs only for the first 'chunk', and only if the task is started multiple times in succession.
Zooming on the time-freq plane plots is buggy. Click at top-left and bottom-right corners to zoom, right click to unzoom. Unzooming defaults to a data range of 0-1000, so one must re-zoom to the size of the 2D plane.

More algorithms

Detect RFI by searching for non-coherent phase as a function of time. The visibility phase for a true spectral line (or continuum source) should remain coherent across time, but RFI should be noisy, or in some cases it should wind. In cases where it does wind cleanly, flag it if the fringe-rate is beyond what the image field-of-view dictates (keep time-averaging in mind here..).
Implement the sumThreshold method as used at LOFAR (Offringa et al. 2010). It uses a 2D surface fit instead of two 1D fits, and works with complex visibilities.
Implement the 'rflag' method from AIPS, once it converges.
Explore the idea of using off-source visibilities to find RFI. For instance, all the data that we 'quack' could be used to find instantaneous RFI, with less fear of flagging source structure.
Explore RFI excision methods.

Old/other Documentation

The original implementation of this algorithm is described in NCRA Technical Report 202 (October 2003).

TFCrop Version 1 documentation (Feb 2011) contains information, examples and a to-do list from the first implementation in CASA.

About this document ...

Automatic RFI identification and flagging in CASA (Version 2)

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

The command line arguments were:
latex2html TFCrop.tex

The translation was initiated by R. V. Urvashi on 2011-08-10

R. V. Urvashi 2011-08-10

timecutoff	4.0	Flag threshold in time (flag all data-points further than N-stddev from the fit).
freqcutoff	3.0	Flag threshold in frequency. Flag all data-points further than N-stddev from the fit.
timefit	'line'	Fitting function for the time direction options = 'line', 'poly' A 'line' fit is a robust straight-line fit across the entire timerange (defined by 'ntime'). A 'poly' fit is a robust piece-wise polynomial fit across the timerange. Choose 'poly' only if the visibilities are expected to vary significantly over the timerange selected by 'ntime', or if there is a lot of strong but intermittent RFI.
freqfit	'poly'	Fitting function for the frequency direction options = 'line','poly' (similar to 'timefit') Choose 'line' only if you are operating on bandpass-corrected data, or residuals, and expect that the bandshape is linear. The 'poly' option works better when there are multiple lines of strong narrow-band RFI.
maxnpieces	7	Maxinum number of pieces to allow in the piecewise-polynomial fits options = 1 - 9 This parameter is used only if 'timefit' or 'freqfit' are chosen as 'poly'. If there is significant broad-band RFI, reduce this number (say 5). Using too many pieces could result in the RFI being fitted in the 'clean' bandpass. In later stages of the fit, a third-order polynomial is fit per piece, so for best results, please ensure that nchan/maxnpieces is at-least 5.
flagdimension	'freqtime'	Choose the directions along which to perform flagging default = 'freqtime' : First flag along frequency, and then along time options = 'time', 'freq', 'timefreq', 'freqtime' For most cases, 'freqtime' or 'timefreq' are appropriate, and differences between these choices are apparant only if RFI in one dimension is significantly stronger than the other. The goal is to flag the dominant RFI first. If there are very few (less than 5) channels of data, then choose 'time'. Similarly for 'freq'.
usewindowstats	'none'	Use sliding-window statistics to find additional flags ( This is Experimental !! ) options = 'none', 'sum', 'std', 'both' The 'sum' option chooses to flag a point, if the mean-value in a window centered on that point deviates from the fit by more than N-stddev/1.5. This option is an attempt to catch broad-band or time-persistent RFI that the above polynomial fits will mistakenly fit as the clean band. It is an approximation to the sumThreshold method found to be effective by Offringa et.al (2010) for LOFAR data. The 'std' option chooses to flag a point, if the 'local' stddev calculated in a window centered on that point is larger than N-stddev/1.5. This option is an attempt to catch noisy RFI that is not excluded in the polynomial fits, and which increases the global stddev, and results in fewer flags (based on the N-stddev threshold). This is an approximation to the idea behind 'rflag' in AIPS (which E.Greisen is currently refining).
halfwin	1	Half width of sliding window to use with 'usewindowstats'. (This is Experimental !!) options = 1,2,3 for 3-point, 5-point or 7-point window sizes

extendpols	False	Extend flags to all correlations This option can be used in conjunction with 'corrs' to calculate flags using only parallel-hand data, but apply them to all correlations (for example)
growtime	50.0	For any channel, flag the entire timerange in the current 2D chunk (set by 'ntime') if more than X% of the timerange is already flagged. options = 0.0 - 100.0 This option catches the low-intensity parts of time-persistent RFI.
growfreq	50.0	For any timestep, flag all channels in the current 2D chunk (set by data-selection) if more than X% of the channels are already flagged. options = 0.0 - 100.0 This option catches broad-band RFI that is partially identified by earlier steps.
growaround	True	Extend flags to immediately surrounding points in the time-freq plane. For every un-flagged point on the 2D time/freq plane, if more than four surrounding points are already flagged, flag that point. This option catches some wings of strong RFI spikes.
flagneartime	False	Flag points before and after every flagged one, in the time-direction. Note : This can result in excessive flagging.
flagnearfreq	False	Flag points before and after every flagged one, in the frequency-direction This option allows flagging of wings in the spectral response of strong RFI. Note : This can result in excessive flagging.