CASA MPI problems

There are several CASA MPI errors I encounter regularly.

2021-10-17 15:56:06     SEVERE  task_tclean::SynthesisImagerVi2::runCubeGridding (file src/code/synthesis/ImagerObjects/SynthesisImagerVi2.cc, line 1579)       remainder rank 7 failed
master 1 init 1
2021-10-17 15:59:21     SEVERE  tclean::::casa  Task tclean raised an exception of class RuntimeError with the following message: Error in making PSF : One or more  of the cube section failed in de/grid
ding. Return values for the sections: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 Traceback (most recent call last):
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casashell/private/init_system.py", line 238, in __evprop__
     exec(stmt)
   File "<string>", line 1, in <module>
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casashell/private/init_system.py", line 175, in execfile
     newglob = run_path( filename, init_globals=globals )
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/runpy.py", line 263, in run_path
     pkg_name=pkg_name, script_name=fname)
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/runpy.py", line 96, in _run_module_code
     mod_name, mod_spec, pkg_name, script_name)
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/runpy.py", line 85, in _run_code
     exec(code, run_globals)
   File "/orange/adamginsburg/ALMA_IMF/reduction/reduction/line_imaging.py", line 935, in <module>
     **impars
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casatasks/tclean.py", line 1660, in __call__
     task_result = _tclean_t( _pc.document['vis'], _pc.document['selectdata'], _pc.document['field'], _pc.document['spw'], _pc.document['timerange'], _pc.document['uvrange'], _pc.document['antenna'], _pc.document['scan'], _pc.document['observation'], _pc.document['intent'], _pc.document['datacolumn'], _pc.document['imagename'], _pc.document['imsize'], _pc.document['cell'], _pc.document['phasecenter'], _pc.document['stokes'], _pc.document['projection'], _pc.document['startmodel'], _pc.document['specmode'], _pc.document['reffreq'], _pc.document['nchan'], _pc.document['start'], _pc.document['width'], _pc.document['outframe'], _pc.document['veltype'], _pc.document['restfreq'], _pc.document['interpolation'], _pc.document['perchanweightdensity'], _pc.document['gridder'], _pc.document['facets'], _pc.document['psfphasecenter'], _pc.document['wprojplanes'], _pc.document['vptable'], _pc.document['mosweight'], _pc.document['aterm'], _pc.document['psterm'], _pc.document['wbawp'], _pc.document['conjbeams'], _pc.document['cfcache'], _pc.document['usepointing'], _pc.document['computepastep'], _pc.document['rotatepastep'], _pc.document['pointingoffsetsigdev'], _pc.document['pblimit'], _pc.document['normtype'], _pc.document['deconvolver'], _pc.document['scales'], _pc.document['nterms'], _pc.document['smallscalebias'], _pc.document['restoration'], _pc.document['restoringbeam'], _pc.document['pbcor'], _pc.document['outlierfile'], _pc.document['weighting'], _pc.document['robust'], _pc.document['noise'], _pc.document['npixels'], _pc.document['uvtaper'], _pc.document['niter'], _pc.document['gain'], _pc.document['threshold'], _pc.document['nsigma'], _pc.document['cycleniter'], _pc.document['cyclefactor'], _pc.document['minpsffraction'], _pc.document['maxpsffraction'], _pc.document['interactive'], _pc.document['usemask'], _pc.document['mask'], _pc.document['pbmask'], _pc.document['sidelobethreshold'], _pc.document['noisethreshold'], _pc.document['lownoisethreshold'], _pc.document['negativethreshold'], _pc.document['smoothfactor'], _pc.document['minbeamfrac'], _pc.document['cutthreshold'], _pc.document['growiterations'], _pc.document['dogrowprune'], _pc.document['minpercentchange'], _pc.document['verbose'], _pc.document['fastnoise'], _pc.document['restart'], _pc.document['savemodel'], _pc.document['calcres'], _pc.document['calcpsf'], _pc.document['psfcutoff'], _pc.document['parallel'] )
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casatasks/private/task_tclean.py", line 364, in tclean
     imager.makePSF()
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casatasks/private/imagerhelpers/imager_base.py", line 344, in makePSF
     self.makePSFCore()
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casatasks/private/imagerhelpers/imager_base.py", line 496, in makePSFCore
     self.SItool.makepsf()
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casatools/synthesisimager.py", line 70, in makepsf
     return self._swigobj.makepsf()
   File "/blue/adamginsburg/adamginsburg/casa/casa-6.4.0-12/lib/py/lib/python3.6/site-packages/casatools/__casac__/synthesisimager.py", line 322, in makepsf
     return _synthesisimager.synthesisimager_makepsf(self)
 RuntimeError: Error in making PSF : One or more  of the cube section failed in de/gridding. Return values for the sections: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

That one appears to be caused by:

2021-10-17 15:53:34     WARN    MPICommandServer::command_request_handler_service::SynthesisImagerVi2::CubeMajorCycle::MPIServer-7 (file src/code/synthesis/ImagerObjects/CubeMajorCycleAlgorithm.cc, line
 A) Exception for chan range [1921, 1924] ---   Error in making PSF : Interpolate1D::operator() data has repeated x values
##################################
#############################
Exception: Error in making PSF : Interpolate1D::operator() data has repeated x values

This happened after a lot of apparently successful cleaning, as an .image was created. However, the failure caused CASA to exit.

This tclean command:

2021-10-17 15:19:43     INFO    tclean::::casa  tclean( vis='/blue/adamginsburg/adamginsburg/almaimf/workdir/G008.67_B3_spw2_12M.concat.ms', selectdata=True, field='G008.67', spw='', timerange='', uvrange='', antenna='', scan='', observation='', intent='', datacolumn='corrected', imagename='/blue/adamginsburg/adamginsburg/almaimf/workdir/G008.67_B3_spw2_12M_spw2', imsize=[2880, 2250], cell=['0.08arcsec', '0.08arcsec'], phasecenter='ICRS 271.5877979623041deg -21.620789662367244deg', stokes='I', projection='SIN', startmodel='', specmode='cube', reffreq='', nchan=-1, start='', width='', outframe='LSRK', veltype='radio', restfreq=[], interpolation='linear', perchanweightdensity=True, gridder='mosaic', facets=1, psfphasecenter='', wprojplanes=1, vptable='', mosweight=True, aterm=True, psterm=False, wbawp=True, conjbeams=False, cfcache='', usepointing=False, computepastep=360.0, rotatepastep=360.0, pointingoffsetsigdev=[], pblimit=0.05, normtype='flatnoise', deconvolver='multiscale', scales=[0, 4, 8, 16, 32], nterms=2, smallscalebias=0.5, restoration=True, restoringbeam='', pbcor=False, outlierfile='', weighting='briggsbwtaper', robust=0.0, noise='1.0Jy', npixels=0, uvtaper=[''], niter=5000000, gain=0.1, threshold='0.0168Jy', nsigma=0.0, cycleniter=-1, cyclefactor=2.0, minpsffraction=0.05, maxpsffraction=0.8, interactive=0, usemask='user', mask='', pbmask=0.1, sidelobethreshold=3.0, noisethreshold=5.0, lownoisethreshold=1.5, negativethreshold=0.0, smoothfactor=1.0, minbeamfrac=0.3, cutthreshold=0.01, growiterations=75, dogrowprune=True, minpercentchange=-1.0, verbose=False, fastnoise=True, restart=True, savemodel='none', calcres=False, calcpsf=True, psfcutoff=0.35, parallel=True )

using concatenated data caused the failure. It was run as an MPI job with 32 cores and 128 GB RAM. The same issue recurred for the next MS (spw3), but not for spw0 or spw1 - yet, they're still running.

This: "Exception for chan range [1921, 1924] --- Error in making PSF : Interpolate1D::operator() data has repeated x values" suggests to me that there's a problem with the gridder thinking there are more channels than there actually are; tclean's behavior is not consistent between operating on concatenated data sets and on lists of data sets. This issue may be a manifestation of tclean misunderstanding the grid when operating with MPI enabled. This could also prove not to be an MPI error, but one that is only encountered if MPI is enabled b/c MPI gets to channel 1921 while non-MPI never gets there before the job is canceled for time.

W51 spw2 had similar:

WARN    MPICommandServer::command_request_handler_service::SynthesisImagerVi2::CubeMajorCycle::MPIServer-21 (file src/code/synthesis/ImagerObjects/CubeMajorCycleAlgorithm.cc, line 336)        Exception for chan range [1269, 1277] ---   FilebufIO::readBlock - incorrect number of bytes read for file /blue/adamginsburg/adamginsburg/almaimf/workdir/W51-E_B3_spw2_12M_spw2.sumwt/table.f0
Exception: FilebufIO::readBlock - incorrect number of bytes read for file /blue/adamginsburg/adamginsburg/almaimf/workdir/W51-E_B3_spw2_12M_spw2.sumwt/table.f0
WARN    MPICommandServer::command_request_handler_service::SynthesisImagerVi2::CubeMajorCycle::MPIServer-9 (file src/code/synthesis/ImagerObjects/CubeMajorCycleAlgorithm.cc, line 336) Exception for chan range [1314, 1322] ---   FilebufIO::readBlock - incorrect number of bytes read for file /blue/adamginsburg/adamginsburg/almaimf/workdir/W51-E_B3_spw2_12M_spw2.sumwt/table.f0
Exception: FilebufIO::readBlock - incorrect number of bytes read for file /blue/adamginsburg/adamginsburg/almaimf/workdir/W51-E_B3_spw2_12M_spw2.sumwt/table.f0

I have not found solutions to any of these; the workaround appears to be just to not use MPI.

EDIT: here's another one

2021-10-22 22:36:53     WARN    MPICommandServer::command_request_handler_service::SynthesisImagerVi2::CubeMajorCycle::MPIServer-2 (file src/code/synthesis/ImagerObjects/CubeMajorCycleAlgorithm.cc, line 336) Exception for chan range [55, 109] ---   Setting masked pixels to zero for input startmodel : Error (Resource deadlock avoided) when acquiring lock on /blue/adamginsburg/adamginsburg/almaimf/workdir/G327.29_B6_spw1_12M_sio.contcube.model/table.lock
##################################
#############################
Exception: Setting masked pixels to zero for input startmodel : Error (Resource deadlock avoided) when acquiring lock on /blue/adamginsburg/adamginsburg/almaimf/workdir/G327.29_B6_spw1_12M_sio.contcube.model/table.lock
2021-10-22 22:36:54     WARN    MPICommandServer::command_request_handler_service::SynthesisImagerVi2::CubeMajorCycle::MPIServer-4 (file src/code/synthesis/ImagerObjects/CubeMajorCycleAlgorithm.cc, line 336) Exception for chan range [165, 219] ---   Setting masked pixels to zero for input startmodel : Error (Resource deadlock avoided) when acquiring lock on /blue/adamginsburg/adamginsburg/almaimf/workdir/G327.29_B6_spw1_12M_sio.contcube.model/table.lock
##################################
#############################
Exception: Setting masked pixels to zero for input startmodel : Error (Resource deadlock avoided) when acquiring lock on /blue/adamginsburg/adamginsburg/almaimf/workdir/G327.29_B6_spw1_12M_sio.contcube.model/table.lock
2021-10-22 22:36:57     WARN    MPICommandServer::command_request_handler_service::SynthesisImagerVi2::CubeMajorCycle::MPIServer-1 (file src/code/synthesis/ImagerObjects/CubeMajorCycleAlgorithm.cc, line 336) Exception for chan range [0, 54] ---   Setting masked pixels to zero for input startmodel : Error (Resource deadlock avoided) when acquiring lock on /blue/adamginsburg/adamginsburg/almaimf/workdir/G327.29_B6_spw1_12M_sio.contcube.model/table.lock
##################################
#############################
Exception: Setting masked pixels to zero for input startmodel : Error (Resource deadlock avoided) when acquiring lock on /blue/adamginsburg/adamginsburg/almaimf/workdir/G327.29_B6_spw1_12M_sio.contcube.model/table.lock

Comments