CASA corrupted measurement sets

Over the weekend, I had a huge quantity of ALMA data get corrupted. The corruption appears to have been caused by an incorrectly-mounted lustre system that had "flock" disabled. When they re-mounted the filesystem, every measurement set I had touched was unreadable.

Corrupted data sets had messages like this when opened with msmd.open:

RuntimeError: Exception: Illegal DATA_DESC_ID value 30 found in main table. /lustre/cv/projects/ALMA_IMF/2017.1.01355.L/science_goal.uid___A001_X1296_X211/group.uid___A001_X1296_X212/member.uid___A001_X1296_X217/calibrated/uid___A002_Xcbdb2a_X6e67.ms.split.cal/DATA_DESCRIPTION only has 0 rows (IDs).
... thrown by void casa::MSChecker::checkReferentialIntegrity() const at File: ../../msvis/MSVis/MSChecker.cc, line: 78

The solution was to replace all of the table.dat files in the DATA_DESCRIPTION, POLARIZATION, STATE, and PROCESSOR folders with "backups" from the parent measurement set. Luckily, all of my data were split from other measurement sets that still existed, and they were split with reindex=False, which meant that these tables were actually usable. There's no guarantee they would be in general.

My fix script is this:

def getdata(x):
    with open(x, 'rb') as fh:
        return fh.read()

def diff(x, y):
    return {ii:(xx,yy) for ii,(xx,yy) in enumerate(zip(x, y)) if xx!=yy}

def fix_bad_mses():
    for dirpath, dirnames, filenames in os.walk('.'):
        for fn in dirnames:
            if fn[-10:] == ".split.cal":
                ffn = os.path.join(dirpath, fn)
                workingfn = os.path.join(dirpath, 'working', fn[:-10])

                if os.path.exists(workingfn):
                    #print("Working on files {0} and {1}".format(ffn, workingfn))
                    print(ffn)

                    for dn in ['POLARIZATION', 'PROCESSOR', 'DATA_DESCRIPTION', 'STATE']:
                        tbpth1 = os.path.join(ffn, dn, 'table.dat')
                        tbpth2 = os.path.join(workingfn, dn, 'table.dat')

                        if os.path.exists(tbpth1) and os.path.exists(tbpth2):
                            d1 = getdata(tbpth1)
                            d2 = getdata(tbpth2)
                            delta = diff(d1, d2)
                            #print("Diff from {0} to {1} is {2}".format(tbpth1, tbpth2, delta))

                            if len(delta) > 0:
                                print("FIXING: Diff for {0} is {1}".format(dn, delta))
                                shutil.copy(tbpth1, tbpth1+".bck")
                                shutil.copy(tbpth2, tbpth1)

which spewed out a bunch of messages like this:

FIXING: Diff for POLARIZATION is {24: ('\x00', '\x01'), 1059: ('\x00', '\x01')}
FIXING: Diff for PROCESSOR is {24: ('\x00', '\x03'), 1078: ('\x00', '\x03')}
FIXING: Diff for DATA_DESCRIPTION is {24: ('\x00', '\x18'), 772: ('\x00', '\x18')}
FIXING: Diff for STATE is {24: ('\x00', '\x1d'), 1699: ('\x00', '\x1d')}

showing that for each bad file, two bytes were flipped. I don't know how or why. Notably, though, the byte flips that corrupted the data did not change the file modification date - somehow CASA hid this operation from the filesystem.

Comments