16.2.3. Advanced

This more advanced tutorial will cover some of the difficulties when reading in data which differs significantly from the structure CIS expects, and/or has little metadata in the associated files. We take the MODIS L2 plugin as our example, and discuss each method in turn.

There are a number of specific MODIS L2 products which we have tested using this plugin, each with their own file signature, and so in this plugin we take advantage of the fact that the regular expression returned by get_file_signature can be a list. This way we create a simple regular expression for each MODIS L2 products that we’re supporting - rather than trying to create one, more complicated, regular expression which matches just these products at the exclusion of all others:

def get_file_signature(self):
    product_names = ['MYD06_L2', 'MOD06_L2', 'MYD04_L2', 'MOD04_L2']
    regex_list = [r'.*' + product + '.*\.hdf' for product in product_names]
    return regex_list

We have implemented the optional get_variable_names method here because MODIS files sometimes contain variables which CIS is unable to handle due to their irregular shape. We only want to report the variable which CIS can read so we check each variable before adding it to the list of variables we return. We know that MODIS only contains SD variables so we can ignore any other types.

Note

HDF files can contain both Vdatas (VD) and Scientific Datasets (SD) data collections (among others). These are stored and accessed quite differently, which makes dealing with these files quite fiddly - we often have to treat each case separately. In this case we know MODIS files only have SD datasets which makes things a bit simpler.

def get_variable_names(self, filenames, data_type=None):
    import pyhdf.SD

    # Determine the valid shape for variables
    sd = pyhdf.SD.SD(filenames[0])
    datasets = sd.datasets()
    valid_shape = datasets['Latitude'][1]  # Assumes that latitude shape == longitude shape (it should)

    variables = set([])
    for filename in filenames:
        sd = pyhdf.SD.SD(filename)
        for var_name, var_info in sd.datasets().iteritems():
            if var_info[1] == valid_shape:
                variables.add(var_name)

    return variables

MODIS data often has a scale factor built in, and stored against each variable, this method reads that scale factor for a particular variable and checks it against our built-in list of scale factors.

def __get_data_scale(self, filename, variable):
    from cis.exceptions import InvalidVariableError
    from pyhdf import SD

    try:
        meta = SD.SD(filename).datasets()[variable][0][0]
    except KeyError:
        raise InvalidVariableError("Variable "+variable+" not found")

    for scaling in self.modis_scaling:
        if scaling in meta:
            return scaling
    return None

In order to use data which has been scaled, we re-scale it on reading. This creates some overhead in the reading of the data, but saves considerable time when performing other operations on it later in the process. Routines like this can often be adapted from available Fortran or IDL routines (assuming no python routines are available) for your data.

def __field_interpolate(self,data,factor=5):
    '''
    Interpolates the given 2D field by the factor,
    edge pixels are defined by the ones in the centre,
    odd factors only!
    '''
    import numpy as np

    logging.debug("Performing interpolation...")

    output = np.zeros((factor*data.shape[0],factor*data.shape[1]))*np.nan
    output[int(factor/2)::factor,int(factor/2)::factor] = data
    for i in range(1,factor+1):
        output[(int(factor/2)+i):(-1*factor/2+1):factor,:] = i*((output[int(factor/2)+factor::factor,:]-output[int(factor/2):(-1*factor):factor,:])
                                                                /float(factor))+output[int(factor/2):(-1*factor):factor,:]
    for i in range(1,factor+1):
        output[:,(int(factor/2)+i):(-1*factor/2+1):factor] = i*((output[:,int(factor/2)+factor::factor]-output[:,int(factor/2):(-1*factor):factor])
                                                                /float(factor))+output[:,int(factor/2):(-1*factor):factor]
    return output

Next we read the coordinates from the file (using the same method of factoring out as we used in the Aeronet case).

def _create_coord_list(self, filenames, variable=None):
    import datetime as dt

    variables = ['Latitude', 'Longitude', 'Scan_Start_Time']
    logging.info("Listing coordinates: " + str(variables))

As usual we rely on the lower level IO reading routines to provide the raw data, in this case using the hdf.read routine.

sdata, vdata = hdf.read(filenames, variables)

Note

Notice we have to put the vdata data somewhere, even though we don’t use it in this case.

We have to check whether we need to scale the coordinates to match the variable being read:

apply_interpolation = False
if variable is not None:
    scale = self.__get_data_scale(filenames[0], variable)
    apply_interpolation = True if scale is "1km" else False

Then we can read the coordinates, one at a time. We know the latitude information is stored in an SD dataset called Latitude, so we read that and interpolate it if needed.

lat = sdata['Latitude']
sd_lat = hdf.read_data(lat, "SD")
lat_data = self.__field_interpolate(sd_lat) if apply_interpolation else sd_lat
lat_metadata = hdf.read_metadata(lat, "SD")
lat_coord = Coord(lat_data, lat_metadata,'Y')

The same for Longitude:

lon = sdata['Longitude']
lon_data = self.__field_interpolate(hdf.read_data(lon,"SD")) if apply_interpolation else hdf.read_data(lon,"SD")
lon_metadata = hdf.read_metadata(lon,"SD")
lon_coord = Coord(lon_data, lon_metadata,'X')

Next we read the time variable, remembering to convert it to our internal standard time. (We know that the MODIS’ atomic clock time is referenced to the 1st January 1993.)

time = sdata['Scan_Start_Time']
time_metadata = hdf.read_metadata(time,"SD")
# Ensure the standard name is set
time_metadata.standard_name = 'time'
time_coord = Coord(time,time_metadata,"T")
time_coord.convert_TAI_time_to_std_time(dt.datetime(1993,1,1,0,0,0))

return CoordList([lat_coord,lon_coord,time_coord])
def create_coords(self, filenames, variable=None):
    return UngriddedCoordinates(self._create_coord_list(filenames))

For the create_data_object we are really just pulling the above methods together to read the specific variable the user has requested and combine it with the coordinates.

def create_data_object(self, filenames, variable):
    logging.debug("Creating data object for variable " + variable)

    # reading coordinates
    # the variable here is needed to work out whether to apply interpolation to the lat/lon data or not
    coords = self._create_coord_list(filenames, variable)

    # reading of variables
    sdata, vdata = hdf.read(filenames, variable)

    # retrieve data + its metadata
    var = sdata[variable]
    metadata = hdf.read_metadata(var, "SD")

    return UngriddedData(var, metadata, coords)

We have also implemented the AProduct.get_file_format() method which allows some associated tools (for example the CEDA_DI tool) to use CIS to index files which they wouldn’t otherwise be able to read. We just return a file format descriptor as a string.

def get_file_format(self, filenames):
    """
    Get the file format
    :param filenames: the filenames of the file
    :return: file format
    """

    return "HDF4/ModisL2"

The full MODIS L2 plugin is rather long to show but can be downloaded here.