14. How can I read my own data?

14.1. Introduction

One of the key strengths of CIS is the ability for users to create their own plugins to read data which CIS doesn’t currently support. These plugins can then be shared with the community to allow other users access to that data. Although the plugins are written in Python this tutorial assumes no experience in Python. Some programming experience is however assumed.

Note

Any technical details that may be useful to experienced Python programmers will be highlighted in this style - they aren’t necessary for completing the tutorial.

Here we describe the process of creating and sharing a plugin. A CIS plugin is simply a python (.py) file with a set of methods (or functions) to describe how the plugin should behave.

Note

The methods for each plugin are described within a Class, this gives the plugin a name and allows CIS to ensure that all of the necessary methods have been implemented.

There are a few methods that the plugin must contain, and some which are optional. A skeleton plugin would look like this:

class MyProd(AProduct):
    def get_file_signature(self):
    # Code goes here

    def create_coords(self, filenames):
        ...

    def create_data_object(self, filenames, variable):
        ...

Note that in python whitespace matters! When filling in the above methods the code for the method should be indented from the signature by four spaces like this:

Class MyProd(AProduct):

    def get_file_signature(self):
        # Code goes here
        foo = bar

Note also that the name of the plugin (MyProd) in this case should be changed to describe the data which it will read. (Don’t change the AProduct part though – this is important for telling CIS that this is a plugin for reading data.)

Note

The plugin class subclasses AProduct which is the abstract class which defines the methods that the plugin needs to override. It also includes a few helper functions for error catching.

When CIS looks for data plugins it searches for all classes which sub-class AProduct. There are also plugins available for collocation with their own abstract base classes, so that users can store multiple plugin types in the same plugin directory.

In order to turn the above skeleton into a working plugin we need to fill in each of the methods with the some code, which turns our data into something CIS will understand. Often it is easiest to start from an existing plugin that reads closely matching data. For example creating a plugin to read some other CCI data would probably be easiest to start from the Cloud or Aerosol CCI plugins. We have created three different tutorials to walk you through the creation of some of the existing plugins to try and illustrate the process. The Easy tutorial walks through the creation of a basic plugin, the Medium tutorial builds on that by creating a plugin which has a bit more detail, and finally the Advanced plugin talks through some of the main considerations when creating a large and complicated plugin.

A more general template plugin is included here in case no existing plugin matches your need. We have also created a short reference describing the purpose of each method the plugins implement here.

Note

Plugins aren’t the only way you can contribute though. CIS is an open source project hosted on GitHub, so please feel free to submit pull-requests for new features or bug-fixes – just check with the community first so that we’re not duplicating our effort.

14.1.1. Using and testing your plugin

It is important that CIS knows where to look to find your new plugin, and this is easily done by setting the environment variable CIS_PLUGIN_HOME to point to the directory within which your plugin is stored.

Once you have done this CIS will automatically use your plugin for reading any files which match the file signature you used.

If you have any issues with this (because for example the file signature clashes with a built-in plugin) you can tell CIS to use your plugin when reading data by simply specifying it after the variable and filename in most CIS commands, e.g.:

cis subset a_variable:filename.nc:product=MyProd ...

14.1.2. Sharing your plugin

This is the easy bit! Once you’re happy that your plugin can fairly reliably read a currently unsupported dataset you should share it with the community. Use the upload form here to submit your plugin to the community.

We moderate the plugins we receive to ensure the plugins received are appropriate and meet a minimum level of quality. We’re not expecting the plugins to necessarily be production quality code but we do expect them to work for the subset of data they claim to. Having said that, if we feel a plugin provides really a valuable capability and is of high quality we may incorporate that plugin into the core CIS data readers – with credit to the author of course!

14.3. Data plugin reference

This section provides a reference describing the expected behaviour of each of the functions a plugin can implement. The following methods are mandatory:

AProduct.get_file_signature()

This method should return a list of regular expressions, which CIS uses to decide which data product to use for a given file. If more than one regular expression is provided in the list then the file can match any of the expressions. The first product with a signature that matches the filename will be used. The order in which the products are searched is determined by the priority property, highest value first; internal products generally have a priority of 10.

For example, this would match all files with a name containing the string ‘CODE’ and with the ‘nc’ extension.:

return [r'.*CODE*.nc']

Note

If the signature has matched the framework will call AProduct.get_file_type_error(), this gives the product a chance to open the file and check the contents.

Returns:A list of regex to match the product’s file naming convention.
Return type:list
AProduct.create_coords(filenames)

Reads the coordinates from one or more files. Note that this method may have to make certain assumptions about the file in order to return a single coordinate set. The user should be warned through the logger if this is the case.

Parameters:filenames (list) – List of filenames to read coordinates from
Returns:CommonData object
AProduct.create_data_object(filenames, variable)

Create and return an CommonData object for a given variable from one or more files.

Parameters:
  • filenames (list) – List of filenames of files to read
  • variable (str) – Variable to read from the files
Returns:

An CommonData object representing the specified variable

Raises:
  • FileIOError – Unable to read a file
  • InvalidVariableError – Variable not present in file

While these may be implemented optionally:

AProduct.get_variable_names(filenames, data_type=None)

Get a list of available variable names from the filenames list passed in. This general implementation can be overridden in specific products to include/exclude variables which may or may not be relevant. The data_type parameter can be used to specify extra information.

Parameters:
  • filenames (list) – List of string filenames of files to be read from
  • data_type (str) – ‘SD’ or ‘VD’ to specify only return SD or VD variables from HDF files. This may take on other values in specific product implementations.
Returns:

A set of variable names as strings

Return type:

str

AProduct.get_file_type_error(filename)

Check a single file to see if it is of the correct type, and if not return a list of errors. If the return is None then there are no errors and this is the correct data product to use for this file.

This method gives a mechanism for a data product to identify itself as the correct product when a specific enough file signature cannot be provided. For example GASSP is a type of NetCDF file and so filenames end with .nc but so do other NetCDF files, so the data product opens the file and looks for the GASSP version attribute, and if it doesn’t find it returns an error.

Parameters:filename (str) – The filename for the file
Returns:List of errors, or None
Return type:list or None
AProduct.get_file_format(filename)

Returns a file format hierarchy separated by slashes, of the form TopLevelFormat/SubFormat/SubFormat/Version. E.g. NetCDF/GASSP/1.0, ASCII/ASCIIHyperpoint or HDF4/CloudSat. This is mainly used within the ceda_di indexing tool. If not set it will default to the products name.

A filename of an example file can be provided to enable the determination of, for example, a dataset version number.

Parameters:filename (str) – Filename of file to be inspected
Returns:File format, of the form [parent/]format/specific instance/version, or the class name
Return type:str
Raises:FileFormatError if there is an error