12.5. Operation Development

In general, operation development follows the general Cate plugin development approach. E.g., any valid Python function can be decorated accordingly to introduce it to the Cate plugin system. However, there are some caveats one should keep in mind, as well as best practices to follow when developing Cate operations. This chapters explores these issues.

12.5.1. Operation development technology stack

To develop operations for cate one should be at least cursory familiar with the following Python projects:

The xarray package is used the most as xarray.Dataset is the data model used to represent raster data throughout Cate. Most operations work on xarray datasets and produce xarray datasets. For tabular data representation and manipulation Cate supports pandas. Numpy is the corner stone of both xarray and pandas and is used when data is explicitly loaded into memory from an xarray object.

The dask package provides numpy-like data array abstraction of datasets spanning many on-disk files or even remote locations. A dask array is the underlying array object type of xarray datasets spanning over multiple files, which is the case in the large majority of Cate use cases. It can be beneficial to be accustomed with how dask works in order to write fast, parallelized operations. Not taking into account how dask works can result in a heavy performance penalty.

12.5.2. Registration with the Cate plugin system

Any python function can be registered in the cate operation registry by decorating it accordingly. As the bare minimum the @op decorator must be used. Depending on particular circumstances it may be needed to also use other decorators, such as @op_input, @op_output, @op_return.

For in-depth information on these decorators and their parameters, please check detailed design of Module cate.core.op as well as documentation on Operation Management and Plugin Concept.

A minimal Cate operation would then look like the following:

from cate.core.op import op


@op()
def dummy_operation(a, b):
    return a + b

A more involved example using tags to ease operation discovery by the user, as well as accepting file inputs, inputs consisting of known value sets, as well as variable inputs tied to a particular dataset would look like the following:

from cate.core.op import op, op_input
from cate.core.types import VarName


@op(tags=['geometric'])
@op_input('file', file_open_mode='w', file_filters=[dict(name='NetCDF', extensions=['nc'])])
@op_input('set', value_set=['a', 'b', 'c'])
@op_input('var', value_set_source='ds', data_type=VarName)
def some_operation(ds: xr.Dataset,
                   file: str,
                   set: str = 'a',
                   var: VarName.TYPE):
    # Do some science here
    return ds

In this example we have denoted input named file as an input that requires a file browser on the GUI, as well as inputs set and var as inputs that require a drop-down box on the GUI, as well as what values should be in this drop down box, or where to find them.

We also use the Cate typing system to let other parts of Cate (GUI, CLI) be aware of what the type of var is, as well as to enable streamlined validation. In light of operation development this is described in more detail here: Cate typing system.

If the newly created operation is meant to be part of the Cate core operation suite, it should be possible to import it when Cate is used programmatically. Hence, it should be put in cate/ops and imported in cate/ops/__init__.py.

12.5.2.1. Tags

Each operation should have at least one tag. This can be the module name, input or output in case of operations in the io module, as well as a tag from the following list:

utility for any utility operations

internal for internal operations, they will not be shown in user interfaces

geometric for geometric operations

point for operations that operate on single lon/lat points

spatial for predominantly spatial operations

temporal for predominantly temporal operations

filter for operations that filter out things from an input to an output

12.5.2.2. Deprecations

Often it is required to change the name or the arguments of an existing operation. To provide backward compatibility the “old” operation can still be maintained by deprecating it. If we need to change a parameter name and possibly its type, we can also keep the old name and type and deprecate it.

The deprecated property is available for the @op, @op_input, and @op_output decorators. Its value may be just True or a string explaining why the operation/input/output has been deprecated and what to do instead.

@op(deprecated='"some_operation()" was inaccurate; use "some_new_operation()" instead')
def some_operation(...):
    ...

@op_input('id', nullable=False)
@op_input('name', deprecated='Meaning changed; use "id" instead')
def some_operation(id:str = None, name:str = None, ...):
    id = id or name
    ...

Note, for maximum backward compatibility it is always a good idea to use keyword arguments instead of positional arguments.

By default, deprecated operations and deprecated inputs/outputs will not be be shown to users in the Cate CLI and Cate App GUI.

To list all deprecated operations in the Cate CLI, type::

$ cate op list --deprecated

12.5.3. History information

Well behaved netCDF filters are expected to add information about themselves to the history attribute of a netCDF file. See Description of netCDF file contents.

Cate facilitates this by automatically adding information about Cate, the particular operation, it’s version and invocation parameters to outputs that have been marked for history addition by providing the appropriate parameter to @op_output or @op_return decorators. Note that version information must be provided to the @op decorator as well.

from cate.core.op import op, op_output


@op(version='1.0')
@op_output('name2', add_history=True)
def my_op_that_saves_history_info(ds1: xr.Dataset, ds2: xr.Dataset):
    # Do some science
    return {'name1': ds1, 'name2': ds2}

Here history information will be added only to the name2 outputs. We could have added add_history=True to both outputs. Adding history information to the only outputs, if this outputs is a dataset, can be achieved by using @op_return in a similar manner.

12.5.4. Cate typing system

Operations must use the Cate typing system to ensure that correct controls are shown in the GUI for the given inputs. Cate typing system also ensures that part of input validation can be done ‘for free’ and is located in the same place, as well as lets one create operations that mimic polymorphism by accepting multiple input types.

For example, an operation that accepts both an xr.Dataset and a pd.DataFrame, as well as takes a polygon could look like this:

from cate.core.types import DatasetLike, PolygonLike
from cate.core.op import op, op_input


@op()
@op_input('dsf', data_type=DatasetLike)
@op_input('region', data_type=PolygonLike)
def my_op_using_advanced_types(dsf: DatasetLike.TYPE, region: PolygonLike.TYPE):
    # Convert inputs to base types (implicit validation)
    ds = DatasetLike.convert(dsf)
    region = PolygonLike.convert(region)

    # Do some science

    return ds

Note that the framework requires that Cate typing system is used both in the decorator, as well as function signature. Here we have made an operation that accepts both xr.Dataset and a pd.DataFrame and converts it to an xr.Dataset for the actual calculation. We also have a region parameter that can be a shapely.geometry.Polygon, a coordinate string, a WKT string, a list of coordinate points, as well as a list of lon/lat values. Now the GUI is also aware that the operation expects a polygon and an appropriate dialog can be displayed.

12.5.5. Monitor usage

Operations that can be potentially long running should implement a Cate monitor that can be used by the CLI and the GUI to track the operation’s progress, as well as to cancel the operation. It can sometimes be hard to determine whether a particular operation will be long running or not. In that case the rule of thumb should be to err on the side of implementing a monitor.

For example:

from cate.core.op import op
from cate.util.monitor import Monitor

@op()
def my_op_with_a_monitor(a: str, monitor: Monitor = Monitor.NONE):
    # Set up the monitor
    with monitor.starting('Monitor Operation', total_work=len(a)):
        for i in a:

            # Do work

            # Update the monitor
            monitor.progress(work=1)

            # If there are resources to clean up (e.g., open file handles)
            # use the following instead:
            try:
                monitor.progress(work=1)
            except Cancellation as c:
                # Clean up
                raise c

    return a

Note that special caution should be taken to ensure the correct step size, such that the task actually ends when the total_work is reached. Apart from progress monitoring it is crucial to implement the possibility to cancel long running operations and perform the appropriate clean up actions when it is cancelled.

Operations that delagate the compute intensive work to xarray have often no possibility to report progress in a meaningful way nor to handle cancellation in a timely manner. In this case the xarray task can be observed:

from cate.core.op import op
from cate.util.monitor import Monitor
import xarray as xr

@op()
def my_op_with_a_monitor(da: xr.DataArray, monitor: Monitor = Monitor.NONE) -> xr.DataArray:
    # Set up the monitor
    with monitor.observing('Monitor Operation'):
      return da.mean(dim='time')

12.5.6. Adherence to relevant conventions

Cate software often makes the assumption that most if not all of climate data towards which the toolbox is geared adhere to CF Conventions and the Attribute Convention for Data Discovery that both complement each other.

On one hand, an operation may make the assumption that data it receives should be CF compliant. For example, netCDF variables that are ancillary to other variables, such as uncertainty information, should be denoted as such. See CF Ancillary Data.

On the other hand, this implies that special care must be taken to ensure that an operation doesn’t break compatibility with said conventions, as well as heeds the advice given in these conventions when creating new variables or datasets.

For example, an operation that adds a mask describing another data variable should follow CF Ancillary Data and CF Flags. Such an operation can be examined in cate/ops/outliers.py.

Also, when an operation modifies spatiotemporal extents and/or resolution of the dataset, the corresponding global attributes from Attribute Convention for Data Discovery should be updated or added. There are dedicated functions in cate/ops/normalize.py for this purpose.

from cate.ops.normalize import adjust_spatial_attrs, adjust_temporal_attrs


@op()
def dummy_op(ds: xr.Dataset):
    rs = ds.copy()

    # Do some science

    # Adjust global attributes
    rs = adjust_spatial_attrs(rs)
    rs = adjust_temporal_attrs(rs)

    return rs

12.5.7. Operation outputs

Most operations work on xr.Datasets and return these as well. However, some operations may produce information that may be best represented in a tabular form. In these cases it is a good idea to return such data as a pd.DataFrame instead of an xr.Dataset. This way it can be represented better in the GUI, on the CLI, as well as in Jupyter notebooks.

Cate supports returning multiple named outputs as a Python dictionary.

...
@op_output('dataset', data_type=xr.Dataset, description='...')
@op_output('table', data_type=gpd.GeoDataFrame, description='...')
@op_output('scalar', data_type=float, description='...')
def my_op_that_has_named_outputs(...):
  ...
  return {'dataset': ds, 'table': df, 'scalar': x}

12.5.8. Using other operations

It is a good idea to use other operations when developing other, more involved operations. Even for seemingly simple cases there might be corner situations that have been solved in the other operation. For example, one is encouraged to use the subset_spatial operation as opposed to directly selecting a dataset region using xr.sel. Reason being, the given polygon might cross the antimeridian, a situation which is already solved in cate.ops.subset_spatial.

Some care must be taken when importing other operations to avoid circular imports. The correct way to import an existing operation is the following:

# Directly from subset.py
from cate.ops.subset import subset_spatial

12.5.9. Testing

All operations should be well tested. The unit tests should be fast and verify the functionality of the operation, not necessarily validate it. Each module in cate/ops/ should have the coresponding test module in test/ops/. A bare bones test set up for any operation should be the following:

from unittest import TestCase

from cate.core.op import OP_REGISTRY
from cate.util.misc import object_to_qualified_name

from cate.ops import dummy_op


class TestDummyOp(TestCase):
    def test_nominal(self):
        """
        Test nominal execution
        """
        expected = 1
        result = dummy_op()
        self.assertEqual(expected, result)

    def test_error(self):
        """
        Test known error conditions
        """
        with self.assertRaises(ValueError) as err:
            dummy_op(param='will error')

It is absolutely crucial to at least have a nominal test that runs the operation with expected inputs that asserts if the outputs is what was expected, the imported operation will automatically be invoked through the operation registry and this will also work in validating if the decorators have been used properly.

If an operation implements a monitor, it is a good idea to test if it has been implemented properly. For example:

from unittest import TestCase
from cate.util.monitor import ConsoleMonitor
from cate.ops import dummy_op


class TestDummyOp(TestCase):
    def test_monitor(self):
        m = ConsoleMonitor()
        result = dummy_op(monitor=m)
        self.assertEqual(m._percentage, 100)

It is also a good idea to test if the dataset meta information is altered correctly, if newly created data variables have correct attributes, as well as if unexpected inputs are handled correctly.

12.5.10. Optimization

12.5.10.1. Profiling

If the operation seems to be too slow it should first be profiled to explore the opportunities of potential improvement. The line_profiler package might come in handy here. It can be installed via conda conda install line_profiler and then used in a notebook to time individual lines of a given operation as such:

import cate.ops as ops
%load_ext line_profiler
%lprun -f ops.some_op result = some_op()

A caveat here is that while profiling, the operation being profiled should be undecorated. Otherwise line_profiler has trouble finding the source code to test.

12.5.10.2. Leveraging xarray and dask

When developing operations it should be kept in mind that every operation can potentially work on out-of-memory datasets. Hence one should try to leverage possibilities offered by xarray and dask as much as possible.

For example, an operation producing a statistical quantity of a timeseries for each lon/lat point of a raster could be naively implemented as such:

import xarray as xr
from scipy import tricky_stat

def some_op(da: xr.DataArray):
    """
    Run tricky_stat on the given data array
    """
    for i in range(0, len(ds.lon)):
        for j in range(0, len(ds.lat)):
            array = da.isel(lat=j, lon=i).values
            res[i, j] = tricky_stat(array)

However, this implementation will yield a heavy performance implication due to the fact that our xr.DataArray is likely distributed among many files, parts of which will be read on each da.isel(lat=j, lon=i).values invocation resulting in a large overhead in memory and processing time due to io operations.

A better approach would be to use arithmetics and xarray ufuncs directly:

import xarray as xr

def some_op(da: xr.DataArray):
    """
    Run tricky_stat on the given data array. Influenced by tricky_stat
    scipy implementation.
    """
    da1 = xr.ufuncs.sqrt(da * MAGIC_CONSTANT)
    tricky_stat = da1.mean(dim='time')
    return tricky_stat

This second operation has a potential of running several orders of magnitude faster due to minimized amount of io operations, as well as additional optimizations and parallelization occuring behind the scenes in xarray and dask code.

12.5.11. Documentation

Operation docstrings are used to provide help information in all channels where an operation may be used. It is rendered on the command line when cate op info some_op is invoked, it is shown in the appropriate places on the GUI, invoked by users through Python help(), as well as published as part of Cate documentation. Hence, it is of utmost importance that the docstring explains well what a particular operation does, as well as documents all input parameters. See also Docstrings.

For example:

import xarray as xr
import pandas as pd

def doc_op(ds: xr.Dataset, df: pd.DataFrame, magical_const: float):
    """
    This operation carries out a well documented calculation.

    References
    ----------
    'Source <http://www.science.org/documented/calculation>'_

    :param ds: The input dataset used for calculation
    :param df: A dataframe containing auxiliary information
    :param magical_const: Magical constant to use for calculation
    :return: Input dataset with documented calculation applied to it
    """
    # Do some science
    return ds

To make sure generated Cate documentation is updated, don’t forget to include the operation in cate/doc/source/api_reference.rst

If an existing operation name is altered, don’t forget to run a search through Cate documentation source to find the possible places where a documentation update is needed.

12.5.12. Operation development checklist

Is the function registered with the operation registry properly?
Is the operation set up for import in cate/ops/__init__.py?
Are operation inputs decorated accordingly? E.g., value sets are provided, links between variables and datasets established?
If one or multiple outputs are xr.Dataset, is history information added when appropriate?
Does the operation use cate typing system so that it can be integrated with the GUI nicely? Both in the function signature and decorators?
Are inputs validated?
If the operation can take a while, does it use a monitor and can be cancelled?
Is the operation a ‘well behaved netCDF filter’?
- If it adds new variables to the netCDF file, do these follow CF conventions?
- If the operation has the potential of changing spatiotemporal extents and or resolution, does it update the global attributes accordingly?
- Does the operation drop valuable global or variable attributes when it shouldn’t?
Does the operation produce outputs of appropriate types?
Are other operations imported correctly if used?
Is the operation well tested?
- Is nominal functionality tested?
- Is the monitor tested?
- Are the side effects on attributes and other meta information tested?
- Are error conditions tested?
- Do the tests run resonably fast?
Is the operation properly documented?
Is the operation properly tagged?

When a newly created operation coresponds to this checklist well, it can be said with some certainty that the operation behaves well with respect to the Cate

framework, as well as the wider world.