Tutorial: Write an Interpolator
===============================

In case you want to extend the interpolation facilities its
straightforward to write an additional **Interpolator**.

First, it has to be decided what interpolator should be
implemented. In this tutorial a nearest neighbor interpolator
is implemented.


Example:  Nearest Neighbor Interpolator
---------------------------------------

A Nearest Neighbor interpolator just looks for the dataset, whose
distance in the coordinate space is smallest. Then, the properties
of this nearest dataset are returned for the requested coordinates.
Our implementation in this tutorial will use scipy's nearest neighbor
search algorithm for a fast search and overall performance

Step 0: Colt Factory
~~~~~~~~~~~~~~~~~~~~~~

*Interpolator* are written as Plugins of the *Interpolator* 
Colt-Factory, i.e. they are inherited from the Factory. 
The Factory takes already care of many things, which are common
to any interpolator, e.g. coordinate transformation and 
energy-only calculations etc. During the example we will 
encounter many of them and explain them whenever they appear.


.. code-block:: python
   :linenos:

   from scipy.spatial import cKDTree 
   from pysurf.spp import Interpolator

Step 1: Basic Features
~~~~~~~~~~~~~~~~~~~~~~

Every **Interpolator** has to implement 5 functions:

    def get(self, request):
        fill the request with the demanded data and additionally
        a flag has to be returned whether the result is trustworthy
        If *fit_only* of the SPP is False, an electronic structure
        calculation is started, else the result of the interpolator 
        is taken. If *fit_only* is True, the result of the interpolator
        will be used anyway.

    def get_interpolators(self, db, properties):
        for each property that is requested, a separate interpolator is 
        set up. The function has to return a dictionary that contains the 
        property name and the interpolator for the property

    def save(self, filename):
        The save method is primarily for machine learning algorithms. The
        weights have to put in a file.

    def get_interpolators_from_file(self, filename, properties):
        This function reads the weightsfile and sets up the interpolators
        for the properties from the weightsfile. Specifically in 
        machine learning algorithms, its not necessary to train the 
        algorithm again.

    def _train(self):
        This function uses the input data to train the interpolator. It is
        primarily used in machine learning interpolators.

    def loadweights(self, filename):
        The weights are loaded from a file.


.. code-block:: python
   :linenos:

    class NearestNeighborInterpolator(Interpolator):
        """Nearest Neighbor Interpolator"""

        @classmethod
        def from_config(cls, config, db, properties, logger, energy_only, weightsfile, crdmode,
                        fit_only):
            """ This classmethod overwrites the from_config method of the factory and is needed
                if interpolator specific user input is implemented
            """

        def __init__(self, db, properties, logger, energy_only=False, weightsfile=None, 
                     crdmode='cartesian', fit_only=False)
            """ This init overwrites the init of the interpolator factory. Therefor, it is
                advisable to call the init of the interpolator factory to make sure that all
                functionality is working and only additional features are added here.
            """
            super().__init__(db, properties, logger, energy_only, weightsfile, crdmode=crdmode,
                             fit_only=fit_only)


        def get(self, request):
            """fill request

               Return request and if data is trustworthy or not
            """

        def get_interpolators(self, db, properties):
            """ """

        def save(self, filename):
            """Save weights"""

        def get_interpolators_from_file(self, filename, properties):
            """setup interpolators from file"""

        def _train(self):
            """train the interpolators using the existing data"""

        def loadweights(self, filename):
            """load weights from file"""


Step 2: User Input
~~~~~~~~~~~~~~~~~~

PySurf is build around the Colt_ framework, developed along the lines
of this project. To specify specific input needed for your class you simply
use the *_questions* string:
In our example we will need 4 user inputs:
   - trust_radius_general, float
      the radius to decide whether an interpolation is trustworthy
   - trust_radius_ci, float
      the radius in the region of small energy gaps to decide
      whether an interpolation is trustworthy
   - energy_threshold, float
      the threshold to distinguish between regions with small and
      large energy gap
   - norm, str 
      the norm that is used to measure the distance between points


.. code-block:: python
   :linenos:

    class NearestNeighborInterpolator(Interpolator):
        """Basic Rbf interpolator"""
    
        _questions = """
            trust_radius_general = 0.75 :: float
            trust_radius_ci = 0.25 :: float
            energy_threshold = 0.02 :: float
            norm = euclidean :: str :: [euclidean]
        """
        @classmethod
        def from_config(cls, config, db, properties, logger, energy_only, weightsfile, crdmode, fit_only):
            trust_radius_general = config['trust_radius_general']
            trust_radius_CI = config['trust_radius_ci']
            energy_threshold = config['energy_threshold']
            #
            # convert input for norm in corresponding input (p-Norm) for the cKDTree
            # for more information go to the cKDTree.query documentation
            if config['norm'] == 'manhattan':
                norm = 1
            elif config['norm'] == 'max':
                norm = 'infinity'
            else:
                norm = 2
            #
            return cls(db, properties, logger, energy_only=energy_only, weightsfile=weightsfile,
                       crdmode=crdmode, trust_radius_general=trust_radius_general,
                       trust_radius_CI=trust_radius_CI, energy_threshold=energy_threshold,
                       fit_only=fit_only, norm=norm)
    
        def __init__(self, db, properties, logger, energy_only=False, weightsfile=None,
                 crdmode='cartesian', fit_only=False, trust_radius_general=0.75,
                 trust_radius_CI=0.25, energy_threshold=0.02, norm='euclidean'):
            self.trust_radius_general = trust_radius_general
            self.trust_radius_CI = trust_radius_CI
            self.energy_threshold = energy_threshold
            self.tree = None
            self.norm = norm
            # Call the init method of the Interpolator Factory
            super().__init__(db, properties, logger, energy_only, weightsfile,
                             crdmode=crdmode, fit_only=fit_only)


Parameters

    db:
        databse containing the datasets, on which the interpolation is
        based on

    properties: list
        properties (e.g. ['energy', 'gradient']) that should be fitted 

    logger:
         logger to log any incident

    energy_only: bool, optional
        if energy_only is True, gradients are derived from the energy surface

    weightsfile: str, optional
        filepath, where to save the weights. Not used in the case of the 
        NearestNeighborInterpolator, but needed for the overall framework.

    crdmode: str, optional
        Variable to determine whether a coordinate transformation is applied before 
        fitting.

    fit_only: bool, optional
        Flag to determine, whether no new QM calculations are performed

    trust_radius_general: float, optional
        radius to determine whether fitted result is trustworthy in regions of a 
        large energy gap

    trust_radius_CI: float, optional
        radius to determine whether fitted result is trustworthy in regions of a 
        small energy gap

    energy_threshold: float, optional
        Threshold to distinguish regions of small and large energy gaps.

    norm: str, optional
        Determining the norm for the nearest neighbor search. 'manhattan' corresponds
        to the 1-norm, 'euclidean' is the 2-norm, and 'max' is the infinity norm. 


Step 3: Implement *get_interpolators* function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The next step is to implement the *get_interpolators* method and the helper class
for the NearestNeighborInterpolator of each property *NNInterpolator*. For each
property, an Interpolator is set up, which is an instance of the *NNInterpolator*
class. Each interpolator has to be callable and to return the desired property.

.. code-block:: python
   :linenos:

   class NearestNeighborInterpolator(Interpolator):
      """Nearest Neighbor Interpolator"""

      ...
        def get_interpolators(self, db, properties):
            """ """
            self.tree = cKDTree(self.crds)
            return {prop_name: NNInterpolator(db, self.tree, prop_name)
                    for prop_name in properties}, len(db)
    

    class NNInterpolator():
        def __init__(self, db, ckdtree, prop):
            self.db = db
            self.tree = ckdtree
            self.prop = prop

        def __call__(self, crd, request=None, idx=None):
            if idx is None:
                dist, idx = self.tree.query(crd)
            return self.db.get(self.prop, idx)

The *get_interpolators* method returns a dictionary with the property names as keys and the
interpolator for that specific property as value. For each property a separate interpolator
has to be set up so that the interpolator factory can handle the interpolators for the properties
independently, which allows e.g. the energy_only calculations. Implementing the interpolators in
this way, they naturally are included in the code package and the full functionality is available.


To avoid that the cKDTree is set up several times, the *NNInterpolator* takes the
tree as a Parameter. Moreover, if *NNInterpolator* is called with an index,
no nearest neighbor search is performed, but the property of dataset with the index
is returned. This is important in the case when several properties are demanded so that the
nearest neighbor search is done only once, cf. Step 4 and the *get* function.


Step 4: Implement *get* function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The *get* function is called with the *request* as parameter. It 
has to fill in the desired results from the fit into the *request* instance and state
whether the fit is trustworthy.

.. code-block:: python
   :linenos:

   class NearestNeighborInterpolator(Interpolator):
      """Nearest Neighbor Interpolator"""

      ...

    def get(self, request):
        #
        # Convert coordinate into desired format
        if self.crdmode == 'internal':
            crd = internal(request.crd)
        else:
            crd = request.crd
        #
        # Make nearest neighbor search once and pass it to all interpolators
        dist, idx = self.tree.query(crd, p=self.norm)
        for prop in request:
            request.set(prop, self.interpolators[prop](crd, request, idx))
        #
        # Determine whether result is trustworthy, using the trust radii
        diffmin = np.min(np.diff(request['energy']))
        is_trustworthy = False
        if diffmin < self.energy_threshold:
            if dist < self.trust_radius_CI: is_trustworthy = True
        else:
            if dist < self.trust_radius_general: is_trustworthy = True
        #
        return request, is_trustworthy


The *get* function first has to make sure that the interpolators get the right coordinates.
Subsequently, the interpolators for all the desired properties are called and the
results are put into the request instance. 
Finally, it is checked, whether the requested point is within the trusted region. The trusted
region is devided into two parts, depending whether the smallest energy gap between two potential
energy surfaces is small or large. The threshold is given as the *energy_threshold* as user input
as well as the radii *trust_radius_ci* and *trust_radius_general*.


Step 5: Implement the save, load and _train methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The NearestNeighborInterpolator does not use the functionality that the interpolators and their parameters
are stored to a file and read afterwards, to avoid long training sessions. The training of the Nearest Neighbor
search is just to update the cKDTree, which doesn't take very long. Therefor, these functions are not really used,
but implemented in a way to make sure that the full functionality is available.


.. code-block:: python
   :linenos:

   class NearestNeighborInterpolator(Interpolator):
      """Nearest Neighbor Interpolator"""

      ...

   def loadweights(self, filename):
        """ Weights are loaded for the interpolators from a file. As the
            NearestNeighborInterpolator is not using the save option, also
            here, interpolators are just set up from the database

            Parameters:
            -----------
                filename, str:
                    filepath of the file containing the weights. Not used here!
        """
        #
        self.logger.warning("NearestNeighborInterpolator cannot load weights, interpolators are " +
                            "set up from DB")
        # As saving is not used, interpolators are set up from the database
        self.get_interpolators(self.db, self.properties)

    def save(self, filename):
        """ Method to save the interpolators to a file. Not used here!
            
            Parameters:
            -----------
                filename:
                    filepath where to save the information. Not used here!
        """
        #
        self.logger.warning("NearestNeighborInterpolator cannot be saved to a file")

    def _train(self):
        """ Method to train the interpolators. In the case of the NearestNeighborInterpolator
            the cKDTree has to be updated.
        """
        #update cKDTree
        self.tree = cKDTree(self.crds)


.. _Colt: https://github.com/mfsjmenger/colt