How to develop your own Python plugin for data enrichment

Example

Just * add a new python file to the enhancer directory * enable this plugin in your connector and * implement the object myOwnEnhancerName the function process with the parameters parameters and data which * reads some parameters from variable parameters and * write to the variable data to index your results:

Example:


import etl

class myOwnEnhancerName(object):

   def process ( parameters={}, data={} ):

      # do some analyses, i.e. with text
      uri\_length = len ( parameters['id'] )

      #write some analysis results to the index into a facet
      etl.append(data, facet='uri\_length\_i', values=uri\_length)

      #return results
      return parameters, data

Plugin directory

Save your enhancer *myOwnEnhancerName*.py into the subdir etl of your in python library directory (i.e. /usr/lib/python2.7/).

Config

Add your enhancer to the enhancer chain in the config of your connector(s), which shall enhance their data with it:

Enable your data enrichment plugin

Add the plugin name to the config of the connector, for example in /etc/opensemanticsearch/connector-files:

config['plugins'].append('*myOwnEnhancerName*')

Custom config options for your plugin

Optionally you can add your own config options, so you have more flexibility, because not everything has to be hard coded in your plugin

Example

config['myOwnEnhancerName'] = 'Own Enhancer'

Your plugin can read this config options from the variable parameters

Variables

Read / set parameters from/to the variable parameters

The variable parameters contains the config and some analysis results from enrichment plugins running before so you can use now as parameter for your plugin.

For example you can run your plugin only for certain document types or mime types or run a language detection plugin before you use its result as parameter for a OCR plugin.

Example data:

print parameters

{ 'id: 'file:///documents/document.pdf', 'content_type': 'application/pdf' }

Write results and values for indexing to the variable data

The variable data is for adding facets and values to the index.

To save and index your plugins results, just add them to the variable data.

Example: data['tags'] = 'myTopic' data['location'] = 'Berlin'

So the variable data will look like after running some enhancer plugins:

print data

{ 'content_type': 'application/pdf', 'filesize': 12345, 'tags': {'Open Source', 'Free software'}, 'email_ss': { 'info@opensemanticsearch.org', 'support@opensemanticsearch.org' } }

Functions

Process

Just implement the function process with the parameters parameters and data.

If the plugin is enabled, this function will be called by the connector before indexing the document.


...
   def process ( parameters={}, data={} ):
      # do some analyses, i.e. with text
      textsize = parameters[text'].size

      #write some analysis results to the index into a facet
      opensemanticsearch\_connector.append(data, facet='textsize', values=textsize)

      #return
      return parameters, data

You can use all Python functions and libraries for very easy or very complex analysis.

Append data

Since if there are values for the same facet before because of another plugin running before and with the first example how to write resulrs to the variable data, you would overwrite existing data and for adding data easier, you can use the function etl.append to add additional values from your analysis results to the index:

etl.append(data=data, facet = 'email', values = 'info@opensemanticsearch.org')

The parameter values can be a value like a integer or a string or multiple values as a python list.

Abort processing

To abort further processing and final indexing you can set parameters['break'] = True

So you can develop additional filters like for example the existing URL blacklist filter.

Another example plugin using regular expressions

Here an example, how to extract the email-adresses and store it to the facet email with just a few lines of python code to enable exploratory search, interactive filters and aggregated overviews for email-addresses, too. Its only an programming example, since extracting regular expressions is a standard plugin and to extract email-adresses its default config.


# Data enrichment plugin for extracting email-adresses
# Extracting email adresses and write them to facet email\_ss

# import the connector, so we can add our analysis to the indexed document
import opensemanticsearch\_connector

# import python module for regular expressions
import re

class enhance\_email(object):
 def process ( parameters={}, data={} ):

   # regular expression matching email-adresses
   regex = r'[\w\.-]+@[\w\.-]+'

   # facet / column where to store/index it
   facet = "email"

   # find all emailadresses with a regular expression
   matches = re.findall(regex, parameters['text'])

   if matches:
      # add the list matches to the facet
      opensemanticsearch\_connector.append(data, facet, matches)

   return parameters, data