Search engine components and architecture

Open source search engine architecture (components and modules) and processing (data integration, data analysis and data enrichment)

Architecture overview

Flowchart

Components and Modules

User Interface: Client and user interface
Search query forms: Search query form for full text search
Explorer and navigator: Search with full text search and navigate (exploratory search) the index or search results with interactive filters (facets)
- Viewers: Parts of the UI to show different views (i.e. analytics like wordlcouds or trend charts) and previews for special formats (i.e. photos, documents, email ...)
- Code: /solr-php-ui/templates/
Annotators: Web Apps for tagging documents or CMS with forms and fields to manage meta data like tags or annotations
Search Apps: Applications and user interfaces for search like search with lists tool or named entities manager
Index and search server (Solr or Elastic Search): Search server managing the index (indexer) and running search queries (query handler)
Datamodel/Schema: src/solr.deb/var/solr/data/opensemanticsearch/conf/managed-schema
Storage: /var/solr/data
Log: /var/solr/logs/
Open Semantic ETL: Framework for data integration, data analysis, data enrichment and ETL (Extract, transform, load) pipelines or chains
Connectors, importers, ingestors or crawlers: Import data from a data source (i.e. file system, file directory, file share, website or newsfeed)
Parsers: Apache Tika to extract text and metadata from different file formats and document formats
Entity extraction and entity linking: Open Semantic Entity Search API
Data enrichment plugins and enhancer: Enhancing content with additional data like meta data (i.e. tagging or annotations) or analytics (i.e. OCR)
ETL Exporter or Loader for Solr or Elastic Search: Indexing the data to search index
Trigger: Your CMS or your file system (file system monitoring) will notify the web service (API) when there is new data or when content changed, so you dont have to burn resources for recrawl often to be able to find new or changed content very soon
Web services (REST-API): Available via standard network protocol HTTP and waiting until you (i.e. using the web admin interface) or another service (i.e. using the REST-API) demands actions like crawling a directory or a webpage and starting this actions
Queue manager (Celery on RabbitMQ): Managing task queue and starting of text extraction, analysis, data enrichment and indexing jobs by the right balance of parallel workers
Scheduler: Managing starting of scheduled indexing jobs. This can be crontab for Cron starting the command line tools. Config: /etc/cron.d/open-semantic-search

Document processing, extract, transform, load (ETL) and enhancing by data enrichment and data analysis

How (new) data is handled by this components and ETL (extract, transform, load), document processing, data analysis and data enrichment:

A user manually or a Cron daemon automatically from time to time starts a command
The command line tools or the web API getting this command starts a ETL (extract, transform, load), data analysis and data enrichment chain to import, analyze and index data
A input plugin or connector (i.e. the connector for the file system or the connector for a website) reads from its datasource
The connectors, an Apache Tika parser, or a file format based data converter or extractor extracts data from the given document or file format
The ETL framework calls all configured enhancer plugins for data enrichment to get additional analysis for the data or annotations to this data from a CMS.
The output storage plugin or indexer index the text and metadata to the Solr index or to the Elastic Search index, so all other tools can search this data
The user uses a user interface like the search user interface, the search apps or some other tools to search based on the search API of this index

Services and Microservices

Linux services:

tika - Text extraction and OCR

tika-fake-ocr - Text extraction without OCR

solr - Search index

spacy-services - spaCy NLP

opensemanticetl - ETL workers

rabbitmq-server - Task queue

flower - Task queue monitoring user interface

apache2 - Search UI - Search apps (f.e. thesaurus app or config UI) - Entity Search API

User Interface and search applications

Solr-PHP-UI

User Interface (supports responsive design for mobiles and tablets) for search, facetted search, preview, different views and visualizations.

Based on Solr client solr-php-client (pure vanilla php) and standard User Interfaces (HTML5 and CSS with Zurb Foundation) and visualization libraries (D3js) so you can install and run it on standard PHP webspace without effort and without often not available special PHP-modules)

Search engine components and architecture

Architecture overview

Flowchart

Components and Modules

Document processing, extract, transform, load (ETL) and enhancing by data enrichment and data analysis

Services and Microservices

User Interface and search applications

Index server

Annotation

Open Semantic Tagger

Connectors

Scheduler

Queue manager

Data enrichment (Enhancer)

Web Services

Web admin interface

Trigger

Trigger Drupal

Generic triggers