How to index files like Word documents, PDF files and whole document folders to Apache Solr or Elastic Search?
If you use Linux that means you can crawl whatever is mountable to Linux into an Apache Solr or Elastic Search index or into a triplestore.
Index different file system types to Solr or Elastic Search
This can be a hard disk or partitions formated with fat, ext3, ext4 or a file server connected via ntfs, file shares like smb or even sshfs or sftp on servers, private file sharing services like Seafile or OwnCloud on own servers or Dropbox, Amazon or other storage services in the cloud.
Data enrichment by different data analytic tools
This connector integrates enhanced data enrichment and data analysis plugins like automatic text recognition (OCR) for images and photos (i.e. as files like PNG, JPG, GIF ...) or inside PDFs (i.e.scanned Documents) using Tesseract OCR.
Index a file or directory:
Web admin interface
Using the web admin interface
- Open the page Files
- Enter filename to the form
- Press button "crawl"
Using the command line interface (CLI):
Using the REST-API:
Config file for indexing files: