Workflow and index settings troy notes

Document created by Troy Myers Employee on May 16, 2017
Version 1Show Document
  • View in full screen mode

  We have come across
several questions and concerns about settings for the Workflow/index and what
options are available and what they represent.
First we have workflow settings and Index settings. A workflow is a
combination of three parts, an input (Data Connector), workflow pipelines which
is an action area where data can be manipulated, standardized, etc.; it can also
be appended, along with other things.
The last part of the workflow is your output it can be an Index ( a
search GUI) or the data can be enhanced and sent to designated area ( Namespace,
file etc.)


All settings I refer to were done on an 8 node system with
32 GB of RAM.  The best way to improve
performance is in proper Pipeline development, but here we are just talking
about the internal settings.  This was
tested multiple times to find the sweet spot for our system.  Creating the initial index settings is
similar to a migration where small sets of data are used for benchmarking prior
to the final indexing.


  First go to Workflow
→ Workflow designer→ the compass in Tasks → Edit settings 

You will see to Memory settings


Driver heap limit and Executor heap limit.    Spark

The default is 1024 for both; through trial and error we
found the ideal spot to be 4096 for both.
In our 64 GB system we have found 8096 is the ideal spot.

  The Driver heap is
the area where and when the objects are being imported into the Pipeline.  The Executer heap limit is how much RAM is
available for the pipeline to process the objects themselves.  We have tweaked them with different offsets,
but found that having them the same in our case was ideal.  If you have very large files you may want to
up the executor to a larger number and have them not be in sync.  We have done this in some other testing’s
with large PST files. 

The number of Parallel jobs was set to 3 from the default of

We set our Reported extra server cores to 6.  The default is four 



The other settings we can look are in System
Configuration → Services →Manage Services in the blue box → Index configure


That is where you will find the container memory.  This is the amount of Memory allocated to
Docker We set ours to 16000.0  “16GB”
Make sure you put the dot zero in the field.
This means we now have allocated 16 GB and 8 GB for the Heap
settings.  We had upped it to higher
numbers but received multiple GC errors while it was running; this is referring
to the 32 GB system.


Below that is the Index Service options for Heap Size, we
set ours to 15800 as it needs to be 200 MB lower than the Container memory.   This is SolR memory size. 

We had upped them to higher numbers (keeping the 200 MB
buffer) but received multiple GC errors while it was running.    The defaults are 2000 and 1800



Other things that were set in this were the Index protection
level.  We set this to 2.  This means we had 2 copies of our index on
the system.  Part of your planning and
testing needs to accommodate the physical size of the index.  In our testing of  ~ 6.5 Million emails our index size was 1.3
TB when complete. 


When creating our initial index under Workflows →Index
Collections → Create Index



We changed the Initial Shard count to 5.  A shard is a slice of the index; we chose 5
since we have an 8 node system with 3 masters and 5 workers.    Please be cognizant that the number of
shards cannot be changed you must manually create a new index.


Once our testing is complete we will change the Initial
Schema from the default Schemaless to default or basic to cut down on the number
of fields to be indexed.  This will
improve performance and reduce bloat of the index. 



With the above settings on an 8 node system with 32 GB We
were able to index ~ 6.5 Million emails in 15 hrs and 12 minutes.  This works out to about 400k/hour or 118


Common errors found in testing were the GC error.  This is the Garbage collection being run in
SolR as it is trying to process/ removethe indexed data.  

3 people found this helpful