I just wanted to know if it is possible to use "Nombre of copies at start" in the step "Merge rows (Diff)" without having wrong results.
Have you observed wrong results by having more than one copy of a merge rows step?
Please take also a look into .09 Transformation Steps - Pentaho Data Integration - Pentaho Wiki to see if it can help you.
Thank you for your answer.
Havent tried but there are a list of steps which cannot be used with "number of copies", for instance, "Sort rows" step.
I wanted to know if this is documented because i cannot find any regarding this.
As a suggestion, PDI should not let user use number of copies in these cases, it would save lot of time people asking in forums and other answering these.
The merge rows diff assumes data from both sides are ordered. And this assumption makes the processing really simple.
Do you really need more than one copy of a merge rows diff step? Probably merge rows diff step is not the botteneck and you don´t need multiple copies.
If you had an ordered data stream, but needed to use multiple copies on previous steps before the data achieves merge rows diff step, you can insert a sorted merge step just before the merge rows diff step and the results will work.
You are correct that number of copies will not work.
However you can achieve what you want (in theory) using partitioning. But; there are some restrictions with partitioning - I know the merge join step cannot be partitioned. But it just so happens you CAN achieve it like so:
Scaling merge joins in Pentaho Data Integration | Codeks Blog
Now; be careful - you're starting to introduce a lot of overhead at this point - in my case I was processing 1M rows per second so I had to do this - but make sure that your solution a) actually works and b) is actually faster than just single threading it!
thank you everyone
Retrieving data ...