How to optimize 'Join Rows (Cartesian Product)' step in Spoon?

Question asked by Himanshu Dixit on Apr 10, 2018
Hi Folks,


I am new to Kettle. I have question regarding 'Join Rows (Cartesian Product)' step.


I am using 2 BigQuery tables as input and cross joining them with 3 conditions based on date fields. It does include operators like '>=' and '<' in the join condition. Count in first BigQuery table is around 5.5k and other BigQuery table has 700k records. Since its a cross join, I am expecting the output to be somewhere around 3.8 Billion records. Currently, this join is happening on BigQuery side and I am ready everything from that query and putting it into a file which is taking close to 3 or sometimes 3.5 hrs. I want to optimize this. I am thinking about using 2 BigQuery inputs in kettle and use 'Join Rows (Cartesian Product)' step to join them.


Question - What is the best way to optimize the 'Join Rows (Cartesian Product)' step? I tried to implement the above logic in kettle using join rows step but it is also taking hours to finish. How can achieve the same result in less time?