DATASTAGE: More efficient surrogate key processing

What I'm basically trying to learn is if you all find the performance of most first tier ETL tools in processing surrogate keys (lookup & generation) acceptable.

Even after fine tuning or re-writing the process to be more like a fact build, do you find that a large percentage of the ETL time is spent on this process? This would imply there are significant performance gains to be made in more efficient surrogate key processing.

We've found that most ETL tools are relatively inefficient in processing surrogate keys simply because they are generalist engines. That surrogate key processing occupies 30-50% of the ETL time.

For example, most ETL tools are only able to process surrogate keys in a multi-pass operation. First pass to update the dimensions, second pass to update the facts. Processing the load in a single pass would significantly reduce complexity and improve performance.

Secondly, ETL tools are generally unable to concurrently process multiple small sessions which update the same surrogate keys. As in the instance of multiple sources (or multiple versions of the same source such as different regions) updating in parallel. Or processing dimension updates for multiple transaction types in parallel. Instead, most ETL tools will handle these as one large job, thus processing each source or transaction type in sequence without fully utilising a machine's multiple processors.

Of course, if most companies can truly live with their current performance, it's not an issue. But I'm wondering if they do so because they feel their only option to accepting the inefficiency is to custom design a high performance solution which would be expensive and difficult to maintain. Or to throw more hardware at the process which is also expensive.

Based on a client's high performance needs, we've designed a specialised engine that streamlines surrogate key processing for updating marts and warehouses. I'm trying to understand how much need there is for our product.

Is the need only amongst those companies with an exceptionally large dimension (like customer) or those experiencing exceptionally high volumes of data or arrival rates?

Or is it that most companies just don't realise that they can significantly improve this portion of the ETL process and radically improve their data mart updating. This means they can maximise their existing infrastructure investment and do more with it. For example, move from a monthly to a weekly load or add new sources or marts or querying tools. What thousands of flowers can bloom once this technical bottleneck is removed?

DATASTAGE

Labels

About Me

Sunday, March 16, 2008

More efficient surrogate key processing

0 comments: