So far we’ve seen various methods of cleansing source data using Data Quality Services. One of the main applications of DQS is in SQL Server Integration Services. This is where source data can be cleansed using the knowledge bases maintained in the DQS client by launching the DQS cleansing transform available in Integration Services. This blog gives step by step procedures to leverage the DQS cleansing transform available in Integration Services to perform data cleansing and use it to obtain good quality source data.
To illustrate this, let’s consider an Excel spreadsheet with the following data:
For simplicity, I’m including two relevant fields- State and Quantity. The source data is non-standardized and will utilize the DQS transformation in our Integration Services package to cleanse out the data as a part of our Extract Transform process.
The package used for the ETL process contains a single Data Flow Task with the below layout:
The above layout includes the following components:
- Excel Source – To extracted data from Excel spreadsheet
- DQS Transform – Launches the DQS engine and does cleansing of source data based on mapped Knowledge Base domain
- Conditional Transformation – Identify between correct and cleansed source rows
- Derived Column – To replace actual source value with DQS corrected value in the case of cleansed data
- Union ALL – To merge the correct and cleansed outputs
- OLEDB Destination – To populate destination table with cleansed source data
For verifying intermediate outputs, I’ve added two data viewers in the package indicated by icon. The DQS transform should be setup to utilize an available Knowledge Base domain as follows:
When we complete the package, it executes the DQS transform and shows us the below windows, which indicates partial results before and after the cleansing step:
The output after DQS transform shows the corrected values against source. This will be used in later tasks to replace the actual values from the source. The conditional transform identifies the cleansed rows and applies the logic to replace the State_Source values with State_Output values in order to populate the corrected values to the destination table. This can be confirmed by looking at the row count values for each output in the package (1 correct row against 5 corrected rows).