How to clean and enhance your data with reference data Learn how to identify wrong data and clean and enhanced them with Talend Enterprise Data Quality

In this tutorial, we will analyze the data in the Profiling perspective of Talend Enterprise Data Quality to see the correlation between them.
Then, we will clean and enhance them with the Design workspace perspective.

To do so, you need to download the exampleFile.zip file gathering the data files, as well as the Jobs used for this tutorial.

Prerequisite:
You first need to import the Jobs available in the exampleFile.zip file on your Studio, and execute the Load_customer_data Job of the Prerequisite folder. This Job will load the customer data in your database which we will analyze in this tutorial.

Download it!

You want to practice?

Download exampleFile.zip to get the files used for this tutorial.

You can also download tutorialProject.zip containing all the jobs needed to carry out this tutorial.

You can also:
Send it!

Share it!

 


Create an analysis


For this tutorial, we used an example of a customer data file.

In this file, we can see that the data seems to have good zip codes, but the cities seems to be misspelled.

We can scroll down in the file to see other customer data.

Next
The misspelling of the cities can be found in the rest of the file and some places in New York City are filled in in the City field.

But the zip codes seem to be well defined.

Next
In the DQ Repository on the left of Talend Enterprise Data Quality:

Right-click DB Connections.

In the menu, click New connection to open the Database Connection wizard.

Next
In the Database Connection wizard:

In the Name field, fill in the name you want to give to your database connection: CustomerDB.

Click Next to go to the second step.

Next
In this step, fill in the database connection settings:

In the Login and Password fields, fill in the login and password to connect to the database.
In the DB type list, select the database type. For this tutorial, we used a MySQL database.
Clear the retrieve all metadata check box to only retrieve the database used for this tutorial.

In the DB URL area, fill in the URL of the database. For this tutorial, we used a database named talend-dbms.
In the DBname, fill in the name of the database, here: tutorials.

Click Finish to close the wizard and create the connection.

Next
The connection displays on the DQ Repository.
You can navigate through the database and its tables.

For this tutorial, we will use the Customer_sample table. You can expand it to check its columns.

We want to analyze the Zip column, so right-click the Zip column and select Analyze correlation in the menu to open the New Analysis wizard.

Next
In the New Analysis wizard:

In the Name field, give a name to your analysis: corranalysis.

Click Finish to close the wizard and create the analysis.

Next
The new analysis displays on the analysis editor.

Click Select column to analyze to add a new column in the analysis.

Next
In the Column Selection wizard:

The database structure displays on the left on the wizard.

On the right of the wizard, select the City check box.

Click OK to validate.

Next
Execute and check the analysis


The new column displays on the Analysis Editor.

Click the running man icon to launch the analysis.

Once the analysis is executed, click the Analysis Results tab at the bottom of the window to see the results.

Next
In the corranalysis result tab:

Scroll to the Graphics area.

We can see the correlation between the zip code 55555 and the city of Young America, the correlation between 02026 and Dedham, etc.

Next
Scroll down to see the other correlations, and most of all the various ways of spelling North Reading, etc.

Next
You can clarify the display of those correlations. To do so:

Go to bottom of the Graphics area.

Select the Picking checkbox.

Next
Once the Picking checkbox is selected, you can click each of the blue and green points of the correlation and move them to space them, in order to clarify the display of each link and data.

Next
Scroll to the Data area of the analysis.

Click the COUNT column name to sort the data by occurrences.

We can see that the most frequent data are the one well spelled.

And the other ones are those which need to be cleaned up.

Next
Clean and enrich your data


To clean the data, we will use the census_data.csv file downloaded for this tutorial, which is a reference file downloaded from the US Census Bureau.

This file contains all the zip codes, City name, State id and County name, as well as latitude, longitude and postal zip class.

We will use this file to both clean and enrich the customer data.

Next
To clean the data we will use the Census_Lookup Job you downloaded and imported from the Download section of this tutorial.

To do so, click the Design Workspace button at the top right hand corner of the Studio.

Next


The Design Workspace button is only available with Talend Enterprise Data Quality. If you are using Talend Open Studio for Data Quality, you will have to use Talend Open Studio for Data Quality to perform the following steps.

In the Repository:

Expand the Metadata node.

Update the RemoteDBMS DB Connection, Census_Data File Delimited, and Customers File Excel metadata with your own file location and database connection information.

Expand the Job Designs node and double-click the Census_Lookup Job to open it.

Next
The Job is made up of two input file components: our customer data file and the census data file, the tMap that will allow to join the data of the two files and therefore clean and enrich our customer data, and two output file components containing our cleaned data and the rejected ones.

Double-click the tMap to open its editor.

Next
In the tMap editor:

The schemas of the input file components display on the top left area of the editor.
The schemas of the output file components display on the top right area of the editor.

On the Schema editor tab at the bottom of the editor, the column of the selected input and output schemas display.

Next
So, in the input area, we can see that our two customers and census_data schemas are linked by a join between their Zip column.

Next
This relationship between the two files will allow to enrich and enhance the data and append data like latitude and longitude.

The tMap is also set to replace the City and State names.

Next
In the case the relationship between the two files can not be found between the zip codes of the customer list and the census data, the record will be outputted to a reject flow.

Next
In the Job Designer:

Double-click the Standardized Customers component to display its Component view.

In the Component view:

In the File Name field, click the [...] to set the path and file name of the cleaned file.

You can also click the Sync columns button to retrieve the schema for the previous component.

Next
In the Job Designer:

Double-click the Zip Rejects component to display its Component view.

In the Component view:

In the File Name field, click the [...] to set the path and file name of the rejected data file.

You can also click the Sync columns button to retrieve the schema for the previous component.

Next

In the Job Designer:

Press Ctrl+S to save the Job.

Press F6 to run it.

The Run view displays at the bottom of Talend Open Studio and the console follows the Job execution.

Next
Now to see the result of our cleaning and enhancing process, in the Job Designer:

Right-click the Zip Rejects component and select data viewer in the menu.

We can see in the Data Preview wizard that all the rejected data are gathered in the reject file.

Click Close to close it.

Next


Data preview is only available from Talend Enterprise Data Quality. If you are using Talend Open Studio for Data Integration, you will not be able to preview the data from the studio.

Right-click the Standardized Customers component and select data viewer in the menu.

We can see in the Data Preview wizard that all our good data has been cleaned and enhanced  by our Job.

Click Close to close it.



Data preview is only available from Talend Data Quality Studio. If you are using Talend Open Studio, you will not be able to preview the data from the studio.



 

    Download it!     Send it!     Share it!

You want to practice?

Download exampleFile.zip to get the files used for this tutorial.

You can also download tutorialProject.zip containing all the jobs needed to carry out this tutorial.

Friends / colleagues may be interested in this tutorial? Send it to them!

You liked this tutorial ? Support it!

[ top ]