How to identify quality problems Learn how to use Talend Open Studio for Data Quality

In this tutorial, we will analyze customer data from a MySQL database to identify quality problems.

The example used is an email marketing campaign based on a gender segmentation of the customers. So, we will check the quality of the gender (m for male, f for female, and u for unknown) and email details of the customers in the database.

Prerequisite:
To do this tutorial, you need to install:
- a Java Runtime 1.5 or above (http://www.oracle.com/technetwork/java/javase/downloads/index.html),
- Talend Open Studio for Data Quality,
- a database on which you can run the analysis. (This database can be found in the exampleFile.zip file available at the bottom of this page, in the Download it! section of this tutorial. To install it, download and unzip the exampleFile.zip and follow the instructions provided in the readme.txt file extracted.)

Download it!

You want to practice?

Download exampleFile.zip to get the files used for this tutorial.

You can also download tutorialProject.zip containing all the jobs needed to carry out this tutorial.

You can also:
Send it!

Share it!

 


Create a DB Connection


In the DQ Repository on the left of Talend Open Studio for Data Quality main screen:

Right-click DB Connections.

In the menu, click New connection to open the Database Connection wizard.

Next
In the Database Connection wizard:

in the Name field, fill in the name you want to give to your database connection: CustomerDB.

Click Next to go to the second step.

Next
In this step, fill in the database connection settings:

In the Login and Password fields, fill in the login and password to connect to the database.
In the DB type list, select the database type. For this tutorial, we used a MySQL database.
Uncheck the retrieve all metadata check box to only retrieve the database used for this tutorial.

In the DB URL area, fill in the URL of the database. For this tutorial, we used a localhost database.
In the DBname, fill in the name of the database: tbi.

Click Finish to close the wizard and create the connection.

Next
Create an analysis


The connection displays on the DQ Repository.
You can navigate through the database and its tables.

For this tutorial, we will use the Customer table. You can expand it to check its columns.

We want to analyze the gender column, so right-click the gender column and select Analyze in the menu to open the New Analysis wizard.

Next
In the New Analysis wizard:

In the Name field, give a name to your analysis: GenderAna.

Click Finish to close the wizard and create the analysis.

Next
The new analysis displays on the analysis editor.

Click Select column to analyze to add a new column in the analysis.

Next
In the Column Selection wizard:

The database structure displays on the left on the wizard.

On the right of the wizard, select the email check box to check if the customers' email can be used.

Click Ok to validate.

Next
The email column displays under the gender one in the analysis.

Now that the columns to analyze are set, we will set up the analysis to carry out. So, click Select indicators for each column to open the wizard.

Next
In the Indicator Selection wizard:

Check the Simple Statistics and Text Statistics check boxes.

And scroll down to the end of the list.

Next
Select the Advanced Statistics and Pattern Frequency Statistics check boxes.

Click OK.

Next
In the Analysis editor:

The statistic indicators selected for the gender column display.

Click the Pattern icon corresponding to the email column to use a regular expression to verify the validity of the customers' email address.

Next
In the Pattern Selector wizard:

Expand the Patterns > Regex > Internet nodes.

Check the Email Address pattern.

Click OK.

Next
Execute and check your analysis


The new criterium displays on the Analysis Editor.

Click the running man icon to launch the analysis.

Once the analysis is executed, click the Analysis Results tab at the bottom of the window to see the results.

Next
In the GenderAna analysis result tab:

Scroll to Text Statistics table.

If we have a look at the minimum and maximal length of the gender column, we can see that their minimal and maximal values are 0 and 1. So it might contain null values, but the gender field has only one character probably corresponding to "m", "f" or "u", the expected data.

Next
Scroll to Frequency Statistics table.
We can see that there are empty fields, and also one "H" and ten "2". Let's have a look at these ones.

To do so, right-click the "2" and select View rows in the menu.


Next
Directly access bad data


If we have a look on the data, we can see that most of the customers displayed are female customers.

Scroll right to check the gender column.

Next
In the gender fields, we can see the "2" which is not a valid data.

As most of the first names were female, we can think that those data were imported from another database in which "2" meant female.

Next

Scroll to the Pattern Matching table, we can see that 41% of the Email addresses don't match the pattern. Let's have a look at those invalid email addresses.

To do so, right-click the Email Address row and select view invalid rows in the menu.

Next
We can see that those email addresses are invalid. They contain spaces, two at signs, no dot before org, etc.

Now, we know we have to correct those data of the database.



To take advantage of the SQL Editor, read the corresponding documentation available here: http://www.talend.com/resources/documentation.php, and discover how to save your own SQL queries and easily access your data.


 

    Download it!     Send it!     Share it!

You want to practice?

Download exampleFile.zip to get the files used for this tutorial.

You can also download tutorialProject.zip containing all the jobs needed to carry out this tutorial.

Friends / colleagues may be interested in this tutorial? Send it to them!

You liked this tutorial ? Support it!

[ top ]