  |
 | Create an analysis
|
For this tutorial, we used an example of a customer data file. In this file, we can see that the data seems to have good zip codes, but the cities seems to be misspelled. We can scroll down in the file to see other customer data.
Next
|
  |
The misspelling of the cities can be found in the rest of the file and some places in New York City are filled in in the City field. But the zip codes seem to be well defined.
Next
|
  |
In the DQ Repository on the left of Talend Enterprise Data Quality:Right-click DB Connections. In the menu, click New connection to open the Database Connection wizard.
Next
|
  |
In the Database Connection wizard:In the Name field, fill in the name you want to give to your database connection: CustomerDB. Click Next to go to the second step.
Next
|
  |
In this step, fill in the database connection settings: In the Login and Password fields, fill in the login and password to connect to the database. In the DB type list, select the database type. For this tutorial, we used a MySQL database. Clear the retrieve all metadata check box to only retrieve the database used for this tutorial. In the DB URL area, fill in the URL of the database. For this tutorial, we used a database named talend-dbms. In the DBname, fill in the name of the database, here: tutorials. Click Finish to close the wizard and create the connection.
Next
|
  |
The connection displays on the DQ Repository. You can navigate through the database and its tables. For this tutorial, we will use the Customer_sample table. You can expand it to check its columns. We want to analyze the Zip column, so right-click the Zip column and select Analyze correlation in the menu to open the New Analysis wizard.
Next
|
  |
In the New Analysis wizard:In the Name field, give a name to your analysis: corranalysis. Click Finish to close the wizard and create the analysis.
Next
|
  |
The new analysis displays on the analysis editor. Click Select column to analyze to add a new column in the analysis.
Next
|
  |
In the Column Selection wizard:The database structure displays on the left on the wizard. On the right of the wizard, select the City check box. Click OK to validate.
Next
|
  |
 | Execute and check the analysis
|
The new column displays on the Analysis Editor. Click the running man icon to launch the analysis. Once the analysis is executed, click the Analysis Results tab at the bottom of the window to see the results.
Next
|
  |
In the corranalysis result tab:Scroll to the Graphics area. We can see the correlation between the zip code 55555 and the city of Young America, the correlation between 02026 and Dedham, etc.
Next
|
  |
Scroll down to see the other correlations, and most of all the various ways of spelling North Reading, etc.
Next
|
  |
You can clarify the display of those correlations. To do so: Go to bottom of the Graphics area. Select the Picking checkbox.
Next
|
  |
Once the Picking checkbox is selected, you can click each of the blue and green points of the correlation and move them to space them, in order to clarify the display of each link and data.
Next
|
  |
Scroll to the Data area of the analysis. Click the COUNT column name to sort the data by occurrences. We can see that the most frequent data are the one well spelled. And the other ones are those which need to be cleaned up.
Next
|
  |
 | Clean and enrich your data
|
To clean the data, we will use the census_data.csv file downloaded for this tutorial, which is a reference file downloaded from the US Census Bureau. This file contains all the zip codes, City name, State id and County name, as well as latitude, longitude and postal zip class. We will use this file to both clean and enrich the customer data.
Next
|
  |
To clean the data we will use the Census_Lookup Job you downloaded and imported from the Download section of this tutorial. To do so, click the Design Workspace button at the top right hand corner of the Studio.
Next
 | The Design Workspace button is only available with Talend Enterprise Data Quality. If you are using Talend Open Studio for Data Quality, you will have to use Talend Open Studio for Data Quality to perform the following steps.
|
|
  |
In the Repository:Expand the Metadata node. Update the RemoteDBMS DB Connection, Census_Data File Delimited, and Customers File Excel metadata with your own file location and database connection information. Expand the Job Designs node and double-click the Census_Lookup Job to open it.
Next
|
  |
The Job is made up of two input file components: our customer data file and the census data file, the tMap that will allow to join the data of the two files and therefore clean and enrich our customer data, and two output file components containing our cleaned data and the rejected ones. Double-click the tMap to open its editor.
Next
|
  |
In the tMap editor:The schemas of the input file components display on the top left area of the editor. The schemas of the output file components display on the top right area of the editor. On the Schema editor tab at the bottom of the editor, the column of the selected input and output schemas display.
Next
|
  |
So, in the input area, we can see that our two customers and census_data schemas are linked by a join between their Zip column.
Next
|
  |
This relationship between the two files will allow to enrich and enhance the data and append data like latitude and longitude. The tMap is also set to replace the City and State names.
Next
|
  |
In the case the relationship between the two files can not be found between the zip codes of the customer list and the census data, the record will be outputted to a reject flow.
Next
|
  |
In the Job Designer:Double-click the Standardized Customers component to display its Component view. In the Component view: In the File Name field, click the [...] to set the path and file name of the cleaned file. You can also click the Sync columns button to retrieve the schema for the previous component.
Next
|
  |
In the Job Designer:Double-click the Zip Rejects component to display its Component view. In the Component view:In the File Name field, click the [...] to set the path and file name of the rejected data file. You can also click the Sync columns button to retrieve the schema for the previous component.
Next
|
  |
In the Job Designer: Press Ctrl+S to save the Job. Press F6 to run it. The Run view displays at the bottom of Talend Open Studio and the console follows the Job execution.
Next
|
  |
Now to see the result of our cleaning and enhancing process, in the Job Designer: Right-click the Zip Rejects component and select data viewer in the menu. We can see in the Data Preview wizard that all the rejected data are gathered in the reject file. Click Close to close it.
Next
 | Data preview is only available from Talend Enterprise Data Quality. If you are using Talend Open Studio for Data Integration, you will not be able to preview the data from the studio.
|
|
  |
Right-click the Standardized Customers component and select data viewer in the menu. We can see in the Data Preview wizard that all our good data has been cleaned and enhanced by our Job. Click Close to close it.
 | Data preview is only available from Talend Data Quality Studio. If you
are using Talend Open Studio, you will not be able to preview the data
from the studio.
|
|