You are going to drill down into single table and columns analyses.
These analyses will allow you to explore the tables of our catalogs.
Columns analyses enable discovering data content and format.
Define data validation rules with Data Quality rules.
In this exercise, you will perform one column analysis on each table of the “cif” and “crm” catalogs in order to get a more accurate view on data. The exercise comprises two steps:
- creation of column analyses for all tables in “cif” catalog, with column analyses configuration,
- creation of column analyses for all tables in “crm” catalog.
Prerequisites: To follow this tutorial, you need to extract and install the cif and crm databases zipped in the exampleFile.zip file available for download in the Download it! section of this tutorial.
This opens a new tab where you can see the duplicates.
cntry column has a maximum and minimum length of 3 characters
addr_st column has around 12% of duplicate values
zip column has a minimal length of 4 and maximal of 5 , with an average of 5. If you right-click on the Minimal Length bar in the
chart to display the rows, you can see that only 4 rows have a 4-digit zip
code: this means they are probably incorrect.
phone_num column has 48.2% of blank values.
According to the text indicator, a phone number seems to be 11 characters long.
cust_type column length is between 8 and 11. It has
only 3 distinct values (Simple Statistics) which can be displayed by
right-clicking on the Distinct Count and selecting View values in the menu: prospect, customer, beneficiary. This corresponds to the three customer types we have
to deal with.
In the Analysis Results tab, you can now check for the patterns of the selected columns.
code has three patterns, all numeric (“9”
symbolizes a number) with three different lengths: 3, 4 and 5 digits. The pattern
with 4 digits represents 99% of the values.
Back to the crm_cust analysis results, go to the phone_num
column results. Here, there are multiple patterns and empty fields. For
this training, we consider that the correct phone number pattern is 999-999-999, which is the most frequent in our dataset.
Use built-in patterns or import them from the Talend Exchange Community
For the email column, we have different patterns, but we know that an email address must contain a "@" character and a "." followed by a domain name.
All the pattern values that are present here are valid. Instead of creating our own email pattern we will use an existing one. We have two possibilities:
1) Use an existing pattern from the built-in list provided with Talend Open Studio for Data Quality or Talend Enterprise Data Quality. 2) Import a pattern that has been shared in Talend Exchange Community.
1) Use an
existing pattern from the built-in list provided with Talend Open Profiler or Talend Data Quality.
In the DQ Repository view, you can see the two patterns you created (zip_code and phone_number), and you can also
browse among the built-in ones. In the internet folder, you will find an Email Address built-in pattern
2)Import a
pattern that has been shared in Talend Exchange community.
In
the DQ Repository view:
Expand the Libraries, Exchange and Patterns nodes. Here you can see the different patterns that have been shared on Talend Exchange community.
Right-click on Email Address (with list of top-level domains) and select Import in DQ Repository which will import this pattern locally in your repository.
You will then be able to use the imported pattern in your analyses.
Optional: export
your patterns to Talend Exchange. If you created
patterns that can be reused by other people, you can share these with Talend
Exchange. First, you need
to export the pattern(s) locally.To do so:
Right-click on Date and select Export
for Talend Exchange.
Click the Browse... button next to the Select a folder field to select an export folder in your local hard
drive. In the Selected Pattern area, select the pattern that youwant to export (in the example, we are exporting a
date pattern that we will create in another exercise).
Click on Finish to export the pattern on the
selected folder. This will create a zip file that contains the pattern.
Open a Web browser and go to the Talend Exchange Web site. Click on the Data Quality link.
You
need to create an account in Talend Forum or Exchange in order to be
able to publish new patterns. As soon as you are authenticated, you can
access a panel to add extensions.
Save and run the analysis and check the new results.
You now have a Pattern Matching
indicator for these columns, where you can view valid and invalid rows
by right-clicking the results table or the chart.
Select indicators as follow: Simple statistics: all columns Text statistics: offer_name, owner, beneficiary Pattern Frequency Statistics: offer_name, owner, beneficiary
offer_name has 3 distinct values: online, simplicity and blocked
For duration, you can check the distinct, unique and duplicate values. For this tutorial, we consider that duration represents the time in minutes for pay monthly plans. It can only have discrete values.
Distinct values will show all values you can find here. There are only 15 different values here. Unique values only appear once: in this case, they are probably invalid values - the same duration is related to multiple contracts.
In this case, the Unique indicator value should always be 0.
Create a threshold for this indicator, from the Analysis Settings tab, click on the button (with the little bolt) next to the Unique indicator of the duration column, and fill in the threshold values (0 for min and max values).
Duplicate values show the values that appear at least twice: in
this case, the correct values will be found here. There are 8 duplicate
values. Let's check the rows with these values: right-click on duplicate count and select view rows:
1) Clear the Limit Rows check box 2 )Re-run the query 3) Sort the duration column in ascending order
Now
you can see that 0, 6 and 12 values are only present a few times,
whereas 30, 60, 120, 180 and 300 are frequent. We can assume that these
5 numbers are valid duration values.
owner is never null or blank and is 4 characters long.
beneficiary:
3 to 5 digit characters long; no blank value but there are multiple
values (45.75%) which pattern is only one digit long, and all of these
values are 0, which means there is no beneficiary for this contract, so the contract owner is also the beneficiary.
Set the threshold for Null, Duplicate and Blank indicators to 0 (min and max).
Now, if these values are not correct, a warning will be displayed in the analysis results:
column1: free-form text which seems to represent the duration (in minutes) and the offer name (blocked, simplicity, online)
column2: single character (most of the time a digit). Using the same method as for the duration column in the crm catalog, find the valid values for column2
For date formats, create a new pattern to check for dates like DD-MM-YYYY with this regular expression pattern: '^((0[1-9])|(1[0-9])|(2[0-9])|(3[0-1]))(\/|-)((0[1-9])|(1[0-2]))(\/|-)((19|20)[0-9]{2})$'
You can duplicate and reuse the Date MM DD YYYY 0.1 existing date pattern.