Single Column Analyses (Exercise 3) You are going to drill down into single table and columns analyses. These analyses will allow you to explore the tables of our catalogs. Columns analyses enable discovering data content and format. Define data validation rules with Data Quality rules.

In this exercise, you will perform one column analysis on each table of the “cif” and “crm” catalogs in order to get a more accurate view on data. The exercise comprises two steps:
- creation of column analyses for all tables in “cif” catalog, with column analyses configuration,
- creation of column analyses for all tables in “crm” catalog.

Prerequisites:
To follow this tutorial, you need to extract and install the cif and crm databases zipped in the exampleFile.zip file available for download in the Download it! section of this tutorial.

Download it!

You want to practice?

Download exampleFile.zip to get the files used for this tutorial.

You can also download tutorialProject.zip containing all the jobs needed to carry out this tutorial.

You can also:
Send it!

Share it!
Next Step: Performing [Single Table Analyses (Exercise 4)]

 


Create column analyses for each crm tables


In the DQ Repository view:

Expand the Data Profiling node.
Right-click Analyses.
In the menu, click New Analysis to open the Create New Analysis wizard.

Next
In the Create New Analysis wizard:

Expand Column Analysis and select Column Analysis.

Click Next.

In the Name field, name them crm_<table_name>. For example: crm_cust.

We are now going to configure these column analyses.

Next
Configure crm_cust analysis with simple indicators


(You may want to create a directory inside the Analyses category in order to sort your analyses.)

Open the crm_cust analysis and click on Select columns to analyze. In the pop-up window, select the cust table in the crm database.

Next
Now select the data profiling indicators you want to display for each column.

An online help is present in the right panel to help you choose among the indicators.

Select Simple Statistics for all columns and Text Statistics for code, cntry, zip, phone_num and cust_type.

Click OK.

Save your analysis and click on the Run button.

Next
Display the results by clicking on the Analysis results tab at the bottom of the window.

Next

Check the results for each column, for both text and simple indicators. Here is some information we can gather from the results:

  • The code column is between 3 and 5 characters long, and has 100% uniqueness (it is the primary key)
  • The name column has 9 duplicate values: right-click in the table or inside the results chart to display the duplicate values or rows.

Next
This opens a new tab where you can see the duplicates.
  • cntry column has a maximum and minimum length of 3 characters
  • addr_st column has around 12% of duplicate values
  • zip column has a minimal length of 4 and maximal of 5 , with an average of 5. If you right-click on the Minimal Length bar in the chart to display the rows, you can see that only 4 rows have a 4-digit zip code: this means they are probably incorrect.
  • phone_num column has 48.2% of blank values. According to the text indicator, a phone number seems to be 11 characters long.

Next
  • email column has 13.6% of duplicate values.
  • cust_type column length is between 8 and 11. It has only 3 distinct values (Simple Statistics) which can be displayed by right-clicking on the Distinct Count and selecting View values in the menu: prospect, customer, beneficiary. This corresponds to the three customer types we have to deal with.

Next
Add new pattern indicators in our crm_cust analysis


Try to refine your analysis. To do so:

Select the Pattern Frequency Table and Pattern Low Frequency Table. Check the boxes of the following columns: code, cntry, zip, phone_num, email.

Click OK to go back to the Analysis Settings tab.

Save and run your analysis.

Next
In the Analysis Results tab, you can now check for the patterns of the selected columns.
  • code has three patterns, all numeric (“9” symbolizes a number) with three different lengths: 3, 4 and 5 digits. The pattern with 4 digits represents 99% of the values.

Next
  • cntry values all have a pattern with 3 characters (in a pattern, “A” represents an alphabetic character in uppercase).

Next
  • zip codes all have 5 digits, except 4 values that are incorrect: we will generate a pattern that represents a valid zip code.

Next
Generate regular expressions from pattern indicators


In the pattern indicator for the zip column, right-click on the 5-digit long pattern and select Generate regular pattern.


In the New Regular Pattern wizard:

Name the pattern zip_code and leave all options as they are and click on Finish.

Next

Back to the crm_cust analysis results, go to the phone_num column results. Here, there are multiple patterns and empty fields. For this training, we consider that the correct phone number pattern is 999-999-999, which is the most frequent in our dataset.

Name this pattern phone_number.

Next
Use built-in patterns or import them from the Talend Exchange Community


For the email column, we have different patterns, but we know that an email address must contain a "@" character and a "." followed by a domain name.

All the pattern values that are present here are valid. Instead of creating our own email pattern we will use an existing one. We have two possibilities:

1) Use an existing pattern from the built-in list provided with Talend Open Studio for Data Quality or Talend Enterprise Data Quality.
2) Import a pattern that has been shared in Talend Exchange Community.

Next
1) Use an existing pattern from the built-in list provided with Talend Open Profiler or Talend Data Quality.

In the DQ Repository view, you can see the two patterns you created (
zip_code and phone_number), and you can also browse among the built-in ones.
In the internet folder, you will find an Email Address built-in pattern

Next
2)    Import a pattern that has been shared in Talend Exchange community.

In the DQ Repository view:

Expand the Libraries, Exchange and Patterns nodes.
Here you can see the different patterns that have been shared on Talend Exchange community.

Right-click on Email Address (with list of top-level domains) and select Import in DQ Repository which will import this pattern locally in your repository.

You will then be able to use the imported pattern in your analyses.



Next
Optional: export your patterns to Talend Exchange.
If you created patterns that can be reused by other people, you can share these with Talend Exchange.
First, you need to export the pattern(s) locally.To do so:

Right-click on Date and select Export for Talend Exchange.


Next
In the Export for Talend Exchange wizard:

Click the Browse... button next to the Select a folder field to select an export folder in your local hard drive.
In the Selected Pattern area, select the pattern that youwant to export (in the example, we are exporting a date pattern that we will create in another exercise).

Click on Finish to export the pattern on the selected folder. This will create a zip file that contains the pattern.

Next
Open a Web browser and go to the Talend Exchange Web site.
Click on the Data Quality link.

You need to create an account in Talend Forum or Exchange in order to be able to publish new patterns. As soon as you are authenticated, you can access a panel to add extensions.

Next
To add a new extension for Talend Open Studio for Data Quality:

Click Add an extension link at the bottom left-hand corner of the page.

Fill in the requested information and select the category of your extension:
  • Indicator allows you to add a custom indicator to column analyses
  • pattern is generic and lets you create Regex or SQL
  • regex is a regular expression pattern
  • SQL are databases-specific patterns that are checked inside a LIKE clause

Next
Apply the patterns to the columns for analysis



From the DQ Repository view, drag and drop the patterns on the columns to analyze.

Now, a new indicator is attached on these three columns.


Next
Save and run the analysis and check the new results.

You now have a Pattern Matching indicator for these columns, where you can view valid and invalid rows by right-clicking the results table or the chart.


Next
Analyze the outstanding crm tables



Analyze the contract table.

Select indicators as follow:
Simple statistics: all columns
Text statistics: offer_name, owner, beneficiary
Pattern Frequency Statistics: offer_name, owner, beneficiary

Save and run the analysis.

Check the results.


Next
Check the results:
  • number is the primary key, with 100% unique count
  • offer_name has 3 distinct values: online, simplicity and blocked
  • For duration, you can check the distinct, unique and duplicate values. For this tutorial, we consider that duration represents the time in minutes for pay monthly plans. It can only have discrete values.

Distinct
values will show all values you can find here. There are only 15 different values here.
Unique values only appear once: in this case, they are probably invalid values - the same duration is related to multiple contracts.


Next
In this case, the Unique indicator value should always be 0.

Create a threshold for this indicator, from the Analysis Settings tab, click on the button (with the little bolt) next to the Unique indicator of the duration column, and fill in the threshold values (0 for min and max values).


Next
Save and run the analysis and check the new results for the duration column.

A warning appears because the Unique indicators values are outside the threshold bounds.


Next
Duplicate values show the values that appear at least twice: in this case, the correct values will be found here. There are 8 duplicate values.
Let's check the rows with these values: right-click on duplicate count and select view rows:

1) Clear the Limit Rows check box
2 )Re-run the query
3) Sort the duration column in ascending order

Now you can see that 0, 6 and 12 values are only present a few times, whereas 30, 60, 120, 180 and 300 are frequent. We can assume that these 5 numbers are valid duration values.


Next
Let's create a new pattern to check for valid and invalid values in our column analysis.

In the DQ Repository view:

Expand Libraries, Patterns and Regex nodes.
Right-click on number and select New Regular Pattern in the menu to create a new regex pattern.


Next
In the New Regular Pattern wizard:

Name it Duration.

Click Next.

Type in the following pattern expression: '^30$'

Click Finish.


Next
Open the Duration pattern if it is not already opened and

1) Click on the Test button to display the Pattern Test ViewScope.

2) Type in some values in the Test Area.

3) Click on Test to check if your pattern works.

Your test should display a Matches smiley when you type in 30 and Non-matches warning for any other value.


Next
Add more valid duration values to the Duration pattern (in the left tab):
'^30$|^60$|^120$|^180$|^300$'

Save (1), click Test (2) again to refresh the right panel. You can now check all your duration values (3).

Next
1) Drag and drop this new pattern on the duration column in your analysis and run it.

2) The Tp_cd column has 3 values: 1, 2 and 3.

Create a new pattern by duplicating the Duration pattern. Right-click the Duration pattern and select Duplicate on the menu.

Next
Rename the pattern you just created with the name contract_type.

Change the valid values to 1, 2 and 3 and save.


Next
Drap and drop the pattern on the Tp_cd column in your analysis and run it.

Next
  • owner is never null or blank and is 4 characters long.

  • beneficiary: 3 to 5 digit characters long; no blank value but there are multiple values (45.75%) which pattern is only one digit long, and all of these values are 0, which means there is no beneficiary for this contract, so the contract owner is also the beneficiary.

  • begin_dt and end_dt are never null.

Next
Analyze the contract_tp table.

This is a very simple table, so:

Select all columns and select simple statistics indicators for them.

There are 3 contract types:
  • Subscription (code 1)
  • Insurance (code 2)
  • Support (code 3)

Next
Analyze the claim table:

These are the contract claims.
Perform simple analysis as for contract_tp and check results.

Now, we have a good vision of all tables composing the crm catalog.

In Exercise 5, we will perform cross-table analyses in order to discover and verify relationships between these tables.

Next
Create column analyses for the cif tables


Create a new column analysis for each of the cif catalog tables:

Right-click on each table and selecting Column Analysis.

Name each of these analyses: cif_<table_name>.

You are now going to configure these Column analyses. This will be more difficult since columns have intentionally been left "anonymous".

Next
Analyze the acct table:

In the cif_acct column analysis, select all columns from the table and the following indicators:
  • Simple Statistics: all columns
  • Text Statistics: all columns
  • Pattern Frequency Statistics: all columns

Save and run the results, and check the results:
column0: primary key (100% distinct and unique) with prefix CTR_

Next
We can put some threshold to monitor that this column is primary and should have unique, non-null values.

First, set the Distinct and Unique indicators min and max thresholds to 100%.

Next
Set the threshold for Null, Duplicate and Blank indicators to 0 (min and max).

Now, if these values are not correct, a warning will be displayed in the analysis results:
  • column1: free-form text which seems to represent the duration (in minutes) and the offer name (blocked, simplicity, online)
  • column2: single character (most of the time a digit). Using the same method as for the duration column in the crm catalog, find the valid values for column2

Next
Create a new pattern to check for valid and invalid values in your column analysis.

In the DQ Repository view:

Expand Libraries, Patterns and Regex nodes.

Right-click on number and select New Regular Pattern in the menu to create a new regex pattern.

Name it Column2, click Next and find the corresponding pattern expression.

Save and run the analysis once again to check the invalid rows.

Next
  
Tips:

  • Each valid value must be surrounded by "^" ("begins with") and "$" ("ends with")
  • Separate each valid value with the "|" operator ("or")
  • You can use the duplicate option by right-clicking on a pattern

Next
  
  • column3: always a date pattern (99-99-9999) with no null values
  • column4: date pattern with null values
  • column5: 4 digits long, no blank or null values
  • column6: 3 to 5 digits long, no blank or null values

Find a comprehensive name for each of the columns of the acct table.

Next
Analyze the country table:

This simple table is a list of countries with ISO codes on 2 and 3 characters and the country name.

Next
Analyze the person table:

In the cif_person column analysis, select all columns from the table and perform the analysis with the following indicators:

  • Simple Statistics: all columns
  • Text Statistics: all columns
  • Pattern Frequency Statistics: all columns

Next
Analyze the customer table:

Repeat the steps of the person table analysis.

Next
Tips:

For date formats, create a new pattern to check for dates like DD-MM-YYYY with this regular expression pattern:
'^((0[1-9])|(1[0-9])|(2[0-9])|(3[0-1]))(\/|-)((0[1-9])|(1[0-2]))(\/|-)((19|20)[0-9]{2})$'

You can duplicate and reuse the Date MM DD YYYY 0.1 existing date pattern.

  Next Step: Performing [Single Table Analyses (Exercise 4)]

 

    Download it!     Send it!     Share it!

You want to practice?

Download exampleFile.zip to get the files used for this tutorial.

You can also download tutorialProject.zip containing all the jobs needed to carry out this tutorial.

Friends / colleagues may be interested in this tutorial? Send it to them!

You liked this tutorial ? Support it!

[ top ]