You are not logged in.
Je cherche un moyen de détecter des homonyme dans deux tables distinctes avant de les fusionner. Les tables contiennent des infos d'utilisateurs (nom, prénom, ...). Le but est de chercher si il n'y aurait pas des doublons avant de faire la fusion des tables.
1) Il faut détecter les doublons "simples" (ex : "Michel Petit" et "Michel Petit" dans les deux bases)
2) Il faut aussi détecter les noms avec des tirets ou des espaces et les homonymes potentiels (ex : "Jean Durant" et "Jean Durant-Smet" ou "Jean Durant Smet")
J'ai déjà réalisé une requête qui permet d'avoir les doublons "simples" et j'ai une REGEXP pour trouver les noms composés de "-".
Mais je vois pas comment remplir les deux conditions.
TOP n'a pas l'air de permettre ça de manière graphique mais je ne sais pas si on peut "scripter" ni en quel langage.
The language on this forum is the Englis language, please try to formulate your question in English so that more people can give you an answer.
Here is my answer:
Talend Open Profiler (TOP) is currently not able to compare tables (or columns in tables). This is planned for the next release: [Bugtracker, bug 5322, fixed] New kind of analysis: Column comparison analysis. And there is no scripting language in TOP.
For a fuzzy comparison I would suggest that you try to do it with Talend Open Studio. Your two tables would be the input of the tFuzzyMatch component.
In this component, you can choose the matching algorithm. For your example, I would suggest the use of the metaphone algorithm since it does a phonetic comparison. But you can also try the Levenshtein algorithm: Strings with missing or switched characters will match.
I hope this helps.
Note that there is also a component in Talend Open Studio called tParseName which helps to parse names and find homonyms, but it currently only support English names and is available in Perl projects of Talend Open Studio. But I am not sure it would help in your case.
First, translation of the original post :
[begenning of the translation]
I'm looking for a way to detect homonyms in two separates tables before join them. The tables contains user's informations (name, firs name, ...). The goal is to seek if there is double entry before append the tables.
1) I need to detect "simple" double entry (example : "Michel Petit" and "Michel Petit" in both tables)
2) I also need to detect if there is close homonyms or composed homonyms (examle: "Jean Durant", "Jean Durant-Smet" or "Jean Durantsmet" and so on).
I've realize a SQL query to find simple double entry and a REGEXT to find composed name. But I don't know how to do both in the same time.
TOP don't seems to allow this with his GUI and it doesn't appears that a script capability exist.
Thanks for your help anyway
[end of the translation]
I've try TOS but it appears that TOP may be best fitted to do this (well). I also wondering if TDQ's Data Cleaning will be appropriate for this problem.
About tParseName I got 80% french name and the rest is composed by various european names.
Anyway thanks for your help, I'll take a closer look.
For this particular task, I am not sure that TDQ will be more appropriate (at least in the current version). It depends on your need. To do what you've described, TOS is currently the best tool to do it. TDQ uses TOS and won't add anything to solve this problem right now. Maybe in a future release...
(Thanks for your translation ;-) )