You are not logged in.
if you want to extract some data given a regular expression, then I think the tExtractRegexFields component is available in Talend Open Studio and can help. Otherwise, you may write a bit of java code in some components such as the tMap or tJavaRow... So there is a bit of manual work.
In the enterprise solution, we provide several components that reuse the regular expressions created in the profiler perspective. Unfortunately, they are not available in the open studio.
You may also have a look at http://www.talendforge.org/exchange/index.php where the community share the components they have developed.
Thanks for your reply. I have Talend Open Studio for MDM installed , and i have run analysis for the file without ticking allow drill down. Now i have to use the data in the DI job. Can you please advise me how this can be done in Talend MDM?
I couldn't find any tutorial which does close to anything like this.
the advise is to not allow drill down. See picture.
This feature allows you to see the data, but all data are loaded in memory. So that feature is not for big file.
Uncheck the check box.
If you want to retrieve the data after an analysis, then it would be an ETL job designed in the DI perspective (not available in Talend Open Studio for DQ, but you may do it manually in Talend Open Studio for MDM).
When i try to profile the file containing around 20,00000 lines keeping the maximum number of rows per indicator to around 4,00000 talend keeps hanging and gives me the following error :-
139752 [Thread-5] INFO org.talend.dataquality.indicators.impl.RegexpMatchingIndicatorImpl - Preparing regular expression matching indicator Random_sequence_of_text with regular expression: [A-Za-z0-9]\.html
Exception in thread "Thread-5" java.lang.OutOfMemoryError: GC overhead limit exceeded
Please advise me if i can do anything to get rid of this error?
It depends on what you exactly need.
It's possible to define an analysis for each file, then to run all analysis at once in the commercial edition.
It you want to create a generic analysis for all files, then it could also be possible if all your files have the same schema. But that will require some trick (I think about using contexts and changing them automatically via a job).
Thanks a ton.
One more thing , is it possible to profile data in a batch? Instead of doing it for files one by one , can we profile all the files from a directory?
Contact sales or info http://www.talend.com/open-source-provider/contacts.php
See more details about the products at http://www.talend.com/products/enterprise-dq.php
Many Thanks for the reply. Where can i find the subscription prices for Talend Enterprise Data Quality ?
If you need more more memory, you may increase the values in the Talend-Studio-*.ini file near the executable file.
For the common tool, you may have a look at Talend Open Studio for MDM, but you'll still have some manual work to do.
Another solution is our subscription product Talend Enterprise Data Quality which combines both the profiling and the ETL jobs for data cleansing.
About the analysis of the existing analysis, it could probably be done using xml components in Talend Open Studio for DI, but that may not be easy.
Again, Talend Enterprise Data Quality stores all results and provides reports and tools to analyze further your analysis results.
Is it possible to do an analysis on the result set of an existing analysis?
Your Response is much appreciated.
I removed some indicators which i didn't need and now i am able to import the data to csv file, Thank you very very much for your help.
As i asked before , is there a common tool, DI and DQ integrated ?
Something which helps me upload the analysed data to database , without writing it into the external file and then run a job t upload it explicitly?
Please find the screen shot attached .
it's a memory issue.
Some indicators heavily use the memory.
Can you tell me which indicator you used in the analysis?
(you may upload a screen capture of your analysis setting)
Thanks a lot for the link.
Coming back to my original problem , when i increased the number of rows to 314130 and tried to run the analysis , Talend DQ is hanging with the following error in console :-
61653 [Thread-4] INFO org.talend.dq.analysis.AnalysisExecutor - Connection tonull
61680 [Thread-4] INFO org.talend.dataquality.indicators.impl.RegexpMatchingIndicatorImpl - Preparing regular expression matching indicator Random_sequence_of_text with regular expression: [A-Za-z0-9]\.html
Exception in thread "State Data Manager" java.lang.OutOfMemoryError: Java heap space
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "State Data Manager"
Can you please advise me on how to get rid of this error ?
yes, you can do that.
There are a lot of tutorials at http://www.talendforge.org/tutorials/menu.php
Have a look and enjoy.