Problem Statement1: We are working on a project where we are supposed to read only the incremental data from Hbase table (Cloudera 4.6). A background process is running 24*7 and it is loading the data in Hbase table.
Approach: we are using a tHBaseInput component to read the data from the Hbase table, but we are not able to find any filter where we can provide the timestamp value so that it can read only the data which is loaded after the last run.
I am not sure if i am missing something on the component or is it the limitation in talend tHBaseInput. I am using 5.4.2 Talend Big data.
Problem Statement2: By default tHBaseInput uses scan to fetch the data form Hbase table and the cache size of this object is set to 1, which means the map-task will make call back to region-server for every record processed. Due to this the tHbaseInput is taking a lot of time to read from Hbase table (30 Mins for 1 Lakh records). We tried to do it in java by creating a new scan object and setting the cache size as 1000 and we were able to read 1 Lakh records in just 2 Minutes.
Do we have any properties in tHBaseInput where we can increase the default cache for scan.
1.) Is there a timestamp field in your table?
2.) Have a try to add the related property in the advanced settings panel of tHbaseInput component.
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the global leader of open source data management and application integration solutions!