Post a reply

Write your message and submit

Options

Click in the dark area of the image to send your post.

Go back

Topic review (newest first)

Thalla Sandeep
2012-04-18 05:10:36

Hi Gabriel,

Thank you once again for your reply.
So, we can extract text using script of POI API.Can please mail  or post procdure to create a sample job which would be of a great help to me.


Regards,
Sandeep.

gusto2
2012-04-17 14:42:47

Hi Sandeep,

then I'd create a script using a POI API (or any Word manipulation API, e.g. Lucene ) to extract document's body clear text (I usually deploy all my routines as web services, it is easier and more accessible than trying to make a new Talend Component)- and then

- for every document (tFileList)
- extract content as clear text (tSSH, tWebService) into a temporary file
  - read per row (tFileInputFullRow)
- check if file contains searched string  (tFilterRow)
  - read other rows necessary (tFileInputRegex)

but there is no out-of-the-box Talend component to extract clear text from a word document. In theory, you could reuse a WordExtractor  from Lucene project (it uses POI as well).

Gabriel

Thalla Sandeep
2012-04-17 13:32:41

Hi Gabriel,

First of all thank you for your reply.
I have a requirement where i have to read data from a Microsoft word file.
I am well aware that a word file is unstructured but i just want to match pattern in file and read data across it.

For Example :

Name : kathi
Place : USA

with a sepcified deilimeter .

I wanted to match this "name" and read data "kathi" in TOS.

Regards,
Sandeep.

gusto2
2012-04-17 13:14:13

Hi,

there is a discussion on LinkedIn about this topic (or it was you who wrote the question? (http://www.linkedin.com/groupItem?view= … gmp_812977)

Still I say - the problem with a word document is, that it is unstructured. I mean - it can contain tables, text, images, links, headers, other documents.. You could read data from an Excel sheet, but at least there are tables. So it doesn't go directly from a Word doc, but you need a a step to extract any structured information. In theory - you may create a script to save your word document as a clear text, but don't you loose any information?

If you know what is in the word document - e.g. CSV (comma separated values), you can use POI API or Visual Baisc to extract data from Word - usualy as delimited values (CSV) - and then Talend to do something useful with data.

Carpe diem
Gabriel

Thalla Sandeep
2012-04-17 12:55:01

Hi Talend Team,

I just wanted to read some data from a word file.
Is there any direct component which can read a word file .
Or is there any way to it ???

Regards,
Sandeep.

Board footer

Powered by FluxBB