#1 2012-05-24 12:02:40

jonathanbowen
Member
Registered: 2011-05-31
Posts: 21
Website

Umlauts in UTF-8

Tags: [UTF-8, xml]

Hello

I'm transforming an XML file. The source file has umlaut characters which have been converted to their utf-8 equivalents. For example the source file contains:

<TOWNCITY>Düsseldorf</TOWNCITY>

When I transform this into an new XML format, TOS converts this to:

<TOWNCITY>Düsseldorf</TOWNCITY>

I would like to preserve the original, but I can't figure it out. Both the source and the output file are configured to be UTF-8 encoding.

Any ideas how I can achieve this?

Thanks for your help.

Offline

#2 2012-05-24 12:03:51

jonathanbowen
Member
Registered: 2011-05-31
Posts: 21
Website

Re: Umlauts in UTF-8

Hmmm.... the character has been converted in the post too. So the original should be:

<TOWNCITY>D & # 2 5 2 ; sseldorf</TOWNCITY>

With spaces, so that it does not convert


Uploaded Images

Last edited by jonathanbowen (2012-05-24 12:06:13)

Offline

#3 2012-05-24 12:45:26

avdbrink
Member
Company: Conspect Consulting & ICT
Registered: 2010-11-08
Posts: 360
Website

Re: Umlauts in UTF-8

Hi Jonathan,

I think the output of the job should be in the format you specify in the output component. So if your input contains UTF-8 and you read this into Talend it will convert it to an internal format, but when exporting, you should be able to select the desired format again, UTF-8 for example. This should give you a file or table with the correct data.

Hope this helps.

Regards,
Arno

Offline

#4 2012-05-24 12:53:19

jonathanbowen
Member
Registered: 2011-05-31
Posts: 21
Website

Re: Umlauts in UTF-8

Arno

Thanks for the reply. I'm doing as you suggest - the source file is read as UTF-8 and the output I create is also UTF-8, but Talend is still converting the data to the umlaut character. Maybe its a bug - I can't find any configuration parameters that will change this.

Jonathan

Offline

#5 2012-05-24 13:00:12

janhess
Member
Company: Newcastle University
Registered: 2009-05-19
Posts: 1137

Re: Umlauts in UTF-8

If it's a bug you could get round it by replacing in a tMap or tReplace but it will probably affect a number of characters.

Offline

#6 2012-05-24 13:48:00

jonathanbowen
Member
Registered: 2011-05-31
Posts: 21
Website

Re: Umlauts in UTF-8

Yes - I tried to post process it with a tReplace - no luck with this either I'm afraid - it still converts back to the umlaut character.

Offline

Board footer

Powered by FluxBB