[2015-12-21] Talend Blog: 2016 Predictions – 4 Ways Big Data & Analytics Will Impact Every Business

It’s hard to believe that 2016—the year that marks Talend’s 10-year anniversary—is right around the corner. If our society and businesses were ruled by the predictions of the film industry, we’d all have flying cars and drones walking our dogs…. Granted, while we’ve made great progress in terms of fuel- efficient, electric vehicles, we’re still not flying. That said, more immediately, there are definitely a few emerging technologies that will have a profound impact on businesses and society at large in 2016. Here are a few of my ‘bets’:

Real-time Analytics Will Take Center Stage

Among all the technological innovations emerging, real-time big data analytics will absolutely be the most disruptive force in 2016. This type of instantly-actionable insight vs. rear-view mirror data analysis is no longer just an option—it’s a necessity—particularly given the pace at which we move both as consumers and businesses. We want relevant, personalized information now! Luckily this type of data integration and processing power is no longer only available to the ‘behemoth’ cloud vendors— the likes of Netflix, Google or Amazon—it’s becoming mainstream. In 2016, companies of all sizes across all industries will be able to embrace opportunities that were previously unimaginable, such as improving patient care, increasing crop yields in order to feed the world and, overall, to make more informed business decisions.

New Threats Will Emerge from Unforeseen Areas, Increasing the Need Increased Customer Focus

In the age of real-time big data previously unattainable opportunities will finally become tangible, new business challenges will also emerge. Tremendous competitive threats will establish themselves – and the biggest threats can now come from outside an organization’s core industry—i.e. even organizations that weren’t event tangentially related to your business and those that you never would imagine would compete in your space will start infringing on your market share. So companies need to be able to analyze data, anticipate these emerging threats and devise ways to not only combat them, but also re-evaluate and re-invigorate their customer engagement process in order to retain customer loyalty.

For years, businesses have been working on becoming more customer centric. However for the most part their customers never see the payoff from those investments, and in today’s real-time big data age, when it comes to customer experience, ‘good’ is no longer enough. With the arrival of new real-time big data technologies in 2016, more companies will be able to actually impact the customer experience where it matters most – in the moment. Businesses will be able to leverage technology to deliver personalized information, incentives and service that add to a better overall customer experience. Treating the everyday consumer as though they are ‘a celebrity’ is something every business owner should strive to achieve and with the application of real-time big data, now, for the first time, customers will notice the difference.

CIO Turnover Will Accelerate

The gap between winning and losing CIOs will widen dramatically in 2016. Those that have been pioneering the move to the cloud and big data will move from pilots to production and will generate game-changing insights for the business. Those that have not will be exposed and pushed out as their companies fall behind their competition. Organizations that have already built their big data platform will have a dramatic head start in the 2016 big data sprint. With the advent of Spark and Spark streaming, they’ll be able to unlock the true potential of their investments in building data lakes and warehouses on Hadoop. Big data pioneers will see their investments pay off in 2016, and the gap between winning and losing CIOs will widen dramatically.

As this gap widens, the demand for CIO talent will heighten – starting a CIO talent war with the weak being exposed and the strong being snatched up. At Talend Connect, we recognized exemplary organizations who are entering 2016 at the forefront of innovative data integration. These leaders used new ways to channel ever-growing volumes of data into actionable information that is not only improving their businesses, but in many instances, also benefiting the broader population. Luckily for those that are feeling behind the eight-ball, there are now data integration technologies available that make it easy to deploy Spark capabilities quickly—meaning you still have the opportunity to catch up.

Businesses will retool

Now that we’ve identified real-time big data as the game-changing technology that will have a profound impact in 2016 and discussed the consequences of not keeping up, it’s time to address how businesses can ensure that they stay ahead of the curve.

The age of big data is causing businesses to rethink their organizational structures. Real-time big data is breaking down the barriers of traditional business best practices and structures – and the dynamic of "Business vs. IT" will give way to a new dynamic of "Business + IT = Innovation Powerhouse.” The companies that will win are those that figure out how business and IT can partner and succeed. Cross-functional centers of innovation must emerge – lead by CEOs, CIOs, CDOs and new Chief Marketing Technology Officers (CMTOs) working together to merge their skillsets. These information SWAT teams will be able to turn insight into revenue and drive inroads into new markets never before thought possible, while remaining compliant with all security and privacy regulations. Silos must be broken down within company walls to take real-time big data to the next level in 2016 and make it a year of success for YOUR organization.

Looking forward to the year ahead and the innovations to come!

[2015-12-17] Talend Blog: Spoiler Alert! Talend 6.1 Hits the ‘Big Screen’

Back in September I talked about how excited I was about the new Star Wars movie coming out. Well, that day is upon us. Yes, Thursday, Dec 17, is the pre-ordained day in my part of the world (Ireland). Have you ever been hit by a spoiler? You know, an older sister who breaks the news about Santa Clause? A colleague who lets the ending to the Sixth Sense slip out or perhaps a friend who provides a way too obvious hint about Luke’s “relationship” to Vader before you have the chance to catch The Empire Strikes Back? Spoilers change everything don’t they? True story, I never shed a tear at ET.... my buddy told me the ending before I saw the movie. Did I hear you say “Robbed”?

So, never one to spill the beans, here’s a heads up that there are major spoiler alerts below – so if you don’t want to hear about the fantastic new features in Talend 6.1, please turn away now!

Oh and when it comes to Star Wars, I have a ticket for Friday night, so mum’s the word.

Introducing Talend 6.1

Fresh on the heels of the Talend 6.0 release, team Talend has been busy creating new capabilities for the holidays. Introducing Talend 6.1, which further enables the data-driven enterprise, boldly going where no other integration tools have gone before (sorry, couldn’t resist), and delivering shiny new machine learning capabilities, along with continuous delivery and data masking on Spark presents.

What does this mean for you? It's much easier and faster to make your data applications more intelligent.

As discussed in previous blogs, there are growing amounts of data everywhere including the Internet of Things. Businesses need to become data-driven to survive or risk being marginalized. We are seeing the rise of data science and machine learning as core competencies in every data-driven organization.

But how do you make deploying and updating models a scalable and repeatable development process? This is where Talend 6.1 comes in.

New Tricks in Time for the Holidays

Talend 6.1 provides an easier way to operationalize analytics, benefitting both IT and data scientists alike. Developers use Talend’s pre-built components and drag-and-drop tools to build Spark analytics models (e.g. Random Forest, Logistic Regression, Clustering via K-Means) for customer segmentation, forecasting, classification, regression analysis and more. Behind the scenes, Talend provides the smart tools for data connectivity, transformation, and cleansing, so you spend less time wrangling your data (which can be as much as 50 to 80% of an analytics project on average), and more time gaining insight.

Combined with the existing Continuous Delivery support added with Talend 6, which by the way we enhanced in Talend 6.1 with Git version control and Talend ESB process support, developers can rapidly deploy machine-learning algorithms. These algorithms can provide real-time operational insight so the systems and people that need it can act in-the-moment (e.g. a machine is about to fail, so shift operational load; online credit card fraud is about to occur, so disable account; or a shopper cart is going to be abandoned, so make another recommendation or offer). The recent Talend Data Master Awards highlights some of these use cases.

Data scientists can use these machine-learning algorithms to understand data, and teach the model to make predictions, with IT having the ability to quickly deploy into production for "testing with live users”. The benefits of Continuous Delivery are fast, iterative development and maintenance cycles, access of the information back to the data scientist for further refinement, and an overall more collaborative approach between data scientists and IT – something that is on every CIO’s holiday list.

Talend 6.1 also delivers a couple of other cool new presents, data masking on Spark and advanced support for Cloudera Navigator, which will bring joy to the security advocates and data quality developers on your shopping list.

As companies build vast data lakes there is an increasing need to make data private to protect against data breaches and meet compliance mandates. Data masking obfuscates your data (numbers, strings, dates, personally identifiable information and more) without impacting the rules that surround that data or allowing other users to see the data. By running on Spark, this can be done in-memory and at scale – sort of like delivering presents to 526 million kids!

A unique and first to market capability for Talend 6.1 and Cloudera Navigator users, is data lineage Spark support. This lets users trace data lineage for MapReduce and Spark down to the level of the schema defined by the developer in a data job, which is crucial for both impact analysis and data lineage.

These are just some of the Talend 6.1 highlights. Check out the Talend 6.1 webinar or Technical Note to see a demo and learn more. Happy Holidays and may 2016 be a year of fresh, new actionable insight!

[2015-12-16] Talend Forum Announcement: Talend Open Studio's 6.1.1 release is available

Dear Community,

We are very pleased to announce that Talend Open Studio's 6.1.1 release is available. This general availability release for all users contains many new features and bug fixes.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s general availability release:

Data Integration: http://www.talend.com/download/data-integration
Big Data: http://www.talend.com/download/big-data
Data Quality: http://www.talend.com/download/data-quality
MDM: http://www.talend.com/download/mdm
ESB: http://www.talend.com/download/esb

You can also view Release Notes for this 6.1.1 version, detailing new features, through this link: http://www.talend.com/download/talend-open-studio
Find the latest release notes, with these steps: [Data Integration | Big Data | Data Quality | MDM | ESB] product tab > at bottom of page under "User Manuals PDF" > find this version's release notes.

For more information on fixed bugs and new features, go to the TalendForge Bugtracker.

Thanks for being a part of our community,
The Talend Team.

[2015-12-15] Talend Blog: When it Comes To Big Data – Speed Matters

Talend vs Informatica – The Big Data Benchmark

If you’ve spoken to a Talend sales representative or read some of my team’s marketing material, then you’ve undoubtedly heard our claims that when it comes to Big Data, Talend offers some significant speed advantages over the competition.

As an example, here’s a slide we used as part of our Talend 6 media deck.

Concerned that some folks might dismiss this content as marketing hype, I thought it would make sense to create some more concrete evidence to substantiate our claims. We utilized the skills of MCG Global Services, a leader in information management, to conduct some benchmark tests on our behalf comparing Talend Big Data Integration against Informatica Big Data Edition.

I believe MCG did a really nice job on the benchmark and defining a common set of use cases and questions that would be highly relevant to many organizations.

Questions included:

- What impact does customers’ views of pages and products on our website have on sales? How many page views before they make a purchase decision (whether online or in-store)? (Use Case 1)

- How do our coupon promotional campaigns impact our product sales or service utilization? Do our customers who view or receive our coupon promotion come to our website and buy more or additional products they might not otherwise without the coupon? (Use Case 2)

- How much does our recommendation engine influence or drive product sales? Do customers tend to buy additional products based on these recommendations? (Use Case 3)

As you’ll note below, the benchmark confirms our speed advantage claims. If you are interested in a more detailed view of the conditions and outcomes of the benchmark, you may download the full benchmark here.

Here’s a snapshot of the overall gains with Talend and how they increase as data volumes rise.

In the case of Talend versus Informatica, it’s relatively straightforward to explain why the gap is so startling. Clearly, by leveraging the in-memory capabilities of Apache Spark, Talend users can integrate datasets at much faster rates. Spark uses fast Remote Procedure Calls for efficient task dispatching and scheduling. It also leverages a thread pool for execution of tasks rather than a pool of Java Virtual Machine processes. This enables Spark to schedule and execute tasks at rate measured in milliseconds, whereas MapReduce scheduling takes seconds and sometimes minutes in busy clusters.

With Informatica Big Data Edition, which doesn’t support Spark directly, how Hive-on-Spark behaves and performs is up to the Hadoop engine and how it is configured.

Again, if you want to learn more about the benchmark tests, you may download the full report here.

[2015-12-14] Talend Blog: Sechs Dinge, die eine Big-Data-Plattform aufweisen sollte

Damit Unternehmen wirklich datengesteuert arbeiten können, müssen sie das „Was?“ und „Wie?“ und das „Was wäre, wenn?“ verstehen. Und zwar nicht nur in Bezug auf ihre eigenen Unternehmensdaten (Kurzsicht), sondern auch die Big Data um sie herum (Geodaten, soziale Daten und Sensordaten). Das ist der einzig mögliche Weg für Unternehmen, eine 360-Grad-Sicht auf ihre Kunden, ihre Geschäftstätigkeit und den Markt zu erhalten und zu begreifen, welche Herausforderungen und Chancen dies für ihr Geschäft mit sich bringt.

Eine der größten Herausforderungen von Big Data für Unternehmen ist die Integration mehrerer Datenquellen, um einen aktionsfähigen Einblick zu erhalten. Laut der IDC (International Data Corporation) machen das Erfassen und Aufbereiten von Daten für die Analyse in der Regel 80 Prozent der Gesamtzeit aus, die für ein Analyseprojekt aufgewendet wird. Das ist eine erstaunliche Menge Zeit, die für die Datenvorbereitung verschwendet wird, nur damit Sie gewisse Einblicke daraus gewinnen können. Anders ausgedrückt ist die Datenintegrationssoftware höchstwahrscheinlich DER Schlüssel zum Erfolg für alle Big-Data-Projekte.

Wir haben für Sie sechs einfache Fragen zusammengestellt, die sich jeder stellen sollte, der die effektivste Datenintegrationsplattform für optimale Erkenntnisse aus den Daten implementieren möchte:

Ist die Nutzung intuitiv? Verlangen Sie einen Blick auf die Benutzeroberfläche. Ist diese einfach aufgebaut oder kompliziert? Erfolgt die Codegenerierung in der Anwendung automatisch oder müssen Sie diese manuell ausführen? Können Sie die Aufgaben über Drag&Drop ausführen? Sind Workflow und Benutzeroberfläche der Plattform wie aus einem Guss oder sehen sie eher wie eine bunte Mischung aus verschiedenen Anwendungen aus?

Werden die Daten zusammengeführt? Kann die Plattform alle Arten von Daten (Cloud, stationär, Internet der Dinge usw.) integrieren und können Sie mit ein und derselben Lösung sowohl im Batch arbeiten als auch in Echtzeit?

Wird Hadoop optimal genutzt? Bei einigen Tools müssen die Daten vor dem Laden in Hadoop verarbeitet und umgewandelt werden. Dies führt nicht nur zu einer erheblichen Verlangsamung der Projekte, sondern bedeutet auch, dass Sie die Leistungsfähigkeit von Hadoop nicht optimal nutzen können.

Ist die Plattform auf dem neusten Stand? Arbeitet die Software auf Grundlage von Open Source oder ist es eine proprietäre Software? Open-Source-Lösungen halten erwiesenermaßen besser Schritt mit der hohen Geschwindigkeit bei Big-Data-Innovationen und erlauben Ihnen, flexibel zu bleiben und besser auf die geschäftlichen Anforderungen reagieren zu können.

Ist sie schnell? Nutzt die Plattform für die Datenverarbeitung Spark und Spark-Streaming in Hadoop? Oder gehört sie in die Kategorie YARN (Yet Another Resource Negotiator)?

Ist sie kosteneffektiv? Wie hoch sind die Gesamtkosten für den Erwerb? Sind diese adäquat und auf Grundlage der involvierten Entwickler berechnet? Oder erfolgt die Berechnung nach Datenvolumen, Connectors oder CPUs?

Wir würden gern Ihre Meinung hören: Was ist für Sie die größte Herausforderung bei der Bewältigung von Big Data?

[2015-12-11] Talend Blog: What’s Next for IoT: 4 Things to Watch

What’s next for IoT?

There’s no doubt that there’s a lot more “connected things” these days and that means a lot more data. Specifically, technology is moving out of the consumer’s hands and into Healthcare, Oil & Gas, Transportation, Aviation and more. The spread of smart devices and sensors creates new forms of value and brings challenges for enterprises seeking to exploit this technology. However, while this boom in data has the potential to advance the industrial space in ways never before thought possible, few companies today have the right technologies in place or the right business models that can truly utilize the power of IoT.

So what should enterprises know about the permeation of sensors and connected devices and how can they prepare now or risk being left behind by their faster moving competitors? Here are 4 things to keep an eye on:

Smart Meters for a Smarter Planet

Consider smart meters’ (the kind that Talend Data Masters’ winners m2ocity and Springg are using) impact on data growth. Each single device typically generates about 400MB of data a year. Not much, right? Well, the numbers add up quickly! According to a recent Bloomberg report, it’s predicted that 680 million smart meters will be installed around the world by 2017 leading to an estimated 280 petabytes of data generated by smart meters each year!

How can this lead to a smarter planet you ask? Consider Springg, an international software company specializing in agricultural data transport, software development and sensor technologies. Utilizing Talend software, Springg is able to evaluate data collected from sensors used in its mobile laboratories around the globe to measure soil elements. Based on these insights, Springg can give farmers soil and fertilizer advice to help dramatically increase their yield targets—thereby helping those farmers feed the world.

Data Up in the Sky

Watching for another area of machine-to-machine data growth? Try looking up in the air. A report from Wikibon found that the aviation industry is ripe for innovation in the form of Big Data analytics. An airplane flight (depending on length) can generate up to 40TB of data that will be used to analyze flying patterns, tune jet engine functions, identify new routes and reduce downtime. This is the type of data-driven decision making that has been revolutionizing eCommerce. Airlines with the ability to handle their big data in real-time and turn that data into instant will ultimately gain the competitive advantage.

For example, Air France/KLM. Each of the airline’s A380 aircrafts contain roughly 24,000 sensors, which generate 1.6 GB of data per flight from Smart meter technology. The company utilizes this data to detect breakdowns before they occur. Using the smart meter analytics technology, Air France can now detect potential needed repairs 10 to 20 days before they occur. This prevents immobilization of the aircraft, which is not only expensive for the company, but also impacts their overall level of customer service and revenue.

Machine Learning and Data Science

Data is the life-blood of your business. You need to not only absorb as much as you can, but you need to analyze it and find any secrets. Data science and machine learning are gaining adoption as companies become even more data-driven. Machine learning is a form of artificial intelligence, where computers can learn and act to make decisions – sort of automating some of the data science tasks. With the shear volume of data coming from the billions of Internet of Things (IoT) devices, automation is a good thing! Spark MLlib is gaining popularity with its many machine learning algorithms for customer segmentation, forecasting, classification, regression analysis and more. Incorporating machine learning into heavy data loads will be an important step for companies to make better use out of the wave of information coming from connected devices – and find the needles in the IoT haystack.

Operationalizing Analytics

Even though we now live in ‘the Golden age’ of big data, what will likely surprise you is that most companies are NOT utilizing the true value of the data to its full potential. In fact, a recent study by McKinsey showed that less than 1% of all IoT data is actually being used for decision making today. WHHHAAATT??

Why you ask? Mostly since the IoT data is being used for alarms or real-time control, not really optimization, or predictive and prescriptive analytics. Also, there are many challenges in making machine learning a reality. Data has to first be organized and cleansed. It can take a long time (months!) to put a model into production—particularly when analytics models change frequently (requiring more updates), and there is a lot of hand-coding going on….Not exactly in the best interest of the data-driven firm. But there is a solution! Companies applying open, native technologies like Talend for real-time big data integration, which requires zero hand- coding, utilizes Spark, Spark Streaming and Spark machine learning, can start to get more insight from their data.

Join Us

Want to learn more about new capabilities in Talend 6.1 for machine learning and real-time big data? Check out our free on-demand webinar on December 17^thand automatically be entered to win two movie tickets to ‘STAR WARS: The Force Awakens’, due out this week! Register here.

[2015-12-11] Talend Forum Announcement: New support portal

Hello All

We are glad to announce that Talend has put in place a new ticket management system to replace the previous support platform. We would like to assure you that this new customer portal is easier to use and provides more features so as to streamline your support experience with us. Some of the benefits include:

• The ability to save details about your production and test environments and share them with Talend support experts
• Options for updating and storing your preferred language, time zone, and contact information for expedited ticket resolution
• Improved SLA tracking to ensure faster ticket responses

This customer portal is the first step in building a new Talend community that makes it easier for you to get the answers and assistance you need. As part of our commitment to building a better experience, we will be releasing new features and functionality throughout 2016. Please click here for instructions on accessing the new support portal.
Frequently Asked Questions:

Q. Will I need to create a new account?
A. If you have an existing account for the Talend support portal, your credentials will automatically be migrated. New or expired users will be required to re-register.

Q. Will I still have access to my existing cases?
A. All open or closed cases from January 2014 to present will be migrated to the new system.

Q. How will I register new users for the Talend support portal?
A. An email has been sent to existing customers providing instructions on how to register additional users.

The Talend Team

[2015-12-07] Talend Blog: Talend “Job Design Patterns” and Best Practices

Talend developers everywhere, from beginners to the very experienced, often deal with the same question: “What is the best way for me to write this job”? We know it should be efficient, easy to read, easy to write, and above all (in most cases), easy to maintain. We also know that the Talend Studio is a free-form ‘Canvas’ upon which we ‘Paint’ our code using a comprehensive and colorful pallet of Components, Repository Objects, Metadata, and Linkage Options. How then are we ever sure that we’ve created a job design using the Best Practices?

Job Design Patterns

Since version 3.4, when I started using Talend, job designs were very important to me. At first I did not think of patterns while developing my jobs; I had used Microsoft SSIS and other similar tools before, so a visual editor like Talend was not new to me. Instead my focus centered on basic functionality, code reusability, then canvas layout and finally naming conventions. Today, after developing hundreds of Talend jobs for a variety of use cases, I found my code becoming more refined, more reusable, more consistent, and yes patterns started to emerge.

After joining Talend in January this year I’ve had many opportunities to review jobs developed by our customers. It confirmed my perception that for every developer there are indeed multiple solutions for each use case. This, I believe compounds the problem for many of us. We developers do think alike, but just as often we believe our way is the best or the only way to develop a particular job. Inherently we also know, quietly haunting us upon our shoulder, whispering in our ear, that maybe, just maybe there is a better way. Hence we look or ask for best practices: in this case - Job Design Patterns!

Formulating the Basics

When I consider what is needed to achieve the best possible job code, fundamental precepts are always at work. These come from years of experience making mistakes and improving upon success. They represent important principles that create a solid foundation upon which to build code and should be (IMHO) taken very seriously; I believe them to include (in no particular order of importance):

- Readability: creating code that can be easily figured out and understood

- Writability: creating straightforward, simple, code in the least amount of time

- Maintainability: creating appropriate complexity with minimal impact from change

- Functionality: creating code that delivers on the requirements

- Reusability: creating sharable objects and atomic units of work

- Conformity: creating real discipline across teams, projects, repositories, and code

- Pliability: creating code that will bend but not break

- Scalability: creating elastic modules that adjust throughput on demand

- Consistency: creating commonality across everything

- Efficiency: creating optimized data flow and component utilization

- Compartmentation: creating atomic, focused modules that serve a single purpose

- Optimization: creating the most functionality with the least amount of code

- Performance: creating effective modules that provide the fastest throughput

Achieving a real balance across these precepts is the key; in particular the first three as they are in constant contradiction which each other. You can often get two, while sacrificing the 3^rd. Try ordering all of these by importance, if you can!

Guidelines NOT Standards ~ It’s about Discipline!

Before we can really dive into Job Design Patterns, and in conjunction with the basic precepts I’ve just illustrated, let’s make sure we understand some additional details that should be taken into account. Often I find rigid standards in place which make no room for the unexpected situations that often poke holes into them. I also find, far too often, the opposite; unyielding, unkempt, and incongruous code from different developers doing basically the same thing; or worse, developers propagating confusing clutters of disjointed, unplanned, chaos. Frankly, I find this sloppy and misguided as it really does not take much effort to avoid.

For these and other fairly obvious reasons, I prefer first to craft and document ‘Guidelines’, not ‘Standards’. These encompass the foundational precepts and attach specifics to them. Once a ‘Development Guidelines’ document is created and adopted by all the teams involved in the SDLC (Software Development Life Cycle) process, the foundation supports structure, definition, and context. Invest in this, and long term, get results that everyone will be happy with!

Here is a proposed outline that you may utilize for yours (feel free to change/expand on this; heck it’s only a guideline!).

Methodologies which should detail HOW you want to build things
1. Data Modeling
  1. Holistic / Conceptual / Logical / Physical
  2. Database, NoSQL, EDW, Files
2. SDLC Process Controls
  1. Waterfall or Agile/Scrum
  2. Requirements & Specifications
3. Error Handling & Auditing
4. Data Governance & Stewardship
Technologies which should list TOOLS (internal & external) and how they interrelate
1. OS & Infrastructure Topology
2. DB Management Systems
3. NoSQL Systems
4. Encryption & Compression
5. 3^rd Party Software Integration
6. Web Service Interfaces
7. External Systems Interfaces
Best Practices which should describe WHAT & WHEN particular guidelines are to be followed
1. Environments (DEV/QA/UAT/PROD)
2. Naming Conventions
3. Projects & Jobs & Joblets
4. Repository Objects
5. Logging, Monitoring & Notifications
6. Job Return Codes
7. Code (Java) Routines
8. Context Groups & Global Variables
9. Database & NoSQL Connections
10. Source/Target Data & Files Schemas
11. Job Entry & Exit Points
12. Job Workflow & Layout
13. Component Utilization
14. Parallelization
15. Data Quality
16. Parent/Child Jobs & Joblets
17. Data Exchange Protocols
18. Continuous Integration & Deployment
  1. Integrated Source Code Control (SVN/GIT)
  2. Release Management & Versioning
  3. Automated Testing
  4. Artifact Repository & Promotion
19. Administration & Operations
  1. Configuration
  2. User Security & Authorizations
  3. Roles & Permissions
  4. Project Management
  5. Job Tasks, Schedules, & Triggers
20. Archives & Disaster Recovery

Some additional documents I think should be developed and maintained include:

- Module Library: describing all reusable projects, methods, objects, joblets, & context groups

- Data Dictionary: describing all data schemas & related stored procedures

- Data Access Layer: describing all things pertinent to connecting to and manipulating data

Sure creating documentation like this takes time but the value, over its lifetime, far outweighs its cost. Keep it simple, direct, up-to-date, (it doesn’t need to be a manifesto) and it will make huge contributions to the success of all your projects that utilize it by dramatically reducing development mistakes (which can prove to be even more expensive).

Can We Talk About Job Design Patterns Now?

Sure! But first: one more thing. It is my belief that every developer can develop both good and bad habits when writing code. Building upon the good habits is vital. Start out with some easy habits, like always giving every component a label. This makes code more readable and understandable (one of our foundational precepts). Once everyone is making a habit of that, ensure that all jobs are thoughtfully organized into repository folders with meaningful names that make sense for your projects (yes, conformity). Then have everyone adopt the same style of logging messages, perhaps using a common method wrapper around the System.out.PrintLn()function; and establish a common entry/exit point criterion with options for alternative requirements, for job code (both of these help realize several precepts all at once). Over time, as development teams adopt and utilize well defined Development Guideline disciplines, project code becomes easier to read, to write, and (my favorite) to maintain by anyone on the team.

Job Design Patterns & Best Practices

For me, Talend Job Design Patterns present us with proposed template or skeleton layouts that involve essentail and/or required elements that focus on a particular use case. Patterns because often they can be reused again for similar job creation, thus jumpstarting the code development effort. As you might expect, there are also common use patterns that can be adopted over several different use cases which, when identified and implemented properly, strengthen the overall code base, condense effort, and reduce repetitive but similar code. So, let’s start there.

Here are 7 Best Practices to consider:

Canvas Workflow & Layout

There are many ways to place components on the job canvas, and just as many was to link them together. My preference is to fundamentally start ‘top to bottom’, then work ‘left and right’ where a left bound flow is generally an error path, and a right and/or downward bound flow is the desired, or normal path. Avoiding link lines that cross over themselves whereever possible is good, and as of v6.0.1, the nicely curved link lines adopt this strategy quite well.

For me, I am uncomfortable with the ‘zig-zag’ pattern, where components are placed ‘left to right’ serially, then once it goes to the right most edge boundry the next component drops down and back to the left side edge for more of the same; I think this pattern is awkward and can be harder to maintain, but I get it (easy to write). Use this pattern if you must but it may indicate the possibility that the job is doing more than it should or may not be organized properly.

Atomic Job Modules ~ Parent/Child Jobs

Big jobs with lots of components, simply put, are just hard to understand and maintain. Avoid this by breaking them down into smaller jobs, or units of work wherever possible. Then execute them as child jobs from a parent job (using the tRunJob component) whose purpose includes the control and execution of them. This also creates the opportunity to handle errors better and what happens next. Remember a cluttered job can be hard to understand, difficult to debug/fix, and almost impossible to maintain. Simple, smaller jobs that have clear purpose jump off the canvas as to their intent, almost always easy to debug/fix, and maintenance, comparitevely a breeze.

While it is perfectly acceptable to create nested Parent/Child job hierarchies, there are practicle limitations to consider. Depending upon job memory utilization, passed parameters, test/debug concerns, and parallelization techniques (described below), a good job design pattern should not exceed 3 nested levels of tRunJob Parent/Child calls. While it is safe to perhaps go deeper, I think that with good reasons, 5 levels should be more than enough for any use case.

tRunJob vs Joblets

The simple difference between deciding between a child job versus using a joblet is that a child job is ‘Called’ from your job and a joblet is ‘Included’ in your job. Both offer the opportunity to create reusable, and/or generic code modules. A highly effective strategy in any Job Design Pattern would be to properly incorporate their use.

Entry & Exit Points

All Talend Jobs need to start and end somewhere. Talend provides two basic components: tPreJob and tPostJob whose purpose is to help control what happens before and after the content of a job executes. I think of these as ‘Initialize’ and ‘WrapUp’ steps in my code. These behave as you might expect in that the tPreJob executes first, then the real code gets executed, then finally the tPostJob code will execute. Note that the tPostJob code will execute regardless of any devised exit within the code body (like a tDie component, or a component checkbox option to ‘die on error’) is encountered.

Using the tWarn and tDie components should also be part of your consideration for job entry and exit points. These components provide programmable control over where and how a job should complete. It also supports improved error handling, logging, and recovery opportunities.

One thing I like to do for this Job Design pattern is to use the tPreJob to initialize context variables, establish connections, and log important information. For the tPostJob: closing connections and other important cleanup and more logging. Fairly straight forward, right? Do you do this?

Error Handling & Logging

This is very important, perhaps critical, and if you create a common job design pattern properly, a highly resusable mechanism can be established across almost all your projects. My job pattern is to create a ‘logPROCESSING’ joblet for a consistent, maintainable logging processor that can be included into any job, PLUS incorporating well defined ‘Return Codes’ that offers conformity, reusability, and high efficiency. Plus is was easy to write, is easy to read, and yes, quite easy to maintain. I believe that once you’ve developed ‘your way’ for handling and logging errors across your project jobs, there will be a smile on your face a mile wide. Adapt and Adopt!

Recent versions of Talend have added support for the use of Log4j and a Log Server. Simply enable the Project Settings>Log4j menu option and configure the Log Stash server in the TAC. Incorporating this basic functionality into your jobs is definitely a Good Practice!

OnSubJobOK/ERROR vs OnComponentOK/ERROR (& Run If) Component Links

It can be a bit confusing sometimes to any Talend developer what the differences between the ‘On SubJob’ or the ‘On Component’ links are. The ‘OK’ versus ‘ERROR’ is obvious. So what are these ‘Trigger Connections’ differences and how do they affect a job design flow?

‘Trigger Connections’ between components define the processing sequence and data flow where dependencies between components exist within a subjob. Subjobs are characterized by a component having one or more components linked to it dealing with the current data flow. Multiple subjobs can exist within a single job and is visualized by default as having a blue highlighted box (which can be toggled on/off on the toolbar) around all the related subjob components.

An ‘On Subjob OK/ERROR’ trigger will continue the process to the next ‘linked’ subjob after all components within the subjob have completed processing. This should be used only from the starting component in the subbjob. An ‘On Component OK/ERROR’ trigger will continue the process to the next ‘linked’ component after that particular component has completed processing. A ‘Run If’ trigger can be quite useful when the continuation of the process to the next ‘linked’ component is based upon a programmable java expression.

What is a Job Loop?

Significant to almost every Job Design Pattern is the ‘Main Loop’ and any ‘Secondary Loops’ in the code. These are the points where control of the potential exit of a job’s execution is made. The ‘Main Loop’ generally is represented by the top-most processing of a data flow result set that once complete, the job is finished. ‘Secondary Loops’ are nested within a higher-order loop and often require considerable control to ensure a jobs proper exit. I always identify the ‘Main Loop’ and ensure that I add a tWarn and a tDie component to the controlling comonent. The tDie usually is set to exit the JVM immediately (but note that even then the tPostJob code will execute). These top level exit points use a simple ‘0’ for success and ‘1’ for failure return code, but following your established ‘Return Codes’ guideline is best. ‘Secondary Loops’ (and other critical components in the flow) are great places to incorporate additional tWarn and tDie components (where the tDie is NOT set to exit the JVM immediately).

Most of the Job Design Pattern Best Practices discussed above are illustrated below. Notice, while I’ve adopted useful component labels, even I’ve bent the rules a bit on component placements. Regardless the result is a highly readable, maintainable job that was fairly wasy to write.

Conclusion

Well ~ I can’t say that all your questions about Job Design Patterns have been answered here; probably not in fact. But it’s a start! We’ve covered some fundamentals and proffered a direction and end game. Hopefully it has been useful and provokes some insightful considerations for you, my gentle reader.

Clearly I’ll need to write another Blog (or perhaps a few) on this topic to cover everything. The next one will focus on some valuable advanced topics and several Use Cases that we all are likely to encounter in some form. Additionally the Customer Success Architecture team is working on some sample Talend code to support these use cases. These will be available in the Talend Help Center for subscribed customers fairly soon. Stay on the lookout for them.

[2015-12-01] Talend Blog: IT stuff for free! – 3 Zero-Cost Integration Projects

According to Gartner forecasts, IT spend is likely to be close to $4 trillion in 2015. It’s probably very welcoming then to hear that you can still complete some IT projects for free. Case in point: Data Integration. There are free open solutions available that are a great alternative to either hand coding all your data integration connections –a time consuming process to say the least– or spending tens of thousands of dollars on tools that you may or may not need, like, or want on an ongoing basis.

Open Source Integration is a great way to get your feet wet if you are new to the discipline or want to experiment with tools before you buy them.

Here are three projects that you can complete for free today with Open Source Integration (OSI) software:

1. Build a data warehouse

2. Perform data synchronization tasks

3. Conduct data migration projects

1. Data warehouse – Using the OSI you can build a straightforward data warehouse by adding the necessary dimension and fact tables, aggregating data and building full data warehousing functionality (including a data base and repository). With the OSI it’s a simple matter of pulling data into the warehouse, no matter where the data is located. In doing so, you actually sidestep the necessity to spend megabucks using conventional approaches, and build a real world, fully functioning warehouse for free! Geeze, you may end up getting a raise for being so smart.

Bonus: have fun as you learn! For example, did you ever think you might want to create your own fantasy football league? Well, you could by using a self-created data warehouse.

For a more work-related alternative, you may want to do some interesting web analysis by building a data warehouse on visitor traffic using a list of IP addresses and some web-log data.

2. Data synchronization – In this scenario, you might be trying to synchronize simple pieces of data between applications or from an application to an operational data store (ODS). Often, these are modest projects involving just a few fields. Rather than spending hundreds of thousands of dollars on high-powered tools, you can easily use Open Source Integration to integrate fields or tables from other repositories? or between applications. You can also use it to move the same content from an ODS into other applications and—did I mention?—all of this is free. For example, for optimal time-to-market efficiency, you might need to synchronize data to analyze different vendors’ contributions to your supply chain. Many SaaS applications allow you to do such synchronizations on some data however you may need to include more data that is not accessible or included only in automatic methods. OSI can easily help you to assess these ‘locked-away’ tables and fields that may have this extracted information.

3. Data migration – This is an example of a project where you need to migrate data from an old application to an updated version – a short-lived effort that doesn’t require complex software tools. In this case, some of the open integration tools can save you both significant amounts of time and money. Perhaps you need to switch from one Salesforce Automation to another. In this instance both are SaaS application but this is still not a 1-to-1 movement of data as I am sure you may have experienced. There are always data model differences and, even better, there are always different lists of values and data values that require some CRAZY data transformation. You can get simple migration projects wrapped up quickly and free with Open Source Integration. You are able to connect to all versions of the software involved, and deal with any types of transformations or data manipulations of data from one application to another. This would obviously be an extremely time-consuming process if you attempt to do it by hand or script.

So there you have it: three great integration jobs you can complete quickly and easily with open source integration (from Talend in this instance). And, with a price tag of zero, what do you really have to lose?

[2015-11-30] Talend Blog: Explore the Talend 6 Studio and Its Exciting Productivity Features

In Talend 6, I must really applaud our teams for the new and exciting productivity features added to the Talend Studio. Just check the look and feel of the Studio: it has been significantly revamped with a new design and many new features that make development much faster. Take a look at the faceted search to let you find components more quickly or how they have improved the ways to create transitions between your components. Now that's productivity!

So with that, I’ll leave you to enjoy our video that introduces the Talend 6 Studio. It’s one of the many how-to videos that we are currently posting here, so check back often.

Remember, if you don’t already have Talend installed, you can still play along with the video by downloading our free trial.

[2015-11-25] Talend Blog: Creating the Golden Record that Makes Every Click Personal

Gartner has just released its annual “Magic Quadrant for Master Data Management of Customer Data Solution”.

You can download a free copy of the report from our web site. This feels like the perfect time to share some thoughts on the market trends, highlight some of the best practices from Talend MDM customers, and introduce the latest advancements in Talend platform to turn customer data into business outcomes.

_____________________

The need for single view of the customer is not new, but it has reached a new level of criticality and urgency. Most of the customer facing lines of businesses now understand the importance of being data driven, and have learned to figure out how this mandate can effectively translate into increased sales, better customer retention and reduced costs. They need to use the new digital tools that are invading our business environment and leverage them to provoke changes in the way they attract, sell, serve, and interact with their customers.

Without question, customer data is the fuel to making this happen. Unfortunately, most companies have failed as of today to build their “fuel refinery”, and this hurts them hard. In its data quality survey, Experian found that 77 percent of companies believe their bottom line is affected by inaccurate and incomplete contact data, while respondents indicate that 12 percent of revenue is wasted as a result. Sales and marketing professionals consider that the rate of inaccurate data has increased to the staggering level of 32 percent, a situation that tends to get worse over the years.

So it is no surprise that Gartner estimates that overall software revenue from the MDM of customer data solutions market segment came to $500 million in 2014, an increase of 10.3% from 2013[1]. Although Talend experienced much higher growth rates across all MDM domains, we also see MDM of Customer Data as our most dynamic sub-market. Additionally, Customer Data was a demand booster across our product lines, especially Big Data, Data Quality and Data integration. Knowing the customer through deeper analytics, connecting with them across the different channels, interacting with them through personalized content and offers, and protecting sensitive against breaches, are extremely hot topics in any industry today.

Turning customer data into outcomes across industries

In the context of our educational campaigns around the MDM of Customer Data, we focused on documenting our customer success in vertical markets. Here are four representative examples:

The travel and tourism industry is undergoing considerable change, now that we as consumers have access to tons of travel information online through providers such as TripAdvisor, Booking.com or Airbnb. Traditional players need to differentiate from those low costs self-service options by delivering superior customer service and developing a long-term relationship with repeat business through personalized interactions. MDM of Customer Data allows TUI, the world’s number one leisure tourism business, to add value to customer interactions across the many customer touch points that makes up a customer journey, from search to bookings, and then all along the traveling experience. Take a look at this webinar or this success story to learn more about their takeaways.

Do you remember the days when you used to visit your bank’s branches on a regular basis, even to fulfill basic transactions such as a withdrawal? Now our banking experience has turned into multi-channel, from web to phone, from mobile to ATM and to branch. But we can see that our bank struggles to provide us the right offer at the right time and are increasingly at risk of customer churn. Kiva Group, a provider of CRM systems to the banking industry, is leveraging Talend MDM to face this challenge by providing customer context across the different transactions and touchpoints, from ATM to branch or contact center. See or read how they made it happen.

In the professional services and software industry, delivering targeted messages to customers has become a must, not an option, now that the mailbox of each and every potential B2B customer is overloaded. In addition, many companies in this industry grow through acquisitions, mandating best practices to rapidly reconcile customer databases from different sources into a single view. This is what Ellie Mae, another example of a company that serves serve banks, credit unions and mortgage companies, achieved with Talend MDM.

Personalization is the next big thing in the healthcare industry. Accolade’s mission is to transform healthcare one person and one family at a time. They provide “a better consumer experience and the right care the first time by combining leading technology, analytics and clinical decision support with a personalized engagement model that consumers value and trust”. To achieve this goal they are backed-up an incredibly innovative data architecture based on Talend MDM with Big Data on Amazon Web Services. Find out more of their experience in this webinar.

Making the most of your customer data with Talend

In the market section of its Quadrant, Gartner highlights the fact that many of the inquiries they receive relates to best practices on how to successfully deploy an MDM program focused on Customer Data. An MDM project is not simply about systems integration and deployment. It requires change management, strong collaboration between multiple stakeholders across organizations throughout the life cycle of the program, and to be backed-up by best practices.

One key takeaway is that MDM projects require a level of guidance, and this is why we packaged our Passport for MDM success earlier this year together with our most knowledgeable MDM partners. I blogged already on this topic, and since this approach proved to be successful to pave the way to many of our customers MDM initiatives.

With the launch of Talend 6, we have also significantly improved the implementation of data quality and MDM project focused on customer data, in the area of contact data management, identification and protection of personally identifiable information, and integration of MDM data into customer facing application. Together with Christophe Toum, we have recorded a webinar to present and demo these capabilities. Talend 6.1 will also bring new capabilities in the area of MDM modeling and Metadata Management. Stay tuned as we are less than one month away from its general availability, a couple of days before Christmas.

We also released Talend Data Fabric. When used in the context of Master Data Management, Talend Data Fabric adds Real time big data capabilities and leverages the power of Spark. In this context, MDM brings the system of record with a single view of the customers; Big Data integration connects it to the system of interaction across channels, while Big Data Analytics brings the system of insights, (for example by leveraging Spark machine-learning capabilities). Our Real Time Big Data capabilities allow users to connect this environment in real time to the system of engagement and the customer facing applications like the web or mobile applications or the applications that are used by sales representatives, the call center agents, etc. Earlier this year, we showcased how MDM can be effectively augmented with big data, by leveraging data sources such as clickstreams, integrating with advanced analytics and machine learning, and connecting in real time to web applications that drive personalized interactions and real time recommendations. See this online webinar.

As you can see, we have seen a lot progress in the recent months on how to make customer data actionable. What seemed like a very costly and risky effort a couple of months ago has now turned into a safer passage. Through packaged approaches such as the Passport for MDM success, MDM initiatives can be better planned and linked to clearer business benefits and key success factors. And through technologies such as Talend MDM and Spark, turning every customer interactions into something that is much more data driven, personalized and linked to measurable business outcomes can be achieved in a cost efficient way. Are you ready to get onboard?

[1] Gartner, Inc., "Magic Quadrant for Master Data Management of Customer Data Solutions” by Bill O’Kane and Saul Judah, November 11, 2015.

[2015-11-23] Talend Blog: The Universal Language of Data Mastery

It’s no secret that big data is engulfing society at large. A recent survey of senior level IT professionals by CIO Insight, found that 90% of respondents said their organizations were investing in big data and analytics. So, what do these results really mean? It means that companies in all sectors are finding value in being data-driven.

This week at Talend Connect we introduced our first annual Talend Data Masters Awards to recognize forward-thinking, data-driven organizations around the globe that are doing incredible things with their data. Winners were selected based on their usage of big data technology, creativity and overall business value achieved.

While they may be from very different industries and locations around the world, these Data Masters share very similar views on the meaning of data:

GE Healthcare

GE Healthcare, a unit of General Electric Company is focused on utilizing technology to design innovative ways to reduce costs, increase access and improve patient quality of service around the world.

Travis Perkins

The U.K.’s largest building materials supplier, Travis Perkins plc., operates 20 businesses across the country, including Wickes, a DIY home improvement chain, and Benchmarx, a kitchen and joinery supply store.

Virtusa

Virtusa, a global information technology (IT) services’ company provides high-value IT services that enable its clients to use data effectively to enhance business performance, accelerate time-to-market, increase productivity and improve customer service.

Springg

Springg, an international software company specializing in agricultural data transport, software development and sensor technologies that has a broad mission to help feed the world.

m2oCity

m2oCity, one of the leading operators of remote smart reader technology in France, monitors a network of more than 2,000,000 Machine-to-Machine (M2M) devices across 1,500 cities, measuring water and gas usage.

Douglas

Douglas, a leading European beauty products retailer, needed a way to connect its multitude of disparate systems in order to reduce its time-to-order fulfillment process for a new product launch campaign.

A BIG congratulations to all of our Data Masters Award winners! We are proud to partner with each of these innovative companies to help make their big data come to life. If you’d like to see for yourself how Talend works wonders with big data, take Talend’s tech for a test drive by clicking here.

[2015-11-19] Talend Blog: [Demo] Combining Talend 6 + Spark for Real-Time Big Data Insights

They say the world used to run on oil, then it ran on data, and today it runs on fast data. For data-driven businesses it’s no longer enough just to have the right information, they need the right information, right now.

Take Amazon for example, they found that delaying their data by just one second could end up costing them $1.6 billion in sales. Why? Today’s customers are accustomed to incredibly fast-paced standards when it comes to online services. Businesses that can’t serve up pricing and product recommendations in real-time fall behind. The business benefits of transforming into a data-driven organization are unprecedented.

So how should you go about putting your big data to work? Talend 6, the first integration platform on Spark, is a good start. Talend 6 on Spark allows you to take full advantage of real-time analytics, and employ features such as predictive recommendations, dynamic pricing and more.

Learn how to create a real-time recommendations engine built on Talend 6 and Spark in the 4-minute video below.

Download Talend’s Real-Time Big Data Sandbox. The Sandbox includes a ready-to-run Talend Real-Time Big Data Platform installation. You will also find a Cookbook as well as scenarios using Apache Spark, Spark Streaming, Apache Kafka and NoSQL. It’s the fastest way for you to learn how to can leverage the power of Spark to improve your business.

[2015-11-18] Talend Blog: 6 Things You Should be Looking for in a Big Data Platform

In order to become data-driven, organizations need to be able to understand the “what?”, the “why?” and the “what if?” using not only their enterprise data (myopic view), but also the Big Data surrounding them (geolocation, social, and sensor data). This is the only way to gain a 360-degree view of their customers, business and market, and what it means in terms of business challenges and opportunities.

One of the largest Big Data challenges for organizations is the integration or ingestion of multiple data sources to gain meaningful insight. In fact, according to IDC, gathering and preparing data for analysis is typically 80 percent of the time spent on any analytics project. That’s an astounding amount of time wasted prepping data just so that you glean insight from it. In other words, data integration software is very likely the key to the success of your Big Data project.

Here are six simple questions you should ask to ensure you are getting the most effective data integration platform for extracting the maximum amount of insight from your data:

Is it Easy to Use? Ask to see the User Interface. Is it simple or complex? Does the application automatically generate code or does it force you to do it by hand? Can you perform tasks using drag-and-drop actions? Does the platform offer a single, consistent workflow and UI or does it look like a mix of separate applications?

Is it Unified? Does the platform enable the integration of all types of data (cloud, on-premises, IoT, etc.) and can you perform both batch and real-time processing within the same solution?

Does it fully leverage the power of Hadoop? Some tools require that you process and transform data before loading into Hadoop. Not only does this data movement slow projects down, but it also means you are not fully exploiting the processing power of Hadoop.

Is it Up to Date? Is the software based on open source or is it proprietary? Open source solutions are proven to better keep pace with the rate of big data innovation and enable you to remain agile and more responsive to the needs of the business.

Is it Fast? Does is utilize Spark and Spark streaming within Hadoop to process data? Or is it stuck in the days of YARN?

Is it cost effective? What’s the total cost of ownership? Is it reasonable and based on the number of developers or is it based on data volumes, connectors or CPUs?

Tell us your thoughts: What do you find to be the biggest challenge when it comes to mastering Big Data?

[2015-11-18] Talend Blog: Too Soon to Talk Holiday Shopping?

We recently completed a survey uncovering some of the new opportunities and challenges facing online retailers. You can read the highlights in the press release here: http://www.talend.com/about-us/press-releases/real-time-big-data-key-to-... . As you’ll see, one of the focus areas of the survey is around the challenge of shopping -cart abandonment, which costs the retail industry hundreds of millions if not billions of dollars a year.

For this post however, I thought we could focus on some of the fun findings in the survey and, in particular, some of the differences between men and women when it comes to online shopping.

Here’s a SUPER easy one to start things off. Who do you think begins holiday shopping later - men or women?

Sure, no surprises here, it is indeed men. In fact, over 20% of the women we polled started holiday shopping before October, while only about 11% of men claimed to have done so. Men are also apparently way more likely to be shopping procrastinators, with over 30% indicating they wouldn’t start holiday shopping until Dec. Only 18% of women will wait that long.

There are a few things that women and men are completely in sync on – surprisingly within a couple of percentage points. Men and women both primarily do their online shopping from the comfort of their homes (approx. 94%). Both genders also indicate they expect to complete more of their shopping online this year (approx. 60%) and, in fact, about 27% of both groups think they’ll complete about half of all their shopping online this year.

In addition, an almost identical percentage of men and women say they always intend to purchase the items they place in their cart (37%). So what are the remaining 63% doing with their shopping carts if not shopping? Well, it seems many are just enjoying a guilt (and bill) free shopping trip. A full 29% of women are using the cart as a “wish list”, while for men about 22% and 21% respectively indicate they are using the cart to “just browse” or to calculate total cost.

For men, the number one reason for abandoning a cart was actually a dead heat at 27% between total cost of goods and cost of shipping. For women, shipping costs were far and away the biggest reason for deserted carts (37%). We also asked men and women to choose from a list what they shopped for most online. Options included Groceries, Fashion, Home Goods/Furniture, Cosmetics, Music/Film/Gaming, and Technology. Can you guest the top two for each gender? Take a look at the full survey to find out if you are right: http://www.talend.com/about-us/press-releases/real-time-big-data-key-to-...

Here’s wishing you a safe and happy shopping season!

[2015-11-17] Talend Blog: A Surprisingly Simple but Effective Masking System

I used to watch the TV show Zorro as a kid - fantastic stuff. While it never put me off watching, I do distinctly recall thinking how absolutely ludicrous it was that nobody could figure out that Zorro was actually nobleman, Don Diego de la Vega. Come on people, they have the same moustache for Pete’s sake!

But here’s the thing, Zorro had that mask, a simple but apparently effective disguise that obscured him just enough to fool almost everyone in the village – even if it didn’t manage to dupe five-year-old me.

Of course, this is a super long way to introduce you to another simple, but effective masking system –one for masking sensitive data. As many of you will know, Data Masking is the process of scrambling sensitive information in order to protect it, while still making it available and useful for things like software testing and user training.

Data Masking is becoming a very important part of managing data across development, testing, training and reporting environments. It allows real data to exist in databases, but be ‘cloaked’ before it reaches users without the need or security clearance required to access the proprietary information (folks like internal developers or perhaps representatives from external partners).

So with that, I leave you to enjoy our video on Data Masking. It is the first in a series of how-to videos we’ll be posting here, so check back often.

Remember, if you don’t already have Talend installed, you can still play along with the video by download a free trial.

[2015-11-16] Talend Blog: You Too Can Become a Data Rock Star & Change the World

I had the chance to attend Apache Spark Summit last week in Amsterdam and was lucky enough to meet Matei Zaharia in person (he is my MIT Big Data online course assistant professor). Matei is indeed a modern rock star for being part of the Spark inception team. It was his experience trying to solve machine learning problems with Hadoop that led him and his teammates to create Spark.

During my trip to Amsterdam, I watched a BBC channel documentary about the mysteries X Tombs in ancient Rome and how an entire population got decimated (fun evening of entertainment!).

They asked a DNA expert and palaeogeneticist to help find out the exact disease that caused the death of these people. So they took some samples from the teeth in the remains and went fishing within the various DNA present (including those of the disease).

The problem is it will take weeks of computer calculations to decode the DNA and process all the information. In contrast, imagine if they had access to modern technologies like Hadoop + Spark and how quickly they’ll be able to find the answer.

With today’s modern technologies: Super computing power, lower costs, real-time and in-memory processing, and machine learning, scientists can not only discover drugs quicker to cure people, but they can also help us understand history and solve some of the world’s biggest mysteries faster than ever before.

Having access to more information at the speed of thought can help organizations solve some of the world’s biggest problems more quickly: hunger, global warming, resource scarcity….

Real time Big Data can lead to breakthrough innovations at speed, and help us better understand our past to build a better future.

Are you ready to improve our future and become a rock star like Matei? You too can solve world’s biggest problems and innovate by using Spark even without advanced knowledge in Spark programming, simply download our free Sandbox and get your feet wet.

https://info.talend.com/prodevaltpbdrealtimesandbox.html

[2015-11-13] Talend Forum Announcement: Join the Talend Data Preparation Beta Program!

Dear Community,

Talend invites you to join the worldwide beta program for a new product called Talend Data Preparation! As a participant, you will experiment a new step ahead in your data-driven journey and influence the future enhancements of the product. The beta period will run from November 30, 2015 through January 22, 2016 and is a great opportunity for you to get early visibility into Talend Data Preparation.

Please fill out this form if you are interested in participating in the Talend Data Preparation Beta Program: https://info.talend.com/applicationdata … type=forge

Thanks for being a part of our community,
The Talend Team.

[2015-11-10] Talend Blog: Infographic: Real-Time Big Data Key to Cyber Monday Success

Full Talend survey results are downloadable as a PDF here: http://www.talend.com/sites/default/files/talend_online_shopping_survey.pdf

[2015-11-04] Talend Blog: Our Sandbox has Better Toys

Today we launched a new real-time big data Sandbox. This is a super quick and painless way for you to gain first-hand experience with one of the latest big-data innovations, Apache Spark. For those of you not familiar with a Sandbox, we’re basically talking about a virtual development environment. Our virtual environment combines our latest data integration sensation, Talend 6, with some great ready-to-run, real-time data scenarios – plus a step-by-step Big Data Insights Cookbook. Within minutes you’ll be up and running and impressing your friends using Talend to turn data into real-time decisions with test examples that will get your feet wet on Apache Kafka, Spark, Spark Streaming and NoSQL.

Specifically, in the demo scenarios you’ll see a simple version of how to turn a website into an intelligent application. Experience building a Spark Recommendation Model using Spark machine learning and you’ll get to set up a new Kafka topic to help simulate live traffic coming from users browsing a web storefront. Most importantly, you will learn first-hand how you can take streaming data and turn it into real-time recommendations that will help dramatically improve sales.

So, whether you are a developer looking to sharpen your skills, a data architect or scientist looking to test some exciting new technologies for your company or a competitor looking to spy on Talend, we welcome you all with open arms; because, as they say, you have to play nice in the sandbox.

Remember, our Sandbox includes a 30-day evaluation of Talend 6, so the fun doesn’t have to stop at just the scenarios provided. Feel free to continue to use Talend 6 to test out some in-house projects; just don’t blame us if you get hooked.

[2015-10-27] Talend Blog: Talend Connect : Entrez dans le futur du Big Data !

A vos agendas ! Talend Connect est de retour à Paris, les 18 et 19 novembre prochains

Nouveauté : cette année la rencontre se déroulera sur deux jours. La première journée sera entièrement dédiée aux partenaires de Talend. Durant cette première journée, nous présenterons les nouveautés de Talend 6, la première plate-forme d’intégration de données sur Spark. Nous présenterons également les nouveautés de nos programmes de partenariat. Le jour suivant est réservé aux clients et utilisateurs de Talend. Outre la présentation de Talend 6, qui offre notamment des fonctions d’intégration de Big Data en temps réel, ils pourront assister à des présentations de Talend Integration Cloud et de Talend MDM, et découvriront les toutes dernières innovations produits. Des clients de Talend issus de différents secteurs d’activité, comme Air France-KLM, Schneider Electric et m2ocity, monteront sur scène pour expliquer dans quelle mesure Talend les a aidés à innover pour tirer le meilleur parti de leurs données. Les participants auront également l’opportunité de rencontrer et d’échanger avec les membres du management.

Plus d’informations sur l’événement et inscription

Journée Partenaires : 18 novembre

L’écosystème de partenaires de Talend – intégrateurs et sociétés de conseil, revendeurs, partenaires technologiques et OEM, – ne cesse de s’étoffer et joue en rôle essentiel dans la réussite des projets de nos clients. Nous favorisons des échanges permanents et une collaboration de tous les instants avec nos partenaires. C’est la philosophie qui anime Talend depuis sa création. Aujourd’hui, environ 30% du chiffre d’affaires de la société provient directement de son réseau de revendeurs.

Nous avons décidé de dédier une journée entière à l’écosystème de partenaires de Talend afin de faire le point sur les différents programmes de partenariat que nous proposons, de mettre en relief les réalisations les plus marquantes de l’année écoulée et d’offrir la possibilité à nos partenaires de rencontrer et dialoguer de manière privilégiée avec des membres de notre département de Recherche & Développement.

Focus Revendeurs : Une attention spécifique sera accordée au nouveau programme VAR (Value Added Reseller) de Talend. Lancé en janvier 2015, ce programme a été bâti sur mesure pour permettre aux revendeurs d’étendre leurs relations avec leurs clients existants, d’en acquérir de nouveaux et de créer de nouvelles sources de revenus récurrents. Talend Connect est l’occasion de d’échanger directement avec les équipes commerciales indirectes chargées de l’animation de ce programme.

Journée Utilisateurs : 19 novembre

Si la présentation des nouvelles fonctionnalités de Talend v6.0 tiendra le haut de l’affiche – avec notamment la possibilité d’intégrer des Big Data en temps réel dans un environnement de production et l’intégration d’Apache Spark qui améliore drastiquement les performances des traitements des Big Data – Talend Connect est l’occasion de découvrir une démonstration de l’ensemble de la gamme de solutions de Talend.

Focus Big Data : Talend Connect offre une opportunité idéale pour découvrir les innovations phares et les bonnes pratiques d’exploitation des Big Data. L'activité de Talend dans ce domaine continue d'être un moteur de croissance majeur pour l'entreprise. Après avoir enregistré une augmentation de 78% de ses ventes au premier trimestre 2015 par rapport au même trimestre 2014, Talend a poursuivi sa croissance : au cours du premier semestre 2015, le nombre de contrats annuels reconductibles a augmenté de 66 %. Cette hausse est notamment marquée par une augmentation du nombre de nouveaux clients de 92 % pour ses solutions de Big Data.

Talend s’est rapidement imposé comme le leader du secteur. Ceci est notamment illustré par la rapidité avec laquelle les solutions de Talend obtiennent la certification des principaux éditeurs de distributions Hadoop – Cloudera, Hortonworks, MapR et AWS. Au-delà d’Hadoop, Talend bénéficie en outre de partenariats stratégiques ou technologiques avec MongoDB, DataStax, Teradata ou Vertica pour n’en citer que quelques-uns.

Talend Data Master Awards

Les vainqueurs des premiers Talend Data Masters Awards seront annoncés lors du Talend Connect. « Talend Data Master Awards » est un programme visant à mettre en relief et récompenser les utilisations les plus innovantes des solutions Talend. Les lauréats seront sélectionnés selon divers critères – l’impact sur leur secteur et l’innovation, la taille du projet et sa complexité, l’utilisation des technologies Big Data, ainsi que les résultats obtenus. Un prix sera remis pour différentes catégories de projet – gestion des données de référence (MDM), intégration de données, Big Data, Enterprise Service Bus (ESB) et qualité des données.

Les finalistes seront sélectionnés parmi plus de 1 700 clients de Talend à travers le monde, issus des secteurs public ou privé, évoluant sur les marchés de la distribution, de la finance, de la santé ou encore des produits industriels.

Un grand merci à nos sponsors

Talend Connect bénéficie du soutien de Keyrus, CGI, Business & Decision, EXL Group, Edis JEMS Group, Micropole, Microstrategy, MapR, Synaltic, Accenture, Smile, Cloudera et Sopra Steria, sponsors de l’édition Paris 2015.

Je suis très enthousiasmé et ravi de recevoir nos utilisateurs, clients et tous les membres de la communauté à la prochaine édition française du Talend Connect. 300 utilisateurs et clients sont attendus cette année, nous espérons vous compter parmi eux !

[2015-10-27] Talend Blog: Talend Connect: Step into the future of Big Data!

Save the date! Talend Connect will be back in Paris on November 18 and 19.

This year the Talend Connect event will take place over two days. The first day will be entirely devoted to Talend's partners. Over the course of the day we will showcase the new features of Talend 6, the first data integration platform on Spark. We will also present our new partnership programs. The second day is reserved for Talend customers and users. In addition to the presentation of Talend 6, visitors will be able to attend focused sessions on the latest innovations in Talend Integration Cloud and Talend MDM. Talend customers representing a variety of industries including Air France-KLM, Schneider Electric and m2ocity will take the stage to explain how they worked with Talend to revolutionize their business and make the most of their data. The attendees will also have the opportunity to connect with members of our executive team.

More information about the event and registration (in French)

Partner Day: November 18

The world of Talend partners – integrators and consulting firms, resellers, technology and OEM partners – continues to grow and play an essential role in the success of our customers' projects. We encourage ongoing discussions and collaboration with our partners at all times. This philosophy has driven Talend since its creation. Today, approximately 30% of the company's sales originate directly from partners.

We decided to dedicate an entire day to Talend's wide array of partners and to take stock of the various partner programs we offer. We’ll take the time to highlight the most significant achievements of the past year and offer our partners the opportunity to meet and engage exclusively with members of our Research & Development team.

Focus on Resellers: Specific attention will be given to Talend's new VAR (Value Added Reseller) program. Launched in January 2015, this program was customized to help resellers expand their relationships with their existing customers, acquire new customers and create new sources of recurring revenue. Talend Connect creates a platform for resellers to interact with the indirect sales teams who are responsible for the deployment of this program.

User Day: November 19

While the presentation of the latest features in Talend 6 will be the main attraction — in particular, the possibility of integrating Big Data in real time into the production environment —Talend Connect will give users a close-up demonstration of the entire range of Talend's new features and solutions.

Focus on Big Data: Talend Connect provides an ideal opportunity to discover the leading innovations and best operating practices of Big Data. Talend's activity in this area continues to be a major driver of growth for the company. After recording a 78% increase in sales in the first quarter of 2015 compared to the same quarter in 2014, Talend continued its growth during the first half of 2015, with the number of annual renewable contracts increasing by 66%. This increase was marked by a 92% hike in the number of new customers for Big Data solutions.

Talend has quickly established itself as the integration industry leader. This is demonstrated in how quickly Talend's solutions receive certification by leading Hadoop distributors – Cloudera, Hortonworks, MapR and AWS. Beyond Hadoop, Talend also benefits from strategic and technology partnerships with leaders like MongoDB, DataStax, Teradata and Vertica, to name a few.

Talend Data Master Awards

The winners of the first-ever Talend Data Masters Awards will be announced at Talend Connect. Talend Data Master Awards is a program designed to highlight and reward the most innovative uses of Talend solutions. The winners will be selected based on a range of criteria including market impact and innovation, project scale and complexity, Big Data technology usage and the overall business value achieved. Awards will be given across a number of different project categories including Master Data Management (MDM), Data Integration, Big Data, Enterprise Service Bus (ESB) and Data Quality.

Finalists are being selected from over 1,700 Talend customers worldwide representing the public and private sectors and working with market distribution, finance, health and industrial products.

Special Thanks to Our Supporters

Talend Connect benefits from the support of Keyrus, CGI, Business & Decision, EXL Group, Edis JEMS Group, Micropole, Microstrategy, MapR, Synaltic, Accenture, Smile, Cloudera and Sopra Steria, the sponsors of the 2015 Paris event.

I am very excited about welcoming users, customers and all the members of the community to the next French session of Talend Connect. Around 300 users and customers are expected this year, and we hope you'll be among them!

[2015-10-23] Talend Blog: Three Key Takeaways from Amazon re:Invent 2015

Upon returning from AWS re:Invent, a jam-packed cloud computing frenzy in Las Vegas, I was a) pumped up about the impact that cloud environments can have for all of our customers and beyond; and b) left thinking about what key messages really stood out.

There was a lot to take in—but one thing was abundantly clear: AWS has done a remarkable job of making the move to public cloud seem inevitable: taking things like security off of the table, starting to remove some of the roadblocks around migration, and referencing a number of enormous companies who have either completed their migration or are making a major commitment. It’s impressive. But what does all this mean for AWS, Talend, and our customers? Here are my key takeaways:

Data warehouses, analytics and IoT are all driving a continued migration to the cloud

Over two days of keynotes, executives from the following companies were among those who made presentations discussing their use of Amazon Web Services’ cloud: General Electric, Capital One, John Deere and BMW. If people were wondering before if enterprises are really using this platform, AWS re:Invent 2015 proved that they are.

The cloud is still in hyper-growth mode and small-to-medium sized companies are also using the cloud more than ever before—particularly as data stores continue to grow and the economics around managing these growing stores of data become unfeasible for most.

Migration made easy with Snowball

Given that, one perceived roadblock on the way to AWS is around the effort of moving existing data away from current servers, it is not surprising that they had a few migration offerings. Snowball is a hardened disk appliance that allows you to physically ship up to 50TB of data via UPS (literally) as well as Database Migration Service (DMS) and a Schema Conversion Tool. The DMS is a simple wizard that does a like for like bulk migration to or from a “legacy” database to something in AWS including data and code, and compressing along the way for performance. They quote $3/TB to migrate. The Schema Conversion Tool is an option that will attempt a best fit heterogeneous migration including data type and stored procedure migration (with some exceptions, I’m sure).

QuickSight and the need for CLEAN data

The announcement of AWS QuickSight was an important one. At face value, it’s a tool customers can use to analyze data they already have stored in AWS’s cloud using fancy graphics. Salesforce has taken the same approach with its analytics cloud. However, as the expression goes—‘Garbage in, garbage out.’ The success of QuickSight will all depend on having really clean data. That said, both Redshift and Aurora integrate with Talend, which makes it significantly easier, faster and cheaper to get all of your data into one repository and have high quality clean data for analytics so that you can have the best insights to infuse back into your business. Talend’s free Data Quality components would also play a critical role in getting the best data possible.

Symbolically, it also signals that AWS is an Infrastructure-as-a-service company – it provides virtual machines, storage, databases and a whole lot of other cool cloud-based infrastructure components. QuickSight is a Software-as-a-Service offering.

So what does all this mean for the future? Cloud is becoming the more obvious choice as the future of data warehouses. It’s more flexible, scalable, affordable and insured. And now, with these new offerings, AWS is making it even easier for customers to get into the cloud. All of these tools are being made more affordable and available so that companies looking to transform their organization into a data driven business have a fast, painless, and obvious path to take.

In addition to the myriad of messages targeted at CxOs, at re:Invent AWS also had the full attention of the world’s leading edge developers (both corporate & ISV) and it’s clear that we’re seeing the next winning multi-year platform franchise emerging. AWS is (reportedly) already at a $7B annual revenue run rate, up 85% from last year (with their Q3 earnings expected to be announced today). Wow. All of this adds up to the fact that there is nothing but momentum in this space and Talend + AWS makes perfect sense.

[2015-10-19] Talend Blog: Building ‘Houses’ in the Cloud

In my IT career I have had the opportunity to work on many great Data Management projects, ranging from simple extract, transform and load (ETL) assignments that support operational systems like CRM, SFA, and ERP, to simple Data Warehouses. I have been on some very impressive Master Data Management (MDM) and Data Quality projects for some of the top companies in their sectors, including both ETL and real-time Data Services integration patterns. But, I have taken a break from that, and now work for a company that provides tools to help you build the very data fabric that all enterprises need to be successful.

Talend recently launched a new product in the integration Platform-as-a-Service (iPaaS) space that makes it even easier for customers to build and deploy their integration patterns in the cloud where infrastructure and hardware aren’t necessary. This is a completely hosted Data Integration platform in the cloud, and if all your source and targets are in the cloud, then your entire solution can be hosted and run in the Cloud.

As part of my new role at Talend, I am fortunate enough to have early access to many of the products and am required to become an early expert in order to train other technical professionals in the company. Sometimes this can be a blessing and sometimes it’s a challenge, as I can be dragged into some really hairy projects. In this case I was excited when our CMO, Ashley Stirrup, came to me and asked if I would help build our own internal cloud-based Customer Data Warehouse (CDW). I was very excited to help build a complete data warehouse entirely in the cloud.

End-to-End Sales and Marketing Data Integration

The concept of the CDW was pretty simple really, the executives wanted to see and measure the effectiveness of all Marketing and Sales activities from beginning to end for all our customers. The secondary project objective was to build the entire CDW using cloud technologies including the Talend's Integration Cloud Platform. The three sources were, Marketo (Marketing Automation), Salesforce.com (Sales and Campaign Operations) and Netsuite (Billing and Invoicing)—all Software-as-as Service (SaaS) platforms. We employed the assistance of our partner, full360, to build the Data Warehouse in Amazon Web Services (AWS) Redshift with the online edition of Tableau for the visualization layer. The partner had a lot of experience with Talend's on-premise tools but like everyone was new to the Cloud edition. It was my job to assist with the migration of the traditional Talend jobs to the cloud—a process which we referred to as: "Cloudify" the flows.

The process was very simple and it took next to no time to build using all the different components and connections Talend provides. We built the flows and tested the overall process from our local development studios. This included a full batch control process within the Redshift tables to assure all extracts from the sources out to AWS S3 were successful before loading data to the production Reporting tables on Redshift. We also used several Data Quality Actions. Actions are "Predefined Integration Patterns" used in a data flow in the cloud, to cleanse data quality issues. Once these were defined, I saw many steps that were excellent candidates to be reused for Actions, such as the Batch control process that needed to retrieve a batch ID before every flow and then update a table at the end that the process was successful or report a failure. I turned this into a ‘simple Action’ that all Flows together in sequence in order to keep the entire process in check.

The best part of this CDW process was that all my testing and production deployment was a matter of a "right click" and deploy and I was done! I didn't have to call up my favorite hardware guys and order new integration servers or database servers because all the infrastructure was created for me in the cloud. The CDW process really is as simple as doing a right click and deploy to the cloud and I am ready to test, schedule, and run my integrations in production for my completely hosted Data Warehouse.

Overall, building my first, completely cloud-hosted, Data Warehouse was a great experience! Of course many of you still have data sources that are not in the cloud and you will need that on premise functionality that Talend Integration Cloud offers, but in this CDW project it was very fulfilling to have an entire project where I didn't need to involve the Infrastructure team or worry about securing space in some data center. Finally, it’s important to note that as the requirements for the CDW grow, I know that AWS and Talend are both capable of scaling to meet the need with very little effort.

Final Data Warehouse Result

[2015-10-15] Talend Blog: You’ve Bought Into the Cloud: Now What?

For the last few years we’ve been hearing about the benefits of the cloud. And at this point, many of us agree the benefits are sizable and beneficial. So, let’s say you’ve finally agreed that it’s time to move to the cloud – what’s next?

First off, “the cloud” can mean anything – so what is it exactly? Applications like web analytics (Google Analytics) and customer relationship management (Salesforce) can be in the cloud. You can put data in the cloud in massive data warehouses like Amazon Redshift and Google BigQuery. You can move your analytics to the cloud with offerings like Tableau Online and Birst. And you can do any or all of the above, in any order.

Companies that weren’t “born in the cloud,” meaning any company more than a couple years old, need a plan for going cloud. Most organizations need to determine what, how and when to adopt cloud services.

Here are five strategies for transitioning to the cloud.

1. Use the right tool for the job.

If you’ve already decided to make a change, look at the toolkit of cloud services and see if there is a good option. Need a new data warehouse? An HR management solution? CRM? Consider starting with cloud there. You’ll likely have a faster implementation by going cloud, which means you’ll get value fast, and you’ll be able to start your transition without ripping out something that’s working. And chances are you’ll save money in the bargain.

For example, WildTangent, a worldwide distributor of mobile, social and online games, has moved most of his company’s data infrastructure to the cloud. Scott Moran, Director of Business Intelligence encourages people to look at the many cloud services offered as a toolkit. Select the right tool at the right time, and choose a tool that’s the right size for the need.

Moran moved Wild Tangent’s data architecture to the cloud piece by piece. He cautions that, in keeping with the analogy of the toolbox, you don’t want to get all your tools out at once. Use one, finish the job, and move onto the next. This helps you keep your business running as you make the transition.

2. Be as flexible as the cloud itself.

The cloud is in a stage of rapid evolution. You have the possibility of prototyping as you go, and adding volume when you’ve got it right. Keep an eye on new technologies and see how you can fit them into your workflows. Your best architecture today may not be your best architecture in a year, or even six months. A bit of tweaking can save you a lot of money.

As you consider new services, take advantage of the flexibility in the cloud. Elasticity is a characteristic of many cloud service. Basically this means you can use (and pay for) a small amount at first and then scale dramatically when your concept is proven out. In the cloud, you can try things out without having to commit massive infrastructure or licensing costs up front.

3. Plan for growth.

One of the advantages of a cloud infrastructure is that you can scale up easily— as long as you’ve got the right infrastructure. Take the time upfront to get your systems working as you want them, whether they be cloud applications, data or analytics. You don’t want to change from a relational database to an analytical database mid stream, but you might want to double your analytic database capacity overnight. And if your business starts growing, a good system can go a long way with you—but a bad one will only add to your headaches.

4. Give your users a hand (or at least a single sign-on solution!).

One of the challenges with moving to the cloud is that your users may end up with a number of different username and password combinations to remember. Luckily, there’s an app for that. Single Sign-On (SSO) solutions like OneLogin and others let your users use one password for many applications. This can significantly reduce user headaches and make users more open to adopting new solutions. It’s a good idea to favor solutions that use SAML or OAuth so that you can make use of an SSO solution when you’re ready.

5. Add even more value by broadening access to data.

If you’re building a data infrastructure in the cloud, think about how your employees will be able to use that data. If you’re moving to cloud applications, think about how you’ll integrate the data with other data in your enterprise. Otherwise, the limiting factor in your cloud infrastructure will be the time of your data scientists.

These steps provide a great starting place for exploring the cloud. You’ve bought into its potential, it’s now up to you to get the most out of it!

About the Author, Edouard Beaucourt (Tableau Software)

Edouard Beaucourt is Regional Director for France, French Speaking Switzerland and North Africa. Edouard Beaucourt is responsible for reinforcing and growing Tableau Software presence driving sales of Tableau’s products within the region. Edouard Beaucourt, 35 years old, joined Tableau Software in December 2013 as Enterprise Account Manager for France and French speaking Switzerland. Previously he fulfilled roles at IBM Business Analytics in Geneva, and at Clarity Systems, Paris and Geneva. He has a background in major account enterprise sales in the business intelligence and analytics software space and a wealth of industry insight. Prior to roles at the organisations above, Beaucourt also managed sales teams and channel partner programmes for Microsoft and Hyperion.

[2015-10-13] Talend Blog: Self-Service and Data Governance Empowers LOB Users

There is a major transformation underway in the use of data centric tools within the enterprise. When it comes to working with solutions for data integration, data preparation, data analysis and Business Intelligence (BI), the emphasis is shifting from IT to line of business (LOB) users.

This is a natural evolution. LOBs today have to deal quickly and efficiently with constantly increasing amounts of data generated by multiple, digitized sources such as social media and the Internet of Things (IoT). It’s a situation that requires a collaborative relationship with IT as LOBs become more involved in data processing to rapidly obtain the trusted, updated data they need to facilitate decision making.

Error ridden, unstructured data can have an immediate negative impact on business processes. For example, customers dealing with a call center with inadequate or faulty customer tracking may have to restate their issue every time they call in to resolve a problem – a major annoyance. Or a corporate marketing department may optimistically launch a major campaign only to realize poor results due to relying on a database riddled with errors and omissions. According to Gartner’s analyst Ted Friedman, organizations estimate losing on average $8.9 million per year due to those kinds of data quality issues.

However, even though business users want to be more involved in data-centric activities, they may be spending 70 to 80% of their time preparing data without any assurance the quality will be high or that governance risks such as privacy and compliance issues are being addressed.

LOB users are partly responsible. Data analysis tends to be an “individual sport” in which users often create their own version of the original data. It’s understandable that increasingly data driven LOBs want to participate in processing data relative to their business unit. However, they must be accountable for the data they manage so that it can become a valuable asset for all parts of the organization, including IT.

Consider the marketing department mentioned above which has created a web form on its website to capture leads for a new campaign. If the data input is not properly controlled, the campaign will introduce poor data into the CRM system including junk emails, invalid phone numbers, duplicate records, etc. This could finally impact the whole organization, not just marketing specific activities such as outbound campaigns. Think about a shipping notice or a response to a claim that could end-up as undeliverable mail.

A Collaborative Solution

For today’s organizations the solution is to transfer some data processing responsibilities to the LOB while allowing IT (or other cross functional organizations, such as the Office of Chief Data Officer that we see in some data-driven organizations) to keep control over the various processes involved. This is in line with the trends we have observed – a need from more autonomy on the part of the users, driving the use of self-service tools for data analysis, data preparation and data integration. However, this move to self-service can result in chaos if not accompanied by appropriate controls.

Data governance provides those controls – it not only allows the LOB to access the data but also ensures its quality and accuracy. Talend provides data governance capabilities to business users through the data stewardship features in Master Data Management solution. And we are now creating a full vision of an LOB self service capability for our business customers that will allow them to cleanse, prepare and integrate data for their own analytical needs or from a file to an application or between applications.

Self-service also can empower LOBs to deal with data at an operational level. In that respect, a key part of the Talend solution is the Talend Integration Cloud, which allows “citizen integrators” – advanced users familiar with the underlying IT landscape such as SaaS or commonly used on-premises business applications – to integrate data on an ad hoc basis or to collaborate with IT to create enterprise ready integration scenarios.

Talend Integration Cloud is a secure cloud integration platform featuring powerful graphical tools and prebuilt components and connectors that make it simple to enrich and share data. This makes data integration projects feasible for LOB users to tackle and therefore frees up IT for other, more strategic tasks. And it makes data Integration a team game between Lines of Business and IT, rather than a source of conflict.

Coming in the very near future based on a similar mindset is Talend Data Preparation, a self-service solution for business analysts, which will enable them to prepare data for analysis or any other data integration or stewardship tasks. Data Preparation is being designed not only as a productivity tool for the LOB user, but as a collaborative tool that allows an organization to share most of its data assets. The flexibility of this new solution will enable an organization to strike the right balance between LOB autonomy and IT control depending on the sensitivity of data involved, the organization’s culture, and the role that IT plays within the enterprise.

Benefits of a New Collaborative Approach

By recognizing the shift to LOB involvement in an organization’s data centric activities, the entire enterprise will realize numerous benefits:

- LOBs save time and increase productivity with easier sharing of information and a more comprehensive view of essential data

- Marketing organizations improve their campaigns through better targeting of customers (and more generally, lines of business can more successfully meet their operational objectives through a better use of their data assets)

- The enterprise gets better control over data including data security (such as avoiding major break-ins like the recent Sony/WikiLeaks hack – a company’s worst data nightmare mostly caused by uncontrolled copies of very sensitive data such as employees’ salaries and social security numbers)

- With LOBs controlling Big Data governance applications, the business users have the ability to use reliable, timely data for driving a competitive advantage.

Talend unified platform is Talend’s integrated solution for all the technical challenges of data governance, data integration and preparation together with data curation and stewardship. This is an easy to use, unified data management platform with more than 1000 built-in connectors and components that make it easy to incorporate nearly any type of data source into your governance process.

With these capabilities in place, data quality has the potential to move from its traditional back office role to a far more proactive practice that can address quality issues before they even occur, and finally turn data into a day-to-day operational outcome across the lines of business. And, this approach is ushering in a new era of proactive, data aware LOB users who can work in close collaboration with their IT counterparts.

[2015-10-07] Talend Blog: Why Driving a Data-Driven Culture is Essential to Business Success

Big data is a familiar term to everyone in the world of IT but now it’s becoming known as a topic in our everyday lives. Big data is forecast to continue impacting all aspects of our lives, especially at work.

For better or for worse, 90% of the world’s data was generated within the last two years. But that data is only useful when anyone and everyone who needs to can access and understand it. This is why traditional business intelligence tools are being superseded by easy-access software solutions that don’t require initiation into some high priesthood of data science, or a PHD in statistics.

The rising tide of data has necessitated that everyone have familiarity with data analysis, not just experts with “analyst” in their titles. Organisations that make better use of data to make decisions are more successful, while those that don’t will begin to fall behind.

The democratisation of data has emerged as a consequence of several trends – proliferation of devices and the consumerisation of IT in general – and signs point to it becoming an ever more prevalent trend. We are moving to the pervasive use of data, through online and real-world tracking and the internet of things.

In a world where people are drowning in data – from information on the Web, on spreadsheets, and in databases on tablets and devices – people need a lifeline, and that lifeline is data analytics.

A recent Teradata survey found that about 90% of organizations report medium to high levels of investment in data analytics solutions, and about a third call their investment “very significant.”

The study underlines this shift in thinking, as businesses see a return on their data-analytics investment across sectors and across areas of the business – from marketing to sales.

With increased data across various parts of today’s businesses, familiarity with data analysis is now an essential skill across roles and levels.

Unfortunately, most business analytics products are built to centralise and control data, not democratise it. As a result, the majority of companies are reliant on specialists just to answer basic questions. They stumble through Escher-like spreadsheets to work around inflexible business systems. Or they’re being stonewalled by enterprise-wide business intelligence platforms that spend more time in development than helping anyone.

There's no power in that approach. The power is in giving people the ability to think, act and deliver – and a self-service delivery means the IT department concentrates on their strategic role – not helping users work out how to generate reports!

When a company empowers employees with self-service analysis tools, they are shown to be capable and respected. People start to drive their organisations forward in ways that senior management could never anticipate. The environment fosters their ingenuity and creativity, and people are able to tell stories with their data.

Top tips for driving a data culture within your business:

- Get buy-in and excitement: think of data analysis as a story, and use a narrative

- Find the story first: explore the data

- Write out the story to guide your audience through the journey

- Supplement hard data with qualitative data, and add emotion

- Be visual: use pictures, graphs and charts

- Make It easy for your audience: stick to 2-3 key issues and how they relate to your audience

- Determine what you want people to do as a result: write your happy ending

- Encourage data uptake by demonstrating the benefits to the business and your colleague’s roles – data empowerment can make business heroes!

About the Author, Edouard Beaucourt (Tableau Software)

[2015-10-06] Talend Blog: Unlocking the Power of the Cloud: Talend Teams Up with AWS at re:Invent 2015

Hot on the heels of Strata + Hadoop World NYC, next week Talend ships out to ‘Sin City’ for four jam-packed days at AWS re:Invent, “the largest gathering of the global Amazon Web Services community.”

We’re still basking in the glow of our new Talend 6 real-time big data integration platform, which is going to make a huge impact on big data cloud environments everywhere.

At re: Invent, our primary demo will team up AWS solutions with Talend Integration Cloud, which lets you can connect all your data in the cloud and on the ground. The solution includes over 900 connectors and components to simplify development of cloud-to-cloud and hybrid integration flows that can be deployed as governed integration services. At the show, you will be able to learn first-hand how to build simple or complex integration flows inside Talend Studio that connect, cleanse, and transform data. You might need to see it to believe it, but with a simple push a button, you can publish and go live in seconds!

Talend Integration Cloud for AWS also offers:

- The Best for Real-Time Big Data, Spark, Kinesis & Kafka

- Connections for all your data sources & applications, cloud and on-premises

- Business user self-service features to trigger agile innovation across the company

- Integration with Aurora, EMR, S3, RDS, Redshift; and,

- Real-time Big Data — Insight at a fraction of the cost with support for leading big data distributions and data sources

Awaken your Cloud Data and Win

Make an appointment here or drop by booth #630 during AWS re:Invent at the Venetian Sands convention center in Las Vegas, October 6-9^th.

While you’re there, be sure to snap an in-booth selfie and post it to Twitter or Instagram, using the #Talend6Awakens and #reInvent hashtags, and be entered to win a Star Wars collectible poster!!

Win a David Prowse Autographed Darth Vader 8 x 10 Poster!

Not attending? Good news! You can still enter for a chance to win Star Wars memorabilia by showing us your creative force by producing and posting a 15-30 second video or a custom meme explaining on how the Rebels or the Empire could have used Real-time Data Analysis.

Be sure to follow all the show action through our social accounts, on Twitter and LinkedIn. We’ll also have key news and insights from the show right here on the blog, so stay tuned….

[2015-10-01] Talend Blog: You Can’t Fake the Data-Driven Force

Before I get into my blog – an admission – I’m a major Star Wars fan.

No, seriously – a MASSIVE fan. Back in the day, I was a light-saber-toting-wookie-lovin-Princess Leia-poster-on-the-wall type fan. Yup, that guy. In three months, the next Star Wars movie hits the streets and, as you can imagine, I can’t wait. While I still love the more recent one-through-three episodes of the franchise, my heart really belongs to the first three (episodes four, five, and six), which came out while I was still a young(ish) boy.

The thing is, Vader had insight. He knew that he was Luke’s father… but Luke didn’t know. I remember watching the movie in the 80s and thinking “how does he not know? What is the point of the Force, if he can’t figure out some basic stuff? Come on!”

When it comes to being “data-driven”, I get the same feeling. “Come on! Surely you can do better than that!” I mean, what’s the point of all that data if you lack insight? While there are plenty of companies that are beginning to use their data, there is still only a handful that can fully exploit it. Those companies appear to know more about their customers than the customers know about themselves. They have insight. It’s instant, relevant, personal, possibly unnerving, but in the end, it’s about providing exceptional service.

You know the few companies that can do this today. It’s the likes of Google, Amazon and Netflix... They know what you are searching for before you finish typing, they automatically identify TV gems you would have never otherwise discovered, and they have made the process of ordering and receiving goods as easy as a single click of the mouse. In short, they deliver experiences that are instant, relevant, personal, and delightful.

Of course, they also have one other thing going for them – they could fill a football field with their IT talent and funding.

Clearly using day or week-old data to make business decisions and shape the customer experience is no longer going to cut it. Customers expect you to have a much better understanding of their needs and deliver a far more personalized experience. This however poses a wee bit of a challenge for the majority of companies that likely couldn’t fill a large office with their IT team and budget, let alone a football field.

Or at least it was a challenge until now - cue dramatic Star Wars opening theme music…

Talend 6 was introduced yesterday and became the FIRST integration product to be built on Apache Spark. This is really significant as it allows any company, regardless of the size of their IT budgets or teams, to handle real-time big data. This means you too can turn huge volumes of data into immediately actionable insights.

Cool right? It’s almost like we’ve handed you the Force. All we ask is that you use it wisely. And, unlike poor Luke, apply it to the really important stuff like revenue and your customer relationships.

P.S., shameless I know, but share my post for your chance to win a cool Star Wars inspired Tee that we created in celebration of getting Talend 6 out the door!

[2015-09-30] Talend Forum Announcement: Announcing Talend 6: The First Spark-Powered Data Integration Platform

Dear Community,

Talend today announced the immediate availability of Talend 6 (www.talend.com/products/talend-6), the industry’s first and only data integration platform with native support for Apache Spark and Spark Streaming. By leveraging over 100 Spark components, Talend 6 delivers unmatched data processing speed and enables any company to convert streaming big data or IoT sensor information into immediately actionable insights.

More on our press release, visible here: http://www.talend.com/about-us/press-re … n-platform

What’s New in Talend 6?
- First data integration platform on Spark
- Deliver an end-to-end integration platform for the Internet of Things (IoT)
- From Continuous Integration to Continuous Delivery with Talend Data Integration
- Power Big Data, Mobile, and Cloud Apps with New insight
- Make data smarter and more secure
- Extend your integration reach

Get more details on our Talend 6 "What's New" page : http://www.talend.com/products/talend-6

Thanks for being a part of our community,
The Talend Team.

[2015-09-30] Talend Blog: Echtzeit-Big Data werden Mainstream – sind Sie bereit?

Echtzeit-Big-Data-Analytik, die schon längst ein rapides Wachstum verzeichnete, gelangt jetzt an einen Wendepunkt. Zum ersten Mal vereinigen sich jetzt alle Bestandteile zu einer integrierten Echtzeit-Big-Data-Lieferkette, mit der die Geschäftsführung allgemein völlig verändert wird.

Es hätte kaum einen günstigeren Zeitpunkt für diese Entwicklung geben können. Laut IDC befinden wir uns mitten in einem digitalen Universum mit einem jährlichen Wachstum von 40 %, das nicht nur durch die Online-Präsenz von Nutzern und Unternehmen vorangetrieben wird, sondern auch durch das rapide Wachstum des Internets der Dinge (Internet of Things (IoT). Dieses digitale Universum verdoppelt alle zwei Jahre seine Größe. Bis 2020 wird es laut IDC einen Umfang von 44 Zettabyte erreichen, das sind 44 Billionen Gigabyte.

Bis jetzt war die Fähigkeit, diese riesige Masse an Daten zu nutzen, um Antworten zu erhalten, auf die hin sofort reagiert werden konnte, die Domäne einer Handvoll der weltweit erfahrensten und fortschrittlichsten Unternehmen: massive Konzerne mit riesigen IT-Budgets und Teams. Aufgrund technischer und finanzieller Einschränkungen waren kleine bis mittelgroße Firmen bislang eher Nebendarsteller, wenn es um die Anwendung von Echtzeit-Big-Data-Analytik ging.

Das alles ändert sich jetzt

Mit der Einführung von Talend 6 hat sich der Bereich der Echtzeitanalytik radikal und permanent verändert. Talend 6 ist branchenweit die erste kosteneffektive Datenintegrations-Plattform mit nativer Unterstützung nicht nur von Hadoop, sondern auch von Apache Spark und Spark Streaming. Durch die Nutzung von mehr als 100 Spark-Komponenten bietet die Plattform einmalige Datenverarbeitungs-Geschwindigkeiten, die notwendig sind, um Streaming-Big Data oder IoT-Sensorinformationen in Echtzeit in aktionsfähige Einblicke umzuwandeln.

Es ist ein enormer Vorteil zu wissen, was Ihre Kunden letzte Woche gemacht haben. Was aber noch wichtiger ist: ihre Handlungen zu verfolgen, während sie geschehen, und sofort darauf reagieren zu können, um die Erfahrung Ihrer Kunden zu optimieren. Gerade macht ein wie ein zenbuddhistischer Grundsatz klingender Spruch die Runde, der dies hervorragend zum Ausdruck bringt: „Wenn es nicht in diesem Moment geschieht, ist es bedeutungslos.“

Eine Big-Data-Lieferkette in Echtzeit ermöglicht die Einführung von Innovationen in Ihre kundenseitigen Lösungen, die zuvor unvorstellbar waren. Das ist ein direktes Resultat der enormen Leistungssteigerungen, die mit der neuen Talend 6-Plattform möglich werden.

Für Bestandskunden von Talend erfolgt z. B. die Konvertierung von MapReduce-Aufträgen (die alte Vorgehensweise in Hadoop) zu Spark durch einfachen Tastenklick und führt sofort zu einer Leistungssteigerung um das Fünffache. Die Entwicklerproduktivität wird im Vergleich zur manuellen Programmierung dank einer intuitiven Designschnittstelle und vordefinierten Spark-Komponenten sowie automatisierter Spark-Code-Generierung sogar um das Zehnfache gesteigert. Talend 6 bietet auch eine integrierte Lambda-Architektur mit einer einzigen Umgebung für die Arbeit mit Bulk- und Batch-, Echtzeit-, Streaming- und IoT-Daten.

Noch wichtiger als die technischen Spezifikationen von Talend 6 aber sind der Nutzen der Plattform für Unternehmen und die unzähligen Anwendungsfälle, die entstehen, wenn Sie Ihre Daten beliebig abfragen und sofort eine Antwort erhalten.

Einige Anwendungsfälle, die den Echtzeit-Big-Data-Unterschied hervorheben

Hier nur ein paar Beispiele für die Leistung von Echtzeit-Big Data, wie die Talend-Plattform sie ermöglicht:

- Gesundheitswesen - Medizinische Notfall-Anhänger z. B. mit Bewegungsdetektoren verbinden alte Menschen sofort mit der Rettungsleitstelle, falls sie bewegungsunfähig werden sollten und nicht auf andere Weise um Hilfe rufen können.

Ein Anbieter von Gesundheitsleistungen kann jetzt dank Echtzeit-Big Data beständig den Zustand von Risikopatienten überwachen. Durch die Kombination von Echtzeitdaten persönlicher Geräte, die Vitaldaten verfolgen, mit Informationen in Krankenakten können Analytik-Tools jetzt Kliniker alarmieren, wenn eine proaktive Handlung am Patienten erforderlich wird.

- Einzelhandel - Aufgegebene Warenkörbe (wenn Käufer Ware in ihren Online-Warenkorb geben, eine Website aber verlassen, bevor der Kauf abgeschlossen wird) sind für Einzelhändler eine essenzielle Herausforderung. Laut BI Intelligence beläuft sich der Gesamtwert von in Warenkörben aufgegebener Ware dieses Jahr auf unglaubliche 4 Billionen US-Dollar. Ohne Echtzeit-Big-Data-Analysefunktionen können Einzelhändler lediglich das Ausmaß des Verlusts verfolgen.

Dank Datenintegration auf Spark-Basis sowie Spark-fähiger Analytik erhalten Unternehmen die nötige Geschwindigkeit und Agilität, um das Problem der Aufgabe von Warenkörben in den Griff bekommen zu können. Durch die Fähigkeit, Echtzeit-Big Data automatisch zu synthetisieren, können Unternehmen nicht nur das Käuferverhalten prognostizieren, sondern auch automatisch Anreize bieten, damit Verbraucher den Kauf auch wirklich zum Abschluss bringen.

- Landwirtschaft - In der Vergangenheit reichten Landwirte eine Bodenprobe bei einer zuständigen Stelle ein und erhielten dann Wochen später eine Analyse mit Empfehlungen, welche Maßnahmen getroffen werden können, um den Ertrag zu maximieren.

Dank Talend 6-Unterstützung von Spark-Datenintegration und Analytik können Fachdienste mehrere Quellen strukturierter und unstrukturierte Daten – aus dem Feld in Kombination mit historischen Labordaten – nutzen, um in wenigen Sekunden Analyseergebnisse und Berichte zu liefern. Dadurch können Landwirte informierte Entscheidungen zum Feldmanagement von einem Augenblick zum anderen treffen.

Demokratisierung der Echtzeit-Analytik

Mit diesem neuen Release bietet Talend die erste Echtzeit-Big-Data-Integrationsplattform, die das Potenzial hat, die Art und Weise, wie Unternehmen ihre Geschäfte führen, völlig zu verändern. Echtzeit-Analysefunktionen sind nicht mehr nur den wenigen Unternehmen mit tiefen Taschen vorbehalten.

Das bedeutet auch, dass Sie einen Wettbewerbsnachteil haben könnten, wenn Sie nicht Echtzeit-Big Data-Analytik mithilfe einer integrierten, kosteneffektiven Plattform wie Talend 6 nutzen.

Die Auswirkungen dieser Umwandlung auf die Echtzeitanalyse gehen noch erheblich weiter. Neben den offensichtlichen, systemeigenen Vorteilen des Echtzeit-Marketings können auch andere Geschäftsbereiche – von der Fertigung über die Lieferkettenverwaltung bis hin zum Personalwesen – profitieren. Hier bietet sich eine hervorragende Gelegenheit der Zusammenarbeit zwischen IT-Abteilung und anderen Bereichen des Unternehmens zur optimalen Nutzung der Echtzeit-Big-Data-Lieferkette und zur Erkundung innovativer neuer Möglichkeiten für die Verwendung dieser Technologie des Fortschritts.

Der Beginn einer neuen Ära.

[2015-09-30] Talend Blog: Real-Time Big Data About to Go Mainstream – Are You Ready?

Real-time big data analytics, already exhibiting rapid growth, is reaching an inflection point. Now, for the first time, all the ingredients are coming together to form an integrated real-time big data supply chain that will transform how we do business.

This development couldn’t come at a more auspicious time. According to IDC, we are caught up in a digital universe that is growing 40% a year, fueled not only by the online presence of people and enterprises, but also the rapid growth of the Internet of Things (IoT). This digital universe is doubling in size every two years. By 2020, says IDC, it will reach 44 zettabytes – that’s 44 trillion gigabytes.

Until now the ability to mine this deluge of data in order to get answers that could be immediately acted upon was the domain of a handful of the world’s most sophisticated companies – massive organizations with really huge IT budgets and teams. Because of technical and financial constraints, small to medium sized enterprises have been largely sitting on the sidelines when it comes to applying real-time big data analytics.

Game Changer

With the introduction of Talend 6, the real-time analytic landscape has changed forever. Talend 6 is the industry’s first cost-effective, data integration platform with native support for not only Hadoop, but Apache Spark and Spark Streaming as well. By leveraging over 100 Spark components, the platform delivers the unmatched data processing speeds needed to convert streaming big data or IoT sensor information into actionable insights in real-time.

It’s powerful to know what your customers were doing last week, but it’s even more powerful to track their behavior as it happens and be able to respond immediately to transform your customer’s experience for the better. There’s a Zen-like phrase that’s making the rounds that somewhat sums this up: “If it’s not in the moment, it’s meaningless.”

A real-time big data supply chain allows you to introduce innovations into your customer-facing solutions that were unimaginable before – a direct result of the significant performance gains that the new Talend 6 platform makes possible.

For example, for existing Talend customers the conversion of MapReduce jobs (the old way of doing things in Hadoop) to Spark are accomplished at the click of a button and result in immediate 5x performance increase. Developer productivity is up 10x when compared to hand coding thanks to an intuitive design interface and prebuilt Spark components with automated Spark code generation. Talend 6 also provides a built-in Lambda architecture that provides a single environment for working with bulk and batch, real-time, streaming and IoT data.

More important however than the technical specifications of Talend 6, is what the platform powers for companies and the innumerable uses cases that are created when you can ask your data anything and receive an answer in an instant.

A Few Use Cases Highlighting the Real-Time Big Data Difference

Here are just a few examples of the power of real-time big data made possible by the Talend platform:

- Healthcare - Medical alert pendants, some with motion detectors, allow the elderly to connect directly with a dispatcher should they become incapacitated and unable to otherwise call for help.

Now, powered by real-time big data, a health service provider is able to constantly monitor at-risk patients. By combining real-time personal device data tracking vitals with medical record information, analytics tools can alert healthcare professionals if proactive patient action is required.

- Retail - Shopping cart abandonment — when shoppers put merchandise in an online cart, but leave before completing the purchase — is a significant challenge for retailers. According to BI Intelligence, a staggering $4 trillion worth of merchandise will be abandoned in shopping carts this year. Without real-time big data analytic capabilities, the only thing retailers are able to track is the extent of the loss.

Spark-powered data integration, coupled with Spark-enabled analytics, provides organizations the speed and agility needed to begin to tackle the issue of shopping card abandonment. With the ability to process real-time big data, companies can not only predict shopper behavior but also automatically deliver incentives to ensure shoppers complete their purchases.

- Agriculture - Historically, farmers would submit a physical soil sample to a service and weeks later receive an analysis telling them what actions to take to maximize a harvest.

Talend 6 support for Spark data integration and analytics allows services to correlate multiple sources of structured and unstructured data – data from the field combined with historical lab data – to deliver analysis and reporting within seconds. This allows farmers to make informed moment-by-moment management decisions.

Democratization of Real-Time Analytics

With this new release, Talend is providing the first real-time big data integration platform with the potential to totally transform how organizations of all sizes do business. Real-time analytics capabilities are no longer just for the deep-pocketed few.

It also means that you may be at a competitive disadvantage if you don't embrace real-time big data analytics using an integrated, cost-effective platform like Talend 6.

But the impact of this shift to real-time analytics has even further reaching implications. Beyond the obvious advantages inherent in real-time marketing, other parts of the business – from manufacturing and supply chain management to human resources – can benefit as well. This is an excellent opportunity for IT to collaborate with other parts of the business to make the most of the real-time big data supply chain and explore innovative new ways to use this advanced technology.

A new era has begun.

[2015-09-30] Talend Blog: Etes-vous prêts à entrer dans l’ère du Big Data en temps réel ?

L'analytique en temps réel des Big Data connaissait une croissance ultra-rapide et s'apprête aujourd’hui à atteindre un point d'inflexion. Aujourd'hui – et pour la première fois – un certain nombre d'éléments se combinent pour définir un système intégré d'« injection des big data en temps réel » qui s'apprête à transformer nos opérations.

Cette évolution ne pouvait pas tomber à un moment plus opportun. En effet, selon IDC, nous sommes emportés dans un tourbillon numérique qui connaît une croissance de 40 % par an ! Cette croissance est alimentée non seulement par la présence sur le Web d'entreprises et de particuliers de plus en plus nombreux, mais aussi par le développement rapide de l'Internet des Objets. Cet univers numérique double en taille tous les deux ans : d'ici à 2020, prédit IDC, il atteindra 44 zettaoctets (Zo) – soit 44 milliards de Go !

Jusqu'à présent, les capacités nécessaires à l'analyse de ce tsunami de données en vue d'obtenir des réponses exploitables et rapides étaient l'apanage des sociétés les plus avancées du monde – des « big » structures avec des équipes et budgets IT vraiment considérables. En raison des contraintes techniques et budgétaires qui les caractérisent, les PME/TPE sont longtemps restées éloignées des solutions d'analyse des Big Data en temps réel.

La révolution Talend

Avec l'introduction de Talend 6, le paysage de l'analytique en temps réel va subir une transformation irrémédiable. Talend 6 est la première plate-forme d'intégration de données à coût modéré supportant non seulement Hadoop en natif, mais aussi Apache Spark et Spark Streaming. En s'appuyant sur plus de 100 composants Spark, cette plate-forme propose les vitesses de traitement nécessaires pour convertir les flux Big Data ou de capteurs en connaissance exploitable immédiatement.

Il est certes très utile de savoir ce que vos clients ont fait la semaine dernière, mais rien n'égale le suivi de leur comportement en temps réel : l'analyse résultante vous permet de réagir instantanément afin d'améliorer ou de transformer leur expérience. La locution latine « maintenant ou jamais » résume bien cette situation.

Avec un système d'injection des Big Data en temps réel, vous pouvez introduire dans vos solutions côté clients des innovations qui étaient inimaginables auparavant – conséquence directe des gains de performance considérables proposés par la nouvelle plate-forme Talend 6.

Par exemple, les clients existants de Talend peuvent convertir des jobs MapReduce en Spark en seulement quelques clics et bénéficier instantanément de performances cinq fois supérieures. Talend prévoit également que la nouvelle plate-forme va contribuer à multiplier par dix la productivité des développeurs par rapport au codage manuel, grâce à l’interface de conception intuitive conviviale, à des composants Spark prédéveloppés et à des fonctions de génération de code Spark automatisée. Talend 6 fournit également une nouvelle architecture Lambda, qui offre un environnement unique permettant d’exploiter des données issues de traitements bulk, batch, temps réel, streaming et de l’Internet des objets.

Au-delà de ses spécifications techniques, Talend 6 présente un avantage de taille : les innombrables cas d'usage qui se matérialisent dès que vous pouvez interroger vos données et recevoir une réponse en une fraction de seconde.

Cas d'usage : quand le temps réel fait la différence

Voici quelques exemples de la puissance des Big Data en temps réel rendus possibles par la plate-forme Talend :

- Santé - Des pendentifs d'alerte médicale (parfois dotés d'un détecteur de mouvement) permettent aux personnes âgées de contacter directement et automatiquement un service d'urgence si elles font une chute incapacitante ou/et qu'elles n'ont plus la possibilité d'appeler à l'aide par des moyens conventionnels.

Désormais, avec la puissance des Big Data en temps réel, un prestataire de services de santé peut surveiller en permanence les patients à risque. En combinant les données en temps réel enregistrées par cet objet connecté chargé de suivre les symptôme médicaux avec les informations présentes dans les dossiers médicaux, des outils d'analyse peuvent avertir des professionnels de santé si une action proactive est urgente pour le patient considéré.

- Grande distribution - Les abandons de panier (quand les visiteurs et clients d'un site sélectionnent des produits et les mettent dans un panier en ligne, puis quittent le site sans passer à l'achat) sont un défi important pour les sites marchands. Selon la société d’analyse BI Intelligence, les marchandises ainsi abandonnées devraient atteindre une valeur considérable en 2015 : plus de 3 milliards d’euros ! En l'absence de capacités d'analyse des Big Data en temps réel, le seul aspect que les commerçants peuvent suivre est... l'étendue de leurs pertes !

Une solution d’intégration de données dotée d’un support Spark combinée à des fonctions d’analyse avec Spark permet aux distributeurs de disposer des performances et de l'agilité nécessaires pour examiner leurs problèmes d'abandon de panier. Avec la capacité de synthétiser automatiquement leurs Big Data en temps réel, ces entreprises peuvent non seulement prédire le comportement de chaque client, mais aussi les encourager à finaliser leurs achats.

- Agriculture - Traditionnellement, les agriculteurs transmettent un échantillon de leur champ à un service spécialisé, et ils doivent ensuite patienter plusieurs semaines avant de recevoir une analyse suggérant les mesures à prendre pour optimiser leur récolte.

Talend 6 et son support d’intégration de données et d’analyses sur Spark permet de corréler plusieurs sources de données structurées et non structurées, par exemple les données recueillies sur le terrain et les données historiques des laboratoires, puis de générer en quelques secondes les analyses et rapports correspondants. Cette solution permet aux agriculteurs de prendre à tout moment des décisions avisées et pleinement adaptées aux circonstances.

Démocratisation de l'analytique en temps réel

Avec cette nouvelle version, Talend propose la première plate-forme d'intégration en temps réel des Big Data, avec le potentiel de transformer considérablement les activités des entreprises de toutes tailles. Les analyses en temps réel ne sont plus uniquement réservées aux grandes entreprises.

Cela signifie également que vous pourriez rapidement vous retrouver dans une position concurrentielle défavorable si vous n'adoptez pas l'analytique en temps réel des Big Data en utilisant une plate-forme intégrée et peu coûteuse telle que Talend 6.

L'impact de cette transition vers l'analytique en temps réel se traduit par des implications encore plus profondes. Au-delà des avantages évidents apportés par le marketing en temps réel, les autres départements de l'entreprise – des ressources humaines à la production en passant par la chaîne logistique – peuvent également tirer parti d'une telle solution. C'est une excellente occasion pour votre département IT de collaborer avec d'autres services de l'entreprise de manière à leur faire profiter des avantages de l’analyse des Big Data en temps réel et à explorer des méthodes nouvelles et originales de mise en application de cette technologie avancée.

Nous entrons dans une nouvelle ère.

[2015-09-29] Talend Blog: リアルタイムビックデータが主流にー準備できていますか？

急成長を遂げるリアルタイムビックデータ分析は、今日、大きな変換点を迎えています。あらゆるデータがリアルタイムビックデータのサプライチェーンを同時に形成出来るようになり、私たちのビジネスのやり方を変えることになるでしょう。

IDCによると、我々は、個人や企業のオンラインでの活動や物のインターネット(IoT)の急増により、年間40%成長しているデジタルワールドに生きていることになります。このデジタルワールドの大きさは隔年で倍増しています。2020年までには、デジタルワールドは44ゼタバイト、つまり44兆ギガバイトに達するとIDCは述べています。

これまで、この巨大なデータの山を掘り起こして、即時実行が可能な分析を行うためには、巨額のIT予算とスタッフを有する一握りの洗練された企業だけでした。その他の企業や組織は技術的、財政的制約により、リアルタイムビックデータ分析の実施は傍観してきたのです。

ゲームチェンジャー（大変革をもたらす者）

Talend 6 の登場により、リアルタイム分析の様相は一変しました。Talend 6 は業界で初めてHadoopのみならずApache SparkとSpark Streamingをネイティブサポートする、優れた費用対効果を発揮するデータ統合プラットフォームです。100以上のSparkコンポーネントを持つこのプラットフォームは、比類なきパフォーマンスでストリーミングデータやIoTセンサーコンテンツをリアルタイムにインサイトに変換します。

お客様の先週の行動を知ることは重要ですが、もっと重要なのは行動が起きた瞬間にそれを把握し、直ぐに対応することでお客様の満足度を向上することです。これを示唆するような禅の教えがあります「今ここに無いなら、それは無意味です」。

新バージョンTalend 6 は大幅にパフォーマンスを向上し、ビックデータのリアルタイムなサプライチェーンを実現することで顧客向けソリューションに革新をもたらすことが出来ます。

例えば、今Talendを使っているお客様は、ボタンをクリックするだけで（Hadoopの従来の処理方式である）MapReduceジョブをSparkジョブに変換することができ、処理性能を5倍に向上させることが出来ます。直感的なデザインインターフェースとSparkコードを自動生成するSparkコンポーネントにより、ハンドコーディングに比べ開発者の生産性は10倍に向上します。Talend 6 はまた、ラムダアーキテクチャによりバルクやバッチ処理、リアルタイム、ストリーミングおよびIoTデータを扱う統合環境を提供します。

しかし、Talend 6 の製品仕様よりも重要なのは、この製品が企業にもたらすメリットとデータに関する疑問に瞬時に答えることが出来る多数のユースケースです。

リアルタイムビックデータならではのユースケース

以下はTalendのプラットフォームが実現するリアルタイムビックデータ統合のほんの一例です：

ヘルスケア – 医療アラートペンダントは、モーション検出器と連動して、高齢な患者が身動きをとれなくなった時に、救護連絡を直接発信させることが出来ます。

今日、リアルタイムビックデータにより、ヘルスケアサービス企業は危険性の高い患者を常時監視することが出来ます。プロアクティブな患者対応が必要な場合は、バイタルサインを追跡する個人デバイスの発するデータとカルテ情報を組み合わせれば、医療従事者にアラートをあげることが出来ます。

小売 – ショッピングカート放棄率 —　購入者がオンラインカートに商品を入れても購入手続きに至らずに放棄してしまう率は、小売業者には大きな課題です。BIインテリジェスによると、ショッピングカートに放棄された商品額は今年4兆ドルにのぼると予想されています。リアルタイムビックデータ分析機能がなければ、小売業者は機会損失を防げません。

Spark対応のデータ統合はSpark対応の分析と連携して、ショッピンカート放棄の問題解決に必要とされる俊敏性を提供します。リアルタイムビックデータ処理により、顧客の購買行動を予測するだけなく、顧客に購入手続きを完了させるためのインセンティブを自動的に提供することが出来ます。

農業 -従来、農業従事者は土壌サンプルの提供後、何週間も経ってから、収穫を最大化するために何をしたら良いかという分析結果を受けとっていました。

Talend 6 はSparkでのデータ統合をサポートしているため、従来のラボデータと結合したフィールドデータなど、構造化及び非構造化両方の様々なソースからデータを収集、分析し、一瞬でレポートを作成することが出来ます。これにより農業従事者は瞬間瞬間の情報に基づき判断を下すことが出来ます。

リアルタイム分析の普及

この新バージョンのリリースにより、Talend は世界初のリアルタイムビックデータ統合基盤を提供し、全ての企業や組織にビジネス手法の変革をもたらします。リアルタイム分析機能はもはや、ごく一部の限られたものの特権ではなくなりました。

また同時に、Talend 6 のような費用対効果の高い統合基盤を使ってリアルタイムビックデータ分析をしなければ、大きな競争上の不利になるということにもなります。

リアルタイム分析へのシフトは、さらなる意味を持っています。リアルタイムマーケティングに限らず、それ他の業務、製造やサプライチェーン管理から人事等にも利点があります。IT部門が他部門と連携してリアルタイムビックデータのサプライチェーンを最大限に活用し、この高度なテクノロジーを革新的な適用に生かす道を模索する絶好の機会です。

新時代の幕開けです。

[2015-09-28] Talend Blog: Survive and Thrive in a Data-Driven Future: Talend Hits the Big Apple at Strata and Hadoop World 2015!

Big Data Analytics have been advancing in the past years as the amount of information has exploded to petabytes and even EXABYTES of data. Hadoop is definitely the platform of choice for Big Data analysis and computation, however, as data Volume, Variety and Velocity increases, Hadoop as a batch processing framework cannot cope with the requirement for real time analytics.

Enter Spark: the engine for large-scale data processing written in Scala. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. As a leading provider of Open Source data integration solutions, Talend is on the bleeding edge of what’s next for the future of big data analytics and we’re excited to share that with you!

Strata + Hadoop World brings together the best minds in strategy, science, and industry for the defining event of the big data community to explore data driven use cases, IoT & Real-time, production ready Hadoop, Hadoop use cases and more! Talend executives, solutions experts and partners plan to take Strata by storm in order to join industry colleagues in dissecting case studies, delivering in-depth tutorials, sharing emerging best practices, and building the future of your business. Join us at the Jacob Javitz convention center to see how Talend is ‘awakening your data’ to the benefit and delight of all your customers!

Connecting the Data-Driven Enterprise

I don’t think there is more capable speaker on “what’s next” for Hadoop and big data technology than Talend’s Chief Marketing Officer, Ashley Stirrup. In his solutions theater presentation titled, “Using Spark: 5 Quick and Easy Ways to Get More out of your Data,” Ashley will highlight how a faster data integration solution can help you fundamentally transform the customer experience and increase revenue.

Using Spark: 5 Quick and Easy Ways to Get More out of your Data

Speaker: Ashley Stirrup, Chief Marketing Officer, Talend

Time: Wednesday, Sept. 30^th at 11:30 AM EST (repeat session on Thursday at 1:45 PM EST)

Location: Solutions Showcase Theater

We also have a Spark demo that will be performed alongside our partners at Cloudera in booth #725 on Wednesday, Sept. 30 at 4:20 PM EST.

Take Advantage of Big Data Jedi Coaching and Enter our Star Wars Collectible Contest at Booth #846

Learn how Talend makes data processing and analysis light-years faster with Spark Real-Time Data Analysis. Talk to one of our Big Data Jedis for tips on how to overcome some of the most common Big Data challenges and explore various Spark use cases. Simply make an appointment here or drop by booth #846. And while you’re there, be sure to snap an in-booth selfie and post it to Twitter or Instagram, using the #Talend6Awakens #strataconf hashtags, and be entered to win a Star Wars collectible poster!!

If you can’t make it to the Javits Convention Center in New York City on September 30^th – October 1^st you can still enter for a chance to win Star Wars memorabilia by showing us your creative force by producing and posting a 15-30 second video or a custom meme explaining on how the Rebels or Empire could have used Real-time Data Analysis. Submit your entry is with the ‪#‎Talend6Awakens hashtag through Twitter, Facebook or Instagram.

You can also follow all the show action through our social accounts @Talend, on our Talend for Big Data LinkedIn group, or on Facebook at facebook.com/talend. We’ll also have key news and insights from the Strata right here on the blog, so stay tuned….

Hope to see you in booth #846!

[2015-09-25] Talend Forum Announcement: #Talend6Awakens: Win a Signed Star Wars Collectible

Dear Community,

Win a *signed star wars collectible*!
In celebration of the upcoming Talend 6 release we want you to tell us how the Rebels or Empire could have used Real-time Data Analysis. To enter, create a 30 second video and post it with the #Talend6Awakens hashtag. Get details and rules on our Facebook page through this link: https://www.facebook.com/talend/app_403834839671843
Show us your force!

May your data be with you!
The Talend Team

[2015-09-24] Talend Blog: The Role of Data Governance in Delivering Seamless Omni-Channel Experiences

Deploying the most powerful technology tools or building sophisticated processes help retailers maximize the shopper experience, drive operational excellence and ensure financial health. However, these results are all dependent on the quality of data used to drive business activities. Indeed, findings from Aberdeen Group’s Data-Driven Retail study shows that quality of data is the most pressing challenge influencing retailers’ data management activities — see the below figure.

Figure 1: Balance Quantity with Quality

Note: This question was asked as a multi-answer question, meaning that respondents were able to select multiple data management challenges impacting their activities.

Quality of data refers to the cleansing and profiling of the data captured across multiple channels (e.g. social media, web and mobile applications). Retailers capture myriad insights related to shopper activities through numerous channels, some of this data is structured and some is unstructured. Unless the data is cleansed and profiled, companies will struggle with establishing visibility into the consumer journeys through integrating data captured across multiple channels.

Savvy retailers are more than twice as likely as their peers to use technology tools such as database management to cleanse and profile customer and operational data. This helps them ensure that the data used to run analytics yields relevant results to guide shopper interactions as well as manage activities such as inventory management and pricing.

If you don’t currently have a data governance program designed to streamline your data management activities we highly recommend you to adopt one. This will help you ensure getting the maximum benefits out of your technology investments in areas such as omni-channel and analytics. It will also help you minimize business risks associated with using poor insights in guiding your strategic and tactical activities.

Omer Minkara

Research Director, Contact Center & Customer Experience Management

Aberdeen Group

There are myriad technology tools that help retailers maximize shopping experiences. See Talend’s related post highlighting these technology enablers. It’s important to note that the results from use of technology are only as good as the input used through these systems. Talend and Aberdeen highly recommend companies cleanse and profile customer data captured across multiple channels in order to improve data quality, and as a result, the accuracy of insights gleaned through use of analytics as a retailer. Watch the Talend webinar on ensuring data quality to learn more on this topic.

To learn more on how Best-in-Class retailers build and manage data governance programs, read Aberdeen’s Data-Driven Retail study.

About the author, Omer Minkara

Omer Minkara is the Research Director leading the Contact Center & Customer Experience Management research within Aberdeen Group.

In his research, Omer covers the Best-in-Class practices and emerging trends in the technologies and business processes used to enhance customer experience across multiple interaction channels (e.g. social, mobile, web, email and call center). Omer’s research is widely consumed by senior-level Customer Care, Marketing, Sales and Service executives. He has published numerous industry research papers, which are used by executives worldwide to build and nurture strategic customer engagement programs. Omer also speaks frequently with global decision makers to discuss their customer management activities.

Omer has a strong finance background with significant international experience. Prior to joining Aberdeen Group, he was an auditor at PricewaterhouseCoopers in the Europe region. Omer has an MBA degree from Babson College, where he participated in the launch of a technology company, creating a customer acquisition and engagement strategy, and developing all the operational and financial forecasts for the enterprise.

[2015-09-21] Talend Blog: The Path to Optimize Retail Operations through Big Data

Running a retail business is no easy feat. There are plenty of unknowns and moving parts that change the dynamics of the business. Coupled with rising consumer expectations for better service, achieving operational excellence is not an option, but rather a necessity for modern retailers.

In our previous post we noted the benefits of becoming a data-driven retailer. We also highlighted that savvy organizations that excel in making effective use of data enjoy substantial improvements in key operational metrics such as average order delivery time and time to market of products / services.

Findings in Aberdeen Group’s Data-Driven Retail study revealed invaluable insights on achieving operational excellence. The common factor across all these insights is that retailers must streamline how they capture, manage and use operational and consumer insights in order to maximize their performance. The below figure illustrates several activities that help savvy retailers accomplish this goal.

Figure 1: Drill-down into Data to Tailor the Shopper Experience

A common mistake most retailers make is solely focusing on adopting the activities above, and overlooking the vital role data governance plays in using relevant and timely insights within these activities. For example, the above figure shows that retailers must utilize data from different shopper segments to identify the optimal price for each product — as well as the channels used to sell these items. However, in the absence of relevant and accurate data the pricing and customer targeting activities are prone to error, resulting in outcomes such as setting product prices lower than their value in the eyes of shoppers, and such missing out on additional revenue.

Retailers have access to a world of insights across multiple channels. Some of these are structured (e.g. web logs) and others are unstructured (e.g. customer-generated social media content such as a tweet). Successful businesses differentiate themselves through their ability to integrate these different types of data into a single view of the shopper. Doing so provides visibility into the products demanded by consumers, identify price sensitivity to different items as well as determine which e-commerce site content is more likely to convert website visitors into paying clients.

If you haven’t already laid the foundation to help your employees analyze more data gathered from more sources, we highly recommend you do so. It will require moving to a new big data processing paradigm so you can integrate all your data, operate in real-time and act with insight. This will help you realize the full potential of using the best practices highlighted in the Data-Driven Retail study.

Omer Minkara

Research Director, Contact Center & Customer Experience Management

Aberdeen Group

As noted in the above post by Aberdeen, Best-in-class retailers have implemented big data technologies where they correlate historical patterns with in-the-moment buying behavior. These sites can predict if a shopping cart is about to be abandoned, and take instant action to minimize this behavior such as offering new information, reviews or a price discount. Product recommendation engines, such as those provided by Amazon, are excellent examples to increase customer loyalty as well as average purchase price.

Like to learn more? Please read Aberdeen’s Data-Driven Retail study to learn more about the challenges faced by modern retailers and how effective data management activities pave the way for success.

About the author, Omer Minkara

Omer Minkara is the Research Director leading the Contact Center & Customer Experience Management research within Aberdeen Group.

[2015-09-15] Talend Blog: Being a Data-Driven Retailer: What’s in it for You?

The short-answer to the above question is: ‘a lot.’ However, that’s not adequate to highlight the vital importance of adopting a data-driven approach to manage retail business activities ranging from supply chain management to pricing and demand planning. As such, this post will provide a detailed overview of the performance benefits retailers enjoy by laying the foundation to become truly data-driven.

Before we highlight all the reasons why you must become data-driven, first let’s define what we mean by the term. Data-driven retailers are organizations that exhibit a high-level of prowess in their data management activities, helping them convert multi-channel customer data into actionable insights. It’s these insights that allow them to ensure delivering seamless (consistent and personalized) messages across multiple channels, such as in-store, web, live chat, email and mobile applications.

Below is a chart from Aberdeen Group’s Data-Driven Retail study illustrating the wide performance gaps between savvy retailers with Best-in-Class data management practices and All Others — more immature retail organizations.

Figure 1: Effective Use of Data Powers Superior Results

Retailers that excel in converting data into insights reap the rewards of their efforts across three main categories:

Operational efficiency: This refers to organization’s ability to streamline their business activities to achieve desired outcomes with minimal errors, cost and effort. Performance metrics such as time-to-market of products and services as well as average order delivery time are among indicators of organizational success in operational excellence. Data-driven retailers enjoy significant annual improvements (decrease) in bringing products / services to the market, compared to All Others. Improvements in this metric validate an organization’s ability to use data to understand the bottlenecks and restrictions in providing customers with products / services, and using these insights to fine-tune existing activities.

Financial success: Operational excellence should be a core focus of all retailers, however it must be balanced by the pursuit to ensure financial health, as the ultimate success for firms is often correlated with its performance in driving top-line and bottom-line results. Effective use of data helps retailers enjoy an increase over twice greater in annual return on marketing investments (ROMI), compared to companies without Best-in-Class data management activities. Using a data-driven approach to establish the right prices for the right products and ensuring item availability also helps savvy retailers enjoy far superior performance gains in product margins and return on inventory investments — both key measures tracked closely to ensure financial success.

Customer experiences: It’s important to note that the aforementioned results are closely linked with organization’s ability to address shopper needs and address them in a timely and relevant fashion. Companies that master making effective use of data once again excel in this area, and enjoy approximately three times greater annual increase in brand awareness, compared to All Others. This validates the benefits of utilizing data to identify unique consumer groups and target them with personalized messaging aimed at making them aware of the company brand as well as purchase its products / services.

Omer Minkara

Research Director, Contact Center & Customer Experience Management

Aberdeen Group

Like to learn more? Please read Aberdeen’s Data-Driven Retail study to learn more about the challenges faced by modern retailers and how effective data management activities pave the way for success.

About the author, Omer Minkara

Omer Minkara is the Research Director leading the Contact Center & Customer Experience Management research within Aberdeen Group.

[2015-09-09] Talend Blog: Creating a Strategic and Dynamic Data Supply Chain

Hortonworks recently featured Talend on their blog, which contains an interview with Talend’s CMO. In this blog post we are also sharing with you, Ashley Stirrup talks about the topic of transforming ETL and helping organizations support a dynamic data supply chain.

In order to remain viable in increasingly competitive markets, companies must create ever-more detailed models of the business that incorporate all data – regardless of source or volume. In essence, companies need to expand from a one-megapixel view of business activity to a fine-grained gigapixel view. This visibility, paired with the creation of predictive models, enables companies to understand what is likely to happen and what activities lead to desired outcomes. Companies can then focus on encouraging those outcomes with data-driven management practices.

What holds most companies back from becoming truly data-driven? Not surprisingly it’s the data: accessing it from many sources to create a detailed view, preparing it, transforming it, loading it, reporting on it, storing it, and analyzing it. In order to become a data-driven organization, companies must change the way they think about data. It must be considered and treated like a highly strategic asset. That means companies must become masters of their data – and therein lies the challenge.

Many companies today operate in data chaos. Multiple data silos, poor data quality, the growth of big data, new data sources and inconsistent data across systems – all of these are contributing factors. Data chaos has also resulted from legacy and accidental architectures that built up over time.

Meanwhile, the process of extracting, transforming, and loading data (ETL) is changing. The standard model of application data moving through an ETL process and ending up in a data warehouse has been under stress for a long time. For many organizations, it has been the primary use case for Hadoop, where it has been modified to ELT, Extract, Load, and Transform (with most of the T taking place inside Hadoop).

But the transformation that is really required is much bigger than this. With new sources of data everywhere, ETL is a vital process that must allow the E, the T, and the L to be combined, ordered, and located wherever needed. At times, you will land data in Hadoop, transform it there, and load it somewhere else. Or extract data from mobile gateways and distribute it to both big data repositories and to the data warehouse. In other words ETL has grown into a form of data logistics that supports a dynamic data supply chain. And like supply chains in the real world, the infrastructure must allow for constant adjustment, reconfiguration, and optimization as conditions change, new data sources arrive, new technology is installed, and disruptions occur. Instead of taking two months to add a field with traditional ETL, modern data logistics must allow changes to be made in near real time.

Talend and Hortonworks have been working together for many years to help organizations transition quickly and efficiently from data chaos to modern data architectures capable of supporting a dynamic data supply chain. As the leading contributor to the Hadoop core, Hortonworks’ Apache Hadoop distribution is the foundation of this modern data architecture. Talend integrates data from all types of data sources: real-time data, application data, data warehouses, and big data sources such as Hadoop. It is designed to solve the entire problem of creating a dynamic data supply chain: integrating all types of data, normalizing them, and providing governed access at scale.

Many organizations currently still live, to one degree or another, with data chaos. They know that part of the truth is in one system or data source, and some parts of the truth are over there in big data, but they have no effective way to integrate it all in time to matter. Ultimately, what Talend and Hortonworks provide is a way to change this situation and create a dynamic data supply chain that is foundational for data-driven organizations.

Learn how to Make the Transition to a Dynamic Data Supply Chain Today

Talend has created a fully integrated demo with the Hortonworks Sandbox, check out the video and access the Talend + Hortonworks Big Data Sandbox here.

[2015-09-04] Talend Forum Announcement: For test only, Talend Open Studio's 6.1.0 M2 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 6.1.0 M2 release is available, for testing only. This milestone contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 6.1 release.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s second milestone:

Big Data: http://www.talend.com/download/talend-open-studio
Data Quality: http://www.talend.com/download/talend-open-studio#t2
ESB: http://www.talend.com/download/talend-open-studio#t3
Data Integration: http://www.talend.com/download/talend-open-studio#t4
BPM: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-open-studio#t6

Thanks for being a part of our community
The Talend Team

[2015-09-03] Talend Blog: Bootstrapping AWS CloudFormation Stacks with Puppet and Structured EC2 User Data

The purpose of this blog is to provide practical advice on how to approach application bootstrapping on AWS using CloudFormation, Puppet, Hiera and Serf. To obtain the most benefit from the blog, you should be technically familiar with these four technologies.

High level tasks include:

Create the your CloudFormation template
Install and configure software on the instances
Connect instance to other instances in the stack

Creating the CloudFormation template

Writing CloudFormation templates can be a painful task if you try to write them directly in JSON. Writing short and simple JSON structures by hand is fine; however describing your entire AWS infrastructure directly as JSON is not very practical for the following reasons:

- Developer workflows problems – keeping large JSON files in source control means that everyone should use exactly the same parser and output formatting, otherwise diffs are completely unusable, which prevents code review.

- Very limited code modularity and re-usability.

- Depending on your parser of choice, syntax errors can be hard to find, especially in large documents.

- Code without comments is never a good idea.

- Syntactically correct JSON isn't necessary semantically correct CloudFormation. Using a full programming language allows for testing at an earlier stage.

Instead of writing large JSON files directly, it's much easier to use full-featured programming language to generate the templates.

We have selected Python and Troposphere (https://github.com/cloudtools/troposphere) for our CloudFormation generation.

In addition to describing your AWS resources, CloudFormation has the task of providing the instances' bootstrapping logic. This is done in Instance UserData.

Let's have a look at some of the possibilities for UserData:

Single shell script as UserData - Requires CloudInit, encoding shell script in JSON template
YAML encoded CloudInit configuration - Requires CloudInit, encoding YAML in JSON template
CloudFormation helper scripts - Generally requires Amazon Linux, May require encoding of shell scripts in the CloudFormation metadata resource
JSON encoded UserData (our preferred option) - Requires custom AMI (Amazon machine image), since the logic which interprets the custom encoded JSON userdata must already exist on the AMI

Exploring Option 4

This option requires using custom AMIs, but that's actually not a new requirement. We use custom AMIs since we don't want to install software during the instance boot, which can cause failure and/or a general slow down during autoscaling.

Since we are already building custom AMIs (using http://packer.io), why not install a start-up script that reads the structured UserData and passes the bootstrapping task to a configuration management tool? Configuration management tools are much better equipped for the tasks compared to shell scripts or cloud-init helper scripts.

Install and configure the application on the instance

Using custom AMIs means that installation happens during AMI creation, while configuration happens during instance boot.

Since we are taking the approach of JSON encoded UserData we need something on the instances that understands this UserData and translates it into application configuration.

Take, for example, the following UserData:

{

"platform": {

"branch": "releases",

"dc": "aws-us-east-1",

"environment": "development",

"profile": "tipaas",

"release": "101",

"role": "webapp",

“stack”: “development-testing”

"cloudformation": {

"resource_name": "WebAutoscalingGroup",

"stack_name": "rnd-IntegrationCloud-1VTIEDLMDO8YW-Web1a-1IY3WUHI0XCNN"

"webapp_config": {

"elastcache_endpoint": "dev1shared.qygjey.cfg.use1.cache.amazonaws.com:1121"

}

Now, back to the Python troposphere library. It's very easy to extend the troposphere library to provide a custom UserData Python class, which returns the above JSON when the final CloudFormation template is rendered.

What is left to do is translate the above JSON to a concrete Puppet catalog (remember – Puppet, Hiera and the Puppet modules are already installed on the AMI).

Next steps are:

1) “facter-ize” all the platform variables

2) execute `puppet apply site.pp` where site.pp is an empty manifest containing only an empty default node and let Hiera provide all the classes and variables for the Puppet catalog compilation.

For example “/etc/rc.local” looks like this:

#!/bin/bash

/usr/loca/sbin/ec2_userdata_facterize.py #reads ec2_userdata and creates facter facts for each variable in platform

/usr/bin/puppet apply site.pp

Snippet of hiera.yaml content (this is very short snippet of our actual hierarchy)

- "%{::t_profile}/role/%{::t_role}/dc/%{::t_dc}/env/%{::t_environment}/stack/%{::t_stack}"

- "%{::t_profile}/role/%{::t_role}/env/%{::t_environment}/stack/%{::t_stack}"

- "%{::t_profile}/role/%{::t_role}/env/%{::t_environment}/release/%{::t_release}"

- "%{::t_profile}/role/%{::t_role}/dc/%{::t_dc}/env/%{::t_environment}"

- "%{::t_profile}/role/%{::t_role}/env/%{::t_environment}"

- "%{::t_profile}/env/%{::t_environment}/stack/%{::t_stack}"

- "%{::t_profile}/env/%{::t_environment}/release/%{::t_release}"

- "%{::t_profile}/env/%{::t_environment}/variables"

Now we have successfully separated the CloudFormation development form the configuration development. Also, we are using the full potential of Hiera and Puppet for separating code from configuration variables.

But there’s more. We use the platform variables/facts and serf (http://serfdom.io) to connect to other instances of the same stack.

3. Connecting to other instances in the stack

(Note: This approach is only suitable for development environments or testing phases of CI pipelines.)

Now that we have our facter facts in place we use them to configure a serf agent on each instance of the stack. Serf is an agent for decentralized cluster membership (for more details see http://serfdom.io)

The agent is configured with a set of tags corresponding to the set of platform variables on our UserData. After the serf agent is configured and running we can use it to obtain information about other nodes in the stack.

Here is an example of output obtained by running serf:

#/usr/local/bin/serf members -status alive -format json

…..

{

"name": "ip-10-100-9-130",

"addr": "10.100.9.130:7946",

"port": 7946,

"tags": {

"t_branch": "releases",

"t_dc": "aws-us-east-1",

"t_environment": "development",

"t_profile": "tipaas",

"t_release": "101",

"t_role": "webapp",

"t_stack": "development-testing"

}

….

The output of the above command contains one such member definition for each instance in the stack. Now we have to make this information available to Puppet in an easy way. That's done again with Hiera and facter.

First we create set of custom facts – one for each profile+role, where the remaining platform variables (all but profile and role) match the same set of variables on the node where the custom facts are generated.

Example:

#facter | grep serf_my_

serf_my_activemq_broker => ip-10-100-2-21

serf_my_activemq_re => ip-10-100-49-114

serf_my_elk_elasticsearch => ip-10-100-41-79

serf_my_idm_syncope => ip-10-100-9-130

serf_my_mongo_repl_set_instance => ip-10-100-62-139

serf_my_repomgr_nexus => ip-10-100-51-250

serf_my_postgres_db => ip-10-100-20-245

serf_my_tipaas_rt_flow => ip-10-100-105-201

serf_my_tipaas_rt_infra => ip-10-100-36-145

serf_my_tipaas_webapp => ip-10-100-47-174

Now that we have those custom facts we can introduce them in Hiera in appropriate levels of the hierarchy

Example in a Hiera file:

tipaas/role/webapp/env/development.yaml

–-

tipaas::activemq_nodes: "%{::serf_my_elk_elasticsearch}"

tipaas::mongo_nodes: "%{::serf_my_mongo_repl_set_instance}"

…

A Few Conclusions

Encoding shell scripts in CloudFormation templates is a valid approach, but using structured UserData provides better separation of concerns between the infrastructure code and configuration management.

Using Troposphere and Python to develop the CloudFormation allows for common developer workflows such as code reviews, local testing and inline documentation as part of the code.

Combining master-less Puppet and Hiera with Serf (serfdom.io) works really well for orchestrating development and integration environments.

[2015-08-26] Talend Blog: Focus IT development on the user experience while improving the developer/designer relationship

Until recently, developing a user-friendly product rarely made the priority list of software makers. IT engineers were more concerned with the service to be rendered than they were with the usability[1] of their software. Packing a product with features was more important than making it easy to use. But at a time when IT products are a dime a dozen, we are now entitled to the same convenience and ease of use with business applications that we enjoy with mainstream products.

If software makers fail to consider usability, IT departments (their main customers) risk having their users turn to applications other than the ones they recommend. This leads to the development of parallel computing models ("shadow IT") which are not certified by the IT department. The success of Tableau in the field of Business Intelligence is a perfect example.

"User-Centered Design": Usability and Ownership

Software makers have now developed proven design methods to avoid this phenomenon and better identify user needs as well as potential constraints. User-centered design places the end user at the core of the development process (from needs analysis and the structure of the software to the development of more or less elaborate models). Nowadays, software developers can rely on user experience specialists to reach this goal.

Observe and Identify Key Use Cases

The first phase of the process is an on-site observation of users as they perform their tasks in everyday situations. This is the "field research" phase: studying how the user handles the application, how they interact with their coworkers and what constraints govern their daily work. For example, on an industrial assembly line, operators are often equipped with protective gloves, their keyboard is protected by plastic wrap and touch screens are more practical than using a mouse; it is essential that software makers consider this range of constraints to ensure that the design of new features is relevant.

Following this analysis, the UX designer composes descriptive lists detailing the profile of each user, as well as various characteristics such as the user's command of computer applications ("persona"). At the same time, with the help of the product manager, the designer completes a comprehensive analysis of use cases, i.e. the different needs that the software must meet. To prepare a data set before integrating it into a marketing campaign management tool, a business user spends an average of 80% of their time preparing data and only 20% analyzing the data. An application that can automate this process using change suggestions (capitalize the first letter of the name, separate the first and last name in a separate field, address correction, etc.) can help reverse this ratio and allows the user to devote more time to adding value.

Prototyping and Testing

The second phase—prototyping—is an essential step; it provides an interface based on the analysis carried out in the initial phase. An initial set of static prototypes are generated in the first round of analyzes, often in black and white, followed by a second series of prototypes that are more in-depth and detailed. In the majority of R&D departments, the work of the UX designer often stops there and the developer takes over to 'code' the interface. This phase of the development, which can be quite long, requires a great deal of attention to detail to ensure perfect restoration of the UX specifications provided by the designer. If this is not the case, bugs related to usability will be identified, but are rarely made a priority, and fixing them is often put off until later. This frustrates all those involved in the project - the developer is criticized, the designer is not heard and the end user does not get what they want.

In my view however, there is a simple solution that I have experienced myself - the UX designer must go further in the prototyping stage and offer animated models, delivered in HTML & CSS. This should be done using modern development frameworks such as Bootstrap.js or AngularJS, which are now broadly used by developers. This would greatly facilitate the work of developers, who could then simply concentrate on what they do best, i.e. data connection, data exchange, performance between front-end and back-end, etc. This would also present a functional, usable, and appealing version to decision makers and target users in the early stages, thus facilitating the adoption of the new software at all levels. Gone are the days of the more or less long 'tunnel' phase, during which assessing the progress of the project is difficult. Now, a ‘live’ version is available at all stages of the project.

Moreover, it facilitates the completion of the last phase of the "User-Centered Design" process, which consists of testing and evaluating developments. It becomes easy to assign each use case to the prototype/product in order to verify that it works correctly at each stage. The target user experiences actual operating conditions and says aloud what he is doing. Video cameras record the users and after all footage is consolidated, problems are identified and addressed on a regular basis.

The Best of Both Worlds

Previously, the two professions of IT engineer and UX designer were very different. While they present different challenges due to the nature of their tasks, new forms of collaboration must be found, such as the development of a joint 'front-end' component. To do this, it would be useful if in their respective courses of study, each student (developers and designers) be familiarized with the work of the other so that each party understands the general constraints of the project and ultimately, so that they can speed up development while responding more precisely to the demands of end users.

Faced with increased competition, tighter turnaround times designed to enable quicker market development response times and the more rapid delivery of applications consumed daily by our users, collaboration between UX designers and developers is becoming a strategic challenge for software makers.

[1] According to Wikipedia, usability corresponds to "the effectiveness, efficiency, and satisfaction with which specific users should be able to perform tasks"

[2015-08-21] Talend Blog: Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 2)

First an apology: It has taken me a while to write this second part of the matching blog and I am sure that the suspense has been killing you. In my defence, the last few months have been incredibly busy for the UK Professional Services team, with a number of our client’s high profile projects going live, including two UK MDM projects that I am involved with. On top of that, the interest in Talend MDM at the current time is phenomenal, so I have been doing a lot of work with our pre-sales / sales teams in Europe whenever their opportunities require deep MDM expertise. Then of course, Talend is gearing up for a major product release later this year so I recently joined a group of experts from around the world at ‘Hell Week’ at our Paris office – where we do our utmost to break the Beta with real world use cases as well as test out all the exciting new features.[1]

Anyway – when I left you last time we had discussed the importance of understanding (profiling) our data, then applying standardisation techniques to the data in order to give ourselves the best chance of correctly identifying matches / duplicates within our data sets. As I stated last time, this is unlikely to be optimised in a single pass, but more likely to be an iterative process over the duration of the project (and beyond – I will discuss this later). Now we are ready to discuss the mechanics of actually matching data with Talend. At a high level there are two strategies for matching:

1. Deterministic Matching

This is by far the simpler of the two approaches. If our data set(s) contain appropriate data we can either:

Use one or more fields as a key or unique identifier – where identical values exist we have a match. For example the primary / foreign keys of a record, national id numbers, Unique Property Reference Numbers (UPRN), tax registration numbers etc.
Do an exact comparison of the contents of some or all of the fields

In essence with deterministic matching we are doing a lookup or a join – hopefully you should all be familiar with doing this within Talend and within relational databases. Of course even this strategy can bring its own technical challenges – for example joining two very large sets of data efficiently, but this is a topic for another blog.

Everyone’s favourite component, tMap joining data from two sources and a sneak preview of new features:

2. Probabilistic Matching

The issue with deterministic matching is that it will not necessarily identify all matches / duplicates:

In the example above, the system ID’s (the primary key) are different – i.e. the record has been keyed in twice, so this doesn’t help us. The National ID could be a contender for helping us match, but it appears to be an optional field. Besides, even if it was mandatory, what if a mistake was made typing in the ID? Finally we have the name fields, but again a deterministic match doesn’t help us here, due to the typo in the ‘Last’ field. The example also illustrates that even if we had some way of addressing these issues it may not be possible to accurately determine if the two records are a match by an automatic algorithm or even human intervention – we simply might not have enough information in the four columns to make a decision.

Now let’s say we had a real world data set (or multiple real world data sets) with a far greater breadth of information about an entity. This is where it gets interesting. Probabilistic or ‘Fuzzy’ matching allows us to match data in situations where deterministic matching is not possible or does not give us the full picture. Simplistically it is the application of algorithms to various fields within the data, the results of which are combined together using weighting techniques to give us a score. This score can be used to categorise the likelihood of a match into one of three categories: Match, Possible Match and Unmatched:

· Match – automatic matching

· Possible Match – records requiring a human Data Steward to make a decision

· Unmatched – no match found

Within the Talend Platform products, we supply a number of Data Quality components that utilise these ‘fuzzy’ algorithms. I cannot stress enough the importance of understanding, at least at a high level, how each algorithm works and what its strengths and weaknesses are. Broadly, they are split into two categories: Edit Distance and Phonetic.

Edit Distance Algorithms

From Wikipedia:

In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

From a DQ matching perspective, this technique is particularly useful for identifying the small typographical errors that are common when data is entered into a system by hand. Let’s look at the edit distance algorithms available within Talend, all of which are known industry standard algorithms:

Levenshtein distance

Most useful for matching single word strings. You can find a detailed description of the algorithm here: https://en.wikipedia.org/wiki/Levenshtein_distance, but in essence it works by calculating the minimum number of substitutions required to transform one string into another.

Example (again from Wikipedia):

The Levenshtein distance between ‘kitten’ and ‘sitting’ is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

1. kitten → sitten (substitution of "s" for "k")

2. sitten → sittin (substitution of "i" for "e")

3. sittin → sitting (insertion of "g" at the end).

As we look at each algorithm it is important to understand the weaknesses. Let’s return to our ‘Pemble’ vs ‘pembel’ example:

Pemble → Pembll
Pembll → Pemble
Pemble → pemble

Yes – that’s right – the algorithm is case sensitive and the distance is 3 – the same as ‘kitten’ and ‘sitting’! Once again, this is a nice illustration of the importance of standardisation before matching: For example, standardising so the first letter is upper case would immediately reduce the distance to 2. Later, I will show how these distance scores translate into scores in the DQ components.

Another example: ‘Adam’ vs ‘Alan’

Adam → Alam
Alam → Alan

Here the Levenshtein distance is 2. However consider the fact that ‘Adam’ and ‘Alan’ may be the same person (because the name was misheard) or they may be different people. This illustrates why we need to consider as much information as possible when deciding if two ‘entities’ are the same – the first name in isolation in this example is not enough information to make a decision. It also demonstrates that we need to consider the possibility of our fuzzy matching introducing false positives.

Jaro-Winkler

From Wikipedia: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Personally I use Jaro-Winkler as my usual edit distance algorithm of choice as I find it delivers more accurate results than Levenshtein. I won’t break down detailed examples as before as the mathematics are a little more complex (and shown in detail on the Wikipedia page). However, let’s try running the same examples we looked at for Levenshtein through Jaro-Winkler:

· ‘kitten’ and ‘sitting’ -> Jaro-Winkler score: 0.7460317611694336

· ‘Pemble’ and ‘Pembel’ -> Jaro-Winkler score: 0.9666666507720947

· ‘Pemble’ and ‘pembel’ -> Jaro-Winkler score: 0.8222222328186035

· ‘Adam’ and ‘Alan’ -> Jaro-Winkler score: 0.7000000178813934

All Jaro-Winkler scores are between 0 and 1. Case is still skewing our results.

Jaro

Essentially a special case implementation of Jaro-Winkler, more details of which can be found on the internet see http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html: ‘Step3: Winkler Modification’. I don’t generally use it as Jaro-Winkler is considered more accurate in most cases. In case you are wondering, our test cases score as follows:

· ‘kitten’ and ‘sitting’ -> Jaro score: 0.7460317611694336

· ‘Pemble’ and ‘Pembel’ -> Jaro score: 0.9444444179534912

· ‘Pemble’ and ‘pembel’ -> Jaro score: 0.8222222328186035

· ‘Adam’ and ‘Alan’ -> Jaro score: 0.6666666865348816

Note where the scores are the same as Jaro-Winkler and where they are different. If you are interested, these variations can be understood by the Winkler Modifications preferential treatment to the initial part of the string and you can even find edge cases where you could argue that this isn’t desirable:

· ‘francesca’ and ‘francis’ -> Jaro score: 0.8412697911262512

· ‘francesca’ and ‘francis’ -> Jaro-Winkler score: 0.9206348955631256

To a human, we can see that these are obviously two different names (doesn’t mean that the name has not been misheard though), but Jaro-Winkler skews based on the initial part of the string - ‘fran’. Remember though that these scores would not usually be used in isolation, other fields will also be matched.

Q-grams (often referred to as n-grams)

Matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams. At the time of writing, the q-grams algorithm that Talend provides is actually a character level tri-grams algorithm, so what does this mean exactly?

https://en.wikipedia.org/wiki/Trigram

Imagine ‘sliding a window’ over a string and splitting out all the combinations of the consecutive characters. Let’s take our ‘kitten’ and ‘sitting’ example and understand what actually happens:

‘kitten’ produces the following set of trigrams:

(#,#,k), (#,k,i), (k,i,t), (i,t,t), (t,t,e), (t,e,n), (e,n,#), (n,#,#)

‘sitting’ produces the following set of trigrams:

(#,#,s), (#,s,i), (s,i,t), (i,t,t), (t,t,i), (t,i,n), (i,n,g), (n,g,#), (g,#,#)

Where ‘#’ denotes a pad character appended to the beginning and end of each string. This allows:

· The first character of each string to potentially match even if the subsequent two characters are different. In this example (#,#,k) does not equal (#,#,s).

· The first two characters of each string to potentially match even if the subsequent character is different. In this example (#,k,i) does not equal (#,s,i).

· The last character of each string to potentially match even if the preceding two characters are different. In this example (n,#,#) does not equal (g,#,#).

· The last two characters of each string to potentially match even if the preceding character is different. In this example (e,n,#) does not equal (n,g,#).

There are two things to note from this:

The pad character ‘#’ is treated differently to a whitespace. This means the strings ‘Adam Pemble’ and ‘Pemble Adam’ will get a good score, but not a perfect match score which is a desirable result.
We should remove any ‘#’ characters from our strings before using this algorithm!

The algorithm in Talend uses the following formula to calculate a score:

normalisedScore =

(maxQGramsMatching - getUnNormalisedSimilarity(str1Tokens, str2Tokens)) / maxQGramsMatching

I won’t delve into the full details of each variable and function here, but essentially the score for out ‘kitten’ and ‘sitting’ example would be calculated as follows:

normalisedScore = (17 – 15) / 17 = 0.1176470588235294 - a low score

Once you understand the q-grams algorithm, you can see why it is particularly suited to longer strings or multi work strings. For example if we used q-grams to compare:

“The quick brown fox jumps over the lazy dog”

“The brown dog quick jumps over the lazy fox”

We would get a reasonably high score (0.8222222328186035) due to the strings containing the same words, but in a different order (remember the whitespace vs ‘#’). A Levenshtein score (not distance) for these strings would be 0.627906976744186. However it is important to note that scores from different algorithms are NOT directly comparable – we will come back to this point later. However we can say that relatively the q-grams algorithm will give us more favourable results in this - same words, different order scenario - if that’s what we were looking for.

Phonetic Algorithms

Once again from Wikipedia: https://en.wikipedia.org/wiki/Phonetic_algorithm

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result.

They are necessarily complex algorithms with many rules and exceptions, because English spelling and pronunciation is complicated by historical changes in pronunciation and words borrowed from many languages.

In Talend we include four phonetic algorithms, again all industry standards:

Soundex

Once more credit to Wikipedia (no point in reinventing the wheel) https://en.wikipedia.org/wiki/Soundex . The Soundex algorithm generates a code that represents the phonetic pronunciation of a word. This is calculated as follows:

The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:

Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
Replace consonants with digits as follows (after the first letter):
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (the chars 's' and 'c' in the name would receive a single number of 2 and not 22 since an 'h' lies in between them). "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P').

A ‘score’ in Talend is generated based on the similarity of two codes. E.g. In the example above, ‘Robert’ and ‘Rupert‘ generate the same Soundex code of ‘R163’ so Talend would assign a score of 1. ‘Robert’ and ‘Rupern‘ (typo in Rupert, code = R165) would get a score of 0.75 as three of the four digits match. Also it is worth nothing Soundex is not case sensitive.

Key point: As you can see, phonetic algorithms are a useful tool especially where data may have been spelt ‘phonetically’ rather than the correct spelling. English is also full of words that can sound the same but be spelt completely differently (they are called homophones https://www.oxford-royale.co.uk/articles/efl-homophones.html) – consider ‘buy’ and ‘bye’, both words will generate the same Soundex code ‘B000’. Being able to match phonetically can be a very powerful tool, however, they also tend to ‘overmatch’ e.g. ‘Robert’ and ‘Rupert’ or ‘Lisa’ and ‘Lucy’ generate the same code. Why not have a play yourself? There are plenty of online tools that generate Soundex codes e.g. http://www.gedpage.com/soundex.html

Soundex FR

A variation of Soundex optimised for French language words. Talend began as a French company after all!

Metaphone / Double Metaphone

Once more to Wikipedia: https://en.wikipedia.org/wiki/Metaphone

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.^[1] It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages.

….

Original Metaphone contained many errors and was superseded by Double Metaphone

The rules for Metaphone / Double Metaphone are too complex to reproduce here, but are available online. Suffice to say, if you are going to use a phonetic algorithm with Talend, it is likely that Double Metaphone will be your algorithm of choice. Once again though, be aware that even with Double Metaphone, the ‘over matching’ problem exists and you should handle it appropriately in your DQ processes. This could mean lower thresholds for automatic matching or stewardship processes that allow false positives to be unmatched.

This concludes our brief tour of Talend’s matching algorithms. It should be noted that we also support an algorithm type of ‘custom’ that allows your own algorithm to be plugged in. Another important point is that the algorithms supplied by Talend are focused on ‘character’ / ‘alphabet’ based languages (for edit distance) and specific languages (phonetic). Non-character based languages like Chinese will require different matching strategies / algorithms (have a look online if you want to know more on this topic).

It is at this point I shall have to apologise once more. Last time I promised that we would discuss the actual mechanics of matching with Talend and Survivorship in this blog. However I think this post is long enough as it is and I shall continue with these topics next time.

[1] This obviously took all of our time. We definitely didn’t spend any time eating amazing French food, drinking beer and swapping war stories.

[2015-08-14] Talend Blog: Beyond “The Data Vault”

In my last blog “What is ‘The Data Vault’ and why do we need it?” I introduced a fresh, compelling methodology for data warehouse modeling authored and invented by Dan Linstedt (http://danlinstedt.com) called ‘the Data Vault’. Solving the many characteristic and inherent problems found in crafting an Enterprise Data Warehouse (EDW), I discussed how the Data Vault’s high adaptability simplifies business ontologies, incorporating Big Data to result in durable yet flexible solutions that most engineering departments would dream of.

Before we get into the Data Vault EDW aspects however, I think we need to cover some basics.

Data Storage Systems

Certainly by now everyone has heard about Big Data; including no doubt the hype, disinformation, and misunderstandings about what Big Data is that are, grudgingly, just as pervasive. So let’s back up a moment and confine the discussion to a sensible level. Setting Relational 3NF and STAR schemas aside, ignoring e-commerce, business intelligence, and data integration, let’s look instead at the main storage facilities that encompass data technologies. These are:

Database Engines
- ROW: your traditional Relational Database Management System (RDBMS)
- COLUMN: relatively new, widely misunderstood, feels like a normal RDBMS
NoSQL: new kid on the block; really means ‘NOT ONLY SQL’
File Systems: everything else under the sun (ASCII/EBCDIC, CSV, JSON, XML, HTML, etc.)

Database Engines

The ROW based database storage methodology is one most of us are already familiar with. Depending upon your vendor of choice (like Oracle, Microsoft, IBM, Postgres, etc…) Data Definition Language (DDL) and Data Manipulation Language (DML) syntax, collectively called SQL, creates tables for the storage and retrieval of structured records, row by row. Commonly based upon some form of ‘key’ identifier, the Relational 3^rd Normal Form (3NF) Data Model thrives upon the ROW based database engine and is widely used for many Transactional (OLTP) and/or Analytic (OLAP) systems and/or applications. Highly efficient in complex schema designs and data queries, ROW based database engines offer a tried and true way to build solid data-‘based’ solutions. We should not throw this away, I won’t!

The COLUMN based database storage methodology has been around, quietly, for a while as an alternative to ROW based databases where aggregations are essential. Various vendors (like InfoBright, Vertica, SAP Hana, Sybase IQ, etc…) generally use similar DDL and DML syntax from ROW based databases, yet under the hood things are radically different; a highly efficient engine for processing structured records, column by column; perfect for aggregations (SUM/MIN/MAX/COUNT/AVG/PCT)! This is the main factor that sets it apart from ROW based engines. Some of these column based technologies also provide high data storage compression which allows for a much smaller disk footprint. In some cases as much as 80/1 over their row based counterpart. We should adopt this where appropriate; I do!

Big Data

The NoSQL based storage methodology (notice I don’t call this a database) is the newer kid on the block which many vendors are vying for your immediate attention (like Cassandra, Cloudera, Hortonworks, MapR, MongoDB, etc.). Many people think that NoSQL technologies are here to replace ROW or COLUMN based databases; that is simply not the case. Instead, as a highly optimized, highly scalable, high performance distributed ‘file system’ (see HDFS below), the NoSQL storage capabilities offer striking features simply not practical with ROW or COLUMN databases. Dramatically enhancing file I/O technologies, NoSQL extends out to new opportunities that were either unavailable, impracticable, or both. Let’s dive a bit deeper on this; OK?

There are three main variations of NoSQL technologies. These include (click on image for web link):

Key Value: which support fast transaction inserts (like an internet shopping cart); Generally stores data in memory and great for web applications that needs considerable in/out data operations;

Document Store: which stores highly unstructured data as named value pairs; Great for web traffic analysis, detailed information, and applications that look at user behavior, actions, and logs in real time;

Column Store: which is focused upon massive amounts of unstructured data across distributed systems (think Facebook & Google);Great for shallow but wide based data relationships yet fails miserably at ad-hoc queries;

(note: Column Store NoSQL is not the same as a Column Based RDBMS)

Most NoSQL vendors support structured, semi-structured, or non-structured data which can be very useful. The real value, I believe, comes in the fact that NoSQL technologies ingest HUGE amounts of data, very FAST. Forget Megabytes or Gigabytes, or even Terabytes, we are talking Petabytes and beyond! Gobs and gobs of data! With the clustering and multi-threaded inner-workings, scaling to future-proof the expected explosion of data, a NoSQL environment is an apparent ‘no-brainer’. Sure, let’s get excited, but let’s also temper it with the understanding that NoSQL is complementary, not competitive to more traditional databases systems. Also note that NoSQL is NOT A DATABASE but a highly distributed, parallelized ‘file system’ and really great at dealing with lots of non-structured data; did I say BIG DATA?

NoSQL technologies have both strengths and weaknesses. Let’s look at these too:

NoSQL Strengths
A winner when you need the ability to store and look up Big Data
Commodity Hardware based
Fast Data Ingestion (loads)
Fast Lookup Speeds (across clusters)
Streaming Data
Multi-Threaded
Scalable Data Capacity & Distributed Storage
Application focused
NoSQL Weakness
Conceivably an expensive infrastructure (CPU/RAM/DISK)
Complexities are hard to understand
Lack of native SQL interface
Limited programmatic interfaces
Poor performance on Update/Delete operations
Good engineering talent still hard to find
Inadequate for Analytic Queries (aggregations, metrics, BI)

File I/O

The FILE SYSTEM data storage methodology is really straightforward and easy. Fundamentally file systems rely upon a variety of storage media (like Local Disks, RAID, NAS, FTP, etc.) and managed by an Operating System (Windows/Linux/MacOS) supporting a variety of file access technologies (like FAT, NTFS, XFS, EXT3, etc.). Files can comprise almost anything, be formatted in many ways, and utilized in a wide variety of application and/or systems. Usually files are organized into folders and/or sub-folders making the file system an essential element to almost all computing today. But then you already know this; Right?

Hadoop/HDFS

So where does Hadoop fit it in, and what is HDFS? The ‘Hadoop Distributed File System’ (HDFS) is a highly fault-tolerant file system that runs on low-cost, commodity servers. Spread across multiple ‘nodes’ in a hardware cluster (sometimes hundreds or even thousands of nodes), 64Mb ‘chunks’ or data segments are processed using a ‘MapReduce’ programming model that takes advantage of the highly efficient parallel, distributed algorithm.

HDFS is focused on high throughput (fast) data access and support for very large files. To enable data streaming HDFS has relaxed a few restrictions imposed by POSIX (https://en.wikipedia.org/wiki/POSIX) standards to allow support for batch processing applications targeting HDFS.

The Apache Hadoop Project is an open-source framework written in Java that is made up of the following modules:

Hadoop Common: which contains libraries and utilities
HDFS: the distributed file system
YARN: a resource manager responsible for cluster utilization & job scheduling (Apache YARN)
MapReduce: a programming model for large scale data processing

Collectively, this Hadoop ‘package’, has become the basis for several commercially available and enhanced products, which include (click on image for web link):

So let’s call them all: Data Stores

Let’s bring these three very different data storage technologies into a conjoined perspective; I think it behooves us all to consider that essentially all three offer certain value and benefits across multiple use cases. They are collectively and generically therefore: Data Stores!

Regardless of what type of system you are building, I’ve always subscribed to the notion that you use the right tool for the job. This logic applies to data storage too. Each of these data storage technologies offer specific features and benefits therefore should be used in specific ways appropriate to the requirements. Let’s review:

ROW based databases should prevail when you want a complex, but not too-huge data set that requires efficient storage, retrieval, update, & delete for OLTP and even some OLTP usage;
COLUMN based database are clearly aimed at analytics; optimized for aggregations coupled with huge data compression and should be adopted for most business intelligence usage;
NoSQL based data solutions step in when you need to ingest BIG DATA, FAST, Fast, fast… and when you only really need to make correlations across the data quickly;
File Systems are the underlying foundation upon which all these others are built. Let’s not forget that!

The Enterprise Data ‘Vault’ Warehouse

Now that we have discussed where and how we might store data, let’s look at the process for crafting an Enterprise Data Warehouse (an obvious Big Data use case) based on a Data Vault model.

Architecture

An EDW is generally comprised of data originating from a ‘Source’ data store; likely an e-commerce system, or Enterprise Application, or perhaps even generated from machine ‘controls’. The simple desire is to provide useful reporting on metrics aggregated from this ‘Source’ data. Yet ‘IT’ engineering departments often struggle with the large volume and veracity of the data and often fail at the construction of an effective, efficient, and pliable EDW Architecture. The complexities are not the subject of this Blog; however anyone who has been involved in crafting an EDW knows what I am talking about. To be fair, it is harder than it may seem.

Traditionally an enterprise funnels ‘Source’ data into some form of staging area, often called an ‘Operational Data Store’ or ODS. From there, the data is processed further into either a Relational 3NF or STAR data model in the EDW where aggregated processing produdces the business metrics desired. We learned from my previous Blog that this is problematic and time consuming, causing tremendous pressure on data clensing, transformations, and re-factoring when systems up-stream change.

This is where the Data Vault shines!

Design

After constructing a robust ODS (which I believe is sound architecture for staging data prior to populating an EDW) designing the Data Vault is the next task. Let’s look at a simple ODS schema:

Let’s start with the HUB table design. See if you can identify the business keys and the ‘Static’ attributes from the ‘Source’ data structures to include into the HUB tables. Remember also that HUB tables define their own unique surrogate Primary Key and should contain record load date and source attributes.

The LNK ‘link’ tables capture relationships between HUB tables and may include specific ‘transactional’ attributes (there are none in this example). These LNK tables also have a unique surrogate Primary Key and should record the linkage load date.

Finally the SAT ‘satellite’ table capture all the variable attributes. These are constructed from all the remaining, valuable ‘Source’ data attributes that may change over time. The SAT tables do not define their own unique surrogate keys; instead they incorporate either the HUB or the LNK table surrogates plus the record load date combined as the Primary Key.

Additionally the SAT tables include a record load end date column which is designed to contain a NULL value for one and only one instance of a satellite row representing the ‘current’ record. When SAT attribute values change up-stream, a new record is inserted into the SAT table, updating the previously ‘current’ record by setting the record load end date to the date of the newly loaded record.

One very cool result of using this Data Vault model is that it is easily possible to create queries that go “Back-In-Time” as it will be possible to check the ‘rec_load_date’ and the ‘rec_load_end_date’ values to determine what the record attribute values were in the past. For those who have tried, they know, this is very hard to accomplish using a STAR schema.

AGGREGATIONS

Eventually data aggregations (MIN/MAX/SUM/AVG/COUNT/PCT), often called ‘Business Metrics’ or ‘Data Points’ must be generated from the Data Vault tables. Reporting systems could query the tables directly, which is a viable solution. I think however, that this methodology puts a strain on any EDW, and instead a column-based database could be utilized instead. As previously discussed, these column-based database engines are very effective for storing and retrieving data aggregations. The design of these column-based tables could be highly de-normalized (consider this example), versioned to account for history. This solution effectively replaces the FACT/DIMENSION relationship requirements and the potentially complex (populate and retrieve) data queries of the STAR schema data model.

Yes, for those who can read between the lines, this does impose an additional data processing step which must be built, tested, and incorporated into an execution schedule. The benefits of doing this extra work are huge. Once the business metrics are stored, pre-aggregated, reporting systems will always provide fast retrieval, consistent values, and managed storage footprints. These become the ‘Data Marts’ for the business user and worth the effort!

ETL/ELT

Getting data from ‘Source’ data systems, to Data Vault tables, to column-based Data Marts, require tools. Data Integration tools. Yes, Talend Tools! As a real-world example, before I joined Talend, I successfully designed and built such an EDW platform. I used Talend and Infobright. The originating ‘source’ data merged 15 identical databases with 45 tables each, into an ODS. Synchronized data from the ODS to the Data Vault model with 30 tables, and further populated 9 InfoBright tables. The only real transformation requirements were to map the ‘Source’ tables to Data Vault tables, then to de-normalized tables. After processing over 30 billion records, the resulting column-based ‘Data Marts’ could execute aggregated query results, across 6 Billion records, in under 5 seconds. Not bad!

Conclusion

Wow -- A lot of information has been presented here. I can attest that there is a lot more that could have been covered. I will close to say, humbly, that an Enterprise Data ‘Vault’ Warehouse, using any of the data stores discussed above, is worth your consideration. The methodology is sound and highly adaptable. The real work then becomes how you define your business metrics and the ETL/ELT data integration process. We believe, here at Talend, not too surprisingly, that our tools are well suited for building fast, pliable, maintainable, operational code that takes you “Beyond the Data Vault”.

[2015-08-13] Talend Forum Announcement: For test only, Talend Open Studio's 6.1.0 M1 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 6.1.0 M1 release is available, for testing only. This milestone contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 6.1 release.
Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s first milestone:
Big Data: http://www.talend.com/download/talend-open-studio
Data Quality: http://www.talend.com/download/talend-open-studio#t2
ESB: http://www.talend.com/download/talend-open-studio#t3
Data Integration: http://www.talend.com/download/talend-open-studio#t4
BPM: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-open-studio#t6

Thanks for being a part of our community
The Talend Team

[2015-08-11] Talend Blog: Talend and the Gartner Magic Quadrant for Data Integration Tools – Less than a whisker from the leader’s quadrant

Why you should consider the emerging leader

When it comes to Talend, prospective customers have a favorite question they like to ask analysts: “If I can get all of this from Talend at a fraction of the cost of the big guys, what am I really missing by not going with bigger vendors?”

And, what the analysts likely say is something along the lines of: “Talend isn’t as mature as Informatica or IBM, so their support and how-to tools won’t be as robust.”

As you might anticipate, my answer would be different.

I’d tell prospects that they’ll never regret taking a deeper look at Talend and assessing for themselves how we stack up.

For my part, here’s how I’d categorize the Data Integration players:

Megavendors (Informatica, IBM, Oracle) – These vendors have mostly grown through acquisitions which means they all have different design and management environments. While they offer a broad set of capabilities, they also require you to deploy/learn multiple, complex and propriety products that come with a high total cost of ownership.

Stovepipes (SAP, SAS, Microsoft) – These vendors all have solid solutions, but they mostly win in their own ecosystems and we don’t see them in the market more broadly. So, if you are ONLY a Microsoft shop or ONLY a SAP shop, then these could be for you. Of course though, these vendors still require you to buy multiple products with a high TCO.

Point Players – (Syncsort, Adeptia, Denodo, Cisco, Actian, Information Builders) – If you have a specific need, these players can fit the bill and they tend to be less expensive. However they lack the full set of capabilities and they aren’t players in the broader integration market.

Talend (Yes, shameless I know, but I truly believe we are currently in a category of one) – If you’re looking for a modern integration platform, that can give you agility, meet all your integration needs, including big data, with great cost of ownership, Talend really is your only one option.

A Good Choice for Today and Tomorrow

Gartner outlines seven trends that are shaping the market and Talend is investing in all of them. We are setting the bar on support for the key platforms of the future like Hadoop, NoSQL and the cloud. As the complexity of use cases grows and the need to operate in real-time intensifies our enterprise service bus and upcoming support for Spark streaming mean that we will deliver on more real-time use cases than anyone else.

Here are all seven of the trends highlighted by Gartner in the Magic Quadrant for Data Integration Tools*:

1. Growing interest in business moments and recognition of the required speed of digital Business

2. Intensifying pressure for enterprises to modernize and enlarge their data integration Strategy

3. Requirements to balance cost-effectiveness, incremental functionality, time-to-value and growing interest in self-service

4. Expectations for high-quality customer support and services

5. Increasing traction of extensive use cases

6. Extension of integration architectures through a combination of cloud and on-premises deployments

7. Need for alignment with application and information infrastructure

Of course, don’t take my word for it. Add Talend to your due diligence list and assess for yourself how a company that’s less than a whisker away from the Leaders Quadrant matches up. You might want to start by taking a closer look at the report, which you can download free here: https://info.talend.com/gartnermqdi.html

*[1] Gartner, Inc., "Magic Quadrant for Data Integration Tools” by Eric Thoo and Mark A. Beyer, July 29, 2015

[2015-08-06] Talend Blog: On the Road to MDM

Think big, but start small. This is particularly good advice if you plan on implementing a master data management (MDM) system any time in the near future.

MDM is an extremely powerful technology that can yield astonishing results. But like any complex, highly effective discipline it is best approached systematically and incrementally.

First, just what are we talking about? Here’s an excellent definition of MDM from Gartner’s IT Glossary: “MDM is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.”

Gartner goes on to say that “Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.”

The metadata residing in the system includes information on:

- The means of creation of the data

- Purpose of the data

- Technical data – data type, format, length, etc.

- Time, date and location of creation

- Creator or author of the data

- Standards

That’s a lot. And it can lead to problems. For example, when moving data from one place to another, you have to know how it transforms, who owns it, where it comes from and what rules govern the data. This may mean going back into the ETL integration and trying to unscramble initial problems in data mapping and compliance. You can try tackling these problems using a spreadsheet (that way lies madness) or turn to IT for an answer that may take months in coming – not a particularly attractive solution if your business is attempting to become more agile

The fact is, that for many organizations, a full-bore MDM deployment right from the start is overkill. The massive effort required proves to be too complicated, expensive and inevitably winds up being put on hold.

To avoid this kind of quagmire, there is a better way. As I mentioned above, start small and think big. To be more specific, start with a data dictionary and ease your way into MDM over time. (A data dictionary is a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.)

The poster child for this approach is one of our valued customers – the Irish Office of Revenue Commissioners, known simply as “Revenue.” Here’s their story.

Revenue’s core business is the assessment and collection of taxes and duties for the Republic or Ireland. With more than 65 offices countrywide, the agency has a staff of over 5700. Revenue’s authorized users have access to operational business and third-party data in its data warehouse for query, reporting and analytical purposes.

The data is complex and growing rapidly. Historically the agency’s metadata has been accessed from multiple sources using spreadsheets and other business documentation, a fragmentary and ineffective solution.

Fortunately Talend’s integrations solutions are already being used by Revenue as the corporate ETL tool. Because much of the technical metadata around data manipulation and transformation were already being captured by these Talend solutions, the implementation of Talend’s MDM unified platform made a lot of sense.

The implementation included the initial use of a data dictionary within the MDM platform in keeping with the “start small, think big” dictum. In addition, Revenue was able to leverage the existing skills and knowledge of their business and data analyst employees who are familiar with Talend Studio.

Overall, the agency avoided additional costs for their metadata solution, reduced operational costs, and solved the business problem of knowing where to find pertinent data for reporting purposes.

Through their use of Talend MDM, we anticipate that Revenue’s metadata solution will increase the understanding of data throughout the organization. This, in turn, will lead to improved decision making and improved data quality over time. Plus, the solutions’ web user interface will help cut metadata management and deployment costs. It provides the Revenue business analysts with access to one centralized location to gather metadata that had previously been scattered around the organization in various formats and residing in individual business and technical silos.

For many companies dealing with today’s influx of big data, the Revenue incremental approach is a good one. They can start by building a data dictionary for free and then upgrade to handle more users and provide additional functionality.

After all, MDM is a journey, not a destination. Companies that elect to follow this path will achieve cost effective and satisfying results by starting small and then moving ahead with all deliberate speed.

[2015-08-04] Talend Blog: OSGI Service Containers

The first post in this series provided a look at the definition of a Container. The second post in the series explored how Platforms leverage Containers to deliver SOA design patterns at internet scale in the Cloud. This post presents a simplified example of applying Container architecture for extensible business infrastructure. It then addresses when and where to use the power of micro-service containers like OSGI.

Use Case

Consider a B2B company seeking to add a new trading partner. The B2B partner may wish to subscribe to a data feed, so the partner will need to adapt its internal API’s to the trading network’s published API.

Rather than an elaborate IT project, this should be a simple Cloud self-service on-demand scenario. This not only increases agility, it maximizes the market for the B2B company. And it ensures business scalability since B2B IT staff will not be on the information supply chain critical path.

The Partner will probably want extensions to the base API so the B2B platform needs to observe the open-closed principle. It needs to be closed from modification while being open to extension. Extensions could be additional validation and business rules or additional schema extensions. Schema extensions in particular will impact multiple workflow stages. In addition, transformations for message schemas might be required along with data level filtering for fine grained access control. In order to realize the Self-Service On-Demand level expected from Cloud providers, the Platform must allow this type of mediation to be dynamically provisioned without rebuilding the application.

Figure 3Dynamic Subscription Mediation

In the diagram above, the partner submits a message to subscribe to a data feed (1) via say a REST web service. The subscription message is received by the RouteBuilder which dynamically instantiates a new route (2) that consumes from a JMS topic. The route filters (3) messages based on the access privileges, provides custom subscription mediation logic (4), and then sends the message using WS-RM (5).

Where should this mediation logic be hosted? Creating a service for each partner is not too difficult. But as the number of services in a composite service increases the overhead of inter-process communication (IPC) becomes a problem. As an extreme case, consider what the performance impact would be if the subscribe, filter, and custom mediation logic each required a separate service invocation.

In many cases modularity and extensibility are even more important than performance. When partners extend the API the impact may not be easily isolated to a single stage in the processing. In such cases the extension points need to be decoupled from the core business flow.

Likewise, when the core service evolves we need to ensure consistent implementation across different B2B partners. Regardless of variation, some requirements remain common. We want to be sure that these requirements are implemented consistently. A copy-paste approach will not be manageable.

Finally, using external processes to implement variation may undermine efficient resource pooling. Each partner ends up with its own unique set of endpoints and supporting applications. In the diagram above, mediation logic belongs to a pool of routes running in the same process to improve efficiency.

So we want granular composability for managed variation as well as modularity for extensibility of business logic. This is in tension with IPC overhead and resource pooling.

Sample Architecture

This post is focused on the role of the Service Container in resolving the design forces. It is used in the context of the Application Container and ESB Containers shown in the sample architecture below.

Figure 1SOA Whiteboard Logical Architecture

The Application Container hosts the actual business services. The business service is a plain old Java object (POJO). It does not know or care about integration logic or inter-process communication. That is addressed by the Service Container. The Service Container runs in the Application Container process. The exposed service is called a Basic Service.

The Service container also runs in the ESB Container. The ESB container provides additional integration including security, exactly-once messaging, monitoring and control, transformation, etc. It provides a proxy of the Basic Service with the same business API but different non-functional characteristics.

Service Container

The Service Container is a logical concept that is language dependent. Since the Service Container runs inside both the ESB and the Application Container it has to be compatible with the languages supported by the ESB and the Application Container. It may well have multiple implementations since there may be multiple Application Containers used by the enterprise.

For purposes of discussion we will focus on Java. We can think of Tomcat as a typical Application Container and Apache Karaf as the ESB container. The Service Container depends on a Dependency Injection framework. We might use Spring for dependency injection in Tomcat; in Karaf we might choose Blueprint. The Service Container itself might be implemented in Apache Camel. Camel works with both Spring and Blueprint. The actual service implementation is a plain old Java object (POJO).

Figure 2Containerized Services

The Service Container is non-invasive in the sense that it has a purely declarative API based on xml and annotations. The service developer does not need to make any procedural calls. Adoption of the Service Container by organizations is supported by an SDK that provides a cookbook for using it. But it should be very simple. Hence no tooling is required. The SDK should address Continuous Integration (CI) and DevOps use cases for every stage of development. As such the Service Container can encapsulate any lower level complexity introduced by other containers.

The service container adds functionality beyond the basic dependency injection framework to address endpoint encapsulation, mediation and routing, and payload marshalling. Using the Service Container provides a flexible contract between the integration team and the service provider which allows performance optimization while maintaining logical separation of concerns.

But in some cases this requires the platform to be able to deploy the new jars dynamically at runtime to an existing, running container. Indeed, there may be many Containers that will need to host the new extension points or adaptor services. All such concerns should be transparent to the service provider.

This could be implemented by the service provider team, but the same mediation will be used by many service providers. So it is preferable to delegate this functionality to the Platform. This has the added benefit that service providers can focus on business logic rather than creating and managing efficient resource pools that deliver reliable, secure throughput. Business logic and IT logic are often orthogonal skill sets. So separation of concerns also leads to improved efficiency.

Having this dealt with by the Platform is good, but it begs the question, how are custom mediation jars resolved and how are conflicts with custom logic from other partners managed?

Micro Service Containers

There is a key difference in the selection of the Dependency Injection framework used in the example architecture. The Application Container uses Apache Tomcat and the ESB Container uses Apache Karaf. Apache Karaf supports Blueprint for dependency injection, but it also supports OSGI micro-services whereas traditional Spring running in Tomcat does not.

Flexible deployment of business logic can be achieved with Dependency Injection frameworks like Spring, but two problems arise. The first is dependency management and classpath conflicts. The second is managing the dynamic modules.

OSGI is a mature specification that manages dependencies at the package level. What this means for the enterprise is that we can manage multiple versions of a module within the same runtime. In turn, this means we can dynamically deploy new service modules to runtime containers without having to worry about conflicting libraries. The concept is to achieve the same pluggable ease-of-use for enterprise services that you get with your smart phone’s App Store.

In addition to dependency management, the OSGI specification provides a micro-service architecture. Micro-service refers to the fact that we are only talking about services and consumers within the same JVM. Micro-services go beyond dependency injection to provide a framework for dynamic services that can come and go during the course of execution. This supports elasticity of services in the cloud.

OSGI is the same technology used by Eclipse plugins. So it is very mature and stable. Moreover, as an open standard it is appropriate for use in the enterprise. But there is some additional complexity with OSGI. OSGI complexity is merited when used to host dynamic modules which need to be composed in-process to encapsulate variation or re-use. This is the case with the ESB Containers. But it is not always the case for Basic Services running in the Application Containers.

For example, consider a simple transformation service. It is visually designed and published as a web service. B2B partners can use the transformation service, and if the additional latency does not impact their SLA, then the additional complexity of OSGI is not merited.

As a general rule, mediation logic in ESB Containers should use OSGI to provide for flexible deployment of mediation modules. Mediation is more likely to vary and it varies over a broader set of stakeholders. These stakeholders may require diverse and potentially overlapping libraries. Since they are all on their own lifecycle, it requires the dependency management capability of OSGI. Moreover, mediation services are more likely to be highly dynamic.

In contrast, Basic Service logic can run in the Application Container and is usually delivered by a single organization along with other related services as part of a single Application deployment lifecycle. Unlike the mediation use case, the service provider team has control and can resolve any library dependency issues during development. As such it can be run in a lightweight container but it does not necessarily need OSGI.

In summary, Apache Camel provides a Service Container on top of the dependency injection framework, so it can run in Spring or OSGI. Blueprint is the dependency injection framework for OSGI. OSGI should be considered for composite services and mediation and routing to provide the flexibility and extensibility needed for self-service on-demand, elastic SaaS.

The next post in the series will explore the reference architecture in greater detail regarding dynamic provisioning and data driven mediation and routing.

[2015-07-27] Talend Blog: Surprising Data Warehouse Lessons from a Scrabble Genius

Lessons in data warehousing and cloud data integration can come from unexpected places. Consider: the latest French-language Scrabble champion doesn’t speak French.

New Zealander Nigel Richards just won the international French World Scrabble® Championship without speaking the language at all, even using many obscure words that the average French speaker doesn’t know. The secret to his success seems to be twofold: his ability to memorize the full 368,000 words in the official authorized French list, known as the Officiel du Scrabble (ODS), along with his expertise in finding the best strategic placement on the game board.

Surprisingly, this achievement highlights a few lessons for any company wishing to become a data-driven enterprise and get ahead of their competitors.

It’s all about consistent, trusted data

Nigel Richards essentially turned his brain into a Data Warehouse. It contained all the approved words, a “single version of the truth” that he based his play on. This is no small accomplishment: typical memorization techniques, like the Memory Palace, aren’t adapted to this sort of task, where the word has no meaning for the mental athlete.

Modern businesses face similar challenges. To become a true data-driven enterprise, you need a very solid data management strategy and infrastructure, so that your data is trusted and that the resulting actions are effective. This means extracting data from multiple sources (for instance Salesforce, Marketo, Netsuite, SAP, Excel as well as your different databases), then taking necessary steps to consolidate, cleanse, transform and load the data into your data warehouse. Unlike the ODS, your data isn’t a relatively frozen set. It needs to be kept up to date and synchronized frequently.

Getting the most out of your data

The next lesson is that you need to focus on key high-level patterns and metrics in order to act. Nigel Richards didn’t need to understand the words in order to play his tiles. He just analyzed the number of points each word would bring, how it would help his future moves, and how it would hinder his opponent’s options. In doing this, he was applying best practices learned throughout his experience as a Scrabble champion in other languages. The words may be different, but the patterns he would seek out are the same.

Before you can produce your analytics and business insights, for instance bringing all your data into AWS Redshift and using Tableau Software or QlikView, or directly importing all data into the Salesforce Analytics Cloud, the foundation needs to be rock solid, bringing us back to data integration.

What this means for you

Mr. Richards single-handedly imported nearly 400,000 words in just nine weeks. I know many managers who would be happy to see that sort of timeline for their projects! Luckily, Cloud Data Integration solutions provide a new, agile approach to allow organizations to integrate your applications and data sources rapidly, without having to worry about provisioning hardware or administering the platform. The best tools provide out-of-the-box connectors to extract and load your data without needing to hand-code, as well as components to de-duplicate, validate, standardize and enrich for higher data quality.

If you are looking for this type of solution, I would definitely recommend you check out Talend Integration Cloud. As you would expect from Talend, the solution offers powerful, yet easy-to-use tools and prebuilt components and connectors that make it easier to connect, enrich and share data. This approach is designed to allow less skilled staff to complete everyday integration tasks so key personal can remain focused on more strategic projects. We think you’ll find Talend Integration Cloud can help you become a world-class, champion data-driven enterprise faster than you thought it was possible!

[2015-07-15] Talend Blog: Hadoop Summit 2015 Takeaway: The Lambda Architecture

It has been a couple of weeks since I got back from the Hadoop Summit in San Jose and I wanted to share a few highlights that I believe validate the direction Talend has taken over the past couple of years.

Coming out of the Summit I really felt that as an industry we were beginning to move beyond the delivery of exciting innovative technologies for Hadoop insiders, to solutions that address real business problems. These next-generation solutions emphasize a strong focus on Enterprise requirements in terms of scalability, elasticity, hybrid deployment, security and robust overall governance.

From my perspective (biased of course!), the dominant themes at the Summit gravitated around:

- Lambda Architecture and typical use cases it enables

- Cloud, tools and ease of dealing with Big Data

- Machine Learning

In this blog post, I’ll focus on the first one…the Lambda Architecture

Business use cases that require a mix of machine learning, batch and real-time Data processing are not new, they have been around for many years. For example:

- How do I stop fraud before it occurs?

- How can I make my customers feel like “royalty” and push personalized offers to reduce shopping cart abandonment?

- How can I prevent driving risks based on real-time hazards and driver profiles?

The good news is that technologies have greatly improved and with almost endless computing power at a fraction of yesterday’s cost, they are not science fiction anymore.

The Lambda architecture (see below) is a typical architecture to address some of those use cases.

Lambda Architecture. (Based on Nathan Marz design)

Within the Big Data ecosystem, Apache Spark (https://spark.apache.org/) and Apache Flink (https://flink.apache.org/) are two major solutions that fit well this architecture.

Spark (the champion) stands out from the crowd because of its ability to address both batch and near real-time (micro batch in the case of Spark) data processing with great performance through its in-memory approach.

Spark is also continuously improving its platform by adding key components to appeal to more Data Scientists (on top of MLlib for machine learning, Spark R was added in the 1.4 release) and expand its Hadoop footprint.

Spark projects in the Enterprise are on the rise and slowly replacing Map/Reduce for Batch Processing in the mind of developers. IBM’s recent endorsement and commitment to put 3500 researchers and developers on Spark related projects will probably accelerate Spark adoption in the hearts of Enterprise architects.

But, because there’s a champion, there must also be a contender…

This year, I was particularly impressed by the new Apache Flink project, which attempts to address some of Spark’s drawbacks like:

- Not being a YARN first class citizen yet

- Being Micro Batch (good in 95% of the cases) versus pure streaming

- Improved/easier Memory Management

If you look at Flink “marchitecture”, you can almost draw a one for one link between its modules and Spark’s. It the same story when it comes to their APIs, they are very similar.

https://flink.apache.org/img/flink-stack-small.png

So where is Talend in all of this?

With our Talend 5.6 platform, we delivered a few Spark components in Tech Preview, since then we have doubled down on our Spark investments and our upcoming 6.0 release will see many new components to support almost any use case, batch or real-time. From a batch perspective, with 6.0, it will be easier to convert your MapReduce jobs into Spark jobs and gain significant performance improvements along the way.

It’s worth highlighting that the very famous and advanced tMap component will be available for Spark Batch and Streaming, allowing advanced Spark transformation, filtering and data routing from single or multiple sources to single or multiple destinations.

As always, and because we believe native code running directly on the cluster is better than going through proprietary layers, we are generating native Spark code, allowing our customers to benefit from the continuous performance improvements of their Hadoop data processing frameworks.

[2015-07-10] Talend Blog: Data Preparation: Empowering The Business User

A growing number of business users with limited knowledge of computer programming are taking an interest in integration functions and data quality, as companies become more and more “data-driven”. From marketing to logistics, customer service to executive management, HR to finance, data analysis has become a means for all company departments to improve their process productivity and efficiency. However, even with cloud graphics development functions offered by publishers like Talend, these tools remain largely reserved for computer scientists and specialists in the development of data integration jobs.

For example, today a marketing manager wanting to launch a campaign has to go to his or her IT department in order to obtain specifically targeted and segmented data, etc. The marketing manager will have to spend time describing their needs in detail, and as for the IT department, it will have to set aside the time to develop the project and then both the marketing manager and IT department will have to conduct initial tests in order to validate the relevance of the development. In this day and age, when reaction time means everything and against a backdrop of global competition in which real-time has become the norm, this process is no longer a valid option.

And yet, business managers simply don't have the time to waste and need shared self-service tools to help them reach their goals. The widespread use of Excel is proof. Business users manage to the best of their ability to make their data usable, which means they spend 70 to 80% of their time preparing this data, without the assurance of even having quality data. Furthermore, the lack of centralized governance represents a risk in terms of the very use of the data including privacy and compliance issues, even problems with data use (such as licensing issues).

These are very common restraints and users need specific tools to manage enrichment, quality or problem-detection issues. Intended for business users, this new type of data preparation solution must be based on an Excel-type shared interface and must offer a broad spectrum of data quality functions. In addition, it must offer viewing and, it goes without saying, transformation functions, easier to use than Excel macros and specialized for the most commonly used domains in order to ensure appropriation by the business user.

For example, by offering a semantic recognition function, the solution could enable automatic model detection and categorization, while simultaneously indicating the potentially missing or non-compliant values. By also offering a visual representation mode based on color codes and pictograms, the user is able to better understand his or her data. In addition, an automatic data class recognition function (personal information, social security number or credit card, URL, email, etc) will further facilitate the user's task.

But if the company is happy with providing self-service tools, it is only addressing one part of the challenges and is “neglecting” to face issues related to the lack of data governance. The IT department, as competent as it may be, generally controls data, which sometimes unleashes the creation of a “Tower of Babel” when users extract their own version of the original data. In this way, a data inventory function would enable the data sets from companies open to “self-service” to be itemized or certified by the IT department, but directly managed by business users. This would enable the implementation of a truly centralized and collaborative platform, giving access to secure and reliable data, while reducing the proliferation of different versions.

What's more, this shared and centralized platform can help IT control the use of data by way of indicators like the popularity of data sets and the monitoring of their use. Or even alarm programming in order to detect problems with data quality, compliance or privacy, as soon as possible. Tracking is the first step in a good governance plan. All in all, it is a win-win situation for everyone: the business user is happy to have access to self-service data sets, to be self-reliant and agile in terms of carrying out the data transformations necessary for his or her business; in the same breath, IT better delegates to its users while implementing good data governance conditions.

However, a new pitfall of the “self-service” model is the fact that it encourages a new type of proliferation: that of personnel preparation scripts. In reality, many preparations can be automated, like recurring operations that have to be conducted every month or on a quarterly basis, like accounting closure. This is what we refer to as “operationalization”: the IT department launches the production of the preparation of recurring data that may be recovered in the form of a verified, certified and official flow of information. By operationalizing their preparations, the users benefit from information system reliability guarantees, including for very large volumes, and for a fraction of the cost thanks to Hadoop. In the end, this virtuous circle meets the double-edged needs highlighted by companies: reactivity (even pro-activity) of business users who have to make decisions in less and less time and the need for the governance and urbanization of the IT department.

[2015-07-09] Talend Forum Announcement: Talend Open Studio's 6.0.0 release is available

Dear Community,

We are very pleased to announce that Talend Open Studio's 6.0.0 release is available. This general availability release for all users contains many new features and bug fixes.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s general availability release:

You can also view Release Notes for this 6.0.0 version, detailing new features, through this link: http://www.talend.com/download/talend-open-studio
Find the latest release notes, with these steps: [Data Integration | Big Data | Data Quality | MDM | ESB] product tab > at bottom of page under "User Manuals PDF" > find this version's release notes.

For more information on fixed bugs and new features, go to the TalendForge Bugtracker.

Important features for this version are visible below

Talend Data Integration

Talend Open Studio for Data Integration:
- https://jira.talendforge.org/browse/TDI-31572 : JVM Visualization: not finished, but much improvements done.
- https://jira.talendforge.org/browse/TDI-31748 : add postgresql 9.4 support
- https://jira.talendforge.org/browse/TBD-1555 : Add option in wizard for Hive engine (most usefull later for DQ)
- https://jira.talendforge.org/browse/TUP-2579 : Use maven to build jobs/routes mostly done
- https://jira.talendforge.org/browse/TDI-31535 : Amazon aurora on db wizard
- https://jira.talendforge.org/browse/TDI-32272 : Update for mdm wizard for the version 6.0
- https://jira.talendforge.org/browse/TUP-2644 : Support CDH 5.4

Studio:
- Support Java 8 (Studio part): https://jira.talendforge.org/browse/TUP-2511
- Allow custom components to extend FILTER of other Talend components: https://jira.talendforge.org/browse/TDI-31510
- Support of MariaDB in Studio Wizards
- New IPAAS / IPAAS_DQ branding (except images..), and Ipaas automatically available for every platform license.
- Metadata bridge available as an add-on at studio startup
- Integration of Bonita SE for MDM (need to review the perspective names and menus to hide)

Components
- Support Netezza Appliance 7.2 https://jira.talendforge.org/browse/TDI-31780
- Upgrade to Bonita 6.5 https://jira.talendforge.org/browse/TUP-2741
- Upgrade CXF jar to v3.1 in tWebservice https://jira.talendforge.org/browse/TUP-2674
- Support JVM 1.8 https://jira.talendforge.org/browse/TUP-1884
- Upgrade Teradata, Vertica, Mssql, Mysql5.6. Deprecate SQLserv 2005
- Upgrade MDM components https://jira.talendforge.org/browse/TDI-31722
- Support ELT for Vertica https://jira.talendforge.org/browse/TDI-29443
- Support for SQL Templates in Vertica https://jira.talendforge.org/browse/TDI-29444
- tMOM component should provide static discovery protocol and timeout https://jira.talendforge.org/browse/TDI-29407
- Support SSL with tMom*/tSAPIdocreceiver https://jira.talendforge.org/browse/TDI-32242
- Support fetchsize in tJDBC component https://jira.talendforge.org/browse/TDI-31765
- Netsuite: reject line on tNetsuiteOutput https://jira.talendforge.org/browse/TDI-32366
- Teradata SCD component https://jira.talendforge.org/browse/TDI-32044
- Add datasource alias feature to DB components: SQL Server, PostgreSQL, and DB2 https://jira.talendforge.org/browse/TDI-32491
- Add MariaDB driver in driver list
- Refresh UX Components
- Support for Cloudera CDH 5.4
- Support the batch mode back in shared connection mode
- Encoding Base64 tfileoutputLDIF file has problem
- Retrieve the upload status for tSalesforceWave*
- Salesforce wave component https://jira.talendforge.org/browse/TDI-31538
- Netsuite component update https://jira.talendforge.org/browse/TDI-31542
- Support Postgresql 9.4 (Component part) https://jira.talendforge.org/browse/TDI-31747
- Support IBM DB2 10.x (Component part) https://jira.talendforge.org/browse/TDI-31884
- Upgrade MDM components to support Tomcat https://jira.talendforge.org/browse/TDI-31722
- Add an option to select how to interpret blank value https://jira.talendforge.org/browse/TDI-31750
- Add one option to the tFileInputRegex to avoid the message "Line doesn't match" https://jira.talendforge.org/browse/TDI-32038
- tOracleSP component issue in Talend ESB runtime

Talend Big Data

Talend Open Studio for Big Data

- Hive and Pig running on Tez (TBD-1480, TBD-1504, TBD-1505)
- Support for Hortonworks 2.2 https://jira.talendforge.org/browse/TBD-1577
- Pig with Tez https://jira.talendforge.org/browse/TBD-1505
- Hive with Tez https://jira.talendforge.org/browse/TBD-1504
- Support for Cassandra CQL3 https://jira.talendforge.org/browse/TBD-1042
- https://jira.talendforge.org/browse/TBD-1592    error retrieving NULL values from bigquery
- https://jira.talendforge.org/browse/TBD-1593    when bigquery returns zero rows result tBigQueryInput fails with null pointer exception
- https://jira.talendforge.org/browse/TBD-1619    MongoDB SSL support
- https://jira.talendforge.org/browse/TBD-1796    Compilation error on tCassandraOutput when field is of type char
- https://jira.talendforge.org/browse/TBD-1820    MongoDB Close Component Run Failed When Log4j is Enabled
- https://jira.talendforge.org/browse/TBD-1508 : Update CDH version for 6.0
- https://jira.talendforge.org/browse/TBD-1509 : Update HDP version for 6.0
- https://jira.talendforge.org/browse/TBD-1513 : Upgrade MongoDB version for 6.0
- https://jira.talendforge.org/browse/TBD-1554 : Update CDH version for 6.0 in wizards
- https://jira.talendforge.org/browse/TBD-1594 : Upgrade MongoDB version for 6.0 in wizards
- https://jira.talendforge.org/browse/TBD-1908 : [5.6.1]tHiveXXX Components do not have capability to specify zookeeper settings in jdbc hive2 url
- https://jira.talendforge.org/browse/TBD-1919 : TBD-1513 Implement $set in MongoDBOutput so you can update specific fields in a collection
- https://jira.talendforge.org/browse/TBD-1939 : TBD-1513 [MongoDB] Add Kerberos authentication.
- https://jira.talendforge.org/browse/TBD-1940 : TBD-1513 [MongoDB] Add WriteConcern option
- https://jira.talendforge.org/browse/TBD-1948 : TBD-1513 [MongoDB] Add bulk option on tMongoDBOuput
- https://jira.talendforge.org/browse/TBD-1950 : TBD-1513 [MongoDB] tMongoDBInput add warning if the query does not match any indexed field
- https://jira.talendforge.org/browse/TBD-1951 : TBD-1513 [MongoDB] Add message on tMongoDBOuput to warn the user againt query without proper query isolation
- https://jira.talendforge.org/browse/TBD-1952 : TBD-1513 [MongoDB] The ReadPreference option should not be on the tMongoDBConnection.
- https://jira.talendforge.org/browse/TBD-1953 : TBD-1513 [MongoDB] Port all new MongoDBConnection modifications to all MongoDB components where applicable
- https://jira.talendforge.org/browse/TBD-1963 : TBD-1513 [MongoDB] Review components UI
- https://jira.talendforge.org/browse/TBD-1972 : TBD-1513 [MongoDB] Allow user to query the root node of the database.
- https://jira.talendforge.org/browse/TBD-1994 : TBD-1513 [MongoDB] Test mongoDB with multiple Mongos
- https://jira.talendforge.org/browse/TBD-1996 : TBD-1513 [MongoDB] Add migration tasks for old version of mongoDB
- https://jira.talendforge.org/browse/TBD-1998 : TBD-1513 [MongoDB] Cannot read numeric values which are not inserted by a tMongoDBOutput

Talend Data Quality

Talend Open Studio for Data Quality:
- Improved column analysis editor: https://jira.talendforge.org/browse/TDQ-9872
- Eclipse 4.4 upgrade: https://jira.talendforge.org/browse/TDQ-9830
- EMF compare upgrade: https://jira.talendforge.org/browse/TDQ-9394
- Data preview in Column analysis editor
- https://jira.talendforge.org/browse/TDQ-8428: Sampling algorithm to run Analysis on a sample of data which is not necessary the first 1000 or first 5000 rows.

Talend ESB

Talend Studio & Talend Open Studio for ESB:
- Eclipse 4.4 / Java 8 Support
- New look and feel: Data Service / Routebuilder
- Route Builder: upgrade to Camel 2.15
- Data Services: upgrade to CXF 3.1
- cMessageEndpoint (extended list of Camel 2.15 endpoint components)
- cFlatpack component added
- cLoop allows now to set the ‘copy’ parameter

Runtime
- Apache Karaf - 4.0 (New major version)
- Apache CXF - 3.1 (New major version)
- Apache Camel - 2.15
- Apache ActiveMQ - 5.11
- Apache Syncope - 1.2
- Updated ESB Infrastructure Services
- Updated Examples
- Java 8 Support

Talend MDM

- Support for Java 8 https://jira.talendforge.org/browse/TMDM-7291
- Web UI look & feel refresh, cleaner, lighter (Step 1) https://jira.talendforge.org/browse/TMDM-8039
- Web app running on Tomcat (EJB / JBoss removed) https://jira.talendforge.org/browse/TMDM-8055
- RESTFul API for CRUD operations onto records https://jira.talendforge.org/browse/TMDM-8027
- Event MAnager based on JMS https://jira.talendforge.org/browse/TMDM-8086
- Upgrade to Eclipse 4

Thanks for being a part of our community,
The Talend Team.

[2015-07-03] Talend Forum Announcement: Talend Connect Paris: inviting all clients, users and partners!

Reserve your spot at Talend Connect Paris!

Don’'t miss on November 19, 2015 a full day of valuable insight, information sharing and networking designed to help Talend'’s ecosystem of customers, users and partners capitalize on the full potential of their data and systems. Learn how leading companies are using Talend to unlock the power of their data!

Register for this event to:
- learn and network with like-minded professionals;
- understand the big data revolution;
- discover the latest evolutions of the Talend platform, presented by Talend executives

Date and time
November 19, 2015 from 8.30AM to 5PM

Where
Paris, France (location given when registration confirmed)

Who
Customers, users and partners

Please register today for free. Secure your place now: https://info.talend.com/talendconnectfr … type=forge

[2015-06-30] Talend Blog: Spaghetti alla Cloud: Prevent IT Indigestion today! (Part II)

In part I of this two-part post, we learned why the IT architecture supporting modern business feels bloated when cloud computing was supposed to be a liberating, game-changing paradigm instead, and why it is critical to address this issue as soon as possible.

Solutions for Cloud Integration

Cloud Service Integration solutions, also known as iPaaS (for integration Platform-as-a-Service), are a new generation of native Cloud offerings at the intersection where data integration and application integration converge that bring the best of both worlds with Hybrid integration across Cloud and On-Premises data sources and applications.

Cloud Service Integration solutions will typically be articulated around connectors and actions. The former are used to connect to applications and data sources, in the cloud and on-premises, implementing the service call and processing the input / output content. Most solutions provide pre-built connectors, as well as a development environment to build native connectors to applications, instead of having to implement custom web services. The integration actions provide additional control over the data: check data quality, convert to different formats, and offer additional activities for managing, merging or cleansing your data.

Enterprise offerings will include additional key functionality such as administration and monitoring, to check for data loss, errors, schedule job execution or assist the set-up of environments and templates.

4 tips for selecting your Cloud Service Integration solution

As with all Cloud solutions, it can be daunting to sift through all the different offerings and identify which is right for you. The following tips will help you pick the right future-proof, cost-effective solution.

Future-proof support of Data Quality and Big Data Integration: Connecting without improving quality is a waste of time, but many vendors do not provide out-of-the-box data quality actions. Likewise, your top priorities most definitely should include the roll out of Data Warehousing, Analytics or Big Data, meaning you will need support for MapReduce and Hadoop integration;
User interfaces adapted to different needs: Your successful deployment will require adoption by multiple categories of users, each with different needs and expectations. For instance, developers will want a powerful IDE, while business user (“citizen integrators”) might prefer a simplified Web interface;
Avoid lock-in with proprietary technologies and languages: prioritize solutions built on open source projects with existing communities (Apache Software Foundation projects are a great starting point), and using popular technologies such as Java (making it easier to find development and support resources). If the components that are developed can be reused across your other top initiatives (Big Data, Master Data Management…) this would of course be a major boost.
Check for hidden costs: beware of packages that seem attractive but don’t include the actual connectors and actions! Having to pay extra for “premium connectors” (SAP, Salesforce…) will significantly increase your total cost of ownership as your architecture grows over time, hurt your ability to evolve with agility, punish success, or even stall the project as extra budget isn’t available to fuel your growth.

Cloud Service Integration might just be your recipe for success, turning your Spaghetti alla Cloud into the fuel for your organization’s future success. It can enable you to realize the full benefits of Cloud & SaaS while successfully implementing your top priority projects, such as Big Data integration, data warehousing, business intelligence and enterprise reporting.

One such option to consider is…prepare for shameless plug…Talend Integration Cloud. Affectionately referred to as “TIC” internally, Talend Integration Cloud is a secure and managed cloud integration platform that we believe will make it easy for you to connect, cleanse and share cloud and on-premises data. As a service managed by Talend, the platform provides rapid, elastic and secure capacity so you can easily shift workloads between the ground and cloud, helping increase your agility and lower operational costs. Or, in other words, it’s like going beyond a solid shot of Pepto-Bismol to sooth your IT indigestion and making sure you are fueling your organization’s performance like an elite athlete.

[2015-06-24] Talend Blog: Sporting Lessons to Kick-Start Big Data Success

As football teams around the world enjoy their pre-season break, these can be exciting but anxious times. Especially for the teams and players making the step-up to play in a higher division after the success of promotion.

The potential rewards are substantial but making the leap to the higher echelons will be challenging, and preparation will be everything. Businesses migrating from traditional data management to big data implementations will be experiencing similar feelings - trepidation mixed with determination to make the most of the opportunity. This time spent in the run up to a new project, similarly with pre-season, will be equally important for businesses, as they plan methodically to ensure long-term success.

Not everything will be new of course. The basics of traditional and big data management approaches are similar. Both are essentially about migrating data from point A to point B. However, when businesses move to embrace big data, they often encounter new challenges.

With the summer transfer window now open, clubs and managers focus on bringing on board new talent and skills to ensure they give themselves the best opportunity for success. Businesses too will concentrate on ensuring they have the skills and tools in place. Over time, businesses will increasingly need to deliver data in real-time on demand, often to achieve a range of business goals from enhanced customer engagement to gaining greater insight into customer sentiment or tapping into incremental revenue streams.

It won’t always be straightforward. The volume of enterprise data is increasing exponentially. Estimates indicate it doubles every 18 months. The variety is growing too, with new data sources, many unstructured, coming on stream continuously. Finally, with the advance of social media and the Internet of Things, data is being distributed faster than ever and businesses need to respond in line with that increasing speed.

These trends are driving the compelling need for organizations to migrate to big data implementations. But as traditional approaches to data management increasingly struggle to manage in this new digital world, businesses look for new ways to avoid driving costs sky-high or taking too long to reach viable results.

The emergence of big data necessitates businesses moving to a completely new architecture based on new technologies from the MapReduce programming language to Apache Spark and Apache Storm Big Data real-time in-memory streaming capabilities to the latest high-powered analytics solutions.

There is much for businesses to do. From learning new technical languages and building new skills, to governance, funding and technology integration. Getting this right isn’t going to be an overnight success and businesses need to set realistic expectations and goals - just as in sport, managers whose teams are new to the top flight, need to take a pragmatic approach and not be too dispirited if they fail to match the top team at the first attempt.

This is where testing environments can play a key role too. At Talend, we’ve developed a free Big Data Sandbox to help get people started with big data – without the need for coding. In this ready-to-run virtual environment, users can experience going from zero to big data in under 10 minutes!

We have also identified five key stages to ensuring big data readiness:

The exploratory phase
The initial concept
The project deployment
Enterprise-wide adoption
And finally, optimization.

Here are some key goals businesses will need to accomplish at each stage of their journey in order to ultimately achieve big data success:

In the initial exploratory phase, the focus should be on driving awareness of the opportunities across the business. Organizations therefore first need to become familiar with big data technology and the vendor landscape; second, find a suitable use case e.g. handling increasing data volumes and third, provide guidance to management on next steps.

The second phase is around the design and development of a proof of concept. The overarching aim should be IT cost reduction but the key landmark goals along the way will typically include building more experience in big data across the business, not least in order to better understand project complexity and risks; evaluating the impact of big data on the current information architecture and starting to track and quantify costs, schedules and functionality.

The next stage moves the project on from theory to practical reality. The project deployment phase specifically targets improved performance. Key goals include achieving greater business insight; establishing and measuring ROI and KPI metrics; and developing data governance policies and practices for big data.

Enterprise-wide adoption drives broader business transformation. It is here that businesses should look to ensure that business units and IT can respond faster to market conditions; that processes are measured and controlled and ultimately become repeatable. The final level of readiness is business optimization. To achieve this, organizations should look to use the insight they have gained to pursue new opportunities and/or to pivot the existing business.

My final recommendation is to make sure you build a clear and pragmatic execution plan, detailing what you want to achieve with big data success. Failure to do this may mean you don’t get the funding or support for a second project. It’s a bit like getting relegated at the end of your first season.

Fancy yourself as a data rock star? Find out how ready you are for big data success with our fun online quiz.

[2015-06-22] Talend Blog: Why Everyone Will Become a Part-Time Data Scientist

Your job description just changed.

Take a look around you – Big Data is no longer a buzzword. Data volumes are exploding and so are the opportunities to understand your customers, create new business, and optimize your existing operations.

No matter what your current core competencies, if you’re not a part-time data scientist now, you will be.

The ability to do light data science (you don’t have to become a full bore PhD data maven in this new environment) will be as powerful a career tool as an MBA. Whether you’re in finance, marketing, manufacturing or supply chain management, unless you take on the mantle of part-time data scientist in addition to your other duties, your career growth might be stymied.

Successful companies today are data-driven. Your role is to be one of those drivers. As a “data literate” employee comfortable slicing and dicing data in order to understand your business and make timely, innovative decisions, you have can have a positive impact on your company’s operations and its bottom line.

For example, I’ve personally found that being able to drill into Talend’s marketing data has yielded critical insights. Analysis indicates that the adoption of Big Data – a key driver for part of our business – is much further along in some countries compared to others. As a result, different marketing messages resonate better in certain locales than others. Combine that data with web traffic activity, the impact of holiday schedules (France, for examples, has a rash of holidays in May), weather patterns and other factors, and we come up with much clearer picture of how these various elements impact our marketing efforts. I can’t just look at global trends or make educated guesses – I need to drill into campaign data on a country by country basis.

So my recommendation to you is to dive in and get dirty with data. The good news is that you can become data literate now without spending years in graduate school.

Start by becoming comfortable with Excel and pivot tables – a data summarization tool that lets you quickly summarize and analyze large amounts of data in lists and tables. (Microsoft has put quite a lot of work into its pivot tables to make them easier to use.)

Learn how to group, filter, and chart data in various ways to unearth and understand different patterns.

Now, once you’ve mastered these basics, you’ll feel comfortable bringing new data sources into the mix – like web traffic data or social media sentiment. You will realize that you can aggregate this data in much the same way as you are able to analyze basic inventory levels or discounting trends.

In the case of Talend’s marketing operation, we are using the Talend Integration Cloud to bring together data from our financial, sales and marketing systems. This allows us to better understand and serve our customers and determine who should be targeted for new products and services. By taking this approach, you don’t have to wait for weeks or months for IT to conduct the analysis – these new tools provide results in hours or even minutes.

In the future, with the introduction of new data visualization tools, working with big data will become far easier for the growing ranks of your part-time data scientists. If you’re already comfortable with spreadsheets and statistics and have the core competence to spot different patterns in your data as you roll it up by week or by month, spotting trends using data visualization will be 10 times easier as you make the transition from a spreadsheet.

And, be sure to update your job description. You’ve just joined the growing ranks of smart business users who have earned their part-time data scientist chops. Today this is a highly desirable option; tomorrow it will be mandatory.

[2015-06-18] Talend Blog: The Union of Real-Time and Batch Integration Opens Up New Development Possibilities

Hadoop's Big Data processing platforms feature two integration modes that correspond to different types of usage, but are being used interchangeably with increasing frequency. "Batch" or "asynchronous" mode enables the programming of typically overnight processing. Examples of using batch mode include a bank branch integrating the day's deposits into its books, a distributor using or updating a new product nomenclature, or a business owner consolidating sales for all branches for a given period. The primary advantages of using batch mode include the ability to process huge data sets and meet most traditional corporate analytics needs (business management, client and marketing expertise, decision-making support, etc.).

However, one of the limits of batch processing is the latency period which makes any real-time integration impossible. This constitutes a delicate problem for companies with the need to meet client demands on the spot, cases such as making a recommendation to an Internet user in the middle of a purchase (think Amazon), posting an ad on a website aimed at a specific Internet user within a matter of milliseconds, taking immediate stock of the variability of different elements in order to improve decision-making (such as weather or traffic conditions) or detecting fraud.

In the Hadoop ecosystem, a new solution to this problem has emerged: Spark, developed by the Apache Foundation, is now offering a synchronous integration mode (in near real-time), also referred to as "streaming". This multifunction analysis engine is well adapted to fast processing of large data sets and includes the same functions as MapReduce, albeit with vastly superior performance. Namely, it enables the management of both data acquisition and processing, all while offering a processing speed that is 50 to 100 times greater than that of MapReduce.

Today, Talend supports both of these integration modes (while making it possible to switch from one to the other in a transparent manner, whereas the majority of solutions on the market will require a total overhaul of the data integration layer). Not only does it simplify processing development, it also simplifies the management of the overall life cycle (updates, changes, re-use). In the face of increasing complexity when it comes to big data-related technological offerings, Talend strove to ensure its support of all Hadoop market distributions (especially the most recent versions), while masking their complexity through a simple and intuitive interface. Spark is now at the heart of Talend's batch & real-time integration offer.

What's more, Spark now features new functions, which, given the backdrop of real-time activities, provides companies with expanding options. One such example is the "machine learning" functions support, currently a native Spark feature. The primary advantage of machine learning is to improve processing based on learning. Combining batch and real-time processing to meet today's corporate needs is also just around the corner: setting up a processing chain using weekly (batch) sales figures to develop predictive functions supported by this information as well as speeding up decision-making in real-time mode in order to avoid missed opportunities that arise in real time.

The advantages are obvious for e-commerce (recommendation) sites, as well as for marketing in general: combining browsing history data with the very latest information from social networks. For banks, creation of a "data lake" where all market data (internal and external) are compiled with no volume restrictions can enable the development of a predictive program by integrating other types of data. In the banking industry, this solution also enables huge volumes of data containing pertinent information to be extracted in order to foresee several different scenarios (predictive maintenance).

At the end of the day, this implicates all business sectors, from agriculture to wholesale distribution, from service provision to digital service providers, from manufacturing to the public sector, and so on. The advent of this new type of tool gives companies unprecedented analytical potential and will assist in their alignment with the current reality of their business with greater accuracy. Talend is the only player in the big data arena to, on the one hand, offer a transformation solution and written data processing aimed specifically at capitalizing on both batch and real-time data integration functions, and on the other hand, to offer Big Data that integrates all of the traditional integration functions (Data Quality, MDM, Data Governance, etc.) addressing the needs of the biggest IT management firms for whom an Enterprise Ready solution is simply not an option.

[2015-06-16] Talend Blog: Spaghetti alla Cloud: Prevent IT Indigestion today! (Part I)

Spaghetti alla Cloud? It’s what’s on the menu for most organizations today. With the explosion of popularity for SaaS applications, as well PaaS (Cloud platforms) and IaaS (Cloud Infrastructures), most IT architectures and business flows resemble a moving, tangled mess of noodles. I’m pretty sure that if you dig deeper, you’ll find some pretty old, legacy meatballs in there too.

The idea that our IT architectures remind us of a bowl of spaghetti isn’t new; in fact, the complexity of integrating on-premises applications and legacy systems has been a challenge for decades, leading many companies to try out SOA (Service oriented architecture) as well as ESB platforms (Enterprise Service Bus). Unfortunately, a number of new challenges are putting even greater strain on our existing architectures.

- The explosion of SaaS (Software-as-a-Service): Cloud solutions are increasingly popular with IT teams given their promise of agility, cost reduction and speed. Today, the average company uses 923 cloud services. This means incessant waves of new applications, each with disconnected islands of data,

- Even more SaaS when you consider how easy it is for business users to start using a tool without informing IT. It’s not rare to find “free” versions used in important business processes (survey tools for instance), as well as add-ons and extensions acquired directly from cloud platforms such as Salesforce AppExchange or AWS Marketplace. Some sources report that the average person now uses 28 cloud apps regularly.

“How many Cloud or SaaS applications do organizations run? At least twice the number they think they run” (tweet this now !)

- Cloud churn is a fact of today’s fast-paced cloud market, where cloud businesses come and go rapidly, meaning that you might lose access to your provider and therefore your data. Users don’t hesitate to move to a better or cheaper solution, and seem less attached to Cloud solutions than their on-premises equivalents,

- Hybrid, Private and Public clouds, different implementation flavors of cloud computing, add an additional challenge of integrating through the firewall,

- The changing nature of modern business means that organizations are continually adapting, adjusting and adopting new technologies and practices,

- And of course, let’s not even talk about the Internet of Things quite yet, but the size of your plate is about to explode.

The result? Spaghetti alla Cloud, that can leave your business bloated when cloud computing was supposed to be a liberating, game-changing paradigm instead.

Untangling your spaghetti architecture

Overloading on Spaghetti alla Cloud can have a significant impact on your organization’s competitive edge, just as there is a fine line between an athlete carb-loading before a race and a couch-potato with stomach cramps.

TechTarget's 2015 IT Priorities Survey provides insight into the top IT projects: Big data integration, data warehousing, business intelligence and reporting all depend on having a high quality, up-to-date data feed. As the IT saying goes, “garbage in, garbage out,” meaning that bad input will always result in bad output.

Developing your own point-to-point enterprise application integration (EAI) patterns is, naturally, a bad idea. The cost of development, implementation and maintenance would be prohibitive; it’s important to think about the long run and the total cost of operations. Unfortunately, the traditional alternative, a SOA + ESB approach, also has a very high cost of implementation, meaning it’s rarely rolled out properly and even less frequently kept up to par with business needs.

Your application integration solution must be robust while not hindering your agility. Flexibility and scalability are the key for both software and the supporting hardware and this is where Cloud options really shine. They lower the total cost of ownership for servers, bringing on-demand scalability and capacity, while improving performance for globally distributed users.

In part II of this post, we’ll explore how solutions for cloud integration, such as Talend Integration Cloud, help prevent IT indigestion and we will uncover four key tips for selecting your next Cloud Service Integration solution.

[2015-06-08] Talend Forum Announcement: For test only, Talend Open Studio's 6.0.0 RC1 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 6.0.0 RC1 release is available, for testing only. This release candidate contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 6.0 release.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s first release candidate:

Big Data: http://www.talend.com/download/talend-o … s_download
Data Quality: http://www.talend.com/download/talend-o … s_download
ESB: http://www.talend.com/download/talend-o … s_download
Data Integration: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-o … s_download

Below, please find key new features and bug fixes for Talend 6.0.0 RC1:

Talend Open Studio for Big Data

- https://jira.talendforge.org/browse/TBD-1508 : Update CDH version for 6.0
- https://jira.talendforge.org/browse/TBD-1509 : Update HDP version for 6.0
- https://jira.talendforge.org/browse/TBD-1513 : Upgrade MongoDB version for 6.0
- https://jira.talendforge.org/browse/TBD-1554 : Update CDH version for 6.0 in wizards
- https://jira.talendforge.org/browse/TBD-1594 : Upgrade MongoDB version for 6.0 in wizards
- https://jira.talendforge.org/browse/TBD-1908 : [5.6.1]tHiveXXX Components do not have capability to specify zookeeper settings in jdbc hive2 url
- https://jira.talendforge.org/browse/TBD-1919 : TBD-1513 Implement $set in MongoDBOutput so you can update specific fields in a collection
- https://jira.talendforge.org/browse/TBD-1939 : TBD-1513 [MongoDB] Add Kerberos authentication.
- https://jira.talendforge.org/browse/TBD-1940 : TBD-1513 [MongoDB] Add WriteConcern option
- https://jira.talendforge.org/browse/TBD-1948 : TBD-1513 [MongoDB] Add bulk option on tMongoDBOuput
- https://jira.talendforge.org/browse/TBD-1950 : TBD-1513 [MongoDB] tMongoDBInput add warning if the query does not match any indexed field
- https://jira.talendforge.org/browse/TBD-1951 : TBD-1513 [MongoDB] Add message on tMongoDBOuput to warn the user againt query without proper query isolation
- https://jira.talendforge.org/browse/TBD-1952 : TBD-1513 [MongoDB] The ReadPreference option should not be on the tMongoDBConnection.
- https://jira.talendforge.org/browse/TBD-1953 : TBD-1513 [MongoDB] Port all new MongoDBConnection modifications to all MongoDB components where applicable
- https://jira.talendforge.org/browse/TBD-1963 : TBD-1513 [MongoDB] Review components UI
- https://jira.talendforge.org/browse/TBD-1972 : TBD-1513 [MongoDB] Allow user to query the root node of the database.
- https://jira.talendforge.org/browse/TBD-1994 : TBD-1513 [MongoDB] Test mongoDB with multiple Mongos
- https://jira.talendforge.org/browse/TBD-1996 : TBD-1513 [MongoDB] Add migration tasks for old version of mongoDB
- https://jira.talendforge.org/browse/TBD-1998 : TBD-1513 [MongoDB] Cannot read numeric values which are not inserted by a tMongoDBOutput

Components

- Support Netezza Appliance 7.2 https://jira.talendforge.org/browse/TDI-31780
- Upgrade to Bonita 6.5 https://jira.talendforge.org/browse/TUP-2741
- Upgrade CXF jar to v3.1 in tWebservice https://jira.talendforge.org/browse/TUP-2674
- Support JVM 1.8 https://jira.talendforge.org/browse/TUP-1884
- Upgrade Teradata, Vertica, Mssql, Mysql5.6. Deprecate SQLserv 2005
- Upgrade MDM components https://jira.talendforge.org/browse/TDI-31722
- Support ELT for Vertica https://jira.talendforge.org/browse/TDI-29443
- Support for SQL Templates in Vertica https://jira.talendforge.org/browse/TDI-29444
- tMOM component should provide static discovery protocol and timeout https://jira.talendforge.org/browse/TDI-29407
- Support SSL with tMom*/tSAPIdocreceiver https://jira.talendforge.org/browse/TDI-32242
- Support fetchsize in tJDBC component https://jira.talendforge.org/browse/TDI-31765
- Netsuite: reject line on tNetsuiteOutput https://jira.talendforge.org/browse/TDI-32366
- Teradata SCD component https://jira.talendforge.org/browse/TDI-32044
- Add datasource alias feature to DB components: SQL Server, PostgreSQL, and DB2 https://jira.talendforge.org/browse/TDI-32491

Talend Open Studio for MDM - bug fixes

- https://jira.talendforge.org/browse/TMDM-8492: TMDM-8024 Fix update workflow object version properties
- https://jira.talendforge.org/browse/TMDM-8482: TMDM-8164 Write document for Synchronization usage
- https://jira.talendforge.org/browse/TMDM-8472: TMDM-8164 Provide text used for Sync function
- https://jira.talendforge.org/browse/TMDM-8467: Can't import job from mdm perspective any more
- https://jira.talendforge.org/browse/TMDM-8460: TMDM-8024 Generate corresponding role in organization when generating a workflow
- https://jira.talendforge.org/browse/TMDM-8451: Relations screen has UI issues
- https://jira.talendforge.org/browse/TMDM-8449: TMDM-8164 Incremental User synchronization
- https://jira.talendforge.org/browse/TMDM-8447: It's not convenient for user to hide the usefull elements
- https://jira.talendforge.org/browse/TMDM-8446: TMDM-8039 Installer 6.0 versioning
- https://jira.talendforge.org/browse/TMDM-8444: TMDM-8355 The latest WSDL were updated for studio side
- https://jira.talendforge.org/browse/TMDM-8439: REST API:XML dateTime as date criteria can not get record
- https://jira.talendforge.org/browse/TMDM-8438: error in console log while import items from server
- https://jira.talendforge.org/browse/TMDM-8434: TMDM-8024 Fix can not run synchronizing automatically when import Demo project
- https://jira.talendforge.org/browse/TMDM-8433: Icons lost in import dialog
- https://jira.talendforge.org/browse/TMDM-8432: Server and studio both are frozen when deploying items
- https://jira.talendforge.org/browse/TMDM-8429: Error Log generate When deploy a DM to server first time
- https://jira.talendforge.org/browse/TMDM-8423: TMDM-8164 Add retry mechanism when sync failed.
- https://jira.talendforge.org/browse/TMDM-8422: TMDM-8024 Migration of Bonita workflow automatically when importing 5.X workflow files
- https://jira.talendforge.org/browse/TMDM-8421: TMDM-8164 Secure Synchronization Servlet
- https://jira.talendforge.org/browse/TMDM-8418: TMDM-8086 Implement execute routing order synchronously
- https://jira.talendforge.org/browse/TMDM-8414: TMDM-8039 Change Bonita BPM Community branding
- https://jira.talendforge.org/browse/TMDM-8410: Deployment cause server connection error even connection check is successful on CE version
- https://jira.talendforge.org/browse/TMDM-8409: DI icon lost in MDM CE version
- https://jira.talendforge.org/browse/TMDM-8404: TMDM-8024 Upgrade MDM import item wizard
- https://jira.talendforge.org/browse/TMDM-8403: TMDM-8355 Couldn't create routing orders records in event manager view.
- https://jira.talendforge.org/browse/TMDM-8402: Validation shell pop up while checking MDM server connection firstly
- https://jira.talendforge.org/browse/TMDM-8398: TMDM-6823 Remove Custom FK Filter
- https://jira.talendforge.org/browse/TMDM-8397: TMDM-8055 H2Console inclusion
- https://jira.talendforge.org/browse/TMDM-8396: TMDM-8039 Studio icons changes
- https://jira.talendforge.org/browse/TMDM-8391: TMDM-8024 Remove dependencies library jar file from bar file when deploying worklfow to MDM server
- https://jira.talendforge.org/browse/TMDM-8375: TMDM-8024 Fix workflow synchronizing after importing demo files
- https://jira.talendforge.org/browse/TMDM-8366: Exception while partial update a record(query language)
- https://jira.talendforge.org/browse/TMDM-8363: TMDM-8164 User/Roles Synchronization
- https://jira.talendforge.org/browse/TMDM-8361: TMDM-8080 Upgrade Demo jobs to reflect 6.0 changes
- https://jira.talendforge.org/browse/TMDM-8355: service apis need implement
- https://jira.talendforge.org/browse/TMDM-8339: TMDM-8024 Migration of existing connectors and actors
- https://jira.talendforge.org/browse/TMDM-8338: TMDM-8024 Upgrade Demo workflow
- https://jira.talendforge.org/browse/TMDM-8334: TMDM-8080 Studio issues wrong http calls to server
- https://jira.talendforge.org/browse/TMDM-8332: TMDM-8024 Fix synchronizing when modifying and deleting workflow
- https://jira.talendforge.org/browse/TMDM-8324: DSC grid and master databrowser missing ScrollBar
- https://jira.talendforge.org/browse/TMDM-8320: Data Browser : "Edit item with row editor" missing Save and Cancel buttons
- https://jira.talendforge.org/browse/TMDM-8314: TMDM-8024 Fix synchronizing when creating and generating workflow
- https://jira.talendforge.org/browse/TMDM-8294: Can't Import server objects from MDM server before check connection with demo server explorer
- https://jira.talendforge.org/browse/TMDM-8292: Use default value rule for an entity browser the view will show error.
- https://jira.talendforge.org/browse/TMDM-8282: BeforeDeleting info message cannot be displayed in WebUI
- https://jira.talendforge.org/browse/TMDM-8281: Can't find the record in recycle bin with logical deletion
- https://jira.talendforge.org/browse/TMDM-8245: "Import misses CONF and PROVISIONING for system containers"
- https://jira.talendforge.org/browse/TMDM-8236: TMDM-8024 Fix compile error of org.bonitasoft.studio.workspace.mdm
- https://jira.talendforge.org/browse/TMDM-8235: TMDM-8024 Package MDM connector as a plugin for studio and update help content
- https://jira.talendforge.org/browse/TMDM-8234: TMDM-8080 event manger test
- https://jira.talendforge.org/browse/TMDM-8226: TMDM-8024 Fix compile error when showing and duplicate workflow in MDM repository view
- https://jira.talendforge.org/browse/TMDM-8225: TMDM-8024 Fix compile error when packaging/deploying/undeploying workflow
- https://jira.talendforge.org/browse/TMDM-8224: TMDM-8024 Fix error when generating workflow file
- https://jira.talendforge.org/browse/TMDM-8223: TMDM-8024 Upgrade generated workflow template file to 6.5
- https://jira.talendforge.org/browse/TMDM-8218: TMDM-8024 Upgrade MDM connector configuration file with Bonita new format
- https://jira.talendforge.org/browse/TMDM-8214: TMDM-8055 XSLT issues
- https://jira.talendforge.org/browse/TMDM-8164: Upgrade Server side to Bonita 6.5.0
- https://jira.talendforge.org/browse/TMDM-8135: TMDM-8055 Review portlets code
- https://jira.talendforge.org/browse/TMDM-8116: Java 8 support (Studio part)
- https://jira.talendforge.org/browse/TMDM-8055: Tomcat as the MDM server
- https://jira.talendforge.org/browse/TMDM-8039: Look & Feel redesign
- https://jira.talendforge.org/browse/TMDM-6829: Obliterate deprecated projects
- https://jira.talendforge.org/browse/TMDM-6823: Obliterate XMLDB support

Talend Open Studio for Data Quality

- https://jira.talendforge.org/browse/TDQ-8428: Sampling algorithm to run Analysis on a sample of data which is not necessary the first 1000 or first 5000 rows.

Talend Open Studio for ESB

Studio
- Route Builder: upgrade to Apache Camel 2.15.2
- cFlatpack component added
- cLoop allows now to set the ‘copy’ parameter

Runtime / ESB
- Apache Karaf - 4.0.0-SNAPSHOT (updated to current snapshot)
- Apache CXF - 3.1.1-SNAPSHOT (updated to current snapshot)
- Apache Camel - 2.15.2 (updated from 2.15.1)

Thanks for being a part of our community,
The Talend Team.

[2015-06-02] Talend Forum Announcement: Survey: Talend Java Platform

Dear Community,

There is a lot of excitement here at Talend as we build out our product plans for 2015 and beyond. As you probably already know, a major cornerstone of our product set is Java. We are currently defining our strategy for supporting future Java versions and we would be very interested to hear from you about your own plans for keeping up-to-date with the latest Java releases. Your feedback will help us to ensure that we continue to provide support to all of our customers in their choice of software platforms.

We would therefore be grateful if you would take a couple of minutes to answer a few questions in our short survey. It will help us to ascertain the use of Java in your projects. Of course your answers will be kept strictly confidential.

We look forward to hearing your comments and views in the completed questionnaire: https://aytm.com/r5b4bad

Best,
The Talend Team.

[2015-05-27] Talend Blog: The Power of One

When we last spoke, I talked about how Talend is working with data-driven companies to define and implement their One-Click data strategies. 1-Click, introduced by Amazon.com in 1999, allows customers to make on-line purchases with a single click – and is a showcase of how well they can turn massive volumes of shopper, supplier and product data into a customer convenience and competitive advantage.

Recently I’ve noticed some customers, particularly in the area of Internet of Things (IoT), are using a similar but different metric – “1 percent.” In this instance, “1” is used to describe the significant impact a fractional improvement in efficiency can have on major industries.

The Positive Power of 1%

For example, take GE Water & Power, a $28 billion unit of the parent company. For nearly a decade, GE has been monitoring its industrial turbines in order to predict maintenance and part replacement. Recently, the company has dramatically increased its ability to capture massive amounts of data from these sources and blend it with additional large data sets.

GE Water & Power CIO Jim Fowler, in a speech last year in Las Vegas, said that GE now has 100 million hours of operating-data and maintenance and part-swap data across the 1,700 turbines its customers have in operation. Each sensor-equipped turbine is producing a terabyte of data per day. That information has been combined and processed with external data such as weather forecasts. And the results clearly illustrate the power of one.

According to Fowler, GE is using the data to help its customers realize a tiny 1% improvement in output that adds up to huge savings – about $2 to $5 million per turbine per year. While that’s impressive enough, when one considers the total savings across all 1,700 turbines over the next 15 years, 1% efficiency savings could equate to a staggering figure in the range of $66 billion.

Another excellent example of the Power of One comes from the OTTO Group in Europe, also a Talend customer. This is the world’s second largest online retailer in the end-consumer (B2C) business that reported 6B euro in online sales in 2013. OTTO is using the Talend platform to “…make quicker and smarter decisions around product lines, improve forecasts, reduce leftover merchandise and importantly, improve our customer experience,” says Rupert Steffner, Chief BI Platform Architect of the company’s IT organization.

For OTTO, like any other online retailer, shopping cart abandonment is a major challenge. Industry reports note that $4 trillion worth of merchandise will be abandoned in online shopping carts this year. By applying solutions made possible by access to extensive customer data, OTTO estimates it can predict with 90% accuracy customers who are likely to abandon a cart. It’s not hard to see how this information could be used to send incentives and promotions to these customers before they leave the site and their carts. Even if such activities were to net only a 1% change of fortune, when you consider it’s a $4 trillion issue, such a shift would be extremely meaningful.

Another European customer exemplifying the Power of One is a financial services company. A 1% improvement in cross selling its insurance policies to their existing customer base (a tiny fraction of their overall business) has resulted in a return of 600,000 euros over the past year.

GE – A Data Driven Enterprise

GE, by the way, is totally committed to making the most of Big Data. It has coined the term “The Industrial Internet” – a combination of Big Data analytics with the Internet of Things (IoT). The challenge, says CIO Fowler, is to build an open platform for ingesting and sharing Big Data to build new, highly secure applications. Talend is lending a hand.

Talend 5.6, our latest release, sets new benchmarks for Big Data productivity and profiling, provides new Master Data Management capabilities with advanced efficiency controls, and broadens IoT device connectivity.

Working with machine-generated content or IoT devices is enhanced by the platforms support for two related protocols – MQTT and AMQP – that allow sensor data to be ingested into a company’s real-time data integration workflows. It supports the latest Hadoop extensions, Apache Spark and Apache Storm, providing significant performance advantages and supporting real time and operational data projects.

Talend 5.6 is ready to help any data-driven enterprise realize the Power of One – and that’s just for openers. For smaller companies, the Talend solution may generate even greater returns – efficiencies on the order of 10% to 20%. Now that’s Power!

[2015-05-19] Talend Blog: Self-Service is Great for Fro-Yo but is it Right for Integration?

I had cause to visit a self-service frozen yogurt wonder emporium on a recent visit to the U.S. It was delightful and, at first, a tad overwhelming – so many flavors to choose from, so many toppings. Needless to say, I over indulged (goodbye, ideal running weight). Based on the number of similar establishments I saw during the rest of my stay, people seem to like control and convenience of the self-service business model. And, whether you look at banking, booking travel, or personal tax filings, self-service or DIY certainly appears to be a broader trend.

It’s also starting to happen in IT. For some IT teams, this change is happening too fast; for some business users this change can’t happen quickly enough. This tension is evident in the case of data integration and analytics. Because today’s business decisions are increasingly data driven, there is a growing strain between data-hungry business users and cost-constrained IT organizations. Users want to be able to tap into multiple data sources and employ a variety of integrated applications to further their initiatives, but many IT teams can’t keep up.

Here’s a typical scenario: A marketing team returns from a trade show with a gold mine of fresh leads sitting in Box or Dropbox. The leads need to be moved into a marketing tool like Marketo to make them actionable, then on to Salesforce.com for additional CRM activity. Finally, leads are imported into a data lake where the information can be combined with other data sources and analyzed using a tool like Tableau.

More easily said than done: Making the connections between the various systems is often a highly manual activity, as is cleansing the prospect contact information, which is typically riddled with data inconsistencies, duplicate names and other quality challenges. Naturally, the business users turn to IT for help.

But that help may not be as forthcoming as the users would like – IT organizations saddled with budget cuts and diminishing headcounts have a lot on their plate. The process might be further protracted if the information delivered isn’t exactly what is needed or if the business teams want to probe further based on the initial insights.

This delay in access to information has encouraged some business users to work outside the boundaries of IT control, sometimes referred to as Shadow IT. Business users may select and try to integrate SaaS applications on their own, which contributes to rogue silos that do not adhere to company standards for quality, security, compliance and governance.

In a recent Talend survey, companies told us that 50% of them have five or more SaaS apps. This figure is on the rise with IDC reporting that purchases of SaaS apps are growing 5X faster than on-premise. At the same time, Gartner predicts that 66% of integration flows will extend beyond the firewall by 2017 and 65% will be developed by the line of business.

Like my overflowing cup of frozen yogurt, independence can be costly in the traditional world of self-service integration. Users must spend hours each week cleaning and manually entering data – a mind-numbing, error prone activity that not only leads to inaccurate data, but also creates the perfect environment for procrastination. Dirty data starts to pile up like unwashed dishes in the kitchen sink.

Compounding the problem is the fact that the tools used to build integration solutions are complex, costly and require frequent updating. Most business users understandably lack the deep programming skills required to effectively code these systems.

Best of Both Worlds

The difficulties I’ve referred to actually present a great opportunity for IT to empower its users, streamline its own operations, and bring a new level of agility to the enterprise. Think “controlled self-service”. Help has arrived in the form of the recently introduced Talend Integration Cloud – a secure cloud integration platform that combines the power of Talend’s enterprise integration capabilities with the agility of the cloud.

To be specific, with Talend Integration Cloud we provide four key capabilities:

- The platform makes it fast and easy to cleanse, enrich and share data from all your on-premises and cloud applications,

- We provide the flexibility to run those jobs anywhere – in the cloud or in your own data center,

- We support cloud and on-premise big data platforms allowing you to deliver next generation analytics at a fraction of the cost of traditional warehouse solutions,

- We enable IT to empower business users with easy-to-use self-service integration tools, so they can move and access data faster than ever before.

With regard to that last point, IT is able to turn its business users into “citizen integrators,” a term used by Gartner. Business users get the tools they need to automate process that once were handled by IT, but IT retains a high-level of visibility and control.

Talend Integration Cloud allows IT to use the full power of Talend Studio and apply more than 1,000 connectors and components to simplify application integration by its citizen integrators. IT either prepares all connectors and components, or validates connectors that are available through Talend Exchange, our community marketplace. This ensures that all connectors are in compliance with enterprise governance and security guidelines.

And best of all, those problematic, standalone silos can be shelved. Talend Integration Cloud provides a whole new style of integration that allows IT to partner with business users to meet their integration needs. It may make both teams so happy that they decide a group outing is in order – self-serve fro-yo anyone?

Infogroup and Talend Integration Cloud

Infogroup is a marketing services and analytics provider offering data and SaaS to help a range of companies – including Fortune 100 enterprises – increase sales and customer loyalty. The company provides both digital and traditional marketing channel expertise that is enhanced by real-time client access to proprietary data on 235 million individuals and 24 million businesses.

“We use Talend’s big data and data integration products to help our customers access vast amounts of contextually-relevant data in real-time, so they can be more responsive to their client’s demands,” said Purandar Das, chief technology officer, Enterprise Solutions, Infogroup. “Our work requires that we constantly innovate and perfect the systems driving topline revenue for clients. Talend Integration Cloud, which we’ve been beta testing, will enable us to integrate more data from more cloud and mobile sources leveraging our existing skills and prior Talend project work.”

Bringing Benefits to IT, Users and the Enterprise

Among the many benefits associated with deploying Talend Integration Cloud are:

- Increased business agility and performance due to easy to use data integration tools provided to business users,

- Increases governance and security within compliance guidelines established by IT,

- Allows business users to work together more easily, think more strategically, and engage in dialogues fostering creativity,

- Simplifies and dramatically reduces time to deployment because all servers are in the cloud, minimizing the need for on-premises infrastructure,

- Allows you to shift workloads from on-premises to run in a secure cloud with cloud-to-cloud or cloud-to-ground connectivity,

- Lowers the cost and effort of maintaining and evolving these integrations by providing up-to-date visual representations of their status.

Plus you enjoy all the typical cloud benefits such as reduced cost, increased agility, faster speed, and lower total cost of ownership (maintenance of integration flows and connections is so much easier).

Talend Integration Cloud provides IT with an unparalleled opportunity to meet business users’ needs for data integration and self-provisioning, while maintaining the highest standards of governance and security. It’s the best of both worlds.

[2015-05-13] Talend Forum Announcement: Get a FREE 30-day Trial of Talend Integration Cloud

Dear Community,

Connect all your data in the cloud and on the ground. Register now for Talend Integration Cloud: https://info.talend.com/prodevaltic.html?type=forge

- Agile Integration: Easily export, cleanse, and import data between popular applications and data sources including Salesforce, Marketo, NetSuite, Redshift, SAP and more
- Speed IT Performance: Manage hybrid integration of cloud and ground systems with instant, elastic and secure capacity
- Enable Real-time Big Data: Insight at a fraction of the cost with support for leading big data distributions and data sources

Your free 30-day trial of Talend Integration Cloud 1.0 includes:

- Web-based designer and integration tools
- Talend Studio for Cloud (2 GB) to build advanced integration flows
- Free access to an online community with 100+ packaged connectors, components and templates

Best,
The Talend Team.

[2015-05-13] Talend Forum Announcement: Talend Open Studio's 5.6.2 release is available

Dear Community,

We are very pleased to announce that Talend Open Studio's 5.6.2 release is available. This general availability release for all users contains new features and bug fixes.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s general availability release:

You can also view Release Notes for this 5.6.2 version, detailing new features, through this link: http://www.talend.com/download/talend-open-studio
Find the latest release notes, with these steps: [Data Integration | Big Data | Data Quality | MDM | ESB] product tab > at bottom of page, you will find the user manuals and recent release notes.

For more information on fixed bugs and new features, go to the TalendForge Bugtracker.

Thanks for being a part of our community,
The Talend Team.

[2015-05-13] Talend Forum Announcement: For test only, Talend Open Studio's 6.0.0 M5 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 6.0.0 M5 release is available, for testing only. This milestone contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 6.0 release.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s fifth milestone:

Big Data: http://www.talend.com/download/talend-o … s_download
Data Quality: http://www.talend.com/download/talend-o … s_download
ESB: http://www.talend.com/download/talend-o … s_download
Data Integration: http://www.talend.com/download/talend-o … s_download
BPM: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-o … s_download

Below, please find key new features for Talend 6.0.0 M5:

Talend Open Studio for Big Data

- Error retrieving NULL values from bigquery https://jira.talendforge.org/browse/TBD-1592
- When bigquery returns zero rows result tBigQueryInput fails with null pointer exception https://jira.talendforge.org/browse/TBD-1593
- MongoDB SSL support https://jira.talendforge.org/browse/TBD-1619
- Compilation error on tCassandraOutput when field is of type char https://jira.talendforge.org/browse/TBD-1796
- MongoDB Close Component Run Failed When Log4j is Enabled https://jira.talendforge.org/browse/TBD-1820

Talend Open Studio for Data Integration

- JVM Visualization: not finished, but much improvements done https://jira.talendforge.org/browse/TDI-31572
- Add postgresql 9.4 support https://jira.talendforge.org/browse/TDI-31748
- Add option in wizard for Hive engine (most usefull later for DQ) https://jira.talendforge.org/browse/TBD-1555
- Use maven to build jobs/routes mostly done https://jira.talendforge.org/browse/TUP-2579
- Amazon aurora on db wizard https://jira.talendforge.org/browse/TDI-31535
- Update for mdm wizard for the version 6.0 https://jira.talendforge.org/browse/TDI-32272
- Support CDH 5.4 https://jira.talendforge.org/browse/TUP-2644

Components

- Upgrade MDM components
- Add MariaDB driver in driver list
- Refresh UX Components
- Support for Cloudera CDH 5.4
- Support the batch mode back in shared connection mode
- Encoding Base64 tfileoutputLDIF file has problem
- Retrieve the upload status for tSalesforceWave*

Talend Open Studio for ESB & Talend ESB SE

Studio
- Route Builder: upgrade to Apache Camel 2.15.1 (Target 2.15.2 with RC1)
- Data Services: upgrade to Apache CXF 3.1.0-SNAPSHOT

Runtime / ESB
- Apache Karaf - 4.0.0-SNAPSHOT
- Apache CXF - 3.1.0-SNAPSHOT
- Apache Camel - 2.15.1
- Apache AMQ - 5.11.1

Talend Open Studio for Data Quality

Features
- UX : new icon set https://jira.talendforge.org/browse/TDQ-10189
- Support Java 8 https://jira.talendforge.org/browse/TDQ-9598
- Column analysis editor improvements https://jira.talendforge.org/browse/TDQ-9953

Improvements:
- Reduce memory usage during TDQ item export https://jira.talendforge.org/browse/TDQ-8325

Thanks for being a part of our community,
The Talend Team.

[2015-05-12] Talend Blog: MDM and the Chief Marketing Officer: Made for Each Other

When it comes to CMO’s, I’m about as data centric as they get. Early in my career, I worked as an economist for a consulting firm in Washington, D.C. I was happily awash in data and found myself analyzing such hot topics as the difference in prices of power tools in Japan and the United States.

Years later when I became a CMO, I thought to myself, “Here’s where I can use my love of working with lots of data to drive decision making and performance in marketing.” I was in for a rude surprise – the data spigot was badly broken.

The reason for this data logjam quickly became apparent. Most marketers are dependent on systems that were built to automate individual business functions such as sales, finance, and customer service, to name just a few. And, despite advances in CRM, e-commerce, BI and marketing applications, very few CMOs can see across these siloed systems to get the insights they need to do their job.

It’s a frustrating dilemma – marketers are unable to resolve this problem because they do not own the internal sales, finance and customer service systems, and do not control the processes that collect the relevant data. Each of these systems was designed to automate a specific function – none were created with the entire IT landscape in mind or built to inform marketing decisions. In most companies, no one is charged with pulling all this information together so the data remains in silos – solving specific functional problems, but not addressing the larger opportunities within the business.

I recently talked to the CMO of an eyewear company in the UK about this very problem. His is a classic example – given the company’s siloed systems, he is unable to analyze SKUs to identify such essential sales patterns as how well various colors and styles of frames are selling in different regions of the country. He’s not just frustrated, he’s angry about being handcuffed because of silo creep and the influx of unstructured, dirty and largely inaccessible data. If he does not fix this problem, there is no way that he can get what he needs out of his business intelligence initiatives to market effectively.

MDM to the Rescue

Master Data Management (MDM) is the answer. MDM was created to work across all of your enterprise’s systems – to pull together all your data, clean it and categorize it, providing you with a 360-degree view of your customers and insight into every aspect of your business.

MDM helps solve three major problems:

Analytic MDM allows you to analyze and understand your entire customer base in order to segment customers and identify new opportunities and trends.
Customer-360 MDM gathers all of your data about a single customer or product, including transactional information (e.g. site navigation path and past purchases. This allows a sales person or customer service representative to leverage this 360-degree view and better sell or service their customers on a day-to-day basis.
Operational MDM enables all sales and service systems to work together on behalf of the customer. Systems are connected in real time to improve data quality and streamline the customer experience – for example, when a customer registers on a web site all other relevant systems are automatically updated.

To help implement the MDM solution, Talend has launched a new consulting services package, Passport for Master Data Management (MDM) Success. The service helps establish the foundation needed to ensure MDM projects are delivered on time, within budget, and address the needs of a company’s various lines of business – including marketing.

MDM’s Beneficial Impact

From a CMO’s perspective, MDM solves a lot of problems and alleviates a lot of frustration.

MDM can help you build your business by:

Improving product characterization in order to track and understand what lines are selling and why
Uniting customer information so you can understand which customers have purchased which specific products and services, and launch successful new offerings
Improving marketing database segmentation so you can better target customers based on role, title, product interest and past purchases
Eliminating duplicates and reducing market spend
Improving ROI tracking accuracy by gaining more insights into the interaction between marketing spend, touches and sales
Improving the customer experience by tying all of your systems together

The efficacy of MDM was brought home to me during another customer visit – this one far more upbeat. Based on projections from a recently installed MDM system, this customer forecasts an increase in e-commerce revenues of 11% because the system allows the customer service representatives to do a better job of cross-selling. And, because the MDM system provides sales people on the floor with more insights into customer past purchases, in store sales are expected to jump by 7%.

And here’s what TUI UK and Ireland had to say about their MDM implementation: “This modernization project is a key enabler for improved customer experience, enhanced multi-channel opportunities, and a reduction of contacts with our contact center,” stated Louise Williams, General Manager Customer Engagement at Ireland. “Talend is used to automatically merge customers to create a single golden record for each customer.”

For a marketing manager, a purpose-built MDM solution is the royal road out of the data management morass and an end to siloed systems. It’s enough to make any CMO smile.

[2015-05-07] Talend Blog: Agatha Christie and the Challenge of Cloud Integration

Many classic detective novels progress in a familiar way: the hero, Hercule Poirot or Miss Marple for instance, has an incomplete understanding of how the crime played out and must painstakingly collect information from witnesses - filtering out lies from truth - until the big picture falls into place. In Agatha Christie’s best-selling novels, such as Murder on the Orient Express, this often leads to bringing all the characters together in the same room for the “reveal”. In this setting, an integrated story is told that ties everything together, the truth becomes obvious to all, and the police can take action.

Most businesses today are in the same predicament as those famous detectives. The data needed to compose a “complete view of the truth” is spread across numerous repositories and tools. Not only is data disconnected but also often contradictory. Customer data may be in a CRM solution, needing to be mapped to a Marketing Automation Platform. As well as lead lists coming in from events, user data may also reside in Customer Support tools and forums or on social media platforms.

Trying to bring together these disconnected data silos isn’t new for IT executives, but the situation is made more challenging today by the growing use of Cloud and SaaS solutions across the enterprise. Gartner reports that SaaS is now used on mission-critical projects and in production (Survey Analysis: Buyers Reveal Cloud Application Adoption Plans Through 2017). Talend research shows that half of businesses regularly use 10 or more SaaS solutions. This situation, more problematic than data silos, actually creates “islands of data”, each with its own data policies, access rules and regulations.

Lower overall cost, operational agility and increased speed, this is the promise of the Hybrid Cloud model that orchestrates SaaS and On-Premises with public and private clouds. However, this “new normal” is threatened by the lack of scalable, future-proof integration solutions that seamlessly connect SaaS applications with on-premises solutions. Executives who want to grow their business by becoming more data driven are in the same situation as our characters from the Agatha Christie novels: How do they bring together all the information needed quickly and resolve inconsistencies in order to gain a single, actionable view?

Connecting the data-driven enterprise with Cloud Integration

Leading companies are looking at iPaaS (integration Platform-as-a-Service) as a way to connect the data-driven enterprise, to connect all their data sources, cloud-to-cloud as well as cloud-to-ground, as they inevitably approach a critical phase where the number of SaaS, controlled and rogue deployments is becoming unwieldy, and need to be brought together before the organization can fully leverage promising technologies like enterprise analytics or big data.

The benefits of an iPaaS solution to your data-driven enterprise:

Drive Growth by connecting SaaS & On-premises applications and data: Use vendor-provided, community-designed or in-house connectors to bring together all the key applications inside your organization, connecting both applications rolled out as part of your IT strategy, as well as those resulting from ad-hoc projects: CRM, marketing automation, but also file sharing sites, HR, ERP platforms, etc. Enable big picture analytics, and predictable forecasting, and fuel growth by identifying new opportunities, markets, customers, and products.
Boost Security, Governance and Quality for richer data: Leverage the controlled data integration to enforce proper standards, data formats, and security protocols. Leaps in data quality can also be achieved by eliminating manual errors and identifying inconsistencies, duplications or junk data before it pollutes your repositories.
Streamline Business Processes through Automation: Increase team efficiency by slashing time wasted on manual operations, boosting employee motivation while encouraging the roll-out of continuously improved best practice templates. Automate data cleansing, standardization, and enrichment while helping data flow faster through the enterprise, making it actionable sooner.
Trigger Breakthrough Innovation by empowering Citizen Integrators: Enable business users to try new ways to simplify their day job with simple, Web-based interfaces to connect data sources in a controlled manner without needing IT expertise. Encourage them think outside the box and innovate.

Hybrid Cloud Integration is the key to improving your organization’s bottom line by realizing the full benefits of Cloud & SaaS. Talend offers one such solution, Talend Integration Cloud. As you would expect from Talend, our solution offers powerful, yet easy-to-use tools along with prebuilt components and connectors to make it easier to connect, enrich, and share data. This approach is designed to allow less skilled staff to complete everyday integration tasks so key personal can remain focused on more strategic projects. With capabilities for batch, real-time and big data integration, we believe you’ll find Talend Integration Cloud a great way to deliver on your vision of a business-oriented and future-proof Hybrid Cloud strategy.

Of course if only Agatha Christie’s heroes had such easy and fast access to consolidated information, knowing exactly what happened and why, most of her novels would only be a few pages long, perhaps even shorter. Maybe there is a new genre ready to explode here? The 140-character #crimenovel!

[2015-04-28] Talend Blog: Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 1)

Data Quality (DQ) is an art form. If I was to define two simple rules for a project involving some element of DQ, they would be:

Don’t underestimate the time and effort required to get it right
Talk to the people who own and use the data and get continuous feedback!

In many cases the DQ role on a project is a full-time role. It is a continuous process of rule refinement, results analysis and discussion with the users and owners of the data. Typically the DI (Data Integration) developers will build a framework into which DQ rules can be inserted, adjusted and tested constantly without the need to rewrite the rest of the DI processes each time.

This blog focuses on the process of matching data, but many of the principles can also be used in other DQ tasks.

First of all, lets understand what we are trying to achieve – why might we want to match?

- To find the same ‘entity’ (be it product, customer, etc.) in different source systems

- To de-duplicate data within a single system – or at least identify duplicates within a system for a Data Steward to be able to take some sort of action

- As part of a process of building a single version of the truth, possibly as part of an MDM (Master Data Management) initiative

- As part of a data-entry process to avoid the creation of duplicate data

To help us with these tasks I propose a simple DQ methodology:

As you can see, this is an iterative process and, as I said earlier, you are unlikely to find that just one ‘round’ of matching achieves the desired results.

Let’s look at each step on the diagram:

Profiling

Before we can attempt to match, we must first understand our data. We employ two different strategies to achieve this:

Consulting:

- Reading relevant documentation

- Talking to stakeholders, end users, system administrators, data stewards, etc.

- Source system demonstrations

- Discussing the change in data over time

Technical Profiling

- Using Talend tools to test assumptions and explore the data

- Analyse the actual change in data over time!

Both strategies must be employed to ensure success. One thing I have found is that end users of systems are constantly finding ways to ‘bend’ systems to do things that business teams need to do, but perhaps that the system wasn’t designed to do. A simple example of this would be a system that doesn’t include a tick box that let’s call centre operators know that a customer does not want to be contacted by phone. You may find that another field which allows free text has been co-opted for this purpose e.g.

Adam Pemble ****DO NOT PHONE****

This is why we cannot rely on the system documentation alone. A combination of consulting the users and technical profiling would help us identify this unexpected use of a field.

Typically for this step I would start by listing every assumption and query about the data – you should have at least one thing for every field of every table and file, plus row and table level rules. Next, design Talend Profiler Analyses and /or Data Integration Jobs to test those assumptions. These results will then be discussed with the business users and owners of the data. The reports produced by the DQ profiler can be a great way to share information with these business users, especially if they are not very technical. DI can also produce results in formats familiar to business users e.g. spreadsheets.

Specific to the task of matching, some examples of assumptions we may wish to test:

“In source system A, every Customer has at least one address and every address has at least one associated customer”

“Every product should have a colour, perhaps we can use that in our matching rules? The colour field is free text in the source system.”
“Source systems A and B both have an email address field – can we match on that?”
“Source system X contains a lot of duplicates in the Customer table”

It is also important to analyse the lineage of each piece of data. For example, say we had an email address field. We may profile it and discover that it contains 100% correctly formatted email addresses. Is this because the front-end system is enforcing this, or is it just by chance? If it is the latter, our DI jobs may need to be written to cope with the possibility of an incorrect or missing email, even though none currently exist.

Note: I may write a future blog going into more detail about the importance of analysis before beginning to write any process logic.

Standardisation

Whilst performing the Analysis stage, it is likely that we will notice things about our data that will have an impact on our ability to match records. For example, we might profile a ‘colour’ column for a product and find results similar to those shown below:

What do we notice?

- Blk is an abbreviation of Black

- Chr is an abbreviation of Chrome

- Aqua MAY be a synonym of Blue

- Blu is a typo of Blue

- Etc.

If we were to do an exact match based on colour, some matches could be missed. Fuzzy matching could introduce false positives (more on this later).

To improve our matching accuracy, we need to standardise our data BEFORE attempting to match. Talend allows you to apply a number of different standardisation techniques including:

- Synonym indexes

- Reference data lookups

- Phone Number Standardisation

- Address Validation tools

- Other external validation / enrichment tools

- Grammar-based parsers

Let’s look at each of these in turn:

Synonym indexes

Our scenario with colours would be a classic use case for a synonym index. Simply a synonym index is a list of ‘words’ (i.e. our master values) and synonyms (related terms that we would like to standardise or convert to our master value). For example:

The above is an excerpt from one of the synonym indexes that Talend provides ‘out of the box’ (https://help.talend.com/display/TalendPlatformUniversalStudioUserGuide55EN/E.2++Description+of+available+indexes), in this case one that deals with names and nicknames. Talend also provides components to build your own indexes (the index itself is a Lucene index, stored on the file system) and to standardise data against these indexes. The advantages of using Lucene is that it is fast, it is an open standard and that we can leverage Lucene’s fuzzy search capabilities, so we can in some cases cater for synonyms that we can’t predict at design time (e.g. typos).

These jobs are in the Talend demo DQ project if you want to play with them. The indexes can also be utilised in the tStandardizeRow component, which we will discuss shortly.

Reference data lookups

A classic DI lookup to a database table or other source of reference data e.g. an MDM system. Typically used as a join in a tMap.

Phone Number Standardisation

There is a handy component in the Talend DQ pallet that you should know about: tStandardizePhoneNumber.

It uses a google library to try to standardise a phone number into one of a number of available formats, based on the country of origin. If it can’t do this, it lets you know that the data supplied is not a valid phone number. Take the example of the following French phone numbers:

+33147045670

0147045670

They both standardise to:

01 47 04 56 70

Using the ‘National’ format option. In their original form, we would not have been able to match based on these records – after standardisation, we can make an exact match.

Address Validation tools

Talend provides components with our DQ offerings that allow you to interface to an external address validation tool such as Experian QAS, Loqate or MelissaData. Why would you want to do this? Well, if you are pulling address data from different systems, the likelihood is that the address data will be held in differing formats e.g. Address1, Address2 etc. vs Building, Street, Locality, etc. These formats may have different rules for governing the input of addresses – from no rules (free text) to validating against an external source. Even if two addresses are held in the same structure, there is no guarantee that the individual ‘tokens’ of the address will be in the same place or have the same level of completeness. This is where address validation tools come in. They take in an address in its raw form from a source and then using sophisticated algorithms, standardise and match the address against an external reference source like a PAF file (Post Office Address file) – a file from the postal organisation of a country that contains all addresses, updated on a regular basis. The result is returned in a well-structured and most importantly consistent format, with additional information such as geospatial information and match scores. Take the example below:

Two addresses that are quite different from each other to a computer; however, to a human, we can see that they are the same address. Running the addresses through an address validation tool (in this case Loqate) we get the same, standardised address as an output. Now our matching problem is much simpler.

You might ask – can I not build something like this with Talend rather than buy another tool? I was once part of a project where this was attempted (not with Talend, it was a different tool), which had quite poor-quality addresses. The issue is that addresses were designed to allow a human to deliver a letter to a given place, and there is a great deal of variation in how addresses can be represented. Six months of consultancy later, we had something that worked in most cases, but of course it was then realised that it would have been cheaper to buy a tool…. Why is address validation not built into Talend you might ask? There are a number of reasons:

- Not all customers require Address Validation – it would make the product more expensive

- Those customers that do may already be using one of the major vendors, they don’t want a different Talend proprietary system

- Different tools on the market suit different needs – e.g. MelissaData is centred on the US

- Why should Talend re-invent the wheel, when we could just allow you to utilise existing ‘best of breed’ solutions?

Other external validation / enrichment tools

There are many tools available on the market, most of which are easy to interface with using Talend (typically via a webservice or api). For example Dun and Bradstreet is a popular source of company data and Experian provides credit information on individuals. All of this data could be useful to an organisation in general and also potentially useful in matching processes.

Grammar-based parsers

Sometimes we will come across a data field that has been entered as free text, but could contain multiple bits of useful information that we could use to match, if only we could extract it consistently. Take for example a field that holds a product description:

34-9923 Monolithic Membrane 4' x 8' 26 lbs

4' x 8' Monolithic Membrane 26lbs Green

Now again, as a human, we can see that there is a high likelihood that these two descriptions are referring to the same product. What we need to be able to do is identify all of the different ‘tokens’: Product code, Name, Dimensions, Weight, and Colour - and create a set of rules to be able to ‘parse’ out these tokens, no matter the order or variations in representation (e.g. the spaces in 26 lbs but not in 26lbs). Essentially, what we are defining is some simple rules for a language or ‘grammar’. Talend includes a variety of parsing components which can help you, from simple regular expressions through to tStandardiseRow, which lets you construct an ANTLR grammar:

A warning though: this is a hard task for even experienced professionals to get right. We are looking to include some additional intelligence in the tool to help you with building these sorts of rules in the future.

Next time: This blog continues with part 2: matching and survivorship / stewardship

[2015-04-23] Talend Blog: Talend – Implementation in the ‘real world’: what is this blog all about?

Let me introduce myself: My name is Adam Pemble and I am a Principal Professional Services Consultant based in the UK. My area of expertise is MDM (Master Data Management), along with the related disciplines of DI (Data Integration or what was traditionally known as ETL – Extract Transform Load) and DQ (Data Quality).

Now a little background: Talend has four main grades of consultant: Analyst, Consultant, Senior and Principal, with each grade bringing different levels of experience, technical knowledge and of course different price points. I have been with Talend for around four and a half years now, starting as a consultant (the first consultant in the UK team!) and working my way up. Before that, I worked for a competitor for three and a half years as an Analyst, then Consultant. I have two main roles: consulting for Talend customers and what we call ‘practice contribution’ – business development, defining best practices, training our consultants etc. When I am not consulting, I like to race cars – sponsored by Talend!

Talend implementation in the real world

So why am I telling you this?

A little while ago I was asked by the Talend management team to start writing a blog for the website. I was given a free reign to write about anything I liked (except cars – sadly). When I looked through the blogs and the bloggers that we already have on the site I realised that many of my colleagues were already doing a really good job of writing about the marketplace, the challenges faced by organisations, and where the industry is heading. All great stuff indeed, but as an MDM practitioner, in particular I’d recommend reading the blogs of Mark http://www.talend.com/blogger/mark-balkenende, Christophe http://www.talend.com/blog/ctoum, Sebastiao http://www.talend.com/blogger/sebastiao-correia and Jean-Michel http://www.talend.com/blogger/jean-michel-franco).

Given that I like to pick my own “lane” (pun intended), I thought I should use my blog to discuss real-world problems and use cases, as well as provide some practical examples of how these may be overcome. I thought this type of content might also serve to augment some of the other information we make available to current and potential customers such as documentation, training, forums etc. that focuses on the ‘how’ to do something, but not necessarily the ‘when?’ and ‘why?’.

Let’s think about how most DI / DQ / MDM developers begin their journey with Talend. The practical reality is that if you have a decent sized project, your company will have chosen one of our Enterprise or Platform products. As a developer you may have been involved in the pre-sales process, but this is not a given. Perhaps you might have even downloaded and used one of the Talend Open Studio products, which was the catalyst for your company considering the purchase (if this is the case, and not that I am biased, well done!) Perhaps there is a Systems Integrator / partner involved in your project that you will work alongside or maybe you have your own development team. You may have used another DI tool in the past, perhaps worked extensively with Databases, or come from a coding background. Then again, maybe none of the above fits your particular situation. The truth is everyone starts at a different level and progresses at different speeds – this is only natural. Perhaps some people in your company think that Talend solves all their problems with no thought / effort required (encouraged by our marketing team I am sure). However, as practitioners, we know the reality is not quite as simple. What are the key factors in delivering a successful DI / DQ / MDM project?

- The right tools (aka Talend!)

- The right people at the right time – technical and business experts

- Experience

- Best practices / standards

- Analysis / Requirements / Design (i.e. a methodology which delivers results)

- Realistic expectations

All this can be rather daunting - so where do we begin?

The Talend training courses are a great place to start – incidentally, if you opt for an Instructor led course, this may be your first interaction with someone from our Professional Services team. The courses are a great first step that will put you on your way, but Talend is a powerful solution with a lot of depth, so mastering it will take time. Of course the disciplines of DI / DQ / MDM are complex, so no matter your background, it will take time and hands-on experience to be able to build truly ‘production-quality’ logic. I can’t quantify how long this will take because everyone is different (the quickest learners tend to be experienced practitioners who have used similar tools), but you are not alone as there are numerous resources available to you, some of which I have mentioned already.

You should also consider utilising our Professional Services consultants – most customers use us for architecture design and installation of Talend, but we can also help mentor you through the development journey. We live and breathe the tool on a daily basis and have been through the project lifecycle many times. In most cases, we will have implemented something similar before (not always though and we love a challenge!). No one Talend consultant is an expert in all Talend products as the platform is too big for one person to know everything. Given this, we tend to specialise – for example I only deal with our ESB and Big Data products at a high level – MDM, DI and DQ are my specialisms. Regardless of your needs though, I can guarantee we have staff on hand that can help.

My hope is that my blog entries can be a practical guide to real-world problems using Talend, and that it will give you a little more insight into the way we work in Talend Professional Services.

Thanks for reading!

Next time: A practical guide to Data Quality Matching.

[2015-04-13] Talend Blog: Open Integration Meets Metadata With The New Talend Metadata Bridge

Great news! Talend Metadata Bridge is now general availability (GA) since March, 8^th. Talend customers with an active subscription to any of our Enterprise or Platform products can download and install it as an add-on to our latest 5.6.1 Talend Studio. Then, from 5.6.2 and onwards, it will be installed automatically.

So, what does it bring to Talend developers, data architects and designers?

Metadata is data about data. Business metadata generally includes information like the definition of business objects (such as a customer), its attributes (for example, a customer ID), the relationships between objects (a contract related to a customer), the business rules that apply to that information (an active customer has at least one open contract), the roles in regards to that information, etc. It brings clarity to information systems, making them more useable and accessible as self-services by business users, and it brings auditability too, a key capability especially needed in heavily regulated industries.

Technical metadata is created by any tool that deals with data: databases, data modelling tools, Business Intelligence tools, development tools, enterprise applications, etc. In fact, metadata is a core capability for solutions and platforms that can bring a high level of abstraction to the IT technical layer, for example, for visual programming or Business Intelligence.

Talend is a perfect example. Metadata is at the cornerstone of our visual design capabilities. So Metadata is not new to Talend. What Talend Metadata Bridges adds is the ability to exchange Talend’s Metadata with Metadata from other tools. In addition, the Excel Data Mapping tool allows for the exposing and authoring of Talend’s data transformation capabilities such as mappings and transformations directly into Excel.

Let’s run through some of the new capabilities. Please refer also to our new web page or online webinar for a more exhaustive overview.

Faster design with the Talend Metadata Bridge

In many organizations, developers, application designers and data architects may not use the same tools when designing, implementing or maintaining systems. Designers may use tools that provide a very high level of abstraction but don’t deal with the technical details: they may use data, objects or process modeling tools, like CA ERwin Data Modeler, Embarcadero ER/Studio, SAP Sybase Power Designer, IBM Infosphere Data Architect, etc. Developers use other tools like a database, an ETL, a Business Intelligence tool, etc. The lack of integration between the tools leads to inefficiencies during the implementation phase.

What Talend Metadata Bridge does is seamlessly integrate Talend with higher-level tools. It can also reverse-engineer existing Talend data jobs into the modeling tools and keep them in sync during the project life cycle. In addition, it not only synchronizes data models with Talend’s physical models, but it also synchronizes metadata across all tools because of its ability to export the metadata across databases and BI tools.

The aforementioned modeling tools are very good at designing and managing data models and data relationships inside a system. However, they don’t provide similar capabilities to manage the relationships between systems, which are the typical problem that you are addressing when you are using Talend. Although Talend Studio provides a high level of abstraction to those data integration processes, some stakeholders involved in the design of a system may still find it too complex for their design job.

Excel screenshot - Talend Metadata Bridge This is where our new Excel Bridge for data mapping comes into play. It is an Excel add-in, delivered as part of the Talend Metadata Bridge that allows designing mappings with simple data transformations between data sources and targets, in a simple spreadsheet. Designers will enjoy it for prototyping, documenting, auditing, or applying quick change to the transformation process, directly from the Excel interface they are familiar with. The Excel add-in includes a new “ribbon” with helper functions to format the sheet. It also provides drop-down lists in the cells for easy access to the source or target metadata. Through this new tool, collaboration between the designers and the developer becomes a matter of import-export, eliminating the traditional specifications / implementation / acceptance cycle. The developer deals with the connectivity and other technicalities of the job, while the designer or a subject-matter expert, uses Excel as a frontend to complete the mappings.

So what are the benefits of the Talend Metadata Bridge? It brings reduced implementation times and maintenance costs, increases data quality and compliance through better documentation and information consistency, and improves agility for change.

At the same time, it empowers designers and business users with simple authoring capabilities for mappings and transformations in Talend, accelerates development by using common formats for specifications and development, and avoids delays at runtime for a quick fix in case of unforeseen changes.

Re-platforming and ETL offloading

Data platforms are at the core of any information system. Changing the core in a system often seems as a daunting task, which is a reason why data platforms don’t change much over time. But, there are times when change is needed. It happened twenty years ago when relational databases outperformed their alternatives for information management. It happened more recently, but to a much lesser extent, when came alternatives to the traditional relational databases, such as open source or datawarehouse appliances. And now, as we are engaging the Big Data trend, we see a new generation of open-sourced innovative databases like NoSQL and data management environments such as Hadoop that can manage more volume, variety and velocity of information at a fraction of the cost of their predecessors.

But re-platforming may appear as a risky and costly project that may hamper the benefits of the new technologies. There is a need for accelerators and a well-managed approach to address this challenge. With this respect, Talend Metadata Bridge enhances Talend capabilities to handle migration projects. The bridge’s metadata connectors complement the existing Talend data connectors to automatically create the data structures in the new environment before moving the data itself. It allows therefore renewing your platforms without losing your previous design and implementation investments, and preserves existing development standards such as naming conventions.

When used in conjunction with Talend Big Data, it also dramatically streamlines the offloading of ETL processes to Hadoop: Existing ETL jobs can be converted into native Hadoop jobs that run without even requiring a proprietary run-time engine. In this scenario, the Talend Metadata Bridge can replicate the metadata from the legacy system to the one needed in the new Big Data platform. Note that it is an accelerator, but not a magic box: re-platforming is a project that needs a well-defined approach and methodology. This is something that we are investigating with System Integrators in the context of a new conversion program.

And that’s not all…

Metadata Management holds a lot more promise. Beyond the capabilities mentioned in this article, the Talend Metadata Bridge will drive new best practices in our Talend community, as we envisioned through the feedback of some of the experts that participated in our early adoption program. In addition, the Talend Metadata Bridge provides the foundation for future capabilities within the Talend Unified Platform. Stay tuned...

[2015-04-10] Talend Forum Announcement: For test only, Talend Open Studio's 6.0.0 M4 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 6.0.0 M4 release is available, for testing only. This milestone contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 6.0 release.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s fourth milestone:

Big Data: http://www.talend.com/download/talend-o … s_download
Data Quality: http://www.talend.com/download/talend-o … s_download
ESB: http://www.talend.com/download/talend-o … s_download
Data Integration: http://www.talend.com/download/talend-o … s_download
BPM: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-o … s_download

Below, please find key new features for Talend 6.0.0 M4:

Talend Open Studio for Big Data

- Support for Hortonworks 2.2 https://jira.talendforge.org/browse/TBD-1577
- Pig with Tez https://jira.talendforge.org/browse/TBD-1505
- Hive with Tez https://jira.talendforge.org/browse/TBD-1504
- Support for Cassandra CQL3 https://jira.talendforge.org/browse/TBD-1042

Talend Open Studio for Data Integration

Studio :
- Support of MariaDB in Studio Wizards

Data Integration Components :
- Salesforce wave component https://jira.talendforge.org/browse/TDI-31538
- Netsuite component update https://jira.talendforge.org/browse/TDI-31542
- Support Postgresql 9.4 (Component part) https://jira.talendforge.org/browse/TDI-31747
- Support IBM DB2 10.x (Component part) https://jira.talendforge.org/browse/TDI-31884
- Upgrade MDM components to support Tomcat https://jira.talendforge.org/browse/TDI-31722
- Add an option to select how to interpret blank value https://jira.talendforge.org/browse/TDI-31750
- Add one option to the tFileInputRegex to avoid the message "Line doesn't match" https://jira.talendforge.org/browse/TDI-32038

Talend Open Studio for ESB & Talend ESB SE

Studio
- Route Builder: upgrade to Apache Camel 2.15.1-SNAPSHOT
- Data Services: upgrade to Apache CXF 3.1.0-SNAPSHOT

Runtime / ESB
- Apache Karaf - 4.0.0-SNAPSHOT
- Apache CXF - 3.1.0-SNAPSHOT
- Apache Camel - 2.15.1-SNAPSHOT
- Apache AMQ - 5.11.1
- Apache Syncope - 1.2.3

Talend Open Studio for MDM

- Support for Java 8
- Web UI look & feel refresh, cleaner, lighter (Step 1)
- Web app running on Tomcat (EJB / JBoss removed)
- RESTFul API for CRUD operations onto records
- Event MAnager based on JMS
- Upgrade to Eclipse 4

Thanks for being a part of our community,
The Talend Team.

[2015-04-09] Talend Blog: Big Data - a Relatively Short Trip for the Travel Industry

If there is one sector that has been particularly affected by the digital revolution, it is travel. According to research company PhoCusWright, the share of bookings derived from online channels will increase to 43 percent this year[1]. The move of more consumers to online sources for researching and booking travel, with sites like Airbnb, Booking.com or TripAdvisor counting visitors in the tens of millions, is a further boost to a sector that has historically always been a strong collector of detailed consumer information. At the same time, companies like Uber and BlaBlaCar are already showcasing the power of being data driven by fundamentally disrupting traditional taxi and rail travel services.

What could be more natural in these circumstances than travel companies being among the most committed to their digital transformation? A Forbes Insights report released earlier this year reinforces this point, placing travel at the top of industries in which companies are using data-driven marketing to find a competitive edge[2]. According to the report, 67 percent of travel executives say they have used data-driven marketing to find a competitive advantage in customer engagement and loyalty, and 56% percent have done so for new customer acquisition.

More Miles to Go

While the digital transformation of the travel industry is certainly underway, there is still a way to go – especially for the other 33 and 44 percent of travel executives who have yet to use data to drive a competitive advantage! Moreover, while the travel industry may be advanced in terms of marketing engagement, when it comes to relationships and the management of customer data, they still have a way to go. For instance, how often are you asked by reception at check-in if this is the first time at the hotel? And, even though during your stay you must constantly prove your identity, why is this only for hotel billing and security purposes rather than to enjoy personalized services? Moreover, some consumers suspect that being identified by the travel industry turns out to be a detriment rather than a benefit (for example the case of “IP Tracking“ - the more one visits a booking site, the higher the ticket price might climb).

The reality is that companies in the travel industry are confined mostly to handling transactions, when there are technologies and practices linked to customer knowledge available that actually enable them to better manage and personalize the entire customer journey. The challenge is to reinvent the notion of the travel agency, which was formerly essential to linking customers to service providers. The Internet allows people to do a lot of the work themselves, such as finding a provider, making a reservation and responding to an event. The role of advisor remains, designed to provide, according to the customer’s profile, the right service at the right time.

How do you differentiate?

Each ticket reservation (plane, train, coach, etc.), each hotel stay and each car rental leaves a “digital trail”, which can be consolidated and analyzed within a Customer Data Platform. This enables travel companies to better understand the needs and desires of an individual customer. Thus, a large amount of data can be collected both before (while booking a trip or a flight: destination preferences, language, climate, activities, etc.), during (food, excursions, sports, etc.) and after the trip (customer reviews and social commentary, recommendations, next trip, etc.). During the trip or the journey, it is also possible to be permanently connected to the customer, for example through the provision of Wi-Fi access, (as is already the case in the majority of hotels, airports, and increasingly on trains and planes). Globally, there are many more points of interaction today than there have ever been.

We therefore see travel companies launching services based on the Internet of Things and offering real-time recommendations to deliver new offers[3](example: the tennis court is open; would you like to use it?). Here, we are talking about managing the entire the customer journey, not just the initial act of purchase.

A rough methodology

The first technological brick in this model is a customer database that covers all of the proposed services (online or offline reservations, points of sale, after sales, customer service, call centers). This should include basic information about the customer, and is what we call the golden record (also known as Master Data Management). To get a unique view of the customer that has been updated across all channels, it must also reflect the transactions and events that took place during the customer’s journey, including interactions. Big data plays a key role in this platform, as it can also integrate data from the web and social networks. Additionally, it allows for the extrapolation of analytical information using raw data, such as a segmentation or scorings that enable companies to predict the customer’s affinity to a certain service.

This platform can also connect in real time to all points of contact, for example, a call center (which helps to increase efficiency and relevance through an immediate understanding of the customer context), websites, points of sale, reception desks, etc. The greater the number of points of contact, the more precise the picture of the individual customer will be and the greater the opportunity for companies to provide the right service at the right time. To the extent that this type of system directly affects the processes and the key actors in the customer relationship, it is essential to support the project with an accompanying change.

In summary, a Customer Data Platform (some call it a Data Management Platform or DMP, but this term is ambiguous in my opinion since it is more often used in reference to a tool for managing online traffic and the purchase of display ads rather than a cross-channel platform intended to supply value-added services to target customers) enables, on the one hand, the creation of a sustainable and up-to-date customer information base and, on the other hand, a means to offer online services to the connected customer throughout their journey/stay/travel, thus creating a personalized relationship. And, finally, it allows for the recommendation of personalized offers in real time at the most opportune moments.

Though it may be difficult to maintain a one-to-one relationship with customers in some sectors, this is not the case in the travel sector; trips are often tailored, the context is personal and interactions with customers are frequent. The development of a Customer Data Platform is therefore essential for professionals in these sectors. Developing a real understanding of the customer journey is their last hope in a world where digital technology giants are beginning to take over their turf and mobility will only make it easier for them to collect more data.

If you are interested in learning more about the impact of technology on the travel industry, you may wish to view this related on-demand webinar. The webinar details how TUI, the world’s number one integrated tourism business, with over 1,800 travel agencies and leading online portals, as well as airlines, hotels, and cruise lines, is using Talend Master Data Management (MDM) to build a single customer view and deliver a more seamless user experience across multiple channels.

Jean-Michel Franco is the Director, Product Marketing for Data Governance products, Talend, a global leader in big data integration and related markets.

[1] “Competitive Landscape Of The U.S. Online Travel Market Is Transforming”, Forbes April 2014 http://www.forbes.com/sites/greatspeculations/2014/04/08/competitive-landscape-of-the-u-s-online-travel-market-is-transforming/

[2] “Data Driven and Digitally Savvy: The Rise of the New Marketing Organization”, Forbes, January 2015, http://www.forbes.com/forbesinsights/data_driven_and_digitally_savvy/

[3] http://www.itbriefcase.net/fishing-for-data-in-a-digitalized-tourism-industry

[2015-04-07] Talend Forum Announcement: ElasticSearch 1.1.1 vulnerability (Talend Log Server component)

Dear Community,

This post is intended for Talend customers running a Talend product version between 5.4.x and 5.6.1 and have the Talend Log Server installed.

We have identified a security vulnerability present in ElasticSearch 1.1.1, a component included with Talend Log Server in versions 5.4.x through 5.6.1. This vulnerability utilizes dynamic scripting, which allows remote attackers with network access to execute arbitrary MVEL expressions and Java code via the source parameter to _search. Further, a cross-site scripting (XSS) vulnerability in the CORS functionality in Elasticsearch allows remote attackers to inject arbitrary web script or HTML via unspecified vectors.

If you have not installed Talend Log Server, or are running a version of Talend software prior to 5.4.x, you are not affected.

However, if you are running a Talend product version between 5.4.x and 5.6.1, and have the Talend Log Server installed, it is necessary to make the following configuration changes to properly secure your system:

1. Create a file called elasticsearch.yml in your Talend Log Server directory (/Talend/<version>/Talend-LogServer)
2.Edit the file and add the following entries:

script.disable_dynamic: true
http.cors.allow-origin: “http://<TAC_SERVER_HOST>:<TAC_SERVER_PORT>”

3. Restart the Log Server.

For more information please refer to Talend Jira System TUP-2775: https://jira.talendforge.org/browse/TUP-2775 .

More information about the CVE (Common Vulnerabilities and Exposures) IDs related to the Elastic Search vulnerabilities can be found at CVE 2014-3120 http://www.cve.mitre.org/cgi-bin/cvenam … =2014-3120 and CVE 2014-6439 http://www.cve.mitre.org/cgi-bin/cvenam … =2014-6439 .

Frequently Asked Questions:

Q. I’m running Talend Log Server. Is my system vulnerable to attackers?

If your system is properly secured behind a firewall, it would only be vulnerable to attacks from within your internal network. Talend recommends that you apply the configuration changes above to ensure that the system is not open to malicious attacks.

Q. I’m not sure if I installed Talend Log Server. How can I identify if it’s running?

You can check what services your system is running. In Windows, click on the start menu and type “services.msc” for a complete list. If you see Talend Logserver, the service is installed. Other ways to locate the log server:

On windows, type netstat -na | find "9200" into the command prompt, and verify if a service is running on port 9200. On Linux, the command would be netstat –p 9200.

Q. I see that I’m running the Talend Log Server. Instead of disabling dynamic scripting, can I turn off the Log Server entirely?

Yes. When viewing your Windows services, right click on the Talend Log Server, and select “stop”. Then select Properties, and change the service from “Automatic” to “Disabled”. The log server will no longer initialize when the server restarts.

Q. Are future versions of Talend software affected?

Starting with version 5.6.2, dynamic scripting will no longer be enabled for Talend Log Server. For Talend 6.0, the version of ElasticSearch included does not contain these vulnerabilities.

For any additional questions, please go to http://talend.com/ and contact Talend Support.

Best,
The Talend Team.

[2015-03-27] Talend Blog: What is “The Data Vault” and why do we need it?

For anything you might want to do, understanding the problem and using the right tools is essential. Resulting methodologies and best practices that inevitably arise become the catalyst for innovation and superior accomplishments. Database systems, particularly data warehouse systems are no exception, yet does the best data modeling methodologies of the past offer the best solution today?

Big Data, agreeably a very hot topic, will clearly play a significant part in the future of business intelligence solutions. Frankly the reliance upon Inmon’s Relational 3NF and Kimball’s STAR schema strategies simply no longer apply. Using and knowing how to use the best data modeling methodology is a key design priority and has become critical to successful implementations. Persisting with outdated data modeling methodologies is like putting wagon wheels on a Ferrari.

Today, virtually all businesses make money using the Internet. Harvesting the data they create in an efficient way and making sense of it has become a considerable IT challenge. One can easily debate the pros and cons involved in the data modeling methodologies of the past, but that will not be the focus of this blog. Instead let’s talk about something relatively new that offers a way to easily craft adaptable, sensible, data models that energize your data warehouse: The Data Vault!

Data Vault: Adaptable to change Enterprise Data Warehouse (EDW) systems aim to provide true Business Intelligence (BI) for the data-driven enterprise. Companies must address critical metrics ingrained in this vital, vibrant data. Providing an essential data integration process that eventually supports a variety of reporting requirements is a key goal for these Enterprise Data Warehouse systems. Building them involves significant design, development, administration, and operational effort. When upstream business systems, structures, or rules change, fail to provide consistent data, or require new systems integration solutions, the minimum reengineering requirements present us with problem #1: The one constant is change; so how well can an EDW/BI solution adapt?

"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin

Consumption and analysis of business data by diverse user communities has become a critical reality to maintain a competitive edge yet technological realities today often require highly trained end-users. Capturing, processing, transforming, cleansing, and reporting on this data may be understandable, but in most cases the sheer volume of data can be overwhelming; Yup, problem #2: Really Big Data; often characterized as: Volume, Velocity, Variety, Variability, Veracity, Visualization, & Value!

Crafting effective and efficient EDW/BI systems, simplified for usability and reporting on this data, quickly becomes a daunting and often difficult technical ordeal even for veteran engineering teams. Several integrated technologies are required from database systems, data processing (ETL) tools like Talend, various programming languages, administration, reporting, and interactive graphics software to high performance networks and powerful computers having very large storage capacities. The design, creation, delivery, and support of robust, effortless EDW/BI systems for simplified, intelligent use are, you guessed it; problem #3: Complexity!

Often we see comprehensive and elegant solutions delivered to the business user that fails to understand the true needs of the business. We’re told that’s just the way it is due to technical requirements (limitations; wink, wink) and/or design parameters (lack of features; nudge, nudge). Hence; problem #4: The Business Domain; fit the data to meet the needs of the business, not the other way around!

Furthermore, as upstream systems change (and they will), as EDW/BI technology plows ahead (and they must), as the dynamic complexities involved prevail (relentlessly), every so often new data sources need to be added to the mix. These are usually unpredicted and unplanned for. The integration impact can be enormous often requiring complete regeneration of the aggregated data; hence, problem #5: Flexibility; or the lack there of!

So how do we solve these problems? Well …

Bill Inmon widely regarded as the father of data warehousing, defines a data warehouse as:

“A subject oriented, nonvolatile, time-variant collection of data in support of management’s decisions”
(http://en.wikipedia.org/wiki/Bill_Inmon)

Star schema Ralph Kimball (http://en.wikipedia.org/wiki/Ralph_Kimball), a pioneering data warehousing architect, developed the “dimensional modeling” methodology now regarded as the de-facto standard in the area of decision support. The Dimensional Model (called a “star schema”) is different from Inman’s “normalized modeling” (sometimes called a “snowflake schema”) methodology. In Kimball’s Star Schema, transactional data is partitioned into aggregated “facts” with referential “dimensions” surrounding and providing descriptors that define the facts. The Normalized Model (3NF or “third normal form”) stores data in related “tables” following relational database design rules established by E. F. Codd and Raymond F. Boyce in the early 1970’s that eliminate data redundancy. Fostering vigorous debate amongst EDW/BI Architects as to which methodology is best, both have weakness when dealing with inevitable changes in the systems feeding the data warehouse and in cleansing data to conform to strict methodology requirements.

Further, the OLAP cube (for “online analytical processing”) is a data structure that allows fast analysis of data from multiple perspectives. The cube structure is created from either a Star or Snowflake Schema stored as metadata from which one can view or “pivot” the data in various ways. Generally cubes have one time based dimension that supports a historical representation of data. Creating OLAP cubes can be very expensive and often create a significant amount of data that is of little or no use. The 80/20 rule appears in many cases to hold true (where only 20% of the OLAP cube data proves useful) which begs the question: Built upon a traditional architecture does an OLAP cube truly deliver sufficient ROI? Often, the answer is a resounding, NO! Durable EDW/BI systems must deliver real value.

A Fresh Approach

The Data Vault is a hybrid data modeling methodology providing historical data representation from multiple sources designed to be resilient to environmental changes. Originally conceived in 1990 and released in 2000 as a public domain modeling methodology, Dan Linstedt, its creator, describes a resulting Data Vault database as:

“A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.”
(http://en.wikipedia.org/wiki/Data_Vault_Modeling)

Focused on the business process, the Data Vault as a data integration architecture, has robust standards and definitional methods which unite information in order to make sense if it. The Data Vault model is comprised of three basic table types:

the data vault HUB (blue): containing a list of unique business keys having its own surrogate key. Metadata describing the origin of the business key, or record ‘source’ is also stored to track where and when the data originated.

LNK (red): establishing relationships between business keys (typically hubs, but links can link to other links); essentially describing a many-to-many relationship. Links are often used to deal with changes in data granularity reducing the impact of adding a new business key to a linked Hub.

SAT (yellow): holding descriptive attributes that can change over time (similar to a Kimball Type II slowly changing dimension). Where Hubs and Links form the structure of the data model, Satellites contain temporal and descriptive attributes including metadata linking them to their parent Hub or Link tables. Metadata attributes within a Satellite table containing a date the record became valid and a date it expired provide powerful historical capabilities enabling queries that can go ‘back-in-time’.

There are several key advantages to the Data Vault approach:

- Simplifies the data ingestion process

- Removes the cleansing requirement of a Star Schema

- Instantly provides auditability for HIPPA and other regulations

- Puts the focus on the real problem instead of programming around it

- Easily allows for the addition of new data sources without disruption to existing schema

Simply put, the Data Vault is both a data modeling technique and methodology which accommodates historical data, auditing, and tracking of data.

“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework”
Bill Inmon

Adaptable

Through the separation of business keys (as they are generally static) and the associations between them from their descriptive attributes, a Data Vault confronts the problem of change in the environment. Using these keys as the structural backbone of a data warehouse all related data can be organized around them. These Hubs (business keys), Links (associations), and SAT (descriptive attributes) support a highly adaptable data structure while maintaining a high degree of data integrity. Dan Linstedt often correlates the Data Vault to a simplistic view of the brain where neurons are associated with Hubs and Satellites and where dendrites are Links (vectors of information). Some Links are like synapses (vectors in the opposite direction). They can be created or dropped on the fly as business relationships change automatically morphing the data model as needed without impact to the existing data structures. Problem #1 Solved!

Big Data

Data Vault v2.0 arrived on the scene in 2013 and incorporates seamless integration of Big Data technologies along with methodology, architecture, and best practice implementations. Through this adoption, very large amounts of data can easily be incorporated into a Data Vault designed to store using products like Hadoop, Infobright, MongoDB and many other NoSQL options. Eliminating the cleansing requirements of a Star Schema design, the Data Vault excels when dealing with huge data sets by decreasing ingestion times, and enabling parallel insertions which leverages the power of Big Data systems. Problem #2 Solved!

Simplification

Crafting an effective and efficient Data Vault model can be done quickly once you understand the basics of the 3 table types: Hub, Satellite, and Link! Identifying the business keys 1^st and defining the Hubs is always the best place to start. From there Hub-Satellites represent source table columns that can change, and finally Links tie it all up together. Remember it is also possible to have Link-Satellite tables too. Once you’ve got these concepts, it’s easy. After you’ve completed your Data Vault model the next common thing to do is build the ETL data integration process to populate it. While a Data Vault data model is not limited to EDW/BI solutions, anytime you need to get data out of some data source and into some target, a data integration process is generally required. Talend’s mission is to connect the data-driven enterprise.

With its suite of integration software, Talend simplifies the development process, reduces the learning curve, and decreases total cost of ownership with a unified, open, and predictable ETL platform. A proven ETL technology, Talend can certainly be used to populate and maintain a robust EDW/BI system built upon a Data Vault data model. Problem #3 Solved!

Your Business

The Data Vault essentially defines the Ontology of an Enterprise in that it describes the business domain and relationships within it. Processing business rules must occur before populating a Star Schema. With a Data Vault you can push them downstream, post EDW ingestion. An additional Data Vault philosophy is that all data is relevant, even if it is wrong. Dan Linstedt suggests that data being wrong is a business problem, not a technical one. I agree! An EDW is really not the right place to fix (cleanse) bad data. The simple premise of the Data Vault is to ingest 100% of the source data 100% of the time; good, bad, or ugly. Relevant in today’s world, auditability and traceability of all the data in the data warehouse thus become a standard requirement. This data model is architected specifically to meet the needs of today’s EDW/BI systems. Problem #4 Solved!

“To understand the Data Vault is to understand the business”

(http://danlinstedt.com)

Flexible

The Data Vault methodology is based on SEI/CMMI Level 5 best practices and includes many of its components combining them with best practices from Six Sigma, TQM, and SDLC (Agile). Data Vault projects have short controlled release cycles and can consist of a production release every 2 or 3 weeks automatically adopting the repeatable, consistent, and measurable projects expected at CMMI Level 5. When new data sources need to be added, similar business keys are likely, new Hubs-Satellites-Links can be added and then further linked to existing Data Vault structures without any change to the existing data model. Problem #5 Solved!

Conclusion

In conclusion, the Data Vault modeling and methodology addresses the elements of the problems we identified above:

- It adapts to a changing business environment

- It supports very large data sets

- It simplifies the EDW/BI design complexities

- It increases usability by business users because it is modeled after the business domain

- It allows for new data sources to be added without impacting the existing design

This technological advancement is already proving to be highly effective and efficient. Easy to design, build, populate, and change, the Data Vault is a clear winner. Very Cool! Do you want one?

Visit http://learndatavault.com or http://www.keyldv.com/lms for much more on Data Vault modeling and methodology.

[2015-03-12] Talend Forum Announcement: For test only, Talend Open Studio's 6.0.0 M3 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 6.0.0 M3 release is available, for testing only. This milestone contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 6.0 release.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s third milestone:

Big Data: http://www.talend.com/download/talend-o … s_download
Data Quality: http://www.talend.com/download/talend-o … s_download
ESB: http://www.talend.com/download/talend-o … s_download
Data Integration: http://www.talend.com/download/talend-o … s_download
BPM: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-o … s_download

Below, please find key new features for Talend 6.0.0 M3:

Talend Open Studio for Big Data

- Hive and Pig running on Tez (TBD-1480, TBD-1504, TBD-1505)

Talend Open Studio for Data Integration

Studio :
- New palette, Curve rows (beziers) and CSS design (TUP-2591, TUP-2552)
- New solution to link components together

Talend Open Studio for ESB & Talend ESB SE

Studio
- Unified Platform-related improvements, incl. the new look and feel for the Data Service (Integration) and Route Builder (Mediation) perspective design window.

Runtime / ESB
- No new features at this time

Note: the Runtime command ‘tesb:start-all’ shows an error in the console with 6.0.0.M3 – please ignore this as it has no impact on the services itself.

Talend Open Studio for Data Quality

- Improved column analysis editor: https://jira.talendforge.org/browse/TDQ-9872
- Eclipse 4.4 upgrade: https://jira.talendforge.org/browse/TDQ-9830
- EMF compare upgrade: https://jira.talendforge.org/browse/TDQ-9394

Thanks for being a part of our community,
The Talend Team.

[2015-03-12] Talend Forum Announcement: For test only, Talend Open Studio's 6.0.0 M3 release is available

Dear Community,

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s third milestone:

Big Data: http://www.talend.com/download/talend-o … s_download
Data Quality: http://www.talend.com/download/talend-o … s_download
ESB: http://www.talend.com/download/talend-o … s_download
Data Integration: http://www.talend.com/download/talend-o … s_download
BPM: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-o … s_download

Below, please find key new features for Talend 6.0.0 M3:

Talend Open Studio for Big Data

- Hive and Pig running on Tez (TBD-1480, TBD-1504, TBD-1505)

Talend Open Studio for Data Integration

Studio :
- New palette, Curve rows (beziers) and CSS design (TUP-2591, TUP-2552)
- New solution to link components together

Runtime / ESB
- No new features at this time

Note: the Runtime command ‘tesb:start-all’ shows an error in the console with 6.0.0.M3 – please ignore this as it has no impact on the services itself.

Talend Open Studio for Data Quality

Thanks for being a part of our community,
The Talend Team.

[2015-03-12] Talend Blog: Announcing the Talend Passport for MDM Success

The need for a 360° view of customers, products, or any business objects needed in your daily work is not new. Wasn’t it supposed to be addressed with ERP, CRM, Enterprise Data Warehouses, by the way?

But the fact is that there is still a gap between the data expectations from the Lines of Businesses and what IT delivers; now, with the advent of Big Data, the gap is widening at an alarming pace. Does that mean that we should forget about MDM because it is too challenging to achieve? Well, can you give up being customer centric? Can you open your doors to new data-driven competitors from your industry? And could you afford to ignore industry regulations and privacy mandates?

Of course you can’t.

MDM is a must, not an option, so there isn’t a choice but to overcome these challenges. To accomplish this -- together with selected Consulting and System Integration partners with proven track record in MDM consulting -- we designed the Talend Passport for MDM Success.

In a must read blog post titled “MDM: Highly recommended, still misunderstood”, Michele Goetz from Forrester Research provides evidence that MDM is hot topic. But at the same time, she warns that MDM is much more than “loading data into the hub, standardizing the view, and then pushing the data.”, but rather a data strategy. This explains why many surveys have shown that organizations often struggle to design a sound MDM strategy, not to mention a clear ROI.

This contrasts with the most recent success stories that Talend sees for MDM. Those achieving success, appear to do so by closely linking their MDM back-end with business outcomes on the front-end. This then drove them to fully engage their Lines of Businesses, not only to get the necessary funding, but also to collaboratively implement a sustainable support organization as well as best practices for Data Governance. This also allowed them to deliver MDM incrementally, by starting small with a well-defined sweet spot in mind and then expanding fast through a series of initiatives aligned with well-defined business objectives.

The lesson learned is that planning is crucial to Master Data success; however, this also appears to be the most challenging step of the project. As a result, most organizations need guidance to succeed at this phase. At Talend, we believe that our role goes beyond equipping MDM projects with the right toolset: we want to contribute as much as we can to the overall success and this is what guided us to address the issue. We aspire to help our customers back up their MDM initiative with a solid business case and to build a clear project plan and to address the prerequisites before engaging their projects.

This drove us to the design of the Talend Passport for MDM Success. We designed it as a collaborative effort with our partners: we selected MDM Consulting and System Integrators across regions, those who had both a proven track record both in MDM consulting and in delivering Talend MDM projects on expectations, on time and on budgets. Once we gathered the community, we worked on the deliverable of the offer.

The Talend Passport for MDM Success is packaged consulting services that can be delivered in a short period (from four to six weeks depending on the project type and scope). It provides guidance to ensure that the MDM project is on the right track and establishes a solid foundation for an MDM roll-out. Concretely, the goal is to:

- Assess an organization maturity for engaging in an MDM program and set up a plan to meet the prerequisites;

- Define/refine the MDM business case(s) and be ready to promote them to the Lines of Businesses;

- Draw a project roadmap and get ready to start the execution

The feedback from the selected partners with regards to this initiative has outpaced the expectations. The initial objective was to have one partner on board for each of our core regions before the end of first quarter of 2015. As of today, nine companies have joined the alliance, from global providers to deep specialist boutiques. And those names resonate: Bearing Point, Cap Gemini, CGI, CSC, McKnight Consulting Group, IPL, Micropole, Sopra Steria, Virtusa. And they are all ready to deliver.

Take McKnight Consulting Group for example. They are fully focused on strategizing, designing and deploying in the disciplines of Master Data Management, Big Data, Data Warehousing and Business Intelligence. Their CEO, William McKnight is a well-known thought leader in the area of information management as a strategist, information architect and program manager for complex, high-volume, full-lifecycle implementations worldwide. Here is his feedback on the program: “McKnight Consulting Group is delighted to be a partner in the Passport to Success program. MDM is an imperative today. Master data must be formed and distributed as soon as the data is available. Often the data needs workflow and quality processes to be effective. We have been helping clients realize these benefits for many years and are extremely focused in building solid, workable plans, built contextually to the organization’s current and future desired state. All of our plans have formed the master data strategies for many years. We look forward to continuing to get information under control and to raising the maturity of the important asset of information."

Because we know the power of a community, thanks to our open source DNA, we didn’t want to reinvent the wheel by creating a new offer. Rather, we are taking a very pragmatic approach in order to leverage the best practices and approaches that each of our partners have already successfully delivered. So we designed collaboratively with each of those partners a Passport for MDM Success that can be delivered today. It was simply a matter of aligning our objectives and assets. From a more personal perspective, this was a great exercise to connect with the best MDM experts from around the world and come together to create an offer that fully meets our initial objectives.

Now that we launched the offer, the results are very promising. Not only are some of our prospects opting for these services, but they are already running our approach as a way to accelerate and secure their planning efforts and make sure that they have their Lines of Business on board. We also see interest from customers that are already engaged in delivering MDM, so that they can augment the impact of their MDM implementation and expand their projects to other domains and use cases.

Now that we delivered this initiative, the story is not over. First, we welcome other Consulting and System Integrators that have experience in providing MDM guidance as well as delivering Talend MDM projects to join the community. Also, together with our partners we will begin to add deeper industry flavor to this program, so that we bundle specific industry best practices into our standard services.

[2015-03-05] Talend Blog: Avoiding the Potholes on the Way to Cost Effective MDM

Master data management is one of those practices that everyone in business applauds. But anyone who has been exposed to the process realizes that MDM often comes with a price. Too often what began as a seemingly well thought out and adequately funded project begins accumulating unexpected costs and missing important milestones.

First we need to know what we’re talking about. One of the best definitions of a MDM project I’ve heard is from Jim Walker, a former Talend director of Global Marketing and the man responsible for the Talend MDM Enterprise Edition launch. Jim describes MDM as, “The practice of cleansing, rationalizing and integrating data across systems into a ‘system of record’ for core business activities.”

I have personally observed many MDM projects going off the rails while working with other organizations. Some of the challenges are vendor driven. For example, customers often face huge initial costs to begin requirements definition and project development. And they can spend millions of upfront dollars on MDM licenses and services – but even before the system is live, upgrades and license renewals add more millions to the program cost without any value being returned to the customer. Other upfront costs may be incurred when vendors add various tools to the mix. For example, the addition of data quality, data integration and SOA tools can triple or quadruple the price.

Because typically it is so expensive to get an MDM project underway, customer project teams are under extreme pressure to realize as much value as they can as quickly as possible. But they soon realize that the relevant data is either stored in hard to access silos or is of poor quality – inaccurate, out of date, and riddled with duplication. This means revised schedules and, once again, higher costs.

Starting with Consolidation

To get around some of these problems, some experts advise starting small using the MDM Consolidation method. Because this approach consists of pulling data into the MDM Hub (the system’s repository) and performing cleansing and rationalizing, the benefit is that Consolidation has little impact on other systems.

While Consolidation is also a good way to begin learning critical information about your data, including data quality issues and duplication levels, the downside is that these learning’s can trigger several months of refactoring and rebuilding the MDM Hub. This is a highly expensive proposition, involving a team of systems integrators and multiple software vendors.

In order to realize a rapid return on MDM investment, project teams often skip the Consolidation phase and go directly to a Co-existence type of MDM. This approach includes Consolidation and adds synchronization to external systems to the mix. Typically data creation and maintenance will co-exist in both the MDM system and the various data sources. Unfortunately this solution introduces difficult governance issues regarding data ownership, as well as data integration challenges such as implementing a service-oriented architecture (SOA) or data services.

There are other types of MDM, each with its own set of problems. The upshot is that the company implementing an MDM system winds up buying additional software and undertaking supplementary development and testing, incurring more expense.

An Alternative Approach

Rather than become entangled in the cost and time crunches described above, you should be looking for vendors that provide a solution that lets you get underway slowly and with a minimum amount of upfront costs.

In fact, part of the solution can include Open Source tools that allow you to build data models, extract data, and conduct match analysis, while building business requirements and the preliminary MDM design. All at a fraction of the resource costs associated with more traditional approaches.

Then, with the preliminary work in place, this alternative solution provides you with the tools needed to scale your users. It is efficient enough to allow you to do the heavy development work necessary to create a production version of your MDM system without breaking the bank.

Once in an operational state, you can scale up or down depending on your changing MDM requirements. And, when the major development phase is over, you can ramp down to a core administrative group, significantly reducing the cost of the application over time.

You should look for vendors offering pricing for this model based on the number of developers – a far more economical and predictable approach when compared to other systems that use a pricing algorithm based on the size of data or the number of nodes involved.

This approach to MDM deployment is particularly effective when combined with other open source tools that form the foundation of a comprehensive big data management solution. These include big data integration, quality, manipulation, and governance and administration.

By following this path to affordable, effective MDM that works within a larger big data management framework, you will have implemented a flexible architecture that grows along with your organization’s needs.

[2015-02-25] Talend Blog: Retail: Personalised Services to Generate Customer Confidence

In recent years, the Internet and e-commerce have revolutionised the retail industry, in the same way, for example, as the appearance of supermarkets. Beyond ease of purchase and the ability to consult the opinion of other consumers, e-commerce has overwhelmingly changed the way in which information about a customer's journey to purchase is captured. Today, it is also captured on a far more individual basis. For example, e-commerce enables you to know, with relative ease, what a particular customer is looking for, how they reached the site, what they buy, the associated products they have purchased previously, and even what purchases they abandoned. Reconstructing the customer’s journey was extremely difficult to achieve when the sole purchasing channel was the physical store and the only traceable element was the purchase itself. At best, the customer was only identified at the checkout, which, for example, ruled out the prospect of providing them personalised recommendations.

Thanks to a better understanding of the customer's journey to purchase, e-commerce has opened up the possibility of not only gaining a better understanding of customer behaviour, but also the ability to react in real time based on this knowledge. Due to the success of such programmes, distributors have considered applying these concepts across all sales channels – stores, call centres, etc. This is the dual challenge facing most retailers today: to fully understand the customer's journey across different sales channels (multi- or omni-channel), while benefitting from greater accuracy, including in the case of physical retail outlets.

This is not as easy as it seems. Depending on the channel chosen by the customer, the knowledge obtained by the seller is not the same: as we know, whilst at the checkout, the customer will only be recognised if they own a loyalty card or have previously visited the store. But, in the latter case, it will be extremely complex to make the link to past purchases. Similarly, a website may enable the collection of data on the intention to buy, but it is extremely difficult to correlate these events with the purchasing transactions if they are not made online and in the same session. The stakes are high, given that 78% of consumers now do their research online prior to making a purchase[1] (the famous Web-to-store or ROPO).

One solution is to integrate sensors into the various elements that make up a customer’s purchasing journey, then analyse and cross reference this big data to extract concrete information from it. For example, we have noticed that Internet users often visit commercial websites during the week in order to prepare for making a purchase on a Saturday. If, for example, it has a self-service Wi-Fi facility linked to a mobile app enabling it to personalise the customer’s journey, the store can follow this journey right up to the actual purchase, or even influence it by proposing a good deal at just the right moment.

Some of our customers are already largely engaged in this process, which is done in a gradual manner. It all usually begins with a very detailed analysis of the customer’s online journey, to collect information on intention, cross reference it at an aggregated level with the actual purchases, at the catchment-area level, for example, to determine correlations and refine the segmentations. Then, this information is cross referenced for a second time with the transactional data from the physical stores and the website, which enables us to map the customer’s journey from the intention to buy to the purchase or even beyond, and across different channels. Thirdly, it's a matter of developing a recommendation system in real time throughout the customer's journey that yields a dual benefit: increased sales and greater loyalty.

The main challenge facing distributors in the future actually lies in the value-added services that they may or may not be able to provide to their customers, to accompany their products or services. Consumers have learned to be wary of digital technology. For example, they create specific email addresses to get the offer they need without revealing their true identity in order to prevent further contact. More than ever, they will only be inclined to share information on their intentions and their profiles if their trust has been gained and they perceive some benefit.

How do you create this trust? Via value-added services: when consumers see that their interests are being considered, they do not feel constrained or trapped by a commercial logic that is beyond them. Let us imagine that, on the basis of till receipts or a basket that is in the process of being filled, a retailer can guide the choice of products based on personal criteria, excluding, for example, those that contain peanut oil, which I must avoid as my son has just been declared highly allergic to it. I am aware that my journey is being tracked by the retailer, but I understand its uses and I derive some benefit from it. Amazon, with its “1-Click” ordering, has shown the way. In other sectors, such as the taxi industry, newcomers have gone even further, revolutionising the customer’s journey by utilising digital technology, from searching for a service to payment through a range of innovative services that make the customer’s life easier, such as the automated capture of expense forms.

In a world in which advertising and tracking are increasingly present, data analysis that is carried out with the sole aim of commercial transformation is ultimately doomed to failure, as it is based on an imbalance between the benefits offered to the customer and those gained by the supplier[2]. Until now, personalisation in retail has had a tendency to limit itself to marketing and measurement based on conversion rates, except for distributors, which have increasingly relied on customer loyalty. Multichannel is not the invention of the distributors but a reaction to consumers’ wishes. Think about it, even Amazon, Internet pure player par excellence, is going to start opening physical stores. Why? Because it has fully understood that a key element was missing in its bid to become better acquainted with its customers’ journey, while responding more effectively to their wishes.

[1] http://www.kpmg.com/global/en/issuesandinsights/articlespublications/consumercurrents/pages/rise-of-the-robo-shopper.aspx

[2] http://blogs.gartner.com/robert-hetu/in-cold-blood-the-murder-of-black-friday/

[2015-02-17] Talend Blog: What is a Container? Cloud and SOA Converge in API Management (Container Architecture Series Part 2)

This is the second in a series of posts on container-centric integration architecture. The first post provided a brief background and definition of the Container architecture pattern. This post explores how Service Oriented Architecture (SOA) and Cloud initiatives converge in API Management, and how Platforms provide the Containerization infrastructure necessary for API Management.

Today we are seeing an increasing emphasis on Services and Composite Applications as the unit of product management in the Cloud. Indeed, API Management can be seen as the logical convergence of Cloud and SOA paradigms. Where SOA traditionally emphasized agility within the enterprise, API Management focuses on agility across the ecosystem of information, suppliers and consumers. With SOA, the emphasis was on re-use of the business domain contract. API Management addresses extensibility at all layers of the solution stack. Platforms take responsibility for non-business interfaces and deliver them via Containerization to business layer Service developers who can focus on the domain logic.

Most enterprise applications have historically had a single enterprise owner responsible for design, configuration and business operation of the Application. Each application team has its own release cycle and controls more or less all of its own dependencies subject to Enterprise Architecture policies and central IT operations provisioning. In this model the unit of design and delivery is at the Application level. Larger enterprises may have independent IT operations to operate and manage the production environment, but the operations contract has always been between individual application development teams as the supplier of business logic and the central IT organization, rather than a peer-to-peer ecosystem of business API’s shared between organizations.

As long as the number of interdependencies between projects remains low and the universe of contributors and stakeholders stays narrow, the Application delivery model works fine.

The relevant principle here is that organizational structure impacts solution architecture. This is natural because the vectors of interest of stakeholders define the decision making context within the Software Development Life Cycle (SDLC). Applications built in isolation will reflect the assumptions and design limitations implicit in the environment in which they were created. A narrow organizational scope will therefore influence requirements decisions. Non-functional requirements for extensibility are understandably hard to justify in a small and shallow marketplace.

SOA and Cloud change all of this.

As the pace of development accelerates, enterprises see an increasing emphasis placed on agility. SOA is about creating composite solutions through assembly of reusable Services. Each line of business provides its own portfolio of services to other teams. Likewise each organization consumes many services. Each of service has its own product lifecycle. Because Services are smaller than Applications, they have a much faster product lifecycle. By encapsulating re-usable business functionality as a more modular Service rather than a monolithic Application, the enterprise increases agility by decreasing cycle time for the minimal unit of business capability.

In response, development and operations teams have crafted DevOps strategies to rapidly deploy new capability. It is not just that DevOps and Continuous Delivery have revolutionized the SDLC; the unit of release has changed as well. In these agile environments, the unit of business delivery is increasingly the Service rather than the Application. In large organizations the Application is still the unit of product management and funding lifecycles, but the Service module is a much more agile model for Cloud and SOA ecosystems. As a result, evolution takes place more rapidly.

This has implications beyond the enterprise boundary. More modular composite applications can be quickly constructed from modules that are contributed by a broader ecosystem of stakeholders, not just those within a single enterprise. From this perspective, SOA design patterns enable more rapid innovation by establishing social conventions for design that allow a faster innovation cycle from a broader population. As a result, evolution takes place more broadly.

SOA initiatives often emphasize standards-based interoperability. Interoperability is necessary but not sufficient for successful adoption of SOA. Standardization, modularity, flexibility, and extensibility are necessary in practice. These can be provided via Containerized Platforms.

SOA drives containerization because the resulting composite business solutions cross organizational and domain boundaries. Coarse grained Service API’s are shared between organizations via Service Contracts. But these come at a cost; the same granularity that provides the increased agility also increases the interactions between organizations and the complexity of the enterprise as a whole. The added complexity is not a negative since it is essential complexity for the modern enterprise; but it must be managed via SOA Governance and API Management.

While Service Orientation can be incorporated into technology oriented services such as Security, the primary focus of SOA contracts is on the business domain. Increased integration both within and across the enterprise boundary have implicit dependencies on these cross-cutting technology services. Data must be secured both in transit and at rest, audit and non-repudiation must be provided in a non-invasive manner, capacity must be provisioned elastically to respond to changing demand. All of these cross-cutting concerns require collaboration across the enterprise boundary; they impact service consumers as well as service providers. In order to scale, these non-functional requirements can be delegated to Platforms that realize them through Containers.

What distinguishes a Platform from an Application is the open architecture exposed by deeper API’s that are consumed by the Services running on the platform. In order for the Services to be composed into solutions, the Platform API must be loosely coupled to the Services. This is the function of the Container. In addition to loose coupling, the Container provides Separation of Concerns between the Service development team and the Operations team. Lightweight containers make the delivery of Services much more modular, enabling among other things the operations team to deliver the fine-grained control necessary for elasticity. In an open ecosystem, a Service Container can also encapsulate the collaborative contract for the community of service providers and consumers, increasing flexibility and re-use. API contracts also support infrastructure features such as authentication and authorization which become pluggable parts of the Platform. An extensible and open infrastructure allows further innovation while insulating business domain developers from the IT concerns.

Platform architectures can be applied within the enterprise or across enterprises. As Cloud delivery models have become popular, Platforms have emerged as a means of accelerating adoption of Cloud offerings. Google, Facebook, Microsoft, Amazon, and SalesForce are all examples of Platforms. But there are other, older examples of Platform architectures. In many ways B2B portals are the archetype Platform, offering an ecosystem of extensible, layered API’s upon which higher level services can be delivered across enterprise boundaries.

Of course Cloud and SOA are intimately related. After all, Cloud comes in three flavors, Infrastructure, Platform, or Software as a Service. So Service orientation is implicit in the Cloud paradigm. What Cloud adds to Service Orientation is a maturity benchmark for delivery and operations of services. Whether they are top-level business services or the supporting infrastructure, Cloud requires self-service on-demand, measurability, visibility, and elasticity of resources. Whereas an enterprise can deliver SOA design patterns, the agility achieved in practice will be constrained if operations cannot step up to the Cloud-delivery model. Containerized Platforms provide both the efficiency necessary for Cloud maturity as well as the extensible modularity required for successful SOA adoption in the larger ecosystem.

The next post will explore how Talend applies Apache OSGI technology to provide a Service Container for enterprise integration as part of the Talend Unified Platform.

[2015-02-16] Talend Forum Announcement: For test only, Talend Open Studio's 6.0.0 M2 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 6.0.0 M2 release is available, for testing only. This milestone contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 6.0 release.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s second milestone:

Big Data: http://www.talend.com/download/talend-o … s_download
Data Quality: http://www.talend.com/download/talend-o … s_download
ESB: http://www.talend.com/download/talend-o … s_download
Data Integration: http://www.talend.com/download/talend-o … s_download
BPM: http://www.talend.com/download/talend-o … s_download
MDM: http://www.talend.com/download/talend-o … s_download

Below, please find key new features for Talend 6.0.0 M2:

Talend Open Studio for Data Integration

Studio:

- Support of Java 8 (feature available in the Studio only): https://jira.talendforge.org/browse/TUP-2511

- Allow custom components to extend FILTER of other Talend components: https://jira.talendforge.org/browse/TDI-31510

Component:

- tOracleSP component issue in Talend ESB runtime

Talend Open Studio for ESB & Talend ESB SE

Studio

- Only Unified Platform-related improvements at this time

Runtime / ESB

- No new features at this time

Talend Open Studio for Data Quality

- Data preview in Column analysis editor

Thanks for being a part of our community,
The Talend Team.

[2015-02-13] Talend Blog: Use Big Data to Secure the Love of Your Customers

[2015-02-10] Talend Blog: Defining Your “One-Click”

People talk about the impact of the “digital transformation” and how companies are moving to becoming “data-driven,” but what does it mean in practice? It may help to provide a couple of examples of data-driven companies. Netflix is often cited as a great example of a data-driven company. The entertainment subscription service is well known for using data to tune content recommendations to the tastes and preferences of their individual subscribers and even for developing new programming (NY Times: Giving Viewers What They Want). At Netflix, big data analysis is not something that only certain teams have access to, but rather a core asset that is used across the organization. Insights gained from data allow Netflix to make quick and highly informed decisions that improve all aspects of the business and most importantly, positively impact the customer experience.

Although perhaps not as well known outside the IT industry, GE is another fantastic example of a data -driven success story. GE has a vision of the “industrial internet,” a third wave of major innovation following the industrial and internet revolution, which is fueled by the integration of complex physical machinery with networked sensors and software.

One example is in GE’s $30B Renewable Energy division. This team has begun to execute on a vision of using smart sensors and the cloud to connect over 22,000 wind turbines globally in real-time. The ultimate goal is to predict downtime before it happens and to be able to tune the pitch and orientation of turbines to generate billions in additional energy production. Talend is helping them achieve this vision. Our work with GE has helped cut the time it takes to gather and analyze sensor data from 30 days to one. And, we believe that we can cut this down to minutes in the very near future.

Amazon is another stunning example of what it means to be a data-driven company. While data plays a significant role in all aspects of Amazon’s business, I view the company’s one-click ordering system, a button that once you click it automatically processes your payment and ships the selected item to your door - as a particularly compelling and pure illustration of being data-driven. This single button proves just how adept Amazon is at turning massive volumes of shopper, supplier and product data into a customer convenience and competitive advantage. Of course, Amazon didn’t become this data-driven overnight. Similar to its evolution from an online bookstore to the leading online retailer, becoming data-driven was a process that took time.

As an integration company deeply rooted in big data and Hadoop, our mission is to help companies through the process of becoming data-driven and, ultimately, define their own “one-click”. Regardless of the industry, companies’ one-click is often associated with customer-facing initiatives – which could be anything from protecting banking clients from fraud to enabling preemptive maintenance of turbines on a wind farm as discussed in the GE example.

Some organizations mistakenly believe being data-driven is all about being better at analytics. While analytics is certainly an important facet, companies must first become highly proficient at successfully stitching together desperate data and application silos. Next, companies must manage and streamline the flow of data throughout their entire organization and ensure that the data they are analyzing is accurate and accessible in an instant.

This is certainly something that our recently released Talend 5.6 aims to help companies achieve. For those of you not familiar with our Integration Platform, it combines data integration, application integration, data quality and master data management (MDM) capabilities into a single, unified workflow and interface. We believe this approach, coupled with the now over 800 components and connectors for easing the integration of new applications, data sources with big data platforms, helps simplify data management and significantly reduce the otherwise steep learning curve associated with big data and Hadoop.

While 5.6 is a great solution for companies initiating their data journey, it’s also ideal for helping companies become data-driven and define their “one-click,” especially given some of latest features we’ve introduced. As noted in our announcement, version 5.6 adds new efficiency controls for MDM. In our view, MDM is a key component for empowering our clients to begin to uniquely identify and track their customers across various touch points, as well as govern the association rules between various data sets. Notably, Talend 5.6 also initiates support for the latest Hadoop extensions, Apache Spark and Apache Storm. While perhaps not achievable for all companies immediately, the ability to operate in real time should be on every organization’s roadmap, and is in part, what these technologies will help facilitate.

As some of you may have heard, later this year we will launch Talend Integration Cloud, an Integration Platform-as-a-Service (iPaaS). The solution will enable the connection of all data sources – cloud-to-cloud, cloud-to-ground – and IT teams design and deploy projects that can run wherever they are needed. Also, for the first time, with Talend Integration Cloud, we will be enabling line of business users to access data integration tools and build jobs without having to rely on IT. Expect to hear far more about Talend Integration Cloud over the coming months, but we are very excited to provide our customers with this new tool in their arsenal and allow them to extend data access and intelligence throughout their enterprise.

I’m looking forward to the year ahead and being part of a fantastic team that will be helping more companies become data-driven and define their “one-click”. What about you? Is this the year your company, like Amazon, will be able to use data to make smarter business decisions at any given moment across your entire organization? If so, I hope to hear from you soon.

[2015-02-03] Talend Forum Announcement: Talend Open Studio's 5.5.2 release is available

Dear Community,

We are very pleased to announce that Talend Open Studio's 5.5.2 release is available. This general availability release for all users contains new features and bug fixes.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s general availability release:

You can also view Release Notes for this 5.5.2 version, detailing new features, through this link: http://www.talend.com/download/
Find the latest release notes, with these steps: [Data Integration | Big Data | Data Quality | MDM | ESB] product tab > at bottom of page click on "User Manuals" > then, click on the "User Manuals" tab > the first download on the page is the most recent release note.

For more information on fixed bugs and new features, go to the TalendForge Bugtracker.

Thanks for being a part of our community,
The Talend Team.

[2015-02-02] Talend Blog: Big, Bad and Ugly - Challenges of Maintaining Quality in the Big Data Era – Part 1

More than a decade ago, we entered an era of data deluge. Data continues to explode - it has been estimated that for each day of 2012, more than 2.5 exabytes (or 2.5 million terabytes) of data were created. Today, the same amount of data is produced every few minutes!

One reason for this big data deluge is the steady decrease in the cost per gigabyte, which has made it possible to store more and more data for the same price. In 2004, the price of 1 GB of hard disk storage passed below the symbolic threshold of $1. It's now down to three cents (view declining costs chart). Another reason is the expansion of the Web, which has allowed everyone to create content and companies like Google, Yahoo, Facebook and others to collect increasing amounts of data.

Big data systems require fundamentally different approaches to data governance than traditional databases. In this post, I'd like to explore some of the paradigm shifts caused by the data deluge and its impact on data quality.

The Birth of a Distributed Operating System

With the advent of the Hadoop Distributed File System (HDFS) and the resource manager called YARN, a distributed data platform was born. With HDFS, very large amounts of data can now be placed in a single virtual place, similar to how you would store a regular file on your computer. And, with YARN, the processing of this data can be done by several engines such as SQL interactive engines, batch engines or real-time streaming engines.

Having the ability to store and process data in one location is an ideal framework to manage big data. Consulting firm, Booz Allen Hamilton, explored how this might work for organizations with its concept of a “data lake”, a place where all raw or unmodified data could be stored and easily accessed.

While a tremendous step forward in helping companies leverage big data, data lakes have the potential of introducing several quality issues, as outlined in an article by Barry Devlin: In summary, as the old adage goes, "garbage in, garbage out".

Being able to store petabytes of data does not guarantee that all the information will be useful and can be used. Indeed, as a recent New York Times article noted: “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”

Another similar concept to data lakes that the industry is discussing is the idea of a data reservoir. The premise is to perform quality checks and data cleansing prior to inserting the data into the distributed system. Therefore, rather than being raw, the data is ready-to-use.

The accessibility of data is a data quality dimension that benefits from these concepts of a data lake or data reservoir. Indeed, Hadoop makes data and even legacy data accessible. All data can be stored in the data lake and tapes or other dedicated storage systems are no longer required. Indeed, the accessibility dimension was a known issue with these systems.

But distributed systems also have an intrinsic drawback, the CAP theorem. The theorem states that a partition-tolerant system can't provide data consistency and data availability simultaneously. Therefore, with the Hadoop Distributed File System - a partitioned system that guarantees consistency - the availability dimension of data quality can’t be guaranteed. This means that the data can't be accessed until all data copies on different nodes get synchronized (consistent). Clearly, this is a major stumbling block for organizations that need to scale and want to immediately use insights derived from their data. As Marissa Mayer from Google says: “speed matters”. A few hundreds of milliseconds of delay in the reply to a query and the organization will lose customers. Finding the right compromise between data latency and consistency is therefore a major challenge in big data, although the challenges tend to apply only in the most extreme situations as innovative technologies appear over time in order to tackle it.

Co-location of Data and Processing

Before Hadoop, when organizations wanted to analyze data stored in a database, they could get it out of the database and put it in another tool or another database to conduct analysis or other tasks. Reporting and analysis are usually done on a data mart which contains aggregated data from operational databases. As the system scales, they can't be conducted on operational databases which contain the raw data.

With Hadoop, the data remains in Hadoop. The processing algorithm to be applied to the data can be sent to the Hadoop Map Reduce framework. And the raw data can still be accessed by the algorithm. This is a major change in the way the industry manages data: The data is no longer moved out of the system in order to be processed by some algorithm or software. Instead, the algorithm is sent into the system near the data to be processed. Indeed, the prerequisite to reap this benefit is that applications can run natively in Hadoop.

For data quality, this is a significant improvement as you no longer need to extract data to profile. You can then work with the whole data rather that with samples or selections. In-place profiling combined with BI Data systems opens new doors for data quality. It's even possible to think about some data cleansing processes that will take place in the big data framework rather than outside.

Schema-on-read

With traditional databases, the schema of the tables is predefined and fixed. This means that data that does not fit into the schema constraints will be rejected and will not enter the system. For example, a long text string may be rejected if the column size is smaller than the input text size. Ensuring constraints with this kind of "schema-on-write" approach surely helps to improve the data quality, as the system is safeguarded against data that doesn’t conform to the constraints. Of course, very often, constraints are relaxed for one reason or another and bad data can still enter the system. Most often, integrity constraints such as the no null value constraint are relaxed so that some records can still enter the system even though some of their fields are empty.

However, at least some constraints dictated by a data schema may mandate a level of preparation before data goes into the database. For instance, a program may automatically truncate too large a text data or add a default value when the data cannot be null, in order to still enter it into the system.

Big data systems such as HDFS have a different strategy. They use a "schema-on-read" approach. This means that there is no constraint on the data going into the system. The schema of the data is defined as the data is being read. It's like a “view” in a database. We may define several views on the very raw data, which makes the schema-on-read approach very flexible.

However, in terms of data quality, it's probably not a viable solution to let any kind of data enter the system. Letting a variety of data formats enter the system requires some processing algorithm that defines an appropriate schema-on-read to serve the data. For instance, such an algorithm would unify two different date formats like 01-01-2015 and 01/01/15 in order to display a single date format in the view. And it could become much more complex with more realistic data. Moreover, when input data evolves and is absorbed in the system, the change must be managed by the algorithm that produces the view. As time passes, the algorithm will become more and more complex. The more complex the input data becomes, the more complex the algorithm that parses, extracts and fixes it becomes - to the point where it becomes impossible to maintain.

Pushing this reasoning to its limits, some of the transformations executed by the algorithm can be seen as data quality transformations (unifying the date format, capitalizing names, …). Data quality then becomes a cornerstone of any big data management process, while the data governance team may have to manage ”data quality services” and not only focus on data.

On the other hand, the data that is read through the "views" would still need to obey most of the standard data quality dimensions. A data governance team would also define data quality rules on this data retrieved from the views. It raises the question of the data lake versus the data reservoir. Indeed, the schema on read brings huge flexibility to data management, but controlling the quality and accuracy of data can then become extremely complex and difficult. There is a clear need to find the right compromise.

We see here that data quality is pervasive at all stages in Hadoop systems and not only involves the raw data, but also the transformations done in Hadoop on this data. This shows the importance of well-defined data governance programs when working with big data frameworks.

In my next post, I'll explore the impacts of the architecture of big data systems on the data quality dimensions. We'll see how traditional data quality dimensions apply and how new data quality dimensions are likely to emerge, or gain importance.

[2015-01-19] Talend Blog: Open Source ETL Tools – Open for Business!

With all the hype and interest in Big Data lately, open source ETL tools seem to have taken a back seat. MapReduce, Yarn, Spark, and Storm are gaining significant attention, but it also should be noted that Talend’s ETL business and our thousands of ETL customers are thriving. In fact, the data integration market has a healthy growth rate with Gartner recently reporting that this market is forecasted to grow 10.3% in 2014 to $3.6 billion!

Open source ETL tools appear to be going though their own technology adoption lifecycle and are running mission critical operations for global 2000 corporations, which would suggest they are at least in the “early majority” adoption stage. Also based on their strong community, open standards and more affordable pricing model, open source ETL tools are a viable solution for small to midsize companies.

I would think the SMB data integration market, which has been underserved for many years, is growing the fastest. Teams of two or three developers can get up-to-speed very quickly and get a fast ROI over hand-coding ETL processes. Many Talend customers are reporting a huge savings on their data integration projects over hand-coding, e.g. Allianz Global Investors states that Talend is “proving to be 3 times faster than developing the same ETLs by hand and the ability to reuse Jobs, instead of rewriting them each time, is extremely valuable.”

A key component with open source is its vibrant community and the benefits it provides including sharing information, experiences, best practices and code. Companies can innovate faster through this model. For example, RTBF, one of over 100,000 Talend community users, states, “A major consideration was that Talend is open source and its community of active users ensures that the tools are rapidly updated and that user concerns are taken into account. Such forums make information easily accessible. As the community grows, more and more topics are covered which, of course, saves users a lot of time.”

And the good news is that the open source ETL tools category has blossomed with maturity to meet changing demands. What started as basic ETL and ELT capabilities has transformed into an open source integration platform. As firms break down their internal silos, data integration developers are being asked to integrate big data, to improve data quality and master data, to move from batch to real-time processing, and to create reusable services.

With increasing data integration requests, companies are looking for more and more pre-built components and connectors – from databases (traditional and NoSQL) and data warehouses, to applications like SAP and Oracle, to big data platforms like Cloudera and Hortonworks, to Cloud/SaaS applications like Salesforce and Marketo. Finally, not only do you need to connect to the Cloud, but run in the Cloud.

Almerys is an example that started with data integration and batch processing then moved to real-time data services, “Early on, significant real-time integration needs convinced us to adopt Talend Data Services, the only platform on the market offering the combination of a data integration solution and an ESB (Enterprise Service Bus).”

Big data may be getting all the attention and open source ETL tools may not be in the spotlight, but looking across the industry and what Talend customers are doing, they have certainly matured into an indispensible part of IT’s toolbox.

(Gartner: The State of Data Integration: Current Practices and Evolving Trends, April 3, 2014)

[2015-01-12] Talend Blog: 7 Reasons to “Unify” Your Application and Data Integration Teams and Tools

I recently attended a Gartner presentation on the convergence of Application and Data Integration at their Application Architecture, Development and Integration conference. During the talk they stressed that “chasms exist between application- and data-oriented people and tools” and that digital businesses have to break down these barriers in order to succeed. Gartner research shows that more and more companies are recognizing this problem – in fact, 47% of respondents to a recent survey indicated they plan to create integrated teams in the next 2-3 years.

And yet, very few integration platforms, other than Talend’s, provide a single solution that supports both application integration (AI) and data integration (DI). It seems that although many people intuitively recognize the value of breaking down integration barriers, many still have a hard time pointing to the specific benefits that will result. This post outlines my top reasons organizations should take a unified approach.

Stop reinventing the wheel

Separate AI and DI teams can spend as much as 30% of their time re-inventing the wheel or re-creating similar integration jobs and meta data. With a unified integration tool, you can create your meta data once and use it over and over again. You can also often avoid re-creating the same integration job. In many situations, the requirements of an integration job can be met with either style of integration, but with separate teams, you are forced to recreate the same jobs for different projects.

Learn from Toyota, Rapid Changeover Kills Mass Production

History is full of examples where management has opted for specialization to increase throughput. This works really well in a predictable environment. A great example of this is Ford’s approach to the model T, where you could have any color you wanted as long as it was black. They could crank out the cars for less than anyone else with their assembly lines and mass-production approach. Unfortunately, I have yet to see an IT organization that could successfully predict what their business owners need. That’s why Toyota’s “one-piece flow” and flexible assembly lines have so dramatically out performed U.S. auto maker’s dedicated production lines.

Pay only for what you need

If you’re building two separate integration teams, you’re probably paying your implementation and administration “tax” twice. You’re buying two sets of hardware and you’re paying people to set up and maintain two separate systems. This tax is especially big if you need a high availability environment with live backup. With a unified integration tool, you’re only doing all of this once and the savings can be huge. At one large Talend customer, they had a team of 10 admins across AI and DI and they were able to cut that to 5 with Talend.
Train once, integrate everything

If you use two separate integration tools, it means you have to have specialists that understand each, or you have to train your people on two completely different and often highly complex tools. With a unified solution, your developers can move back and forth across integration styles with very little incremental training. This makes it much easier for your integration developers to stay current with both tools and styles of integration, even if they spend the majority of their time on a single style. This reduces training costs and employee ramp time while increasing flexibility.
Win with speed

With new types of data (web and social) and new cloud applications, data volumes are exploding in every company, large and small. This is making the ability to be data-driven a strategic differentiator that separates winners from losers. A critical part of being data driven is allowing the business to put the right data in the right places as quickly as possible. With a unified tool, you can start out with one style of integration and be ready to add other integration styles a moment’s notice. A great example of the need for this is when a data warehousing project starts out with batch data movement and transformation requirements and then later business teams realize they can use this same data to make real-time recommendations to sales. Without a unified integration solution, this would require two separate integration teams and two separate projects.
Do more

Application integration and data integration tools are often stronger at some things and weaker at others. For instance, data integration tools can include strong data quality and cleansing capabilities that application tools lack. With a unified solution, each style of integration can benefit from best that the other integration style has to offer.

Stay aligned

It happens in almost every business. Executives show up at a meeting and each has their own reports and data that give very different views on the business and it’s almost impossible to reconcile the differences. The same thing happens with separate integration teams. Each defines separate rules around prices, revenue and product lines and as a result, it’s very hard to get a consistent view on the business and key performance indicators. A unified tool allows you to build those rules once and then apply them consistently across every integration job. This kills many data discrepancies at the root cause.

What do you think? I’m interested to hear from folks that are considering a unified approach, but believe the challenges maybe too great – equally happy to engage those with opposing viewpoints.

[2015-01-07] Talend Blog: Customer Data Platform: Toward the personalization of customer experiences in real time

Big data has monopolized media coverage in the past few years. While many articles have covered the benefits of big data to organizations, in terms of customer knowledge, process optimization or improvements in predictive capabilities, few have detailed methods for how these benefits can be realized.

Yet, the technology is now mature and proven. Pioneers include Mint in the financial sector, Amazon in retail and Netflix in media. These companies showcase that it is possible today to put in place a centralized platform for the management of customer data that is able to integrate and deliver information in real time, regardless of the interaction channel being used.

This platform, known as Customer Data Platform (CDP), allows organizations to reconstruct the entire customer journey by centralizing and cross referencing interactional or internal data such as purchase history, preferences, satisfaction, and loyalty with social or external data that can uncover customer intention as well as broader habits and tastes. Thanks to the power and cost-effectiveness of a new generation of analytical technologies, in particularly Hadoop and its ecosystem, the consolidation of these enormous volumes of customer data is not only very fast, but also enables immediate analysis.

As well as this data helping improve overall general customer knowledge upstream; importantly, it also helps organizations understand and act upon individual customer needs on a real time basis. In fact, it enables companies to predict a customer’s intentions and influence their journey through the delivery of the right message, at the right time, through the correct channel.

The Pillars of CDP

To achieve this, the Customer Data Platform must be based on four main pillars. The first pillar is about core data management functions around retrieving, integrating and centralizing all sources of useful data. In an ideal implementation, this system incorporates modules for data quality to ensure the relevance of the information, as well as Master Data Management (MDM) to uniquely identify a customer across touch points and govern the association rules between the various data sets.

The second pillar establishes a list of the offers and “conditions of eligibility”, taking into account, for instance, the specifics of the business such as premium pricing, loyalty cards, etc. The third pillar aims to analyze the data and its relationships in order to establish clear customer segments. Finally, the last pillar is concerned with predictability and enabling, through machine learning, the ability to automatically push an offer (or “recommendation”) that is most likely to be accepted by the customer.

These are the four steps that I believe are essential to achieving the Holy Grail or the ultimate in one-to-one marketing. Before companies tackle these types of projects, it is of course absolutely essential they first define the business case. What are the goals? Is it to increase the rate of business transformation, drive customer loyalty, or to launch a new product or service? What is the desired return on investment? The pioneers in the market are advising companies to develop a storyboard that describes the ideal customer journey by uncovering "moments of truth” or the interactions that have the most value and impact in the eyes of the customer.

The Path to Real Time Success

Once companies have created their Customer Data Platform, they may begin to test various real time implementation scenarios. This would involve importing and integrating a data set to better understand the information it includes and then executing test recommendation models. By tracking the results of various models, companies can begin to refine their customer engagement programs.

The ability to modify customer engagement in real time may at first seem daunting. Especially given that until now information systems have taken great care to decouple transactional functions from analytics. However, the technologies are now in place to make real time engagement a reality. Companies no longer have to do analysis on a small subset of their customer base, wait weeks for the findings, and then far longer before they take action. Today, all the power companies need to connect systems containing transactional information, web site, mobile applications, point of sale systems, CRM, etc., with analytical information, in real time.

In general, making the transition to real time can be completed gradually. For example, companies could start with the addition of personalized navigation on a mobile application or individualized exchanges between a client and a call center for a subset of customers. This has the advantage of quickly delivering measurable results that can grow over time as the project expands to more customers. These early ventures can be used as a stepping stone to building a Customer Data Platform that enables companies to precisely integrate more deeply the points of contact - web, call center, point of sale, etc. - in order to enrich the customer profile and to be able to personalize a broader set of interactions.

Once all the points of contact have been integrated, the company has the information necessary to make personalized recommendations. It is the famous Holy Grail of one-to-one marketing in real time, with the four main benefits being: Total visibility on the customer journey (in other words the alignment of marketing and sales); complete client satisfaction (no need to authenticate) and therefore loyalty; clear visibility into marketing effectiveness and, in the end, increased revenue due to higher conversion rates. Given that we already know that companies employing this type of analytical technique are more successful that the competition that don’t,[1] moving real time becomes a necessity.

What about Privacy?

The respect of privacy should indeed be a key consideration for anyone interested in the personalization of the customer experience. While regulatory concerns might be top of mind, companies must first and foremost consider how their actions may impact their customer relationships. This is a matter of trust and respect, not simply compliance. Without a doubt, there is a lot we can learn here from what has been implemented in the health sector. Beyond getting an accurate diagnosis, people are generally comfortable being open with their doctors because they clearly understand the information will be held in confidence. In fact, physicians have a well-known code of conduct, the Hippocratic Oath. For companies, a similar understanding must be reached with their customers. They need to be upfront and clear what information is being collected, how it will be used and how it will benefit the customer.

[1] http://www.mckinseyonmarketingandsales.com/five-facts-how-customer-analytics-boosts-corporate-performance

[2014-12-11] Talend Forum Announcement: Talend Open Studio's 5.6.1 release is available

Dear Community,

We are very pleased to announce that Talend Open Studio's 5.6.1 release is available. This general availability release for all users contains new features and bug fixes.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s general availability release:

You can also view Release Notes for this 5.6.1 version, detailing new features, through this link: http://www.talend.com/download/
Find the latest release notes, with these steps: [Data Integration | Big Data | Data Quality | MDM | ESB] product tab > at bottom of page click on "User Manuals" > then, click on the "User Manuals" tab > the first download on the page is the most recent release note.

For more information on fixed bugs and new features, go to the TalendForge Bugtracker.

Thanks for being a part of our community,
The Talend Team.

[2014-12-09] Talend Blog: Takeaways from the new Magic Quadrant for Data Quality Tools

Gartner has just released its annual “Magic Quadrant for Data Quality Tools.”[1]

While everyone’s first priority might be to check out the various recognitions, I would also recommend taking the time to review the market overview section. I found the views shared by analysts Saul Judah and Ted Friedman on the overall data quality market and major trends both interesting and inspiring.

Hence this blog post to share my takeaways.

In every enterprise software submarket, reaching the $1 billion dollar threshold is a significant milestone. According to Gartner estimates, the market for Data Quality tools has reached it a couple of months ago and “will accelerate during the next few years, to almost 16% by 2017, bringing the total to $2 billion”.

Although Data Quality represents a significant market already, its growth pace indicates that it has yet to reach the mainstream. Other signs that point to this include continued consolidation on the vendor side and, from a demand side perspective, a growing demand for democratization (in particular, lower entry costs and shorter implementations times).

Data quality is gaining popularity across data domains and use cases. In particular, “party data” (data related to customers, prospects, citizens, patients, suppliers, employees, etc.) is highlighted as the most frequent category. I believe demand for data quality is growing in this area because customer-facing lines of businesses are increasingly realizing that data quality is jeopardizing customer-relationship capabilities. To further illustrate this fact, see the proliferation of press articles mentioning data quality as a key success factor for data-driven marketing activities (such as this one titled Data quality, the secret assassin of CRM). In addition, media coverage appears to reinforce that data quality together with MDM of Customer Data, are “must haves” within CRM and digital marketing initiatives (see example in this survey from emarketer).

The Gartner survey referenced in the Data Quality Magic Quadrant also reveals that data quality is gaining ground across other domains beyond party data. Three other domains are considered as a priority: financial/quantitative data, transaction data and product data (and this wasn’t the case in last year’s survey).

In my view, this finding also indicates that Data Quality is gaining ground as a function that needs to be delivered across Lines of Businesses. Some organizations are looking to establish a shared service for managing data assets across the enterprise, rather than trying to solve it on a case by case basis for each activity, domain, use case, etc. However, this appears to be an emerging practice delivered in only the most mature organizations (and we at Talend would advise to only consider it once you have already demonstrated the value of data quality for some well-targeted use cases). Typically, those organizations are also those that have nominated a Chief Data Officer to orchestrate information management across the enterprise.

In terms of roles, Gartner sees an increasing number involved with data quality especially among the lines of businesses and states “This shift in balance toward data quality roles in the business is likely to increase demand for self-service capabilities for data quality in the future.”

This is in sync with other researches: for example, at a recent MDM and data governance event in Paris, Henri Peyret from Forrester Research elaborated on the idea of Data Citizenship.

Our take at Talend is that data quality should be applied where the data resides or is exchanged. So, in our opinion, the deployment model would depend on the use case: data quality should be able to move to the cloud together with the business applications or with the integration platforms that process or store the data. Data quality should not however mandate moving data from on premises to the cloud or the other way round for its own purposes.

Last, the Gartner survey sees some interest, but not yet a key consideration for buyers, for big data quality and data quality for the Internet of Things.

“Inquiries from Gartner clients about data quality in the context of big data and the Internet of Things remain few, but they have increased since 2013. A recent Gartner study of data quality ("The State of Data Quality: Current Practices and Evolving Trends") showed that support for big data issues was rarely a consideration for buyers of data quality tools.”

This is a surprising, yet very interesting finding in my opinion, knowing that at the same time other surveys show that data governance and quality are becoming one of the biggest challenges in big data projects. See as an example this article from Mark Smith from Ventana Research, showing that most of the time spent in big data projects relate to data quality and data preparation. The topic is also discussed in a must watch webinar on Big Data and Hadoop trends (requires registration), by Gartner analysts Merv Adrian and Nick Heudecker. An alternative to the highly promoted data lake approach is gaining ground, referred as the “data reservoir approach”. The difference: While the data lake aims to gather data in a big data environment without further preparation and cleansing work, a reservoir aims to focus on making it more consumption ready for a wider audience and not only for a limited number of highly skilled data scientists. Under that vision, data quality becomes a building block of big data initiatives, rather than a separate discipline.

I cannot end this post without personally thanking our customers for their support in developing our analyst relations program.

Jean-Michel

[1] Gartner, Inc., "Magic Quadrant for Data Quality” by Saul Judah and Ted Friedman, November 26, 2014

[2014-12-09] Talend Blog: Takeaways from the new Magic Quadrant for Data Quality Tools

Gartner has just released its annual “Magic Quadrant for Data Quality Tools.”[1]

Hence this blog post to share my takeaways.

This is in sync with other researches: for example, at a recent MDM and data governance event in Paris, Henri Peyret from Forrester Research elaborated on the idea of Data Citizenship.

Last, the Gartner survey sees some interest, but not yet a key consideration for buyers, for big data quality and data quality for the Internet of Things.

I cannot end this post without personally thanking our customers for their support in developing our analyst relations program.

Jean-Michel

[1] Gartner, Inc., "Magic Quadrant for Data Quality” by Saul Judah and Ted Friedman, November 26, 2014

[2014-12-02] Talend Blog: What Is a Container? (Container Architecture Series Part 1)

This is the first in a series of posts on container-centric integration architecture. This first post covers common approaches to applying containers for application integration in an enterprise context. It begins with a basic definition and discussion of the Container design patterns. Subsequent posts will explore the role of Containers in the context of Enterprise Integration concerns. This will continue with how SOA and Cloud solutions drive the need for enterprise management delivered via service containerization and the need for OSGI modularity. Finally, we will apply these principles to explore two alternative solution architectures using OSGI service containers.

Containers are referenced everywhere in the Java literature but seldom clearly defined. Traditional Java containers include web containers for JSP pages, Servlet containers such as Tomcat, EJB containers, and lightweight containers such as Spring. Fundamentally, containers are just a framework pattern that provides encapsulation and separation of concerns for the components that use them. Typically the container will provide mechanisms to address cross-cutting concerns like security or transaction management. In contrast to a simple library, a container wraps the component and typically will also address aspects of classloading and thread control.

Spring is the archetype container and arguably the most widely used container today. Originally servlet and EJB containers had a programmatic API. Most containers today follow Spring’s example in supporting Dependency Injection patterns. Dependency Injection provides a declarative API for beans to obtain the resources needed to execute a method. Declarative Dependency Injection is usually implemented using XML configuration or annotations and most frameworks will support both. This provides a cleaner separation of concerns so that the bean code can be completely independent of the container API.

Containers are sometimes characterized as lightweight containers. Spring is an example of a lightweight container in the sense that it can run inside of other containers such as a Servlet or EJB container. “Lightweight” in this context refers to the resources required to run the container. Ideally a container can address specific cross-cutting concerns and be composed with other containers that address different concerns.

Of course, lightweight is relative and how lightweight a particular container instance is depends on the modularity of the container design as well as how many modules are actually instantiated. Even a simple Spring container running in bare JVM can be fairly heavyweight if a full set of transaction management and security modules are installed. But in general a modular container like Spring will allow configuration of just those elements which are needed.

Open Source Containers typically complement Modularity with Extensibility. New modules can be added to address other cross-cutting concerns. If this is done in a consistent manner, an elegant framework is provided for addressing the full spectrum of design concerns facing an application developer. Because containers decouple the client bean code from the extensible container modules, the cross-cutting features become pluggable. In this manner, open source containers provide an open architecture foundation for application development.

Patterns are a popular way of approaching design concerns and they provide an interesting perspective on containers. The Gang of Four Design Patterns[1] book categorized patterns as addressing Creation, Structure, or Behavior. Dependency Injection can be viewed as a mechanism for transforming procedural Creation code into Structure. Containers such as Spring also have elements of Aspect Oriented Code which essentially allow Dependency Injection of Behavior. This allows transformation of Behavioral patterns into Structure as well. This simplifies the enterprise ecosystem because configuration of structure is much more easily managed than procedural code.

Talend provides an open source container using Apache Karaf. Karaf implements the OSGI standard that provides additional modularity and dependency management features that are missing in the Java specification. The Talend ESB also provides a virtual service container based on enterprise integration patterns (EIP) via Apache Camel. Together these provide a framework for flexible and open solution architectures that can respond to the technical challenges of Cloud and SOA ecosystems.

[1] Gamma Erich, Helm Richard, Johnson Ralph, Vlissides John (November 10, 1994). Design Patterns: Elements of Reusable Object-Oriented Software

[2014-11-13] Talend Blog: More Action, Less Talk - Big Data Success Stories

The term ‘big data’ is at risk of premature over-exposure. I’m sure there are already many who turn off when they hear it – thinking there’s too much talk and very little action. In fact, observing that ‘many companies don’t know where to start with big data projects’ has become the default opinion within the IT industry.

I however stand by the view that integration and analysis of this big data stands to transform today’s business world as we know it. And while it’s true that many firms are still unsure how and where to begin when it comes to drawing value from their data, there is a growing pool of companies to observe. Their applications might all be different; they may tend to be larger corporations rather than mid-range businesses, but there is no reason why companies of any size can’t still look and learn.

I was thinking this when several successful examples of how large volumes of data can be integrated and analysed came my way this week. The businesses involved were all from different industry sectors, from frozen foods to France’s top travel group.

What they have in common is that consumer demand, combined with the strength of competition in their own particular industry, is driving the need to gain some kind of deeper understanding of their business. For the former, Findus, this involves improving intelligence around its cold supply chain and gaining complete transparency and traceability.

For Karavel Promovacances, one of the largest French independent travel companies, it is more a question of integrating thousands upon thousands of travel options, including flights and hotel beds – and doing it at the speed that today’s internet users have come to expect. A third company, Groupe Flo is creating business intelligence on the preferences of the 25 million annual visitors to the firm’s over 300 restaurants.

Interestingly, the fourth and final case study involves a company which is dedicated to data. OrderDynamics analyses terabytes of data from its big-name retailer customers such as Neiman Marcus, Brooks Brothers, and Speedo, every day to provides real-time intelligence and recommendations on everything from price adjustments and inventory re-orders to content alterations.

As I said, these are four completely different applications from four companies at the top of their own particular games. But these applications are born from the knife-edge competitive spirit they need in order to maintain their positions. A need that drives innovation and inventiveness and turns the chatter about new technologies into achievement.

This drive or need won’t remain in the upper echelons of the corporate world forever. An increasing number of mid-range and smaller companies are discovering that there are open source solutions now on the market that effectively address the challenge of large-scale volumes. And, importantly, that they can tackle these projects cost-effectively.

This is bound to turn up the heat across the mainstream business world. In a recent survey by the Economist Intelligence Unit, 47% of executives said that they don’t expect to increase investments in big data over the next three years (with 37% referencing financial constraints as their barrier). However, I believe this caution will soon give way as more firms learn of the relatively low cost of entry and, perhaps more significantly, as they see competitors inch ahead using a big data fueled business intelligence.

In other words, I expect to hear less talk and rather read more success stories in the months to come. Follow the links below to learn more about real world data success stories in high volume:

Karavel Promovacances Group (Travel and Tourism)

OrderDynamics (Retail/e-Tail)

Findus (Food)

Groupe Flo (Restaurant)

[2014-10-30] Talend Forum Announcement: Talend Open Studio's 5.6.0 release is available

Dear Community,

We are very pleased to announce that Talend Open Studio's 5.6.0 release is available. This general availability release for all users contains many new features and bug fixes.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s general availability release:

You can also view Release Notes for this 5.6.0 version, detailing new features, through this link: http://www.talend.com/download/
Find the latest release notes, with these steps: [Data Integration | Big Data | Data Quality | MDM | ESB] product tab > at bottom of page click on "User Manuals" > then, click on the "User Manuals" tab > the first download on the page is the most recent release note.

As part of this release, Talend Open Studio for Data Integration, Talend Open Studio for Data Quality and Talend Open Studio for ESB are now provided under the Apache License version 2.0: http://www.apache.org/licenses/LICENSE-2.0.txt

Important new features for our most recent release are viewable on our "what's new" page: http://www.talend.com/products/what-new

For more information on fixed bugs and new features, go to the TalendForge Bugtracker.

Thanks for being a part of our community,
The Talend Team.

[2014-10-23] Talend Forum Announcement: Talend Connect Netherlands: an Event for the Community!

Reserve your spot at Talend Connect Netherlands !

Don’'t miss on November 6, 2014 a full day of valuable insight, information sharing and networking designed to help Talend'’s ecosystem of customers, users and partners capitalize on the full potential of their data and systems. Learn how leading companies are using Talend to unlock the power of their data!

Register for this event to:
- learn more about the big data revolution;
- network with Talend users;
- ask questions to our experts and tell about your best practices as well as experiences;
- discover new technology to answer your integration needs.

Date
November 6, 2014

Where
Central Amsterdam (location given when registration confirmed)

Who
Customers, users and partners

Please register today for free. Secure your place now: https://info.talend.com/TCNL14_Registration.html

[2014-10-18] Talend Blog: Turning a Page

At the end of October, I will be leaving Talend, after more than 7 years leading its marketing charge. It has been quite a ride – thrilling, high octane, wearing at times, but how rewarding.

And indeed, how rewarding it is to have witnessed both the drastic change of open source over the years, and the rise of a true alternative response to integration challenges.

Everyone in the open source world knows this quote from the Mahatma Gandhi:

"First they ignore you, then they laugh at you, then they fight you, then you win."

And boy, do I recall our initial discussions with industry pundits and experts, not all of them were believers. I also remember the first struggles to convince IT execs of the value of our technology (even though their developers where users). And the criticism from “open source purists” about the “evil” open core model.

It would be preposterous to say that Talend has won the battle. But it is clearly fighting for (and winning) its fair share of business. And anyway, what does “winning the battle” mean in this context? We never aimed at putting the incumbents out of business (ok, maybe after a couple drinks, we might have boasted about it), but our goal has always been to offer alternatives, to make it easier and more affordable to adopt and leverage enterprise-grade integration technology.

Over these years, it has been a true honor to work with the founding team, with the world-class marketing team we have assembled, and of course with all the people who have made Talend what it is today. We can all be proud of what we have built, and the future is bright for Talend. The company is extremely well positioned, at the forefront of innovation, and with a solid team to take it forward, to the next step (world domination – not really, just kidding).

This is a small world, and I won’t be going very far I am sure. But in the meanwhile, since I won’t be contributing to the Talend blog anymore, I will start blogging about digitalization – of the enterprise, of society, of anything I can think about, really - and I might even rant about air travel or French strikes every now and then. I hope you will find it interesting.

Digitally yours,

Yves
@ydemontcheuil
Connect on LinkedIn

[2014-10-03] Talend Forum Announcement: [Resolved] Maintenance Completed on Talend Bug Tracker

Dear Community,

On October 5, 2014, our operations team performed routine maintenance on the systems that power the Talend Bug Tracker (http://jira.talendforge.org). The maintenance has been completed!

Questions or concerns regarding our Bug Tracker platform may be directed to the comment section on this thread.

Sincerely,
The Talend Team

[2014-09-26] Talend Forum Announcement: For test only, Talend Open Studio's 5.6.0 RC1 release is available

Dear Community,

We are pleased to announce that Talend Open Studio's 5.6.0 RC1 release is available, for testing only. This release candidate contains new features and bug fixes, and is recommended for experienced users who need an early preview of upcoming 5.6 release.

Download Talend Open Studio for [Data Integration | Big Data | Data Quality | MDM | ESB]'s first release candidate:

Data Integration: http://www.talend.com/download/data-int … data_integ
Big Data: http://www.talend.com/download/big-data … nload_tabs
Data Quality: http://www.talend.com/download/data-qua … nload_tabs
MDM: http://www.talend.com/download/mdm?qt-p … nload_tabs
ESB: http://www.talend.com/download/esb?qt-p … nload_tabs

Below, please find key new features and fixed bugs for Talend 5.6.0 RC1:

Talend Open Studio for Data Integration 5.6.0 RC1

Big Data
- Support for new Hadoop distributions in 5.6 (TBD-61 / TBD-132)
- Support Microsoft HD Insights Platform in bigdata wizards (TBD-949)
- Add Pig functions from DAFU library (TBD-677)
- Add Impala components (TBD-928)

Components
- tWriteJSONField is not outputting the columns from source which are not part of json tree (TDI-29704)
- tAdvancedFileOutputXml / add option to add Document type as text (default) or node (TDI-26630)

Studio
- Studio : move the lib/java folder into the configuration folder (TUP-1659)
- Change TOS DQ/DI/ESB license to APL 2.0 (TUP-1975)

Talend Open Studio for Data Quality 5.6.0 RC1:

- Fix the out of memory issue when running match analysis (TDQ-9320)
- Support new hadoop distributions (TDQ-9282)
- The about text must be changed in TOS DQ (TDQ-9410)
- Improve about MapDB mode (TDQ-9405)
- Set the default value of "hide groups less than" to 2 instead of 1 (TDQ-9297)

Talend Open Studio for MDM 5.6.0 RC1:

Bug fixes are viewable on JIRA: https://jira.talendforge.org/secure/Iss … stId=18082

Talend Open Studio for ESB & Talend ESB SE 5.6.0 RC1

New Features since 5.6.0 M4 now in RC1:

Studio
- Talend Open Studio for ESB – Now under Apache License V2.0

Runtime / ESB
- Service Locator – console commands
- Runtime Core: Apache Versions incl. within Talend ESB Runtime 5.6.0RC1: Apache ActiveMQ 5.10.0; Apache Camel 2.13.2; Apache CXF 2.7.12; Apache Karaf 2.3.8

Thanks for being a part of our community,
The Talend Team.

[2015-12-21] Talend Blog: 2016 Predictions – 4 Ways Big Data & Analytics Will Impact Every Business

[2015-12-17] Talend Blog: Spoiler Alert! Talend 6.1 Hits the ‘Big Screen’

Introducing Talend 6.1

New Tricks in Time for the Holidays

[2015-12-16] Talend Forum Announcement: Talend Open Studio's 6.1.1 release is available

[2015-12-15] Talend Blog: When it Comes To Big Data – Speed Matters

[2015-12-14] Talend Blog: Sechs Dinge, die eine Big-Data-Plattform aufweisen sollte

[2015-12-11] Talend Blog: What’s Next for IoT: 4 Things to Watch

[2015-12-11] Talend Forum Announcement: New support portal

[2015-12-07] Talend Blog: Talend “Job Design Patterns” and Best Practices

Job Design Patterns

Formulating the Basics

Guidelines NOT Standards ~ It’s about Discipline!

Can We Talk About Job Design Patterns Now?

Job Design Patterns & Best Practices

Canvas Workflow & Layout

Atomic Job Modules ~ Parent/Child Jobs

tRunJob vs Joblets

Entry & Exit Points

Error Handling & Logging

OnSubJobOK/ERROR vs OnComponentOK/ERROR (& Run If) Component Links

What is a Job Loop?

Conclusion

[2015-12-01] Talend Blog: IT stuff for free! – 3 Zero-Cost Integration Projects

[2015-11-30] Talend Blog: Explore the Talend 6 Studio and Its Exciting Productivity Features

[2015-11-25] Talend Blog: Creating the Golden Record that Makes Every Click Personal

[2015-11-23] Talend Blog: The Universal Language of Data Mastery

[2015-11-19] Talend Blog: [Demo] Combining Talend 6 + Spark for Real-Time Big Data Insights

[2015-11-18] Talend Blog: 6 Things You Should be Looking for in a Big Data Platform

[2015-11-18] Talend Blog: Too Soon to Talk Holiday Shopping?

[2015-11-17] Talend Blog: A Surprisingly Simple but Effective Masking System

[2015-11-16] Talend Blog: You Too Can Become a Data Rock Star & Change the World

[2015-11-13] Talend Forum Announcement: Join the Talend Data Preparation Beta Program!

[2015-11-10] Talend Blog: Infographic: Real-Time Big Data Key to Cyber Monday Success

[2015-11-04] Talend Blog: Our Sandbox has Better Toys

[2015-10-27] Talend Blog: Talend Connect : Entrez dans le futur du Big Data !

[2015-10-27] Talend Blog: Talend Connect: Step into the future of Big Data!

[2015-10-23] Talend Blog: Three Key Takeaways from Amazon re:Invent 2015

[2015-10-19] Talend Blog: Building ‘Houses’ in the Cloud

[2015-10-15] Talend Blog: You’ve Bought Into the Cloud: Now What?

[2015-10-13] Talend Blog: Self-Service and Data Governance Empowers LOB Users

[2015-10-07] Talend Blog: Why Driving a Data-Driven Culture is Essential to Business Success

[2015-10-06] Talend Blog: Unlocking the Power of the Cloud: Talend Teams Up with AWS at re:Invent 2015

[2015-10-01] Talend Blog: You Can’t Fake the Data-Driven Force

[2015-09-30] Talend Forum Announcement: Announcing Talend 6: The First Spark-Powered Data Integration Platform

[2015-09-30] Talend Blog: Echtzeit-Big Data werden Mainstream – sind Sie bereit?

[2015-09-30] Talend Blog: Real-Time Big Data About to Go Mainstream – Are You Ready?

[2015-09-30] Talend Blog: Etes-vous prêts à entrer dans l’ère du Big Data en temps réel ?

[2015-09-29] Talend Blog: リアルタイムビックデータが主流に ー 準備できていますか？

[2015-09-28] Talend Blog: Survive and Thrive in a Data-Driven Future: Talend Hits the Big Apple at Strata and Hadoop World 2015!

[2015-09-25] Talend Forum Announcement: #Talend6Awakens: Win a *Signed Star Wars Collectible*

[2015-09-24] Talend Blog: The Role of Data Governance in Delivering Seamless Omni-Channel Experiences

[2015-09-21] Talend Blog: The Path to Optimize Retail Operations through Big Data

[2015-09-15] Talend Blog: Being a Data-Driven Retailer: What’s in it for You?

[2015-09-09] Talend Blog: Creating a Strategic and Dynamic Data Supply Chain

[2015-09-04] Talend Forum Announcement: For test only, Talend Open Studio's 6.1.0 M2 release is available

[2015-09-03] Talend Blog: Bootstrapping AWS CloudFormation Stacks with Puppet and Structured EC2 User Data

[2015-08-26] Talend Blog: Focus IT development on the user experience while improving the developer/designer relationship

[2015-08-21] Talend Blog: Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 2)

1. Deterministic Matching

2. Probabilistic Matching

Edit Distance Algorithms

Phonetic Algorithms

[2015-08-14] Talend Blog: Beyond “The Data Vault”

Data Storage Systems

Database Engines

Big Data

File I/O

Hadoop/HDFS

So let’s call them all: Data Stores

The Enterprise Data ‘Vault’ Warehouse

Architecture

Design

ETL/ELT

Conclusion

[2015-08-13] Talend Forum Announcement: For test only, Talend Open Studio's 6.1.0 M1 release is available

[2015-08-11] Talend Blog: Talend and the Gartner Magic Quadrant for Data Integration Tools – Less than a whisker from the leader’s quadrant

[2015-08-06] Talend Blog: On the Road to MDM

[2015-08-04] Talend Blog: OSGI Service Containers

Use Case

[2015-09-29] Talend Blog: リアルタイムビックデータが主流にー準備できていますか？

[2015-09-25] Talend Forum Announcement: #Talend6Awakens: Win a Signed Star Wars Collectible