The importance of data availability
Data is an extremely important business asset, but it can sometimes be difficult to access, orchestrate and interpret. When data is moving across systems, it is not always in a standard format; data integration aims to make data agnostic (term which stands opposite to intuition and scares reasoning) usable (I believe that: “useless is worse than failed”) quickly across the business, so it can be accessed and handled by its constituents. We all are consumers and producers of data.
Data is one of the most important assets an organization has because it defines each organization’s uniqueness. Being a data-driven organization is not the final objective, but it represents a crucial process in the innovation challenge.
Data integration will continue to remain an actual issue for complex and fast-growing companies that share datasets between vendors, partners and more and more connected customers. The need to integrate systems is not recent, but now, thanks to computational power and technology evolution, we can achieve this in real-time.
Whole categories of industries are re-inventing their line of products with the “excuse” of Cloud infrastructures and SaaS and so the integration is a primary key point for companies that deal with change management.
Experience suggests that working with data can be frustrating if no results (or better valuable insights) are illustrated on time.
Sketch of a hypothetical data integration business case
Let me suppose that a typical business scenario is the following: How would you manage a typical data integration project? Of course, crucial issues to consider are:
- Resources, budget, time and cost
- Stakeholders’ frustration and their need for transparency and results
- Business model, integration design pattern, tools and role definitions
- Keeping the project on track and close it successfully
- Further follow-up activities and implementation (added values or how to enrich features after all requirements have been fulfilled)
Data necessities Governance, but also Agility and Responsiveness
That is why, from a Project Management point of view, I would recommend keeping an Agile methodology, trying to:
- Collaborate in an open and transparent manner with all the IT roles and business people during both phases. Developing agility and responsiveness are factors for success. Communication should be constant and business-relative.
- Deliver in short times and fast (weekly?)
- Start immediately with sample data (how to determine “importance”?)
- Share Talend process to engage (democratize resources) and inspire people.
Let me suppose I already identified the following tasks to be priorities:
1) Project urgency in terms of time (duration) and costs (budget) and stakeholders concern (especially the 1-st phase!)
2) Cloud adoption for the digital transformation (benefits of Cloud as a paradigm…)
3) Hybrid system (MF and SaaS) to be implemented for near real-time CRM transactions (sync or data replication) about Customers/Accounts information
4) Perform some data quality actions like data validation/cleanse (Address field/column to be formatted or adapted) before some ETL activities
5) Bi-directional operations between MF (example: DB2) and SaaS CRM (trigger changes: update/insert/delete/upsert and run job)
6) Need to cooperate with the source code (maybe for remote DEV team collaboration or versioning?)
In order to restart the project as soon as possible and limit the stakeholders’ frustration I’d prefer immediately with meeting them and determine which are business-critical steps that this project could improve drastically (need for quality data, provide db connections, initial migration, set up the integration, training staff…).
More specifically, the aim is to identify candidate data that is business-related. This is the first type of data that I would upload into SaaS.
Three major business steps to set the project
Customers object could contain huge dataset, but not all are required in this phase or at least not all is needed for processing.
Candidate data for business-related value may be Prospects/Leads (potential Customers), or Customers that the company has done business recently (order by last modified date?), or Customers that generate most revenue or Customers that belong to a specific Industry etc.
Customer filtering to identify business-related data
Candidate data for business-related value:
- Prospects/Leads (potential Customers)
- Recent closed deals
- Largest revenue generated streams
- Interested geographical area (filter by Address?)
- Particular industry
- Fast time closing deals (best Customers?)
For this 1-st phase a migration design pattern (moving from one system to another) is implemented. Then, automate such phase (by setting sync parameters in bi-directional integration data). Later, thanks to Talend features, add functionalities to enrich the integration and provide new value for the business.
From a business management perspective the 1-st project phase could be organized in this way:
In the 2-nd phase, the aim is to automate this process, provide bi-directional integration (changes in the SaaS CRM should also propagate into the specified MF database) and add iPaaS-related value.
Instead of looking for other solutions, why considering Talend?
- Open source Eclipse-similar IDE platform (user familiarity, design canvas, abstraction of business logic…)
- Start immediately with the free solution tools (Open Studio is under Apache license)
- Java-based code generator (not proprietary developing language — still one of the most diffused OO language and will remain first-choice for next decades)
- Graphical representation of business models
- Visual job design (drag-and-drop components, easy creation of connections)
- +1000 built-in connectors and components (with third-party support like AWS, Salesforce, Marketo, ServiceNow…)
- Reusable repository components (metadata management, jobs, modules, database schemas, flat input/output files…). Once you create a job, you run it many times!
- Real-time debugging
1-st Project phase: Initial load
Let me suppose that the file extracted from the MF (mainframe, legacy system…) are already available and they are flat files, specifically CSV files and that the SaaS CRM platform chosen is Salesforce.com into the Account object.
Simplified proposed Business Model for the initial load of Customers data with Talend Open Studio for Data Integration
Before starting implementing any business requirement in Talend, focus on some best practice principles:
- Purpose and job architecture (why and how?)
- Reusability (need to create maintainable jobs: “write once, run many”)
- Error handling and log management (eventual debugging and detect faults when restoring…)
- Performance (metrics about job execution)
In the business model designed with TOS (Talend Open Studio for Data Integration) we can see the 1-st project phase being implemented as a visual flow.
The files that serve as an input to be uploaded in the SaaS CRM object (Customers; Account object in Salesforce) are processed through a validation task (match valid values).
The Address values (which represents a column or a field in the DB tables) are cleansed or adapted or reformatted BEFORE the insertion into SaaS CRM object.
If we need to know which of these Customer records contains valid Addresses, we can apply matches (join operations or search for patterns) against a list of valid Address (an extra file with valid Addresses) or use pre-built components to verify its correctness.
Example of an Account detail page in Salesforce
In the above figure we see an example of the correspondent sample object in Salesforce. As we notice, the Address information might be exploded in more than one column/field.
In addition, Address may be incorrect, not existent, not valid, not standardized or simply NULL. Validation rules have to be defined in this step. Processing of such data has to be done depending on the business needs (“what is meant by validation”).
Frequent categories of data validation are:
- Null handling (where null values are permitted and where not, Java pointer exceptions, null objects…)
- String handling (String manipulation…)
- Date handling (display and parse date format…)
- Validation with third-party libraries (Google address, international phone numbers…)
Examples of real cases: Address various strings (street, ZIP code, country code…) should be concatenated in all-in-one string, length range, or match certain regular expression (REGEX), or validated against international standards, or have the Google maps format etc…
Another solution could be using manually the self-service Talend free tool: Data Preparation and then save the new result in a CSV.
Upload the CSV file and perform actions in the Address column as indicated.
Sample CSV file Customer data processes with Talend Data Preparation
To upload the data after the validation of the address we can use the Talend Open studio for Data Integration:
The logical 3-component job: (1) a flat file in CSV, (2) a transformation component to validate the address and (3) a Salesforce DB connection with the Account module as target
Note that the component tSalesforceOutput_1 represents the Account module in Salesforce and has already in repository a connection defined (username/password with attached the security token). The action/operation selected is UPSERT (update OR insert).
Notice that matchValidAddress component could be also a table (which contains only valid ZIP codes and cities) from one database connection metadata (mainframe) and works as INNER JOIN with the Accounts records.
The settings for this connection are:
2. Provide the login open API URL server to access using SOAP protocol
The job could be of course exported and scheduled to be run at any specific time.
To measure the data quality (for example the Completeness dimension) we could define the following KPI:
– Number of rejected records from the source file / total number of records (a percentage having set a goal/threshold).
2-nd Project phase: Set up a bi-directional sync
The integration pattern used is bi-directional sync: combine two databases in two different systems (in this case MF and Cloud) so that they behave as one, while respecting their need to exist as different databases.
It makes sense when data representations have to be much closer to reality in two or more systems.
In this scenario, billing will continue to be managed in the Mainframe while only Customer data will be synchronized in both systems.
In other terms, the problem is how to identify changes? How to determine if the change is an Insert, Update, Upsert or Delete?
Another integration design pattern could be the CDC (Change Data Capture), but since in Salesforce there is no such mechanism, the only way is to make use of read-only system field (SystemModstamp) which records last modified date and time, user ID and last extraction start time for each row etc…
Instead, for this scenario I used the Talend component tSalesforceGetUpdated_1.
There are some key issues to consider:
- How similar are the structures (schemas) of the databases to be kept in sync (this helps for eventual JOIN operations)?
- How often do the databases need to be synced (performance query…)?
- How will we resolve situations in which the same data has been modified in both of databases since the last sync session (conflict based on the “record owner” or “last modified” solution to be described)?
- How much effort and/or money are we willing to invest in developing our sync system (“keep project budget on track”)?
Bi-directional integration means synchronization between the two databases: Mainframe and SaaS CRM. Talend TOS supports multiple connectors to achieve this objective.
In this scenario I suppose that Salesforce Account object is “scanned” systematically in order to track changes in its records (insert or update record). For this purpose, the Talend component tSalesforceGetUpdated_1 can be used.
3-rd Project phase: Which are added business values?
After a bidirectional-integration has been set up, there are several issues to think about how to enrich the functionalities of such integration:
- External lookup with other data sources (indeed, any kind of data source, other databases or applications or documents/files containing Customer data…). Examples of applications could be supply chain, e-commerce, Business Intelligence (analysis of ROIs, deals/opportunities), DW, Marketing, social networks activity/engagement, distributed and cross-platform applications…
- Reusable jobs, thanks to repository metadata (export and schedule for deployment or as packages…)
- Statistical reports about job execution (performance)
- Other applications can trigger the job (example: collecting data for reports and dashboards…)
- Versioning of the Java generated code (Github, Maven…)
- Unified and scalable integration platform (data preparation, DI, Cloud integration, ESB, MDM, Big Data, Talend Fabric…)
- Give real value to the data asset (“enable data-driven organizations”)
- Support for decisions (“how to use the information obtained?”) and provide them in advance (apply automatically and review rules regularly)
- Remove data management risk when modernizing systems
- Consolidate applications
- Decrease costs and time maintaining effective synchronization processes and limit IT staff operations (both batches and streaming)
- Smooth subscription model (start with free open-source tool and then upgrade in a predictable fashion depending on business needs — pay only for the number of developers…)
- Optimize processes by keeping comprehensive, relevant and consistent data everywhere. Deliveries in real-time!
- Big Data native suite of products
iPaaS, data integration, Talend, Salesforce, data-driven, use case, migration, cloud computing, SaaS, CRM, database, real-time, open-source, java, professional services, on-premise, mainframe, data quality, hybrid, repository, metadata, reusable job, data validation, bi-directional sync, design pattern, agile, business, ETL, project management, customer