What is Data Integration? – Dev Community

TLDR

Data integration synchronizing information between 1 system (such as Salesforce) and another (such as Snowflake)


Source: Giphy

outline

  • What is this?
  • why do you need it?
  • How do you implement it?
  • what’s that for?
  • future and next steps

What is this?

Data integration is the process of loading data from an external source and then exporting it to an internal destination. This process can also be reversed (eg reverse ETL).

landscape

You regularly collect gold from scattered kingdoms that have pledged allegiance to your throne. You collect their gold payments through Stripe.


picture descriptionSource: Giphy

Your treasury (aka the finance team) wants to better understand which demographics of states pay well and on time.

They need gold payment data from Stripe in your Snowflake data warehouse so they can combine it with state census data. Once combined, they can analyze the data and report it to the Higher Council.

why do you need it?

Use cases for integrating external data with internal systems include (but are not limited to):

  • Combine data from multiple sources to build comprehensive data models for business use cases
  • Analyze data with external SaaS tools
  • Combine data for personalization
  • Forecast

Examples of use cases for loading data from your internal system (such as PostgreSQL) and then exporting it to an external system (such as HubSpot) include (but are not limited to):

  • Getting internal application data into marketing tools (such as Mailchimp, Google Ads, etc.) for outreach campaigns, ads, etc.
  • Syncing user data with CRMs (eg Salesforce)
  • Exporting internal data to task management (eg Airtable) software for Ops

landscape

You have a big tournament ahead to celebrate your next heir to the throne. You have put up a lot of advertising posters all over the country. Several knights sign up to compete in your tournament.


picture descriptionSource: Giphy

You create a data integration between all ad service providers (such as Google Ads, Facebook Ads, etc.), get data from those APIs and store it in your BigQuery data warehouse.

Your team uses ad campaign performance data, night sign ups, tournament results and audience spend to calculate the ROI of ad campaigns. These models will help you promote your team to the next tournament more efficiently.

How do you implement it?

There are 3 ways you can accomplish data integration:

  1. Software as a Service (SaaS)
  2. open source software
  3. Write code from scratch or use an open-source library


picture descriptionSource: Giphy

Some of the benefits of using SAAS or open-source software is that you get some out-of-the-box features such as (not a comprehensive list):

  • When a third party API is updated, the maintainers will help update it
  • When there is a duplicate record, you can automatically ignore it or update an existing record
  • Automatically track the progress and status of synchronization
  • Retry Failed Synchronization
  • even more…

what’s that for?

Generally, data engineering is the master of data integration. Data engineers can choose how to implement it (eg buy SAAS or implement open-source software).

Marketing, sales, operations, etc. can help influence which third party sources require data integration or which third party APIs need to sync data from an internal data warehouse.

Specific business use cases require different solutions depending on what sources are needed.


picture descriptionSource: Giphy

Companies use SAAS, or a self-hosted open-source software, to integrate data from common sources. When no unusual sources from a SaaS provider exist out-of-the-box, teams usually write the code themselves to handle the synchronization.

In addition, when a team needs to export data from their internal systems to third party APIs (such as Salesforce) and has very customized needs, they usually create custom ones to handle their specific use case. Write code too.

future and next steps

Once the data is integrated into a common destination, you can begin to combine it with data from other sources, analyze it, then predict it.

To accomplish this, you must first clear the data, manipulate it, and transform the data through the data pipeline. It is important to manage this process in a testable, repeatable and observable manner.


picture descriptionSource: Giphy

In this data integration series, upcoming content will include:

  • Singer Spec: Data Engineering Community Standard for Writing Data Integrations

  • How to write your own data integration
  • How to Build an End-to-End Data Pipeline to Sync Data

Leave a Comment