Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today we spotlight Shivangi Srivastava, Senior Director of Software Engineering at Salesforce, as she details the creation of Cloud Data Integration, the decentralized architecture driving Informatica’s platform that supports workflows for over 5,500 corporate clients managing about 250,000 daily tasks.

Explore how the team transformed the Informatica data integration framework from a solitary node setup into a scalable Spark environment on Kubernetes while ensuring legacy support for thousands of active streams and utilizing FinOps logic to stabilize operational expenses against processing speed for massive data sets.

What is your team’s mission as it relates to building Cloud Data Integration (CDI)?

We make enterprise data accessible and reliable across hybrid and multi-cloud environments. Cloud Data Integration serves as the engine for this mission by connecting systems, transforming datasets, and moving information to its destination.

Enterprises manage hundreds of sources, including SaaS platforms and legacy systems. CDI provides the necessary connectors for these environments. This allows teams to build pipelines that cleanse and reshape data as it moves through the organization.

Productivity remains a central focus for us. We prioritize graphical pipelines over handwritten code within the CDI model. Engineers define mappings while the runtime engine handles orchestration and scaling. This approach lets teams design integrations instead of managing infrastructure.

High-level overview.

What constraints drove CDI’s evolution from a single-node engine to a distributed Spark platform on Kubernetes?

The original Informatica integration engine served a different era of data processing. It functioned as a single-node system, which worked for datasets measured in gigabytes.

Modern enterprises now operate at a massive scale. SaaS platforms and digital applications generate volumes that reach terabytes and petabytes. This shift required a move toward a distributed architecture.

Backward compatibility remained the primary constraint during this transition. Since thousands of production pipelines already existed on the platform, asking users to rebuild them was not an option. We solved this by preserving the logical abstraction layer used to design pipelines. Engineers still create graphical mappings, but the runtime now converts those mappings into distributed Spark execution plans.

Open-source Spark alone lacked necessary enterprise capabilities like lineage tracking and deep connector support. To fix this, we extended the engine into Spark++. This version combines the distributed processing model of Spark with our transformation framework and governance features. This extended runtime allows CDI to run complex integration pipelines at scale while keeping the logical abstractions engineers already use.

What reliability constraints shaped the architecture of CDI as the foundational engine underneath much of the Informatica stack?

CDI operates as the backbone of many data workflows. Reliability remains a core design requirement because integration pipelines power analytics systems and operational processes.

To maintain stability, CDI focuses on three reliability principles:

Data integrity: The platform tracks execution at the row level. This allows the system to isolate problematic records without corrupting pipelines.
Tenant isolation: Compute clusters operate within strict VPC boundaries. Ephemeral nodes disappear after execution.
Infrastructure resilience: High-availability services run across multiple availability zones. This prevents localized outages.

These systems allow CDI to function as a stable data integration backbone. The platform maintains a 99.9% control-plane availability target.

High-level architecture.

What scalability constraints emerged as enterprise data volumes expanded from gigabytes to terabytes and petabytes?

Scaling CDI presents challenges as enterprise data volumes expand. Infrastructure planning grows complex because cloud providers offer many compute and storage options.

Enterprise deployments face three specific scaling challenges:

Infrastructure complexity: Selecting the right instance types and cluster configurations remains difficult.
Dynamic workloads: Pipelines experience unpredictable spikes and idle periods.
Cost-performance tradeoffs: Improving throughput often inflates infrastructure costs.

CDI uses FinOps automation to solve these issues. Users define cost and performance goals instead of configuring clusters manually. The platform analyzes workloads and selects the best infrastructure configuration. It scales compute resources dynamically across Kubernetes-managed clusters.

This approach processes large datasets efficiently. Engineers no longer manage infrastructure directly.

What engineering challenges arise when balancing infrastructure cost with performance as data workloads scale?

Scaling distributed systems by simply adding hardware often leads to rapidly increasing infrastructure costs. CDI automates infrastructure optimization to balance performance and cost. This architecture removes the manual burden of cluster management.

Three core systems power the FinOps architecture:

Cluster Lifecycle Manager: It predicts job demand to start or shut down clusters automatically.
Cluster Tuner: This system selects appropriate cluster configuration like instance types, storage, network etc.
Job Tuner: It adjusts Spark runtime parameters like CPU and memory based on historical data.

These systems optimize infrastructure continuously. Production deployments show this architecture reduces infrastructure costs by approximately 1.65 times while maintaining performance.

Sample performance results.

What scale challenges emerge when operating CDI across thousands of enterprises and hundreds of thousands of jobs?

Scaling CDI at an enterprise level requires more than just initial construction. The platform must manage growth across customer environments, job concurrency, and data throughput.

CDI currently supports approximately 5,500 enterprise customers who execute roughly 250,000 integration jobs daily. Two architectural choices maintain stability at this scale:

Control plane separation: The platform separates orchestration and scheduling from the data processing layer.
Distributed execution: The data plane executes Spark workloads independently of the control services.

This design ensures orchestration services remain stable during compute cluster spikes. Advanced scheduling also prevents large workloads from monopolizing shared infrastructure. These capabilities allow CDI to maintain performance during enterprise growth.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.
Want a deeper dive into this topic? Check out these white papers:
- INFA-FinOps for Cloud Data Integration
- CDI-E: An Elastic Cloud Service for Data Engineering

The post Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily appeared first on Salesforce Engineering Blog.

Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily

Why it matters

Key takeaways

Keywords

Content preview

What is your team’s mission as it relates to building Cloud Data Integration (CDI)?

What constraints drove CDI’s evolution from a single-node engine to a distributed Spark platform on Kubernetes?

What reliability constraints shaped the architecture of CDI as the foundational engine underneath much of the Informatica stack?

What scalability constraints emerged as enterprise data volumes expanded from gigabytes to terabytes and petabytes?

What engineering challenges arise when balancing infrastructure cost with performance as data workloads scale?

What scale challenges emerge when operating CDI across thousands of enterprises and hundreds of thousands of jobs?

Learn more

Related posts

Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

How Agentforce Lead Nurturing Agents Generated $100M+ Pipeline Under Rate-Limited Infrastructure

Keeping a Deeply Unified Platform Aligned — Inside the Office of the Chief Architect