Back to Story Navigator

DataFusion

Completed

Enterprise Data Integration Platform

A high-performance distributed data platform enabling semantic data discovery and SQL-based querying across heterogeneous data sources. Co-developed at Stream Financial to solve the enterprise data integration challenge—allowing tech-savvy business users to access and query data without requiring IT change requests.

C++Distributed SystemsSQLColumnStore DatabaseJava APIPython API.NET APIVector ProcessingMap ReduceTLS Security
Multiple
data sources
SQL '92
query standard
Cross-platform
platforms
Exceptional
compression
The Problem We Solved

The Challenge

Large organisations with complex legacy infrastructure face a dilemma: technology that enabled growth has reached a turning point in value. Legacy systems are inflexible when agility is needed for changing markets.

The Insight

The tools available to business users for discovering and accessing enterprise data are poor. When data is found, extracting and using it productively is difficult—leading to proliferation of shadow IT spreadsheets and Access databases.

The Solution

DataFusion provides a semantic layer that makes business-relevant metadata available, along with tools to discover, retrieve, and process information regardless of where that data actually resides in the enterprise.

Key Features
SQL '92 compliant query engine with distributed query processing across multiple data sources
Join queries seamlessly across DataFusion, Excel, Oracle, SQL Server, and more
Full query plan transparency providing data lineage and source timings
Temporal as-of queries for point-in-time data analysis
Automatic sharding and indexing for optimized performance
Open provider framework allowing custom data source integration without vendor lock-in
Secure TLS encryption for all data access
APIs available in Java, Python, and .NET (C#)
Business Impact
data Accessibility
Tech-savvy business users can access enterprise data using standard SQL without IT change requests
agility
Business needs no longer blocked by IT system change cycles—explore data independently
governance
Controlled sandbox environment ensuring right people access right data with audit trails
integration
Unified abstraction layer against constantly changing golden source systems
Technical Deep Dive

Technical Highlights

High-Performance ColumnStore Database

Performance

Written from ground up in C++ to leverage vector processing in modern CPUs. Provides exceptional read/write speeds with data compression that's actually faster than uncompressed operations.

Distributed Query Engine

Scalability

SQL '92 compliant engine that distributes queries across multiple heterogeneous data sources, combining results seamlessly with full data lineage tracking.

Open Provider Framework

Flexibility

Extensible services that provide access to data: CSV files, ODBC sources (Excel/Access), databases (SQL Server, Oracle, Snowflake), and opaque providers for R, Matlab, Python scripting.

In-Memory Processing with Persistence

Reliability

Bespoke map-reduce optimized for low latency with in-memory processing while maintaining data persistence to disk for durability.

System Architecture

Core Components

Query Engine (SQL '92)ColumnStore DatabaseProvider FrameworkWorkbench UIMulti-language APIsTLS Security Layer

Data Flow

Client Query → Query Engine → Provider Selection → Distributed Execution → Result Aggregation → Transparent Lineage

Integrations

CSV/Flat Files
Excel/Access (ODBC)
Oracle/SQL Server/Snowflake
R/Matlab/Python Scripts
Custom Vendor APIs
Enterprise Applications

Risk & Finance data unification: Enable consolidated views across traditionally siloed systems

Regulatory reporting: Aggregate data from multiple sources for BCBS239 compliance

Data quality initiatives: Provide single query interface for data validation across systems

Business intelligence: Allow business users to create ad-hoc analyses without IT involvement

Legacy system migration: Query both old and new systems during transition periods

First-Principles Lessons
1.The business-IT gap is fundamentally a data accessibility problem, not just a communication issue
2.Semantic metadata is the key to making data discoverable by business users
3.Performance matters—compression that's faster than uncompressed access changes what's possible
4.Open extensibility prevents vendor lock-in and enables client customization
5.Security and governance must be built-in from the start, not added later
The Builder Storyline Connection

DataFusion represents the transition from consulting (identifying data integration problems) to building (creating technology solutions). The same first-principles approach used later in PAI and Risk-Agents—understand the problem deeply, then build something that addresses the root cause.

Learn More