A Litepaper On Provenance Infrastructure For Content in the Post-AI Internet
Executive Summary
Photos are fake, by default. Pixels are as gullible as words. Pics and it didn’t happen. Generative AI is overrunning social, commercial and information sharing spaces online. We are at the risk of losing all trust in photos, videos, and content in general. This would lead to the breakdown in the social internet, contemporary politics, remote working and any other industry that relies on the evidentiary value of online media.
This white paper introduces OpenOrigins’ novel protocol that looks to bifurcate the internet into ‘human’ and ‘AI’ to reconcile the proliferation of bots and algorithmically-generated content with the need for authenticity and trust through an internet-scalable infrastructure. We address the Internet’s foundational flaw of data authentication with a new provenance infrastructure with unparalleled scalability, privacy, and immutability. Just as the early Internet needed TCP/IP and the payments/transactions needed blockchain, the coming AI-powered Internet needs OpenOrigins for trust.
Table of Contents
Context: The Internet's Original Sin
Crisis Point: The Agentic Apocalypse
Solution Theater: Why Current Approaches Fail
Design Requirements: How To Preserve Human Internet
Foundations: Distributed Ledger Approaches
Cambium: Breaking Physics to Save the Internet
Endpoint Security: The Last Mile Problem
The Complete System: Digital Provenance Infrastructure
Incentives and economics
Implementation Roadmap
Conclusion
Problem
The Internet was born incomplete. While it revolutionized information sharing, it fails to effectively attest to the lifecycle of data, like the origin of content and modifications to it. This architectural omission was (mostly) sustainable only because humans collectively developed social and institutional guardrails that compensated for this technological deficiency.
Media literacy, journalistic standards, ad-hoc certificate authorities, whack-a-mole cybersecurity detection mechanisms, and academic credentials (among others) have served as stopgap measures in a content ecosystem that lacks intrinsic trust signals. For decades, we have accepted this limitation as inevitable, building increasingly elaborate workarounds rather than addressing the root deficiency.
Human perception, post-hoc cybersecurity mechanisms, and social filtering mechanisms that previously buffered against this flaw are now being systematically undermined by the emergence of Generative AI capable of creating perceptually indistinguishable synthetic content at scale. We are rapidly approaching a paradigm shift where the old rules no longer apply, and our existing guardrails will catastrophically fail.
Content and information attestation is a long-overdue necessity: just as SSL certificates helped us define what websites to trust, we need a pervasive and persistent mechanism to trust content and its origins; one that scales to the billions of content transactions we will begin to encounter in the coming decade.
The Agentic Internet
The Internet is rapidly evolving from a human-to-human communication medium into an agentic space where AI entities increasingly mediate, generate, and transform digital content. This transition is not merely an incremental evolution; agentic interactions and synthetic media will soon outstrip human-to-human interactions and authentic content. This is a fundamental transformation of the internet's operational paradigm that demands architectural reconsideration.
Without urgent intervention, we face the impending "dead internet" scenario: a digital landscape where AI-generated content proliferates to such an extent that distinguishing authentic human activity becomes practically impossible. This is the logical conclusion of current technological trajectories.
The implications of this shift extend far beyond academic concerns:
Identity Collapse: Digital identity becomes meaningless when synthetic entities can perfectly mimic human behavior, undermining everything from online dating to banking security.
Information Corruption: When the marginal cost of generating misinformation approaches zero, information ecosystems break down completely, not incrementally.
Economic Distortion: Markets that depend on human attention (advertising, content creation, social media) face existential threats when engagement becomes algorithmically manufactured.
Social Fragmentation: Communities disintegrate when participants can no longer distinguish between authentic and synthetic engagement.
Exponential Transactions: The number of interaction points will be 1000x within an agentic world, as numerous agents act on our behalf throughout the internet.
Every proposed solution that does not address the fundamental provability problem amounts to applying band-aids to a gushing arterial wound. We need radical surgery, not first aid. On the flip side, if we were able to find an effective solution, we will be able to benefit from the massive productivity gains promised by AI agents (possibly moving away from traditional UI altogether) without compromising on security and giving clear indicators to Internet users about who they are interacting with.
Why Current Approaches Fail
3.1 Attesting On Blockchain: Directionally Correct, But Functionally Non-Scaleable To An Internet Level.
The idea of anchoring the entire internet to blockchain technology represents one of those seductively elegant solutions that collapses under the weight of its own ambition. In theory, it is brilliant – create an immutable, timestamped record of all digital content, solving authenticity and provenance challenges in one decisive technological stroke. The vision promises a world where content cannot be silently modified, where digital history becomes verifiable truth rather than malleable narrative. However, this vision crashes headlong into the brutal reality of blockchain physics: current distributed ledger architectures simply cannot absorb the crushing volume of the internet's data firehose without either sacrificing their fundamental security properties or demanding computational resources at a scale that makes the entire proposition economically absurd.
Major blockchains process transactions in the range of dozens to thousands per second, while the internet generates millions of content pieces per minute. To be clear, this is not a mere technical limitation: it is a fundamental architectural constraint. Blockchains are a linear data structure: every block needs to be communicated and agreed upon by every node before we can start working on the next one. Every blockchain faces an unforgiving trilemma between decentralization, security, and scalability. Even if we somehow engineered a technical solution to the throughput problem (layer-2 solutions, sharding, or other scaling approaches), the economic model disintegrates. The cost of securing each transaction becomes prohibitive at internet scale, especially for mundane content. We are talking about a world where attesting a single Reddit comment might cost more than creating it, where the blockchain itself would become a crushing bottleneck rather than a liberating technology. To give a sense of magnitude, even if all we were trying to do was secure content being uploaded to a single website such as Imgur, it would incur a daily cost of approximately 750,000 USD on Ethereum with the equivalent cost on a layer-2 such as Polygon being 60,000 USD. The promise is magnificent, but the weight of a linear data structure keeps this vision from being realistic.
3.2 Weak Endpoints
Most proposed solutions to the content authenticity problem focus on the endpoints or post-hoc analysis. Let’s quickly discuss why these do not address the fundamental problems that the agentic internet surfaces.
3.2.1 Deepfake Detection
The AI detection industry promotes a dangerous illusion: that machine-generated content can be reliably distinguished from human-created content. This approach is fundamentally flawed for several reasons:
False Positives: Detection systems incorrectly flag human content as machine-generated, disproportionately impacting non-native English speakers and neurodivergent writers. (source)
False Confidence: Even when accuracy rates seem impressive (e.g., "97% accurate"), the failure modes create dangerous blind spots that compound rather than mitigate harm.
Supercharging Misinformation: Detection tools become weaponized to falsely "certify" manipulated content as human-generated, amplifying rather than reducing harm.
Arms Race Dynamics: As generative models improve, detection becomes increasingly unreliable, perpetuating an endless and ultimately futile technical competition.
Detection is not just insufficient—it is actively harmful, creating illusory trust signals that exacerbate rather than mitigate the underlying problem.
3.2.2 Metadata-based Security
Appending provenance information to digital assets sounds promising, but fails in practice due to:
Compatibility Barriers: Metadata standards vary widely across platforms, making universal adoption practically impossible.
Easy Stripping: Metadata can be trivially removed, modified, or fabricated, rendering it useless for security purposes.
Verification Challenges: No practical mechanism exists to verify the authenticity of metadata across diverse digital ecosystems.
Metadata approaches exemplify the peripheral treatment of a central problem, attacking symptoms while leaving the disease untouched.
3.2.3 Institutional Trust
Centralized verification authorities—whether governmental or corporate—represent perhaps the most dangerous "solution":
Censorship Risk: Centralized arbiters of truth inevitably expand their scope, threatening free expression.
Single Points of Failure: Centralized systems create high-value targets for manipulation, hacking, and corruption.
End of the Open Internet: Gatekeeping mechanisms fundamentally contradict the decentralized ethos that makes the Internet valuable.
Replacing technological failures with bureaucratic overreach trades one disaster for another.
3.2.4 Watermarking and Steganography
Embedding invisible markers in AI-generated content represents an intriguing but ultimately inadequate approach:
Easy Circumvention: Minor modifications to watermarked content can defeat detection mechanisms.
Limited Scope: Watermarks only apply to cooperative AI systems, doing nothing to address malicious actors.
Format Constraints: Watermarking techniques vary by medium (text, images, video), creating implementation gaps.
While watermarking may serve limited use cases, it cannot address the systemic provability problem at scale.
None of these approaches constitutes a sufficient solution because none addresses the architectural deficit at the Internet's core. We need to rebuild the foundation, not redecorate the façade.
Design Requirements: How To Preserve Human Internet
Given the failures of existing solutions, we need to take a different approach: instead of relying on post-hoc mechanisms, we need to proactively create proofs of content. These proofs, need to be stored in a secure manner allowing for accessible verification. A genuine solution to the Internet's provability crisis must meet stringent requirements:
Universal Asset & Rights Tracking: Attest all digital media assets regardless of origin or destination. This implies massive scalability of the infrastructure.
Open Access: Provide read/write access to this provenance infrastructure for all participants with minimal gatekeeping.
Robust Security: Implement cryptographic guarantees that prevent tampering or falsification.
Privacy. Proving or retrieving the provenance of a piece of content should not necessitate disclosure of identity or personally identifiable information
Fault Tolerance: Eliminate single points of failure to ensure system resilience.
Preserving Openness: Maintain the Internet's fundamental characteristic as an open, permissionless system.
Ubiquitous Adoption: Achieve deployment more widespread than HTTPS to ensure comprehensive coverage.
Cost. Both retrieval and proving should incur negligible costs (on the order of network bandwidth costs)
Speed. Time taken to attest and retrieve should be predictable and low enough to not disrupt existing workflows.
These requirements present a formidable challenge that no existing technology fully satisfies. However, examining current approaches provides critical insights toward a viable solution.
Foundations: Distributed Ledger Approaches
5.1 Permissioned Blockchains
Enterprise blockchain solutions like Besu offer promising elements:
Strong Security: Cryptographic verification ensures data integrity.
Fault Tolerance: Distributed consensus provides resistance to node failures.
However, they fall short in critical areas:
Limited Access: Permissioned systems restrict who can participate.
Scalability Constraints: Traditional blockchain architectures face throughput limitations.
Adoption Barriers: Enterprise focus limits widespread deployment.
Evaluating against our requirements:

5.2 Scalable Consensus
Advanced consensus mechanisms like Tendermint and Robust Round Robin improve on traditional approaches:
Improved Throughput: Higher transaction processing capacity.
Broader Participation: More open validation networks.
Enhanced Security Models: Sophisticated fault tolerance.
Even with these improvements, fundamental limitations persist:
Communication Overhead: All nodes must still communicate with each other.
Resource Requirements: Participation demands significant computing resources.
Governance Challenges: Determining who can validate remains problematic.
Evaluating against our requirements:

5.2 Layer 2
Layer 2 protocols generally trade off accessibility for greater throughput. They also introduce additional security considerations over flat network topologies. Some improvements over traditional consensus:
Improved Throughput: Higher transaction processing capacity.
Lower Communication Overhead: Only a subset of the entire network need to communicate regularly.
The tradeoff for these improvements are:
Consensus Hierarchy: Security guarantees and accessibility are unevenly distributed.
Limited throughput: Despite the increase in throughput, layer 2’s still do not scale to the level required for Internet-level scale.
Evaluating against our requirements:

These approaches provide valuable stepping stones but fail to deliver the comprehensive solution required. We need to move past the constraints of the blockchain data structure.
We need something more radical.
OpenOrigins Cambium: Moving Past Blockchains
OpenOrigins proposes Cambium, representing a theoretical breakthrough in distributed consensus that challenges fundamental assumptions about how network participants must interact. Unlike traditional approaches, where all nodes must communicate with all other nodes, Cambium enables global consensus while each node communicates with only a logarithmic subset of the network.
Here, we highlight the high-level functional characteristics of Cambium. A detailed technical specification of Cambium and its implementation roadmap will be released separately.
6.1 The Conceptual Breakthrough
Traditional Byzantine fault-tolerant (BFT) systems suffer from a critical limitation: communication costs scale super-linearly with the number of nodes, creating an inherent ceiling on network size. Cambium breaks this constraint through a novel approach:
Banyan Trie Data Structure: Organizes the network into a self-balancing tree where information propagates efficiently without requiring all-to-all communication.
Logarithmic Communication: Each node only needs to interact with O(log n) other nodes, drastically reducing bandwidth requirements.
Localized Storage: Nodes store only their own data plus cryptographic proofs, not the entire global state.
This architecture enables a network that can theoretically scale to billions of nodes while maintaining practical operational costs and a root of trust.
Taking a step back to the information theoretic roots, Cambium achieves these seemingly impossible results by not actually being a consensus algorithm. Consensus algorithms are defined as having three properties:
Termination: Eventually, every correct process decides some value.
Integrity: If all the correct processes proposed the same value v, then any correct process must decide v.
Agreement: Every correct process must agree on the same value.
Any algorithm that achieves all three is bound by the Fischer, Lynch, and Paterson (FLP) impossibility theorem, which states that a deterministic consensus algorithm is impossible in the presence of even a single crashing node. Modern consensus algorithms usually get around this by loosening the definition of “termination” by being eventually consistent.
Cambium targets Integrity instead. Every node proposes a different value vi, which is different from the value that every node reaches agreement on, V. Breaking integrity allows us to achieve scalability that is theoretically impossible for consensus algorithms.
6.2 Technical Overview

Cambium proceeds in a series of discrete time periods, called cycles. The goal of each cycle is to produce a cycle trie, containing all the activity in that cycle. Each leaf of this cycle trie contains the Merkle root of the previous cycle trie, called its cycle root. A banyan trie is a succession of cycle tries, just like a blockchain is a succession of blocks.
This banyan trie data structure is the core of Cambium. Each node has a different view of it, and no node will have complete knowledge of any cycle trie. This is unlike a blockchain, where every node sees the entire ledger. Here each node knows only its own Merkle proof in each cycle trie; the only globally shared knowledge is the sequence of cycle roots. The figure above illustrates this structure.
Each cycle trie within the banyan trie has a depth of log(n/c) where n is the total number of nodes and c is a configurable parameter denoting the size of a committee. Each cycle trie has n leaves, one for each of the nodes in the network. The goal of Cambium is to securely and efficiently build up these cycle tries—and consequently the banyan trie—for a large number of nodes. This is done by achieving local consensus first in committees, and then merging the committees’ tries in a binary recursive structure. At the end of this process, most nodes should agree on the value of the new cycle root.
The sequence of cycle roots, being the only globally known values, act as the root of trust for the entire Cambium network. For a node to prove that they had committed a value at time t, it needs to show its Merkle proof terminating in cycle root t. Any node in the network can then verify the proof by calculating its inclusion.
6.3 Roadmap: To Be Solved
Despite its revolutionary potential, Cambium faces some unsolved technical challenges:
Sybil Resistance: Preventing actors from creating multiple identities to gain disproportionate influence.
Data Availability: Ensuring that required data remains accessible even when original sources go offline.
Offline Node Handling: Managing state transitions when nodes temporarily disconnect from the network.
Storage Optimization: Balancing redundancy requirements against storage efficiency.
These challenges, while significant, represent engineering problems for OpenOrigins, rather than fundamental conceptual obstacles. With sustained research and development, viable solutions are within reach. We highlight avenues for solving these issues in the separate technical specification.
If successfully implemented and adopted, Cambium would satisfy all our core requirements:

However, even a perfect network infrastructure is not enough. We must also address the endpoint problem: how can we ensure that we can trust the data and content at the point of creation and/or ingestion?
Endpoint Security: Verification of 'Human' Committed Proofs
Securing the network solves only half the problem. We must also ensure that endpoints—the devices generating and consuming content—maintain the integrity of the provenance chain. Two complementary approaches show particular promise:
7.1 Secure Sourcing using Trusted Execution Environments (TEEs)
TEEs provide hardware-level isolation of security-critical operations:
Hardware Attestation: Cryptographic verification that code executes in an unmodified environment.
Protected Processing: Computation occurs in isolated memory regions inaccessible to the operating system.
Secure I/O Paths: Direct connections between input/output devices and the TEE bypass potential interception points.
While not foolproof, TEEs raise the bar significantly for endpoint tampering, making large-scale attacks economically infeasible.
The OpenOrigins Secure Sourcing solution leverages Apple iOS and Android TEEs to authenticate content at the point of capture. This may be deployed as standalone applications or as an SDK embedded into other applications. Secure Sourcing proves that the device it is running on is not an emulator, proves that the camera attached is an expected one and that the scene in front of the device is a real three-dimensional view. In addition, we also capture data such as location, time and gyroscope readings to further strengthen the attestation.
Combining all of these indicators in a remotely attested secure enclave gives us a high level of confidence that Secure Sourced content is, in fact, real, human and unedited. This represents the highest security for content provenance barring custom hardware.
7.2 Archive Anchoring (AA)
While Secure Sourcing works well for content created from this point forwards, the question of what to do with historical content still remains. This is where Archive Anchoring comes in. The guarantee that we want to achieve is necessarily different: it is not possible to prove that content in an archive has never been edited since there was no provable provenance present at the point of capture. However, we can still create an audit trail for the content from the point of ingest. This is especially important for historical and news content. All content in vaulted archives becomes traceable to origin, a fossil record for historical content. This protects the foundations of human content from a synthetic internet.
The core realisation here is that we need to increase the threshold for the adversary to successfully fake an entry in an archive. As part of the archive anchoring process, we plug into an archive’s external services (eg, a logging service). For each media file in the archive, we calculate the hash and then ping the external services for the earliest timestamp they recorded for that hash. Once we have collected more than two external timestamps within proximity of the archive reported timestamp, we consider the media item as verified. An adversary would need to compromise the archive as well as two external services to forge this timestamp. Following this, we anchor the media item along with the external witness service timestamps.
The Complete System: Digital Provenance Infrastructure



The full OpenOrigins vision combines Cambium's network-level provenance tracking with robust endpoint security to create a comprehensive digital integrity system. Several additional components glue these elements together:
8.1 System Overview
A functional origin proof technology requires two layers: the ability to immutable define a piece of content at its creation, or through its archive root, and the ability to re-assert that immutable cryptographic proof in the event that a piece of content has been transformed (modified, edited, cropped, screenshotted, etc.). The cryptographic hash becomes the unbreakable root, while Origin Hashing (defined below) becomes a modification tracking technology that can link modified assets to their parent on the open internet, ensuring we can prove human content, even after it’s been shared on the open internet.
8.2 Origin Hashing
Traditional cryptographic hashes break with even minimal content modifications. Matrixed perceptual hashing and keypoint based lookups solve this through:
Content Fingerprinting: Generating signatures based on content characteristics rather than exact bit patterns.
Similarity Detection: Identifying derivatives and modifications of original content.
Cross-Format Tracking: Maintaining provenance even when content crosses media formats.
This enables tracking content lineage even through transformations and edits. We refer to this combined lookup mechanism as Origin Hashing.
8.3 Proof and Metadata Storage
Proofs created from the trusted endpoints, metadata associated with media files and data needed to perform reverse lookups all need to be stored in secure, scalable data store.
For this, we rely on existing storage mechanisms on a case-by-case basis. Applications requiring high security would rely on decentralized storage mechanisms such as IPFS. However, we envision that for most applications, we would rely on centralized storage for efficiency. This does not pose a security risk since Cambium ensures tamper evidence.
Governance and Economics
The OpenOrigins ecosystem needs to create economic and governance alignment across all participants: content creators and archives will be rewarded for bootstrapping the ecosystem with provenanced content, node operators for ensuring uptime, and application/platform developers for advancing core platform technology and endpoints. The incentive system will be designed and described in the following paper as we approach a TGE. Here we present the high-level goals that the incentive system must accomplish for the healthy functioning of the OpenOrigins ecosystem:
Validation Rewards: Nodes that maintain network integrity (eg, by participating in Cambium) need to be compensated for their continued participation.
Creator Benefits: In addition to providing tools for attribution and monetization that depend on provenance, we also envision an initial phase of direct incentivisation for creators anchoring their content to the OpenOrigins infrastructure. This helps bootstrap the network with a critical mass of content.
Consumer Value: Costs associated with consuming trust signals must be balanced with the inherent value that consumers see in having ready access to those signals. We anticipate that this will be a dynamically priced model.
Platform Advantages: Platforms and third-party ecosystems that integrate with the OpenOrigins infrastructure need to be incentivised, at least for an initial period, to get a critical mass of adoption.
Implementation Roadmap
The OpenOrigins Attestation Infrastructure has now been deployed, and is currently being operated on a custom blockchain we have developed based on Hyperledger Besu. This first generation blockchain solution is deployed within leading broadcasters and is already becoming the backbone for many institutional leaders. This first generation blockchain solution is deployed within leading broadcasters and is already becoming the backbone for many institutional leaders. Although our first generation blockchain will continue to run for specific use-cases, the next phase outlines the necessary jump to hyper-scalability and ‘absolute’ data-privacy. This roadmap outlines the next stages in our transition to deploying the novel Cambium L.1 as the backbone for our human proof infrastructure. Realizing the OpenOrigins vision requires a phased approach:
Phase 1: Foundation (Months 0-6)
Complete formal specification of the Cambium consensus mechanism
Rigorous security analysis and peer review
Simulation testing under varied network conditions
At the end of this phase, we will have a fully specified and peer-reviewed Cambium protocol, ready for initial implementation.
Phase 2: Testnet (Months 6-18)
Reference implementation of core protocols
Small-scale testnet deployment
Developer tooling and API design
Initial endpoint security integration
At the end of this phase, we have small scale deployments of Cambium available freely for users to test and inspect. We also envision transitioning subsets of our current customer base to the testnet to ensure no disruption to their existing workflows when the eventual transition away from Besu occurs.
Phase 3: Mainnet (Months 18-30)
Open-source node software release
Partner integration with strategic platforms
Standards development for cross-platform compatibility
Incentive mechanism implementation
At this point, we are ramping up endpoint adoption in conjunction with the full mainnet launch. Open sourcing all of the node software should allow for greater scrutiny and trust.
Phase 4: Internet-Scale (Months 30-onwards)
Browser and OS-level integration
Content platform plugins and APIs
Enterprise tooling for compliance and verification
Consumer-facing trust indicators
This timeline anticipates significant technical challenges but represents a realistic path toward deployment at internet scale.
Conclusion