The Challenge of AI-Generated Components
AI models are trained on vast repositories of public code, which means they frequently produce code that closely resembles — or directly replicates — existing open source implementations. These similarities are often invisible to developers who assume the generated code is original. The problem is structural: AI models learn patterns from their training data and reproduce those patterns when prompted for similar functionality. Research demonstrates the scope of this challenge. Studies analysing LLM-generated code found that between 0.8% and 5.3% exhibits notable similarity to existing open source implementations — similarity considered sufficient to indicate copying rather than independent creation when evaluated against stringent thresholds. When evaluated using more permissive similarity thresholds, approximately 30% of AI-generated code shows at least some degree of overlap with open source codebases. These are not edge cases; they represent the routine output of mainstream AI coding tools. The implications are significant:- Hidden from dependency scanners: AI-generated snippets do not appear in dependency manifests, bypassing traditional Software Composition Analysis (SCA) tools that only scan declared dependencies.
- Distributed across codebases: Rather than being concentrated in a few files, AI-generated code fragments appear throughout projects, embedded within functions that developers have otherwise written themselves.
- Licence obligations without awareness: When an AI model generates code that matches GPL-licenced source code, developers inherit those licence obligations — even though they never consciously copied anything.
- Undetectable by conventional means: Traditional package-level scanners miss snippet-level similarities entirely, creating blind spots in compliance programmes.