the current implementation keeps the interim report focused on findings metadata (source,dependency_info,finding_id). The finding-centric reachability slices such ascall_chains,entry_call, andcrypto_callare emitted by the dedicated--export-callgraphartifact rather than embedded in the interim report.
Overview
When--scan-dependencies is enabled, crypto-finder goes beyond scanning the user’s source code. It resolves the project’s dependency tree, scans each dependency for cryptographic usage, builds a cross-package call graph, and traces each finding back to the user’s code to answer: “Does my code actually reach this crypto function?”
The Six-Step Pipeline
The full pipeline lives inDependencyScanner.ScanWithDependencies(). Here’s what each step does and why.
Step 1: Resolve Dependencies
TheResolver interface discovers all dependencies and locates their source code on disk. Each ecosystem has its own resolver implementation (see Supported Ecosystems below). Each Dependency carries:
| Field | Go Example | Java Example | Python Example | Rust Example | Purpose |
|---|---|---|---|---|---|
Module | golang.org/x/crypto | org.bouncycastle:bcprov-jdk18on | cryptography | ring | Import path / coordinate |
Version | v0.17.0 | 1.77 | 42.0.5 | 0.17.8 | Resolved version |
Dir | ~/go/pkg/mod/golang.org/x/crypto@v0.17.0 | ~/.crypto-finder/cache/sources/org.bouncycastle:bcprov-jdk18on/1.77/ | ~/.local/lib/python3.x/site-packages/cryptography/ | ~/.cargo/registry/src/.../ring-0.17.8/ | Filesystem path to scan |
RootModule (e.g. github.com/myorg/app for Go, com.myorg for Java) is used later to determine which packages are “user code” vs. “dependency code”.
Step 2: Load & Filter Rules
Rules are pre-loaded once and filtered to the ecosystem’s language(s). For a Go project, onlygo rules are kept; for Java, only java rules. This avoids running irrelevant rules against source code, significantly reducing scanner overhead.
Step 3: Scan Dependencies in Parallel
Each dependency is scanned independently using the sameOrchestrator.Scan() pipeline as user code (Semgrep/OpenGrep rules → deduplication → enrichment). Dependencies are deduplicated by module@version and processed in a stable order (module, version, dir) so repeated scans produce deterministic report and call graph inputs.
Dependencies without a usable local source directory are not sent to the scanner. They are logged as Skipping dependency source scan: no local source directory instead of triggering empty-path scanner failures. For Java, those dependencies still proceed to step 4 as type-only inputs as long as module@version can be resolved to a compiled JAR.
Step 4: Build the Call Graph
This is where the architecture gets interesting. The call graph builder uses syntactic parsing to process source files, which means it works on raw source without needing a full Go toolchain, Java compiler, Python interpreter, or Rust toolchain. TheParser interface abstracts all language-specific behavior:
NewParserForEcosystem() factory selects the right parser (GoParser, JavaParser, PythonParser, or RustParser) based on the detected ecosystem.
Performance optimization: Only dependencies with crypto findings get full source parsing. Dependencies without findings contribute only their bytecode type signatures (class names, method signatures, return types, interface hierarchy). This preserves 100% type resolution accuracy for fluent chains while skipping expensive source parsing for ~80% of dependencies.
What the parser extracts
Each language parser extracts the same semantic information into the sharedFileAnalysis / FunctionDecl structures. For Go (.go files, excluding _test.go):
| Extracted | Example | Stored As |
|---|---|---|
| Package imports | import "crypto/aes" | Imports["aes"] = "crypto/aes" |
| Function declarations | func Encrypt(...) | FunctionDecl{ID, FilePath, StartLine, EndLine, Calls} |
| Method declarations | func (b *Block) Seal(...) | Same, with Type = "*Block" |
| Call expressions | aes.NewCipher(key) | FunctionCall{Callee: {Package: "crypto/aes", Name: "NewCipher"}} |
.java files):
| Extracted | Example | Stored As |
|---|---|---|
| Package declaration | package javax.crypto; | PackagePath = "javax.crypto" |
| Imports (explicit) | import javax.crypto.Cipher; | Imports["Cipher"] = "javax.crypto" |
| Imports (wildcard) | import java.security.*; | WildcardImports = ["java.security"] |
| Class methods | class Cipher { getInstance(...) } | FunctionDecl{..., Type: "Cipher", Name: "getInstance"} |
| Constructors | new SecretKeySpec(...) | Call to FunctionID{..., Type: "SecretKeySpec", Name: "<init>"} |
| Method invocations | Cipher.getInstance("AES") | Resolved via imports → FunctionID{Package: "javax.crypto", Type: "Cipher", Name: "getInstance"} |
| Local variable types | Cipher cipher = Cipher.getInstance(...) | Enables cipher.doFinal() → resolves cipher to Cipher type |
.py files, excluding test_*.py and *_test.py):
| Extracted | Example | Stored As |
|---|---|---|
| Import statements | import hashlib | Imports["hashlib"] = "hashlib" |
| From imports | from cryptography.hazmat.primitives import Cipher | Imports["Cipher"] = "cryptography.hazmat.primitives" |
| Wildcard imports | from hashlib import * | WildcardImports = ["hashlib"] |
| Aliased imports | import hashlib as hl | Imports["hl"] = "hashlib" |
| Function definitions | def encrypt(key, data): | FunctionDecl{ID, FilePath, StartLine, EndLine, Calls} |
| Class methods | class Cipher: def __init__(self): | FunctionDecl{..., Type: "Cipher", Name: "<init>"} |
| Attribute calls | hashlib.sha256() | Resolved via imports → FunctionID{Package: "hashlib", Name: "sha256"} |
| Chained calls | cryptography.hazmat.primitives.hashes.SHA256() | Resolved via first segment import |
self calls | self.encrypt() | FunctionID{Package: current_package, Name: "encrypt"} |
.rs files, excluding *_test.rs and tests.rs):
| Extracted | Example | Stored As |
|---|---|---|
| Use declarations | use ring::aead::Aead; | Imports["Aead"] = "ring::aead" |
| Scoped use lists | use ring::aead::{Aead, AeadCore}; | Imports["Aead"] = "ring::aead", Imports["AeadCore"] = "ring::aead" |
| Wildcard use | use ring::aead::*; | WildcardImports = ["ring::aead"] |
| Free functions | fn encrypt(key: &[u8]) {...} | FunctionDecl{ID, FilePath, StartLine, EndLine, Calls} |
| Impl methods | impl Aead { fn new(...) {...} } | FunctionDecl{..., Type: "Aead", Name: "new"} |
| Scoped calls | Aead::new(...) | Resolved via imports → FunctionID{Package: "ring::aead", Type: "Aead", Name: "new"} |
| Field calls | self.encrypt(...) | FunctionID{Package: current_module, Name: "encrypt"} |
The two data structures
TheCallGraph holds two maps:
FunctionsmapsFunctionID.String()→*FunctionDecl(forward: function → its outgoing calls)Callersmaps callee →[]callerID(reverse: who calls this function?)
Edge resolution kinds
While building the caller index, the builder records how each edge was resolved inCallGraph.EdgeResolutions: exact (receiver type known, unique
target or overload set on that type), interface_dispatch (an interface/abstract
method expanded to concrete implementations by name+arity within a namespace
root), or name_only (a fluent-fallback guess with no receiver-type anchor).
This classification is emitted on the graph-fragment export (graph-fragment-1.3,
see Output Formats) so downstream
consumers can fail closed on over-broad dispatch instead of reporting it as
typed reachability. The dispatch heuristics below (interface expansion, fluent
fallback) are exactly the edges tagged non-exact. The same fragment also
exposes crypto_entry_points[] and supporting_calls[] so consumers can use a
stable API index rather than the removed entry_point_index projection.
Type resolution
After building the caller index, the builder runs additional resolution passes to improve type accuracy:-
TypeResolver(language-specific): For Java, a bytecode-based resolver reads.classfiles from resolver-supplied dependency JARs (Maven or Gradle) plus the selected JDK platform archives to extract fully-qualified method signatures. JAR indexing runs in parallel and uses a per-artifact bytecode cache under~/.scanoss/crypto-finder/cache/bytecode/, keyed by exact artifact identity. This provides accurate parameter types (e.g.,io.jsonwebtoken.SignatureAlgorithminstead of genericK) and return types for fluent chain resolution. The Java resolver can be configured per scan withjava_jdk_major(auto,8,11,17,21) andjava_jdk_homes, which also makes Java dependency resolution JDK-aware. TheTypeResolverinterface is extensible — each language can implement its own approach (Go:go/types, Python:.pyistubs, Rust:rust-analyzer). -
Fluent chain resolution: For chained calls like
Jwts.builder().setId(id).signWith(algo, key), return types are propagated through the chain. Ifbuilder()returnsJwtBuilder, thensetId()is resolved asJwtBuilder.setId. Interface inheritance is also followed (e.g.,JwtBuilderextendsClaimsMutator, sosetIdresolves toClaimsMutator.setId). -
Argument source tracing: For each function call, the parser traces where argument values come from — literal constants, local variables, class fields, method parameters, or call results. This produces recursive
source_nodesshowing the data flow into each argument.
Step 5: Trace & Attribute Findings
This is the core of the attribution system. For each crypto finding in a dependency:How TraceBack works (BFS)
- Start with the target function (where the crypto finding was detected)
- Look up all callers via the reverse index (
graph.Callers[targetKey]) - Prepend each caller to the chain being built (so the chain grows backward:
[caller, ...existing]) - Terminate when a root function is reached (a function with no callers, e.g.,
main) - Validate that the complete chain passes through at least one user-package function
- Cycle detection prevents infinite loops in recursive call graphs
- Return all complete chains (BFS finds all paths from entry points to the crypto call site)
[program_entry_point, ..., intermediate, ..., crypto_call_site] — array position [i] calls position [i+1].
Attribution output
After tracing, each finding gets structured attribution metadata in the interim report, and the detailed reachability slices are emitted by the separate call graph export. Dependency finding:file_path values are relative to the dependency root. The artifact identity stays in dependency_info, so consumers do not need to parse module@version back out of the path string.
User code finding:
Step 6: Merge Reports
The current implementation merges dependency findings into the interim report and defers reachability slicing to the call graph export. In other words, interim-report inclusion is not gated on whether a finding later produces one or more exported call chains. User code findings are always included and marked withsource: "direct".
Practical Walkthrough
This section traces the entire pipeline using the test project attestdata/projects/go_with_crypto_dep/. Every value shown is real — produced by running crypto-finder scan --scan-dependencies against this project.
The Source Code
Three files make up the project:go.mod — declares one direct dependency:
main.go — the entry point. Does not use crypto directly:
mypkg/crypto.go — user’s wrapper package. Uses crypto from a dependency:
Step 1: Scan User Code (the normal scan)
The orchestrator runs Semgrep/OpenGrep rules against user code. It finds 3 assets inmypkg/crypto.go:
| Line | Match | Rule |
|---|---|---|
| 13 | chacha20poly1305.New(key) | go.xcrypto.chacha20poly1305.aead |
| 19 | rand.Read(nonce) | go.crypto.rand.usage |
| 29 | chacha20poly1305.New(key) | go.xcrypto.chacha20poly1305.aead |
source, call_chain, or dependency_info fields — just raw findings.
Step 2: Resolve Dependencies
The Go resolver runsgo list -m -json all. It returns:
RootModule (example.com/crypto-test) is the key — any package whose import path starts with this prefix is considered user code. Everything else is a dependency.
Step 3: Scan Dependencies in Parallel
Each dependency gets scanned with the same rules, limited to Go rules only:| Dependency | Crypto Assets Found | Why |
|---|---|---|
golang.org/x/crypto | ~870 | It is a crypto library — virtually every file matches |
golang.org/x/sys | ~3 | False positives (function names like Generate matching crypto rules) |
Step 4: Build the Call Graph
The builder receives three package directories:.go file and extracts function declarations with their calls.
From main.go:
mypkg/crypto.go:
golang.org/x/crypto/...: hundreds more function declarations.
Then buildCallerIndex() creates the reverse index (callee → who calls it):
Step 5: Trace & Attribute
The system now traces each finding back through the call graph to user code. Three different scenarios play out:Scenario A: User finding — chacha20poly1305.New at line 13
This finding is in mypkg/crypto.go (user code). The enrichment flow:
5a-1. Find the containing function:
FindContainingFunction("mypkg/crypto.go", 13) iterates all FunctionDecls, looking for one whose FilePath matches and whose StartLine..EndLine spans line 13. It finds SecureEncrypt (lines 12–25).
5a-2. Trace back to entry point:
TraceBack(SecureEncrypt, userPackages={"example.com/crypto-test"}, maxDepth=0):
main’s line is 14 (the line where main calls SecureEncrypt), not line 10 where main is declared. This is because findCallLine() searches main’s Calls list for the specific call to SecureEncrypt and returns that call-site line number.
Scenario B: User finding — chacha20poly1305.New at line 29
Same logic, but traces through SecureDecrypt:
main’s line is now 19 — the line where main calls SecureDecrypt, not 14.
Scenario C: Dependency finding — deep x/crypto internal functions (UNREACHABLE)
Take any internal function in golang.org/x/crypto, say ssh.newAESCTR:
call_chains is empty.
In the current implementation, this means the finding may still exist in the interim report, but it will not contribute a useful reachability slice to the call graph export.
This is why the interesting downstream signal stays narrow even when a dependency contains a large amount of internal crypto usage. The vast majority of golang.org/x/crypto’s internal crypto usage is not reachable from the user’s main().
Step 6: Merge
chacha20poly1305.New directly inside mypkg/crypto.go — a file in the user’s own module. The crypto usage is already captured as source: "direct". There’s no intermediate dependency wrapper that the user calls which then reaches crypto.
If the project had a longer chain — e.g. main → mypkg.Encrypt → someMiddleware.Process → chacha20poly1305.New where someMiddleware is a dependency — then we’d expect a dependency-backed reachability slice to appear in the call graph export.
Actual Interim Report Output
The final interim report looks like this:finding_id.
Visual Summary
Walkthrough 2: Multi-Hop Dependency Chain
The first walkthrough showed a case where user code calls crypto directly, so the useful reachability slices are anchored to direct findings. This second walkthrough usestestdata/projects/go_with_dep_chain/ to demonstrate a multi-hop chain where crypto usage is buried inside a dependency and dependency-backed reachability slices become the interesting artifact.
The Source Code
The key difference: user code never toucheschacha20poly1305 directly. Instead, it calls through a wrapper dependency (cryptowrapper_dep/), which has an internal function layer.
main.go — entry point, calls mypkg:
mypkg/crypto.go — user code, delegates to the dependency. No crypto imports.
../cryptowrapper_dep/wrapper.go — the dependency (separate module). Has a public API and an internal function:
What the Scan Produces
Running with--scan-dependencies:
x/crypto and x/sys internals may still be scanned and identified, but they do not help downstream stitching unless the graph can connect them back to component-owned or user-owned entry points.
Tracing the 3-Step Chain
The finding atchacha20poly1305.New (line 59 of wrapper.go) produces a 3-step chain. Here’s the BFS trace:
call_chains:
Full Trace to main
The BFS walks all the way to root functions (functions with no callers, like main). This means the full chain main → SecureEncrypt → Encrypt → newAEAD is preserved. A chain is valid if it passes through at least one user-package function, so chains that only traverse dependency code are discarded.
Visual Summary
- User code has 0 crypto findings —
mypkghas no crypto imports - 2 dependency findings survive reachability because the call graph proves user code reaches them
- 492 dependency findings dropped — deep
x/cryptointernals unreachable from user code mainappears at the head of each chain — BFS walks to root functions (no callers)- Both Encrypt and Decrypt paths preserved — all chains stored in
call_chains
Interim Report Contract (v1.3)
Version 1.3 keeps the attribution fields needed to join findings to the separate reachability export. Dependency-backed paths are dependency-root-relative;dependency_info remains the canonical place for module and version.
| Field | Type | When Present | Description |
|---|---|---|---|
source | string | Always (when dependency scanning) | "direct" or "dependency" |
dependency_info | object | Dependency findings only | {module, version} |
finding_id | string | Always (when dependency scanning) | Short hash (SHA-256) for cross-referencing with the callgraph export |
Call Graph Export
When--export-callgraph is enabled, Crypto Finder emits a finding-centric JSON export that uses the same relative-path convention as the main report.
Schema note: call graph export version 4.3 adds Java runtime provenance in scan_metadata for JDK-aware platform signature enrichment.
- Each top-level record stays keyed by
finding_id, which is the join key back to the interim report. call_chainsis the primary value-flow structure. Each chain is ordered from the first reachable caller to the function that contains the matched crypto call.- Each chain node contains a fully qualified
function_name, a normalizedfile_path,start_line, optionaldependency_info, and optionalentry_call. entry_calldescribes how execution entered the current function from the previous step. Itsfile_pathandlineare the call-site location in the previous node’s source file.- The last node in a chain carries
crypto_call, which is the matched crypto-relevant call that triggered the finding. entry_call.parameters[]andcrypto_call.parameters[]both exportparameter_index(always0-based), best-efforttype,argument_expression,resolved_value,variable_namefor simple identifiers only, and recursivesource_nodes.- For Java scans,
scan_metadatamay also includejava_requested_jdk_major,java_runtime_version,java_platform_signatures_used,java_platform_signature_source, andjava_platform_signature_unavailable_reasonto show which JDK major was requested and whether JDK platform signatures were available for enrichment. source_nodescan now carry interprocedural provenance across wrapper hops, for examplePARAMETER -> PARAMETER -> VALUE, and propagated nested nodes keeplocation.file_pathpluslocation.linewhen known.- Method-call expressions are preserved as
CALL_RESULTnodes instead of flattening away their receivers. When the invoked method can be resolved, the node also exportscall_target, and receiver provenance stays nested under theCALL_RESULT(for exampleCALL_RESULT -> PARAMETER alg -> VALUE SignatureAlgorithm.HS256). - Findings that cannot be resolved to a containing function or a specific crypto call remain in the export with
finding_locationandunresolved_reason.
Call Chains Ordering
Thecall_chains field in the call graph export contains all traced paths from program entry points to the crypto call site. Each inner array is one complete path, ordered from program entry point (index 0) to crypto call site (last index). Entry [i] calls entry [i+1].
Example:
Findings Cache
Dependency scanning is dominated by opengrep execution time (~93% of pipeline time). Sincemodule@version produces identical scan results with the same ruleset, caching eliminates redundant work entirely. On a second scan with the same dependencies and rules, the dependency scanning phase drops from minutes to near-zero.
How It Works
The cache sits between Step 2 (rule loading) and Step 3 (parallel scanning) in the pipeline. Before scanning each dependency,scanSingleDep checks for a cached result. On a cache miss, the scan runs normally and the result is stored.
Cache Key Design
The key captures everything that affects scan output:- Module + version: e.g.,
org.bouncycastle:bcprov-jdk18on@1.78 - Rules hash: First 16 hex chars of SHA-256 over sorted rule file contents — if any rule is edited, the cache invalidates automatically
org.bouncycastle:bcprov-jdk18on@1.78:a3f8b2c1d4e5f678
The rulesHash is computed once per scan (not per-dep), so I/O cost is negligible.
Storage Layout
The default implementation (DiskFindingsCache) stores results as JSON files:
golang.org/x/crypto) are replaced with _ for filesystem safety. Writes use temp file + atomic rename to prevent corruption from interrupted scans.
FindingsCache Interface
context.Context on both methods to support network-backed implementations with timeouts and cancellation. The pipeline doesn’t know or care which backend is behind the interface.
Distributed Extensibility
TheFindingsCache interface is the extension point for multi-node scanning:
| Backend | Implementation | Use Case |
|---|---|---|
| Disk | DiskFindingsCache | Single-node, dev workflow |
| Redis | RedisFindingsCache | Multi-node cluster, shared LAN |
| S3/GCS | S3FindingsCache | Global fleet, persist across deploys |
| Two-tier | TieredCache{L1: memory, L2: redis} | Hot + warm layers |
Get/Put. The scanning pipeline is completely agnostic about the storage backend.
Architecture Map
Supported Ecosystems
The extensible architecture makes adding a new language a matter of implementing two interfaces and registering them. Currently supported:Go
- Resolver:
GoResolver— usesgo list -m -json allto resolve modules - Parser:
GoParser— syntactic parsing of Go source - Manifest:
go.mod - Module format: Go import path (e.g.,
golang.org/x/crypto) - Package separator:
/ - Source location: Go module cache (
$GOPATH/pkg/mod/)
Java (Maven / Gradle)
- Resolver:
JavaResolver— auto-detects Maven vs Gradle at the project root - Parser:
JavaParser— syntactic parsing of Java source - Manifest:
pom.xml,build.gradle,build.gradle.kts,settings.gradle,settings.gradle.kts - Module format:
groupId:artifactId(e.g.,org.bouncycastle:bcprov-jdk18on) - Package separator:
. - Source location: Source JARs resolved by the active build tool and extracted to
~/.scanoss/crypto-finder/cache/sources/
Maven Resolution Details
TheMavenResolver uses a three-tier fallback strategy to maximize dependency recovery, especially for multi-module projects:
Tier 1 — Reactor with --fail-never (always attempted):
- Runs
mvn dependency:list --fail-never -DappendOutput=true -DincludeScope=compile - The
--fail-neverflag continues past module failures;-DappendOutput=trueaccumulates results from all succeeding modules into a single output file - If some modules resolve successfully, their dependencies are collected even if other modules fail
- Detects modules from
<modules>in the parentpom.xml - Runs
mvn dependency:list -pl <module>for each module independently - Modules that fail are skipped; dependencies from succeeding modules are deduplicated and collected
- Runs
mvn install -DskipTests --fail-neverto build all modules locally, populating~/.m2/repositorywith inter-module artifacts - Retries Tier 1 after install
- This is expensive (requires compilation) but is the only way to resolve inter-module transitive dependencies
mvn dependency:sources— downloads-sources.jarfiles to~/.m2/repository/(best-effort; ~65% of Java libraries publish source JARs)mvn dependency:tree --fail-never -DappendOutput=true— builds the dependency graph adjacency list (best-effort)
~/.m2/repository, Java bytecode indexing can still use it as a type-only dependency.
Gradle Resolution Details
TheGradleResolver asks Gradle itself for a machine-readable dependency model via a temporary init script:
- Prefers
./gradlewand falls back togradlefromPATH - Resolves the main Java compile classpath for single-project and multi-project builds
- Treats included Gradle subprojects as
WorkspaceMembersrather than external dependencies - Captures external module coordinates, versioned dependency edges, compiled JAR paths, and best-effort source archive paths
- Reuses the shared source extraction cache so Gradle and Maven dependencies flow through the same Java scanning pipeline
Multi-Module Project Support
Multi-module Maven projects (parent POM with<modules>) are automatically detected. When detected:
- All modules are registered as
WorkspaceMembers, meaning they are treated as user code for call chain tracing (same as Cargo workspace members) - The
WorkspaceMember.Namefollows the formatgroupId:moduleDirName - The three-tier fallback strategy handles common multi-module failures:
- Inter-module dependencies (e.g.,
eladmin-loggingdepends oneladmin-common) — resolved via Tier 3 - HTTP mirror blocks (Maven 3.8.1+ blocks insecure HTTP repositories) — partial results collected via Tier 1
- Missing parent POMs or private repositories — gracefully degraded via Tier 1/2
- Inter-module dependencies (e.g.,
Java Call Resolution
TheJavaParser resolves method calls through import analysis:
- Explicit imports:
import javax.crypto.Cipher;→Cipher.getInstance(...)resolves to packagejavax.crypto - Wildcard imports:
import java.security.*;→ class names matched against wildcard packages - Local variable types:
Cipher c = Cipher.getInstance(...)→c.doFinal()resolvescto typeCiphervia local variable tracking - Field types: Class fields are tracked similarly to local variables
- Fallback: Unresolved calls default to the current package (same as Go’s behavior for unresolved variables)
Python (pip)
- Resolver:
PipResolver— usespython -m pip list --format=json+python -m pip showto resolve packages with the same interpreter used for metadata lookups - Parser:
PythonParser— syntactic parsing of Python source - Manifests:
pyproject.toml,requirements.txt,Pipfile,setup.py - Module format: Python package name (e.g.,
cryptography) - Package separator:
. - Source location: Site-packages directory (e.g.,
~/.local/lib/python3.x/site-packages/)
Python Resolution Details
ThePipResolver executes the following steps:
- Root module detection — reads
pyproject.tomlfor[project] name, falls back to directory name python -m pip list --format=json— lists all installed packages with versionspython -m pip show <packages>— gets location and dependency info for each package (batched in groups of 50)- Distribution-to-import mapping — uses that SAME interpreter’s
importlib.metadata.packages_distributions()(Python 3.10+) to map distribution names to import names. Falls back to scanning*.dist-infodirectories (top_level.txt→RECORDfile) for older Python versions - Package directory resolution — uses the import mapping, then heuristic name normalization, to find the source directory. Single-file modules (e.g.,
six.py) and C-extension packages are skipped
Python Call Resolution
ThePythonParser resolves calls through import analysis:
import X:X.func()resolvesXvia importsfrom X import Y:Y()resolves to packageX, treated as constructorY.<init>()- Chained attributes:
a.b.c.func()— first segment resolved via imports, rest chained selfcalls:self.method()resolves to the current package- Wildcard imports:
from X import *recorded for fallback resolution - Aliased imports:
import X as Y—Ymaps toX - Fallback: Unresolved calls default to the current package
Rust (Cargo)
- Resolver:
CargoResolver— usescargo metadata --format-version=1 - Parser:
RustParser— syntactic parsing of Rust source - Manifest:
Cargo.toml - Module format: Crate name (e.g.,
ring) - Package separator:
:: - Source location: Cargo registry cache (e.g.,
~/.cargo/registry/src/.../<crate>-<version>/)
Rust Resolution Details
TheCargoResolver runs cargo metadata --format-version=1 which provides:
- All packages with name, version, and manifest path
- Resolve graph with dependency edges between packages
- Workspace detection — packages with
source: null(local/workspace crates) are treated as user code; all others are dependencies
Rust Call Resolution
TheRustParser resolves calls through use declaration analysis:
- Scoped identifiers:
Aead::new(...)— resolvesAeadthroughuseimports - Qualified paths:
ring::aead::new(...)— first segment resolved via imports - Scoped use lists:
use ring::aead::{Aead, AeadCore}— each item registered separately - Wildcard use:
use ring::aead::*recorded for fallback resolution selfcalls:self.method()resolves to the current modulesrc/transparency: Thesrc/directory is transparent in module paths (e.g.,ring/src/aead/→ring::aead, notring::src::aead)- Impl blocks: Methods in
impl Type { fn method() {} }are extracted with their type association - Fallback: Unresolved calls default to the current module
Adding a New Language
To add support for a new ecosystem:- Implement
callgraph.Parser— with syntactic parsing for the target language - Implement
dependency.Resolver— shells out to the ecosystem’s package manager - Register the parser in
parser_registry.go— add onecase - Register the resolver in
scan.go— add onedepRegistry.Register()call - Add manifest detection in
detectEcosystem()— add oneifchecking for the manifest file
builder.go, tracer.go, dependency_scanner.go, entities, or schemas.
Performance
Two-Phase Call Graph Build
The call graph build is the most expensive step in the dependency scanning pipeline. To minimize cost while preserving 100% type resolution accuracy, the builder uses a two-phase approach: Phase 1 — Source parsing (targeted): Only dependencies with crypto findings + user code modules get full source parsing viaParser.ParseDirectory(). This builds FunctionDecl entries with call sites, parameters, and return types.
Phase 2 — Bytecode type indexing (comprehensive): ALL dependencies (including those without findings) are indexed via JavaBytecodeTypeResolver. This reads .class files from Maven JARs to extract class names, method signatures, return types, and interface hierarchy. The type index is used to resolve fluent chains and enrich parameter types across dependency boundaries.
Why both phases are needed: Java fluent APIs (e.g., Jwts.builder().signWith(key)) require knowing return types from one dependency to resolve calls in another. A dependency without crypto findings may define the return type that bridges a call chain from user code to a crypto finding. Skipping its type information would break backward tracing.
Benchmarks (eladmin — 160 deps, 27 with findings, 269 crypto assets)
Current warm-run numbers with findings cache + Java bytecode cache enabled:| Metric | Current |
|---|---|
| Packages source-parsed | 32 |
| Functions in graph | 168,386 |
| Caller index entries | 190,688 |
| Bytecode type packages | 165 |
| JARs indexed | 160 |
| Bytecode cache hits | 157 |
| Bytecode resolution duration | ~1.27s |
| Full dependency pipeline | ~33.0s |
| Wall time | ~44.0s |
| Findings | 269 |
| Exported call graph edges | 498,271 |
- Two-phase call graph build: only findings-bearing dependencies get full source parsing
- Parallel Java JAR indexing: exact-version JARs are indexed concurrently
- Per-artifact bytecode cache: repeated scans avoid reparsing unchanged JARs
Current Bottlenecks
The pipeline has three main time consumers:-
Source parsing for graph packages (~23s on warm
eladmin): Parses.javafiles from 32 packages (user code + 27 deps with findings) to build 168K function declarations with call sites. -
Dependency scanning with opengrep: Still dominates cold scans. On warm scans most dependencies hit the findings cache; on first scans the cost depends on dependency source size and worker count (
--dep-workers). - Call graph post-processing (~2-3s): Caller index construction, bytecode merge/rewrite, and fluent-chain resolution are no longer dominant but still scale with graph size.
Future Optimization Opportunities
Source parsing and graph size reduction
The bytecode resolution bottleneck has largely been removed. Remaining performance opportunities are now upstream:- Reduce graph packages further: Keep shrinking the set of dependencies that require full source parsing without breaking attribution accuracy.
- Smarter pre-scan eligibility: Skip dependencies that have source directories but no scannable files before invoking opengrep.
- Containing-function lookup index:
Tracer.FindContainingFunction()still scans functions linearly; indexing by file could cut repeated lookups on large reports. - Selective bytecode indexing: Only index JARs whose types appear in unresolved calls from graph packages. More complex, but now one of the few remaining bytecode-side wins.
Opengrep scanning
- Result caching: The existing
DiskFindingsCachecaches dependency scan results bymodule@version:rulesHash. On repeated scans of the same project, most dependencies hit the cache. - Cold-scan throughput: First-scan performance still depends heavily on opengrep throughput over large dependency sources.
Call graph export
- Export is fast (~10-15s for 269 findings) and not currently a bottleneck. The finding-centric export only traces paths reachable from findings, producing a compact JSON (~9.5K lines for eladmin) regardless of full graph size.
Limitations
General
- Static analysis — The call graph is built from syntactic call expressions. It cannot resolve interface dispatch, reflection-based calls, or function values passed as arguments.
- All paths stored — When multiple call chains exist (BFS finds all paths), all are stored in
call_chains. This ensures no reachability information is lost.
Go-specific
- No cross-module method resolution — Method calls on variables (e.g.,
cipher.Encrypt()) are recorded with the variable name as the type, not the resolved type. Cross-package type resolution would require full type analysis.
Java-specific
- Gradle source archives are best-effort — Gradle dependency resolution provides binary artifact paths deterministically, but source archive availability still depends on what upstream repositories publish.
- Missing source JARs — Dependencies without sources are skipped for source scanning, but they can still contribute Java bytecode types if the compiled JAR is present locally. They cannot produce source-level findings until sources are available.
- Wildcard import resolution — When multiple wildcard imports could match a class name, resolution is best-effort.
- No inheritance/polymorphism — Variable types are tracked syntactically; interface implementations and subclass overrides are not resolved.
- Multi-module Maven partial resolution — Multi-module Maven projects are supported via a three-tier fallback strategy. Tier 3 (
mvn install -DskipTests) requires compilation and may fail if the project needs specific JDK versions or build tools not available in the scan environment.
Python-specific
- Requires a Python interpreter with
pipavailable — The resolver now runspython -m pipandimportlib.metadatathrough the same interpreter. IfVIRTUAL_ENVis set, that environment’s Python is preferred; otherwise it falls back topython3thenpythonfrom PATH. - Single-file modules skipped — Packages distributed as a single
.pyfile (e.g.,six.py) are skipped since there is no directory to scan. - C-extension packages skipped — Packages without Python source on disk (compiled C extensions) cannot be scanned.
- Distribution-to-import mapping — Relies on
importlib.metadata(Python 3.10+) or*.dist-infofallback; packages with non-standard layouts may not be resolved. - No dynamic dispatch — Calls resolved through
getattr,__getattr__, or metaclass magic are not tracked.
Rust-specific
- Requires
cargoin PATH — The resolver shells out tocargo metadata. - No trait dispatch — Method calls on trait objects (e.g.,
dyn Cipher) are resolved syntactically by type name; trait implementations are not followed. - Macro-generated code — Functions generated by macros (e.g.,
proc_macro) are invisible to syntactic parsing. src/transparency assumption — The parser assumessrc/is the crate root; non-standard[lib] pathconfigurations may produce incorrect module paths.