Documentation Index
Fetch the complete documentation index at: https://docs.scanoss.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The SCANOSS Engine is a command-line tool that scans files and directories for open-source component matches by comparing them against the SCANOSS Knowledgebase (KB). Results are printed toSTDOUT in JSON format and include licence, copyright,
and component identification data.
The engine operates at the server level and receives WFP (Winnowing FingerPrint) files
from clients through the SCANOSS API’s ../scan-direct path. Each WFP file follows
this format:
TARGET can be a single file, a .wfp fingerprint file, or a directory.
Scanning Procedure Overview
At a high level, the engine produces one of three match types for each scanned file:- None — the scanned file could not be matched with anything in the KB.
- Snippet — a portion of the scanned file was matched with a portion of a known open-source file.
- Full — the entire content of the scanned file matches an existing file in the KB.
- Determine the type of match (full file or snippet).
- For snippet matches only: identify the most meaningful open-source file (MD5) among multiple candidates containing the matched snippet.
- Link the matched file with component information (URL, purl, release date, etc.). There may be more than one candidate; by default, the engine selects the best match.
- Query component-level information (licence, vulnerabilities, etc.).
- Produce the JSON report.
File Matching Logic
The engine attempts to match each scanned file against the KB using the following sequence:- URL match: Does the file exactly match a known package archive at a registered
URL? If so, the identification type (
id) is"url". - File match: The engine queries the “file table” to check whether the file’s MD5
(
FILE_MD5) is present in the KB. If it is, the identification type (id) is"file". If it is not, snippet analysis begins. - Snippet match: If neither of the above applies, the engine performs a snippet
comparison using the CRC32 fingerprints from the WFP. The identification type
(
id) is"snippet". - Binary match: For binary files, identification is performed via binary
fingerprinting. The identification type (
id) is"binary". - No match: If none of the above apply, the identification type (
id) is"none".
Snippet Matching in Detail
When a full file match is not found, the engine performs snippet analysis — a multi-step process designed to balance match accuracy with processing performance.How snippet matching works
Each WFP fragment (a CRC32 value) represents a code fingerprint and is associated with a series of file MD5s in the KB. When two WFP fragments share an associated MD5, that MD5 accumulates a “hit”. The more hits an MD5 has, the larger the common snippet between the scanned file and the matched open-source file. Because each CRC32 may be associated with many MD5s, the number of comparisons can grow very large. To keep processing time reasonable, the engine prioritises CRC32s linked to fewer MD5s — these represent less common (more distinctive) code fragments and therefore provide higher-value signal.Snippet scanning procedure
-
Pre-processing map: The engine queries the KB for the MD5s associated with
each CRC32 and builds a map in the following format:
Each row records a CRC32, its line number, the number of linked MD5s (
LIST_SIZE), and the MD5s themselves. - Sort by list size: The map is sorted ascending by list size, so less popular (more distinctive) fingerprints are processed first.
- Build matchmap: Each MD5 is processed and added to a matchmap. If an MD5 already exists in the map, a hit is added. The matchmap tracks line ranges and has a fixed capacity of 10,000 MD5s. Processing stops when all CRC32s are handled or the matchmap is full.
- Select largest snippets: The engine selects the MD5s with the most hits (the largest common snippets). One or more may be selected; subsequent steps run for each selected MD5.
- Resolve to components: Each selected MD5 may appear in multiple repositories (clones, forks, or dependents). The engine resolves these to a list of candidate components — see Component Ranking Logic below.
- Refine line ranges: Small or spurious line ranges are removed as false positives, and overlapping ranges are merged.
Line range generation
Line ranges reported in snippet matches are approximate, due to the nature of the winnowing algorithm. Each matched CRC32 carries two pieces of positional information:- Line number: The line in the scanned file where the fingerprint starts.
- OSS line number: The line in the matching open-source file where the fingerprint starts.
- A nearby line range already exists (within a configurable tolerance gap) — the existing range is extended to include the new fingerprint.
- No nearby range exists — a new range is created.
OSS from value indicates the starting line in the
matching open-source file.
Component Ranking Logic
When a file is present in multiple components or versions in the KB, the engine applies a series of rules to determine the best match:- SBOM / context priority: The engine can receive context information — for
example, via an SBOM (Software Bill of Materials) supplied with
--sbom. Any component that matches the provided context is always selected and placed at the top of the candidate list, regardless of release date. - First component released: If no context match is found, the engine defaults to the oldest component in the KB (i.e., the first released) as the best match.
- Tie-breaking: If two components share the same release date, the project-level release date is used as a tiebreaker.
- Component hint: The scanning client can optionally pass a component hint — the name of the most recently detected component — to guide matching. The engine favours files belonging to a component that matches this hint.
JSON Report
Once the best match and its component are selected, the engine queries additional component-level metadata — licences, known vulnerabilities, and so on — and serialises everything into a JSON result written toSTDOUT.
By default, only the best match is included in the output. To inspect the full
candidate list (the top M components for each of the N matched files), pass the
-F256 flag. By default, M is 3; N is determined automatically at scan time.