Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.scanoss.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The SCANOSS Engine is a command-line tool that scans files and directories for open-source component matches by comparing them against the SCANOSS Knowledgebase (KB). Results are printed to STDOUT in JSON format and include licence, copyright, and component identification data. The engine operates at the server level and receives WFP (Winnowing FingerPrint) files from clients through the SCANOSS API’s ../scan-direct path. Each WFP file follows this format:
file=FILE_MD5,FILE_SIZE,FILE_NAME
LINE_NUMBER_1=CRC32_1A,CRC32_1B.....
LINE_NUMBER_2=CRC32_2A,CRC32_2B.....
Each CRC32 value is produced by the winnowing algorithm and represents a fingerprint of a specific line range within the file. Basic syntax:
scanoss [parameters] [TARGET]
TARGET can be a single file, a .wfp fingerprint file, or a directory.

Scanning Procedure Overview

At a high level, the engine produces one of three match types for each scanned file:
  • None — the scanned file could not be matched with anything in the KB.
  • Snippet — a portion of the scanned file was matched with a portion of a known open-source file.
  • Full — the entire content of the scanned file matches an existing file in the KB.
Regardless of the match type, every scan follows the same five main steps:
  1. Determine the type of match (full file or snippet).
  2. For snippet matches only: identify the most meaningful open-source file (MD5) among multiple candidates containing the matched snippet.
  3. Link the matched file with component information (URL, purl, release date, etc.). There may be more than one candidate; by default, the engine selects the best match.
  4. Query component-level information (licence, vulnerabilities, etc.).
  5. Produce the JSON report.
The sections below describe each of these steps in detail.

File Matching Logic

The engine attempts to match each scanned file against the KB using the following sequence:
  1. URL match: Does the file exactly match a known package archive at a registered URL? If so, the identification type (id) is "url".
  2. File match: The engine queries the “file table” to check whether the file’s MD5 (FILE_MD5) is present in the KB. If it is, the identification type (id) is "file". If it is not, snippet analysis begins.
  3. Snippet match: If neither of the above applies, the engine performs a snippet comparison using the CRC32 fingerprints from the WFP. The identification type (id) is "snippet".
  4. Binary match: For binary files, identification is performed via binary fingerprinting. The identification type (id) is "binary".
  5. No match: If none of the above apply, the identification type (id) is "none".

Snippet Matching in Detail

When a full file match is not found, the engine performs snippet analysis — a multi-step process designed to balance match accuracy with processing performance.

How snippet matching works

Each WFP fragment (a CRC32 value) represents a code fingerprint and is associated with a series of file MD5s in the KB. When two WFP fragments share an associated MD5, that MD5 accumulates a “hit”. The more hits an MD5 has, the larger the common snippet between the scanned file and the matched open-source file. Because each CRC32 may be associated with many MD5s, the number of comparisons can grow very large. To keep processing time reasonable, the engine prioritises CRC32s linked to fewer MD5s — these represent less common (more distinctive) code fragments and therefore provide higher-value signal.

Snippet scanning procedure

  1. Pre-processing map: The engine queries the KB for the MD5s associated with each CRC32 and builds a map in the following format:
    CRC32-1, CRC32-1-LINE, CRC32-1-LIST_SIZE, MD5-1...MD5-N
    CRC32-2, CRC32-2-LINE, CRC32-2-LIST_SIZE, MD5-1...MD5-N
    ...
    
    Each row records a CRC32, its line number, the number of linked MD5s (LIST_SIZE), and the MD5s themselves.
  2. Sort by list size: The map is sorted ascending by list size, so less popular (more distinctive) fingerprints are processed first.
  3. Build matchmap: Each MD5 is processed and added to a matchmap. If an MD5 already exists in the map, a hit is added. The matchmap tracks line ranges and has a fixed capacity of 10,000 MD5s. Processing stops when all CRC32s are handled or the matchmap is full.
  4. Select largest snippets: The engine selects the MD5s with the most hits (the largest common snippets). One or more may be selected; subsequent steps run for each selected MD5.
  5. Resolve to components: Each selected MD5 may appear in multiple repositories (clones, forks, or dependents). The engine resolves these to a list of candidate components — see Component Ranking Logic below.
  6. Refine line ranges: Small or spurious line ranges are removed as false positives, and overlapping ranges are merged.

Line range generation

Line ranges reported in snippet matches are approximate, due to the nature of the winnowing algorithm. Each matched CRC32 carries two pieces of positional information:
  • Line number: The line in the scanned file where the fingerprint starts.
  • OSS line number: The line in the matching open-source file where the fingerprint starts.
As the matchmap is built, the engine attempts to group nearby fingerprints into contiguous line ranges. When a new fingerprint is added, one of two things happens:
  1. A nearby line range already exists (within a configurable tolerance gap) — the existing range is extended to include the new fingerprint.
  2. No nearby range exists — a new range is created.
After this initial pass, the engine runs a final optimisation step: it increases the tolerance gap and merges ranges until the total number of ranges is at most 10. This keeps the JSON report concise while preserving the most meaningful match regions. For example, an initial pass might produce 14 ranges:
144467 Accepted ranges (min lines range = 2):
144470 0 = 62 to 66 - OSS from: 171
144471 1 = 87 to 90 - OSS from: 199
144475 2 = 98 to 98 - OSS from: 125
144477 3 = 108 to 114 - OSS from: 131
144480 4 = 120 to 120 - OSS from: 138
144483 5 = 128 to 134 - OSS from: 141
144486 6 = 136 to 148 - OSS from: 40
144490 7 = 156 to 156 - OSS from: 87
144494 8 = 206 to 209 - OSS from: 171
144496 9 = 233 to 240 - OSS from: 199
144499 10 = 283 to 296 - OSS from: 125
144500 11 = 301 to 310 - OSS from: 141
144503 12 = 313 to 320 - OSS from: 43
144507 13 = 325 to 330 - OSS from: 110
The optimisation step then increases the tolerance and merges nearby ranges:
144509 Range tolerance: 8
144512 join range 2 with 1
144513 join range 4 with 2
144518 join range 5 with 2
144519 join range 6 with 2
144521 join range 7 with 2
144523 join range 11 with 5
144526 join range 12 with 5
144528 join range 13 with 5
144530 Final ranges:
144533 0 = 62 to 66 - OSS from: 171
144534 1 = 87 to 98 - OSS from: 199
144537 2 = 108 to 156 - OSS from: 131
144539 3 = 206 to 209 - OSS from: 171
144540 4 = 233 to 240 - OSS from: 199
144541 5 = 283 to 330 - OSS from: 125
The 14 initial ranges are reduced to 6 final ranges, each spanning a broader but more meaningful match region. The OSS from value indicates the starting line in the matching open-source file.

Component Ranking Logic

When a file is present in multiple components or versions in the KB, the engine applies a series of rules to determine the best match:
  • SBOM / context priority: The engine can receive context information — for example, via an SBOM (Software Bill of Materials) supplied with --sbom. Any component that matches the provided context is always selected and placed at the top of the candidate list, regardless of release date.
  • First component released: If no context match is found, the engine defaults to the oldest component in the KB (i.e., the first released) as the best match.
  • Tie-breaking: If two components share the same release date, the project-level release date is used as a tiebreaker.
  • Component hint: The scanning client can optionally pass a component hint — the name of the most recently detected component — to guide matching. The engine favours files belonging to a component that matches this hint.
Internally, candidate components are maintained in a fixed-size linked list sorted by date (with context-matched components always at the front). If the list is full and a higher-priority component is added, the lowest-priority entry is removed. This resolution process runs independently for each of the N matched MD5s. The engine then compares the top-ranked component from each MD5’s candidate list and selects the single best component to report.

JSON Report

Once the best match and its component are selected, the engine queries additional component-level metadata — licences, known vulnerabilities, and so on — and serialises everything into a JSON result written to STDOUT. By default, only the best match is included in the output. To inspect the full candidate list (the top M components for each of the N matched files), pass the -F256 flag. By default, M is 3; N is determined automatically at scan time.