> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scanoss.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Scanning a File or Directory

> Learn how to use the <a href='https://github.com/scanoss/engine' target='_blank' rel='noopener noreferrer'>SCANOSS Engine</a> to scan a file or directory against the SCANOSS Knowledgebase and interpret the identification results.

## Overview

The SCANOSS Engine is a command-line tool that scans files and directories for
open-source component matches by comparing them against the SCANOSS Knowledgebase
(KB). Results are printed to `STDOUT` in JSON format and include licence, copyright,
and component identification data.

The engine operates at the server level and receives WFP (Winnowing FingerPrint) files
from clients through the SCANOSS API's `../scan-direct` path. Each WFP file follows
this format:

```
file=FILE_MD5,FILE_SIZE,FILE_NAME
LINE_NUMBER_1=CRC32_1A,CRC32_1B.....
LINE_NUMBER_2=CRC32_2A,CRC32_2B.....
```

Each CRC32 value is produced by the winnowing algorithm and represents a fingerprint
of a specific line range within the file.

**Basic syntax:**

```bash theme={null}
scanoss [parameters] [TARGET]
```

`TARGET` can be a single file, a `.wfp` fingerprint file, or a directory.

## Scanning Procedure Overview

At a high level, the engine produces one of three match types for each scanned file:

* **None** — the scanned file could not be matched with anything in the KB.
* **Snippet** — a portion of the scanned file was matched with a portion of a known open-source file.
* **Full** — the entire content of the scanned file matches an existing file in the KB.

Regardless of the match type, every scan follows the same five main steps:

1. Determine the type of match (full file or snippet).
2. For snippet matches only: identify the most meaningful open-source file (MD5) among multiple candidates containing the matched snippet.
3. Link the matched file with component information (URL, purl, release date, etc.). There may be more than one candidate; by default, the engine selects the best match.
4. Query component-level information (licence, vulnerabilities, etc.).
5. Produce the JSON report.

The sections below describe each of these steps in detail.

## File Matching Logic

The engine attempts to match each scanned file against the KB using the following
sequence:

1. **URL match**: Does the file exactly match a known package archive at a registered
   URL? If so, the identification type (`id`) is `"url"`.
2. **File match**: The engine queries the "file table" to check whether the file's MD5
   (`FILE_MD5`) is present in the KB. If it is, the identification type (`id`) is
   `"file"`. If it is not, snippet analysis begins.
3. **Snippet match**: If neither of the above applies, the engine performs a snippet
   comparison using the CRC32 fingerprints from the WFP. The identification type
   (`id`) is `"snippet"`.
4. **Binary match**: For binary files, identification is performed via binary
   fingerprinting. The identification type (`id`) is `"binary"`.
5. **No match**: If none of the above apply, the identification type (`id`) is
   `"none"`.

## Snippet Matching in Detail

When a full file match is not found, the engine performs snippet analysis — a
multi-step process designed to balance match accuracy with processing performance.

### How snippet matching works

Each WFP fragment (a CRC32 value) represents a code fingerprint and is associated
with a series of file MD5s in the KB. When two WFP fragments share an associated
MD5, that MD5 accumulates a "hit". The more hits an MD5 has, the larger the common
snippet between the scanned file and the matched open-source file.

Because each CRC32 may be associated with many MD5s, the number of comparisons can
grow very large. To keep processing time reasonable, the engine prioritises CRC32s
linked to fewer MD5s — these represent less common (more distinctive) code fragments
and therefore provide higher-value signal.

### Snippet scanning procedure

1. **Pre-processing map**: The engine queries the KB for the MD5s associated with
   each CRC32 and builds a map in the following format:

   ```
   CRC32-1, CRC32-1-LINE, CRC32-1-LIST_SIZE, MD5-1...MD5-N
   CRC32-2, CRC32-2-LINE, CRC32-2-LIST_SIZE, MD5-1...MD5-N
   ...
   ```

   Each row records a CRC32, its line number, the number of linked MD5s (`LIST_SIZE`),
   and the MD5s themselves.

2. **Sort by list size**: The map is sorted ascending by list size, so less popular
   (more distinctive) fingerprints are processed first.

3. **Build matchmap**: Each MD5 is processed and added to a matchmap. If an MD5
   already exists in the map, a hit is added. The matchmap tracks line ranges and has
   a fixed capacity of 10,000 MD5s. Processing stops when all CRC32s are handled or
   the matchmap is full.

4. **Select largest snippets**: The engine selects the MD5s with the most hits (the
   largest common snippets). One or more may be selected; subsequent steps run for
   each selected MD5.

5. **Resolve to components**: Each selected MD5 may appear in multiple repositories
   (clones, forks, or dependents). The engine resolves these to a list of candidate
   components — see [Component Ranking Logic](#component-ranking-logic) below.

6. **Refine line ranges**: Small or spurious line ranges are removed as false
   positives, and overlapping ranges are merged.

### Line range generation

Line ranges reported in snippet matches are approximate, due to the nature of the
winnowing algorithm. Each matched CRC32 carries two pieces of positional information:

* **Line number**: The line in the *scanned* file where the fingerprint starts.
* **OSS line number**: The line in the *matching open-source* file where the fingerprint starts.

As the matchmap is built, the engine attempts to group nearby fingerprints into
contiguous line ranges. When a new fingerprint is added, one of two things happens:

1. A nearby line range already exists (within a configurable tolerance gap) — the
   existing range is extended to include the new fingerprint.
2. No nearby range exists — a new range is created.

After this initial pass, the engine runs a final optimisation step: it increases the
tolerance gap and merges ranges until the total number of ranges is at most 10. This
keeps the JSON report concise while preserving the most meaningful match regions.

For example, an initial pass might produce 14 ranges:

```
144467 Accepted ranges (min lines range = 2):
144470 0 = 62 to 66 - OSS from: 171
144471 1 = 87 to 90 - OSS from: 199
144475 2 = 98 to 98 - OSS from: 125
144477 3 = 108 to 114 - OSS from: 131
144480 4 = 120 to 120 - OSS from: 138
144483 5 = 128 to 134 - OSS from: 141
144486 6 = 136 to 148 - OSS from: 40
144490 7 = 156 to 156 - OSS from: 87
144494 8 = 206 to 209 - OSS from: 171
144496 9 = 233 to 240 - OSS from: 199
144499 10 = 283 to 296 - OSS from: 125
144500 11 = 301 to 310 - OSS from: 141
144503 12 = 313 to 320 - OSS from: 43
144507 13 = 325 to 330 - OSS from: 110
```

The optimisation step then increases the tolerance and merges nearby ranges:

```
144509 Range tolerance: 8
144512 join range 2 with 1
144513 join range 4 with 2
144518 join range 5 with 2
144519 join range 6 with 2
144521 join range 7 with 2
144523 join range 11 with 5
144526 join range 12 with 5
144528 join range 13 with 5
144530 Final ranges:
144533 0 = 62 to 66 - OSS from: 171
144534 1 = 87 to 98 - OSS from: 199
144537 2 = 108 to 156 - OSS from: 131
144539 3 = 206 to 209 - OSS from: 171
144540 4 = 233 to 240 - OSS from: 199
144541 5 = 283 to 330 - OSS from: 125
```

The 14 initial ranges are reduced to 6 final ranges, each spanning a broader but more
meaningful match region. The `OSS from` value indicates the starting line in the
matching open-source file.

## Component Ranking Logic

When a file is present in multiple components or versions in the KB, the engine
applies a series of rules to determine the best match:

* **SBOM / context priority**: The engine can receive context information — for
  example, via an SBOM (Software Bill of Materials) supplied with `--sbom`. Any
  component that matches the provided context is always selected and placed at the
  top of the candidate list, regardless of release date.
* **First component released**: If no context match is found, the engine defaults to
  the oldest component in the KB (i.e., the first released) as the best match.
* **Tie-breaking**: If two components share the same release date, the project-level
  release date is used as a tiebreaker.
* **Component hint**: The scanning client can optionally pass a component hint — the
  name of the most recently detected component — to guide matching. The engine
  favours files belonging to a component that matches this hint.

Internally, candidate components are maintained in a fixed-size linked list sorted by
date (with context-matched components always at the front). If the list is full and a
higher-priority component is added, the lowest-priority entry is removed.

This resolution process runs independently for each of the N matched MD5s. The
engine then compares the top-ranked component from each MD5's candidate list and
selects the single best component to report.

## JSON Report

Once the best match and its component are selected, the engine queries additional
component-level metadata — licences, known vulnerabilities, and so on — and
serialises everything into a JSON result written to `STDOUT`.

By default, only the best match is included in the output. To inspect the full
candidate list (the top M components for each of the N matched files), pass the
`-F256` flag. By default, M is 3; N is determined automatically at scan time.
