> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scanoss.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Fast Winnowing & Fingerprinting

> Fingerprinting is the process of creating a unique digital signature (fingerprint) for source code files using the Winnowing Algorithm.

SCANOSS uses this technique to identify and match code snippets across its knowledge
base, enabling accurate open-source component detection and licence compliance analysis.

The Winnowing Algorithm has been used for many years in academic networks to detect
plagiarism by comparing fingerprints against known texts and source code. SCANOSS
adopted this algorithm because of its well-established theoretical foundation and
demonstrated performance in large-scale code comparison tasks.

The SCANOSS implementation generates a WFP (Winnowing FingerPrint) for each file,
which contains metadata and a series of hash values representing the code's unique
characteristics.

## [Winnowing Algorithm](https://github.com/scanoss/wfp)

The Winnowing Algorithm converts source code into fingerprints through four key steps:

1. **Normalisation**
2. **Gram Fingerprinting**
3. **Window Selection**
4. **Output Formatting**

### Normalisation

The normalisation process eliminates all non-alphanumeric characters from the input
source code, converting it to lowercase and removing spaces, punctuation, and special
characters.

#### Example

```c theme={null}
for (uint32_t i = 0; i < src_len; i++)
{
    if (src[i] == '\n') line++;
    uint8_t byte = normalize(src[i]);
    if (!byte) continue;
    gram[gram_ptr++] = byte;
}
```

#### After Normalisation

```
foruint32ti0isrcleniifsrcinlineuint8tbytenormalizesrciifbytecontinuegramgramptrbyteifgramptrgramwindowwindowptrcalccrc32c...
```

All spaces, operators, brackets, and punctuation are removed, leaving only alphanumeric
characters in lowercase.

### Gram Fingerprinting

From the normalised code, overlapping data samples called **grams** are extracted and
fingerprinted. Each gram is a 30-byte sequence taken from the normalised code, and each
sequence is hashed using CRC32C.

> **Note:** The example below uses 10-character sequences for readability. In
> production, SCANOSS uses 30-byte sequences.

#### Example

```
foruint32t = 1adf644b
oruint32ti = 6f72669d
ruint32ti0 = 88ad5ece
uint32ti0i = d368b44c
int32ti0is = 2123892a
nt32ti0isr = 336cdfdd
t32ti0isrc = 1c8e832d
```

Each sequence produces a unique CRC32C hash.

### Window Selection

From the series of gram fingerprints, a sliding window is applied to select
representative hashes:

* **Window size**: 64 gram fingerprints
* **Selection method**: Choose the **minimum hash** from each window

Consistently selecting the minimum hash skews the distribution of selected values
towards the lower end of the hash space, which can cause imbalance in database indices.
To counteract this, SCANOSS applies a secondary hash to each selected value, ensuring
uniform distribution across the index.

#### Why These Values?

The values **gram=30** and **window=64** were chosen after extensive testing across
multiple programming languages (C, Java, JavaScript, Ruby) to provide the optimal
balance between:

* **Footprint**: Number of fingerprints generated (affects storage and performance)
* **Uniformity**: Even distribution of hash values (prevents database index imbalance)
* **Match accuracy**: Ability to find matches even in modified code

### Output Formatting

The fingerprints are formatted as a **.wfp** file with:

* File metadata (MD5 hash, filename, size)
* Line numbers where each fingerprint was found
* The hash values themselves

## WFP File Format

A WFP (Winnowing FingerPrint) file is a machine-readable, human-readable format that
contains fingerprints for one or more source code files.

### Structure

The .wfp file contains:

1. **File declarations** with metadata
2. **Fingerprints** organised by line number

### Example WFP File

```
file=34cff02ed13a3d26e716e473d4e8900d,948,test.c
3=688c09fe,fc6d701d,61b2b37c
5=5f7b1b19,99181ce1,79923cb2,64691599
6=f218cd1c
8=7cf9f396,17c3dd99
10=3a693f60,fb9493ca,54fc128c
12=6f8dfa99,d3f3a3ca,04a0062b
13=bccec1a8,1657ceac
15=4dde1f15,a4c8bf7a
16=b657086d,39b9f206,bec983db,2978bdfa
18=1fb6cdda
20=c18636e3,47091215,7f040b14
```

### Format Components

**File Declaration:**

```
file=<MD5_HASH>,<FILE_SIZE>,<FILE_PATH>
```

* **MD5\_HASH**: MD5 checksum of the entire file (used for exact file matching)
* **FILE\_SIZE**: File size in bytes
* **FILE\_PATH**: Relative path to the file

**Fingerprint Lines:**

```
<LINE_NUMBER>=<HASH1>,<HASH2>,<HASH3>,...
```

* **LINE\_NUMBER**: The source line number where these fingerprints were generated
* **HASH**: CRC32C hash values representing the normalised code at that line

## Fingerprinting with SCANOSS-PY

### Basic Fingerprinting

Generate fingerprints for a file or directory:

```bash theme={null}
scanoss-py fingerprint /path/to/code
```

### Fingerprint a Specific File

```bash theme={null}
scanoss-py fingerprint /path/to/file.py
```

### Output to File

Save fingerprints to a specific file:

```bash theme={null}
scanoss-py fingerprint /path/to/code -o fingerprints.wfp
```

### What Programming Languages are Supported?

Fingerprinting works with **any text-based programming language** because it operates
on normalised character sequences rather than language-specific syntax. It has been
tested extensively on:

* C/C++
* Java
* JavaScript/TypeScript
* Python
* Ruby
* Go
* Rust
* PHP
* Among other text-based languages

## Files Skipped During Fingerprinting

By default, SCANOSS skips fingerprinting for file types that are not suitable for code
matching. Use the `--all-extensions` flag to override this behaviour.

**Binary and Archive Files**

* `.exe`, `.zip`, `.tar`, `.tgz`, `.gz`, `.7z`, `.rar`
* `.jar`, `.war`, `.ear`, `.whl`, `.bin`, `.app`, `.out`

**Compiled and Object Files**

* `.class`, `.pyc`, `.o`, `.a`, `.so`, `.obj`, `.dll`, `.lib`

**Document and Office Files**

* `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.pdf`
* `.odt`, `.ods`, `.odp`, `.pages`, `.key`, `.numbers`

**Data and Configuration Files**

* `.json`, `.xml`, `.html`, `.htm`, `.dat`, `.lst`, `.xsd`, `.pom`, `.mf`, `.sum`

**Other Text and Web Assets**

* `.md`, `.txt`, `.min.js`, `.woff`, `.woff2`

## Fingerprinting vs Scanning

**Fingerprinting** generates the .wfp file but does not compare it against the SCANOSS
knowledge base.

It is useful when you want to:

* Generate fingerprints for later analysis
* Create a .wfp file to share or archive
* Inspect what data will be transmitted during a scan

**Scanning** performs fingerprinting and compares the results against the SCANOSS
knowledge base to identify components, licences, and vulnerabilities.

To scan using a pre-generated .wfp file:

```bash theme={null}
scanoss-py scan --wfp fingerprints.wfp
```
