Skip to main content
SCANOSS uses this technique to identify and match code snippets across its knowledge base, enabling accurate open-source component detection and licence compliance analysis. The Winnowing Algorithm has been used for many years in academic networks to detect plagiarism by comparing fingerprints against known texts and source code. SCANOSS adopted this algorithm because of its well-established theoretical foundation and demonstrated performance in large-scale code comparison tasks. The SCANOSS implementation generates a WFP (Winnowing FingerPrint) for each file, which contains metadata and a series of hash values representing the code’s unique characteristics.

Winnowing Algorithm

The Winnowing Algorithm converts source code into fingerprints through four key steps:
  1. Normalisation
  2. Gram Fingerprinting
  3. Window Selection
  4. Output Formatting

Normalisation

The normalisation process eliminates all non-alphanumeric characters from the input source code, converting it to lowercase and removing spaces, punctuation, and special characters.

Example

for (uint32_t i = 0; i < src_len; i++)
{
    if (src[i] == '\n') line++;
    uint8_t byte = normalize(src[i]);
    if (!byte) continue;
    gram[gram_ptr++] = byte;
}

After Normalisation

foruint32ti0isrcleniifsrcinlineuint8tbytenormalizesrciifbytecontinuegramgramptrbyteifgramptrgramwindowwindowptrcalccrc32c...
All spaces, operators, brackets, and punctuation are removed, leaving only alphanumeric characters in lowercase.

Gram Fingerprinting

From the normalised code, overlapping data samples called grams are extracted and fingerprinted. Each gram is a 30-byte sequence taken from the normalised code, and each sequence is hashed using CRC32C.
Note: The example below uses 10-character sequences for readability. In production, SCANOSS uses 30-byte sequences.

Example

foruint32t = 1adf644b
oruint32ti = 6f72669d
ruint32ti0 = 88ad5ece
uint32ti0i = d368b44c
int32ti0is = 2123892a
nt32ti0isr = 336cdfdd
t32ti0isrc = 1c8e832d
Each sequence produces a unique CRC32C hash.

Window Selection

From the series of gram fingerprints, a sliding window is applied to select representative hashes:
  • Window size: 64 gram fingerprints
  • Selection method: Choose the minimum hash from each window
Consistently selecting the minimum hash skews the distribution of selected values towards the lower end of the hash space, which can cause imbalance in database indices. To counteract this, SCANOSS applies a secondary hash to each selected value, ensuring uniform distribution across the index.

Why These Values?

The values gram=30 and window=64 were chosen after extensive testing across multiple programming languages (C, Java, JavaScript, Ruby) to provide the optimal balance between:
  • Footprint: Number of fingerprints generated (affects storage and performance)
  • Uniformity: Even distribution of hash values (prevents database index imbalance)
  • Match accuracy: Ability to find matches even in modified code

Output Formatting

The fingerprints are formatted as a .wfp file with:
  • File metadata (MD5 hash, filename, size)
  • Line numbers where each fingerprint was found
  • The hash values themselves

WFP File Format

A WFP (Winnowing FingerPrint) file is a machine-readable, human-readable format that contains fingerprints for one or more source code files.

Structure

The .wfp file contains:
  1. File declarations with metadata
  2. Fingerprints organised by line number

Example WFP File

file=34cff02ed13a3d26e716e473d4e8900d,948,test.c
3=688c09fe,fc6d701d,61b2b37c
5=5f7b1b19,99181ce1,79923cb2,64691599
6=f218cd1c
8=7cf9f396,17c3dd99
10=3a693f60,fb9493ca,54fc128c
12=6f8dfa99,d3f3a3ca,04a0062b
13=bccec1a8,1657ceac
15=4dde1f15,a4c8bf7a
16=b657086d,39b9f206,bec983db,2978bdfa
18=1fb6cdda
20=c18636e3,47091215,7f040b14

Format Components

File Declaration:
file=<MD5_HASH>,<FILE_SIZE>,<FILE_PATH>
  • MD5_HASH: MD5 checksum of the entire file (used for exact file matching)
  • FILE_SIZE: File size in bytes
  • FILE_PATH: Relative path to the file
Fingerprint Lines:
<LINE_NUMBER>=<HASH1>,<HASH2>,<HASH3>,...
  • LINE_NUMBER: The source line number where these fingerprints were generated
  • HASH: CRC32C hash values representing the normalised code at that line

Fingerprinting with SCANOSS-PY

Basic Fingerprinting

Generate fingerprints for a file or directory:
scanoss-py fingerprint /path/to/code

Fingerprint a Specific File

scanoss-py fingerprint /path/to/file.py

Output to File

Save fingerprints to a specific file:
scanoss-py fingerprint /path/to/code -o fingerprints.wfp

What Programming Languages are Supported?

Fingerprinting works with any text-based programming language because it operates on normalised character sequences rather than language-specific syntax. It has been tested extensively on:
  • C/C++
  • Java
  • JavaScript/TypeScript
  • Python
  • Ruby
  • Go
  • Rust
  • PHP
  • Among other text-based languages

Files Skipped During Fingerprinting

By default, SCANOSS skips fingerprinting for file types that are not suitable for code matching. Use the --all-extensions flag to override this behaviour. Binary and Archive Files
  • .exe, .zip, .tar, .tgz, .gz, .7z, .rar
  • .jar, .war, .ear, .whl, .bin, .app, .out
Compiled and Object Files
  • .class, .pyc, .o, .a, .so, .obj, .dll, .lib
Document and Office Files
  • .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf
  • .odt, .ods, .odp, .pages, .key, .numbers
Data and Configuration Files
  • .json, .xml, .html, .htm, .dat, .lst, .xsd, .pom, .mf, .sum
Other Text and Web Assets
  • .md, .txt, .min.js, .woff, .woff2

Fingerprinting vs Scanning

Fingerprinting generates the .wfp file but does not compare it against the SCANOSS knowledge base. It is useful when you want to:
  • Generate fingerprints for later analysis
  • Create a .wfp file to share or archive
  • Inspect what data will be transmitted during a scan
Scanning performs fingerprinting and compares the results against the SCANOSS knowledge base to identify components, licences, and vulnerabilities. To scan using a pre-generated .wfp file:
scanoss-py scan --wfp fingerprints.wfp