Winnowing Algorithm
The Winnowing Algorithm converts source code into fingerprints through four key steps:- Normalisation
- Gram Fingerprinting
- Window Selection
- Output Formatting
Normalisation
The normalisation process eliminates all non-alphanumeric characters from the input source code, converting it to lowercase and removing spaces, punctuation, and special characters.Example
After Normalisation
Gram Fingerprinting
From the normalised code, overlapping data samples called grams are extracted and fingerprinted. Each gram is a 30-byte sequence taken from the normalised code, and each sequence is hashed using CRC32C.Note: The example below uses 10-character sequences for readability. In production, SCANOSS uses 30-byte sequences.
Example
Window Selection
From the series of gram fingerprints, a sliding window is applied to select representative hashes:- Window size: 64 gram fingerprints
- Selection method: Choose the minimum hash from each window
Why These Values?
The values gram=30 and window=64 were chosen after extensive testing across multiple programming languages (C, Java, JavaScript, Ruby) to provide the optimal balance between:- Footprint: Number of fingerprints generated (affects storage and performance)
- Uniformity: Even distribution of hash values (prevents database index imbalance)
- Match accuracy: Ability to find matches even in modified code
Output Formatting
The fingerprints are formatted as a .wfp file with:- File metadata (MD5 hash, filename, size)
- Line numbers where each fingerprint was found
- The hash values themselves
WFP File Format
A WFP (Winnowing FingerPrint) file is a machine-readable, human-readable format that contains fingerprints for one or more source code files.Structure
The .wfp file contains:- File declarations with metadata
- Fingerprints organised by line number
Example WFP File
Format Components
File Declaration:- MD5_HASH: MD5 checksum of the entire file (used for exact file matching)
- FILE_SIZE: File size in bytes
- FILE_PATH: Relative path to the file
- LINE_NUMBER: The source line number where these fingerprints were generated
- HASH: CRC32C hash values representing the normalised code at that line
Fingerprinting with SCANOSS-PY
Basic Fingerprinting
Generate fingerprints for a file or directory:Fingerprint a Specific File
Output to File
Save fingerprints to a specific file:What Programming Languages are Supported?
Fingerprinting works with any text-based programming language because it operates on normalised character sequences rather than language-specific syntax. It has been tested extensively on:- C/C++
- Java
- JavaScript/TypeScript
- Python
- Ruby
- Go
- Rust
- PHP
- Among other text-based languages
Files Skipped During Fingerprinting
By default, SCANOSS skips fingerprinting for file types that are not suitable for code matching. Use the--all-extensions flag to override this behaviour.
Binary and Archive Files
.exe,.zip,.tar,.tgz,.gz,.7z,.rar.jar,.war,.ear,.whl,.bin,.app,.out
.class,.pyc,.o,.a,.so,.obj,.dll,.lib
.doc,.docx,.xls,.xlsx,.ppt,.pptx,.pdf.odt,.ods,.odp,.pages,.key,.numbers
.json,.xml,.html,.htm,.dat,.lst,.xsd,.pom,.mf,.sum
.md,.txt,.min.js,.woff,.woff2
Fingerprinting vs Scanning
Fingerprinting generates the .wfp file but does not compare it against the SCANOSS knowledge base. It is useful when you want to:- Generate fingerprints for later analysis
- Create a .wfp file to share or archive
- Inspect what data will be transmitted during a scan