What is Canvass for Compliance?
Canvass for Compliance is a client-server based, proprietary Scanner solution for the OSS Review Toolkit (ORT).
Canvass for Compliance uses Natural Language Processing and other machine learning methods to find license statements and copyright notices in source code text files. It excels in finding licensing information that was modified from its original form and unusually formatted copyrights. Software license compliance requires correct identification of Free and Open Source Software dependencies. Canvass Labs is collaborating with the OSS Review Toolkit (ORT) to solve this problem.
For more information, please refer to the Getting Started guide.
Why use machine learning and AI methods?
There are around 400 standard Open Source Licenses, as shown in the Software Package Data Exchange (SPDX) license list. When analyzing thousands of OSS packages, Canvass Labs noticed that open-source software developers often alter the license statements and write non-standard copyright statements about ten to twenty percent of the time. Open-source programmers can also make a reference to a license instead of providing the license itself. The exact or fuzzy text matching used by other scanners will fail to find all of these variations.
Canvass for Compliance protects users' intellectual property
Before sending data to the server, Canvass for Compliance blanks out all the code except comments and string literals because only they may contain license information in the source code files. Currently, it parses the following languages:
C, C#, C++, Dart, Go, Java, JavaScript, Kotlin, Perl, Python, Ruby, Rust, Scala, Typescript, and JSX (JavaScript XML, React)
For other language types, it currently uploads the entire content. We are presently developing parsers for other programming languages. (Users can review the information sent to the server by examining the native-scan-results directory's contents in their results.)
Canvass for Compliance anonymizes file names and paths before sending them to the server because they may contain valuable information.
The server only caches the file hash (md5) and the license findings for future scans. Canvass for Compliance deletes uploaded users' source code files after the scan.