Measuring Subversions: Security and Legal Risk in Reused Software Artifacts

Julius W. Davies1, December 10th, 2010 1 University of Victoria
 
Submitted Abstract (pdf)   Reviews   Final Version of Abstract (pdf)  

32" by 48" Poster (3.1 MB)   Presentation Slides (2.6 MB)
 
 

Replication & Data

This site was built to help researchers replicate our study of a proprietary e-commerce application. We also offer processed data for researchers who wish to see the results in more detail.

Instructions

Click on hyperlinks in the two tables to the right to see additional processed data. When you drilldown you will see two types of files for each jar:

  1. *.sql.html - The raw SQL query.
  2. *.sql.txt.html - The results of the SQL query.

The full replication package is available for download: 2011-bertillonage-replication.zip (29.2 MB).

This zip file contains 80 Java binary archives (jars) that are very similar to the 84 we found in a proprietary e-commerce application running inside a North American financial institution. We also include instructions for obtaining an 81st jar.

To perform a full replication, we recommend mirroring the Maven 2 central repository. Circa late 2010, the Maven 2 central repository requires approximately 200GB of disk space, and takes about 5 days to download. Instructions for mirroring can be found in our bertillionage paper.

About the data

We were unable to take the jars from our original study outside of the financial institution. For replication purposes we have re-downloaded the closest matches possible for each library. All of the jars in this replication package (and summarized in the tables to the right) come from original project sites (i.e. apache.org, sourceforge.net, etc). We believe this creates a good approximation of the original artifacts that is also unencumbered by any legal or propietary concerns.

Many of the jars (around 30%) were byte-for-byte identical with jars from the original study, which is an interesting observation in and of itself, since it shows many libraries are never recompiled from source, even in Maven 2.

 
Click here for the license analysis.
 
Binary to Binary Matching
 
Similarity
index
Type of
match
 
Perfect
Correct
product
 
Incorrect
1 Single 484 
  Multiple142 
  Subtotal6260
(0,1) Single 372
  Multiple   
  Subtotal372
0 No match  1
 
Total  65133
 

Using a binary-to-binary bertillonage technique to determine the provenance of 81 open source binary archives in a proprietary e-commerce application.

 
 
 
 
Binary to Source Matching
 
Similarity
index
Type of
match
 
Perfect
Correct
product
 
Incorrect
1 Single 123 
Multiple6  
Subtotal1830
(0,1) Single 21171
Multiple52 
Subtotal26191
0 No match  14
 
Total  442215
 

Using a binary-to-source bertillonage technique to determine the provenance of 81 open source binary archives in a proprietary e-commerce application.