Software Bertillonage: Finding the provenance of an entity

Julius Davies1, Daniel M. German1, Michael W. Godfrey2, Abram Hindle3 1 University of Victoria, 2 University of Waterloo, 3UC Davis
 
2011-bertillonage.pdf
 

Replication & Data

This site was built to help researchers replicate our 'Bertillonage' study of a proprietary e-commerce application. We also offer processed data for researchers who wish to see the results in more detail.

Instructions

Click on hyperlinks in the two tables to the right to see additional processed data. When you drilldown you will see two types of files for each jar:

  1. *.sql.html - The raw SQL query.
  2. *.sql.txt.html - The results of the SQL query.

The full replication package is available for download: 2011-bertillonage-replication.zip (29.2 MB).

This zip file contains 80 Java binary archives (jars) that are very similar to the 84 we found in a proprietary e-commerce application running inside a North American financial institution. We also include instructions for obtaining an 81st jar.

To perform a full replication, we recommend mirroring the Maven 2 central repository. Circa late 2010, the Maven 2 central repository requires approximately 200GB of disk space, and takes about 5 days to download. Instructions for mirroring can be found in our original paper.

About the data

We were unable to take the jars from our original study outside of the financial institution. For replication purposes we have re-downloaded the closest matches possible for each library. All of the jars in this replication package (and summarized in the tables to the right) come from original project sites (i.e. apache.org, sourceforge.net, etc). We believe this creates a good approximation of the original artifacts that is also unencumbered by any legal or propietary concerns.

Many of the jars (around 30%) were byte-for-byte identical with jars from the original study, which is an interesting observation in and of itself, since it shows many libraries are never recompiled from source, even in Maven 2.

Page 7, Section 8.1, RQ1, Table 4:
 
Similarity
index
Type of
match
 
Perfect
Correct
product
 
Incorrect
1 Single 484 
  Multiple142 
  Subtotal6260
(0,1) Single 372
  Multiple   
  Subtotal372
0 No match  1
 
Total  65133
 

Table 4: Using a binary-to-binary bertillonage technique to determine the provenance of 81* open source binary archives in a proprietary e-commerce application.

 
 
 
 
Page 9, Section 8.2, RQ2, Table 7:
 
Similarity
index
Type of
match
 
Perfect
Correct
product
 
Incorrect
1 Single 123 
Multiple6  
Subtotal1830
(0,1) Single 21171
Multiple52 
Subtotal26191
0 No match  14
 
Total  442215
 

Table 7: Using a binary-to-source bertillonage technique to determine the provenance of 81* open source binary archives in a proprietary e-commerce application.

 

*Note: The tables above differ slightly from those in the original paper, since we are using the replication data set rather than the original data set. Please refer to the section "About the data" to the left.