Site map


_static/logoyael.png

Hamming distances

Yael provides very efficient functions to compute Hamming distances between binary vectors. Hereafter, we show how to use them in Matlab. The same functionalities are (of course) available from the C and python interface.

To make the following test, that the subdirectory ‘matlab’ of yael is in the Matlab path. If no, you can add it with the command addpath. Please also ensure that the mex-file yael_hamming.c is compiled, otherwise this test program invokes the non-optimized version implemented in yael_hamming.m, which is super-slow.

The Hamming functions operates on compact bit-vectors. A bit-vector is stored as an uint8 vector, therefore each entry of the Matlab vector stores effectively 8 bits. In the following lines, we randomly generate bitvectors for the sake of exposure.

nbits = 64;            % size of the bit vector, in bits
d = nbits / 8;         % corresponding size in terms of Matlab uint8 length
na = 100000;
nb = 1000;

% Generate some random bit vectors uniformely
a = uint8 (randi (256, d, na) - 1);
b = uint8 (randi (256, d, nb) - 1);

We display the first 8 vectors of a:

>> a(:,1:10)

ans =

  240  166  230  182  164  138   40  170  113   94
   23  200  238  162  144   43   98  115  195  206
   35  115   41  235  148  139   40   63   94   80
  202  134  251  168  210  250   59   71  101   76
   76  183  245  199    1   81  216   60   17   90
   45   64   86   24  182  160   27  127  242  233
  176  249  106   60  205   61   15  166  184  159
  133  242  243  109  106   19  165  149   98  242

Now, we can compute the pairwise distances between a and b.

ht = 16;
tic
dis = yael_hamming (a, b) ;
fprintf ('%dx%d = %d Hamming distances computed in %.3fs\n', na, nb, na*nb, toc);
disp (dis(1:8,1:6));

This typically produces the following output:

100000x1000 = 100000000 Hamming distances computed in 0.161s
   37     31     32     30     30     33
   30     30     33     31     31     28
   30     24     35     29     29     30
   30     28     25     29     31     30
   26     36     31     37     31     34
   35     33     28     32     26     35
   31     35     30     28     36     35
   33     27     36     26     30     27

where we observe that 100 million distances are computed in less than 0.2 second, at least on my laptop. Now, let detect the distances below a threshold ht. his can be done with the standard Matlab function find.

tic;
[idxa, idxb] = find(dis < ht);
fprintf ('Found %d distances below %d in %.3fs\n', numel (idxa), ht, toc);

with following output:

Found 1285 distances below 16 in 0.502s

Overall, finding these pairs takes 0.66s, the dominant cost being the one of the matlab function find! Additionally, Storying this matrix (of type uint16) is memory costly (200Mbytes)

>> whos dis
  Name           Size                  Bytes  Class     Attributes

  dis       100000x1000            200000000  uint16

Instead, what we can do is to directly retrieve the pairs of vectors in a and b whose distances are below ht. This is done as follows:

tic
[ids, hdis] = yael_hamming (a, b, ht);
fprintf ('Found %d distances below %d in %.3fs\n', numel (idxa), ht, toc);

The variable ids is a 2*nmatches vector of identifiers (for a and b), while hdis gives the corresponding distances. The output is:

Found 1285 distances below 16 in 0.185s

This is at least 3 times faster. Most importantly, the function never allocates a big matrix. It is therefore possible to compute much more distances, as shown in the example below (only on a machine with enough RAM and good cache size):

na = 100000;
nb = 100000;
nbits = 64;
d = nbits / 8;
ht = 16;

a = uint8 (randi (256, d, na) - 1);
b = uint8 (randi (256, d, nb) - 1);

tic
[ids, hdis] = yael_hamming (a, b, ht);
toc

On my machine, this takes 18.12 seconds to compute these 100,000^2= 10 billion distances! Observe that the complexity is linear in na and nb.