# Hamming distances¶

Yael provides very efficient functions to compute Hamming distances between binary vectors. Hereafter, we show how to use them in Matlab. The same functionalities are (of course) available from the C and python interface.

To make the following test, that the subdirectory ‘matlab’ of yael is in the Matlab path. If no, you can add it with the command addpath. Please also ensure that the mex-file yael_hamming.c is compiled, otherwise this test program invokes the non-optimized version implemented in yael_hamming.m, which is super-slow.

The Hamming functions operates on compact bit-vectors. A bit-vector is stored as an uint8 vector, therefore each entry of the Matlab vector stores effectively 8 bits. In the following lines, we randomly generate bitvectors for the sake of exposure.

```nbits = 64;            % size of the bit vector, in bits
d = nbits / 8;         % corresponding size in terms of Matlab uint8 length
na = 100000;
nb = 1000;

% Generate some random bit vectors uniformely
a = uint8 (randi (256, d, na) - 1);
b = uint8 (randi (256, d, nb) - 1);
```

We display the first 8 vectors of a:

```>> a(:,1:10)

ans =

240  166  230  182  164  138   40  170  113   94
23  200  238  162  144   43   98  115  195  206
35  115   41  235  148  139   40   63   94   80
202  134  251  168  210  250   59   71  101   76
76  183  245  199    1   81  216   60   17   90
45   64   86   24  182  160   27  127  242  233
176  249  106   60  205   61   15  166  184  159
133  242  243  109  106   19  165  149   98  242
```

Now, we can compute the pairwise distances between a and b.

```ht = 16;
tic
dis = yael_hamming (a, b) ;
fprintf ('%dx%d = %d Hamming distances computed in %.3fs\n', na, nb, na*nb, toc);
disp (dis(1:8,1:6));
```

This typically produces the following output:

```100000x1000 = 100000000 Hamming distances computed in 0.161s
37     31     32     30     30     33
30     30     33     31     31     28
30     24     35     29     29     30
30     28     25     29     31     30
26     36     31     37     31     34
35     33     28     32     26     35
31     35     30     28     36     35
33     27     36     26     30     27
```

where we observe that 100 million distances are computed in less than 0.2 second, at least on my laptop. Now, let detect the distances below a threshold ht. his can be done with the standard Matlab function find.

```tic;
[idxa, idxb] = find(dis < ht);
fprintf ('Found %d distances below %d in %.3fs\n', numel (idxa), ht, toc);
```

with following output:

```Found 1285 distances below 16 in 0.502s
```

Overall, finding these pairs takes 0.66s, the dominant cost being the one of the matlab function find! Additionally, Storying this matrix (of type uint16) is memory costly (200Mbytes)

```>> whos dis
Name           Size                  Bytes  Class     Attributes

dis       100000x1000            200000000  uint16
```

Instead, what we can do is to directly retrieve the pairs of vectors in a and b whose distances are below ht. This is done as follows:

```tic
[ids, hdis] = yael_hamming (a, b, ht);
fprintf ('Found %d distances below %d in %.3fs\n', numel (idxa), ht, toc);
```

The variable ids is a 2*nmatches vector of identifiers (for a and b), while hdis gives the corresponding distances. The output is:

```Found 1285 distances below 16 in 0.185s
```

This is at least 3 times faster. Most importantly, the function never allocates a big matrix. It is therefore possible to compute much more distances, as shown in the example below (only on a machine with enough RAM and good cache size):

```na = 100000;
nb = 100000;
nbits = 64;
d = nbits / 8;
ht = 16;

a = uint8 (randi (256, d, na) - 1);
b = uint8 (randi (256, d, nb) - 1);

tic
[ids, hdis] = yael_hamming (a, b, ht);
toc
```

On my machine, this takes 18.12 seconds to compute these 100,000^2= 10 billion distances! Observe that the complexity is linear in na and nb.