Computer Vision | Search | Indexing | Performance Optimization

Speeding up Binary Feature Matching

Feature matching is a core operation in the hot loop of visual odometry, SLAM, and structure-from-motion. Classical pipelines treat it as a nearest-neighbor search over floating-point descriptors such as SIFT, often accelerated with approximate indexing methods.

Here we look at a different regime: learned binary descriptors assumed to be globally unique. Under this assumption, matching becomes an exact-match problem with a theoretical linear-time solution, as shown in the Global Patch Collider (GPC).

In practice, though, the obvious hash-table implementation loses badly even to a sorting-based baseline - until the data structure is redesigned around memory locality.

TL;DR: Replacing unordered_map with a cache-conscious custom hash join yielded a 19x speedup over the baseline and outperformed both Abseil and Ankerl by over 20%.

Description
Fig 1. Subset of benchmarked methods. The baseline $O(n)$ unordered_map (yellow) is slower than the $O(n\log n)$ sort baseline (orange). Our fastest method (ours) outperforms both baselines as well as Abseil and Ankerl on this workload. Apple M3, L1: 64K, L2: 4MB, AppleClang 16.0.0.16000026

Exclusive Feature Matching

The GPC defines a valid match as one where the binary embedding of a feature is present exactly once in the source and once in the target image. For instance, for the following two multisets of integers $S=(5,1,4,2,3,4), T=(7,4,2,5,5,6)$, we'd only have one match: $(2,2)$ as shown in Fig. 2.

Description
Fig 2. Feature matching as defined in the GPC. We retain only one valid match $(2,2)$ based on the exclusive match criterion. We exclude $4$ and $5$ because there are duplicates, and $1, 3, 6, 7$ because they're only present in one of the multisets.

We can find a matching as described above in $O(n\log n)$ by sorting $S$ and $T$: $S=(1,2,3,4,4,5), T=(2,4,5,5,6,7)$. We then do a single joint linear pass over both multisets, where we skip values that appear more than once.

Baseline implementations

The following shows two simple baseline implementations to find the exclusive matching, starting with the sort method described in the prior section:

std::vector<std::pair<int, int>>
match_sort(const std::vector<uint64_t>& src,
                       const std::vector<uint64_t>& dst) {
    std::vector<std::pair<uint64_t, int>> a, b;
    for (int i = 0; i < src.size(); ++i) a.push_back({src[i], i});
    for (int j = 0; j < dst.size(); ++j) b.push_back({dst[j], j});

    std::sort(a.begin(), a.end());
    std::sort(b.begin(), b.end());

    std::vector<std::pair<int, int>> out;
    int i = 0, j = 0;

    while (i < a.size() && j < b.size()) {
        int i0 = i, j0 = j;
        while (i < a.size() && a[i].first == a[i0].first) ++i;
        while (j < b.size() && b[j].first == b[j0].first) ++j;

        if (a[i0].first == b[j0].first && i - i0 == 1 && j - j0 == 1)
            out.push_back({a[i0].second, b[j0].second});
        else if (a[i0].first < b[j0].first)
            j = j0;
        else if (b[j0].first < a[i0].first)
            i = i0;
    }

    return out;
}

Theoretically, the hash-table approach shown below has expected linear-time complexity, $O(n)$, under standard hashing assumptions, compared with $O(n\log n)$ for the sort-based method; in the worst case, however, hash-table operations can degrade and the overall runtime becomes $O(n^2)$.

std::vector<std::pair<int, int>>
match_hashtable(const std::vector<uint64_t>& src,
                  const std::vector<uint64_t>& dst) {
    std::unordered_map<uint64_t, int> src_count, dst_count;
    std::unordered_map<uint64_t, int> src_pos, dst_pos;

    for (int i = 0; i < src.size(); ++i) {
        ++src_count[src[i]];
        src_pos[src[i]] = i;
    }

    for (int j = 0; j < dst.size(); ++j) {
        ++dst_count[dst[j]];
        dst_pos[dst[j]] = j;
    }

    std::vector<std::pair<int, int>> matches;
    for (const auto& [x, c] : src_count) {
        if (c == 1 && dst_count[x] == 1) {
            matches.emplace_back(src_pos[x], dst_pos[x]);
        }
    }
    return matches;
}

With real data, the baseline based on unordered_map is significantly slower than even the sort-based method, as shown in Fig. 1 and Fig. 6.

Building a cache-conscious exclusive hash join

In the following, we discuss why the above baseline is not efficient. We then show approaches to improve the performance of the hash table implementation and the join operation of the two multisets by reducing the constant time overhead of this specific matching problem.

Memory access patterns and cache locality

The most prominent issue is that typical std::unordered_map implementations use a contiguous bucket array but store elements as individually allocated nodes. As a result, reads and writes involve non-contiguous memory access and poor cache locality.

This problem becomes worse in the presence of collisions. In that case, unordered_map typically follows additional pointers to heap-allocated nodes, making accesses even less predictable and further increasing constant-time overhead.

Description
Fig 3. Heap-allocated bucket contents in `unordered_map` lead to scattered memory access.

We can (partially) remedy this problem by upper bounding our problem size $N$ and pre-allocating a large contiguous chunk of memory dedicated to our hash table buckets as in Fig 4. This on its own improves locality by reducing pointer chasing. Keeping the table compact also helps us keep it (mostly) in cache and improves locality of linear probing.

Description
Fig 4. Hash table bucket contents reside in contiguous pre-allocated memory region, improving locality.

Data distribution and Join Operation

In practice, the descriptor distribution is heavy-tailed: a small number of hashes occur many times, while most occur rarely. Duplicates are due to the structured nature of image data; GPC descriptors of areas with little texture are often very similar. This is by design, since matches between indistinct image features are often of low quality and can lead to problems later down the pipeline, if they are not discarded.

We conduct the actual join (finding elements that are present exactly once in the source and once in the target multisets) with a single hash table $H$ as follows:

  1. Insert source descriptors into $H$ and track multiplicity.
  2. Scan target descriptors, generating a candidate match when the source multiplicity is 1 and tombstoning it when the target multiplicity exceeds 1.
  3. Run a final compaction pass to emit only valid matches.

The hash table $H$ is an open-addressed and flat. The key is the 64-bit binary feature, the value stores count and tombstone(32-bit), original input index(32-bit) and optionally a generation counter(32-bit) for reuse across allocations. As discussed further below, for the benchmarks we do not use pre-allocation and as a result also no generation counter.

To improve its performance further, the following additional optimizations were considered and included:

Array of Structs(AoS) vs Struct of Arrays (SoA)

For the value part of the table (count, index, ..), both an array of structs (AoS) and struct of arrays(SoA) approach were considered. For the problem size most relevant for feature matching in computer vision (multiset sizes of up to 2M), SoA outperformed AoS. However we can see that at 2M items, AoS appears to start outperforming SoA. For larger input image dimensions, the tradeoff here may be re-evaluated.

Manual unrolling and prefetching

We manually unroll the main hash table insertion and probing loop by 4 to benefit from instruction-level-parallelism, in the plots denoted with unrolled in the run name. Further we also use manual prefetching through __builtin_prefetch for linear probing.

Other design choices

The following discusses additional and orthogonal approaches that were considered and not included for the benchmarks.

Memory Pre-Allocation

In a real-time computer vision application we'd receive a stream of pairs of frames, match them and then move on to the next pair of frames. As a result, the read:write ratio to the hash tables is roughly 1:1. This makes optimizing the construction of the table equally as important as the reading and matching itself. Typically, past frame pairs would not be reused and the image dimensions are consistent for a camera feed.

In particular, pre-allocating and reusing the fixed size memory for the hash-tables can thus save a lot of time for repeated invocations with inputs of the same image size. In production, the table memory would be preallocated and reused because image dimensions are fixed for a given camera stream. For fairness, all benchmarks in this article include allocation costs and therefore do not use preallocation for the table.

Generation count

Repeated uses of the same pre-allocated memory requires clearing it between invocations. One alternative is a generation counter that marks stale entries logically instead of physically clearing the table. This was not implemented here because the benchmarked version does not reuse memory. Not using a generation counter also helps keep the table smaller, improving cache locality.

Early rejection

As discussed earlier, many features appear with large frequency and it could intuitively make sense to exclude them from matching very early on, even before entering the hash table. Since many duplicates appear in sequence, this was evaluated by skipping over and invalidating descriptors that appear multiple times in sequence in the input, rather than incrementing their count. However, the poor branch predictabilityh-prediction of this resulted in this optimization actually being slower than just increasing the count.

In case very few elements appear with overwhelming majority, it could be considered to keep a small list of items(for instance all 0 or all 1) that are excluded from matching. To remain somewhat flexible with the distribution of the input data, this was not yet implemented or evaluated due to the complexity that it would add.

Radix Partitioning

Since the keys to be matched are integers, it makes sense to radix partition them into smaller buckets(See Fig. 5) based on a prefix and to then operate on smaller hashtables for the matching. In theory the smaller hash tables would more likely fit into cache, however the construction of the radix buckets also comes at a cost: The elements need to be scattered based on the prefix, which is more or less a random write. Given that we expect a 1:1 read/write ratio for the matching, the additional construction overhead did not seem to pay off. Results in Fig. 6 show that the method using radix buckets (oursAosRadix) does eventually match the faster oursSoAUnroll). Depending on the cache size of the target platform, radix buckets may still make sense depending on the problem size.

Description
Fig 5. Radix partitioning exemplified with a single digit base-10 prefix used to partition the input numbers into partitions $P_0,...,P_4$. For our actual matching problem we partition on the lowest 8 bits of the 64bit integers.

Motivated similarly, and as shown in Fig. 6 a full radix sort with multiple passes (avoiding a hashtable altogether) was also attempted, but did not come close to the hash table implementation.

Smaller tombstones

The count per feature is only relevant up to 2 and is then discarded and marked as invalid with a tombstone value. Considering this it would make sense to reduce the size of the count variable to 8bit (or even smaller). However, this turned out to be less performant than the current best approaches, presumably due to the cache line length not being an even multiple of the size of the value part of the table (particularly in case of the AoS implementations)

Speedup and scaling behavior

We benchmark problem sizes up to roughly 2M items using binary features extracted from one image pair from the Middlebury dataset. For smaller problem sizes, descriptors are deterministically subsampled by index to preserve the underlying distribution. Across these workloads, the optimized implementation achieves up to 19x speedup over std::unordered_map and remains faster than both Ankerl and Abseil on this task.

Description
Fig. 6. Full benchmark comparison. std::sort outperforms std::unordered_map, and both Abseil and Ankerl are outperformed by most of our specialized implementations on this workload. All methods are single-threaded, and the ours variants do not use explicit SIMD.

References