Encoding Multiplicity: Fast Duplicate Detection using Hyper-Dimensional Computing

Efficiently determining item multiplicity and detecting duplicates within large, unstructured sets remains a critical computational bottleneck in modern data processing. Conventional solutions—such as hash tables, explicit counting, or sorting followed by linear comparison—suffer from two primary drawbacks: they either incur high memory overhead (scaling with the size of the unique universe, U) or poor time complexity (involving multiple passes or non-linear comparison costs), making them impractical for high-throughput or resource-constrained environments.

I propose a fundamentally different approach rooted in Vector Symbolic Architectures (VSA). By employing a Shift-Code method, we can encode the entire multiset into a single, high-dimensional bundle vector (S) via superposition. The key advantage is that the presence and multiplicity of items are implicitly stored in the bundle's geometric structure. We can then recover the multiplicity profile—and thus detect duplicates—by performing a single, fast circular correlation between the bundleand the base vectorusing the Fast Fourier Transform (FFT). This approach offers a path toward a single-pass, memory-efficient solution for robust set analysis.

For folks who are unfamiliar with VSA, An Intuitive Introduction to Hyperdimensional Computing provides a high-level overview of it.

Methodology: The Algebra of Set Superposition

The core challenge in duplicate detection is to efficiently record both the identity and the count (multiplicity) of every item in a large set. Our VSA-based approach achieves this by leveraging the powerful properties of high-dimensional space.

The Intuition Behind The Idea

The Codebook and The Shift-Code

Imagine a vast, circular canvas (our high-dimensional vector space of size ). We first choose a random, grainy texture or base pattern —like a unique fingerprint. To uniquely encode an integer item , we simply cyclically shift this base pattern by positions. This rotation is our Shift-Code operation.

Analogy: Holographic Encoding. Think of an item's encoding as a hologram. The base pattern b is the laser reference beam. Encoding item means rotating the reference beam by 5 degrees. Every possible item corresponds to a unique, slightly displaced version of the same fundamental pattern.

The Bundle: Superposition via Summation

When processing the set, we don't store items separately. Instead, we superimpose (add) all their high-dimensional encodings into a single bundle vector ().

Intuition: We take the "holograms" for all items and expose them onto the same photographic plate S. If item k appears once, its pattern bk is added once. If item k is a duplicate, its pattern is added twice, causing a brighter, more concentrated exposure in that specific region of the plate.
Key Insight: Because our patterns are high-dimensional random textures, they are nearly orthogonal (they interfere minimally). The plate S becomes a single, complex composite image where all item patterns are mixed, but each item's pattern retains its distinct, hidden signature.

The Correlation: Detecting the Echo

To read the set and find duplicates, we use circular correlation. We use the original base pattern as a "decoding filter" and sweep it across the composite image , checking for a match at every possible shift k.

Pattern Matching. This process is like holding a magnifying glass (the base pattern ) up to the photographic plate S and rotating it. When the magnifying glass is perfectly aligned with the latent pattern (i.e., when the shift matches ), the two patterns resonate and a clear image (a high correlation score) "pops out."
Duplicate Detection: A singleton item generates a modest match score (). A duplicate () generates a score that is significantly higher ( or more) because its pattern was exposed twice as brightly. By checking which scores exceed a Duplicate Threshold (), we instantly filter for and identify all duplicated item identities.

The Nitty Gritty

This approach formally relies on the properties of random vectors and the mathematical convenience of the convolution theorem for accelerating the correlation.

A. Encoding and Bundling

Base Vector Generation: The process begins with a base vector , generated as a unit-norm Gaussian random vector:
, normalized s.t.
Shift-Code Encoding: The encoding function is the cyclic shift (permutation) operator :
where
Bundle (Superposition): Given a multiset , the bundle vector is the sum of the encoded items:

B. Decoding via Circular Correlation

The core recovery operation is the circular cross-correlation between the bundle and the base vector . The -th element of the correlation is:

Substituting the definition of :

The term is the correlation between and . Due to the cyclic shift property, this correlation has a high value only when t matches the shift .

C. FFT Acceleration and Multiplicity

Direct calculation of c is or depending on implementation. By using the Convolution Theorem, the circular correlation can be calculated efficiently in the frequency domain:

where is the real-to-complex Fast Fourier Transform, conj is the complex conjugate, and is the element-wise Hadamard product. This yields a complexity of for the final detection step, making the process dominated only by the time complexity of the initial bundling (or simply if the encodings are pre-computed).

For an item with multiplicity , the correlation approximates :

Since . Thus, . The noise term arises from the cross-correlations between distinct shifts and , which are minimized because is a high-dimensional random vector.

Detection Rule: An item is declared a duplicate if its correlation score surpasses a predetermined threshold: is a duplicate if

With , this threshold robustly distinguishes between singletons and duplicates .

Conclusion

This work successfully demonstrates a highly efficient method for duplicate detection and multiplicity estimation by harnessing the unique properties of Vector Symbolic Architectures (VSA). By encoding entire multisets into a single, compact bundle vector (S) using Shift-Codes, we sidestep the high computational cost and memory overhead associated with traditional comparison-based methods.

This VSA framework offers a powerful, single-pass alternative for set analysis, making it particularly attractive for applications requiring high-throughput, memory-efficient processing, such as in massive streaming data pipelines or resource-constrained embedded systems. The scalability and robust performance of this correlation-based approach open promising avenues for future research into more complex set operations, including approximate intersection and union, all within the elegant algebra of high-dimensional vectors.