Latent-Mark: An Audio Watermark Robust to Neural Resynthesis

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis

[Interspeech 2025] LATENT-MARK: Bridging the Gap Between Audio Watermarking and Neural Resynthesis

Summary

Problem

Method

Results

Takeaways

Abstract

LATENT-MARK is a novel zero-bit audio watermarking framework designed to survive "Neural Resynthesis" from modern neural codecs (e.g., EnCodec, SNAC). It achieves SOTA robustness by embedding the watermark as a directional shift in the codec's invariant latent space rather than as waveform-level noise.

TL;DR

As neural audio codecs (like EnCodec and SNAC) become the backbone of generative AI, traditional audio watermarks are being "washed away" by the semantic reconstruction process. LATENT-MARK solves this by embedding watermarks directly into the codec's latent space. By treating the watermark as a structural feature of the audio manifold rather than stochastic noise, it survives neural resynthesis with SOTA accuracy while remaining perfectly inaudible.

The "Semantic Filter" Problem: Why Traditional Watermarks Fail

Most existing watermarking tools (WavMark, AudioSeal) operate on the principle of additive perturbations. They hide data in high-frequency bands or phase shifts that are imperceptible to humans.

However, modern neural codecs don't just compress audio; they re-synthesize it. They map the waveform into a discrete latent bottleneck and reconstruct the signal from scratch. In this process, the codec acts as a "semantic filter"—it keeps what sounds like speech or music and discards everything else as "quantization noise." Unfortunately, traditional watermarks are indistinguishable from noise to these models, leading to a total loss of the embedded signal.

Methodology: Steering the Latent Manifold

The core insight of LATENT-MARK is simple yet profound: If you can't beat the codec, join it. If the codec only preserves features in its latent space, we must embed the watermark there.

1. Latent-Targeted Optimization

Instead of adding noise, LATENT-MARK uses gradient descent to slightly modify the input waveform $s$ so that its encoded latent representation $z$ shifts in a specific direction $v_{c}$ .

2. The Secret Axis: Latent-Cluster

How do we pick the direction $v_{c}$ ? The authors found that random directions are less robust. Instead, they use Latent-Cluster: they perform K-means clustering on the codec's codebook and define the axis as the vector between cluster centroids. This ensures the shift points toward "high-density" regions the quantizer is likely to preserve.

Overall Architecture Figure 1: The LATENT-MARK framework. Optimization (A) induces a constrained shift (B) that survives the RVQ bottleneck for detection (C).

3. Cross-Codec Optimization (The "Ensemble" Trick)

To make the watermark "zero-shot" transferable to codecs it wasn't designed for, the authors propose Joint Manifold Optimization. By optimizing the waveform against a committee of diverse surrogate codecs (e.g., SNAC, DAC, EnCodec), the watermark captures shared "semantic invariants" that exist across different architectures.

Experimental Battleground

The researchers tested LATENT-MARK across 7 datasets (Speech, Music, Ambient) against SOTA baselines.

Neural Resynthesis Survivability

When attacked by the SNAC codec, baselines like WavMark and SilentCipher dropped to near 0% detection. LATENT-MARK maintained high survivability (reaching 93% on some datasets). This proves that the "latent shift" strategy is inherently more resilient to neural bottlenecks.

Standard DSP Attacks

Does optimizing for neural codecs break traditional robustness? No. LATENT-MARK remains competitive against Gaussian noise and Resampling, often outperforming WavMark.

Performance Comparison Table 1: Main experimental results showing the massive gap in survivability (Sur.) between LATENT-MARK and traditional baselines.

Acoustic Imperceptibility

Using UTMOS (a neural MOS predictor), the study shows that LATENT-MARK is virtually indistinguishable from original audio. By aligning the perturbation with the "natural audio manifold," the authors ensure that the changes sound like natural variations of the audio signal rather than artificial glitches.

Critical Insight & Future Outlook

LATENT-MARK marks a paradigm shift. We are moving away from signal-level watermarking toward feature-level watermarking.

Limitations: As a zero-bit watermark, it currently only detects presence, not a multi-bit payload (like a URL or ID). Expanding this to carry high-capacity data while maintaining latent stability is the next logical step for the community.

In a world where AI can effortlessly strip away traditional metadata and digital signatures, LATENT-MARK offers a robust "DNA" for audio that survives even the most aggressive neural reconstructions.

Find Similar Papers

Try Our Examples

Find recent papers addressing audio watermark robustness specifically against neural audio codecs or generative resynthesis since 2024.
Which study first proposed the concept of "Neural Resynthesis" as a watermark removal attack, and how does Latent-Mark's manifold alignment differ from their defense?
Explore if latent-space directional shifting has been successfully applied to image watermarking for Diffusion models or VAE-based compression.

Contents

[Interspeech 2025] LATENT-MARK: Bridging the Gap Between Audio Watermarking and Neural Resynthesis

1. TL;DR

2. The "Semantic Filter" Problem: Why Traditional Watermarks Fail

3. Methodology: Steering the Latent Manifold

3.1. 1. Latent-Targeted Optimization

3.2. 2. The Secret Axis: Latent-Cluster

3.3. 3. Cross-Codec Optimization (The "Ensemble" Trick)

4. Experimental Battleground

4.1. Neural Resynthesis Survivability

4.2. Standard DSP Attacks

4.3. Acoustic Imperceptibility

5. Critical Insight & Future Outlook