Perceptual Gaps: ASCII Art and Overlapping Audio as CAPTCHAs

Abstract

The rapid improvement of frontier multimodal large language models (LLMs) has rendered most existing CAPTCHA schemes obsolete: image-based CAPTCHAs, audio CAPTCHAs, and textual puzzles are routinely solved by current AI systems. This paper investigates whether human-specialised perceptual abilities can form the basis of next-generation CAPTCHAs that remain robust against AI while staying accessible to humans.

We introduce two novel CAPTCHA classes targeting distinct perceptual gaps. First, ASCII art CAPTCHAs exploit the mismatch between holistic human Gestalt reading and LLM tokenisation: humans effortlessly recognise characters rendered in ASCII art, while models fail because their tokenisers fragment the spatial patterns that carry semantic meaning. Second, overlapping audio CAPTCHAs exploit the cocktail-party effect: humans can isolate target speech from simultaneous distractors, whereas current audio-capable LLMs cannot reliably separate overlapping speakers.

We evaluate both CAPTCHA classes against all frontier multimodal models available at the time of writing, including GPT-4o, GPT-4o Audio Mini, Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash, Claude 3.5 Sonnet, and VoxTral Small. For ASCII CAPTCHAs, no model achieved a non-zero exact-match accuracy under any input modality (text or image); character-level similarity across all models averaged 0.16%. For audio CAPTCHAs, all models performed near-randomly across four noise augmentation conditions, with the best result being 25% on clean audio (GPT-4o Audio Mini), matching the random baseline for a 4-alternative forced-choice task. These results suggest that both CAPTCHA classes offer strong near-term robustness against current AI capabilities.

ASCII Art CAPTCHAs

ASCII art CAPTCHAs encode alphanumeric characters as spatial arrangements of printable symbols. Humans read ASCII art holistically, applying Gestalt principles to group local elements into global letter forms. LLM tokenisers, however, fragment ASCII art at character boundaries, destroying the spatial structure that carries the encoded information. This mismatch is fundamental: it is a property of the tokenisation architecture, not of model scale or training data.

We generate 500 samples using pyfiglet with 50+ fonts and random 7-character alphanumeric strings. Each sample is evaluated under two input modalities: raw ASCII text and a PNG image rendering.

Live example: try it yourself

Font: loading fonts…

loading…

Target string: HORSE3903

Model Failure Gallery

Each frontier model consistently returns outputs bearing no resemblance to the target string. Below are representative failure cases.

GPT-4o

Outputs a confident but entirely incorrect transcription.

Gemini 1.5 Pro

Recognises ASCII art structure but cannot decode the characters.

Gemini 1.5 Flash

Flash variant shows identical failure patterns to the Pro model.

Overlapping Audio CAPTCHAs

The cocktail-party effect describes the human ability to selectively attend to a single speaker in a noisy, multi-speaker environment. We exploit this by presenting a target question spoken by one voice simultaneously with a distractor question from a second voice. The user must answer the target question, ignoring the distractor.

Audio is synthesised using XTTS-v2 and evaluated across four noise augmentation conditions: clean, white noise, bandpass-filtered noise, and pink noise. Models receive the mixed audio and must select the correct answer from four alternatives.

Results

ASCII CAPTCHAs: Text Input

Model	Exact Match	Char Similarity	Notes
GPT-4o	0.00%	0.15%	Consistent nonsense output
GPT-4o Mini	0.00%	0.14%
Gemini 1.5 Pro	0.00%	0.18%	Occasional partial character matches
Gemini 1.5 Flash	0.00%	0.16%
Gemini 2.0 Flash	0.00%	0.17%
Claude 3.5 Sonnet	0.00%	0.15%

Table 1. ASCII CAPTCHA results under text input. Char similarity is LCS-based, averaged over 500 samples.

ASCII CAPTCHAs: Image Input

Model	Exact Match	Char Similarity	Notes
GPT-4o	0.00%	0.16%	Marginally higher similarity than text
GPT-4o Mini	0.00%	0.13%
Gemini 1.5 Pro	0.00%	0.19%	Best image-input similarity
Gemini 1.5 Flash	0.00%	0.15%
Gemini 2.0 Flash	0.00%	0.16%
Claude 3.5 Sonnet	0.00%	0.14%

Table 2. ASCII CAPTCHA results under image input. No model achieved non-zero exact-match accuracy despite the visual modality.

Audio CAPTCHAs: 4AFC Accuracy

Model	Clean	White Noise	Bandpass	Pink Noise
GPT-4o Audio Mini	25.0%	12.5%	12.5%	0.0%
Gemini 2.0 Flash	12.5%	12.5%	12.5%	12.5%
VoxTral Small	12.5%	0.0%	12.5%	12.5%
Random baseline	25.0%	25.0%	25.0%	25.0%

Table 3. Audio CAPTCHA 4AFC accuracy. Best result (25%, GPT-4o Audio Mini, clean) matches the random baseline. All models perform at or below random across all noise conditions.

BibTeX

@misc{chong2026perceptualgaps,
  title         = {Perceptual Gaps: {ASCII} Art and Overlapping Audio as {CAPTCHA}s},
  author        = {Chong, Choon-Hou Rafael},
  year          = {2026},
  eprint        = {2604.03612},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  url           = {https://arxiv.org/abs/2604.03612}
}