The rapid improvement of frontier multimodal large language models (LLMs) has rendered most existing CAPTCHA schemes obsolete: image-based CAPTCHAs, audio CAPTCHAs, and textual puzzles are routinely solved by current AI systems. This paper investigates whether human-specialised perceptual abilities can form the basis of next-generation CAPTCHAs that remain robust against AI while staying accessible to humans.
We introduce two novel CAPTCHA classes targeting distinct perceptual gaps. First, ASCII art CAPTCHAs exploit the mismatch between holistic human Gestalt reading and LLM tokenisation: humans effortlessly recognise characters rendered in ASCII art, while models fail because their tokenisers fragment the spatial patterns that carry semantic meaning. Second, overlapping audio CAPTCHAs exploit the cocktail-party effect: humans can isolate target speech from simultaneous distractors, whereas current audio-capable LLMs cannot reliably separate overlapping speakers.
We evaluate both CAPTCHA classes against all frontier multimodal models available at the time of writing, including GPT-4o, GPT-4o Audio Mini, Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 2.0 Flash, Claude 3.5 Sonnet, and VoxTral Small. For ASCII CAPTCHAs, no model achieved a non-zero exact-match accuracy under any input modality (text or image); character-level similarity across all models averaged 0.16%. For audio CAPTCHAs, all models performed near-randomly across four noise augmentation conditions, with the best result being 25% on clean audio (GPT-4o Audio Mini), matching the random baseline for a 4-alternative forced-choice task. These results suggest that both CAPTCHA classes offer strong near-term robustness against current AI capabilities.