otherPublished: June 23, 2026

CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

By Faris Alasmary, Taif Nono, Orjuwan Zaafarani, Kholood Al Tabash, Ahmad Ghannam, Anas Salamah, Shouq Sadah, Lahouari Ghouti

Research TL;DR

"Uses CTC as a novel sequence alignment formulation for character deduplication in Arabic text, outperforming classification baselines; distilled to 2 layers for 3x depth reduction with minimal accuracy loss."

Abstract

Handling repeated characters in text can be tricky, since they can represent either the correct spelling of a word or informal character elongation often seen in social media posts. We present CANDLE, a lightweight system for character-level Arabic noise deduplication that addresses this challenge without relying on handcrafted rules, dictionaries, or morphological analyzers. At the heart of CANDLE is a novel application of Connectionist Temporal Classification (CTC) to this task, a formulation not previously explored for character deduplication, which frames normalization as a sequence alignment problem over a character-based encoder. Evaluated on three benchmarks spanning clean newspaper, manually curated ambiguous cases, and real-world social media text, the CTC model achieves a Sentence Error Rate (SER) as low as $5.37\%$ and consistently outperforms a classification-based baseline by a large margin. To reduce inference overhead, we distill the 6-layer CTC model into a 2-layer student, achieving a $3\times$ depth reduction with minimal performance degradation. Beyond deduplication accuracy, normalization yields a practical downstream benefit: a relative reduction in tokenizer fertility of up to $12.8\%$ across a diverse set of Arabic LLM tokenizers, directly lowering inference costs and improving context window utilization. We release all code and models publicly to support reproducibility and advance future research\footnote{https://github.com/abjadai/candle}.

Read full paper on arXiv →

Related Research

Jun 2026

CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

Abstract

Related Research

New Bounds for the Last Iterate of the Stochastic subGradient Method

It's Complicated: On the Design and Evaluation of AI-Powered AAC Interfaces

Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment