Character-Level Mistake Marking: Precision Error Tracking for Arabic Text

Jibran Kalia 12 min read
Written

بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ

Tracking recitation mistakes at the word level isn't enough. A student might pronounce the letter correctly but miss the diacritical mark above it. In this post, I'll walk through how we built character-level mistake tracking for Arabic Quranic text — from Unicode decomposition to the tap-based selection UI.

Why Character-Level Precision?

Consider the Arabic word بِسْمِ (bismi). It contains three letters (ب، س، م) and three diacritical marks (kasra, sukun, kasra). A student might pronounce the س correctly but miss the sukun above it, turning a stopped consonant into an open syllable. Word-level tracking would flag the entire word. Character-level tracking pinpoints the exact harakat.

This distinction matters for teachers. When reviewing a student's recitation, they need to see patterns: does this student consistently drop sukun marks? Do they confuse fatha and damma? Character-level data makes these patterns visible.

Understanding Arabic Character Structure

Arabic text is composed of base letters with combining diacritical marks (tashkeel) that modify pronunciation. These marks appear above or below the letter:

Diacritic Positions

Above the letter:

  • Fatha (◌َ) U+064E — short "a" vowel
  • Damma (◌ُ) U+064F — short "u" vowel
  • Sukun (◌ْ) U+0652 — no vowel (consonant stop)
  • Shadda (◌ّ) U+0651 — doubled consonant
  • Fathatan (◌ً) U+064B — nunation with "a"
  • Dammatan (◌ٌ) U+064C — nunation with "u"
  • Superscript Alef (◌ٰ) U+0670 — long "a" (dagger alef)

Below the letter:

  • Kasra (◌ِ) U+0650 — short "i" vowel
  • Kasratan (◌ٍ) U+064D — nunation with "i"
  • Subscript Alef (◌ٖ) U+0656

A single Arabic character column can therefore have three layers: the base letter, marks above, and marks below. Multiple marks can stack in the same position (e.g., shadda + fatha: بَّ).

The Character Column Model

We represent each visual "column" of an Arabic word as a structured object with three slots:

# app/helpers/arabic_text_helper.rb

ARABIC_DIACRITICS = /[\u064B-\u065F\u0670\u06D6-\u06ED]/
BELOW_DIACRITICS  = /[\u0650\u064D\u0656]/

def arabic_character_columns(word_text)
  columns = []
  current_index = 1  # 1-based indexing

  word_text.each_char do |char|
    if char.match?(ARABIC_DIACRITICS)
      # Diacritical mark — attach to current column
      position = char.match?(BELOW_DIACRITICS) ? :below : :above

      if columns.last[position]
        # Stack multiple marks: shadda + fatha
        columns.last[position][:text] += char
      else
        columns.last[position] = { text: char, index: current_index }
      end
      current_index += 1
    else
      # Base letter — start new column
      columns <<