Character-Level Mistake Marking: Precision Error Tracking for Arabic Text
بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ
Tracking recitation mistakes at the word level isn't enough. A student might pronounce the letter correctly but miss the diacritical mark above it. In this post, I'll walk through how we built character-level mistake tracking for Arabic Quranic text — from Unicode decomposition to the tap-based selection UI.
Why Character-Level Precision?
Consider the Arabic word بِسْمِ (bismi). It contains three letters (ب، س، م) and three diacritical marks (kasra, sukun, kasra). A student might pronounce the س correctly but miss the sukun above it, turning a stopped consonant into an open syllable. Word-level tracking would flag the entire word. Character-level tracking pinpoints the exact harakat.
This distinction matters for teachers. When reviewing a student's recitation, they need to see patterns: does this student consistently drop sukun marks? Do they confuse fatha and damma? Character-level data makes these patterns visible.
Understanding Arabic Character Structure
Arabic text is composed of base letters with combining diacritical marks (tashkeel) that modify pronunciation. These marks appear above or below the letter:
Diacritic Positions
Above the letter:
- Fatha (◌َ) U+064E — short "a" vowel
- Damma (◌ُ) U+064F — short "u" vowel
- Sukun (◌ْ) U+0652 — no vowel (consonant stop)
- Shadda (◌ّ) U+0651 — doubled consonant
- Fathatan (◌ً) U+064B — nunation with "a"
- Dammatan (◌ٌ) U+064C — nunation with "u"
- Superscript Alef (◌ٰ) U+0670 — long "a" (dagger alef)
Below the letter:
- Kasra (◌ِ) U+0650 — short "i" vowel
- Kasratan (◌ٍ) U+064D — nunation with "i"
- Subscript Alef (◌ٖ) U+0656
A single Arabic character column can therefore have three layers: the base letter, marks above, and marks below. Multiple marks can stack in the same position (e.g., shadda + fatha: بَّ).
The Character Column Model
We represent each visual "column" of an Arabic word as a structured object with three slots:
# app/helpers/arabic_text_helper.rb
ARABIC_DIACRITICS = /[\u064B-\u065F\u0670\u06D6-\u06ED]/
BELOW_DIACRITICS = /[\u0650\u064D\u0656]/
def arabic_character_columns(word_text)
columns = []
current_index = 1 # 1-based indexing
word_text.each_char do |char|
if char.match?(ARABIC_DIACRITICS)
# Diacritical mark — attach to current column
position = char.match?(BELOW_DIACRITICS) ? :below : :above
if columns.last[position]
# Stack multiple marks: shadda + fatha
columns.last[position][:text] += char
else
columns.last[position] = { text: char, index: current_index }
end
current_index += 1
else
# Base letter — start new column
columns <<