Building a Natural Language Parser for Quranic References
بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ
Teachers in a Quran class think in surahs, pages, and juz — not verse IDs. When a teacher types "Baqarah 255" or "juz 30 Q1", the system needs to resolve that into precise verse boundaries. In this post, I'll walk through how we built a natural language parser for Quranic references.
What Users Type
The parser accepts a wide range of input formats. Here are real examples:
Baqarah → Full surah (Al-Baqarah, verses 1-286)
Baqarah 255 → Single verse (Ayat al-Kursi)
Baqarah 255-260 → Verse range within a surah
Takweer to Falaq → Surah range (81 through 113)
2:255 → Verse key notation
2:255 to 3:10 → Cross-surah verse range
pg 5 → Full page in the mushaf
pg 233 H1 → First half of page 233
pg 233 H1 - 244 H2 → Page range with halves
juz 30 → Full juz (Amma)
juz 30 H1 → First hizb of juz 30
juz 30 Q1 → First quarter of juz 30
juz 30 Q1-3 → Quarters 1 through 3 of juz 30
nl: Baqarah 255 → With assignment type prefix
Baqarah 255 Pass → With grade suffix
The parser also handles Arabic digits (٠١٢٣٤٥٦٧٨٩) and Persian digits (۰۱۲۳۴۵۶۷۸۹), transliterated surah names with typos, and flexible delimiters (-, –, —, to).
Architecture Overview
The system has three layers:
- AssignmentRangeParser — the core parsing engine that resolves text into verse boundaries
- AssignmentRange::QuickEntry — a state machine that wraps the parser with progressive feedback
- unified_range_controller.js — the frontend Stimulus controller with keyboard shortcuts and real-time validation
Digit Normalization
Before any parsing begins, we normalize digits. Arabic-speaking users might type Eastern Arabic or Persian numerals:
# lib/digit_normalization.rb
module DigitNormalization
def self.normalize(value)
string = value.to_s.dup.force_encoding(Encoding::UTF_8)
string
.tr("\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669",
"0123456789")
.tr("\u06F0\u06F1\u06F2\u06F3\u06F4\u06F5\u06F6\u06F7\u06F8\u06F9",
"0123456789")
end
end
U+0660-U+0669 Arabic-Indic digits ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
U+06F0-U+06F9 Extended Arabic-Indic ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
"جز ٣٠" (juz 30 in Arabic digits) becomes "جز 30" before the parser ever sees it.
The Parser Engine
Dispatch by Range Type
The parser accepts a range type and raw input, dispatches to the appropriate handler, and returns a structured result:
# app/services/assignment_range_parser.rb
def parse(range_type, unified_range)
cleaned_value = DigitNormalization.normalize(unified_range).strip
result = case range_type
when "ayah" then parse_ayah_range(cleaned_value)
when "surah" then parse_surah_or_ayah_range(cleaned_value)
when "page" then parse_page_range(cleaned_value)
when "juz" then parse_juz_range(cleaned_value)
when "qaidah" then parse_qaidah_range(cleaned_value)
end
validate_range_order(result) if result[:valid] && range_type != "qaidah"
result
end
Every result has the same shape:
{
valid: true,
display: "Baqarah 255",
range_params: {
from_chapter: 2,
from_ayah: 255,
to_chapter: 2,
to_ayah: 255,
selection_method: "ayah",
selection_reference: "2:255-2:255"
}
}
Surah Name Resolution
This is the trickiest part. Teachers type surah names in various transliterations: "Baqarah", "baqara", "Al-Baqarah", or even Arabic "البقرة". We use a three-step approach:
def find_chapter(identifier)
normalized = DigitNormalization.normalize(identifier).strip
# Step 1: Try as a number
return Chapter.find_by(id: normalized.to_i) if normalized =~ /^\d+$/
# Step 2: Fuzzy search by English transliteration
Chapter.fuzzy_search(normalized) || find_chapter_by_arabic_name(normalized)
end
def find_chapter_by_arabic_name(identifier)
# Direct match
chapter = Chapter.find_by(name_arabic: identifier)
return chapter if chapter
# Handle definite article "ال" prefix
if identifier.start_with?("ال")
Chapter.find_by(name_arabic: identifier.delete_prefix("ال"))
else
Chapter.find_by(name_arabic: "ال#{identifier}")
end
end
The fuzzy search uses PostgreSQL's pg_trgm extension with trigram matching:
# Chapter model
pg_search_scope :fuzzy_search,
against: :name_alternatives,
using: {
trigram: { threshold: 0.35 },
tsearch: { prefix: true }
}
name_alternatives stores multiple transliteration variants for each surah, so "Baqara", "Baqarah", and "Al-Baqara" all match Surah 2.
Parsing Surah and Ayah Ranges
The surah/ayah parser must distinguish between a full surah ("Baqarah"), a single verse ("Baqarah 255"), a verse range ("Baqarah 10-20"), and a surah range ("Takweer to Falaq"). The detection heuristics:
AYAH_PATTERNS = [
%r{\d+:\d+}, # "2:255" - verse key notation
%r{.+\s+\d+\s*[-–—]}, # "Surah 10-" (range start)
%r{.+\s+\d+\s+to\s+}, # "Surah 10 to" (range start)
%r{[-–—].*end}i, # "-end" or "to end"
%r{\s+to\s+end}i, # " to end"
%r{^.+\s+\d+$} # "Surah 10" (single verse)
]
def surah_or_ayah?(input)
AYAH_PATTERNS.any? { |pattern| input.match?(pattern) }
end
If any ayah pattern matches, we parse as a verse range. Otherwise, we parse as a surah range. Cross-surah verse ranges like "Al-Fatiha 1 to Baqarah 5" are supported by checking for a second surah name after the delimiter.
Parsing Page Ranges
PAGE_PATTERN = %r{
^(\d+) # Start page number
\s*(?:H(\d+))? # Optional half (H1 or H2)
\s*(?:to|-|–|—)? # Optional delimiter
\s*(\d+)? # Optional end page number
\s*(?:H(\d+))?$ # Optional end half
}xi
def parse_page_range(input)
match = input.match(PAGE_PATTERN)
start_page, start_half, end_page, end_half = match.captures
# Validate page numbers against mushaf (typically 604)
return error("Page exceeds #{mushaf.pages_count}") if start_page.to_i > mushaf.pages_count
# Half-page validation: both sides or neither
if end_page && (start_half.present? ^ end_half.present?)
return error("Specify halves on both sides of the range")
end
# Convert page + half to verse boundaries
first_verse = MushafPage.verse_boundaries(mushaf, start_page, start_half)
last_verse = MushafPage.verse_boundaries(mushaf, end_page || start_page, end_half)
build_result(first_verse, last_verse, method: start_half ? :page_half : :page)
end
Half-page ranges are asymmetric — "pg 5 H1 - 10" is rejected, but "pg 5 H1 - 10 H2" and "pg 5 H1" are valid.
Parsing Juz Ranges
Juz parsing supports three levels of granularity: full juz, hizb (half), and rub (quarter).
JUZ_QUARTER_PATTERN = %r{
^(\d+) # Juz number (1-30)
\s*Q(\d+) # Quarter start (Q1-Q4)
(?:\s*(?:to|-|–|—)\s*Q?(\d+))? # Optional quarter end
$
}xi
def parse_juz_quarter(juz_num, quarter_from, quarter_to)
quarter_to ||= quarter_from
# Each juz has 8 rubs (2 per quarter)
rub_from = (juz_num - 1) * 8 + ((quarter_from - 1) * 2) + 1
rub_to = (juz_num - 1) * 8 + (quarter_to * 2)
first_verse = Rub.find(rub_from).first_verse_key
last_verse = Rub.find(rub_to).last_verse_key
build_result(first_verse, last_verse, method: :rub,
reference: "#{juz_num} Q#{quarter_from}-#{quarter_to}")
end
The Quran has 30 juz, 60 hizb (2 per juz), and 240 rub (8 per juz). Each is a reference table with precomputed verse boundaries.
Verse Order Validation
As a final check, the parser verifies that the start verse comes before the end verse in Quran order:
def validate_range_order(result)
from_verse = Verse.find_by(verse_key: from_key(result))
to_verse = Verse.find_by(verse_key: to_key(result))
# Verses are stored with sequential IDs in Quran order
if from_verse.id > to_verse.id
result.merge(valid: false, error: "End verse comes before start verse")
else
result
end
end
The State Machine
The QuickEntry model wraps the parser with a five-state machine that drives the UI:
# app/models/assignment_range/quick_entry.rb
STATES = %i[empty pending needs_type valid error].freeze
def state
if @raw_input.blank?
:empty # No input yet
elsif error.present?
:error # Parsing failed
elsif range_attrs.present? && type.present?
:valid # Ready to save
elsif range_attrs.present? && type.nil?
:needs_type # Range OK, but no assignment type selected
else
:pending # Still typing or incomplete
end
end
Each state maps to a distinct UI treatment:
empty → Gray text: "Type a surah name, page, or juz"
pending → Gray text: "Enter the ending verse number"
needs_type → Amber with arrow-up icon: "Select an assignment type above"
valid → Green with check icon: "Ready to save"
error → Red with x icon: "Could not find surah 'bqrah'"
Type Detection from Input
The assignment type (NL, NR, OR, RG, QH) can be detected from the beginning or end of the input:
TYPE_PREFIXES = %w[nl nr or rg qh].freeze
def detect_type_from_input
type_pattern = /(#{TYPE_PREFIXES.join("|")}):?\s*/i
# Try beginning: "nl: Baqarah" or "nl Baqarah"
if (match = @input.match(/\A#{type_pattern}/i))
@type ||= AssignmentType.from_prefix(match[1])
@input = @input.sub(match[0], "").strip
return
end
# Try end: "Baqarah nl"
if (match = @input.match(/\s+(#{TYPE_PREFIXES.join("|")})\s*\z/i))
@type ||= AssignmentType.from_prefix(match[1])
@input = @input.sub(match[0], "").strip
end
end
Grade Detection
GRADE_ALIASES = {
"pass" => :pass, "pas" => :pass, "pa" => :pass, "good" => :pass,
"fail" => :fail, "fai" => :fail, "fa" => :fail,
"redo" => :redo
}.freeze
def detect_grade
grade_pattern = /\s+(#{GRADE_ALIASES.keys.join("|")})\s*\z/i
if (match = @input.match(grade_pattern))
@grade = GRADE_ALIASES[match[1].downcase]
@input = @input.sub(match[0], "").strip
end
end
So "nl Baqarah 255 pass" is parsed as: type = New Lesson, range = Baqarah 255, grade = Pass.
Range Type Detection
When the user doesn't explicitly specify a type, the system uses heuristics:
def detect_range_type
lower_input = @input.downcase.strip
if qaidah_context?
:qaidah
elsif lower_input.match?(/\A(pg|page)[\s.]/i)
:page
elsif lower_input.match?(/\A(juz|j)[\s\d]/i)
:juz
elsif lower_input.match?(/\A\d+\s*(h[12]?)?\s*([-–—]|to)?\s*\d*\s*(h[12]?)?\z/i)
:page # Bare numbers like "5", "5-10" default to page
else
:surah # Default: assume surah name
end
end
Contextual Hints
The hint engine adapts to the current state. If the input has a typo, it suggests corrections via Levenshtein distance:
def contextual_hints
if error? && fuzzy_match.present?
[{ input: fuzzy_match, desc: "did you mean this?" }]
elsif pending? && ambiguous_number?
number_disambiguation_hints # Is "30" a page or juz?
elsif valid? && grade.nil?
grade_hints # Suggest "pass", "fail", "redo"
elsif empty?
default_examples
else
range_type_hints
end
end
The Frontend
Keyboard Shortcuts
The Stimulus controller registers global keyboard shortcuts for fast range type selection:
// app/javascript/controllers/unified_range_controller.js
setupKeyboardShortcuts() {
document.addEventListener("keydown", (event) => {
if (event.target.tagName === "INPUT" || event.target.tagName === "TEXTAREA") return;
if (event.metaKey || event.ctrlKey || event.altKey) return;
switch (event.key.toLowerCase()) {
case "s": this.selectSurah(); break;
case "p": this.selectPage(); break;
case "j": this.selectJuz(); break;
}
});
}
Real-Time Validation
As the user types, the frontend debounces input and sends it to a parse endpoint via Turbo:
# app/controllers/assignments/parse_controller.rb
def create
@entry = parse_quick_entry(parse_params, personal_mushaf: @personal_mushaf)
respond_to do |format|
format.turbo_stream do
render turbo_stream: turbo_stream.update(
"parse-result",
partial: "assignments/quick_entry_feedback",
locals: { entry: @entry }
)
end
end
end
The parse endpoint returns a Turbo Stream that updates the feedback area with the current state (valid, error, pending, etc.), preview text, and contextual hints — all without a full page reload.
Persistence
Once the user submits, the parsed range is stored as an immutable AssignmentRange record:
class AssignmentRange