Building a Natural Language Parser for Quranic References

Jibran Kalia 14 min read
Written

بِسْمِ ٱللَّٰهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ

Teachers in a Quran class think in surahs, pages, and juz — not verse IDs. When a teacher types "Baqarah 255" or "juz 30 Q1", the system needs to resolve that into precise verse boundaries. In this post, I'll walk through how we built a natural language parser for Quranic references.

What Users Type

The parser accepts a wide range of input formats. Here are real examples:

Baqarah                    → Full surah (Al-Baqarah, verses 1-286)
Baqarah 255                → Single verse (Ayat al-Kursi)
Baqarah 255-260            → Verse range within a surah
Takweer to Falaq           → Surah range (81 through 113)
2:255                      → Verse key notation
2:255 to 3:10              → Cross-surah verse range

pg 5                       → Full page in the mushaf
pg 233 H1                  → First half of page 233
pg 233 H1 - 244 H2         → Page range with halves

juz 30                     → Full juz (Amma)
juz 30 H1                  → First hizb of juz 30
juz 30 Q1                  → First quarter of juz 30
juz 30 Q1-3                → Quarters 1 through 3 of juz 30

nl: Baqarah 255            → With assignment type prefix
Baqarah 255 Pass           → With grade suffix

The parser also handles Arabic digits (٠١٢٣٤٥٦٧٨٩) and Persian digits (۰۱۲۳۴۵۶۷۸۹), transliterated surah names with typos, and flexible delimiters (-, , , to).

Architecture Overview

The system has three layers:

  1. AssignmentRangeParser — the core parsing engine that resolves text into verse boundaries
  2. AssignmentRange::QuickEntry — a state machine that wraps the parser with progressive feedback
  3. unified_range_controller.js — the frontend Stimulus controller with keyboard shortcuts and real-time validation

Digit Normalization

Before any parsing begins, we normalize digits. Arabic-speaking users might type Eastern Arabic or Persian numerals:

# lib/digit_normalization.rb

module DigitNormalization
  def self.normalize(value)
    string = value.to_s.dup.force_encoding(Encoding::UTF_8)

    string
      .tr("\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669",
          "0123456789")
      .tr("\u06F0\u06F1\u06F2\u06F3\u06F4\u06F5\u06F6\u06F7\u06F8\u06F9",
          "0123456789")
  end
end
U+0660-U+0669  Arabic-Indic digits    ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
U+06F0-U+06F9  Extended Arabic-Indic   ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹

"جز ٣٠" (juz 30 in Arabic digits) becomes "جز 30" before the parser ever sees it.

The Parser Engine

Dispatch by Range Type

The parser accepts a range type and raw input, dispatches to the appropriate handler, and returns a structured result:

# app/services/assignment_range_parser.rb

def parse(range_type, unified_range)
  cleaned_value = DigitNormalization.normalize(unified_range).strip

  result = case range_type
  when "ayah"   then parse_ayah_range(cleaned_value)
  when "surah"  then parse_surah_or_ayah_range(cleaned_value)
  when "page"   then parse_page_range(cleaned_value)
  when "juz"    then parse_juz_range(cleaned_value)
  when "qaidah" then parse_qaidah_range(cleaned_value)
  end

  validate_range_order(result) if result[:valid] && range_type != "qaidah"
  result
end

Every result has the same shape:

{
  valid: true,
  display: "Baqarah 255",
  range_params: {
    from_chapter: 2,
    from_ayah: 255,
    to_chapter: 2,
    to_ayah: 255,
    selection_method: "ayah",
    selection_reference: "2:255-2:255"
  }
}

Surah Name Resolution

This is the trickiest part. Teachers type surah names in various transliterations: "Baqarah", "baqara", "Al-Baqarah", or even Arabic "البقرة". We use a three-step approach:

def find_chapter(identifier)
  normalized = DigitNormalization.normalize(identifier).strip

  # Step 1: Try as a number
  return Chapter.find_by(id: normalized.to_i) if normalized =~ /^\d+$/

  # Step 2: Fuzzy search by English transliteration
  Chapter.fuzzy_search(normalized) || find_chapter_by_arabic_name(normalized)
end

def find_chapter_by_arabic_name(identifier)
  # Direct match
  chapter = Chapter.find_by(name_arabic: identifier)
  return chapter if chapter

  # Handle definite article "ال" prefix
  if identifier.start_with?("ال")
    Chapter.find_by(name_arabic: identifier.delete_prefix("ال"))
  else
    Chapter.find_by(name_arabic: "ال#{identifier}")
  end
end

The fuzzy search uses PostgreSQL's pg_trgm extension with trigram matching:

# Chapter model
pg_search_scope :fuzzy_search,
  against: :name_alternatives,
  using: {
    trigram: { threshold: 0.35 },
    tsearch: { prefix: true }
  }

name_alternatives stores multiple transliteration variants for each surah, so "Baqara", "Baqarah", and "Al-Baqara" all match Surah 2.

Parsing Surah and Ayah Ranges

The surah/ayah parser must distinguish between a full surah ("Baqarah"), a single verse ("Baqarah 255"), a verse range ("Baqarah 10-20"), and a surah range ("Takweer to Falaq"). The detection heuristics:

AYAH_PATTERNS = [
  %r{\d+:\d+},              # "2:255" - verse key notation
  %r{.+\s+\d+\s*[-–—]},     # "Surah 10-" (range start)
  %r{.+\s+\d+\s+to\s+},     # "Surah 10 to" (range start)
  %r{[-–—].*end}i,          # "-end" or "to end"
  %r{\s+to\s+end}i,         # " to end"
  %r{^.+\s+\d+$}            # "Surah 10" (single verse)
]

def surah_or_ayah?(input)
  AYAH_PATTERNS.any? { |pattern| input.match?(pattern) }
end

If any ayah pattern matches, we parse as a verse range. Otherwise, we parse as a surah range. Cross-surah verse ranges like "Al-Fatiha 1 to Baqarah 5" are supported by checking for a second surah name after the delimiter.

Parsing Page Ranges

PAGE_PATTERN = %r{
  ^(\d+)           # Start page number
  \s*(?:H(\d+))?   # Optional half (H1 or H2)
  \s*(?:to|-|–|—)? # Optional delimiter
  \s*(\d+)?        # Optional end page number
  \s*(?:H(\d+))?$  # Optional end half
}xi

def parse_page_range(input)
  match = input.match(PAGE_PATTERN)
  start_page, start_half, end_page, end_half = match.captures

  # Validate page numbers against mushaf (typically 604)
  return error("Page exceeds #{mushaf.pages_count}") if start_page.to_i > mushaf.pages_count

  # Half-page validation: both sides or neither
  if end_page && (start_half.present? ^ end_half.present?)
    return error("Specify halves on both sides of the range")
  end

  # Convert page + half to verse boundaries
  first_verse = MushafPage.verse_boundaries(mushaf, start_page, start_half)
  last_verse  = MushafPage.verse_boundaries(mushaf, end_page || start_page, end_half)

  build_result(first_verse, last_verse, method: start_half ? :page_half : :page)
end

Half-page ranges are asymmetric — "pg 5 H1 - 10" is rejected, but "pg 5 H1 - 10 H2" and "pg 5 H1" are valid.

Parsing Juz Ranges

Juz parsing supports three levels of granularity: full juz, hizb (half), and rub (quarter).

JUZ_QUARTER_PATTERN = %r{
  ^(\d+)                        # Juz number (1-30)
  \s*Q(\d+)                     # Quarter start (Q1-Q4)
  (?:\s*(?:to|-|–|—)\s*Q?(\d+))? # Optional quarter end
  $
}xi

def parse_juz_quarter(juz_num, quarter_from, quarter_to)
  quarter_to ||= quarter_from

  # Each juz has 8 rubs (2 per quarter)
  rub_from = (juz_num - 1) * 8 + ((quarter_from - 1) * 2) + 1
  rub_to   = (juz_num - 1) * 8 + (quarter_to * 2)

  first_verse = Rub.find(rub_from).first_verse_key
  last_verse  = Rub.find(rub_to).last_verse_key

  build_result(first_verse, last_verse, method: :rub,
    reference: "#{juz_num} Q#{quarter_from}-#{quarter_to}")
end

The Quran has 30 juz, 60 hizb (2 per juz), and 240 rub (8 per juz). Each is a reference table with precomputed verse boundaries.

Verse Order Validation

As a final check, the parser verifies that the start verse comes before the end verse in Quran order:

def validate_range_order(result)
  from_verse = Verse.find_by(verse_key: from_key(result))
  to_verse   = Verse.find_by(verse_key: to_key(result))

  # Verses are stored with sequential IDs in Quran order
  if from_verse.id > to_verse.id
    result.merge(valid: false, error: "End verse comes before start verse")
  else
    result
  end
end

The State Machine

The QuickEntry model wraps the parser with a five-state machine that drives the UI:

# app/models/assignment_range/quick_entry.rb

STATES = %i[empty pending needs_type valid error].freeze

def state
  if @raw_input.blank?
    :empty         # No input yet
  elsif error.present?
    :error         # Parsing failed
  elsif range_attrs.present? && type.present?
    :valid         # Ready to save
  elsif range_attrs.present? && type.nil?
    :needs_type    # Range OK, but no assignment type selected
  else
    :pending       # Still typing or incomplete
  end
end

Each state maps to a distinct UI treatment:

empty      → Gray text: "Type a surah name, page, or juz"
pending    → Gray text: "Enter the ending verse number"
needs_type → Amber with arrow-up icon: "Select an assignment type above"
valid      → Green with check icon: "Ready to save"
error      → Red with x icon: "Could not find surah 'bqrah'"

Type Detection from Input

The assignment type (NL, NR, OR, RG, QH) can be detected from the beginning or end of the input:

TYPE_PREFIXES = %w[nl nr or rg qh].freeze

def detect_type_from_input
  type_pattern = /(#{TYPE_PREFIXES.join("|")}):?\s*/i

  # Try beginning: "nl: Baqarah" or "nl Baqarah"
  if (match = @input.match(/\A#{type_pattern}/i))
    @type ||= AssignmentType.from_prefix(match[1])
    @input = @input.sub(match[0], "").strip
    return
  end

  # Try end: "Baqarah nl"
  if (match = @input.match(/\s+(#{TYPE_PREFIXES.join("|")})\s*\z/i))
    @type ||= AssignmentType.from_prefix(match[1])
    @input = @input.sub(match[0], "").strip
  end
end

Grade Detection

GRADE_ALIASES = {
  "pass" => :pass, "pas" => :pass, "pa" => :pass, "good" => :pass,
  "fail" => :fail, "fai" => :fail, "fa" => :fail,
  "redo" => :redo
}.freeze

def detect_grade
  grade_pattern = /\s+(#{GRADE_ALIASES.keys.join("|")})\s*\z/i
  if (match = @input.match(grade_pattern))
    @grade = GRADE_ALIASES[match[1].downcase]
    @input = @input.sub(match[0], "").strip
  end
end

So "nl Baqarah 255 pass" is parsed as: type = New Lesson, range = Baqarah 255, grade = Pass.

Range Type Detection

When the user doesn't explicitly specify a type, the system uses heuristics:

def detect_range_type
  lower_input = @input.downcase.strip

  if qaidah_context?
    :qaidah
  elsif lower_input.match?(/\A(pg|page)[\s.]/i)
    :page
  elsif lower_input.match?(/\A(juz|j)[\s\d]/i)
    :juz
  elsif lower_input.match?(/\A\d+\s*(h[12]?)?\s*([-–—]|to)?\s*\d*\s*(h[12]?)?\z/i)
    :page       # Bare numbers like "5", "5-10" default to page
  else
    :surah      # Default: assume surah name
  end
end

Contextual Hints

The hint engine adapts to the current state. If the input has a typo, it suggests corrections via Levenshtein distance:

def contextual_hints
  if error? && fuzzy_match.present?
    [{ input: fuzzy_match, desc: "did you mean this?" }]
  elsif pending? && ambiguous_number?
    number_disambiguation_hints  # Is "30" a page or juz?
  elsif valid? && grade.nil?
    grade_hints                  # Suggest "pass", "fail", "redo"
  elsif empty?
    default_examples
  else
    range_type_hints
  end
end

The Frontend

Keyboard Shortcuts

The Stimulus controller registers global keyboard shortcuts for fast range type selection:

// app/javascript/controllers/unified_range_controller.js

setupKeyboardShortcuts() {
  document.addEventListener("keydown", (event) => {
    if (event.target.tagName === "INPUT" || event.target.tagName === "TEXTAREA") return;
    if (event.metaKey || event.ctrlKey || event.altKey) return;

    switch (event.key.toLowerCase()) {
      case "s": this.selectSurah(); break;
      case "p": this.selectPage(); break;
      case "j": this.selectJuz(); break;
    }
  });
}

Real-Time Validation

As the user types, the frontend debounces input and sends it to a parse endpoint via Turbo:

# app/controllers/assignments/parse_controller.rb

def create
  @entry = parse_quick_entry(parse_params, personal_mushaf: @personal_mushaf)

  respond_to do |format|
    format.turbo_stream do
      render turbo_stream: turbo_stream.update(
        "parse-result",
        partial: "assignments/quick_entry_feedback",
        locals: { entry: @entry }
      )
    end
  end
end

The parse endpoint returns a Turbo Stream that updates the feedback area with the current state (valid, error, pending, etc.), preview text, and contextual hints — all without a full page reload.

Persistence

Once the user submits, the parsed range is stored as an immutable AssignmentRange record:

class AssignmentRange