Online education growth demands scalable quality assessment:
- Tight development timelines
- QA standards and rubrics are underutilized
- Difficulty connecting learning sciences with instructional design practice
- Lack of systematic identification of improvement opportunities
QA Bot with three innovations:
- Hybrid architecture (rule-based + LLM prompting)
- Resource efficient (8B parameters)
- Open source and free access
- Actionable feedback based on course quality standards
- RQ1: Small models vs experts?
- RQ2: Agreement among models?
- RQ3: Small vs large models?
Validation: 7 Canvas courses, 20 OLC Essential Design standards, Human review benchmark.
| Model | % Agreement | Cohen's κ | Bias |
|---|---|---|---|
| Llama 3.1 8B | 48.6% | 0.06 | -0.60 |
| DeepSeek R1 8B | 30.0% | 0.04 | -1.13 |
| Claude Sonnet 4 | 15-17% | ~0.01 | -1.32 |
| GPT-4o | 17-20% | ~0.00 | -1.17 |
κ < 0.20 = poor agreement (Landis & Koch, 1977). Negative bias = systematic underrating.
Quality Matters
28 criteria
Online Learning Consortium
20 objectives
Universal Design for Learning
5 principles
Customized Rubric
Future work
Open-Source Models:
Commercial Models:
8B models run on standard laptop (8GB RAM)
Adapted from: Learning Engineering Process by Aaron Kessler, Jim Goodell, Sae Schatz (CC BY)
QA Bot supports nested improvement cycles:
- Creation: Validate course design against standards
- Implementation: Identify improvement opportunities
- Investigation: Targeted analysis of design elements
| Comparison | Justification Similarity | Rating Agreement | Gap |
|---|---|---|---|
| Humans vs. DeepSeek | 0.41 | 0.30 | +0.11 |
| Humans vs. Llama | 0.43 | 0.49 | -0.06 |
Interpretation: Positive gap = justification quality exceeds rating accuracy. The reasoning module works; classification is poorly calibrated.
Future: Classification recalibration • Hybrid human-bot workflows • Larger model evaluation • Custom rubric integration • Multi-LMS support