Contribution · Application — Education

AI Automated Grading (Essays and Code)

Teachers spend 10-20 hours a week grading. LLM-assisted grading — essay scoring against rubrics, code execution checks, short-answer evaluation — can cut that dramatically. The catch: automated grading has a troubled research history (bias toward longer essays, gaming via keyword stuffing), and regulators are starting to scrutinize high-stakes educational AI decisions. Low-stakes formative assessment is a safer wedge than summative high-stakes scoring.

Application facts

Domain
Education
Subdomain
Assessment
Example stack
Claude Sonnet 4.6 or GPT-5 with rubric-bound prompting · Pydantic structured-output scoring schema with rationale · Judge0 or Docker-sandboxed test execution for code grading · Streamlit or Moodle plug-in for teacher review UI · Evidently AI for fairness / drift monitoring across cohorts

Data & infrastructure needs

  • Rubric definitions at dimension + performance-level granularity
  • Teacher-scored calibration sets for inter-rater reliability
  • Student submissions (anonymized) for eval
  • Bias-testing corpora across demographic / linguistic dimensions
  • Item metadata — topic, difficulty, Bloom's level

Risks & considerations

  • Bias against L2-English speakers or dialect variation
  • Gaming via keyword stuffing or rubric-targeted content
  • Due-process challenges for high-stakes decisions
  • Over-reliance displacing formative teacher judgment
  • FERPA violations if graded content leaves the controlled environment

Frequently asked questions

Is AI grading fair?

Only with bias audits. Historical AES systems (e-rater, PEG) scored longer essays higher and penalized L2-English speakers and AAVE dialect. LLM-based scoring shows similar patterns unless explicitly mitigated. NIST's AI RMF and recent state laws (New York, Illinois) require bias testing for any high-stakes education AI.

Which LLM grades best?

Claude Sonnet 4.6 and GPT-5 grade short-answer and essay questions at teacher-calibrated levels when given detailed rubrics. For code grading, combine an LLM with actual test execution (via a sandbox like Judge0) — LLMs reading code hallucinate about runtime behavior.

Can AI grading be challenged legally?

Yes — students have successfully challenged AI grading outcomes under due-process grounds (International Baccalaureate 2020 scandal, Ofqual 2020 UK). Safe deployments offer transparent rationale, right to human review, and audit logging of every scoring decision.

Sources

  1. ETS — Research on automated essay scoring — accessed 2026-04-20
  2. NIST AI RMF — Educational applications — accessed 2026-04-20
  3. IEEE — Standard for automated grading systems — accessed 2026-04-20