
In-depth analysis of an intelligent code review reasoning agent using Claude AI to enhance code quality and developer productivity
Developer submits pull request
Claude analyzes code changes with context
Actionable suggestions with reasoning
Accept/reject AI suggestions
Feedback stored for RLHF
Developer submits pull request
Claude analyzes code changes with context
Actionable suggestions with reasoning
Accept/reject AI suggestions
Feedback stored for RLHF
Scalable, Event-Driven RAG implementation on AWS
Author: J Sankpal Β· Version: 2.0 Β· March 2026 Β· Status: MVP Complete, Phase 2 Ready
The gap: Junior developers repeat the same code mistakes 60% of the time within 6 months. Senior engineers spend 8-12 hours per week writing the same pattern explanations. Existing tools flag problems β none of them teach.
CRRA is the first code review tool where every comment is simultaneously:
| Property | What it means | Why it matters |
|---|---|---|
| Detection-grounded | Issues identified by deterministic static analysis before LLM sees them β not hallucinated | Zero false positives from LLM imagination; developer trust is never broken by AI-invented bugs |
| Principle-teaching | Explanation follows Issue β Reasoning β Why It Matters β Fix β Learning Takeaway structure | Developers internalize the pattern, not just the patch β repeat violations drop 40% |
| Learning-measurable | Per-developer, per-pattern repeat violation tracking proves actual retention over time | EM can show ROI in data: "Violation X dropped 40% in 90 days in this cohort" |
GitHub Copilot has suggestions. SonarQube has rules. CRRA teaches principles β and measures whether developers actually learned them.
MVP Results (6-week pilot, n=50 developers): 91% precision Β· 4.2/5 developer satisfaction Β· 18% early repeat violation reduction (directional signal) Β· <2s latency Β· 4-6 hrs/week saved per senior reviewer
Launch: GitHub-integrated Β· Free / $15/dev/mo Pro / $50/dev/mo Enterprise Β· 200-developer Production Pilot in progress
Target: 1,000 active developers and $900 MRR by Month 3 Β· 5K developers and $5.5K MRR by Month 6
Every engineering organization faces the same pattern: junior developers make mistakes, senior developers point them out, juniors fix the immediate issue but miss the underlying principle. Two weeks later, the same mistake reappears in different code.
From the manager's perspective: New engineers take 6+ months to internalize coding standards, extending onboarding costs.
From the reviewer's perspective: Senior engineers spend 30-40% of code review time writing the same explanations repeatedlyβlow-leverage work that doesn't scale.
From the developer's perspective: Feedback feels arbitrary and disconnected from the bigger picture, making it hard to build mental models.
Scenario 1: The Terse Comment
Reviewer: "Add null check"
Developer: Adds null check, doesn't understand why
Result: No learning. Pattern repeats elsewhere.
Scenario 2: The Over-Explanation
Reviewer: "This will crash if the API returns null because Java
doesn't handle null pointer exceptions gracefully..."
Developer: Eyes glaze over, applies patch, forgets immediately
Result: Wasted reviewer effort. No retention.
Scenario 3: The Asynchronous Ping-Pong
Reviewer: "Why is this here?"
Developer: "Not sure, seemed right?"
[3 hours later]
Reviewer: [Explains rationale]
Developer: "Got it, fixing"
Result: 6-hour review cycle for a 2-minute explanation
Each failure mode stems from treating feedback and teaching as the same thing. They're not. Feedback identifies problems. Teaching explains principles.
| Solution | Price | Gets right | Gets wrong |
|---|---|---|---|
| SonarQube / ESLint | Free-$400/mo | Mature rules, CI integration, no hallucinations | Flags "what" β never explains "why"; no learning outcome tracking |
| GitHub Copilot Code Review | $19/mo | Fast suggestions, native GitHub UX | Suggestions without principles; no repeat violation measurement |
| CodeRabbit | $15/dev/mo | Good PR summaries, fast setup | Issue flagging, not teaching; no measurable developer growth |
| Amazon CodeGuru | $10/100 lines | AWS-native, security and performance focus | Narrow coverage; no educational framing; doesn't teach transferable principles |
| Human senior reviewer | $200-500K salary | Deep judgment, context-aware, relationship-based mentorship | 8-12 hrs/week on repeatable explanations β doesn't scale; knowledge lost when reviewer leaves |
| General LLMs (ChatGPT, Claude) | $0-20/mo | Can explain any concept in natural language | Hallucinate issues that don't exist; not AST-grounded; no PR workflow integration |
The white space: The $10-50/dev/month range for code review tooling is occupied by linters (no explanation) and AI suggestion tools (no grounding). Nobody occupies the position: educational code review at scale with measurable learning outcomes and zero hallucinated issues.
Code review feedback that teaches requires more than static rule matching:
Why not linters alone (ESLint, SonarQube)? Linters handle step 1 well but cannot explain why an issue matters or help developers build transferable mental models. They flag "potential null pointer" but don't teach the underlying principle of defensive programming at system boundaries.
Why not general-purpose LLMs (GPT-4, Claude)? They can explain but hallucinate issues that don't exist β generating false positives that destroy trust. Cost at scale ($0.05-0.10/PR) also makes API-only approaches economically unviable for high-volume code review.
Why agentic AI specifically? CRRA separates detection (deterministic static analysis) from explanation (fine-tuned LLM). Static analysis ensures issues are real (high precision); the LLM generates educational explanations for confirmed issues only. This separation gives the best of both: reliable detection with context-aware teaching. The LLM never decides what to flag β only how to explain what was flagged.
Industry benchmarks and internal data:
For a 200-person engineering org:
Opportunity: If we reduce explanation overhead by 30-50% and accelerate learning by 20-30%, we unlock $500K-1M+ annual savings per 200-engineer org.
The aha is NOT "it caught a bug." It's: "I would have written that check myself next time."
A senior engineer wastes 4-6 hours per week on the same pattern explanations (pilot result, n=5 senior engineers). A junior developer patches bugs without understanding why they occur. Traditional review teaches compliance, not principles.
The aha happens at the third time a developer encounters CRRA explaining the same pattern type. They open a new file, spot the same pattern before they write it, and preemptively fix it β with no prompt, no comment. Internalized.
The three-turn sequence:
That is the product working. Repeat violation rate drops 40%. Senior engineers reclaim their time for architecture.
See interactive diagrams: CRRA Workflow, and Component Architecture
Total code issues explained with developer engagement per week (WoW)
An explanation counts only when the developer took an action indicating they read and accepted it: thumbs-up, "resolve," or no dismissal within 48h. Issues dismissed without reading do not count. Outcome-based: the agent completed its job of teaching.
Target: 500 engaged explanations/week (Phase 2) β 10,000/week (Platform scale, Month 18)
| L1 Metric | Definition | Target | If missed |
|---|---|---|---|
| % Task Completion Rate | % of PR reviews where CRRA posted β₯1 comment that was engaged with (not dismissed), without requiring human override | >50% beta; >65% launch | <40%: review false positive rate; check context window and confidence threshold |
| % Helpfulness | Composite: developer satisfaction Γ explanation engagement Γ repeat violation reduction | >70% beta; >80% launch | Investigate explanation length, structure quality, pattern coverage |
| % Honest | Composite: precision Γ explanation accuracy Γ fabrication rate | >91% MVP; >95% Phase 2; >97% Phase 3 | Any fabrication: immediate pause and retrain; precision <90%: pause deployments |
| % Harmless (guardrails adhered) | % of sessions with zero forbidden phrases, zero replacement framing, zero PII references | 100% | Any incident: behavioral audit within 24h; update adversarial test set |
| NPS / Developer Satisfaction | Survey: "Did this help you understand the underlying principle?" (1-5) as NPS proxy | >4.2/5 (achieved); >4.5/5 Phase 2 | <3.5/5: UX research sprint; check false positive rate and explanation format |
% Helpfulness β breakdown
| L2 Signal | Measurement | Target |
|---|---|---|
| Issue / goal identification accuracy | Static analysis correctly flagged the right pattern (not a false positive) | >95% precision on gold set |
| Plan accuracy (orchestrator sequencing) | Orchestrator exits immediately on clean code; no spurious LLM calls | 100% correct exit on zero-issue PRs |
| Response format correctness | All 5 sections present in every posted comment | 100% |
| Read-through rate | % of comments where developer spent >10s before acting | >60% |
| % Abandon rate | % of PRs where developer dismissed all CRRA comments without reading | <15% |
| Conversations where human was called | % of PRs where developer clicked "This doesn't apply" (context gap) | <20%; all flagged for training pipeline |
% Honest β breakdown
| L2 Signal | Measurement | Target |
|---|---|---|
| Grounding accuracy | Every flagged issue backed by a concrete AST pattern match (not LLM-generated) | 100% β static analysis pre-filter enforces this |
| Factfulness | Reasoning steps technically correct (LLM-as-judge [B] score) | >95% on weekly production sample |
| Source correctness | Explanation cites the specific code line and pattern category | 100% β structure validation check |
| Reasoning / planning step accuracy | Issue β consequence β fix chain is logically correct | >95% (expert spot-check monthly) |
| Output format correctness | No missing sections; severity matches static analysis output | 100% |
% Harmless β breakdown
| L2 Signal | Measurement | Target |
|---|---|---|
| No subjective style opinions | Comments flag objective correctness/security/performance only | 0 (monthly behavioral audit) |
| No replacement framing | Comments frame CRRA as supplemental, not replacing human review | 0 (LLM-as-judge [F] score) |
| No PII or developer references | No names, team references, or internal identifiers | 0 (automated regex scan per comment) |
| Adversarial test pass rate | Correct code / ambiguous code β no comment or hedged comment | 100% (pre-deployment gate) |
| Nonfactual information rate | LLM-as-judge [B] failures on weekly production sample | <5% sampled; 0% on gold set |
| Metric | Definition | Target | Cadence |
|---|---|---|---|
| Cost per PR | Total inference + infrastructure cost per PR analyzed | $0.008 (MVP achieved); <$0.02 at scale | Weekly |
| Avg tokens per PR | Token consumption per inference call; breakdown by PR size | <1,000 tokens per explanation | Weekly |
| Avg tool calls per PR | Total component invocations: webhook + static analysis + CRRA calls + 1 GitHub API | 1 GitHub API call per PR (batched); minimize CRRA calls per PR | Weekly |
| Cycle time (P95 latency) | PR submission β inline comment posted | <2s P95 | Continuous |
| Latency β first meaningful response | PR submission β first comment visible in GitHub | P95 <2s; P99 <4s | Continuous |
| Throughput | PRs processed per minute at peak | >50 PRs/min (Phase 2, SageMaker) | Monthly load test |
| Recovery rate | % of failed webhooks or API calls that retry and complete | >95% | Weekly |
| Uptime | Agent available and processing PRs | >99.9% | Continuous |
| Cost reduction rate (MoM) | Change in cost per PR as pattern caching and model distillation mature | 15% MoM reduction target by Phase 2 | Monthly |
| Month 1 | Month 3 | Month 6 | Month 12 | |
|---|---|---|---|---|
| Active developers | 200 | 1,000 | 5,000 | 10,000 |
| Paid (Pro, $15/dev/mo) | 10 | 60 | 300 | 700 |
| Enterprise seats ($50/dev/mo) | 0 | 0 | 20 | 50 |
| MRR | $150 | $900 | $5,500 | $13,000 |
Kill gate: If Month 6 MRR < $2K (conservative floor), reassess freemium conversion rate and enterprise sales motion.
| SMART Goal | Area | Dependent L2 Metrics | Target |
|---|---|---|---|
| Increase engaged explanations 25% MoM through Phase 2 | Business impact | Task completion rate; engaged explanation count | 500 engaged explanations/week by Month 3 |
| Improve helpfulness composite to >80% within 6 months | Accuracy | Precision; format correctness; read-through rate | >80% composite; >40% repeat violation reduction |
| Cut average PR cycle time 30% by Month 6 | Efficiency | Latency P95; avg tool calls per PR; batch API call rate | <2s P95; 1 GitHub API call per PR |
| Achieve >4.5/5 developer satisfaction by Month 6 | User satisfaction | NPS proxy; thumbs-up; dismiss rate | >4.5/5; dismiss rate <15% |
| Demonstrate 15% cost reduction per PR by Month 4 | Business impact | Cost per PR; avg tokens; cache hit rate | <$0.007/PR from $0.008 baseline |
| Maintain 99.9% uptime with 0 fabricated explanations | Security/stability | Recovery rate; precision; fabrication rate | 99.9% uptime; 0 fabricated issues |
| If we optimize for | Watch for | Detection |
|---|---|---|
| Precision (fewer false positives) | Recall dropping below useful threshold | Track issues caught by human reviewers that CRRA missed |
| Explanation length (brevity) | Disengagement from truncated explanations | Track read-through rate and time-on-comment |
| Adoption rate | Developers auto-dismissing without reading | Track dismiss-without-read rate |
| Gate | Timeline | Must-Meet Criteria | If Missed |
|---|---|---|---|
| MVP validation | Week 6 (DONE) | 91% precision, 4.0+/5 satisfaction, <2s latency | Redesign approach |
| Production pilot launch | Month 1 | 200 developers enrolled, infra stable | Delay launch, fix blockers |
| Adoption signal | Month 3 | >50% of PRs engage with CRRA comments, dismiss rate <15% | Run retrospectives, adjust thresholds |
| Learning impact | Month 6 | 40% reduction in repeat violations, 4.5+/5 satisfaction | Iterate explanation quality or pivot |
| Revenue readiness | Month 12 | 5K active developers, 3 enterprise customers, $50K ARR | Reassess monetization strategy |
| Platform scale | Month 18 | 100K developers, $2M ARR | Evaluate acquisition or standalone path |
Imagine this project failed at Month 12. The three most likely reasons:
Individual developers never pay β their organization does. Free tier drives individual adoption; Pro and Enterprise trigger on team and EM sponsorship.
| Feature | Free | Pro ($15/dev/month) | Enterprise ($50/dev/month) |
|---|---|---|---|
| Code patterns | 3 (security, performance, maintainability) | 10+ patterns | 10+ patterns + custom org patterns (Phase 3) |
| Languages | Python and Java | + TypeScript, Go (Phase 2) | All supported languages |
| Repos | Public only | Public + private | On-prem deployment option |
| Comments per PR | Max 3 | Max 10 | Configurable |
| Feedback | Thumbs up/down | + "This doesn't apply" + personal dashboard | + RLHF customization for org patterns + custom pattern feedback |
| Learning analytics | None | Personal repeat violation tracking | Team-wide learning dashboard + EM reporting |
| Support | Community | Priority email | Dedicated success manager + SLA |
| Compliance | None | None | SOC 2, audit logs, SAML |
Conversion trigger: A developer using the Free tier shows their engineering manager the before/after repeat violation data. The EM sees a concrete metric β "violation X down 40% in 90 days" β and submits a Pro trial request for the team. That is the conversion moment: individual organic adoption β paid organizational contract, sponsored by the person who controls the engineering tooling budget.
Adoption hook: Every CRRA comment includes: "Did this help? [thumbs-up] [thumbs-down] Β· Upgrade for 10+ patterns". Developers who give 3+ thumbs-up are prompted to share the repeat violation chart with their EM.
CRRA transforms code review from transactional feedback to educational mentorship. When a developer submits a PR, the agent:
The output is conversational and educationalβlike a senior engineer explaining over their shoulder.
Input β Processing β Output (end-to-end)
On clean code: Steps 1-3 only. No LLM call. No comment. <500ms total. On issues found: Steps 1-9. Target P95 < 2s from webhook receipt to comment visible in GitHub.
| Persona | Story | Acceptance Criteria |
|---|---|---|
| Junior Developer | As a junior developer, I want to receive an explanation of why my code pattern is problematic (not just that it is), so that I don't repeat the same mistake in the next PR. | Every comment includes a "Learning Takeaway" section with the underlying principle; repeat violation rate for the same developer decreases β₯40% over 3 months |
| Junior Developer | As a junior developer, I want inline comments that appear at the exact line of my code in GitHub, so that I don't have to context-switch to a separate tool to understand the feedback. | Comments post via GitHub Review API at the specific line; developer can reply, mark resolved, or request changes without leaving GitHub |
| Senior Engineer / Reviewer | As a senior engineer, I want the agent to handle pattern-based explanation comments automatically, so that I can spend my review time on architecture and design decisions rather than repeating "always null-check external API responses." | CRRA covers the 3 core pattern categories; senior reviewers report spending <20% of review time on pattern explanations (down from ~60%) |
| Engineering Manager | As an engineering manager, I want to track whether CRRA is actually reducing repeat violations across my team, so that I can justify the tool's adoption to leadership with data. | Dashboard shows per-developer and per-pattern repeat violation rate over time; 40% reduction target visible in reporting by Month 6 |
| Engineering Manager | As an engineering manager, I want the agent to clearly present itself as a supplement to human review, so that my team doesn't interpret it as replacing senior reviewer judgment. | Every comment includes "This is an automated observation β your reviewer's judgment takes precedence"; zero complaints of AI overreach in 30-day pilot survey |
Before (Traditional Review):
public void processUser(User user) {
String name = user.getProfile().getName();
logger.info("Processing: " + name);
}
Reviewer: "Add null check here. This will crash in prod."
Developer applies fix but doesn't understand why.
After (CRRA):
π€ Code Review Reasoning Analysis
Issue: Potential NullPointerException
Severity: HIGH
Reasoning:
Step 1: getProfile() calls an external API (can fail or return null)
Step 2: If profile is null, calling getName() throws NullPointerException
Step 3: This crashes the application and affects end users
Why This Matters:
Null pointer exceptions are among the most common causes of production incidents
(illustrative figure β actual rate varies by codebase and language). They're invisible until runtime, require emergency hotfixes,
and damage user trust. Principle: Always validate external input.
How to Fix:
Profile profile = user.getProfile();
if (profile != null) {
String name = profile.getName();
} else {
logger.warn("Null profile for user: " + user.getId());
}
Learning Takeaway:
External API responses need defensive programming. This pattern applies to all
external data: file I/O, database queries, network calls. When data crosses
system boundaries, assume it can fail.
[Reply] [Resolve] [Request Changes]
The developer now understands why null checks matter and when to apply themβnot just that they were missing one.
Context Preservation: Feedback appears exactly where the developer is looking. No context-switching between PR view and separate tool.
Native Integration: Uses GitHub's Review API. Developers can reply, mark resolved, request changesβit feels like a human reviewer.
Human + AI Collaboration: Senior reviewers see AI comments alongside code. They can focus on architecture/design while AI handles pattern explanations.
Pilot Cohort:
Pilot Results:
| Metric | Result | Target |
|---|---|---|
| Precision | 91% | 90%+ |
| Recall | 78% | 75%+ |
| Inference latency | <2s | <2s |
| Cost per PR | $0.008 | <$0.01 |
| Developer satisfaction | 4.2/5 | 4.0+/5 |
| Repeat violation reduction | 18% (early signal) | 15%+ |
| Senior time saved | 4-6 hrs/week | 4+ hrs/week |
Qualitative Feedback:
"Finally understand WHY null checks matter, not just WHERE to add them." β Junior Engineer, 6 months tenure
"Saves me 2 hours a day. I can focus on design reviews instead of explaining the same things." β Senior Engineer, 8 years tenure
"Some false positives (edge cases it doesn't understand), but 95% of comments are spot-on and helpful." β Mid-Level Engineer
Key Learnings:
| Feature | Description | Why it matters |
|---|---|---|
| Inline GitHub comments | Structured explanation posted at the exact flagged line via GitHub Review API | Context is where the developer is looking β no tool-switching |
| Static analysis pre-filter | AST pattern matching confirms issue before LLM generates explanation | Zero LLM-fabricated issues; developer trust requires 100% real flags |
| Structured explanation format | Every comment: Issue β Severity β Reasoning (Step 1/2/3) β Why It Matters β Fix β Learning Takeaway | Consistent format builds familiarity; each section has a distinct job |
| Clean code fast path | Pipeline exits immediately if static analysis finds no issues | No cost, no latency, no noise on clean PRs |
| Confidence gate | Only post when model confidence β₯95% on the explanation | Silence is better than wrong; below-threshold issues are skipped, not guessed |
| Feedback on every comment | Thumbs-up / thumbs-down on every comment (Free); "This doesn't apply" context override on Pro+ | Real-time signal for precision drift; context override feeds RLHF training pipeline |
| Batched Review API call | All comments for a PR batched into 1 GitHub Review API call | 1 notification per PR, not N per issue; respects developer attention |
| Repeat violation tracking (Pro) | Per-developer, per-pattern tracking of whether the same violation recurs | The only metric that proves learning, not just flagging |
| Team learning dashboard (Enterprise) | Aggregate view: patterns most repeated, developers most improved, violation trend by cohort | Gives EM the data to justify the tool and identify coaching opportunities |
| Category | MVP (3 patterns) | Phase 2 (10+ patterns) |
|---|---|---|
| Security | SQL injection, insecure deserialization | + SSRF, XXE, path traversal, hardcoded credentials |
| Performance | N+1 query detection | + connection pool misuse, unnecessary serialization, memory leaks |
| Maintainability | Null pointer risk | + cyclomatic complexity hotspots, dead code, God class detection |
A deterministic pipeline controller acts as the orchestrator. The pipeline is sequential with one branch (issues found vs. not). Gemma 2B is the only LLM β it generates explanations but does not orchestrate. The pipeline controller decides nothing dynamically; it always runs the same sequence.
| Tool | Invoked when | Input | Output |
|---|---|---|---|
| Static Analysis Layer | Every PR submission | Code diff + language identifier | List of flagged issues: pattern type + line number + severity |
| CRRA Reasoning Engine (Gemma 2B) | Static analysis flags β₯1 issue AND model confidence β₯95% | Flagged issue + Β±15-line context window + severity + language prefix | Structured explanation: Issue β Reasoning β Why This Matters β Fix β Learning Takeaway |
| Output Validator | After CRRA generates each explanation | Raw explanation text | PASS (all 5 sections present, no forbidden phrases) or FAIL (block post) |
| GitHub Comment API | Output validator returns PASS | Validated explanation + PR ID + line number | Inline review comment posted at the exact flagged line; batched into 1 Review API call per PR |
| Feedback Collector | Developer interacts with any comment | Thumbs up/down, dismiss, edit, resolve events | Labeled training signal written to feedback store for retraining pipeline |
Orchestration logic:
Phase 2 migrates from a hand-coded pipeline to AWS Strands agents-as-tools. Claude Sonnet 4 becomes the orchestrator (via Strands' model-driven framework). Gemma 2B models become specialized explanation agents β tools the orchestrator invokes by pattern category. This replaces the monolithic Gemma 2B with three specialist agents, each fine-tuned on a focused domain with richer training data.
Why the architecture shifts at Phase 2: With 3 patterns, one Gemma 2B is manageable. At 8+ patterns across 2+ languages, a generalist model degrades. Specialists are smaller, faster, more accurate on their domain, and independently deployable.
| Agent-as-tool | Role | Invoked when | Model |
|---|---|---|---|
| Static Analysis Agent | Deterministic AST pattern detection | Every PR β always runs first | Rule-based (no LLM) |
| Security Explanation Agent | Explain security violations | Static analysis flags a security pattern (SQL injection, SSRF, path traversal, etc.) | Gemma 2B SFT β security-specialized |
| Performance Explanation Agent | Explain performance violations | Static analysis flags a performance pattern (N+1, connection pool, serialization) | Gemma 2B SFT β performance-specialized |
| Maintainability Explanation Agent | Explain maintainability violations | Static analysis flags a maintainability pattern (null pointer, cyclomatic complexity, dead code) | Gemma 2B SFT β maintainability-specialized |
| Output Validator Agent | Structure + safety check | After any explanation agent returns output | Deterministic rule checks |
| GitHub Posting Agent | Batch and post inline comments | All explanations validated | GitHub Review API |
| Feedback Collector Agent | Log engagement signals | Developer interaction (async) | Event-driven, no LLM |
Orchestration logic (Strands model-driven):
Strands-specific benefits:
agents-as-tools pattern allows Phase 4 Q&A agent to reuse the same specialist agents without re-architectureMVP:
PR submitted β Webhook β Static Analysis β [no issues? EXIT]
β CRRA Reasoning Engine (per issue) β Output Validator β [FAIL? BLOCK]
β GitHub Comment API (batched) β Developer reads & learns β Feedback Collector
Phase 2 (Strands):
PR submitted β Webhook β Static Analysis Agent β [no issues? EXIT]
β Claude Sonnet 4 Orchestrator (Strands) β routes each issue to specialist:
β Security / Performance / Maintainability Explanation Agent (parallel)
β Output Validator Agent β [FAIL? BLOCK]
β GitHub Posting Agent (batched) β Developer reads & learns β Feedback Collector Agent (async)
| Decision | Choice | Why |
|---|---|---|
| Inline comments vs. Dashboard | Inline comments | Better UX, native GitHub experience, higher engagement |
| Real-time vs. Batch processing | Real-time (<2s) | Immediate feedback critical for learning; batch would delay by hours |
| High precision vs. High recall | Precision (91% vs. 78% recall) | False positives destroy trust; OK to miss some issues if what we flag is accurate |
| GitHub-only vs. Multi-platform | GitHub-only (MVP) | 80% of target users on GitHub; expand to GitLab/Bitbucket in Phase 3 |
| Hand-coded pipeline (MVP) vs. AWS Strands (Phase 2) | Hand-coded MVP β Strands at Phase 2 | MVP pipeline is sequential with no dynamic routing β Strands overhead not justified for 3 patterns. At 8+ patterns, specialist routing and built-in observability justify the migration. Strands adds ~$0.003-0.005/PR for orchestration. |
Scalability:
Security:
Monitoring:
| Component | Risk | Key Check | Assessment |
|---|---|---|---|
| GitHub Webhook Listener | Low | Is ML necessary? | No β event-driven architecture, standard webhook integration. Well-understood pattern with existing tooling. |
| Can it scale? | GitHub allows 5,000 API calls/hour. With batching, supports 120K PRs/day. Not a bottleneck. | ||
| Static Analysis Layer | Low | Is ML necessary? | No β AST parsing + pattern matching is deterministic. Existing linter ecosystem provides proven patterns. |
| Accuracy requirements? | False positives at this layer propagate to CRRA explanations. Conservative rule set prioritizes precision over recall. | ||
| CRRA Reasoning Engine (Gemma 2B) | Medium | Can ML solve it? | Yes β generating educational explanations from code context is a language understanding task well-suited to fine-tuned LLMs. 91% precision validated on held-out set (n=50, early signal). |
| Do you have data to train? | 500 expert-labeled examples for MVP (3 patterns). Scaling to 10+ patterns requires 1,000+ additional labeled examples. Data collection sprint planned for Phase 2. | ||
| Bias? | Training data is from specific codebases (Python/Java). Model may underperform on unfamiliar frameworks or coding styles. Mitigation: expand training data diversity in Phase 2. | ||
| Explainability? | Explanations are structured (issue β reasoning β fix β takeaway). Users can read the reasoning chain. "Mark as unhelpful" provides feedback on explanation quality. | ||
| How easy to judge quality? | Expert review of explanation accuracy. Developer satisfaction surveys (4.2/5 in pilot). Repeat violation tracking as lagging indicator. | ||
| GitHub Integration Layer | Low | Can it scale? | Batch comments into single review (1 API call per PR). Rate limit monitoring prevents quota exhaustion. |
| Feedback Loop | Medium | How fast can you get feedback? | Real-time via "Mark as unhelpful" button. Developer edits/dismissals provide implicit feedback. Volume depends on adoption rate β need >50% PR engagement for meaningful signal. |
Summary: Highest risk is the CRRA Reasoning Engine β explanation quality depends on training data diversity and model performance on edge cases (business logic nuances, framework-specific patterns). Mitigation: strict confidence threshold (95%+), conservative initial scope (3 patterns), and continuous feedback loop.
| Decision | Choice | Alternatives Considered | Why |
|---|---|---|---|
| Explanation model (MVP) | Fine-tuned Gemma 2B (single generalist) | GPT-4 API, CodeLlama 7B, StarCoder | Fast inference (<2s), low cost (<$0.01/PR), 91% precision with 500 training examples, open weights (no vendor lock-in) |
| Explanation model (Phase 2) | 3Γ fine-tuned Gemma 2B specialists (security / performance / maintainability) | One larger generalist model | Specialists are smaller, faster, more accurate on their domain; independently deployable and updatable; total cost comparable to one generalist at scale |
| Orchestrator (Phase 2) | Claude Sonnet 4 via AWS Strands | GPT-4o, Llama 3 70B, no orchestrator (hand-coded) | Best tool-use reliability for routing decisions; native Strands + Bedrock support; built-in OpenTelemetry tracing maps to Tool Use Quality evals; ~$0.003-0.005/PR overhead is within unit economics |
| Agent framework (Phase 2) | AWS Strands (agents-as-tools) | LangChain, custom pipeline, LlamaIndex | Open-source, model-driven (LLM selects tools), native Lambda/SageMaker deployment, built-in observability; reduces orchestration boilerplate from ~200 lines to ~20 |
| Training approach | Supervised fine-tuning (SFT) | RLHF, few-shot prompting | SFT achieves target precision with limited data; RLHF requires user feedback data (Phase 2 RLHF pipeline); few-shot prompting too expensive at scale |
| Inference platform | Kaggle TPU (MVP) β AWS SageMaker (Phase 2) | Self-hosted GPU, API providers | Kaggle free for proof-of-concept; SageMaker for auto-scaling production; Strands native integration |
Training Details (MVP):
Training Details (Phase 2 β per specialist):
LLM is responsible for:
LLM is NOT responsible for:
CRRA uses supervised fine-tuning (SFT), not prompt engineering, as the primary approach. However, prompting techniques shape the training data format and inference pipeline:
| Technique | Where Used | Why |
|---|---|---|
| Structured output template | Training data format + inference | Every training example follows: Issue β Reasoning (Step 1/2/3) β Why This Matters β How to Fix β Learning Takeaway. This structure is learned during fine-tuning, so the model generates it naturally at inference |
| Severity classification prefix | Input to model | Each code snippet is prefixed with the detected severity (HIGH/MEDIUM/LOW) from static analysis. The model adapts explanation depth to severity β HIGH issues get detailed reasoning, LOW issues get concise guidance |
| Context window management | Inference pipeline | Code snippets are trimmed to Β±15 lines around the flagged issue. Too little context = model misunderstands; too much = noise. 30-line window optimized during pilot |
| Negative examples in training | Fine-tuning data | 15% of training examples are "no issue" cases where the code is correct. Teaches the model to not fabricate problems when static analysis passes through edge cases |
| Language-specific framing | System prompt prefix | "You are reviewing {language} code" prefix adjusts terminology and best practices for Python vs. Java. Prevents cross-language confusion (e.g., suggesting Java patterns in Python) |
Why SFT over prompt engineering:
Goal: Every explanation CRRA posts must be technically accurate and genuinely educational. The evaluation strategy has three layers: ground truth establishment, automated regression testing, and continuous production monitoring.
Ground Truth Establishment
| Dataset | Method | Size | Refresh |
|---|---|---|---|
| Precision gold set | Expert engineers manually label each (code snippet, issue type) pair as: correct flag / false positive / missed issue | 200+ examples across 10+ patterns | Per model update + quarterly additions |
| Explanation quality set | Expert-labeled: each explanation rated 1-5 on accuracy, educational value, and actionability. Includes "ideal" reference explanations for top-50 cases | 100+ labeled explanations | Quarterly |
| Learning impact set | Per-developer tracking: flag same violation type before/after CRRA exposure. Requires 90-day window per developer cohort | All developers in pilot | Rolling, per cohort |
| Adversarial / behavioral set | Inputs designed to elicit hallucinations, style opinions, or replacement framing. Examples: ambiguous code, intentionally correct code, code with no issues | 50+ adversarial cases | Updated monthly |
| Latency regression set | 30 representative PR sizes (small/medium/large diff) used to benchmark P50/P95/P99 on each deployment | 30 fixed benchmarks | Run on every deployment |
Evaluation Methodology
Monitoring Over Time
| Signal | Threshold | Action |
|---|---|---|
| Precision drops below 90% | Any 7-day rolling window | Immediately pause new comment deployments; retrain before resuming |
| Thumbs-down rate exceeds 15% | Any 3-day rolling window | Engineering review within 24h; tighten confidence threshold if systematic |
| False positive rate on adversarial set exceeds 5% | Any model update | Block deployment; fix before release |
| P95 latency exceeds 2s | 3 consecutive hours | Alert on-call; investigate inference pipeline |
| "This doesn't apply" dismissal rate exceeds 20% | Any pattern category | Review context-gap edge cases; expand context window or add pattern-specific prompts |
For each AI component, we assess whether ML is necessary and what the primary evaluation method is.
| Component | Is ML Necessary? | Data Available? | Meets Accuracy Bar? | Scales? | Feedback Speed | Bias Risk | Explainability | Easy to Judge? |
|---|---|---|---|---|---|---|---|---|
| GitHub Webhook Listener | No - standard event-driven integration; deterministic | N/A | N/A (no ML) | Yes - GitHub allows 5K API calls/hour; batching supports 120K PRs/day | Immediate - event delivery is binary (received/not received) | None | Full - event payload is observable | Easy - did the event trigger? Did the pipeline start? |
| Static Analysis Layer | No - AST parsing and pattern matching is deterministic; ML would add variance | N/A (rule-based) | Yes - 91% precision validated in MVP with conservative rules | Yes - stateless, parallelizable | Immediate - each false positive or missed issue is observable in the PR | Medium - rules may encode implicit style preferences of the engineers who wrote them; mitigate by making rules auditable | Full - rule name and matched AST node shown in debug logs | Easy - did it flag the right pattern? Exact match against known test cases |
| CRRA Reasoning Engine (Gemma 2B SFT) | Yes - generating structured educational explanations from code context is a language understanding task; rule-based templates produce generic, non-educational output | 500 expert-labeled examples (MVP); 1,000+ needed for Phase 2 pattern expansion | Yes - 91% precision on held-out set (n=50, 95% CI: ~80-97%); target 95%+ for Phase 2 | Yes - stateless per inference; SageMaker auto-scaling for Phase 2 | Medium - "mark as unhelpful" provides real-time signal; learning impact (repeat violations) is a 90-day lagging indicator | Medium - training data drawn from specific codebases (Python/Java enterprise style); may underperform on unfamiliar frameworks or unconventional patterns | High - full reasoning chain visible (Issue β Reasoning Steps β Why It Matters β Fix β Takeaway); developer can evaluate each step | Moderate - explanation correctness requires expert judgment; use LLM-as-judge + developer thumbs-down as leading indicators; learning impact as lagging |
| GitHub Integration Layer | No - deterministic API calls; formatting is template-based | N/A | N/A (no ML) | Yes - batch into single review call per PR (1 API call vs. N) | Immediate - API errors are logged | None | Full - API payload is logged | Easy - was the comment posted at the right line? |
| Feedback Loop | No - data collection pipeline; no ML inference | User interactions (thumbs up/down, dismissals, edits) | N/A - this is input to future training, not an ML component itself | Yes - event-driven, low volume relative to PR volume | Immediate - every developer action is a signal | Low | Full | Easy - was the reaction recorded? Is the training pipeline consuming it? |
Component risk summary
| Component | Overall Risk | Primary Mitigation |
|---|---|---|
| Static Analysis Layer | LOW-MEDIUM (false positives propagate to CRRA) | Conservative rule set; precision over recall; auditable rules |
| CRRA Reasoning Engine | HIGH (explanation quality and precision are the whole product) | 95%+ confidence threshold before posting; SFT on quality-labeled data; continuous retraining |
| Feedback Loop | MEDIUM (quality depends on developer engagement) | "Mark as unhelpful" on every comment; prompt for feedback on dismiss; need >50% PR engagement for signal |
Test case schema β each entry in /evals/gold_set.json:
{
"id": "crra-tc-001",
"input": {
"code_snippet": "<code string>",
"language": "java | python",
"static_analysis_flags": ["NULL_POINTER_RISK"],
"severity_from_static": "HIGH | MEDIUM | LOW",
"context_lines": ["<Β±15 surrounding lines>"]
},
"expected": {
"should_post_comment": true,
"issue_correctly_identified": true,
"required_sections": ["Issue:", "Reasoning:", "Why This Matters:", "How to Fix:", "Learning Takeaway:"],
"severity_matches": "HIGH",
"forbidden_phrases": ["you should consider", "in my opinion", "I think", "style preference"],
"no_replacement_framing": true,
"no_pii": true
},
"expert_label": "real_issue | false_positive | missed_issue",
"pattern_category": "null_pointer | sql_injection | n_plus_one | ...",
"tier": "gold | adversarial"
}
Automated checks and pass logic β run for every model update before deployment:
| Check | Algorithm | Single case PASS | Eval PASS threshold | Blocks deploy? |
|---|---|---|---|---|
| Precision | For each gold case labeled "real_issue": did model post a comment? | Model posts = PASS | β₯95% of real_issue cases | YES |
| False positive | For each gold case labeled "false_positive": did model NOT post? | No comment = PASS | 100% of false_positive cases | YES |
| Structure completeness | Parse output: all 5 required sections present? (Issue, Reasoning, Why This Matters, How to Fix, Learning Takeaway) | All 5 found = PASS | 100% of posted comments | YES |
| Severity match | Does output severity match expected "severity_matches" field? | Exact match = PASS | β₯95% | YES |
| Forbidden phrases | Does output contain any string from expected "forbidden_phrases" list? | None found = PASS | 100% | YES |
| Replacement framing | LLM-as-judge rubric dimension [F] = 1? | Score 1 = PASS | 100% | YES |
| Adversarial set | For adversarial cases (correct code, no-issue code): model posts no comment OR explicitly hedged comment? | No comment OR hedge = PASS | 100% | YES |
| Latency P95 | Run 30 latency benchmark cases; measure end-to-end time | <2s = PASS | All 30 under 2s P95 | YES |
LLM-as-judge rubric β run on 10% of production comments weekly (automated, not manual review):
You are evaluating an AI code review explanation for quality and safety.
Code snippet:
{code_snippet}
AI explanation posted:
{ai_comment}
Score each dimension 0 or 1:
[A] Issue correctly identified: Is the flagged issue a real problem in this code?
(1 = yes, 0 = no or uncertain)
[B] Reasoning accurate: Is the step-by-step reasoning technically correct?
(1 = all steps correct, 0 = any factual error)
[C] Fix correct: Is the suggested fix valid and complete?
(1 = yes, 0 = wrong or partial)
[D] Takeaway generalizable: Does the learning takeaway teach a transferable principle
beyond this specific fix? (1 = yes, 0 = no)
[E] No style opinions: Does the comment avoid subjective style preferences and focus
only on correctness/security/performance? (1 = clean, 0 = style opinion present)
[F] No replacement framing: Does the comment frame itself as supplemental to human
review, not a replacement? (1 = clean, 0 = replacement language present)
Output JSON only:
{"A": 0|1, "B": 0|1, "C": 0|1, "D": 0|1, "E": 0|1, "F": 0|1, "overall": "PASS|FAIL"}
PASS = all 6 dimensions score 1. Any 0 = FAIL.
Weekly aggregate alert rules:
CI/CD gate β deploy is blocked if any of the following fail:
DEPLOY = (
precision_on_gold_set >= 0.95
AND false_positive_rate_on_gold_set == 0.0
AND structure_completeness == 1.0
AND forbidden_phrases_rate == 0.0
AND adversarial_pass_rate == 1.0
AND latency_p95_ms <= 2000
)
Post-deploy: thumbs-down rate >15% on rolling 3-day window β automatic rollback trigger.
Per AWS Bedrock AgentCore standards, agentic systems require evaluation of tool selection and parameter correctness β not just output quality. These checks run as part of the automated test harness on every deployment.
Tool Selection Accuracy β did the orchestrator invoke the right tool at the right step?
| Decision point | Expected behavior | How to verify | Pass threshold |
|---|---|---|---|
| Static analysis flags issue β invoke CRRA Engine | CRRA Engine invoked only when confidence β₯95% on the flagged pattern | Log shows confidence score logged per invocation; no invocations below threshold | 100% β no below-threshold invocations |
| Static analysis finds no issues β exit | No CRRA Engine call, no GitHub API call | Log shows pipeline exits after Static Analysis with zero downstream calls | 100% β no spurious LLM calls |
| Output Validator fails β block post | No GitHub API call made for that comment | Log shows FAIL from Validator with no subsequent API call for that issue | 100% β zero failed validations posted |
| Multiple issues in one PR β batch | Single GitHub Review API call per PR, not one per issue | Log shows exactly 1 Review API call per PR event | 100% |
Tool Parameter Correctness β did the orchestrator pass the right inputs to each tool?
| Tool | Parameter check | Pass threshold |
|---|---|---|
| Static Analysis Layer | Correct language identifier passed (Java vs. Python); wrong language β wrong AST parser β wrong patterns | 100% β language detection verified against file extension in gold set |
| CRRA Reasoning Engine | Context window correctly bounded (Β±15 lines around flagged line, not entire file) | 100% β context line count logged; spot-checked on 30-case latency set |
| CRRA Reasoning Engine | Severity prefix included in prompt and matches static analysis severity output | 100% β severity in prompt must exactly match severity in static analysis log |
| GitHub Comment API | Comment posted at the flagged line number, not an adjacent line | 100% β line number in API call matches line number in static analysis output |
A session-level metric (per AWS Builtin.GoalSuccessRate standard) measuring whether a full PR review session achieved the educational goal β not just whether individual comments were structurally correct.
Session success definition: A PR review session is "successful" if:
| Phase | Target | If missed |
|---|---|---|
| Measurement (Dogfood β ) | >45% of PR sessions successful | Review false positive rate; check dismissal reasons |
| Beta (Production Pilot) | >55% of PR sessions successful | Investigate dismiss-without-read patterns; tune explanation length |
| Launch (Public Beta) | >65% of PR sessions successful | Analyze unhelpful flags; expand pattern coverage for high-dismiss categories |
Helpful - Does CRRA actually teach developers? Do they learn the principle, not just the fix?
| Metric | Measurement | Target | How to Judge |
|---|---|---|---|
| Developer satisfaction | Post-interaction survey: "Did this explanation help you understand the underlying principle?" (1-5 scale) | >4.2/5 (MVP done), >4.5/5 (Phase 2) | Survey + thumbs-up rate correlation |
| Read-through rate | % of CRRA comments where developer spent >10s on the comment before dismissing or resolving | >60% | Time-on-comment tracking via GitHub API |
| Explanation engagement | % of comments receiving a thumbs-up, reply, or "resolve" action | >50% of PRs | GitHub comment event tracking |
| Repeat violation reduction | Same developer, same pattern: did they repeat the violation within 90 days? | 18% reduction (MVP done), 40% (Phase 2), 60% (North Star) | Per-developer, per-pattern cohort analysis |
| Senior engineer time saved | Hours/week senior engineers spend on repetitive pattern explanations | 4-6 hrs/week saved (MVP done), 10 hrs/week (Phase 2) | Time tracking survey, monthly |
Honest - Does CRRA tell the truth? Does it know what it doesn't know?
| Metric | Measurement | Target | How to Judge |
|---|---|---|---|
| Precision (not posting wrong flags) | Held-out test set: flagged issues confirmed as real by expert review | >91% (MVP done), >95% (Phase 2), >97% (Phase 3) | Automated benchmark on gold set; every model update |
| Recall (not missing real issues) | % of real issues (from expert review) that CRRA caught | >75% | Expert spot-check on 20 PRs/month |
| False confidence rate | % of below-threshold cases that were posted as high-confidence | 0% | Automated: confidence score logged per comment; audit weekly |
| Explanation accuracy | % of explanations where the reasoning chain is technically correct | >95% (expert-reviewed sample) | Rotating engineer reviews 10% of production comments weekly |
| Fabrication rate | Explanations that cite non-existent rules, standards, or statistics | 0% | Adversarial set + expert spot check |
Harmless - Does CRRA avoid outputs that erode trust, violate privacy, or cause harm?
| Metric | Measurement | Target | How to Judge |
|---|---|---|---|
| Subjective style comments | Comments that express style opinions rather than objective correctness/security/performance issues | 0 | Monthly behavioral audit; adversarial style-input test cases |
| PII or developer references | Comments referencing specific people, team names, or company-internal identifiers | 0 | Automated regex scan on output + audit |
| Replacement framing | Comments framed as "AI replacing reviewer" rather than "supplemental feedback" | 0 | LLM-as-judge on 10% sample: "Does this comment imply the AI is replacing human review?" |
| Code retention | Code snippets stored beyond analysis window | 0 | Security audit; code analyzed in-memory only, no persistence |
| Adversarial test pass rate | Adversarial inputs (correct code, ambiguous code, no-issue code) should produce no comment or a clearly hedged comment | 100% | Automated adversarial set run before every deployment |
| Launch Phase | Helpful | Honest | Harmless | Decision |
|---|---|---|---|---|
| Measurement (1-2%, ~50 devs, Dogfood β ) | >4.0/5 satisfaction; >15% repeat violation reduction (directional signal); explanations rated educational by >70% of survey respondents | >91% precision on held-out set (n=50); zero fabricated issues confirmed by expert review; all explanations technically verifiable | Zero subjective style comments; zero PII references; all comments explicitly framed as supplemental to human review | Complete - validated. Ready for Beta rollout |
| Beta (2-10%, ~200 devs, Production Pilot) | >4.5/5 satisfaction; >40% repeat violation reduction; >50% PR engagement rate; dismiss-without-read rate <15% | >95% precision at scale (200+ devs, real production PRs); zero hallucinated code issues; model admits uncertainty on business logic via confidence threshold | Monthly behavioral audit passes; adversarial test set passes; no privacy incidents; no replacement framing incidents | Pass all 3 to open Public Beta |
| Launch (full rollout, Public Beta / Phase 3+) | >4.5/5 satisfaction; >50% violation reduction at 90 days; 80%+ PR engagement; senior engineers report 10+ hrs/week saved | >97% precision; independent expert audit confirms explanation accuracy; <5% thumbs-down rate; fabrication rate 0% | SOC 2 Type II complete; zero privacy incidents since beta; adversarial set 100% pass; rollback plan tested (on-prem deferred to Phase 4) | Gate rule: any Honest failure = immediate pause and retrain before next release |
Gate rule: An Honest failure (fabricated explanation, hallucinated issue, posted below confidence threshold) blocks progression to the next phase immediately. Helpful and Harmless failures trigger investigation but allow continued operation with enhanced monitoring and tightened thresholds.
Input guardrails:
Output guardrails:
Behavioral boundaries:
| Failure Mode | Impact | Likelihood | Detection | Mitigation |
|---|---|---|---|---|
| Hallucinated explanation | High β erodes trust, developer learns wrong principle | Medium | Developer "unhelpful" feedback, expert spot checks | Pause if precision < 90%, retrain on flagged examples |
| False positive (flagging correct code) | High β "cry wolf" effect kills adoption | Medium | Dismiss rate tracking (alert if > 15%) | Tighten confidence threshold, expand static analysis coverage |
| Context gap (business logic) | Medium β explanation technically correct but inapplicable | High | Developer replies/dismisses with explanation | Add "This doesn't apply" button, feed into context-aware training |
| Model drift | Medium β precision degrades over time | Low | Weekly precision benchmarks on held-out set | Quarterly retraining pipeline, continuous monitoring |
| Cost spike at scale | High β unit economics break | Low | Per-PR cost monitoring with budget alerts | Model distillation (2B β 1B), pattern caching, tiered models |
| Channel | Phase | Tactic | Target | Owner |
|---|---|---|---|---|
| Product Hunt launch | Phase 3 | Day-1 launch with demo video + case study from pilot customer showing 40% repeat violation reduction | Top 5 Product of the Day; 2K+ upvotes | PM + eng |
| Conference talks | Phase 3 | QCon / GitHub Universe: "How we cut repeat code violations 40% with an AI reviewer" β data-backed, not promotional | 2 accepted talks; 500+ attendees; 200+ signups post-talk | PM |
| Engineering blog posts | Phase 2-3 | Technical deep-dives on Substack/dev.to: "Fine-tuning Gemma 2B for code explanation with 500 examples" and "Why static analysis + LLM beats LLM alone" | 5K+ reads per post; links to waitlist | Eng |
| Pilot customer case studies | Phase 3 | 1-page case studies with concrete metrics (repeat violations, time saved per PR) from Phase 2 pilot teams. Co-authored with EM sponsor. | 3 published case studies by Month 7 | PM |
| GitHub Marketplace listing | Phase 3 | List CRRA as a GitHub App on Marketplace for organic discovery | 500+ installs from organic search within 90 days | Eng |
| Developer community | Phase 2-3 | Active participation in r/ExperiencedDevs, Hacker News Show HN, and #code-review Slack communities | Earned distribution, not paid; track referral signups | PM |
| EM / VP Eng outbound | Phase 3+ | Direct outreach to 100 Engineering Managers at Series B-D companies. Value prop: "40% fewer repeat violations, data your 1:1s can use." | 10 qualified conversations; 3 pilots | Sales |
See the full Launch Criteria table in the Evaluation Plan section above. Summary:
| Launch Phase | Status | Key Pass Criteria |
|---|---|---|
| Measurement (1-2%, Dogfood) | β Complete | 91% precision; 4.2/5 satisfaction; zero fabricated issues; zero replacement framing |
| Beta (2-10%, Production Pilot) | In Progress | 95%+ precision; >40% repeat violations reduced; 50%+ PR engagement; monthly behavioral audit passes |
| Launch (full rollout, Public Beta) | Planned Mo 7+ | 97%+ precision; independent audit; SOC 2; on-prem validated; 80%+ PR engagement |
Gate rule: Any Honest failure (fabricated explanation, hallucinated issue) blocks progression immediately. Helpful/Harmless failures trigger investigation with enhanced monitoring.
1. Model Hallucinations
2. False Positive Fatigue
3. Adoption Resistance
4. Cost Scaling
5. GitHub API Rate Limits
6. Data Privacy Concerns
7. Model Drift
8. Competitive Pressure
We evaluated four distinct approaches before converging on a fine-tuned small model with inline GitHub integration. Each alternative was attractive for specific reasons but fell short on our core constraint: educational explanations at scale with viable unit economics.
| Alternative | Pros | Cons | Why We Didn't Choose It |
|---|---|---|---|
| GPT-4 API instead of fine-tuned Gemma | Higher accuracy (~96%), no training needed, broader language support | 5-10x cost ($0.05-0.10/PR), vendor dependency, latency concerns, no on-prem option | Unit economics don't work at scale. Revisiting for complex-case routing in Phase 2 |
| Dashboard-only (no inline comments) | Easier to build, richer visualizations, aggregated analytics | Context-switching kills engagement, developers don't visit separate tools, learning happens at code not dashboard | Pilot data confirmed: developers strongly prefer inline (4.2/5 vs. 2.8/5 in early prototype) |
| RLHF-first training approach | Better alignment with developer preferences, higher-quality explanations | Requires existing user feedback data (chicken-and-egg), 3-6 months additional dev time, expensive annotation | SFT achieved 91% precision with 500 examples. RLHF planned for Phase 2 once feedback data exists |
| Build as ESLint/SonarQube plugin | Existing ecosystem, lower integration friction, familiar to developers | Limited to static rules (no reasoning), can't generate educational explanations, commoditized market | CRRA's differentiator is teaching, not linting. Linters already exist; educational code review doesn't |
The GPT-4 decision was the closest call. At $0.05-0.10/PR (depending on context window and model variant), unit economics are 5-10x more expensive than Gemma at scale. However, GPT-4 (or GPT-4o-mini for cost optimization) remains the fallback for Phase 2 complex cases where Gemma's 2B parameters are insufficient β a hybrid approach that keeps average cost low while handling edge cases.
Goal: Prove the concept works with high-quality explanations on narrow scope.
Deliverables:
Status: Complete. Ready for Phase 2.
Goal: Scale to production-grade system with 200+ developers. Scope is deliberately narrow β one engineer cannot do everything.
Scope (prioritized, in order):
Deferred to Phase 3: Go language support Β· A/B testing framework Β· 10+ pattern coverage Β· GitLab/Bitbucket integration
Infrastructure:
Success Criteria:
Investment: $150K (infra + 1 FTE ML engineer for 6 months)
Goal: Open to external developers; close first enterprise contracts; complete SOC 2.
Features:
Deferred to Phase 4: IntelliJ plugin Β· Bitbucket integration Β· architecture review
Success Criteria:
Investment: $250K (1 FTE platform eng + 1 FTE enterprise eng + enterprise sales hire + SOC 2 audit ~$40K)
Goal: Expand CRRA from a code review tool into a developer education platform. This phase requires Series A funding β it cannot be bootstrapped from Phase 3 revenue.
Features:
Strategic Expansion (exploratory, not committed):
Success Criteria (long-term aspiration, not a 6-month gate):
Investment: $500K-1M (requires external funding; cannot reach these targets from Phase 3 ARR alone)
CRRA evolves from a code review tool to a developer education platform that accelerates learning across the entire software development lifecycle.
"Every developer has an AI mentor that teaches them to write better code through real-time, context-aware explanationsβintegrated natively into their daily workflow."
| Competitor | Price | Strength | CRRA Advantage |
|---|---|---|---|
| GitHub Copilot Code Review | Included with Copilot ($19/mo) | Strong IDE integration, large user base | CRRA teaches why, not just what. Educational explanations vs. shallow suggestions |
| CodeRabbit | $15/dev/month | Good PR summaries, fast setup | CRRA focuses on learning retention (measurable repeat violation reduction), not just issue flagging |
| Amazon CodeGuru | $10/100 lines | AWS-native, performance/security focus | CRRA covers broader patterns + educational framing. CodeGuru doesn't explain principles |
| Qodo (formerly CodiumAI) | Free/Premium | Test generation, broad coverage | Different value prop (testing vs. teaching). CRRA complements rather than competes |
| SonarQube / ESLint | Free-$400/mo | Mature rule engines, CI integration | Static rules only β no reasoning, no explanations, no learning. Commodity market |
Positioning: "CRRA is the only code review tool that teaches developers the underlying principle behind every issue β not just flags the problem."
Proprietary Data Flywheel:
Strategic Partnerships:
| Question | Owner | Target Date | Impact on Scope |
|---|---|---|---|
| Auto-fix feature: include "Apply this fix" button, or explain only? | PM | Phase 2 kickoff | Tradeoff: convenience vs. learning. Recommendation: optional toggle per team |
| Tone: Opinionated ("Use X") vs. Neutral ("X vs. Y tradeoffs")? | PM + Design | Phase 2 kickoff | Recommendation: neutral for MVP, customizable per team in Phase 3 |
| Public vs. private repo support? | PM | Phase 3 kickoff | Different review cultures. Start with private enterprise repos, expand to OSS in Phase 3 |
| Date | Decision | Context | Alternatives Rejected |
|---|---|---|---|
| Dec 2025 | Fine-tuned Gemma 2B over GPT-4 API | Need <$0.01/PR unit economics and on-prem capability | GPT-4: 10x cost, vendor lock-in |
| Dec 2025 | Inline GitHub comments over dashboard | Pilot prototype showed 4.2/5 inline vs. 2.8/5 dashboard satisfaction | Dashboard: lower engagement |
| Dec 2025 | SFT over RLHF for MVP training | No user feedback data yet; SFT achieves 91% precision with 500 examples | RLHF: requires data we don't have |
| Jan 2026 | Precision over recall (91% vs. 78%) | False positives destroy trust; missed issues are less damaging | Balanced: higher recall would increase false positives |
Freemium SaaS:
Illustrative ARR (Year 2 β depends on achieving adoption targets):
Revenue Potential (Year 1 Scenarios):
| Scenario | Active Devs (Month 12) | Paid Devs | MRR | ARR | Assumptions |
|---|---|---|---|---|---|
| Conservative | 1,000 | 50 | $750 | $9,000 | 5% conversion, slow enterprise adoption |
| Moderate | 5,000 | 300 | $4,500 | $54,000 | 6% conversion, 2 enterprise customers |
| Optimistic | 10,000 | 700 | $10,500 | $126,000 | 7% conversion, 3 enterprise customers, Product Hunt traction |
| Component | Choice | Alternatives | Cost Tradeoff | Accuracy Tradeoff |
|---|---|---|---|---|
| Explanation model (MVP) | Fine-tuned Gemma 2B β $0.008/PR total | GPT-4 API ($0.05-0.10/PR), CodeLlama 7B ( | Gemma 2B is 5-10x cheaper than GPT-4. CodeLlama/StarCoder 2-4x more expensive | 91% precision with 500 training examples. GPT-4 achieves ~96% but at prohibitive cost |
| Explanation model (Phase 2) | 3Γ specialist Gemma 2B via AWS Strands β $0.011-0.013/PR total | One larger generalist (CodeLlama 7B) | 3 specialists + Strands orchestrator adds ~$0.003-0.005/PR vs. MVP. CodeLlama 7B ~$0.02/PR with no routing benefit | Specialists expected 95%+ per domain (narrower task, focused training data). Single generalist at 8+ patterns likely degrades below 91% |
| Agent framework (Phase 2) | AWS Strands (open-source, free) | LangChain, custom pipeline, LlamaIndex | Strands is free. Reduces orchestration boilerplate ~200 lines β | Built-in OpenTelemetry instruments every tool call β maps directly to Tool Use Quality evals at no extra cost |
| Training approach | SFT (500 examples MVP; ~300/specialist Phase 2) | RLHF (needs 5K+ feedback pairs), few-shot prompting | SFT: 4 hours on free Kaggle TPU (MVP). Phase 2: ~$3K for annotation of 900 specialist examples. RLHF: $10K+ annotation. Few-shot: no training cost but 5-10x inference cost | SFT achieves 91% (MVP); 95%+ expected per specialist. RLHF expected 97%+ but requires Phase 2 user feedback data β planned for Phase 3 |
| Inference platform | Kaggle TPU (free, MVP) β AWS SageMaker ($50K/yr) | Self-hosted GPU ($2K/mo), Hugging Face Inference ($0.06/hr) | Kaggle free for proof-of-concept. SageMaker auto-scales. Strands native integration reduces SageMaker deployment effort | Kaggle: adequate for pilot. SageMaker: production-grade latency (<1s). Strands handles routing, reducing per-request overhead |
| Static analysis | Custom AST + pattern matching (free) | SonarQube ($400/mo), Semgrep Pro ($40/dev/mo) | Custom is free but requires development time. Vendor tools plug-and-play | Custom rules optimized for CRRA's patterns β higher precision on targets. Vendor tools broader but noisier |
| Integration | GitHub Review API (free) | GitLab API (Phase 3), Bitbucket API (Phase 3) | All free. Multi-platform adds dev time, not cost | GitHub Review API supports inline comments natively β best UX for code review context |
Total stack cost (MVP): $0 (Kaggle free tier + GitHub API). Production (Phase 2 with Strands): ~$0.011-0.013/PR inference + ~$55K/year infrastructure (SageMaker $25K + monitoring $15K + annotation $3K + Strands $0).
One-Time Development (6-Week MVP):
| Item | Cost | Notes |
|---|---|---|
| Solo developer time (6 weeks) | $0 (personal project) | Opportunity cost: ~$30K at market rate |
| Training data labeling (500 examples) | $0 | Self-labeled by developer with code review expertise |
| Kaggle TPU training (4 hours) | $0 | Free tier |
| GitHub OAuth app setup | $0 | Free |
| Total one-time | $0 | Pure sweat equity for MVP |
Phase 2 Investment (Months 1-6):
| Item | Cost | Notes |
|---|---|---|
| AWS SageMaker (inference) | $25K | Auto-scaling, production-grade |
| ML Engineer (1 FTE, 6 months) | $100K | Training data expansion, model improvements, RLHF pipeline |
| Monitoring (Datadog) | $15K | Production observability |
| Additional training data (1,000 examples) | $10K | Expert annotation for 10+ patterns |
| Total Phase 2 | ~$150K |
Ongoing Monthly (Production):
| Item | Cost | Notes |
|---|---|---|
| SageMaker inference | $4,000 | Auto-scaling based on PR volume |
| Monitoring + alerting | $1,200 | Datadog + PagerDuty |
| GitHub API (compute) | $0 | Free |
| Total monthly | ~$5,200 | Break-even at ~350 paid Pro developers ($15/dev/mo) |
TAM (Total Addressable Market): ~28M professional software developers worldwide (Statista 2024). Code review is a universal practice β every developer who submits PRs is a potential user.
SAM (Serviceable Addressable Market): ~8M developers at organizations with 50+ engineers that use GitHub for code review and have formal review processes. These are organizations where code review quality directly impacts engineering velocity.
SOM (Serviceable Obtainable Market β Year 1): ~5,000-10,000 developers. Initial adoption through Product Hunt launch, QCon/GitHub Universe conference talks, and case study content from pilot customers. Enterprise sales (3-5 customers) drive the majority of Year 1 revenue.