Tithely QA Strategy
A unified quality model across 8 product teams — better releases, fewer defect escapes, faster feedback. AI generates the tests; SDETs are the quality engineers who judge them.
The pipeline — every ticket flows left to right through four gates
How to read this deck: it is the operating manual for the QA function — the directions from the QA Manager (Josh Partridge). It defines how every ticket moves to production, what gates block, how we measure coverage, and what every embedded SDET owns. Use the checklists; they persist your progress.
The Tithely QA Model
Standardize QA across all 8 product teams so we ship higher-quality releases, faster — with one clear gate, one feedback loop, and one definition of done.
The strategic shift
- QA Engineer → SDET. Every quality engineer writes code; every test is automated. SDETs are not test authors — they are quality engineers who own the bar.
- One SDET per team. Each of the 8 product teams has a dedicated embedded SDET — no shared resources.
- Parallel, not sequential. SDETs run alongside developers from refinement onward — quality is built in, not inspected at the end.
- AI-first automation. Claude generates ~95% of tests; the SDET reviews, judges, and approves every one.
Two goals — never conflate them
Surge target — End of Q3 2026
Critical paths (P0) green on every high-risk surface within 90 days of full SDET placement. This is not 80% coverage.
Comprehensive horizon — Month 18
80% of P1 critical-path coverage. The long game — full regression depth across all 8 teams.
Reporting discipline: "critical paths green" (Q3) and "80% coverage" (Month 18) are two different goals. Never report one as if it were the other to leadership.
QA Process Flow — Ticket to Production
The operational foundation of the strategy: exactly when each test type runs, which environment it targets, and who is responsible. Tests live in a separate test repo that GitHub Actions pulls from at each pipeline stage.
P0 vs P1 — the test classification
- P0 — critical happy path. Journeys that, if broken, cause system-wide customer-visible failure. Surge target: 100% by Q3.
- P1 — full regression. The full scenario including edge cases and negative paths. Targets 80–90% functional coverage; builds to Month 18.
- A journey is not fully covered until both a P0 and a P1 test exist for it.
The feedback loop
Any gate failure triggers an immediate Slack alert to the SDET and developer with the failing report. The build does not advance until the SDET judges it green. The nomenclature is aligned with bug priorities — a P0 test failure is a release blocker.
The Four-Stage Gate Design
Every team uses the Guild-maintained GitHub Actions template. Teams may add steps; they may not remove or loosen gates. Each stage has a time budget — exceeding it by >20% is a CI failure.
1 · PR — <3 min
Unit tests · linting · type check · SonarQube quality gate · Snyk dependency scan · unit coverage ≥ 80%.
BLOCKING — PR cannot merge
2 · Merge — <8 min
Integration tests · Pact contract verification · P0 critical-path smoke suite (tagged tests from the test repo) · Claude coverage-gap report (non-blocking).
BLOCKING — merge to main fails
3 · Staging — <20 min
P1 full regression (all journeys) · Percy visual snapshot · axe-core a11y audit · OWASP ZAP DAST scan · performance regression (p95 vs baseline).
BLOCKING — staging deploy fails
4 · Production — manual
Josh sign-off · SDET confirms no open regressions · rollback plan documented · canary deploy (30 min synthetic monitoring before full rollout).
BLOCKING — requires explicit approval
Why budgets are enforced as failures: a slow gate gets bypassed. Keeping PR under 3 min and merge under 8 min is what keeps the gate trusted and the feedback loop tight.
Blocking Checks & Flake Quarantine
What blocks, who owns it, and what happens on failure. Quarantine is the pressure valve that keeps flaky tests from eroding trust in the gate.
| Check | Stage | Owner | Failure action |
|---|---|---|---|
| Unit coverage <80% | PR | Dev + SDET | PR blocked; dev adds tests before re-review |
| SonarQube / Snyk critical | PR | Guild template | PR blocked; resolve, accept with justification, or patch within 48 hrs |
| Pact contract broken | Merge | Embedded SDET | Merge blocked; consumer or provider fixes and republishes |
| Smoke (P0) suite failure | Merge | Embedded SDET | Merge blocked; SDET triages within 30 min — fix or quarantine |
| Claude coverage-gap report | Merge | Embedded SDET | Non-blocking; posted as PR comment; SDET acts within the sprint |
| E2E (P1) failure (>1 non-quarantined) | Staging | Embedded SDET | Deploy blocked; fix or quarantine within 2 hrs |
| axe-core critical a11y | Staging | Guild + SDET | Deploy blocked; must resolve before release |
| Performance p95 >15% over baseline | Staging | Guild stewards | Deploy blocked; eng lead alerted; root cause required |
| OWASP ZAP high-severity | Staging | Guild + DevSecOps | Deploy blocked; remediation ticket; exception needs CISO sign-off |
Flake quarantine: a test is flaky if it fails non-deterministically on 2+ of the last 5 runs with no code change. The SDET tags it @quarantine within 24 hrs — it still runs and logs to the flake dashboard, but is excluded from blocking gates and does not count toward coverage. Resolution SLA: 5 business days to fix or delete. A re-promoted test must pass 10 consecutive runs before re-entering the gate. @ai-generated flake rate is tracked separately — if it exceeds the human baseline, the prompts get a Guild review.
Coverage Definition & The 4 KPIs
Coverage = % of identified critical-path journeys covered by at least one passing automated P0 or P1 test in Zephyr Scale. Not line coverage. Not test-case count. A journey is "covered" when an automated test exists, passes on the last 5 consecutive CI runs, and is tagged to that journey in Zephyr. This is the single definition used in every dashboard, gate check, and leadership report.
The four metrics — tracked monthly
< 8 / mo
P0/P1 bug volume — 25% reduction, tied to the org-wide OKR (Q4 2026).
< 5%
Defect escape rate — production bugs ÷ total bugs ('Found in Environment = Production' in Jira). Month 18.
80%
P1 regression coverage — % of journeys with a passing P1 test. Month 18.
< 2%
Flaky test rate — non-deterministic failures ÷ total. Month 18.
Milestone targets
| Milestone | P0 | P1 | Escape | Flaky | Date |
|---|---|---|---|---|---|
| P0 surge complete | 100% | baseline | baseline | baseline | End of Q3 2026 |
| Gate 1 — all teams reporting | 100% | ≥30% | <15% | <10% | Month 6 |
| Gate 2 — coverage build | 100% | ≥60% | <10% | <5% | Month 10 |
| Gate 3 — year-1 review | 100% | ≥70% | <7% | <3% | Month 12 |
| Comprehensive horizon | 100% | ≥80% | <5% | <2% | Month 18 |
Surge Coverage Plan & 90-Day Burn-Down
The surge target is critical paths green on every high-risk surface within 90 days of the last SDET being placed and productive — infrastructure first, coverage second.
High-risk surfaces — day-90 targets (measured from July 1)
| Team | Paths | Day 30 | Day 60 | Day 90 |
|---|---|---|---|---|
| Enterprise Giving | 12 | 4 | 8 | 12/12 |
| SMB / YoY Giving | 11 | 3 | 7 | 10/11 |
| People Participation | 12 | 3 | 7 | 10/12 |
| People Core | 12 | 2 | 5 | 9/12 |
Medium / lower risk — paths green by month 6
- Comms & Content (8 paths) — email/SMS send, unsubscribe, bounce, template editor
- Org Mgmt & Tooling (9 paths) — billing, RBAC admin, integration settings
- Growth Team (7 paths) — onboarding, trial conversion, feature flags
- Elvanto (10 paths) — service planning, roster, volunteer scheduling
How Giving hits day-90 green
Giving is treated as a migration project, not a standard ramp. The Enterprise Giving SDET runs a dedicated 2-week Stripe surface-mapping spike (days 1–14). Every critical path gets a Stripe sandbox from day 1 — no path is marked green without a passing test against the sandbox.
Register total: 91 journeys across 8 teams · 41/47 high-risk green at surge.
P0 Pipeline Wiring — First Steps Per Team
Wiring even a handful of tagged P0 tests into the QA-deploy gate proves the flow and delivers visible value before full coverage is built. Coverage % is irrelevant at this stage — the goal is to close the loop.
The wiring pattern — same for every team
- Step 1: Tag 3–5 existing Zephyr cases that cover a critical happy path as
@P0. That's enough to start. - Step 2: Add a GitHub Actions workflow that triggers on every deploy to QA, checks out the central test repo, runs only
@P0tests, and posts pass/fail to Slack within 5 minutes. - Step 3: Set the workflow as a blocking status check. Any QA deploy with a failing P0 test stops and notifies the developer immediately.
- Done looks like: developer deploys to QA → P0 suite runs automatically → result in Slack in under 5 minutes.
Tagging order (highest priority first)
- Week 2: Enterprise Giving (Stripe migration), SMB / YoY Giving
- Week 3: People Participation, People Core
- Weeks 4–6: Comms, Org Mgmt, Elvanto, Growth
P0 trigger template
on:
workflow_dispatch:
inputs:
environment:
default: qa
jobs:
p0-suite:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
repository: tithely/qa-test-repo
token: ${{ secrets.QA_REPO_TOKEN }}
- run: npm ci
- run: npx playwright test --grep @P0
env:
BASE_URL: ${{ vars.QA_BASE_URL }}
STRIPE_TEST_KEY: ${{ secrets.STRIPE_TEST_KEY }}
- name: Post result to Slack
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{ "text": "P0 suite: ${{ job.status }} on QA deploy" }
The config is identical for every team — the only variable is which tests get the @P0 tag first.
Team Shape & SDET Assignments
9 total QE members covering 8 product teams. All report directly to the QA Manager. One dedicated SDET embedded per team — no shared resources.
Headcount & reporting
| Role | Count | Notes |
|---|---|---|
| QA Manager | 1 | Josh Partridge — strategy, Guild, vendor mgmt, escalations, final prod sign-off |
| Internal SDET | 1 | Embedded on a high-risk product team |
| Build Online SDETs | 7 | Embedded across the remaining 7 teams |
| Total | 9 | One SDET in each of the 8 teams |
What an embedded SDET owns
- Build P0 & P1 automation against the team's critical paths
- Use and contribute to the standard framework + prompt library
- Improve quality in refinement and implementation, not just at the end
- Report coverage, escapes, and flake rate weekly
Risk tier & surge priority
| Team | Risk | Surge priority |
|---|---|---|
| Enterprise Giving | High | Day 90: 12/12 green |
| SMB / YoY Giving | High | Day 90: 10/11 green |
| People Participation | High | Day 90: 10/12 green |
| People Core | High | Day 90: 9/12 green |
| Comms & Content | Medium | Paths green by month 6 |
| Org Mgmt & Tooling | Medium | Paths green by month 6 |
| Elvanto | Medium | Paths green by month 6 |
| Growth Team | Lower | Paths green by month 6 |
Concentration risk is real: 7 of 9 are from one vendor. Mitigated by IP ownership of all test code, a versioned Guild prompt library, a 60-day knowledge-transfer requirement, and a secondary vendor identified by month 3.
AI-First Automation Approach
Building this coverage across 8 teams in 18 months is not feasible at human-only throughput. Claude generates the tests; the SDET is the quality engineer who reviews every one before it ships.
What Claude generates
- Happy-path tests from specs · edge-case parameterisation
- API scaffold from OpenAPI schemas · negative-path enumeration
- Regression tests from bug reports
- Coverage-gap analysis on PR diffs
- Boilerplate — data factories, page objects, fixtures
What the SDET must supply
- Domain judgment (Stripe migration risk, child-safety rules)
- Risk-based prioritisation — which tests run first
- Verifying assertions actually catch real bugs
- Upstream requirements review · exploratory & adversarial testing
- Prompt quality, library curation, mutation-test oversight
Mutation testing — the real quality signal. Flake rate alone does not prove tests catch bugs. Monthly, Stryker (JS/TS) or PITest (Java) introduces a small deliberate code change (e.g. > to ≥). A good test kills the mutant — it fails. A tautology test does not. Target: ≥70% mutation score by month 12; teams below 60% get their prompts reviewed.
The tautology-test problem. The most common AI failure mode is a test that passes regardless of what the code does. Before approving any AI-generated test, the SDET must be able to answer: "Can I make this test fail by introducing a real bug?" "Claude generated it" is never sufficient to approve.
Stripe Migration Coverage
The highest-priority testing surface in the strategy. It touches both Giving teams, runs in parallel with SDET onboarding, and has zero tolerance for defect escape given the financial and trust implications.
Payment methods
Card (3DS, soft/hard decline, card update) · ACH (verification, micro-deposit, return codes) · Apple Pay · Google Pay · card-present kiosks (NFC, receipt).
SCA / 3DS flows
3DS2 frictionless / challenge / failed · SCA exemptions (low-value, MIT, recurring) · 3DS1 fallback · recurring off-session re-auth. All against Stripe's test card matrix.
Recurring lifecycle
Create (weekly/monthly/annual) · schedule change · pause/resume · cancel · Smart Retries · dunning · leap-year & month-end edge cases.
Refunds & disputes
Full / partial refund · ACH refund timing · chargeback webhook handling · evidence submission · multi-fund refund allocation.
API authz & HMAC fuzzing (not just happy-path). The transactions HMAC issue class requires dedicated security testing: HMAC signature-bypass attempts, replay attacks outside the timestamp window, malformed webhook payloads, and missing/invalid idempotency keys. These run in the staging ZAP DAST scan and as a dedicated Stripe authz suite in the Giving merge gate.
Environment strategy: dedicated Stripe sandbox per environment tier · test card matrix automated in the Giving data factory · webhook replay tool (Stripe CLI) for local dev · no production Stripe keys in CI · Stripe API version pinned (upgrades need Giving SDET sign-off) · sandbox refresh on every staging deploy.
Test Standards & Patterns
Consistency across 8 teams is what makes the shared library and prompt library work. These apply to all test code — AI-generated or human-written.
AAA structure
- Arrange: data factory only — no raw DB inserts, no hardcoded IDs, no shared state between tests.
- Act: one action per test.
- Assert: explicit, specific assertions on outcome — never on implementation detail; no multiple independent assertions per test.
- Mutation check: before approving, articulate what code change would make this test fail.
Page objects & shared utilities
- All Playwright tests use Guild-maintained page objects — no raw DOM selectors in test files.
- Data factories handle all test data; auth/session helpers are shared.
- Stripe test-card constants live in the Giving data factory — no hardcoded card numbers anywhere else.
Naming conventions
submitDonation_withExpiredCard_showsDeclinedError
fetchDonorHistory_whenLoggedOut_returns401
recurringCharge_onMonthEnd_rollsToValidDate
Tag taxonomy
| Tag | Meaning |
|---|---|
@P0 / @P1 | Critical happy path / full regression |
@journey-[team]-[id] | e.g. @journey-giving-003 |
@ai-generated | Flake rate tracked separately |
@quarantine | Excluded from blocking gate |
@regression-[bug-id] | Written from an escaped defect |
Manual cases (Zephyr Scale): precondition, step-by-step actions, expected result per step, pass/fail criteria, journey tag. Exploratory and hardware protocols follow their own logged standard.
Iteration, Maintenance & Exploratory
Tests are code — they decay and need maintenance. This is who owns what, and how the loop closes. SDETs also gate quality before dev starts, because AI amplifies whatever is in the requirements.
The maintenance loop
| Trigger | Process | Owner | SLA |
|---|---|---|---|
| Feature changes | SDET reviews diff in PR; updates affected tests (re-prompts if AI-generated) before merge | Embedded SDET | Same sprint |
| Feature removed | Archive/delete tests; remove journeys from Zephyr; update prompt library | Embedded SDET | Same sprint |
| Escaped defect | Root-cause the missed journey; write @regression-[bug-id] test; document prompt gap | SDET + Josh | Within 5 business days of fix |
| Exploratory finding | If it's a register gap, add the journey and prompt Claude for coverage; else file a bug | Embedded SDET | Register updated in 2 business days |
| Mutation score <60% | Guild steward reviews prompts; SDET re-generates weak suite sections | Guild steward + SDET | Same monthly cycle |
Exploratory cadence — 100% human
- One 60–90 min charter-driven session per sprint per SDET, minimum (Giving: two). Protected in sprint planning.
- Focus: new features, AI coverage gaps, recent incidents, low mutation-score areas.
- Notes logged in Jira within 24 hrs; gaps feed the register within 2 business days.
Upstream requirements validation
- Refinement: SDET reviews stories for testability — AC specific enough for AI to generate from?
- Definition of Ready: no SDET sign-off → story doesn't start dev.
- Definition of Done: dev can't close a story without SDET QE sign-off in Jira.
Security Testing Depth
Security is integrated into the pipeline and tied to the existing security remediation program. Findings log directly into that backlog — not a separate board.
| Test type | Tool | Stage | Blocking? | Scope |
|---|---|---|---|---|
| Dependency scan | Snyk | PR | Yes (crit/high) | All deps; no critical, high → 48 hr plan |
| SAST | SonarQube | PR | Yes | SQLi, XSS, hardcoded secrets, insecure deserialization |
| Secret scanning | GitHub Adv. Security | PR | Yes | Stripe keys, API tokens, DB credentials |
| DAST (authenticated) | OWASP ZAP | Staging | Yes (high) | All auth'd endpoints; OWASP Top 10; session, CSRF |
| API authz fuzzing | ZAP + k6 | Staging (Giving) | Yes | HMAC bypass, replay, missing auth, IDOR, scope escalation |
| Stripe webhook authz | Custom suite | Merge (Giving) | Yes | Signature validation, replay window, malformed/missing HMAC |
| Penetration test | External vendor | Quarterly | N/A | Full surface → remediation backlog |
Severity mapping: Critical = P0 (block release) · High = P1 (48 hr SLA) · Medium = P2 (next sprint) · Low = backlog. The QA Manager reviews the remediation backlog weekly and flags any item that has slipped its SLA. The monthly QA report includes a security burn-down alongside coverage.
Toolchain & Technology Stack
One stack across all 8 teams. Deviations require Guild approval. Free / open-source first — adopt these before evaluating paid alternatives.
| Category | Tool | Notes |
|---|---|---|
| AI test generation | Claude (Code + API) | Primary test author; SDET reviews all output |
| E2E / UI | Playwright | Multi-browser; TS-first; Claude generates to this framework |
| API / contract | Supertest / REST Assured + Pact | Pact Broker hosted by Guild (self-hosted to start) |
| Unit | Jest / JUnit / pytest | Matched to each team's language stack |
| Mutation testing | Stryker (JS/TS) / PITest (Java) | Monthly; ≥70% score target by month 12 |
| Load / perf | k6 | Budgets & profiles enforced as a staging CI gate |
| Visual regression | Percy / Chromatic | Snapshot diffs on every staging deploy |
| Accessibility | axe-core + Playwright | WCAG 2.1 AA; 0 critical violations gate |
| DAST / SAST / deps | OWASP ZAP / SonarQube / Snyk | Authenticated scan + PR gates |
| Test management | Zephyr Scale | Journey register, coverage %, AI-vs-human tracking |
| Dashboards | Grafana + Allure | Coverage, escape rate, mutation score, flake, perf trends |
| CI/CD | GitHub Actions | 4-stage gate design; Guild templates |
Realistic year-1 tooling cost: ~$8k–$15k. Most of the stack is open source. Net-new paid spend is usage-based Anthropic API, Chromatic Pro, Zephyr Scale, and SonarCloud — self-host the Pact Broker and use OSS k6 to start.
Roadmap & Programme Gates
June 2026 is the placement window (PI2 begins Jun 24). Month 1 = July 2026 — full operational start. All surge targets (Day 90) are measured from July 1.
Programme gates & KPI milestones
| Milestone | When | Target |
|---|---|---|
| Day 90 — Surge target | End of Q3 2026 | Critical paths green on all 4 high-risk surfaces |
| Gate 1 — all teams live | Month 6 (Dec 2026) | ≥30% coverage; P0 suites active on 8/8 teams |
| Gate 2 — coverage build | Month 10 (Apr 2027) | ≥60% coverage; escape <10%; mutation ≥60% |
| Gate 3 — year-1 report | Month 12 (Jun 2027) | ≥70%; vendor conversion recommendation |
| Comprehensive horizon | Month 18 (Dec 2027) | ≥80%; escape <5%; flaky <2%; mutation ≥70% |
Program increments
- PI2 (Jun 24 – Sep 1): foundation buildout + 90-day surge on all four high-risk surfaces.
- PI3 (Sep 2 – Nov 10): medium-risk teams to critical-path green; hit Gate 1.
- PI4 (Nov 11 →): coverage build toward Gate 2 and beyond.
Activating the plan — what's next
- Raise PI2 epics in Jira under Production Confidence — one per team; align with each PM + EM before raising.
- Road show with product/eng — walk each team through this plan; gather feedback.
- Wire the first P0 suites to the QA-deploy gate (see "First Win").
- Raise PI3 epics next cycle — medium-risk teams' P0 + start of P1 regression.
SDET Onboarding & AI Orientation
Every SDET completes the 30-day baseline track. Those without prior AI testing experience add a 3–5 day AI orientation in week 1. This is your first month — track it.
30-day baseline track
- Week 1: product walkthrough with PM · codebase overview with dev lead · dev env + Stripe sandbox setup · Guild intro · review your team's critical-path register.
- Week 2: Playwright hands-on (write one critical-path E2E manually) · CI walkthrough · Jira + Zephyr setup · shared utilities · Grafana + Allure orientation.
- Week 3: first sprint planning with your team · identify top 3 journey gaps · draft your surge coverage plan · review the Guild prompt library.
- Week 4: submit first AI-generated PR · gap analysis to Josh · participate in first Guild meeting.
3–5 day AI orientation
- Day 1–2: Claude Code + API hands-on; prompt structure; context injection with journey register + API schema.
- Day 3–4: generate a test from a real spec; compare to a manual equivalent; spot tautology tests; practice the re-prompt loop.
- Day 5: run Stryker on sample output; interpret mutation score; debrief with Josh.
Resources & references
Playwright
E2E framework: selectors, fixtures, page objects, the @tag grep we use to wire P0 suites.
playwright.dev/docs/intro DocsStryker Mutator
Mutation testing for JS/TS — how mutants are generated and how to read a mutation score.
stryker-mutator.io/docs DocsPact (Contract Testing)
Consumer/provider contracts, the Pact Broker, and merge-gate verification.
docs.pact.io Docsk6 Load Testing
Load profiles, thresholds, and the p95 regression gate enforced at staging.
k6.io/docs DocsOWASP ZAP
Authenticated DAST scans and API authz fuzzing for the Giving pipeline.
zaproxy.org/docs DocsStripe Testing
Test card matrix, SCA/3DS test scenarios, and the webhook replay CLI.
docs.stripe.com/testingWhat good looks like at 6 months: you independently own your team's critical-path suite, can articulate the risk profile of any failing test, contribute prompts to the Guild library, and run exploratory sessions without a charter from Josh.