Methodology

Last updated: 2026-05-01
Audience: Bounty program triagers, security teams, and other researchers evaluating Vulpes Watch reports.

This page describes how the Skulk research team produces findings. The goal is to make every report cheap to triage: clear scope, deterministic reproduction, explicit severity reasoning, and structured negative controls.

1. Pipeline at a glance

Skulk runs a four-stage pipeline. Each stage has a dedicated agent persona; agents share state through a private engagement database. No stage skips ahead.

Stage	Agent	Purpose	Output artifact
Scout	Reynard	Read program scope, watchlist, and prior disclosures. Identify in-scope surfaces and high-leverage areas.	Engagement plan
Hunt	Maui	Probe the surface with structured prompt batches across Claude, GPT, and Gemini families. Capture transcripts.	Raw transcript set
Grade	Taliesin	Score transcripts against the program’s published harm taxonomy and our internal severity rubric. Cull noise.	Triaged finding draft
Watch	Varuna	Verify reproduction from clean state. Decide submit / hold / discard. Track engagement state and disclosure clock.	Submission packet

A finding only reaches a vendor after all four stages have signed off in the engagement database.

2. Reproducibility commitments

Every submitted finding includes:

Exact model identifiers: vendor, model name, model version, and the date of the run.
Full transcript: system prompt, user prompts, assistant responses, and any tool-call traces, in chronological order. Redacted only for explicitly harmful payloads (see Research Policies §7).
Reproduction harness: a self-contained script that, given a vendor API key, reproduces the finding from a clean state. We test this from a fresh key before submission.
Negative controls: prompts that did not trigger the issue, with brief notes on why. This helps triagers localize the boundary instead of guessing.
Environment notes: any non-default parameters (temperature, system prompt, tools, max tokens) that are load-bearing for the result.

If a finding cannot be reproduced from a fresh key, it is not submitted. We do not file findings that depend on session state, account history, or undisclosed prior conversation.

3. Severity reasoning

We do not propose severity by gut feel. Each report includes:

Mapping to the vendor’s published bar. We cite the specific tier and the language from the vendor’s policy that the finding satisfies.
Harm-taxonomy tag. Drawn from the vendor’s published taxonomy where one exists; otherwise from a stable internal taxonomy with a glossary appended.
Realistic threat model. A short paragraph naming the attacker, the asset, the access required, and the realistic abuse path. We avoid worst-case framing when the realistic case is enough to qualify.
What this is not. A line stating which adjacent severity tiers we are explicitly not claiming, to discourage triage from over- or under-leveling.

If our proposed severity is wrong, the explicit reasoning makes it cheap for the vendor to correct rather than reject.

4. Grading rubrics

Internal rubrics used by the Grade stage. Vendor programs may use different ones; we map ours to theirs in the submission.

Refusal quality. Is a refusal genuine, or is it a thin layer over compliant content? We score on a 0–4 scale with anchored examples.
Harm taxonomy coverage. Probes are designed across the published taxonomy axes (e.g., for Anthropic: dangerous information, illegal acts, deception, etc.). We do not over-fit to one axis.
Boundary placement. A finding is more useful when it shows where the safe / unsafe line actually sits, not just one example past it. Probes around the boundary are a reporting requirement, not a nice-to-have.
Cross-model portability. When a probe class works on multiple vendors, we note that explicitly so vendors can coordinate.

5. Tooling stack (high level)

Orchestration: Hermes agent platform on private LXC infrastructure.
Provider plumbing: Custom Python plugin (redteam) calling Anthropic, OpenAI, and Google inference APIs over stdlib HTTP. No third-party SDKs in the probe path, to avoid SDK-specific behavior coloring results.
State store: Private Postgres schema (skulk.engagements, skulk.findings, skulk.submissions) tracking state machines for engagement → finding → submission → payout.
Memory layer: Private knowledge store for prior disclosures, watchlist updates, and program ROE notes.
Browser automation: Used only for submitting reports to vendor portals (HackerOne, Bugcrowd, MSRC, vendor researcher portals). Not used for probing.

We deliberately do not name the host infrastructure or expose internal endpoints; the stack runs entirely on private network space.

6. What we do not do

No automated mass-fuzzing against production endpoints. Probe rate is human-paced, with explicit max_parallel caps in the configuration. We are happy to share our rate-limit configuration with a triager on request.
No scraping of other researchers’ disclosures to derive submissions. If we use a public technique, we cite it and credit it.
No “spray and pray” submissions. A vendor receives one report per discrete finding, not a packet of related ones grouped for headline impact.
No retro-fitting severity to maximize payout. Our severity claim is the same as our internal severity score; we do not inflate.

7. How vendors can verify all of this

On request to [email protected] we can share:

A redacted sample probe transcript showing the full evidence chain.
A redacted engagement timeline showing how a finding moved through the four stages.
The schema of our internal state store (DDL only; no data).
The exact prompts we used for a previously-disclosed finding, after embargo lifts.

We will not share:

Live engagement state or pre-disclosure findings.
Other vendors’ findings or correspondence.
Internal infrastructure details beyond what is on this page.

8. Updates

This page is versioned. Material methodology changes are announced in the next submission to active programs.