Methodology

Last updated: 2026-05-01
Audience: Bounty program triagers, security teams, and other researchers evaluating Vulpes Watch reports.

This page describes how the Skulk research team produces findings. The goal is to make every report cheap to triage: clear scope, deterministic reproduction, explicit severity reasoning, and structured negative controls.


1. Pipeline at a glance

Skulk runs a four-stage pipeline. Each stage has a dedicated agent persona; agents share state through a private engagement database. No stage skips ahead.

StageAgentPurposeOutput artifact
ScoutReynardRead program scope, watchlist, and prior disclosures. Identify in-scope surfaces and high-leverage areas.Engagement plan
HuntMauiProbe the surface with structured prompt batches across Claude, GPT, and Gemini families. Capture transcripts.Raw transcript set
GradeTaliesinScore transcripts against the program’s published harm taxonomy and our internal severity rubric. Cull noise.Triaged finding draft
WatchVarunaVerify reproduction from clean state. Decide submit / hold / discard. Track engagement state and disclosure clock.Submission packet

A finding only reaches a vendor after all four stages have signed off in the engagement database.

2. Reproducibility commitments

Every submitted finding includes:

If a finding cannot be reproduced from a fresh key, it is not submitted. We do not file findings that depend on session state, account history, or undisclosed prior conversation.

3. Severity reasoning

We do not propose severity by gut feel. Each report includes:

  1. Mapping to the vendor’s published bar. We cite the specific tier and the language from the vendor’s policy that the finding satisfies.
  2. Harm-taxonomy tag. Drawn from the vendor’s published taxonomy where one exists; otherwise from a stable internal taxonomy with a glossary appended.
  3. Realistic threat model. A short paragraph naming the attacker, the asset, the access required, and the realistic abuse path. We avoid worst-case framing when the realistic case is enough to qualify.
  4. What this is not. A line stating which adjacent severity tiers we are explicitly not claiming, to discourage triage from over- or under-leveling.

If our proposed severity is wrong, the explicit reasoning makes it cheap for the vendor to correct rather than reject.

4. Grading rubrics

Internal rubrics used by the Grade stage. Vendor programs may use different ones; we map ours to theirs in the submission.

5. Tooling stack (high level)

We deliberately do not name the host infrastructure or expose internal endpoints; the stack runs entirely on private network space.

6. What we do not do

7. How vendors can verify all of this

On request to [email protected] we can share:

We will not share:

8. Updates

This page is versioned. Material methodology changes are announced in the next submission to active programs.