Projects

Things I've built — AI systems, tools, and platforms.

RoleCall

Premium AI roleplay platform
Founding Engineerrolecallstudios.com

A premium RP ecosystem for creative storytellers, co-built with Chi and Nemo. Features in-house models, a multi-agent architecture (DM story director + TV lorebook sidecar + narrator engine), 27 reactive game state trackers, inference routing across 25+ providers with E2E-encrypted BYOK, and a Compendium typed knowledge system.

  • Dual-stream AI architecture — DM plans the story, narrator writes the prose, neither sees the other's full context
  • Shadow world simulation — invisible arcs, hidden events, and NPC off-screen actions the player discovers organically
  • Autonomous lorebook management — TV sidecar reads, writes, and organizes lore in real-time across 4 phases
  • Immersion engine — combat, relationships, inventory, quests, and 20+ more trackers driven by AI directives
Next.jsTypeScriptReactZustandSupabasePostgreSQLPythonPyTorchvLLM

VectHare

Semantic RAG for AI chat

An intelligent Retrieval-Augmented Generation system for SillyTavern that gives AI characters long-term memory through semantic search. Instead of relying on fixed context windows, VectHare lets characters recall relevant information from hundreds of messages back — featuring natural memory decay, conditional activation, and multiple vector backends.

  • Temporal decay — memories naturally fade with configurable half-life, important scenes can be marked as immune
  • Conditional activation — chunks trigger based on 28 emotion types, keywords, recency, with AND/OR logic
  • Hybrid search — combines semantic similarity with keyword matching for more accurate retrieval
  • Multiple backends — built-in Vectra for zero-dep setup, LanceDB for millions of vectors, Qdrant for enterprise
  • Smart chunking — per-message, conversation turns, batches, or per-scene grouping strategies
JavaScriptVectraLanceDBQdrantEmbedding APIs

RP-Bench

Roleplay quality benchmark for LLMs

A specialized evaluation framework that measures roleplay quality in large language models across 26 distinct dimensions. Addresses a gap in LLM benchmarking by testing what actually matters in interactive roleplay: character consistency, agency respect, lorebook integration, prose craft, and genre-specific skills — not generic writing metrics.

  • 26 evaluation dimensions across 3 tiers — from agency respect and continuity to atmospheric dread and structural comedy
  • Dual-judge system — Claude and GPT-4.1 score independently, results are cross-averaged to reduce bias
  • 5 scenario types — completion, OOC correction, degradation, preference (A/B swipes), and consistency testing
  • 8 synthetic seeds spanning fantasy, horror, comedy, ERP, politics, tragedy, sci-fi thriller, and slice-of-life
  • HuggingFace dataset published at lazyweasel/roleplay-bench
PythonOpenRouterClaudeGPT-4.1SeabornHuggingFace