Projects

Things I've built — AI systems, tools, and platforms.

RoleCall

—Premium AI roleplay platform

A premium RP ecosystem for creative storytellers, co-built with Chi and Nemo. Features in-house models, a multi-agent architecture (DM story director + TV lorebook sidecar + narrator engine), 27 reactive game state trackers, inference routing across 25+ providers with E2E-encrypted BYOK, and a Compendium typed knowledge system.

Dual-stream AI architecture — DM plans the story, narrator writes the prose, neither sees the other's full context
Shadow world simulation — invisible arcs, hidden events, and NPC off-screen actions the player discovers organically
Autonomous lorebook management — TV sidecar reads, writes, and organizes lore in real-time across 4 phases
Immersion engine — combat, relationships, inventory, quests, and 20+ more trackers driven by AI directives

Next.jsTypeScriptReactZustandSupabasePostgreSQLPythonPyTorchvLLM

VectHare

—Semantic RAG for AI chat

Contributorgithub.com/LeviTheWeasel/VectHare ↗

An intelligent Retrieval-Augmented Generation system for SillyTavern that gives AI characters long-term memory through semantic search. Instead of relying on fixed context windows, VectHare lets characters recall relevant information from hundreds of messages back — featuring natural memory decay, conditional activation, and multiple vector backends.

Temporal decay — memories naturally fade with configurable half-life, important scenes can be marked as immune
Conditional activation — chunks trigger based on 28 emotion types, keywords, recency, with AND/OR logic
Hybrid search — combines semantic similarity with keyword matching for more accurate retrieval
Multiple backends — built-in Vectra for zero-dep setup, LanceDB for millions of vectors, Qdrant for enterprise
Smart chunking — per-message, conversation turns, batches, or per-scene grouping strategies

JavaScriptVectraLanceDBQdrantEmbedding APIs

RP-Bench

—Roleplay quality benchmark for LLMs

Creatorgithub.com/LeviTheWeasel/rp-benchmark ↗

A specialized evaluation framework that measures roleplay quality in large language models across 26 distinct dimensions. Addresses a gap in LLM benchmarking by testing what actually matters in interactive roleplay: character consistency, agency respect, lorebook integration, prose craft, and genre-specific skills — not generic writing metrics.

26 evaluation dimensions across 3 tiers — from agency respect and continuity to atmospheric dread and structural comedy
Dual-judge system — Claude and GPT-4.1 score independently, results are cross-averaged to reduce bias
5 scenario types — completion, OOC correction, degradation, preference (A/B swipes), and consistency testing
8 synthetic seeds spanning fantasy, horror, comedy, ERP, politics, tragedy, sci-fi thriller, and slice-of-life
HuggingFace dataset published at lazyweasel/roleplay-bench

PythonOpenRouterClaudeGPT-4.1SeabornHuggingFace