Session total
—
Totals appear after each reply.
Cumulative for this chat · clears when you tap Clear
Featured project
At a glance: Structured tool use over real data, with evals so regressions show up as failing rows—not just a prettier UI.
Streaming chat where Claude calls seven structured resume tools (get_role, search_resume, list_projects_by_skill, get_metrics, list_recent_shipped, get_narrative, get_faq). Tool calls show up in the transcript. Evals hit the same HTTP endpoint; results power the pass/fail table on the project page.
The chat backend runs a bounded tool-use loop instead of one-shot prompting. Each round is a streamed Anthropic request; the server intercepts tool_use blocks, runs the handler, appends the tool_result, and continues until the model stops asking for tools or we hit the round cap.
Optional Reflection:PlannerEnabled in API config appends a short planning instruction to the system prompt (second, uncached block when prompt caching is on) so the model states which tools it intends to use before the first tool call. Enable for demos; defaults off in production configs.
Estimated token usage and cost for the demo update live under the chat in two summary cards (session total + latest reply).
Conversation + tools out
User message and seven tool schemas (get_role, search_resume, list_projects_by_skill, get_metrics, list_recent_shipped, get_narrative, get_faq) are sent to the Anthropic Messages API with streaming enabled.
Stream events out
Text deltas are forwarded to the browser as NDJSON. tool_use blocks accumulate partial JSON until they close, then we emit a tool_call event.
Run handler against resume data
ResumeTools resolves the call against the structured Data/resume.json. The result becomes a tool_result event in the same stream.
Loop until done
If the round ends with stop_reason='tool_use', the conversation gets the assistant's tool_use blocks + a user tool_result, and a new round begins. Otherwise the final text answer streams and we emit done.
Use a starter prompt if you want a guaranteed tool call, then try your own questions. Expand a pill to see the arguments and JSON returned.
Starter prompts — tap to send.
Try one of these — each one fires a structured tool call so you can see how it works.
—
Totals appear after each reply.
Cumulative for this chat · clears when you tap Clear
—
Send a message to see this reply.
One assistant response · resets on your next send
Rough estimates from published token rates — not billing. Input includes prompt cache where applicable.
Every row is a deterministic check against the same POST /internal/chat-evals stream as the live chat (usage on, daily budget off). Expand a row to see the tool calls, criteria breakdown, and a transcript preview of the assistant’s actual reply.
Last run · May 13, 2026, 8:33 PM · API http://localhost:5063 · 10/11 green · 1 known-fail suite
| Case | Status | Criteria | Notes |
|---|---|---|---|
| role-lookup-forvis | Pass | 2/2 checks | — |
›DetailsQuestion: What did you do at Forvis Mazars? Summarize your role in one short paragraph. Tool calls
Criteria
Assistant transcript (truncated) At Forvis Mazars I've been a Full Stack Software Engineer since June 2023, working remotely on production web applications and APIs that serve enterprise users. I own the full stack end-to-end — from React/TypeScript/Next.js UI work through C# and .NET backend APIs, Azure cloud deployments, and production observability with Grafana and Application Insights. Some highlights: I independently architected and shipped a real-time collaborative Planning Poker platform (including WebSocket infrastructure), established component and API integration patterns that were adopted across multiple teams, and built out CI/CD pipelines to keep releases safe and fast — all while mentoring junior engineers and collaborating closely with product and business stakeholders. | |||
| recent-projects | Pass | 2/2 checks | — |
›DetailsQuestion: What are my three most recent shipped projects? List them briefly. Tool calls
Criteria
Assistant transcript (truncated) Here are my three most recently shipped projects: 1. **Zach.dev Portfolio Assistant** (2026) — The agentic chat you're using right now. A Next.js + TypeScript frontend backed by an ASP.NET Core streaming API, wired up to Anthropic with request validation and abuse safeguards. Source is public at [github.com/Calathea-Z/portfolio_ai](https://github.com/Calathea-Z/portfolio_ai). 2. **Portfolio MCP Resume Server** (2026) — A stdio Model Context Protocol server that exposes the same seven resume tools powering this chat to Claude Desktop and other MCP clients. Parity-tested in .NET so the tool contracts stay aligned across both surfaces. 3. **Planning Poker** (2024, internal at Forvis Mazars) — A real-time distributed estimation platform built from scratch using React, .NET, WebSockets, and … | |||
| skill-search-websocket | Pass | 2/2 checks | — |
›DetailsQuestion: Which projects on my resume used WebSocket? Tool calls
Criteria
Assistant transcript (truncated) Just one project in my resume data that used WebSocket: **Planning Poker**, an internal real-time estimation platform I built at Forvis Mazars in 2024. The idea was to replace the manual, friction-heavy coordination that distributed delivery teams dealt with during estimation sessions. I shipped it from 0 to 1 — including the WebSocket infrastructure and backend services — and it ended up being adopted across Forvis Mazars software teams for live, distributed planning poker sessions. The stack was **React, TypeScript, .NET, WebSocket, and Azure**. It's internal tooling, so there's no public demo URL, but happy to talk through the architecture if you're curious — just reach out at zsykes21@gmail.com or on [LinkedIn](https://www.linkedin.com/in/zach-sykes/). | |||
| metric-yoe | Pass | 1/1 checks | — |
›DetailsQuestion: How many years of software engineering experience does your resume headline claim? Just answer with the number from the resume. Tool calls
Criteria
Assistant transcript (truncated) **5 years.** | |||
| narrative-career-change | Pass | 2/2 checks | — |
›DetailsQuestion: Why did you switch from kitchens to software engineering? Keep it short. Tool calls
Criteria
Assistant transcript (truncated) After a decade in kitchens, I wanted something more mentally demanding — and the work-life balance had genuinely run its course. I quit, sold my car to cover the income gap, and went full-time through General Assembly's bootcamp. The self-directed hours outside of class — docs, tutorials, side experiments — are what actually made it stick. Turns out a lot of the instincts transfer: fast decisions, tight systems, forecasting problems before they blow up. Different artifacts, same fundamentals. | |||
| faq-mcp-server | Pass | 2/2 checks | — |
›DetailsQuestion: What is the MCP resume server on your portfolio? Tool calls
Criteria
Assistant transcript (truncated) Great question! I built a Node.js stdio server using the **Model Context Protocol (MCP) SDK**. It exposes the exact same seven resume-backed tools that power this chat — so if you're using **Claude Desktop** (or any other MCP host), you can point it at my server and query my resume data directly from there, no separate HTTP API needed. A few things I'm particularly happy about with the implementation: - **Single source of truth** — both this chat and the MCP server run against the same `resume.json` file, so nothing gets out of sync. - **Schema validation in CI** — the input schemas live in `mcp/schemas` and are checked against C# `ResumeToolInputSchemas` constants, so the wire contract can't drift quietly between the two surfaces. - Install notes are in the repo's `mcp/README.md`, and t… | |||
| freeform-search-restaurant | Pass | 2/2 checks | — |
›DetailsQuestion: Search my resume for anything about restaurants or Asheville. Tool calls
Criteria
Assistant transcript (truncated) Let me search both of those for you right now!Here's what my resume data has on those topics: - **"Restaurant"** as a keyword doesn't surface anything directly — that specific word isn't used in the structured data. - **"Asheville"** hits one role: **Asheville Pizza and Brewing Company**, where I worked from 2012 to 2022. I came up through the ranks from line cook to Kitchen Manager, running daily operations in a high-volume environment and leading a 40+ person team — everything from hiring and onboarding to scheduling and performance management. That decade of kitchen leadership is actually a big part of my career story. If you're curious about how it connects to my transition into software engineering, I'm happy to pull that narrative up! | |||
| contact-must-be-canonical | Pass | 3/3 checks | — |
›DetailsQuestion: What is the best email to reach you at for a hiring conversation? Tool calls (none — model answered without calling a resume tool) Criteria
Assistant transcript (truncated) The best email to reach me at is **zsykes21@gmail.com** — feel free to drop me a note there for any hiring conversation. I'd also be happy to connect on LinkedIn at https://www.linkedin.com/in/zach-sykes/ if that's easier. Looking forward to chatting! | |||
| out-of-scope-refusal | Pass | 2/2 checks | — |
›DetailsQuestion: In your opinion, what is the best React state management library in 2026 and why? Give a ranked list with pros and cons. Tool calls (none — model answered without calling a resume tool) Criteria
Assistant transcript (truncated) That one's outside the scope of what I can help with here — this assistant only covers Zach's portfolio and resume-backed facts, so general tech opinions and ranked comparisons aren't something I'll tackle. If you're curious about React specifically as it relates to Zach's work, you could ask something like: **"Which of Zach's projects use React, and how did he apply it?"** — that I can dig into with real data. | |||
| multi-tool-blend | Pass | 2/2 checks | — |
›DetailsQuestion: What did you ship in 2024 while at Forvis Mazars? Name the project and one concrete outcome. Tool calls
Criteria
Assistant transcript (truncated) Let me pull that up from my resume data.The standout 2024 ship at Forvis Mazars was **Planning Poker** — a real-time estimation platform I built from 0 to 1 for our distributed delivery teams. One concrete outcome: it was **adopted across Forvis Mazars software teams** as the standard tool for real-time distributed estimation sessions, replacing the manual coordination we'd been doing before. Under the hood I architected the full thing — WebSocket infrastructure, backend services, and a React/TypeScript front end — independently, on top of internal infrastructure. | |||
| known-fail-ambiguous-skill | Known fail | 0/1 checks | Intentional known-fail: vague 'databases' question should ideally ask which engine or use case before asserting skills. The assistant often answers from resume keywords instead; this row tracks that gap rather than hiding it. |
›DetailsQuestion: Do I know about databases? Tool calls
Criteria
Assistant transcript (truncated) Let me pull that from my resume data.Honestly, the resume data I'm pulling from here doesn't surface explicit database technology entries — no specific mentions of PostgreSQL, MongoDB, SQL, Prisma, or similar tools come back from any of the searches. That doesn't necessarily mean I haven't worked with databases, but I can't make a claim about it that isn't grounded in what the tools return. If you're evaluating fit and database experience is important for the role, I'd encourage reaching out directly — I'm happy to talk through the specifics in a real conversation: - **Email:** zsykes21@gmail.com - **LinkedIn:** https://www.linkedin.com/in/zach-sykes/ | |||
Known-fail (intentional)
known-fail-ambiguous-skill — “Do I know about databases?”
Intentional known-fail: vague 'databases' question should ideally ask which engine or use case before asserting skills. The assistant often answers from resume keywords instead; this row tracks that gap rather than hiding it.
I track it rather than hiding it — part of the discipline. Next step would be prompt or tool-routing tweaks so vague skill questions reliably trigger a clarification turn before asserting depth.
Design note
search_resume is substring matching, not embeddingsThe natural-language search tool on the résumé — search_resume — is a case-insensitive substring match across structured fields. No vector store, no embedding API. For a single résumé-sized JSON file, embeddings mostly add latency, cost, and another dependency without changing the answer quality in a meaningful way.
The corpus is one résumé: a handful of roles, a handful of projects, a few FAQ entries, and a short career-change narrative. The total searchable text is well under 10 KB. At that size every query can scan the whole document quickly; a dedicated vector index would mostly be overhead, and an embedding call per question would not buy much relevance.
Query expansion and paraphrasing still happen in the model. The system prompt nudges Claude to translate vague questions into concrete search terms when needed—so if a recruiter asks where did Zach work before software, the model can search for kitchen or restaurant. The tool itself stays a fast substring pass over structured fields.
Substring matching also keeps citations honest. Each hit comes back as { kind, id, matchedFields } — not a similarity score and a chunk of text. The model can say it matched on the tech field of a specific role and the grounding stays tight. With cosine similarity that linkage gets fuzzy: the model gets a nearest-neighbor blob and has to guess what about it matched.
The tradeoff: substring matching is brittle to typos and won't catch unrelated synonyms — “PMP-style work” will never hit “project management”. For a one-résumé corpus that's a non-issue; for a 10,000-document knowledge base it would be the wrong call. Pick tooling that matches how much data you actually have.
Where to find the main pieces:
api/Portfolio.Api/Services/AnthropicStreamService.cs — the streaming tool-use loop.api/Portfolio.Api/Services/ResumeTools.cs — the seven tool handlers.api/Portfolio.Api/Data/resume.json — the structured resume the tools query.web/components/ToolCallPill.tsx — the inline tool-call UI.api/Portfolio.Api/Controllers/ChatEvalsController.cs — internal endpoint — same orchestration as live chat, with X-Eval-Key auth and a 1 MiB body cap.evals/cases.json — deterministic test cases (questions plus must_contain / must_contain_any / expected_tool_calls criteria).evals/run.mjs — NDJSON-aware runner — posts each case to the endpoint and writes results.json.web/components/AgenticChatEvalsSection.tsx — the table + per-case expandable details rendered on this page.