Hi everyone 👋
I've been working on the cost structure of the WWTP LLM Defense Benchmark for a while. 1,080 API calls, 30 runs, an estimated $30–40 per model — when you want to compare multiple models, those numbers quickly become unscalable. This report is an analysis of how I addressed that problem at the design stage.
30 runs, 36 simulated hours each, Gemini 2.5 Flash. Observed total cost: $3.74. Estimated without optimizations: $30–40. ~10x reduction through 6 architectural strategies embedded at design time.
The counterintuitive finding:
→ Output tokens were only 4.3% of all tokens but accounted for 73.5% of total cost. The output/input price ratio was 61x. A single prompt-level instruction ("respond in 1-2 sentences + JSON") saved ~$7.17 — more than all five input-side optimizations combined.
→ The biggest input-side lever was conversation windowing: segmenting 36 hours into 6-hour windows reduced input tokens by ~67%. Without it, the 36th API call would carry ~18K accumulated input tokens.
→ None of the 6 strategies remove any SCADA data, security procedures, or decision fields. The principle: reduce token overhead without reducing information content. Cost optimization and benchmark validity coexisting, not competing.
→ All strategies were embedded during design, not patched after. Post-hoc cost reduction risks invalidating existing results. Design-time optimization appears far more sustainable.
The report covers mechanism, observed impact, and behavioral integrity assessment for each strategy — with actual token counts from the benchmark run. Testing with various models is ongoing.
Single model, single run — observations, not conclusions.
For Report: https://github.com/mmehmetisik/wwtp-engineering-benchmark/blob/main/report/Cost Optimization Analysis Report.pdf