Hacker News Viewer

StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

by skysniper on 4/1/2026, 4:17:35 PM

https://app.uniclaw.ai/arena?tab=costEffectiveness&via=hn

Comments

by: james2doyle

None of the Qwen 3.5 models seem present? I’ve heard people are pretty happy with the smaller 3.5 versions. I would be curious to see those too.<p>I would also be interested to see &quot;KAT-Coder-Pro-V2&quot; as they brag about their benchmarks in these bots as well

4/1/2026, 8:56:26 PM


by: ipython

I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.<p>Then I clicked on one task to see what it looks like “on the ground”: <a href="https:&#x2F;&#x2F;app.uniclaw.ai&#x2F;arena&#x2F;DDquysCGBsHa" rel="nofollow">https:&#x2F;&#x2F;app.uniclaw.ai&#x2F;arena&#x2F;DDquysCGBsHa</a> (not cherry picked- literally the first one I clicked on)<p>The task was:<p>&gt; Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.<p>Reading through the description of the top rated model (stepfun), it stated:<p>&gt; Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.<p>Oh cool! Sounds great and would be commiserate with the score given of 7&#x2F;10 for the task! However- the next sentence:<p>&gt; Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.<p>So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.<p>Ok, closed that tab.

4/1/2026, 9:02:37 PM


by: WhitneyLand

StepFun is an interesting model.<p>If you haven’t heard of it yet there’s some good discussion here: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=47069179">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=47069179</a>

4/1/2026, 4:57:53 PM


by: hadlock

According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn&#x27;t super suprising as StepFun is ~about 5% the price of Sonnet.<p><a href="https:&#x2F;&#x2F;openrouter.ai&#x2F;apps?url=https%3A%2F%2Fopenclaw.ai%2F" rel="nofollow">https:&#x2F;&#x2F;openrouter.ai&#x2F;apps?url=https%3A%2F%2Fopenclaw.ai%2F</a>

4/1/2026, 4:44:52 PM


by: dmazin

why do half the comments here read like ai trying to boost some sort of scam?

4/1/2026, 5:43:18 PM


by: grimm8080

Yet when I tried it it did absymal compared to Gemini 2.5 Flash

4/1/2026, 6:45:36 PM


by: smallerize

It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.

4/1/2026, 4:49:07 PM


by: mgw

Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash.<p>Pricing is essentially the same: MiMo V2 Flash: $0.09&#x2F;M input, $0.29&#x2F;M output Step 3.5 Flash: $0.10&#x2F;M input, $0.30&#x2F;M output<p>MiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it&#x27;s 49 vs 52 for Step on their Agentic Index.

4/1/2026, 7:16:37 PM


by: azmenak

This model is free to use, and has been for quite some time on OpenRouter. $0 is pretty hard to beat in terms of cost effectiveness.

4/1/2026, 9:48:19 PM


by: clausewitz

I&#x27;m not seeing Deepseek mentioned very often, which I&#x27;ve been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.

4/2/2026, 5:05:20 AM


by: skysniper

another thing from the bench I didn&#x27;t expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus&#x2F;sonnet 4.6 and gpt 5.4 never have this issue.

4/1/2026, 5:32:14 PM


by: sunaookami

Tried the free version on OpenRouter with pi.dev and it&#x27;s competent at tool calling and creative writing is &quot;good enough&quot; for me (more &quot;natural Claude-level&quot; and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with &quot;simple&quot; agentic workflows but it&#x27;s definitely not made for programming nor made for long writing.

4/1/2026, 6:54:46 PM


by: grigio

i like StepFun 3.5 Flash, a good tradeoff

4/1/2026, 7:32:37 PM


by: yieldcrv

people aren&#x27;t just using Claude models any more? that&#x27;s nice to see

4/1/2026, 8:38:27 PM


by: jghiglia

[dead]

4/1/2026, 9:35:11 PM


by: hyperlambda

[dead]

4/2/2026, 8:49:16 AM


by: philbitt

[dead]

4/2/2026, 2:23:01 PM


by: mtrifonov

[dead]

4/2/2026, 4:33:21 AM


by: Caum

[dead]

4/1/2026, 9:41:51 PM


by: skysniper

I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.<p>The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.<p>The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.<p>Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.<p>Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but &quot;A beat B&quot; is reliable. Full methodology: <a href="https:&#x2F;&#x2F;app.uniclaw.ai&#x2F;arena&#x2F;leaderboard&#x2F;methodology?via=hn" rel="nofollow">https:&#x2F;&#x2F;app.uniclaw.ai&#x2F;arena&#x2F;leaderboard&#x2F;methodology?via=hn</a><p>I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.

4/1/2026, 4:17:35 PM