StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

by skysniper on 4/1/2026, 4:17:35 PM

https://app.uniclaw.ai/arena?tab=costEffectiveness&via=hn

Comments

by: james2doyle

None of the Qwen 3.5 models seem present? I’ve heard people are pretty happy with the smaller 3.5 versions. I would be curious to see those too.I would also be interested to see "KAT-Coder-Pro-V2" as they brag about their benchmarks in these bots as well

4/1/2026, 8:56:26 PM

by: ipython

I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.Then I clicked on one task to see what it looks like “on the ground”: <a href="https://app.uniclaw.ai/arena/DDquysCGBsHa" rel="nofollow">https://app.uniclaw.ai/arena/DDquysCGBsHa</a> (not cherry picked- literally the first one I clicked on)The task was:> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.Reading through the description of the top rated model (stepfun), it stated:> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.Ok, closed that tab.

4/1/2026, 9:02:37 PM

by: WhitneyLand

StepFun is an interesting model.If you haven’t heard of it yet there’s some good discussion here: <a href="https://news.ycombinator.com/item?id=47069179">https://news.ycombinator.com/item?id=47069179</a>

4/1/2026, 4:57:53 PM

by: hadlock

According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.<a href="https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F" rel="nofollow">https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F</a>

4/1/2026, 4:44:52 PM

by: dmazin

why do half the comments here read like ai trying to boost some sort of scam?

4/1/2026, 5:43:18 PM

by: grimm8080

Yet when I tried it it did absymal compared to Gemini 2.5 Flash

4/1/2026, 6:45:36 PM

by: smallerize

It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.

4/1/2026, 4:49:07 PM

by: mgw

Missing from the comparison is MiMo V2 Flash (not Pro), which I think could put up a good fight against Step 3.5 Flash.Pricing is essentially the same: MiMo V2 Flash: $0.09/M input, $0.29/M output Step 3.5 Flash: $0.10/M input, $0.30/M outputMiMo has 41 vs 38 for Step on the Artificial Analysis Intelligence Index, but it's 49 vs 52 for Step on their Agentic Index.

4/1/2026, 7:16:37 PM

by: azmenak

This model is free to use, and has been for quite some time on OpenRouter. $0 is pretty hard to beat in terms of cost effectiveness.

4/1/2026, 9:48:19 PM

by: clausewitz

I'm not seeing Deepseek mentioned very often, which I've been using for Openclaw, very cheaply I might add, with great success. I think I loaded $10 to my account 2 months ago and I still havent needed to top up.

4/2/2026, 5:05:20 AM

by: skysniper

another thing from the bench I didn't expect: gemini 3.1 pro is very unreliable at using skills. sometimes it just reads the skill and decide to do nothing, while opus/sonnet 4.6 and gpt 5.4 never have this issue.

4/1/2026, 5:32:14 PM

by: sunaookami

Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.

4/1/2026, 6:54:46 PM

by: grigio

i like StepFun 3.5 Flash, a good tradeoff

4/1/2026, 7:32:37 PM

by: yieldcrv

people aren't just using Claude models any more? that's nice to see

4/1/2026, 8:38:27 PM

by: jghiglia

[dead]

4/1/2026, 9:35:11 PM

by: hyperlambda

[dead]

4/2/2026, 8:49:16 AM

by: philbitt

[dead]

4/2/2026, 2:23:01 PM

by: mtrifonov

[dead]

4/2/2026, 4:33:21 AM

by: Caum

[dead]

4/1/2026, 9:41:51 PM

by: skysniper

I ran 300+ benchmarks across 15 models in OpenClaw and published two separate leaderboards: performance and cost-effectiveness.The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: <a href="https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn" rel="nofollow">https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn</a>I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.

4/1/2026, 4:17:35 PM

Hacker News Viewer

Top 20

StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

Comments

by: james2doyle

by: ipython

by: WhitneyLand

by: hadlock

by: dmazin

by: grimm8080

by: smallerize

by: mgw

by: azmenak

by: clausewitz

by: skysniper

by: sunaookami

by: grigio

by: yieldcrv

by: jghiglia

by: hyperlambda

by: philbitt

by: mtrifonov

by: Caum

by: skysniper