Rendered at 04:30:01 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
hardsnow 10 hours ago [-]
I’ve been developing an open-source version of something similar[1] and used it quite extensively (well over 1k PRs)[2]. I’m definitely believer of the “prompt to PR model”. Very liberating to not have to think about managing the agent sessions. Seems that you have built a lot of useful tooling (e.g., session videos) around this core idea.
Couple of learnings to share that I hope could be of use:
1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.
2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.
3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.
I love the idea of emailing agents like we email humans! Thank you for sharing your learnings:
1. Network constraints vary quite a bit from one enterprise customer to another, so right now this is something we handle on a case-by-case basis with them.
2. We came to the same conclusion. For sensitive credentials like LLM API keys, we generate ephemeral keys so the real keys never touch the sandbox.
3. Totally right, we support constrained tasks too (ask mode, automated CI fixes). We've gone back and forth on whether to go vertical-first or stay generic. We're still figuring out where the sweet spot is. The constrained tasks are more reliable today, but the open-ended ones are where teams get the most leverage.
woeirua 39 minutes ago [-]
I think Cloud Agents are the future, but I’ll be honest I don’t see how a third party provider survives in this space.
1. It’s really not that hard to stand this up on your own. GitHub agentic workflows gets you 95% of the way there already.
2. Anthropic and Cursor are already playing in this space and likely will eat your lunch.
IMO, the only way you can survive is to make this deployable behind the firewall. If you could do that then I would seriously consider using your product.
danoandco 13 minutes ago [-]
On gh-aw: it looks solid for the event-driven automation shape (triage, docs sync, CI fix). We're after a slightly different shape: interactive back-and-forth, steering from Slack or Linear, persistent sandboxes with a booted dev server for live previews. Thanks for the pointer, I'll dig into it more.
On labs eating our lunch: it's definitely a risk. Our bet is that reusing lab-native CLIs is enough to position ourselves in the market
On behind the firewall: it's something we're looking into. We open-sourced agentbox-sdk in that direction.
cocoflunchy 2 hours ago [-]
Great timing as I'm exploring the space to get rid of Cursor in our stack.
For local dev everyone is switching to Claude Code or Codex.
The state of the art for cloud agents in my opinion right now is Cursor. But their pricing model per-user doesn't make sense when what I want is to enable anyone in the company to fix things in the product.
2 things not immediately clear from your homepage:
- do you support full computer use? Again Cursor is the best I've tried there
- what kind of triggers do you support? We have in particular one automation built with cursor to auto approve PRs that are low-risk. It triggers on a specific comment on a PR
Finally some advice from a user's pov: you need to invest a lot in the onboarding experience. I tried Devin today and it couldn't get it to work after one hour of fiddling. How do you store the repo's setup scripts? Cursor cloud is pretty opaque and annoying to configure on that side.
Anyway I'll try it!
danoandco 2 hours ago [-]
On computer use: Yes. Sandboxes come with a computer-use CLI for driving Linux GUI apps via X11.
On triggers: Cron, GitHub (PRs, issues, @twill mentions in review comments), Slack, Linear, Notion, Asana webhooks, plus CLI and web. Our PR-comment workflow is you would have to tag @twill with an instruction. That being said, you can also setup a daily cron on Twill that checks PRs with a specific label like Confidence Score : x/5 and tell it to auto-approve when 5/5 for example.
On setup scripts: Per-repo entrypoint script, env vars, and ports, all accessible on the UI. There is a dedicated Dev Environment agent mode that you start with to setup the infra. You can steer the agent into how to setup if it gets stuck. So this should be smooth. The agent can also rewrite the entrypoint mid-task.
There is also a Twill skill you can add to your local agents to dispatch tasks to Twill. Meaning you can research and plan locally using your CLI and delegate the implementation to a sandbox on Twill.
cocoflunchy 2 hours ago [-]
Sent you some feedback from the app, I can't get GitHub to connect. Feel free to contact me over email to troubleshoot!
danoandco 2 hours ago [-]
Mmh this works on my end. Sending you an email. Ty
2001zhaozhao 10 hours ago [-]
24/7 running coding agents are pretty clearly the direction the industry is going now. I think we'll need either on-premises or cloud solutions, since obviously if you need an agent to run 24/7 then it can't live on your laptop.
Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?
piker 9 hours ago [-]
> 24/7 running coding agents are pretty clearly the direction the industry is going now.
This assertion needs some support for those of us that don't have a macro insight into the industry. Are you seeing this from within FAANG shops? As a solo developer? What? Honest question.
2001zhaozhao 8 hours ago [-]
I'm speaking from my daily experience. Sometimes i don't want to close my laptop before going to bed because there are still 1-2 tasks ongoing in my AI kanban board, so I just leave my laptop open (lock but not suspend it) so that the agents keep working for a while. I don't even have things all that automated.
I anticipate that once I have some more complex agentic scaffolds set up to do things like automatically explore promising directions for the project, then leaving the AI system on overnight becomes a necessity.
ragelink 7 hours ago [-]
The core issue for me is, I don't want to trust someone else with my code, or run my stuff on their computers. I don't see serious enterprise organizations offloading something as critical to security outside their own network perimeter.
danoandco 9 hours ago [-]
For a solo dev running one task at a time, a beefy desktop overnight is totally viable. We see a lot of this with the Mac Mini hype
Cloud starts to matter when you want to (a) run a swarm of agents on multiple independent tasks in parallel, (b) share agents across a team, or (c) not worry about keeping a machine online
2001zhaozhao 9 hours ago [-]
I would point out that a beefy desktop is probably faster at compiling code than a typical cloud instance simply due to more CPU performance. So maybe up to 10-ish concurrent agents it's faster to use a local desktop than a cloud instance, and then you start to get into the territory where multiple agents are compiling code at the same time, and the cloud setup starts to win. (That's assuming the codebase takes a while to compile and pegs your CPU at 100% while doing so. If the codebase is faster to compile or uses fewer threads, then the breakeven agent count is even higher.)
Other than that, I agree with what you said. I don't know what the tradeoffs for local on-premises and cloud agents are in terms of other areas like convenience, but I do think that scalability in the cloud is a big advantage.
danoandco 8 hours ago [-]
Totally right on the compile time. CIs have the same bottleneck, and the ecosystem is working on fixing this (faster cpus, better caching) in both coding agents and CI to improve overall velocity
Edit: just noticed this is a semi duplicate question to https://news.ycombinator.com/item?id=47723506 so rephrasing my question - will you have computer use and will you have self-hosted runners option? (you being just the controlplane / task orchestrator, which is the hardest problem apparently...)
Additional question - what types of sandboxes you use? (just docker or also firecracker etc...)
Original comment:
Congrats on the launch!
What's the benefit over cursor cloud agents with computer use? (other than preventing vendor lock in?)
We already support computer use out of the box (linux sandboxes). Self hosted runners are not available yet, but Twill is built on a runtime agnostic layer (see https://github.com/TwillAI/agentbox-sdk) so it is feasible!
eranation 5 hours ago [-]
You definitely got my interest... will try it out!
willydouhard 5 hours ago [-]
Let us know how it goes!
dennisy 9 hours ago [-]
Congrats on the launch, the agentbox-sdk looks interesting, but seeing as the first commit was 3 days ago - I feel a little wary to use it just yet!
One question, do you have plans for any other forms of sandboxing that are a little more "lightweight"?
Also how do you add more agent types, do you support just ACP?
willydouhard 8 hours ago [-]
Thank you! agentbox-sdk is very recent so it is not stable just yet indeed!
For the lightweight sandbox, can you give an example?
Currently we support main coding CLIs, ACP support is not shipped yet.
eranation 6 hours ago [-]
HN hug of death probably, but your scorecard returns an error :(
Claude managed agents is a general-purpose hosted runtime for Claude. While Twill focuses on SWE tasks.
And so the SWE workflow is pre-built (research, planning, verification, PR, proof of work). Twill is also agnostic to the agent, so you can use codex for instance. Additionally you have more flexibility on sandbox sizing on Twill
hmokiguess 10 hours ago [-]
> Run the same agent n times to increase success rate.
Are there benchmarks out there that back this claim?
danoandco 10 hours ago [-]
Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.
hmokiguess 9 hours ago [-]
Interesting, and how does Twill uses it in that feature?
danoandco 7 hours ago [-]
On the Twill web app, you can run the same task across different agents and multiple attempts (each in its own sandbox). Then you pick the best result. This is super handy for UI work where you can open the live preview for each attempt and compare.
Next step for us is adding a final pass where an agent evaluates the results and combines the best parts into one PR.
danoandco 9 hours ago [-]
[dead]
j_gonzalez 6 hours ago [-]
[flagged]
a_t48 8 hours ago [-]
Does it support running Docker images inside the sandbox?
willydouhard 8 hours ago [-]
Yes, for instance Twill is running a local postgres and redis directly in the sandbox using docker compose when running on our codebase.
This is what enables Twill to self verify its work before opening a PR
wordpad 7 hours ago [-]
How does this compare to Jules from Google?
danoandco 7 hours ago [-]
Jules is similar to Twill with the following differences:
- Twill is CLI-agnostic, meaning you can use Claude Code, Codex or Gemini. Jules only works with Gemini.
- We focus on the delegation experience: Twill has native integrations with your typical stack like Slack or Linear. The PRs comes back with proofs of work, such as screenshots or videos.
auszeph 7 hours ago [-]
I built an internal version of this for my workplace.
Something very useful that will be harder for you most likely is code search. Having a proper index over hundreds of code repos so the agent can find where code is called from or work out what the user means when they use an acronym or slightly incorrect name.
It's quite nice to use and I'm sure someone will make a strong commercial offering. Good luck
woeirua 44 minutes ago [-]
Why not connect to the GitHub MCP?
willydouhard 7 hours ago [-]
I agree and that is why I think monorepos are making a comeback.
That said, there are workarounds, like cloning all repos and enabling LSP (coding CLIs added that feature) or using a dedicated solution for codebase indexing and add a skill/mcp.
Super fast models spamming grep commands are also fun to watch!
Run a copy of this in the same VPC. Monorepos would definitely help, but that's not the structure we have. I didn't want to rely on API limits (or stability) at GitHub for such a core feature.
Using this we've had agents find dead APIs across multiple repos that can be cleaned up and the like. Very useful.
willydouhard 5 hours ago [-]
this is great, thanks for sharing
senordevnyc 8 hours ago [-]
How does this compare to something like Cursor Cloud Agents with a solid set of skills and tools?
danoandco 7 hours ago [-]
Similar but reusing lab-native CLIs like Claude Code or Codex, which they perform RL on. And so in the long-run, we believe this approach wins over custom harnesses.
gbnwl 9 hours ago [-]
So instead of using my Claude Code subscription, I can pay the vastly higher API rates to you so you can run Claude Code for me?
willydouhard 9 hours ago [-]
Anthropic recently killed the ability for third parties to use the Claude Code subscription, and it's assumed they're subsidising that price heavily. Which is fine, but it's a good reminder of the vendor lock-in risk. One policy change and your workflow breaks. Twill is agent-agnostic (Claude Code, Codex CLI, OpenCode), so you're not betting on any single vendor's pricing decisions.
On the cost for solo devs, yeah, if you're one person running one agent at a time on your laptop, the sub is probably the better deal today. No argument there. The cloud agent model starts to make sense when you want to fire off multiple tasks in parallel.
gbnwl 9 hours ago [-]
Not sure if you've seen it yourself but Claude code can kick off parallel agents working in their own worktrees natively now. I do it all the time.
willydouhard 9 hours ago [-]
Yes, the difference is that Twill launches dedicated infra on each sandbox for each task. This means you can work on multiple tasks requiring a DB migration for instance.
Also you can fire and forget tasks (my favorite) and don't have to keep your laptop running at night.
verdverm 7 hours ago [-]
See also Cowork and other upcoming Anthropic features.
See also Show HN, this exact product is frequently shown as a github link.
The paradigm shift in Ai means what you are making is (1) filling a gap until the primaries implement it, most have it in their pipeline if not already (2) how easy it is to replicate with said Ai using my preferred tech stack
willydouhard 6 hours ago [-]
Cowork does not seem to be focused on engineering, but we are fully expecting Anthropic to catch up in this category.
What Anthropic can't offer is to let you use Codex or combine it with Claude Code. That is why we think non ai-labs players have a say in this market.
To your last point, as always there is a buy vs build tradeoff which ultimately comes down to focusing on your core business which we think still remains important in the ai era
verdverm 6 hours ago [-]
> as always there is a buy vs build tradeoff
it's a nonbinary decision now
Google has a free, open source take on what you are building, looks more mature as well
My comment about Cowork is more about pointing out a different feature set that will crossover with Code. In example they have the Task related things as an affordance, Code has this coming.
willydouhard 6 hours ago [-]
I believe there is a difference between an open source framework and a product. You would still have to manage and scale your infra, build the integration layer around it to make it accessible where your teams are, fix bugs etc...
I am not saying that build is always the bad choice, but the tradeoff did not disappear imo
verdverm 6 hours ago [-]
[flagged]
gbnwl 5 hours ago [-]
I’m newer to knowing and caring about what YC does at all in terms of the companies it funds. The fact that this is YC makes me think the org has forfeited any sense of “taste” at all. Complete scattershot from people who have money to scatter I guess.
verdverm 5 hours ago [-]
You can read old Paul Graham essays and the early YC Startup School (which is probably when peak YC happened) to get a sense of the ethos. They increased batch size to scale (as context for the "stopped doing things that don't scale" comment)
We’re focused on SWE use cases. Code is nice because there’s already a built-in verification loop: diffs, tests, CI, review, rollback. But you do quickly get to a state where the agent needs to make a risky action (db migration, or an infra operation). And this is where the permissions features from the agents are handy: allowlist, automode, etc. So you have approve/reject only the high risk actions.
And I think this risk model is valid for both technical and non-technical use cases
Couple of learnings to share that I hope could be of use:
1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.
2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.
3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.
[1] https://airut.org [2] https://haulos.com/blog/building-agents-over-email/
I love the idea of emailing agents like we email humans! Thank you for sharing your learnings:
1. Network constraints vary quite a bit from one enterprise customer to another, so right now this is something we handle on a case-by-case basis with them.
2. We came to the same conclusion. For sensitive credentials like LLM API keys, we generate ephemeral keys so the real keys never touch the sandbox.
3. Totally right, we support constrained tasks too (ask mode, automated CI fixes). We've gone back and forth on whether to go vertical-first or stay generic. We're still figuring out where the sweet spot is. The constrained tasks are more reliable today, but the open-ended ones are where teams get the most leverage.
1. It’s really not that hard to stand this up on your own. GitHub agentic workflows gets you 95% of the way there already. 2. Anthropic and Cursor are already playing in this space and likely will eat your lunch.
IMO, the only way you can survive is to make this deployable behind the firewall. If you could do that then I would seriously consider using your product.
On labs eating our lunch: it's definitely a risk. Our bet is that reusing lab-native CLIs is enough to position ourselves in the market
On behind the firewall: it's something we're looking into. We open-sourced agentbox-sdk in that direction.
On triggers: Cron, GitHub (PRs, issues, @twill mentions in review comments), Slack, Linear, Notion, Asana webhooks, plus CLI and web. Our PR-comment workflow is you would have to tag @twill with an instruction. That being said, you can also setup a daily cron on Twill that checks PRs with a specific label like Confidence Score : x/5 and tell it to auto-approve when 5/5 for example.
On setup scripts: Per-repo entrypoint script, env vars, and ports, all accessible on the UI. There is a dedicated Dev Environment agent mode that you start with to setup the infra. You can steer the agent into how to setup if it gets stuck. So this should be smooth. The agent can also rewrite the entrypoint mid-task.
There is also a Twill skill you can add to your local agents to dispatch tasks to Twill. Meaning you can research and plan locally using your CLI and delegate the implementation to a sandbox on Twill.
Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?
This assertion needs some support for those of us that don't have a macro insight into the industry. Are you seeing this from within FAANG shops? As a solo developer? What? Honest question.
I anticipate that once I have some more complex agentic scaffolds set up to do things like automatically explore promising directions for the project, then leaving the AI system on overnight becomes a necessity.
Cloud starts to matter when you want to (a) run a swarm of agents on multiple independent tasks in parallel, (b) share agents across a team, or (c) not worry about keeping a machine online
Other than that, I agree with what you said. I don't know what the tradeoffs for local on-premises and cloud agents are in terms of other areas like convenience, but I do think that scalability in the cloud is a big advantage.
Additional question - what types of sandboxes you use? (just docker or also firecracker etc...)
Original comment:
Congrats on the launch!
What's the benefit over cursor cloud agents with computer use? (other than preventing vendor lock in?)
https://cursor.com/blog/agent-computer-use
Or the existing Claude Code Web?
One question, do you have plans for any other forms of sandboxing that are a little more "lightweight"?
Also how do you add more agent types, do you support just ACP?
For the lightweight sandbox, can you give an example?
Currently we support main coding CLIs, ACP support is not shipped yet.
The analysis request failed.
Hosted shell completed without parseable score_repo.py JSON output. 11 command(s), 11 output(s). (rest redacted)
And so the SWE workflow is pre-built (research, planning, verification, PR, proof of work). Twill is also agnostic to the agent, so you can use codex for instance. Additionally you have more flexibility on sandbox sizing on Twill
Are there benchmarks out there that back this claim?
This is what enables Twill to self verify its work before opening a PR
- Twill is CLI-agnostic, meaning you can use Claude Code, Codex or Gemini. Jules only works with Gemini.
- We focus on the delegation experience: Twill has native integrations with your typical stack like Slack or Linear. The PRs comes back with proofs of work, such as screenshots or videos.
Something very useful that will be harder for you most likely is code search. Having a proper index over hundreds of code repos so the agent can find where code is called from or work out what the user means when they use an acronym or slightly incorrect name.
It's quite nice to use and I'm sure someone will make a strong commercial offering. Good luck
That said, there are workarounds, like cloning all repos and enabling LSP (coding CLIs added that feature) or using a dedicated solution for codebase indexing and add a skill/mcp.
Super fast models spamming grep commands are also fun to watch!
Curious to know how you implemented it in house.
Run a copy of this in the same VPC. Monorepos would definitely help, but that's not the structure we have. I didn't want to rely on API limits (or stability) at GitHub for such a core feature.
Using this we've had agents find dead APIs across multiple repos that can be cleaned up and the like. Very useful.
On the cost for solo devs, yeah, if you're one person running one agent at a time on your laptop, the sub is probably the better deal today. No argument there. The cloud agent model starts to make sense when you want to fire off multiple tasks in parallel.
Also you can fire and forget tasks (my favorite) and don't have to keep your laptop running at night.
See also Show HN, this exact product is frequently shown as a github link.
The paradigm shift in Ai means what you are making is (1) filling a gap until the primaries implement it, most have it in their pipeline if not already (2) how easy it is to replicate with said Ai using my preferred tech stack
What Anthropic can't offer is to let you use Codex or combine it with Claude Code. That is why we think non ai-labs players have a say in this market.
To your last point, as always there is a buy vs build tradeoff which ultimately comes down to focusing on your core business which we think still remains important in the ai era
it's a nonbinary decision now
Google has a free, open source take on what you are building, looks more mature as well
https://googlecloudplatform.github.io/scion/overview/
My comment about Cowork is more about pointing out a different feature set that will crossover with Code. In example they have the Task related things as an affordance, Code has this coming.
I am not saying that build is always the bad choice, but the tradeoff did not disappear imo
https://www.startupschool.org/