The Agent Experiment · Part 2 of 5

Twelve rounds of debate had already happened. Four AI agents — a macro analyst, an opportunity scout, a risk destructor, and a capital allocator — had been arguing about the best business to start in today’s economy. They were deep into stress-testing each other’s logic, killing weak concepts, and building toward a final recommendation. So far, about ten minutes of work had elapsed, and I still hadn’t said a word.

But not because I wasn’t paying attention.

I eventually caught the issue at round twelve. By then, I knew I had made a mistake. Then I made the intentional choice to let the agents keep working without me.

The Setup

If you read Part 1, you already know the premise. I built a multi-agent investment committee using AutoGen and then gave them a real problem. I asked the team to identify the best B2B business to launch in the US under the real current macroeconomic and geopolitical conditions. I also limited them to $250K in capital and a 12-month operational timeline. Then I let them do their jobs. The AutoGen framework models debate natively. Multiple agents, each with a distinct role and perspective, all working toward convergence. For an investment thesis problem, this topology made the most sense.

As referenced earlier, I limited the team to just four agents. A MacroAnalyst who frames the economic environment. An OpportunityScout who generates viable concepts. A RiskDestructor who attacks opportunities from all angles. And a CapitalAllocator who scores survivability, unit economics, and capital efficiency. This agent also decides what lives and what gets killed.

During the build, I set human_input_mode=“ALWAYS” on the UserProxy (which is the “agent” that represents the human operator in AutoGen). “ALWAYS” sounded like what I wanted. It sounded like, “ask the human every turn.” In theory, I thought this would ensure the team would check-in with me consistently. Human_input_mode=“ALWAYS” doesn’t mean any of that.

The Definition of the Word ‘ALWAYS’

Human_input_mode=“ALWAYS” controls how UserProxy responds when selected. The “ALWAYS” toggle means when the system calls on UserProxy, it asks the human for input instead of autoreplying for the human. So, my microphone is connected and should work when I’m called on.

The problem was UserProxy never actually got selected to speak!

Speaker selection in AutoGen’s GroupChat works through a rotation list. All the agents were on that list. But UserProxy wasn’t. I hadn’t been explicitly added it. So, the agents conducted the conversation by themselves, round after round. The system never once asked the human operator for comment.

That’s not a bug. That’s a gap between what I intended to build and what I built. The code did exactly what it was supposed to do. The real issue was with what I thought I told it to do and what it did — two very different things.

Twelve Blazing Rounds

I realized something was up by round twelve. Up to that point, I had my hands full just trying to keep up with the conversation by reading the log updates. The agents were doing real work and doing it very quickly. By the time I realized I had been passed by the wayside, the first round of concepts had already been killed (ComplianceStack: wrong sales cycle, FieldFlow: too capital-heavy, AIroi Tracker: survivability too low). And the agents had already retooled, applied new criteria as given by CapitalAllocator, and were midway through evaluating a second slate of concepts!

The knee-jerk move was to stop the experiment, diagnose the issue, and start over.

But I didn’t do that.

I exercised curiosity (disguised as experimental discipline). If I intervened, I’d never know what the agents could produce without me. This type of data has real value. You can’t analyze what human participation does to agent output if you don’t have a baseline of what agent output looks like without you. Thus, I needed to eat my own dog food. I needed to experience what it feels like when an AI agent system cuts the human out of the loop. I needed to sit with the discomfort of being a passive spectator in a process where I was supposed to be a proactive participant.

So, with a deep sense of wonder and my compulsion to learn from mistakes, I allowed the experiment to go on without human participation. And beneficially so. The agents weren’t hallucinating or spinning in circles without me. As I followed along (still struggling to keep up), I could see all the foundational work I had injected into the build from the beginning was paying off. The agents were applying analytical rigor, building on each other’s reasoning, and self-correcting. At this point, I was experimenting on myself too — without human participation, how far could the agents get based solely on my underpinnings and “scenario-building” alone?

So, I let all 20 rounds run without ever saying a word.

The Mistake: Autonomous Agents — Attempt 1

The winner was StandUply Pro (SUP) — an AI-powered asynchronous “standup” tool for remote engineering teams.

The concept targets engineering and product managers at remote-first tech companies. The problem it solves is real. For example, a 15-person distributed team spending 15 minutes per person per day on status updates loses nearly 19 hours of collective time every week. Existing tools (Geekbot, Status Hero) require manual input. This means engineers must still write the updates. SUP pulls activity directly from GitHub, Jira, and Linear. It uses AI (the model of your choice) to generate natural-language standup summaries and then posts them to a Slack channel at 9AM (or whatever time you’d like). The management team also gets a dashboard showing progress, blockers, and velocity trends. Everyone gets their time back (more time spent working).

CapitalAllocator scored it 8/10. $600/month per team, 80% gross margins, four-month CAC payback, and product-led growth mechanics that compress customer acquisition costs over time. Survivability was set at 75%. The runner-up? ProposalForge (an automated proposal generator) scored a 6.8. The agents made a defensible call and supported it in detail.

Outside of the time spent setting up the foundations for the experiment and creating the build, I was able to accomplish all this for under $0.40 in API costs.

The output was damn good, too. That’s what made the architecture failure so instructive. The agents didn’t need me to produce something useful. Turns out all they needed was the info, roles, tools, and scenarios I painstakingly put together so they could do their best work. Even though I wasn’t involved in the debate, my fingerprints were still all over the experiment. I did the groundwork. I just didn’t get a chance to join in on the fun!

The Fix: Making the Human a Mandatory Node

The fix was architectural, not cosmetic.

The “ping-pong” rule

I added logic to ensure that after every AI agent turn, starting AFTER round 5, the conversation would route back to me (UserProxy) for input. This means the agents seek my feedback from round six on without exception. I also added an “autonomous release” mechanism. I can type something like “talk amongst yourselves for N rounds” and the agents will run autonomously for N rounds before handing me the mic again. Essentially, I took the learning from my mistake and made it a feature (a feature I can proactively control). I can now deliberately choose when I want to be ignored. This is the difference between choosing to step back and being excluded without knowing it.

This is the distinction between HiC (Human in Control) and HITL (Human in the Loop). HITL is a checkpoint, a nominal human gate the system can and will route around if nothing structurally prevents it. HiC is both architectural foresight and operational command. The human is a required stop in the cycle, not an assumed participant in the rotation.

The Mulligan: The Stagflation Test (And Two More Bugs) — Attempt 2

I ran the second try as a validation and stress test combined. I swapped the macroenvironment scenario file but kept the agents and the now improved codebase the same. Instead of “Q1 2026 Current,” the agents were operating in a stagflation scenario: 7.8% CPI, Fed funds at 7%, GDP stagnant, regional bank failures, and a credit crisis. The thesis the agents produced was completely different. This time, they settled on a Distressed SaaS Rollup firm using offshore operations. The firm would buy dying software companies, running out of runway, at $0.30-$0.50 on the dollar. Probably the right concept for such an inhospitable macroenvironment. The agents settling on a completely different business idea also helped confirm the parametric architecture was working properly.

But the second attempt was not without mistakes. It surfaced two more bugs, and they compounded in a way that’s worth understanding.

Bug one

The round counter disappeared after round five completed. The counter itself was running fine (current_round was tracking accurately throughout the session, but behind the scenes only). I just couldn’t see it anymore. Once the “ping-pong” rounds started, the counter no longer showed on screen — it just disappeared.

Bug two

Natural language phrases weren’t recognized. I typed things like “keep going, don’t call on me again” to release autonomous control for the remaining rounds. The parser only matched phrases with an explicit number like “talk amongst yourselves for N rounds.” Thus, no autonomous release was ever triggered, and the system prompted me again at the next turn.

And then I was in a pickle. I needed to release control for a specific number of remaining rounds. BUT I couldn’t see what round I was on! This meant I didn’t know how many rounds were left and what scenarios I could fit into those rounds. So, I had to guess. I told the team to be “autonomous for X rounds” and hoped there were at least X rounds left. There were, but that’s luck — not design. The takeaways from this round’s failures are relevant beyond just this specific system.

BLIND human-in-the-loop might be worse than NO human-in-the-loop.

If the human can’t see system state (what round, what branch, what decision point is approaching) they can’t make informed decisions about when to intervene, when to step back, and so on. A human making uninformed interventions adds noise and risk, not direction and clarity. HiC requires visibility as much as it requires presence. If you don’t design for both — you’ve only built for half the job.

Both bugs were fixed. Round counter now displays on every branch, every round with format:

[ROUND CONTROL] Round 7/20 | RiskDestructor spoke → YOUR TURN (13 rounds remaining).

And the completion phrase library was expanded to cover natural language like “keep going,” “run to completion,” “don’t call on me again,” “for the rest of the session”, etc. The system also now injects MAX_ROUNDS into the selector module at session start, so remaining-round calculations are accurate regardless of how you’ve configured the ceiling.

In the end, I ran three iterations of the experiment. Each one exposed something the previous one didn’t.

What This Means for SMB Operators

The AutoGen experiment wasn’t a cautionary tale about AI going rogue. The agents didn’t do anything untrustworthy or dangerous. On the contrary, they did exactly what they were configured to do. They applied genuine analytical rigor and produced a defensible output. The failure was entirely in the design layer, most specifically, in the gap between what I intended and what the architecture actively enforced.

This gap is where most agentic AI failures are likely to live. Not in the models, but rather in well-intentioned configurations that play out differently than expected in production.

”I set it up for human oversight” is not the same as “I built human oversight into the framework from the ground up.”

PRO TIP

The parametric reusability is worth noting for SMB operators specifically. I didn’t rebuild the system to run the stagflation scenario. I simply swapped out a YAML file. Same agents, same logic, but different inputs (and the theses the agents produced were scenario-appropriate in each case). That’s not trivial. It means a well-designed agentic workflow that can be improved over time, requires little administrative burden, and can operate in good faith with or without you is worth the investment. You’re not building a one-time tool. You’re building reusable infrastructure.

The hard costs? Less than $0.40 in API costs (x3). The soft costs? Time and effort (but you learn and grow too).

What Comes Next

StandUply Pro exists as a concept. And an investment thesis generated by a small team of AI agents says it’s worth pursuing. However, that’s not the same as having a shippable product.

My next experiment will take the AutoGen output and hand it to a different framework, LangGraph, with a different job. The LangGraph agentic team will be asked to turn the business concept into product specifications. LangGraph uses state machine architecture, quality gates, loop-back logic, and will have human interrupt points designed in from the start (lesson learned). The question is no longer whether agents can identify a good business idea — we answered that one.

Now the question is whether agents can “build-out” a good business idea.

We’ll go hunting for the answer in my next installment.

Chad Schmookler is a Fractional COO/CPO with 20 years in operations and product leadership. Creator of the HIxAI operating philosophy. He writes on the convergence of AI, operations, and organizational strategy — and the gap between boardroom vision and operational reality. Follow Chad on LinkedIn: linkedin.com/in/cschmookler

Key Takeaways

  1. 1. human_input_mode="ALWAYS" controls whether AutoGen asks for input when UserProxy is selected — it does not guarantee UserProxy will ever be selected. Explicit rotation membership is a separate requirement.
  2. 2. Letting the mistake run without intervening yielded a baseline the corrected version couldn't have produced: agents applied genuine analytical rigor and delivered a defensible investment thesis without a single human comment.
  3. 3. HiC (Human in Control) is an architectural requirement, not a configuration setting. The human must be a required node in the cycle — not an assumed participant the rotation may or may not call on.
  4. 4. Blind human-in-the-loop may be worse than no loop at all. Without visibility into system state, uninformed human interventions add noise and risk rather than direction and clarity.
  5. 5. Parametric architecture — swappable scenario files, same agents and logic — is reusable infrastructure. A well-designed agentic workflow compounds over time. You're not building a one-time tool.