This websiteI try to save money when ordering groceries online by first going through the half price list, then the multi buy and general discounted list, before falling back to just searching for the items I want. This works really well to save me money, the problem is that it takes wayyyy too long. Whilst writing this the specials list contains 8919 items!
Make da robot do it. (Using a local model so I don't have to pay anyone)
I first tried making the agent interact with the browser by just giving it access to the playwright mcp server. I didnt have high hopes but I figured that if it worked it would save me a lot of time and effort.
It didn't work at all.
The playright mcp tools read the page's accessibility tree and act on that and I think that is just far too abstract for an LLM to navigate. Maybe it would've been able to get by if the model was better but I dont have access to hardware that can run those. Below is a single item taken from the accessibility tree of the page that lists bakery items.
- generic [ref=e7234]: - generic [ref=e7238]: - link "Coles Pikelets 8 Pack | 200g" [ref=e7240]
[cursor=pointer]: - /url: /product/coles-pikelets-8-pack-200g-4799340 - generic [ref=e7243]: EVERY DAY - link
"Coles Pikelets 8 Pack | 200g" [ref=e7245] [cursor=pointer]: - /url: /product/coles-pikelets-8-pack-200g-4799340
- heading "Coles Pikelets 8 Pack | 200g" [level=2] [ref=e7246] - generic [ref=e7247]: - generic [ref=e7250]: -
generic "Price $2.50" [ref=e7252]: $2.50 - generic [ref=e7254]: $1.25/ 100g - generic [ref=e7256]: - button
"save to list Pikelets 8 Pack" [ref=e7257] [cursor=pointer]: - img [ref=e7259] - generic [ref=e7263]: - status
[ref=e7264]: Product is not in your trolley - 'button "Add to trolley: Coles Pikelets 8 Pack" [ref=e7265]
[cursor=pointer]': - generic [ref=e7266]: - img [ref=e7267] - text: Add
Now imagine trying to work with that complexity, but for a page with 60 items and a bunch of other irrelevant stuff, like ads, terms and conditions, links to Coles's social media and apps. The full snapshot of that page, which is what the llm would have to process, is 100257 characters long, which with the average ratio of 4 characters per token is like 25k tokens...for one page. The models im able to run have small context windows so I cant afford to be that wasteful, additionally, all that extra junk would make agent perform worse.
I realised I was going to have to make it as easy as possible for the agent to get information from Coles. I wouldve loved to be able to make an mcp server that just hits Coles apis but they dont have any public ones and I think if I just went wild on the ones their website uses I would get flagged as a bot and ip banned or something, which would really suck. So, I decided to make my own mcp server that would interact with the website in a much more usable way. Instead of navigating the accessibility tree, this mcp server would expose specialised tools to perform high level actions. Here are some examples:
| Action | Description |
|---|---|
| list_categories | List the names and ids of the top-level product categories |
| list_category_products | List products for a specific category or subcategory. Returning a list of JSON objects with the product id, name, description and price |
| add_to_cart | Add a product to the shopping cart by its id. |
Setting this up was a pain in the ass. I wanted to just host the mcp server in the extension itself but because extensions cant read/write to stdin/stdout, nor open a socket, I had to make a separate application that would host the mcp server and also act as a websocket server for the extension to connect to. I decided to just make this a node application so I could share the DTOs with the extension.
Once I had something set up that could bridge the communication between the extension and agent, the battle was certainly not over. Different parts of a chrome extension can do different things. So I have a service worker that handles the websocket connection to the mcp server, but it cant read the webpage. This service worker talks to a content script that can read and interact with the webpage, but cant interact with any variables defined by the website. The content script talks to injected scripts that can hook into fetch requests to get information cleanly and interact with the nextjs router for non-bot-like navigation.
Finally, this worked and could interact with the webpage in whatever way I wanted. It has it's own issues, such as
the service worker throwing ERR_CONNECTION_REFUSED exceptions if the mcp server isnt running (you cant
swallow or silence the exceptions from failed ws connections for some godforsaken reason).
I had a test prompt of "Add some bread and milk to my cart." and a system prompt that is now lost to time. I implemented all the different tools the llm would need to navigate the website and create an order:
and eventually it was able to complete the task which felt awesome. However, once I got it trying to look for more items, and especially when I had it review my previous orders, it fell apart. It would start out fine but as soon as it filled up the very small context window of 15k tokens, it would forget what it had already done and start over. Damn.
I had a couple of solutions to this problem. I could give it some tools for tracking its own progress and then try to convince it to use them often. Or, I stop using lmstudio as the harness and build my own. I went with the latter as it would allow me to create specialised subagents for each task. I would be able to give each of these agents their own prompts, parameters, models, tools, and I would have precise control over the context window.
This "engine" is really just a script that uses the lmstudio sdk to interact with the llm. The llm is still running in LM Studio and I can interact with it via rest endpoints. I had to write some code to hand tool calls off to the mcp server, which was probably the most difficult part, but other than that it wasnt too hard to get something working. The real challenge now is creating good agents. I used to laugh at the idea of a 'Prompt Engineer' but now thats what im doing and when working with smaller models its actually quite hard.
After I started building the engine I realised that the Qwen3 family of models I had been using had unusually small context windows, all maxing out at 32k. For comparison, the GPT-OSS models have a context window of 131k and still fit in 14GB of VRAM. So, the main reason I changed my approach was probably invalid, but I think all the other benefits still apply and changing to this new approach will prove to be worthwhile.
Here are the agents I've created so far:
This agent is responsible for looking at my previous orders, figuring out my preferences, and then generating a shopping list that the later agents will try to complete.
The shopping list is actually more of a request list, as the items on the list aren't necessarily particular products, or even categories of products. The requests can be something like "Dips that go well with carrots" or "Some frozen meals, but none containing fish". In addition to the criteria is how many units of suitable items should be bought, which can be like "250g", "5", or "Enough for 4 sandwiches".
This may be the most important agent to get right, as even if the later agents perfectly fulfil the requests, if the requests arent good then I wont end up with an order I like.
This agent will need a model that can hold all of my previous orders in its context window. Unfortunately, ill also need it to be quite smart to pick up on patterns and preferences.
An instance of this agent is created for each request in the request list. It's job is to look through Coles' expansive catalog and find items that match the criteria for the request. It just needs to build a rough list, so it doesn't take into account the requested quantity (or even have access to that information).
While this agent is going to be injesting the most information out of all the agents as it scans through large portions of the catalog, its not important that it remembers products it previously saw. So I dont need to give it a model with a large context window. I also dont think it needs to be that smart, so ill probably prioritise speed.
This agent is similar to the Product Candidate Selector, in that there is one instance for each request. However, instead of having access to all of Coles, it can only see the items that the Product Candidate Selector picked. Its job, as the name implies, is to narrow down the candidate list and decide exactly which items to buy, and how many of each. It has access to a tool that lets it look up a bunch of details about each item so it can make a more informed decision than the previous agent could. This is also a really important agent to get right as it's the one that will be making the final decision about what products get added to the order. I'll eventually want it to be able to compare prices, consider multi buy deals (eg. any 2 for $10), alternative products, and analyse the ingredients so make sure it's something I would like.
This agent will need a model with a large context window as it's going to be thinking a lot and will need to have a lot of information from each of the candidates in order to make good decisions. It will also need to be a very smart model, so it makes the correct decisions, but I think it will be fine to sacrifice speed.
I havent started to create this agent yet, but the idea is that it will just make sure that nothing is missing from the cart and that the quantities are correct. I might not even have to make it if the other agents are good enough.