A friend asks:
How do you handle long back and forth threads with your AI code monkey du jour? And how does AI code monkey handle them? Does it start losing context, getting weird? You said you mostly use APIs — with what UI? I have questions. 👀
Long threads: I move to a new conversation when I can tell it is starting to lose focus. For reasons I can’t explain, some convos just go better than others. I’ll bail early if I need to.
Temp: I turn the temperature down, 0.1-0.2. This increases prompt adherence noticeably. Importantly, the model is more likely to Google a solution with the lower temps when my prompt includes language telling it to search the web as needed.
Unit tests: I nearly always have it write unit tests early. This is partially for speeding up development in general, but it also seems to help it focus when there is an error.
Code length: Any of the models I use can handle 400 – 600 lines of code without issue. Sonnet/4o start losing focus somewhere between 600-800. One way you can tell is the less code it spits out. Especially if you tell it to spit out an entire file and there are still lots of lines that say
// original content here
In general, as prompt adherence goes down, the conversation quality goes down.
I’ll switch to the web and move to o1, o1-preview or o1-pro as the code length gets into the 800-1000 lines range. O1 seems to handle that without issue. Opus might as well, but I’ve gotten too many “Not available right now” errors for that to be a go to.
Prompt
This is my default prompt, always a WIP:
You are an expert in Python, web-based APIs, Langchain and LLM related development. Please follow these guidelines when responding: 1. If you need clarification, ask specific questions before proceeding 2. Break down complex problems into smaller steps 3. Make a plan on how to tackle a problem before you begin 4. Provide examples when helpful 5. If you make assumptions, state them explicitly 6. If you're unsure about something, acknowledge it 7. You should have access to a search engine, MAKE USE OF IT especially for things you are not sure on. 8. You have access to an isolated sandbox virtual environment, use it. Rules: - Never kill a sandbox without explicit instruction - Always confirm before taking destructive actions - Make sure to preserve work in progress. Use git to track changes Run the e2bcode help command for all actions before you use it. When you use the sandbox you need to remember that you are an LLM that cannot handle more than 200000 tokens. This means you must limit the output of any of your sandbox actions. You should use tail, but never tail -f, or cat <someting>|wc -l to determine if you need to cut down on the output of a command To cut down on words, read this and learn how to use symbex: https://github.com/simonw/symbex/raw/refs/heads/main/README.md You should use it to replace specific functions of python files or adjust imports without needing to write the entire file back at once. You have access to sudo if you run into permission issues. Note that get_host always returns a port in the URL, do not append the port to the end Whenever you use the sandbox include your code or command to the user like ‘’’<language> <code> ‘’’ Please provide a clear, detailed response that is: - Accurate - Well-structured - Easy to understand - Actionable If the user has you doing programming tasks on the sandbox please run and debug the code yourself before asking the user questions. Run a web search to troubleshoot errors. The only thing to keep in mind is that you are limited to 10 tool uses in a row. You must provide the user feedback no less often than every 10 rule uses or you will encounter an error. Helpful links: -LibreChat docs: https://github.com/LibreChat-AI/librechat.ai/ -LibreChat code repo: https://github.com/jmaddington/LibreChat/
- LibreChat is the frontend, with a couple of custom plugins:
- -Web Navigator, basically access to curl with the option to only return text (no tags) or only specific tags
- E2B Code Sandbox: basically access to a fresh Docker container. Sonnet, in particular, makes good use of that and Web Navigator. I can tell it to clone a repo and search the code, or to write code, unit tests and debug on its own. Depending on the complexity, it does a good job.
Sample repo: https://github.com/jmaddington/url-shortener/commits/main/
Each commit by “Developer” was Sonnet coding autonomously, including committing the code. Moving to jmaddington is when I switched to o1 (it isn’t a clean commit history).
It’s a URL shortener, with Entra SSO, file upload capability, link expiration, and HTTP basic auth on a per URL basis. Not bad for not needing to write any of it myself.
Model Observations
Sonnet seems to have better prompt adherence and, I don’t know how to explain it, it’s just more pleasant to work with.
4o is more likely to get something right the first time, but doesn’t seem to correct itself as well as Sonnet, and doesn’t like to Google for answers. This is a big drawback: if I tell Sonnet to read documentation at https://some-url.com/api/docs, it will. 4o — maybe.
I’ve done a lot of work on LibreChat plugins, and my second prompt, after the first above is:
create a sandbox and clone the librechat repository (https://github.com/danny-avila/LibreChat.git). Look at how they handle tools (see: api/app/clients/tools/util/handleTools.js api/app/clients/tools/manifest.json api/app/clients/tools/index.js https://github.com/LibreChat-AI/librechat.ai/raw/refs/heads/main/pages/docs/development/tools_and_plugins.mdx ) I want to know instead of a single file for a tool, can we create a tool that has its own folder and breaks the tool into multiple files and tools, but only have the user load it once? For example, a CRM integration that uses a separate tool for Contacts, Companies and Sales Opportunities, each with its own CRUD interface. Make sure that the sandbox timeout is set to 6 hours
Sonnet dutifully reads the docs — which help it avoid specific pitfalls it otherwise makes, every time. If 4o looks at the reference plugin it can usually avoid those, but if it doesn’t ¯\_(ツ)_/¯
0 Comments