On July 8th, I entered a bot I built – named “MWG” after its creator – into Metaculus’s first AI Forecasting Benchmark Series, which will occur quarterly for the next year. The contest is fairly simple: every day for three months the competing bots will submit predictions on 5 - 10 binary questions (e.g., will the U.S. offer India a nuclear submarine before September 30, 2024?). As the questions reach their resolution date, the bots are scored using Metaculus’s scoring methodology, which is kind of like a Brier Score but centered around how the bots perform relative to one another. The contest is “bot-only” with no human participation (other than building the bot and running it each day). As is often said on the Discord channel: no humans in the loop!
Metaculus launched the contest in order to understand how capable (or incapable) bots built around current-state Artificial Intelligence (AI) are at predicting future events: “AI forecasting accuracy is well below human level, but the gap is narrowing – and it’s important to know just how quickly,” it said on its website. My goals for participating are a bit different: I had been looking for a project that would let me get my hands dirty with LLMs and understand the state-of-the-art capabilities of these models in an applied setting.
Here’s a basic rundown of my Bot’s logic:
It grabs a prediction question from Metaculus.
It then gathers relevant articles from AskNews and relevant context from Perplexity for the question. It asks GPT-4o to formulate instructions for AskNews and Perplexity.
It then has GPT-4o work through a series of steps and questions so the model thinks about the problem from a variety of different angles. Questions include things like “think of reasons why the answer might be yes,” “think of reasons why the answer might be no,” “if the question was resolved today, what would the default resolution be.” There are about twelve questions like this.
GPT-4o is then asked to develop a prediction – the probability of the event occurring on a scale of 0 - 100 – based on all the work it has done and the context it has.
If the model offers varied answers (rather than converging answers), then it runs the prediction steps more times. No matter what, it runs at least three times.
Once it has finished the individual runs, the Bot sends all the predictions from the other forecasts and their logic to GPT-4o and asks it to make a forecast from the forecasts.
All of these various forecasts are then weighted and the final number is submitted to Metaculus.
It took me about a week of working on and off to build the Bot. It’s still early in the competition (only about a third of the questions have been resolved) so it’s really hard to say how it’s going. That said, the answers and logic I am getting back from my Bot are generally reasonable and seem directionally correct. For instance, the Bot’s final submission on the submarine question above was 2%. Certainly, not insane. Right now, my Bot ranks in the top five on the leaderboard, but we have a ways to go.
Based on this experience, I wanted to offer five observations on working with LLMs to build things:
Observation #1: Coding is exponentially easier than it was even just a couple of years ago. I have coded off and on since I was a teenager. Mostly, I’ve worked in Python. Coding has been an intermittent hobby and something I’ve done at work a few times, but don’t let that confuse you: I am not an especially good programmer and no one would call me an engineer.
And yet in 10 days I was able to stand up a fully functioning AI prediction bot with a meaningful amount of functionality. It was several hours of work but those hours were almost all enjoyable and marked by only a couple of moments of true frustration (in strong contrast to some of my previous programming efforts).
I credit this only partially to advances in AI. Really it was the result of an entire ecosystem of tools and systems. Below I list five elements that I would credit – in combination – with making me a much, much more proficient programmer. I won’t describe each one but will say a quick word about why it was so useful.
Google Colab Notebook, including Gemini. The set-up here makes coding so easy. You can quickly and easily run what you’ve written and see the output which encourages you to work in an iterative fashion (which makes resolving errors easier because you only move a step or two before confirming it works). It highlights errors and problems, like a spellchecker. And Gemini helps you debug problems and figure out how to do new things. I ran something like 20 errors through Gemini and in 16 cases it solved the problem. In 3 cases, it had a thought but was wrong. In one case, it was just on the completely wrong track. A pretty amazing hit rate.
LLMs. As described above, much of my model is prompts submitted via API to an LLM (currently OpenAI). Being able to submit instructions in plain English and have a smart, capable “research assistant” shoot back an answer in a matter of minutes is an amazing superpower. You can do tremendous work by simply writing regular English sentences.
Shared code. The starting point for almost everything I did was some code snippet from someone else. My general approach was to find something similar to what I wanted to do on Github or in a Colab Notebook (or to ask Gemini to get me started), try it out in its native environment, and then grab the relevant code and move it into my Notebook. It’s like having a huge team of programmers at your disposal.
APIs. I use four APIs in my model: Metaculus, OpenAI, Perplexity, and AskNews, and I’m considering adding others (namely Claude and NewsCaster). Some of these services are more robust and / or easier to use than others, but no matter what they make it relatively easy to access tremendous powers (e.g., OpenAI) and information (e.g., Perplexity).
Discord. On a few occasions, I got really stuck. I couldn’t do what I wanted to do. Gemini couldn’t answer my question. Things just weren’t working. In those cases, I turned to the community on Discord – once to the Metaculus community and once to the AskNews community. Both times people were super friendly and helpful, and I got the issue resolved within a day.
Some of these tools have been around for years while others are fairly new but collectively they’re super powerful. They create an ecosystem where novices can put together pretty interesting stuff. It will be fascinating to watch how these forces change software development over the next few years as well as how they change businesses now that amateurs can build useful things themselves with code.
Observation #2: It’s easy to get the LLMs to follow analytical steps; it’s harder to get them to think about cognitive biases. In my Bot, there are something on the order of 20 different prompts (12 in the main prediction logic and the remainder helping find supporting information). They cover a wide range of approaches (and I won’t cover even a majority of them here). From the start, it was clear that the model was better at some types of prompts than others. Two stood out to me: first, there are evaluative questions that largely involve synthesizing headlines, articles, and other materials being served up. Second, there are prompts that encourage the Bot to adjust based on its possible biases.
Let me give a example of each prompt:
Evaluative
“Using your knowledge of the world and the topic, as well as the information I have provided you, list a few reasons why the resolution might be NO.”
Bias Adjustment
“Here’s one thing to think about in particular: if your forecast has a score of less than 10% or more than 90%, chances are it suffers from a failure to ‘extremize’ the forecast. When you get near extremes – like 0% and 100% – forecasters (including Bot forecasters) tend to hedge. Given this, do you want to adjust your answer?”
My Bot’s performance across these types of questions varied meaningfully. The Bot’s sweet spot was the evaluative questions. It was great at generating lists of reasons yes and no and generally ranking them in a reasonable order of importance. Rarely (less than 10% of the time) did I look at the list and think: “Wow, did it miss an important point.”
Bias adjustments prompts, on the other hand, were a weak spot. Never – in dozens and dozens of test runs – did it make a bias adjustment, even when doing so was fairly obviously correct. It’s fun to think about why this might be. On the one hand, it’s noteworthy that the Bot doesn’t just defer to its creator and do essentially what I’m telling it to do, i.e., knock a few points off any score less than 10% or add a few points to anything more than 90%. It instead seems to really “think” about whether it’s biased in this fashion, and every time it then concludes that it is not. It reacts specifically to the prompt, saying it had reweighed the evidence and come to the same conclusion. You could also kind of imagine it thinking that it’s more rational than a human and therefore not in need of this kind of adjustment. But, interestingly, this bias is true of bots, as a team of researchers from Berkley discussed in this paper. Telling it specifically that it – the Bot – has this bias also had no impact so I tend to think that the Bot thinks that it is in fact pretty rational.
Observation #3: I found it hard to be super scientific while building my Bot. My biggest disappointment so far in building my Bot is that I haven’t taken a more scientific approach to crafting it. I generate ideas from common sense (e.g., it would be useful to have the Bot think about “reasons for yes” and “reasons for no”) as well as from the forecasting literature (e.g., base rates are everything). I then try out the ideas on three or four questions, read the results, and drop the prompt if it doesn’t seem helpful or leave it in if it does. Of course, this is far, far, far from some sort of scientific method. On some level, this is due simply to how much time I can dedicate to the project.
What is more interesting are the challenges that stand in the way of being more scientific. There are at least three:
First, there are so many degrees of freedom in shaping the Bot that you can’t be super scientific about everything. Think about some of the prompts I shared above. There are a half dozen reasonable ways to write each one. Are you going to test all of those? For instance, there was a debate on Discord among bot builders about whether it was useful to say “please” and “thank you” to your bot. Is it worth doing a few hundred runs to prove that one way or another?
That leads to the second challenge: testing all of those would get expensive from both a time and money perspective. To point #4 below, let’s say each run is only $3.00 and takes 4 minutes, and you need 250 runs to establish whether one approach is better than another, and there are literally hundreds of things like this to test. You’re rapidly in a world where you just can’t test things in this rigorous fashion. So you have to use common sense and smaller samples to cut through the problem and really prioritize where you’re going to run much more robust tests.
The third and final reason I am not working in a more scientific fashion is that I don’t have a great test set-up. Ideally, I would have hundreds of questions that have already been resolved (i.e., happened in the past) which human forecasters had worked on (so I would have a baseline). That is certainly doable – it’s just a matter of scraping the prediction sites to create it. But then you have a foresight problem in that all of these LLMs will have been trained using data from the past so you have to figure out ways to keep that out (which is very hard).
In short, I could certainly be doing better in terms of my scientific approach and will try to do so over the quarter. But there are some obstacles to it as well.
Observation #4: A pretty good predictor bot isn’t very expensive. There was a high volume of discussion in the forums and on Discord when the contest launched about the cost of participating. Some people went so far as to suggest that $30,000 per quarter in prize money wouldn’t come close to covering the costs of building and running the bot, which some pegged at greater than $100,000. To be clear, I have no idea what other people or teams are doing. It may be amazing and may require that kind of spend. Second, I have no idea how good my Bot is other than to say it’s reasonable. I do know my costs and they are not anywhere near the realm of the numbers mentioned above. So I’ve been able to build and run a reasonable bot at very low cost.
Here’s a quick summary:
Net net: you’re looking at less than $1000 to build the model and run it on the 500 or so questions Metaculus plans to deploy during the contest. It’s not chump change but it’s shockingly less than I was expecting. Also, imagine the counterfactual: what it would cost to hire a human forecaster to do the work on 5 - 10 questions a day. It’s hard to imagine those costs wouldn’t be at least 5 - 10x. For a super-forecaster, it would be way more.
Of course, the other key question is what do these services cost OpenAI, Perplexity, etc. to provide to me. Are they losing money on this? How long can they do that for? Where will these prices ultimately head? That’s what we have to know to be able to estimate whether you can cost effectively operate a prediction bot at scale. But, at least for today, the answer is that you can.
Observation #5: My Bot has been fairly well behaved. For all the talk of hallucinations and such, I was surprised that on dozens and dozens of runs my Bot only went truly sideways once (described below) that I noticed. No super crazy facts or obviously loony reasoning. The Bot had outlier predictions, which is why multiple runs is important. Here’s an example:
Also it missed things. For instance, on a question in July about Trump becoming President it was very zeroed in on his conviction (which had recently happened) and concerns about Biden’s age but didn’t really seem to grasp the finer points of the electoral college. (As such, it predicted about 50% chance of Trump becoming President, whereas Nate Silver at that time put it at 70%.)
The one crazy thing it did was this: I asked it to prioritize a bunch of news articles from AskNews and send back the headlines and summaries in order of importance. Due to a coding error at one point, I was not actually passing the feedstock, i.e., the articles, to the LLM. That didn’t stop it. It made up 20 super relevant headlines and summaries. In fact, they were so good that’s the reason I noticed the problem: I didn’t think that topic would have that many on-point articles from such top rated media. And I was right. Interestly, the Bot didn’t point out the problem.
But all in all I would put all of that in the land of reasonable – nothing too crazy about it.
Next
The first round of the contest runs until October 1st; another round starts shortly thereafter. You’re allowed to update your bot based on how it’s performing so I have been working to improve it. Also, more exciting, I will see where my Bot ranks relative to the other contestants (which currently stands at about 35 active participants). Hopefully, it’s interesting enough to warrant another post. In the meantime, if you have questions, let me know.