Hold up, Multi-arm Bandits? | Tucson Richelson

In my vast expeditions into the depths of Wikipedia, I stumbled on the idea of Multi-arm Bandits. https://en.wikipedia.org/wiki/Multi-armed_bandit It taught me everything I need to know about how to make better decisions in the face of uncertainty.

One-arm Bandits, otherwise known as slot machines, have negative expected value. If you take the sum of each outcome multiplied by its probability and payout, minus the cost to play, the outcome is negative. If it was a fair coin flip and the payout was $2 for heads and $0 for tails, and it costs $1.50 to play, then the expected value is (2 x 0.5 + 0 x 0.5) -1.5 = -$0.50. It’s possible to make money on any single attempt, but on average you will lose money. Casinos collect that negative expected value from gambling machines.

Multi-arm Bandit Problems go as such: suppose you had a row of ten slot machines, two of them have a positive expected value and eight have a negative expected value, but you don’t know which. What is the minimum number of “pulls,” how many attempts, would it take to find the winners?

On one hand, If you attempt each machine once, that gives very little information. A machine may have a large range of payouts and a balanced probability of each. On the other hand, you could brute force the solution by pulling the arm of each machine 10,000 times. The optimum solution is somewhere between 1 and 10,000. Yes, there is a mathematical answer to this arbitrary problem. If there’s one thing academia is good at, it’s answering questions no one thought to ask.

This does have actual real world applications though. Exposing children to a variety of subjects in school lets them explore in a fairly cheap way. If a child was exposed to math only once and they disliked it, they might never try again. If they have repeated exposure, they might learn to like it. Same for vegetables. Once you’re an adult, if you don’t like asparagus yet, you’re probably not going to like it in the future.

Are people naturally good, or bad? At any given moment some people are having a bad day, but if you give them another chance, they might do the right thing, or maybe over the long term they’re making bad choices on average.

The most profitable application is stock market portfolio allocations. Suppose you randomly bought ten stocks. Each week, they would either go up or down. You can calculate how long you would have to hold on to each stock before you can confidently determine if it’s a winner or a loser.

Studying Multi-Arm Bandit problems has taught me a few things.

A single occurrence isn’t necessarily indicative of the average
Don’t be afraid to stick around to see if an opportunity turns around
Don’t become attached to a single opportunity just because it’s been doing well for a stint

Share this:

Related