This is a really common reaction after first encountering the No Free Lunch theorems (NFLs). The one for machine learning is especially unintuitive, because it flies in the face of everything that's discussed in the ML community. That said, the theorem is true, but what it *means* is open to some debate.

To restate the theorem for people who don't know it, the NFL theorem for machine learning is really a special case of the NFL theorem for local search and optimization. The local search version is easier to understand. The theorem makes the following, somewhat radical claim:

Averaged across *all* possible optimization problems, the average solution quality found by any local search algorithm you choose to use is *exactly* the same as the average solution quality of a local "search" algorithm that just generates possible solutions by sampling uniformly at random from the space of all solutions.

Another formulation, when people want an even stronger reaction, is to say that if you want to find the best solution to a problem, it's just as good to try things that seem to be making your solution iteratively worse as it is to try things that seem to be making your solution iteratively better. On average, both these approaches are equally good.

Okay, so *why* is this true? Well, the key is in the details. Wolpert has sometimes described the theorem as a specialization of Hume's work on the problem of induction. The basic statement of the problem of induction is: we have no logical basis for assuming that the future will be like the past. Logically, there's no reason that the laws of physics couldn't all just radically change tomorrow. From a purely *logical* perspective, it's totally reasonable that the future can be different from the past in any number of ways. Hume's problem is that, in general the future *is* like the past in a lot of ways. He tried to formulate a philosophical (logical) argument that this *needed* to be so, but basically failed.

The No Free Lunch theorems say the same thing. If you don't know what your search space looks like, then if you iteratively refine your guess at what a good solution looks like, in response to the observations you've made in the *past* about what good solutions look like (i.e. learning from data), then it's just as likely that operation you make helps as it is that it hurts. That's why the "averaged over all possible optimization problems" part is key. For any optimization problem where hill climbing is a good strategy after $k$ moves, we can make one that is identical, except that the kth hill climbing move leads to an awful solution. The actual proof is more subtle than that, but that's the basic idea.

A very brief lay summary might be:

A machine learning algorithim can only be made to work better on some kinds of problems by being made to work worse on another kind of problem.

So what does this **mean** in a practical sense? It means that you need to have some apriori *reason* for thinking that your algorithm will be effective on a *particular* problem. Exactly what a *good* reason looks like is the subject of vigorous debate within the ML community. This is very closely related to the bias/variance tradeoff.

Some common responses are:

- When you're looking at a new optimization problem, although it
*could* have any random kind of structure, the problems we actually encounter in the real world are a lot more regular, and certain common themes are present, like the the fact that moving "uphill" (minimizing error) iteratively tends to lead to good solutions. Basically, this school of thought says NFL is an ornamental theorem: most ML algorithms work better on "the kind of problems we see in real life", by working worse on "the kind of problems we don't see in real life".
- When you're looking at a new optimization problem in [insert your favourite application domain], although it
*could* have any random kind of structure, problems tend to look like [whatever you think], which makes [your favorite algorithm] a lot more effective than random guessing.
- Wolpert & McCready themselves published an interesting result showing that there actually
*are* specialized optimization processes, based on co-evolution, that are *consistently* better than random guessing.

Regardless, it's indisputable that some algorithms are better than others, in certain sub-domains (we can see this empirically). NFL tells us that to be better there, they need to be worse somewhere else. The question up for debate is whether the "somewhere else" is real problems, or purely artificial ones.

"Although any optimization problem might be present", present? I suggest you clarify the points in the section "Some common responses are:". – nbro – 2019-09-27T16:56:56.783

Great answer. But by algorithm do they include all variations of it? For example backprop maybe implemented by derivatives, or by taking small differences or by double derivatives (as far as I know), so are they same or different? And by performance is it end results or resources too? – DuttaA – 2019-09-27T16:58:04.867

1@nbro: Actually I think that was just unfortunate choice of

`<`

and`>`

to show placeholders. I've switched them out so you can see closer to what John intended. – Neil Slater – 2019-09-27T17:04:02.760@NeilSlater Yep, thanks for doing that! – John Doucette – 2019-09-27T17:23:40.117

1@DuttaA Yes. The key idea is that, no matter what strategy you come up with for solving your optimization problem (like minimizing error by taking into account higher derivatives), I can create a version of the problem that looks exactly the same except that, after $k$ iterations, you end up in a bad solution. – John Doucette – 2019-09-27T17:24:55.103

@nbro Good point. I've updated the text. – John Doucette – 2019-09-27T17:26:48.440