More on the Preface

Back at Acme, our team of epistemologists are on the scene looking at Inspector 14’s record of length measurements for pole 453-01-120. Name this pole “p”. To simplify, suppose there are n physical measurements, the conditions for measurement were standardized, the measurement device was calibrated, errors are distributed normally, et cetera, et cetera:

1. The measured length-1 of p is 6.005 meters.
2. The measured length-2 of p is 6.003 meters.

n. The measured length-n of p is 5.099 meters.

Our assumptions about those n measurements license us to view them as a random sample from all possible measurements of p. This in effect is what it means to assume that errors are normally distributed. The negation of any of the other conditions (e.g., calibration, standardization) would function like defeaters for the randomization assumption, and block the inference I am about to make.

Because we have good reason to view those n measurements as randomly drawn, and no evidence that would suggest that n is a biased sample, we say the length of p is the mean of those n measurements plus or minus the product of 1.96 and 2 over the square root of n.

The example imagines that the mean of n is 6.003; the ‘plus or minus (1.96 x 2/root n)’ gives us the confidence interval, at 2 standard deviations, which we imagine to be [6.002, 6.004] meters. This says that the 95% confidence interval for the length of the flag pole p is [6.002, 6.004] meters. This expresses that 95% of the intervals calculated in this fashion will contain the true value of p; or, alternatively, that the probability that the true length of p is equal to a value within [6.002, 6.004] meters is 0.95.

Now to the philosophy of statistics:

The first point to notice is that we are talking about the actual length of the pole, which is a fixed-value (under standardized conditions, et cetera, et cetera) rather than a random variable. The actual length of the pole is either between 6.002 and 6.004 or it isn’t; there is no random variable to attach this 0.95 probability to. Rather, we say that we are confident to level 0.95 that the interval we calculated contains the length in meters of this pole. So, assuming here that 0.95 is a high enough confidence level, we accept flat out that the pole is between 6.002 and 6.004. And notice that this is what Inspector 14 does: he puts his “Inspected by 14” sticker on p and adds it to the stock of Model A’s; he does not put 95% of the sticker on the pole, nor hedge his bet by thinking that it is partially not in the stock of Model A’s. It’s an A, in his judgment, on his evidence.

Notice what happens when you switch to the Bayesian view, for you can tell a mathematically equivalent story in Bayesian dress. If you want to hedge, want to assign a probability to a proposition, then you’ve got to have a random variable. So, the Bayesian interpretation lets the length of the pole be a random variable! This doesn’t make sense directly, of course: a pole can be no more 95% a magnitude of length than a woman can be 95% pregnant. Either pole or woman has or has not the stated attribute. But, if you change the story around to talk about the belief about the pole, or the belief that a certain woman is pregnant, then you can view the interval as bounds on your (non-extreme) credal probability that the length of the pole (pregnancy status of the woman in question) is in that prescribed interval.

Now the jump from statistics to epistemology!

To effect this move, a Bayesian must assume a flat subjective prior probability distribution to get his mathematical representation of this story into equilibrium with the frequentist story. If you are a statistician doing this, this isn’t (or doesn’t need to be) such a big problem: you are interested in tools for modeling parameters, and you know about this equivalence between methods and can exploit whichever seems right for the task.

But, if you are an epistemologist making this move, you’ll likely be driven to do so by ideas about principles of rational belief fixation, and if you are making this move to explicate partial belief and use the Kolmogorov axioms as consistency constraints on such a notion, then you are doing so not as a technical ploy but rather because you think that issues of rationality are in play.

But for the epistemologist so-described the assumption of a flat prior takes on enormous weight, for it is not always reasonable to assume a flat prior, and the only reason that it is reasonable to do so here is because this is the assumption needed to get the Bayesian reconstruction to agree with the classical treatment of measurement! So, this reliance on a flat prior is crucial; the whole normative story hangs on it, but it is a thin reed from which to hang rational belief.

(I should stress that Ralph rejects probabilism, but he does accept a notion of partial belief; it remains to be seen whether there is a view of partial belief as opposed to full belief that escapes the general objection here.)

To tie this post up, a thought about the vaguely contemporary, mainstream view on the lottery paradox. In my view the move to avoid inconsistency by adopting partial belief pushes the key epistemic problems into the philosophy of science, where one finds ready assurances that the complaints I have just made are old and solved, or old and nearly solved, or at least old. From a review of the literature on the lottery paradox, the most striking feature I found in my reading was that at some point in the late 70s or early 80s, people largely stopped engaging the philosophy of statistics and instead focused on the thought experiments of the lottery and the preface. Epistemologists began writing on the particulars of these puzzles while assuming that the fundamental parameters of the puzzle were fixed, such as the interpretation of probability.

Christensen’s book is remarkable in this sense, because it is constructed entirely within this contemporary and restricted view of the paradoxes; it primarily engages the post 1980 “received view” literature.

There is, however, a recent lottery literature outside of this tradition, some of which is arguable outside of philosophy. Nevertheless, there is engaging work to consider. Joe Halpern addresses the lottery with a theory of first-order probability; the field of non-monotonic reasoning, going back to Hayes and McCarthy in 1969, has observed a tension between defeasible conditions for belief fixation and logical rules for manipulating those beliefs: David Makinson is more famous for the AGM belief revision paradigm than his Analysis paper; Eric Neufeld and Scott Goodwin rebut Pollock’s claim that there are fundamental differences between the lottery and the preface (which might make a good foil, Ralph, btw: Computational Intelligence, 14(3),1998); and, to self promote, I’ve proposed a limited system for 1-monotone capacities (JoLLI,2006) that gives a unified treatment of the lottery and preface, a statistical default logic (NMR2004) which treats the normality assumption for sample distributions as a default assumption, and a review of the lottery paradox in the 2007 Harper and Wheeler collection.

More on the Preface — 6 Comments

1. Greg,

The Bayesian doesn’t need a flat prior. It can even be quite hilly — just not too mountainous. That’s what so-called “stable estimation” is all about. See section 4 of the Stanford Encyclopedia article “on Inductive Logic” for details.

Best,

Jim

2. Hi Jim! Good to hear from you! In reply: To get the generality here you need the “boundedness assumption” you discuss in your SEP article, which is a prior assumption an agent must make about possible values of the parameter outside of the specified interval. This might be fine, from a statistical point of view. But from an epistemological point of view it is a suspicious move to assume that this property holds of data before its collection. I had thought that I was being more charitable to the Bayesian view by dropping this requirement and stipulating instead a flat prior, which is presumably what one would do absent prior knowledge about the data. (Otherwise, why gather the data?)

Indeed, in your SEP article you point out that the boundedness assumption doesn’t always hold. This isn’t a problem for statistical practice in general, but it is a problem for epistemological practice if you are looking to build epistemic principles from this machinery.

3. This expresses that 95% of the intervals calculated in this fashion will contain the true value of p; or, alternatively, that the probability that the true length of p is equal to a value within [6.002, 6.004] meters is 0.95.

I’ve never quite understood how these two claims are supposed to be related to one another. The former I understand, and it looks like we even have good justification for it. Is this statement supposed to just give the meaning for the second claim?

4. Greg,
The boundedness assumption is not an assumption about the data “before its collection”. One might have the data first or have the prior distribution first, or “get them at the same time”. It’s not really a temporal matter at all. It’s just that the stable estimation theorem doesn’t apply if the data (once collected) happens to fall in a region where the agent’s prior distribution represents extremely small prior probabilities. If the data doesn’t happen to fall in such a region, then the agent’s posterior probability (his degree of confidence) that the true length lies within a 2SD margin of error of the data mean is .95.

That seems more like what we want than the usual confidence interval approach, which only says that 95 percent of the time when confidence regions are calculated this way, they will capture a true hypothesis within the interval. That is, confidence intervals make the former claim in Kenny’s quote, but not the latter claim — because on a frequentist view the probability (i.e. frequency) with which the true hypothesis is in this interval in this case is either 0 or 1. But on a Bayesian account, the probability (i.e. updated confidence strength for the agent) that the true hypothesis is in this interval in this case is .95, which is what we wanted.

Now you might object that confidence strengths on there own aren’t much good when what we want is high confidence in hypotheses that are true. But convergence to truth results can be used to bolster the Bayesian confidence strengths by showing that confidence for hypotheses formed in this way are very likely to result in high confidence for true hypotheses. (See the end of section 4 of the Inductive Logic article.)

5. This is an excellent question, Kenny.

I think that the only way to interpret the latter statement is in terms of the former, which means that there is an interpretation of probability that is fundamentally different than the measure-theoretic notion specified by the Kolmogorov axioms, and fundamentally different than the many, many variants of this classical interpretation of probability.

Many statistics textbooks maintain that Kolmogorov probability is probability enough, and their authors would pounce on my “alternatively, that the probability…” claim as a major blunder. Even Neyman and Pearson would bristle at my referring to this confidence interval as a probability.

Why their fuss? On the classical view of statistical estimation the confidence comes from the sampling distribution of the statistic rather than from the actual measurement values. The length of the pole is a fixed but unknown constant; this is the unknown “population” mean, the fixed magnitude of length in meters that is true of p. We are sampling to find this magnitude, to estimate its value. The sample mean (the average of our n measurements of pole p) is the random variable. There is therefore no random variable to attach that 95% confidence statement to; so it cannot be a Kolmogorov probability. And if this is the only probability there is, then my “alternatively,…” statement is a blunder, a category mistake.

The Bayesian view sticks to the classical axiomatic treatment of probability and yields a clear probability statement, but it does so at the price of assigning measure to some pretty funky looking events. We get a probability statement for the posterior distribution of the population mean, “length of pole p”, which now is a random variable rather than a fixed constant, given the actual sample data and not the sampling distribution of the statistic. In this case the sample data are these particular n measurements of p. So, you get to talk about probabilities of statements at the price of believing that it makes sense to talk about the probability that: [the length of pole p is between 6.002 and 6.004 meters, given that measurement-1 is 6.005 meters, measurement-2 is 6.003 meters, …, measurement-n is 5.099 meters] = 0.95.

Return now to Jim’s observation. On the Bayesian view, the probability is conditional on that actual data. So you have to be pretty careful defining the random variables over these values to effect inverse inference, and this information is loaded into your subjective prior. The classical statistician sees this and thinks that the bulk of the statistical inference problem is swept into this murky prior. The Bayesian sees this as simply articulating what we know about measurement.

It also doesn’t help matters that I, and everyone else, pick problems that are very easy to visualize and understand. It helps to have familiar problems that have simple linear functional dependencies. We have centuries of experience measuring the length of poles, after all. There is a lot of information about this that we can put into our prior. And, if the task is to come up with an epistemology for measuring the length of flag poles, then I agree, we can do a pretty good job at this. But when epistemologists talk about rationality constraints built atop this machinery, I often become very confused by what the algebra must be like. I’ve argued with a Bayesian-minded epistemologist on this blog who insisted that I must know the probability of, have a degree of belief for: [my suffering an elaborate hoax involving my boarding a commercial airline in Lisbon then landing and touring an ersatz Geneva, given that my airline booking number for the flight ticket was 77777]. (!). He was consistent and a good sport, but he endorsed the consequences of orthodoxy without flinching.

6. The second to last paragraph is particularly badly put. The random variable is the statement about the population parameter, and that is the variable that we define a prior on. The thought was that the agent is presumed to have some ideas about expected values for the particular observed data and the variable in question, but that information is put into your prior on the population parameter.