X-Phi and Sampling by Convenience

University students make up the majority of the samples for subject-based research articles in experimental psychology, mainly because they are cheap and convenient, but the reliance on student samples raises the question of whether results based on those samples are representative of the general population. An interesting letter on the subject appeared in Association for Psychological Science in 2006, titled “How Random is That?

There are important exceptions: convenience sampling seems fine for studying cognitive functions, for instance. But in most non-cognitive areas of psychological research, sampling bias is an issue.

Which brings me to X-phi and the reliance on convenience sampling for assessing folk attitudes about philosophical intuitions. Why think that your samples are representative of the folk? And, more to the methodological heart of the matter, perhaps some caution might be needed before finger-wagging about appeals to intuition. For, if standard practices within (non-cognitive) psychology are what is being held up as the example to follow, the tables may very well be turned on you:

Feldman Barrett admits that some psychologists aren’t careful enough about making generalizations — in part because they lack a sufficient understanding of statistics. “Most of us are not trained in sampling theory,” she said. “There are sciences that take sampling much more seriously than we do.”

Michael Hunter is trained in sampling theory, and he does take it seriously. Co-author of the 2002 paper “Sampling and Generalizability in Developmental Research,” Hunter also sees this shift away from statistical methodology in psychological research, away from a time when using analysis of variance supported an experiment’s causality and using multiple regression analyses helped infer generalizability.

“Today, and for some time now, psychological researchers judge generalizability not on statistical grounds but instead on methodology, educated guesses or common sense, and accumulated knowledge,” Hunter said. Some of these techniques are fine to draw generalizations from, he said — sometimes even highly effective — as long as they are based on random samples of the population in question. The problem, of course, is that college students are seldom a random sample of a university population, let alone a national one. “Obviously, this basis of generalizability does not augur well for the generalizability of some research with college students, who are a selective population and who are rarely randomly sampled” (“How Random is That?“, APS 2006; emphasis, mine).


X-Phi and Sampling by Convenience — 7 Comments

  1. This post looks like it is meant to making some sort of potential trouble for x-phi, but, really, the basic worry about the overreliance on college students (and, for that matter, mostly white and western college students) is one that the x-phi community is already pretty well aware of. And I think that if you look a little deeper at both the article you reference, and at the body of work being done in x-phi, you’ll see that most of the x-phi work isn’t problematized by any of this, for a combination of factors including: (i) Part of the point of some of the work is that there is previously-unnoticed variation in judgments about cases (so pointing out that there’s _even more_ such variation is hardly an objection). (ii) Many studies already use permutation tests, such as Fisher’s exact, as per the recommendation that appears in the next part of the article after the bit you quoted (assuming I’ve understood that correctly, which I’m willing to be shown that I haven’t!) (iii) I don’t think that anyone is looking to make other than very weak generalizations from their studies. (iv) The sorts of tasks under consideration, i.e., category-applicaiton tasks, fall well within the scope of that general presumption of representativeness for cognitive tasks (though maybe some ethics-related research is otherwise). (v) Perhaps most relevantly, many x-phi studies already done, and perhaps even most being done today, draw from a much wider available population via person-on-the-street interviews and/or web-based methods (which has its own risk of biases, mostly towards higher-SES subjects, but I don’t think that that’s of concern for most of the studied tasks — and which is still going to get a _vastly_ more representative sample than any roomful of philosophers would constitute, who are going to count 100% as high-SES under standard operationalizations).

    I must confess that I also don’t really see what sort of “table-turning” you have in mind in the last part of the post, also. What do you take the significance of that quoted section to be? It sounds like you think it’s going to say we should maybe just stop doing studies and rely just on armchair judgments, but I don’t see how it even begins to support such an upshot. So I think I must not understand what you have in mind.

  2. Great post! You know, I think this worry about our typical sampling methods for experimental studies is really quiet sobering. One recent study (Arnett 2008, “The neglected 95%: Why American psychology needs to become less American”) calculates that 67% of the American samples in JPSP come from undergraduate psychology students. I would only add that this problem is actually one of a large, far worse issue when you consider the further fact that not only are these samples often not representative within our culture, but that 68% of all subjects (and 73% of first authors) from top psych journals between 2003-2007 were from the US…up to about 96% if you count “western” nations.

    Of course, this can be really problematic when psychologists take results from such American undergraduate samples and blindly make species-level generalizations! And as you capture in the APS quote, there are some ways to correct this tendency towards at least the more modest problem raised above (though in the opinion of some hardly enough ways) ““Today, and for some time now, psychological researchers judge generalizability not on statistical grounds but instead on methodology,educated guesses or common sense, and accumulated knowledge.” I think though that the same might be generally true for xphi. But I was also thinking that the trend in xphi is to be *remarkably more modest* in making claims about the universality of their data than the implicit attitude of research in social psychology tends to be. And of course, some of the most exciting xphi work to date has existed only to point out precisely these facts through cross-cultural research.

    Also, as Jonathan points out, there are sometimes good reasons for generalization from a narrow sample in certain cases (like when you think there is bound to be little variability across diverse subpopulations in a particular domain). It turns out that lots of xphi work is on category-applicaiton tasks, falling well within the scope of the general presumption of representativeness for cognitive tasks. But again, I think if you take a look at most xphi to date, you won’t find anyone making unqualified claims about things like universal processes from their data in the relevant way under concern.

  3. I am not quite so sanguine as Jonathan or Wesley about sampling. Jonathan’s first point about studies that try to establish the existence of variation looks right to me. But I’m a bit more skeptical about (ii) and (iv). For (ii), distribution-free (non-parametric) tests that make use of permutations or resampling are great for weakening assumptions about the underlying population (that’s the “samples from any population” bit), but they *do not* obviate the need to actually sample at random from the population of interest. Hence, I don’t see how permutation tests or resampling helps when the problem is that the samples are (practically) never random. With respect to (iv), category-application seems at least as likely as perception to be influenced by culture, and Nisbett has produced at least some evidence that perception is indeed influenced by culture. (Of course, that’s all to the good for those engaging in the so-called “negative” project.)

    A strong (and recent) case that the sampling problem is serious in the behavioral sciences is made here. Summarized with commentary here.

    Of course, good statisticians have been saying this sort of thing for a long time. See this collection of essays by David Freedman, for a great example.

    If one is engaged in the project of endorsing a position on the basis of widespread intuitions, then it seems to me that one needs to have some generalization or uniformity commitment. The “traditional” commitment might be something like:

    (1) Since I have the intuition that x, everyone has the intuition that x.

    One positive x-phi move is to weaken the commitment and supplement with evidence. Hence, one might say:

    (2) Since p/n people in my sample have the intuition that x, p/n people have the intuition that x.

    If p/n is close enough (whatever that is) to 1, then endorse x.

    Unsurprisingly, better samples mean better inferences. We ought to be cautious in generalizing, especially from small, unrepeated studies. And we should be worried — more worried than we have been so far — about sampling. Still, I’m with Jonathan in wondering where the table-turning is here. I read the quotation as advocating for more and better statistics/experimental design in order to fix a *problem* of too much armchair work in psychology. Maybe the table-turning assumes that experimental philosophers slavishly follow whatever the nearest working scientists actually do?

  4. Wait, that was me being sanguine? I was just pointing out that since the bulk of extant xphi goals are often difference testing, it naturally makes claims to universality there lots more modest then the generalizations made in say, mainstream social psych…not to say that I think there isn’t some serious issue with sampling just regarding the general state of research in 2010.

    I thought that the original table- turning thing was supposed to be that xphi often accuses armchair philosophers of buttressing arguments from intuitions held by a very small group of people. But then, the thought is that the samples xphi studies rely on are also not so generalizable give the relevant issue or something?

  5. Jonathan W, Wesley, and Jonathan L.: thanks for your excellent comments. There is a lot of material to address, but I think I can get at my worry most directly by replying to Jonathan W’s remarks.

    Let me take JW’s points in a slightly different order. I will try sharpening what I vaguely had in mind about turning the tables later in another comment.

    (ii) Fisher’s exact test. For general readers, let me explain the gist of this test and make my general point in one go, and I propose to do this by looking at the canonical example of this test, The Lady Tasting Tea experiment. A lady comes to you claiming that, by taste alone, she can determine whether her afternoon tea (a milk-tea mixture, suppose) is prepared by adding milk first or tea first. Dubious, you decide to test whether she has this ability by an experiment motivated by the following reasoning: if she cannot make this discrimination, you would expect her to perform no better than chance at picking the correct order from a sequence of randomly prepared milk-tea mixtures. For this setup, that would mean that if she could not discriminate lactose-first from lactose-last, you would expect her success rate over a sequence of n trials to look more or less like the proportion of n fair coin tosses landing heads. So, if you were to give her 5 randomly prepared milk-tea mixtures and she correctly identified whether milk-first/tea-first in all five cases, the chance of seeing that happening by chance is akin to flipping 5 heads in a row for a fair coin, which is about 3%. So, the FIsher test is a test designed to detect non-random associations between a pair of variables. Here it is between how the tea was prepared (H), and how the lady responded (T).

    Now, there are two places where inferential caution and moderation come in. Only one of them do I think X-phi observes.

    The first, which I believe Jonathan is referring to, concerns what inference to draw from observing five correct guess: it is good evidence for a non-random association between H and T, but not necessarily good evidence that the lady has the ability she says she has. After all, you may have tipped her off visually by not stirring the mixtures thoroughly, exhibited a behavioral tell when you handed over the cups, and so forth. These types of confounders are usually settled in the design of the experiment, which is guided by methodology, educated guesses or common sense, and accumulated knowledge. And there is, often enough, nothing wrong with appeals to common sense here. (Aside: this caution is beaten into classical statisticians by the convoluted talk of rejecting the null (i.e., no effect; i.e., the lady is a sham) hypothesis when data (sequence of correct milk-first/tea-first classifications) falls within the rejection region (often, below 5%).

    However, none of this addresses the inferential caution required in extrapolating the result of a Fisher test to a broader population. This is where I think X-phi is vulnerable. Returning to our concrete example, seeing this lady nail the taste test (suitably hedged!) doesn’t tell you a damn thing about ladies nailing the taste test, never mind folk nailing the taste test, regardless of the hedges. The lady, after all, came to you; she is no more a random sample of ladies or of the general public than, say, the sophomores who sign up for X-phi 201 class are a random sample of the university community or of the general public.

    Now, you might think to appeal here to methodology, educated guesses or common sense, and accumulated knowledge to justify the extrapolation of your non-random sample to the general public in either case, but this would be a grievously bad use of common sense. (Call this “common-sense mongering“.)

    (iii) The strength of generalization. With this distinction between ‘good common sense’ and ‘common-sense mongering’ handy, turn to Jonathan’s claim that nobody “is looking to make other than very weak generalizations in their studies”. This may be so, and it should be so, when discussing the limits of a particular experimental design and what can be reasonable to take away from results. But if this is all that X-phi is up to, then I don’t see the point of burning up armchairs and bullying epistemologists about their intuition pumping, as harebrained and maddening we can be about this. It strikes me that you cannot have it both ways.

    (iv) I am dubious of the claim that the sorts of tasks that have been discussed here recently fall under cognitive tasks, in the sense in which a projection from observations within a sample carry over (better) to a population. Here I would like to see the data for this claim.

    (v) How many is ‘many’? Do person-on-the-street interviews make up the majority of X-phi published results? Do you have data on the proportion? Are those persons in the streets of Ithaca/Chapel Hill/Boulder/Austin/Bloomington/… or are they getting representative samples of American adults?

    (i) Variation in a non-random sample is variation in a non-random sample, not variation in the general population.

  6. Regarding Turning the tables:

    I think I like better Jonathan L.’s and Wesley’s proposals, but this is what I had in mind:

    First, I agree that philosophers often get away with murder by appealing to intuitions. But, I think the categorical assault on intuitions is wrong-headed. So, originally, and depending on how the thread unspooled, I thought about conflating the “good” use of commonsense with the “bad” use of common sense, calling both “commonsense mongering”, and then turn a few X-phi arguments bashing intuition mongering into arguments bashing commonsense mongering. You get the idea.

    The denouement then would have been to recognize a distinction between an appropriate use of intuition and an inappropriate use of intuition, and to connect this to a distinction between an appropriate use of untested background knowledge and in inappropriate use of untested background knowledge…which I hoped would have come up somewhere in the thread.

    But Jonathan led off with Fisher, whom I admire even more than Dave Zinkoff, and it seemed like I should instead take a shot at getting straight to the point:

    There is nothing wrong with appealing to intuitions per se; the problem is mistaking an intuition as supplying grounds for a general empirical claim about a population. There is nothing wrong with appealing to commonsense per se; the problem is mistaking a common-sense judgment of representativeness in a sample as justification for a general empirical claim about a population.

  7. Wesley, you don’t have to be very sanguine about the use of convenience samples in the social sciences to be *more* sanguine than I am! I was picking up on your agreement with Jonathan’s (ii) in saying that you are more optimistic than me, but if I’ve over-stated things, please accept my apology.

Leave a Reply

Your email address will not be published. Required fields are marked *