Informed Surveys: Some Thoughts on How to Gather Information About, Evaluate, and Rank Philosophy Programs

Here are some ideas about how philosophy programs might be evaluated more effectively than they now are. Right now, the most widely used evaluations are those of the Philosophical Gourmet Report. (Disclosure: I’m on the Advisory Board of the PGR, and have been and am a supporter and defender of that project. For my suggestions as to how a prospective student might best use the PGR in deciding which programs to apply to and attend, see this post.) My recommendations here will be presented in the form of potential modifications (what I would take to be improvements) of the PGR. But they could instead be implemented by some other evaluating project.

Below the Fold:
1. A Two-Stage Process
2. Informed Surveys
3. Citation Counts?

1. A Two-Stage Process

The PGR rankings are currently arrived at by means of surveys. The PGR evaluators are given faculty lists for various universities’ programs (without the names of the universities), and then each is asked to score those programs, on a scale of 0 to 5, both for overall strength, and also for strength in the particular area(s) of philosophy the evaluator works in. The results of these surveys are processed to yield overall and area-by-area evaluations of the various programs.

From my perspective as one of the evaluators, one of the weaknesses of the process by which overall rankings are produced is that we’re all evaluating programs based on our perceptions of their strength in areas we know little about. There are a lot of names on those faculty lists that I don’t recognize. My sense of the various programs’ strengths in areas I know little about is, of course, very important to how I will score them in the overall surveys. After all, I only know a couple of areas at all well. Yet this sense is not at all well-informed.

One thing I often do to address this problem is consult the PGR area rankings when deciding on what overall score to give the programs. I don’t know how legitimate that procedure will seem to people. It would be problematic to consult old PGR overall rankings in giving programs overall scores: The PGR is seeking my opinion, and not looking for me to be conduit by which its own old results can be perpetuated into the future. But it doesn’t seem to me in the same way problematic to use the area rankings in my deliberations about overall scores. That certainly seems to me to improve the quality of the overall scores I give. The area-by-area evaluations are perhaps the most valuable part of the PGR, precisely because these are determined by evaluators who have a good idea about what they’re evaluating. In fact, some, I think, wonder whether the PGR should just confine itself to the area evaluations, and not include overall rankings. I’m in favor of keeping the overall rankings, but I think I do a better job of assigning overall scores when I take into account the area evaluations. Here I’m using my fellow-philosophers’ evaluations of programs in areas they are expert in as I arrive at my own estimation of the programs’ overall strength. This seems to me a good procedure.

However, since the area surveys are done at the same time as the overall, I’m using the previous PGR’s two-year-old area rankings. Things change over two years. And even when I’m aware of relevant changes (additions or losses of a philosophers working in the area in question, significant new publications in the area, etc.), I’m not as well-positioned to evaluate the significance of those changes as are the evaluators who work in the area in question.

Thus, it seems a good idea to do the PGR surveys in two stages. First would be the area evaluations, in which we each use faculty lists to evaluate programs in the areas we know best. Then, presented with the results of these area evaluations in addition to the faculty lists, we score programs in terms of overall strength. In giving overall scores, evaluators can use, or not use, the results of the area evaluations as they see fit. But if the area results were easily available to me as I assigned overall scores, I would certainly use them, and I think I would do a much better job of assigning overall scores.

2. Informed Surveys

As a PGR evaluator, I feel I am on much firmer ground when I’m scoring programs on areas I’m familiar with than when I am assigning scores on overall strength. But even when I’m assigning scores on the area I know best (epistemology), I often feel I’m making problematically uninformed decisions. I often address this problem, to some extent, by looking up information I can find quickly on-line – seeing what faculty in a certain program list epistemology as one of their areas, what, if anything, they’ve written in the area, etc. But that’s very time consuming, and, largely because it is so time-consuming, I don’t do nearly as much of such checking as I should. I doubt that I’m alone in this.

So I’m quite certain I would do a much better job of ranking programs in my area if I wasn’t just presented with a list of faculty in the various programs, but if I were also provided, in the information given to me as I assign my area scores, a list of the faculty in each program that work in the area in question, and, for each of the people so listed, simple bibliographic information on the papers and books they’ve written in the area, and answers to some standard questions concerning whether, what, and how often they teach classes and graduate courses in the area.

Gathering this information would of course be a huge undertaking. If it were done by the PGR (and I certainly don’t mean to be presuming that the PGR will attempt anything like this), then Brian Leiter would need to be provided with an enormous amount of new help. I have some ideas about how it might be gathered, but I won’t go into that here.

However, this information wouldn’t only be useful for PGR evaluators, but could also be made available in its own right. It seems precisely the kind of information that prospective graduate students and their advisors could use in deciding which programs are well-suited to them. Those who use the PGR rankings can also benefit from having access to this underlying information. And even those opposed to rankings might well benefit from the underlying information. Indeed, it seems just the information some of them would like to see made available in place of rankings. So, people would be free to use the rankings, the underlying information, or both, as they see fit.

Of course, much of this information is available in various places on-line. But what’s available is spotty, and it takes a long time to find what information there is – which is why such information is badly underutilized by PGR evaluators. If it were all provided conveniently to them as they did their area scoring, I think it would be extremely helpful, and would result in far better area rankings. And, even for those who don’t believe in rankings, it can be helpful to get such standard information on all the programs in one convenient place, to help them arrive at lists of programs to check out in what they can hope is greater depth, by visiting the programs’ web pages and pages for the various relevant faculty members.

3. Citation Counts?

Here at Certain Doubts, Jon Kvanvig has recently posted some interesting entries exploring the possibility of using citation counts to evaluate philosophers and even philosophy programs. Like Jon, I’m interested in seeing how useful such methods of evaluation can be made. But my own (perhaps somewhat skeptical) opinion is that the best use of such measures will likely be to better inform reputational surveys like the PGR. At any rate, I think that if there were reliable citation counts for philosophical work, it would improve the PGR if such counts were included in the information given to evaluators. Above, I suggested that area evaluators be given lists of papers written by various faculty members in the areas in question. I think it would be helpful to also provide them with citation counts for the papers that are listed, that evaluators could use as they see fit. Wise evaluators would know not to expect a lot of citations for papers that have been published recently. For such recent publications, the quality of the journal that it appears in may be a better quick indicator of its value that might prod an evaluator to take a closer look at the scholar’s work. But there may well be a situation where my judgment would be helped by a citation count. For instance, there could well be an epistemologist whose work I haven’t yet encountered, and so I wouldn’t give her department much epistemology credit for having her on board, but high citation counts for several of her epistemology papers might be a good indication that I should take a closer look at her work before scoring her department in epistemology.

For reasons I’ve given in recent posts and comments here at Certain Doubts, I don’t think that Google Scholar is a good enough source for citation counts, so if that’s the best source we have, then I wouldn’t include GS counts in the information supplied to evaluators. But perhaps a better source for citation counts will be developed. And, even now, evaluators would of course be free to look up philosophers’ work on Google Scholar to get an idea of the impact of their work.


Informed Surveys: Some Thoughts on How to Gather Information About, Evaluate, and Rank Philosophy Programs — 11 Comments

  1. Keith, excellent thoughts here. I especially like the two-stage suggestion for PGR, and I concur with you about my own practices in evaluating: I don’t know the people well enough and worry about not knowing what people are doing even in my own field (especially in recent times where formal epistemology has come to be such a hot subpart of philosophy, and one that I want to take seriously in evaluating departments).

    The other part that I like very much is that your suggestions would move PGR to a system in which the evaluation of a department is closer to a function of the evaluation of its parts than the present system is. It might be best just to have evaluators assess the parts in question, and leave the assignment of overall weight up to other parties: maybe have program that lets the reader of the Report weight the factors however they choose, for example. Or maybe select some obvious choices, and include results for those weightings.

    The one thing that compiling the data for the Hirsch number rankings did for me was to convince me of how ignorant I am of the players in philosophy that are outside my area. In some cases, I’d conclude that the numbers were not representative of status, but in other cases, and the more ordinary case, I concluded that I just don’t know enough about what is going on in subfields other than those I work on. And the data gathering convinced me that my ignorance is deeper and broader than I already knew about, and had considered, in doing evaluations for the PGR. Though it is possible that I’m much worse in this respect than other evaluators, I’ll continue to doubt it until presented with more compelling evidence than I now have!

    This last point, and your discussion of it, is part of what underlies the halo effect and the way in which conference participation and presentation has an effect on the rankings that citation information can help correct for. If we had an adequately funded national organization that wasn’t crippled by the forces opposed to any kind of sociological scrutiny of our profession, they could provide a really useful service to the profession. Such scrutiny will reveal impostors for what they are, however, so I suspect we’ll never in our lifetime see a change on this score.

  2. One of Jon’s comments in another post suggests that numerous folks are interested in having a ranking system that is more inclusive or maybe more refined. Jon lists 99 PhD granting departments, 49 of which are not mentioned in the PGR. So, one might say they are not included in the PGR or that they belong in an “unranked” category in the PGR. In either case, those 49 fall into a broad category. Essentially half the departments fall in that category. If rankings are in fact a service for potential grad students and for the profession, wouldn’t a more inclusive or refined ranking be a better service for potential grad students and the profession? Issues of tractability aside, might this not be a way to improve the quality and usefulness of the PGR?

  3. part of what underlies the halo effect and the way in which conference participation and presentation has an effect on the rankings that citation information can help correct for

    I’m missing the allusion here. I guess you’re referring to the favorable bias produced by participation in (sponsoring of?) high profile (national?) conferences on a department’s ranking. Is that it? I’m not sure why that shouldn’t count in favor of a department, so I’m probably missing something.

  4. In response to K (comment #2): I myself would not advocate extending the PGR, as it’s currently done, to cover all or even more of the graduate programs. I believe that the rankings (& here I’m thinking mostly of the overall rankings) become less and less trustworthy as you move down the list to lower ranked programs. (So, even among the limited number of programs currently ranked, I believe less confidence is appropriate in the bottom of the rankings than in the top.) This is because the rankings are based on reputational surveys, and evaluators tend to have more of a basis for the scores they assign to the more highly regarded programs. At what point, as you move down the list, you should stop doing the rankings is a very tricky call, and I admit that I have little confidence in my own inclinations as to where to draw that line. (The worries of some skeptics about the PGR can be seen as a limiting case of this kind of worry I’m expressing here. Like me, they worry that evaluators don’t have enough basis for the scores they assign, and like me, they may think this problem intensifies as one moves down the list of programs, but, unlike me, they think the problem is already bad enough at the top that we should stop doing these rankings before they begin.) But my own inclination is not to think that the PGR should be extended to cover significantly more programs in the overall rankings.

    However, I do believe that if a reputational survey were done in the way I’m here suggesting, it would be more reliable further down the list, and in fact, even if it included all PhD programs, it would be more trustworthy at the bottom of that long list than the PGR currently is toward the bottom of its much shorter list. This would be a very significant advantage because, as you point out, evaluations of the programs currently excluded from the PGR would be a very valuable service to many prospective students.

  5. Mike I had a glitch in the software: below is the original reply I wrote to your question and tried to post, and just noticed this morning, after posting the next reply below, that it was routed so that administrator approval was required before it would show up. That’s really strange! Reminds me of John Perry’s story at the beginning of the paper about essential indexicals…

    Mike, the point is that face-to-face contact tends to trump more reliable information about quality (such as actually reading a person’s work), and some departments have better travel budgets than others, and some people like to travel more than others. So these places and people secure an advantage in the ratings because of conference participation. The point isn’t that such participation is not a good thing and shouldn’t count at all, but that it’s value is likely to be disproportionate to its actual significance. The allusion I made was, I think, to the studies that show that interviewing candidates tends to swamp more reliable, on paper, information about the quality of the candidates. These studies have led Princeton, among others, to discontinue APA interviews, and we followed their lead when I was at Missouri.

  6. Mike, the worry is that face-to-face experiences have disproportionate influence compared to actual significance. There is data to this effect regarding interviews in comparison with more reliable, on-paper, data about job candidates, and I think the same phenomenon occurs regarding conferences in philosophy as opposed to written work. Citation-based information is a surrogate for actually reading all the work in question, and not a completely reliable one even when the database is excellent, but it can correct for a disproportionate assessment of a person or department based on personal contact.

  7. Keith wrote about the problem of assessing the strength of people in fields outside his own:

    One thing I often do to address this problem is consult the PGR area rankings when deciding on what overall score to give the programs. I don’t know how legitimate that procedure will seem to people.

    If his practice is representative, then it would go some way toward explaining why the Leiter data records a high degree of consensus.

    I think it is reasonable to complain a bit about this procedure: the task is for each evaluator to rank the departments, not for the evaluators to collude in assigning ranks. Looking at the specialty ranking information is (a form of) collusion.

    Push-polling is another form of consensus-building; the hand-selection of evaluation panels is another one still.

    I hope it is clear that I am complaining about methodology. The kind of variability in Kvanvig’s data is more in line with what I would expect in a discipline like philosophy, where people spend an a lot of time at conferences talking and disagreeing about (other people’s) social position in the field. I don’t think philosophers would do this if they actually were in a “high consensus” field. (Otherwise, why bother?) The Leiter data look more like reports of bankers or financiers assessing each other rather than philosophers.

    I don’t want to suggest that there is no point to a sociological study of the field. But I think scorn and derision should be heaped upon ranking methods that resist complete, transparent, data-driven assessments of the field.

  8. Greg, I had suggested above that maybe only subfield rankings by specialists in the field should be gathered, leaving it to some other group or program to combine the data into overall rankings. There is a problem with this idea, however. Think of students looking at the subfield rankings and trying to decide what to make of them. To make an informed decision, they’d want to know what the dominant view in the profession is about which areas are most important. Keith’s second stage of evaluation addresses that concern, even though it also raises the concerns you have. But, if I were a student trying to decide where to apply, I’d want the information in Keith’s second stage, even knowing the methodological issues it raises.

    I had thought that one might use a different type of second stage, where one asks raters simply to rate subfields in terms of their importance to the discipline, but that won’t work. MIT is a good example of why. I’d never discourage a student from going to MIT if their interests involved MIT’s strengths, even though I doubt they would do well in terms of the distribution of subfields that would come out of a survey of the sort I imagine.

    Keith and I are both concerned about data sources, but if you want to see something even more worrisome, look at the data being used by Academic Analytics. There’s some info in the Chronicle of Higher Ed about it, and they are gathering clients slowly at present, but (I predict) only slowly because the NRC rankings are about to come out. As soon as they do and universities are looking for more readily available data, Academic Analytics will be a stronger force in academia, and the z-score that they generate gives some rather, errr, unusual results…

  9. One candidate for a second stage is placement data. There might be terrific people at U State but lousy placement results. From a student’s point of view, that would count against U State. From the profession’s point of view, that misalignment might count against any number of things: halo effects on hiring practices, vagaries of fashion, the number of PhD programs, a preference for caste systems….

    A solution to Academic Analytics is to be found in the kind of thing Jon’s doing. If you find a basket of metrics that are positively correlated, and you find the groupings that result useful—for professional mobility, for guiding students to graduate programs, for settling needling questions of purity—then either the z-score will be in this basket or it won’t.

    And if it isn’t? Then it might point to features that the profession should consider. But, more likely it will disagree because the metric is too coarse for measuring anything of interest in philosophy. Again, a case for embracing best practices: If the data is open and the methods transparent, then you can have this kind of discussion and be in a much better position for defending your interests.

  10. Why not for the first stage, instead of grouping philosophers by department, just list philosophers by AOS? The idea would be to detemine, per philosopher, what kind of positive contribution he or she would make to a department, given his or her reputation. People would only be required to evaluate people in their fields, or people with whose work they are familiar from any field. Then Leiter or someone else could, as it were, put the departments back together and determine the departmental score as a function of the individual scores of the philosophers.

  11. It is the ‘put[ting] the departments back together and determine the departmental score as a function of the individual scores of the philosophers‘ part that is the rub: you’d need to say what function should be used and then explain/defend why.

Leave a Reply

Your email address will not be published. Required fields are marked *