Comments on the Rorschach Controversy

Jim Wood, Scott Lilienfeld, and Teresa Nezworski

August 2009

The following essays were published as part of an exchange concerning the Rorschach Technique on the listserv of the Society for a Science of Clinical Psychology, August 2009. This material is posted by permission of Prof. Wood.

Wood's critique is persuasive, to me, but the Rorschach still has its defenders. In 2011 Prof. Gregory Meyer and his colleagues introduced a new scoring system for the Rorschach that, they claim, has improved validity over previous systems (including Exner's highly popular "Comprehensive System").Link to a paper from that group, documenting their approach.

Julius Wishner, one of my teachers in graduate school, claimed that the Rorschach technique was "psychologiy's most interesting test". He's right about that. But Wishner was also very skeptical that the Rorschach could tell us anything that we couldn't find out through alternative means that were both more valid and more efficient. I agree, and offer the Rorschach here only as an example of the constructive point of view on perception.

Part I: Validity

Scientific controversy has raged around the Rorschach since the 1950s, and around Exner's Comprehensive System since 1995. In the latest flare-up, Wikipedia's publication of the Rorschach inkblots has set off heated exchanges on SSCP-net and other internet discussion lists. Greg Meyer, a noted Rorschach proponent and editor of the Journal of Personality Assessment, has posted two long messages to SSCP-net in defense of the test.

We would like to respond to Meyer by addressing four topics concerning the Rorschach:

(1) Validity;

(2) Norms and Standardization;

(3) Interrater Reliability;

(4) Over-Marketing.

The present and lengthiest message focuses on the first topic, validity.

Validity of Exner's Comprehensive System

List members may have noticed an important inconsistency between the first and second parts of Meyer's posting on Rorschach validity. In the first part, he cites meta-analyses that have found the "global validity" of the Rorschach to be approximately .30. He concludes:

"In a general and global way the Rorschach demonstrates reasonable validity..... clinicians on the front lines should see about the same level ofglobal validity in their work with clients for the Rorschach as for the MMPI." [italics added]

In contrast, during the second part of his posting, Meyer acknowledges that it does not really make sense to talk about the global validity of the Rorschach:

"Of course not everything is fine with the Rorschach either. It is not a single entity and soit does not really make sense to talk about the validity of the 'Rorschach' per se but about the validity and utility of individual scales."[italics added]

We're glad that Meyer agrees that validity is a characteristic of inferences drawn from individual test scores, and that "it does not really make sense to talk about the validity of the Rorschach per se." But isn't this statement inconsistent with the earlier part of his posting, in which he asserts that the Rorschach possesses "global validity" and is valid "in a general and global way"? The fact that Meyer openly acknowledges the inconsistency does not make it any less of an inconsistency. Instead, it should alert us that that his interpretation of the Rorschach meta-analyses is not logically coherent, since it is based on a concept -- "global validity of the Rorschach" -- that doesn't quite make sense.

Our own interpretation of the meta-analyses on Rorschach "global validity" has the advantage, we think, of being logically coherent and consistent with the standard meaning of "validity." Essentially our position is this: Some scores in Exner's Comprehensive System possess well-demonstrated validity for some purposes, but most do not. Because meta-analyses of "global validity" typically average together validity coefficients from a potpourri of Rorschach variables, the resulting average validity coefficient is inevitably non-zero. However, this average validity -- which Meyer calls "global validity" -- tells us very little about the crucial validity question that clinicians must address before using the Rorschach: Which scores are valid for which purposes?

Exner's Comprehensive System is far and away the most widely used Rorschach scoring and interpretive system. For the past 13 years, we have challenged Exner, Meyer, and other proponents of the System to publish a list of its individual scores that have been validated in methodologically sound, independently replicated studies with consistent results. As of today, we are still awaiting such a list. Instead, Rorschach proponents have generally responded -- as Meyer does in his posting -- by citing "global" validity and then pointing to a few well-validated or promising Rorschach variables that are *not* in the Comprehensive System and that are rarely used in clinical settings, such as Masling's Rorschach Oral Dependency variable (ROD) and Klopfer's Rorschach Prognostic Rating Scale (RPRS).

We repeat our challenge to Meyer today: Please list the Exner Comprehensive System scores (not the Masling or Klopfer scores) that have been well validated in the scientific literature in consistent and independently replicated studies, along with appropriate citations. We have compiled our own list, and we think that about 20 Exner scores (out of more than 180) have reasonably well demonstrated validity. In general, we find that some Comprehensive System scores are related to intelligence, psychosis, and disorders that involve thought disorder (such as schizophrenia and Borderline Personality Disorder). Aside from these scores, however, we have not identified any Exner variables with a well-demonstrated relationship to other psychological diagnoses, to anxiety, depression, impulsiveness, psychopathy or anti-social behaviors, or to other clinically relevant personality traits. The Comprehensive System purports to measure these characteristics, and clinicians using the Rorschach on the front lines typically assume it does, but we can't find good evidence for this claim.

Below we sketch out our position more fully. List members interested in extended discussions can consult our book What's Wrong With the Rorschach?(Wood, Nezworski, Lilienfeld, & Garb, 2003, pp. 252-253) or an article on the web: Wood, Nezworski, Lilienfeld, & Garb, 2006

http://www.division42.org/MembersArea/IPfiles/Spring06/practitioner/rorschach.php

Here is our take on meta-analyses of Rorschach "global validity," and on the validity of scores in Exner's Comprehensive System:

1. The most widely cited meta-analysis on the global validity of the Rorschach is by Hiller, Rosenthal, Bornstein, Berry, and Brunell-Neuleib (1999). We will focus on it here, with the understanding that our remarks apply generally to similar "global" meta-analyses.

2. Hiller et al. combined the results from 30 Rorschach articles randomly selected from the published literature. The topics of these articles were extremely diverse. For instance, three articles examined the correlation of Form Quality scores with learning disabilities. Another examined the correlation of Klopfer's Rorschach Prognostic Rating Scale with patients' improvement after psychotherapy. When the results from these and the remaining articles were combined, the average validity coefficient ("global validity") was .26.

3. First methodological caveat: A huge number of Rorschach variables have been developed and used in the past half century. The Exner system alone contains more than 180 variables. We estimate that perhaps 300 to 500 Rorschach variables were used clinically or in research between 1977 and 1997, the years covered by the Hiller et al meta-analysis. But the meta-analysis included only 44 variables, a small fraction of the total.

Contrary to what is sometimes assumed, the Rorschach scores in the Hiller et al. meta-analysis did not represent a random sample from the much larger population of Rorschach variables. Only a small fraction of Rorschach scores have been frequently researched, in the sense that they have been examined again and again in published validity studies. Most Rorschach variables are rarely or never examined in published validity studies. Because the Hiller et al meta-analysis was based on a random sample of published studies only, it inevitably over-represented the frequently researched scores and under-represented the infrequently researched ones. For example, tables in Hiller et al. show that certain frequently researched Rorschach categories (e.g., form quality scores) were heavily over-represented in its sample.

Thus the .26 average validity coefficient reported by Hiller et al. is not based on anything approaching a random sample of Rorschach scores. Instead, it reflects a small subset of frequently researched scores and can't confidently be generalized to Rorschach scores in general.

4. Second methodological caveat: Because the Hiller et al. meta-analysis included only published studies, its average validity coefficient may also be inflated due to publication bias and the file drawer effect. However, for purposes of discussion we're willing to accept .26 as a reasonable ballpark estimate of the average validity of the most frequently researched Rorschach scores.

5. Keeping the foregoing caveats in mind, the Hiller et al. results lead us to the following conclusion:About half of the most frequently researched Rorschach scores probably have validity greater than .26. By the same logic we're led to a symmetrical conclusion:About half of the most frequently researched Rorschach scores probably have validity less than .26.

6. Like Lee Sechrest, we doubt that a test score with validity lower than .40 is likely to be clinically useful for individual diagnostic judgments. However, for purposes of discussion we're willing to accept the premise that a score with validity of .26 or higher possesses what Meyer calls "reasonable validity." Even by this lenient standard, however, only about half of the most frequently researched Rorschach variables appear to have "reasonable validity."

7. We have now gone as far as "global" meta-analyses of the Rorschach can take us. But for clinical purposes, it's not far enough. The question immediately arises: "Which of the frequently researched Rorschach scores -- particularly the scores in Exner's Comprehensive System -- are the ones with 'reasonable validity'?"

It would be scientifically and professionally indefensible to ignore this question and use any Rorschach score we please, because (a) the meta-analysis indicates that about half of the frequently researched scores lack "reasonable validity", and (b) the meta-analysis provides little or no information about the validity -- even the "average validity" -- of infrequently researched scores.

8. For this reason, from 1996 to the present we have repeatedly challenged Exner, Meyer, and other Rorschach proponents to publish a list, with citations, of Comprehensive System scores that are well validated. Which scores have shown a consistent relationship to psychological diagnoses or symptoms, personality traits, or behavior, in several methodologically adequate replications carried out by independent research groups? We would like to see something like Roger Greene's book, which even-handedly evaluates the scientific evidence for the validity of MMPI-2 scores. Even a casual perusal of Greene's book reveals that multiple MMPI-2 clinical scales (despite their well-documented psychometric shortcomings) have shown at least moderate validity when compared with other self-report and interview measures of their corresponding constructs.

Despite multiple requests on our part, Exner, Meyer, Weiner, and other Rorschach proponents have never responded to this challenge. Of course, we expect that publishing such a list would entail at least an implicit admission that most Comprehensive System scores lack well demonstrated validity. Given the lack of response from Rorschach proponents, we have tentatively created our own list of 20 Comprehensive System variables with well-demonstrated validity. But before presenting and discussing it in detail, we will pause to review the meta-analyses cited in Meyer's posting that have examined the validity of specific Rorschach variables, rather than the "global validity" of the test as a whole.

9. A few meta-analyses have been reported on the validity of specific Rorschach variables that are not part of the Comprehensive System. We briefly summarize the results here because Meyer cited them in his post and because they provide useful background on the validity of Rorschach scores outside the Exner System. Some readers may want to skip the summaries and go on to the next section.

(i) Meyer and Handler (1997; Meyer 2000) published a meta-analysis which showed that Bruno Klopfer's Rorschach Prognostic Rating Scale (RPRS) is a valid predictor of psychotherapy outcomes. The RPRS is not a part of the Comprehensive System and appears to be incompatible with its administration procedures. It has not been widely used since the 1970s and lacks current norms.

(ii) Romney (1990) published a meta-analysis showing that Singer's Rorschach Communication Deviance score is a valid measure of thought disorder among relatives of patients with schizophrenia (Singer, Wynne, & Toohey, 1978). The Communication Deviance score is not a part of the Comprehensive System and lacks current norms.

(iii) Bornstein (1998; 1999) published two meta-analyses which showed that Masling's Rorschach Oral Dependency scale (ROD) bears a valid relationship to non-pathological observed dependent behaviors (though apparently not pathological dependency or Dependent Personality Disorder) and to retrospectively recalled physical illness. The ROD is not a part of the Comprehensive System, is used almost exclusively in research settings, and lacks good norms.

(iv) Gronnerod (2004) published a meta-analysis which, according to Meyer's post, shows the "ability of the Rorschach to measure change as a result of psychotherapy." After a careful reading, we doubt that it shows any such thing. It is an odd and deeply puzzling meta-analysis and we will post a separate message to the list about it, seeking comment from Meyer and other list members. In any case, Gronnerod's meta-analysis includes many studies dating back to the 1940s and 1950s, long before the Exner system was developed. It sheds little or no light on the validity of Comprehensive System scores.

10. Having reviewed meta-analyses on variables outside the Exner system, we turn to meta-analyses on specific Comprehensive System variables. As a careful reading of Meyer's post reveals, there has been only one: A meta-analysis by Jorgensen et al. (2000) found that the validity of Exner's Depression Index had a validity of .14 for detecting depression, and his Schizophrenia Index had a validity of .44 for detecting schizophrenia. It is sobering to realize that the Comprehensive System has been in widespread clinical use for more than 30 years, and the center of controversy for 14 years, yet this is the only meta-analysis that has ever looked at the validity of any its individual variables.

11. We are now ready to consider our own proposed list of valid Comprehensive System variables. Even in the absence of meta-analyses, there's sufficient research evidence to identify four main areas in which Comprehensive System scores have well-established validity (for citations, see Wood, Nezworski & Garb, 2003; Wood, Nezworski, Lilienfeld, & Garb, 2003; Wood et al., 2006).

First, the inkblot responses of patients with schizophrenia and bipolar disorder often exhibit poor form quality. That is, the images reported by these patients often do not fit the shape of the blots. The most prominent measures of form quality in the Exner system are Conventional Form (X+%), Distorted Form (X-%), Form Appropriate Extended (XA%), and the good and poor Human Representational Variables (GHR and PHR). As Robyn Dawes and others, have noted, these are perceptual, not projective variables, and reflect the well documented and unsurprising finding that people with psychotic disorders and other disorders marked by reality distortion often perceive odd percepts in stimuli.

Second, the inkblot responses of patients with schizophrenia, schizotypal personality disorder, or borderline personality disorder, and patients in the manic phase of bipolar disorder, are often characterized by thought disorder, that is, by disorganized cognitions and peculiarities of language. The two most important measures of thought disorder in the Exner system are the Weighted Sum of 6 Special Scores (WSum6) and Level 2 scores.

Third, the Exner system includes three global indexes that combine measures of poor form quality with measures of thought disorder: the Schizophrenia Index (SCZI), the Perceptual Thinking Index (PTI), and the Ego Impairment Index (EII). These three indexes are highly correlated with each other and largely redundant. Patients with schizophrenia and other psychotic conditions receive high scores on all three.

Fourth, numerous CS scores are correlated with IQ. Moderate correlations with IQ, ranging from .30 to .40, have been found for Developmental Quality (DQ+) and Organizational Activity (Zf), scores that reflect the degree to which a patient has synthesized the diverse parts of each blot into a unified image. Form Quality scores (X+%, X-%, XA%), the total number of responses (R), Human responses, Human Movement responses (M), Whole responses, Blends, Lambda, and F% (a variant of Lambda) are also correlated with IQ.

12. Other than the scores just listed, we can identify no other Comprehensive System variables with well-demonstrated and consistently replicated validity -- though we challenge Meyer and other Rorschach proponents to supplement our list and provide relevant citations to the literature. In a wide-reaching review of the research evidence on the Rorschach and diagnoses nine years ago (Wood, Lilienfeld, Garb, & Nezworski, 2000), we concluded:

"The Rorschach has not shown a well-demonstrated relationship... to Major Depressive Disorder, Posttraumatic Stress Disorder (PTSD), anxiety disorders other than PTSD, Dissociative Identity Disorder, Dependent, Narcissistic, or Antisocial Personality Disorders, Conduct Disorder, or psychopathy."

To the best of our knowledge, no new evidence has been brought forward to change this conclusion.

13. So far we have focused on research that explicitly addresses the validity of Comprehensive System variables. However, there is another aspect of validity that is seldom recognized as such: test norms.

As noted earlier, validity is a characteristic of the inferences that are made from test scores. If test norms are inaccurate, then the inferences made from test scores will be inaccurate. Thus test norms are an aspect of validity -- even though books on psychological testing typically put the two topics in different chapters.

As we will discuss in our next post, the norms of the Comprehensive System are controversial and in our view seriously inaccurate. They can contribute to erroneous inferences that compromise whatever validity Rorschach scores possess. For the moment, however, we will simply note that the controversy surrounding the Rorschach norms is essentially a controversy about Rorschach validity.

14. In closing, we will return briefly to the notion of "global validity." As we noted earlier, it is such a troublesome and slippery concept that Meyer embraces it in one paragraph and rejects it in another.

In our opinion, it is typically a bad idea to apply the term "global validity" to the Rorschach or any test, because it suggests -- wrongly -- that the test as a whole is valid or invalid. For instance, in a forensic setting, a psychologist who testifies that the "global validity of the Rorschach is comparable to that of other psychological tests" is giving a misleading impression that inferences drawn from the Rorschach generally possess validity -- whereas in fact, the truth is that many inferences commonly drawn from the test have the substantiality of moonbeams. Even as we write, hundreds of clinicians across the country are interpreting specific Rorschach indices as though they possessed well documented validity for drawing inferences about specific psychological disorders and personality traits. This is why we find the line of argumentation put forth by Meyer and many other Rorschach advocates so deeply troubling.

We ourselves have sometimes used the term "global validity" in our publications. But in general, we recommend that psychologists be more precise in their speech, and we will try to do so as well. Instead of "global validity," it is better to refer to "the average validity of frequently studied Rorschach scores." Or better yet, why not just say: "Some Rorschach scores are valid for some purposes. But most have little or no demonstrated validity at all."

References

Bornstein, R. F. (1998). Interpersonal dependency and physical illness: A meta-analytic review of retrospective and prospective studies. Journal of Research in Personality, 32, 480-497.

Bornstein, R. F. (1999). Criterion validity of objective and projective dependency tests: A meta-analytic assessment of behavioral prediction. Psychological Assessment, 11, 48-57.

Gronnerod, C. (2004). Rorschach assessment of changes following psychotherapy: A meta-analytic review. Journal of Personality Assessment, 83, 256�276.

Hiller, J. B., Rosenthal, R., Bornstein, R. F., Berry, D. T. R., & Brunell-Neuleib, S. (1999). A comparative meta-analysis of Rorschach and MMPI validity. Psychological Assessment, 11, 278-296.

Jorgensen, K., Andersen, T. J., & Dam, H. (2000). The diagnostic efficiency of the Rorschach Depression Index and the Schizophrenia Index: A review. Assessment, 7, 259-280.

Meyer, G. J. (2000). The incremental validity of the Rorschach Prognostic Rating Scale over the MMPI Ego Strength Scale and IQ. Journal of Personality Assessment, 74, 356-370.

Meyer, G. J., & Handler, L. (1997). The ability of the Rorschach to predict subsequent outcome: A meta-analysis of the Rorschach Prognostic Rating Scale. Journal of Personality Assessment, 69, 1-38.

Romney, D. M. (1990). Thought disorder in the relatives of schizophrenics: A meta-analytic review of selected published studies. Journal of Nervous and Mental Disease, 178, 481-486.

Singer, M. T., Wynne, L. C., & Toohey, M. L. (1978). Communication disorders in the family of schizophrenics. In L. C. Wynne, R. L. Cromwell, and S. Mathysse (Eds.),The nature of schizophrenia: New approaches to research and treatment (pp. 491-511). New York: Wiley.

Wood, J. M., Lilienfeld, S. O., Garb, H. N., & Nezworski, M. T. (2000). The Rorschach Test in clinical diagnosis: A critical review, with a backward look at Garfield (1947). Journal of Clinical Psychology, 56, 395-430.

Wood, J. M., Nezworski, M. T., & Garb, H. N. (2003) What's right with the Rorschach?Scientific Review of Mental Health Practice, 2, 142-146.

Wood, J. M., Nezworski, M. T., Garb, H. N., & Lilienfeld, S. O. (Spring, 2006). The controversy over Exner's comprehensive system for the Rorschach: The critics speak. The Independent Practitioner. Available on the web at

http://www.division42.org/MembersArea/IPfiles/Spring06/practitioner/rorschach.php

Wood, J. M., Nezworski, M. T., Lilienfeld, S.O., & Garb, H. N. (2003). What's wrong with the Rorschach? Science confronts the controversial inkblot test. San Francisco: Jossey-Bass.

Wood, J. M., Nezworski, M. T., & Stejskal, W. J. (1996). Thinking critically about the Comprehensive System for the Rorschach. A reply to Exner. Psychological Science, 7, 14-17.

Part II: Comprehensive System (CS) Norms

In our previous posting to SSCP-net we discussed the validity of scores in Exner's Comprehensive System (CS) for the Rorschach. We're glad to learn from Greg Meyer's response that we share common ground with him, and that he agrees with our conclusion that most Comprehensive System scores lack demonstrated validity and shouldn't be used in clinical practice. Although we continue to disagree with Greg on some of the specifics (e.g., the number of Rorschach scores that have been validated for their intended purpose in independently replicated studies), we're pleased by the collegial tone of the debate.

In this posting we describe the long and ever-expanding controversy concerning the Comprehensive System norms. As we previously noted, bad norms undermine the inferences based on test scores and thus should be regarded as an important aspect of test validity. If the Exner norms are seriously in error, their use can compromise whatever validity Rorschach scores otherwise possess. After discussing the problems of the Exner norms, we will briefly review issues concerning the new standardized administration procedure of the Comprehensive System. (for earlier critiques, see Wood, Nezworski, Lilienfeld, & Garb, 2003; Wood, Nezworski, Garb, & Lilienfeld, 2006; Wood, Nezworski, Lilienfeld, & Garb, 2009)

The Crisis With the Comprehensive System Norms

During the past 50 years, debates have occasionally arisen among psychologists concerning the norms of certain popular tests. However, none of these disputes can compare even remotely with the wide-ranging and at times bizarre controversy that has surrounded the Comprehensive System norms for the past 10 years. For instance, there are now two competing sets of adult norms for the Comprehensive System, each preferred by a different faction of Rorschach experts. Likewise, consensus is growing (even among many prominent Rorschach proponents) that the children's norms for the Comprehensive System are so problematic that they should be discarded. A crisis has engulfed the Comprehensive System because of its norms and, in our view, no clear resolution is in sight.

The origins of the controversy can be traced back 20 years. The story begins in 1989 with Greg Meyer, then a doctoral student, now the editor of the Journal of Personality Assessment. While preparing his dissertation on the Comprehensive System, Greg routinely computed descriptive statistics for its variables. Unexpectedly, he discovered that the scores of the college students in his sample generally appeared pathological when compared with the Exner norms:

For virtually every variable the variances and/or the means were significantly different across the two samples..... In general, and in contrast to the objective data.... the Rorschach data indicated that the current sample was more "pathological" than the standardization sample. (Meyer, 1989/1991, p. 167)

Meyer (1989/1991) carefully re-checked the scoring of the Rorschach protocols. He also looked for evidence that his sample of undergraduates was abnormal, but found "there were no clear problems with the present sample in terms of Rorschach scoring or in terms of its comparability to a typical college population" (p. 175). Greg concluded there were "problems" with the Exner norms (p. 175) and observed:

Much of Exner's data remains unpublished, or non-refereed (in his books) and somewhat sloppy or contradictory when it is published. (pp. 71-72)

Greg didn't publish these unflattering conclusions. In fact, he later disavowed them (Meyer, 2001). Thus, the problems with the Exner norms generally went unnoticed until ten years later, when a group of respected Rorschach experts -- Thomas Shaffer, Philip Erdberg, John Haroian, and Mel Hamel -- reported the results of two groundbreaking studies. Their research appeared in the Journal of Personality Assessment, which certainly cannot be accused of being an anti-Rorschach journal.

In the first study (Shaffer, Erdberg, & Haroian, 1999), the researchers administered the Exner Rorschach, the WAIS-R, and the MMPI-2 to 123 nonpatient adults living in the community. Most of these participants were volunteers who donated blood at a blood bank and then gave their time to be tested by the research team. According to the WAIS-R and MMPI-2, the group was average or even slightly above-average compared with other Americans.

In only one respect did these apparently typical Americans stand out: When compared with the Exner norms, their Rorschach scores indicated that most of the individuals in the study were seriously disturbed. For example, about 1 in 6 of the participants scored in the pathological range on the CS Schizophrenia Index. Their Distorted Form Quality

(X-%) scores were so high that half would be considered thought-disordered. Nearly a third gave a Reflection response, a supposedly pathognomonic indicator of narcissism.

A year later the same group of scholars published a second study, this time of 100 preadolescent children with no known history of mental health problems (Hamel, Shaffer, & Erdberg, 2000). The children were above-average in psychological adjustment according to a well-validated measure, the Conners Parent Rating Scale-93. Yet when their Rorschach scores were compared with the CS norms, the results were even more troubling than in the study of adults. More than 60% of the children scored in the pathological range on the Schizophrenia Index. More than 50% had Form Quality scores that indicated thought disorder. Nearly half scored in the "depressed" range on the CS Depression Index. Hamel and his colleagues wrote:

If we were writing a Rorschach-based, collective psychological evaluation for this sample, the clinical descriptors would command attention. In the main, these children may be described as grossly misperceiving and misinterpreting their surroundings and having unconventional ideation and significant cognitive impairment. Their distortion of reality and faulty reasoning approach psychosis.... They apparently suffer from an affective disorder that includes many of the markers found in clinical depression. Equally puzzling is that the previous Comprehensive System descriptors are incongruent with all other information known to this study about these children. (p. 291)

The findings of Shaffer et al. (1999) and Hamel et al. (2000) created a stir in the community of Rorschach scholars. Why did apparently normal adults and children appear seriously disturbed when compared with the Exner norms? Intrigued by this question, we searched the scientific literature from 1974 to 1999 and identified 32 additional studies that had administered the Exner Rorschach to nonpatient American adults. When we combined the numbers across studies, the results were very similar to those reported by Shaffer and his colleagues. That is, the apparently normal individuals in these 32 studies appeared "sick" when compared with the Exner norms.

In an critique based on these findings (Wood, Nezworski, Garb, & Lilienfeld, 2001a, 2001b), we concluded that the Exner norms do not accurately represent American adults, and that use of the norms tends to make clients appear much more disturbed than they really are. Exner (2001a) and Meyer (2001) wrote forceful comments on the critique and rejected its conclusions. Greg flatly asserted: "the Comprehensive System norms do not overpathologize" (p. 394).

In a strange twist, at approximately the same time that he was dismissing the evidence of problems with the Comprehensive System norms, Exner published an important revelation in a new edition of his Rorschach Workbook. The normative tables in his new book were based on 600 nonpatient adults, although in previous editions the same tables had been based on 700 adults. Exner (2001b, p. 172) explained this discrepancy in two sentences:

The reduced number results from the fact that when the sample of 700 nonpatients was selected, using stratification criteria, more than 200 duplicate records were included. Once detected, those records were deleted from the sample and most have been replaced to constitute the sample used here.

Exner thus announced that the Comprehensive System normative sample of 700 subjects described in his books since 1989 didn't really contain 700 subjects after all. Instead, it consisted of only 479 subjects. A subset of these -- 221 -- had somehow been duplicated and then added to the original 479, yielding an illusory sample of 700 subjects. Psychologists had been using this flawed set of norms for more than 10 years.

Barry Ritzler, a friend of Exner's and professor at Long Island University, later posted an internet message to explain how the mistake occurred:

John Exner told me that a technician who worked for him a number of years ago was responsible for entering the normative data. He actually entered 700 DIFFERENT cases, but pushed the wrong button and got a re-entering of the PREVIOUS 200 cases rather than 200 new ones. So 700 separate cases were prepared for entry into the norms, but only 500 got in, 200 twice.

In 1999 Exner began to collect a new normative sample. Although he died in 2006, a new set of Comprehensive System norms based on 450 adults was published posthumously in 2007 (Exner, 2007). In most respects, the new Comprehensive System norms are similar to the old ones. However, for a large number of important variables, both sets of Exner norms are highly discrepant from the numbers reported by other researchers.

In 2007, in conjunction with the appearance of the new Comprehensive System norms, a second and much different set of norms was published by Greg Meyer in collaboration with Thomas Shaffer and Philip Erdberg, the two Rorschach researchers whose groundbreaking 1999 article had started the norms controversy (Meyer, Erdberg, & Shaffer, 2007; Shaffer, Erdberg, & Meyer, 2007). These authors had organized an ambitious international project in which collaborators gathered Comprehensive System Rorschach data from 21 adult samples in 17 countries, including the U.S. The data were then combined to yield "International Normative Reference Standards." For many important Rorschach scores, these standards (which we'll simply call the new International Norms) were highly discrepant from both the old and the new Exner norms, although similar to the numbers reported 8 years earlier by Shaffer et al. (1999) and in our own critique (Wood et al., 2001a).

The article introducing the new International Norms included a recommendation - although arguably a "soft" one - that they be used to interpret Rorschach scores: "We encourage clinicians to incorporate the composite international reference values into their clinical interpretation of protocols" (Meyer et al, 2007, p. 201). However, in a book chapter the following year, Meyer and Viglione (2008, p. 321) explained more clearly how the International Norms should be integrated with Exner's new norms for the Comprehensive System (CS):

We recommend that examiners use the new CS sample as their primary benchmark for adults, but adjust for those variables that have consistently looked different in international samples...."

Thus Meyer and Viglione recommended that (a) when the new CS norms for adults and the International Standards are similar, the CS norms should be used, but (b) when they are different, the International Norms should be used. Not surprisingly, many Rorschach users have concluded that Meyer and Viglione are attempting to replace the Exner adult norms with the International Norms. Some users have accepted the idea, others have not, and two sects appear to be forming: those who follow the Exner Norms and those who transfer their allegiance to the International Norms. As one wag commented, the situation is reminiscent of the era in the Middle Ages when there were two popes.

While the adult norms for the Comprehensive System have sunk into ever-deeper controversy, its norms for children and adolescents have collapsed entirely. The international study by Shaffer, Erdberg, and Meyer included 31 child and adolescent samples from 5 countries. After comparing data from American and international samples with Exner's numbers, Meyer et al. (p. S214) concluded that the Comprehensive System norms for children and adolescents were "dated and atypical." They warned that use of the norms with children would "incorrectly result in some very unhealthy inferences and attributions of psychopathology" and that their application in clinical settings was inadvisable. We agree. Of course, if clinicians took these warnings to heart, they must either stop using the Rorschach with children, or return to the practice common in the 1950s and 1960s of using the test without norms.

New Comprehensive System Administration Procedures

We have described the controversy over the Comprehensive System norms as it has unfolded and reached crisis proportions. Clearly, the story is not yet finished. Before offering reflections on the situation, we will discuss new administration procedures for the Comprehensive System that have recently been introduced.

As discussed in What's Wrong With the Rorschach (Wood et al., 2003), the Rorschach has been bedeviled since the 1950s by "the problem of R." Here, "R" stands not for the Rorschach but for the total number of responses per Rorschach administration. The problem is that some respondents give fewer than 15 responses to the test, whereas others give more than 40. As has been shown by numerous researchers (including Greg Meyer), this variability in Response Frequency (R) can exert a strong and undesirable influence on respondents' scores on other Rorschach variables.

The obvious solution to the problem of R is to limit each respondent to the same number of responses, as Wayne Holtzmann did with his psychometrically elegant inkblot test in the 1950s (Holtzman, Thorpe, Swartz, & Herron, 1961). However, this solution has never been popular among Rorschach users, and John Exner rejected suggestions to limit R in the Comprehensive System.

During the past few years, however, Greg Meyer and Don Viglione have introduced new administration procedures for the Comprehensive System that are intended to limit the number of responses that respondents give to the blots. The basic idea of the new procedures is to (a) encourage respondents who give only one response to a card to provide another response (with a maximum of three encouragements), and (b) allow a maximum of four responses per card. We believe that this proposal holds some promise for remedying the problem with R, although whether it succeeds remains to be seen (see below).

In a study published in the Journal of Personality Assessment,Dean, Viglione, Perry, and Meyer (2007) reported that the new procedure reduced the variability of R and increased validity coefficients for several Comprehensive System scores related to Thought Disorder. Based on these findings, the authors recommended that users of the Comprehensive System adopt the new administration procedures, a recommendation that has recently been reiterated by Meyer and Viglione (2008).

The new administration procedure, like the new International norms, has encountered resistance from some quarters of the Rorschach community. The most forceful objection is that the new administration procedure was not used to collect data for either the Exner norms or the new International norms. Thus, Rorschachs administered with the new procedure cannot be appropriately compared with either set of norms. In response, Meyer has argued and presented data to show that such worries are unjustified.

Reflections On the Norms and Administration Procedures

1. In our opinion, overwhelming evidence has accumulated that the Exner adult norms for many important Rorschach variables are seriously in error, tend to overpathologize individuals, and should not be used either clinically or forensically. Four data sets are particularly relevant: (a) Meyer's (1989/1991) dissertation; (b) the groundbreaking article by Shaffer et al. (1989) on 123 nonpatients; (c) our own synthesis of data from 32 non-patient American samples (Wood et al., 2003?); and (d) the new International Norms based on 21 international and American samples (Meyer et al., 2007). The mean values of Comprehensive System variables are similar in these data sets, but differ strikingly from both the old and new Exner norms. The variables with the greatest discrepancies are listed in our article on the web (Wood, Nezworski, Lilienfeld, & Garb, 2006;

http://www.division42.org/MembersArea/IPfiles/Spring06/practitioner/rorschach.php )

and in a chapter by Meyer and Viglione (2008).

2. Likewise we agree with Meyer and his colleagues (2007) that Exner's child and adolescent norms are inconsistent with the findings of other researchers, tend to seriously overpathologize children, and should not be used either clinically or forensically. Our 2003 book reached the same conclusions and made a strong recommendation against use of the Comprehensive System with children (Wood et al., 2003). We're gratified that some Rorschach proponents have come around to our point of view.

3. Though we'd like to know exactly what went wrong with the Exner norms, we doubt we ever will. When we prepared our 2001 critique of the norms, we wrote to Exner asking to re-analyze his data, but he declined our request. As we've discussed elsewhere, the data and studies underlying the Comprehensive System have often been unavailable for inspection by other scholars. The empirical foundations of the System, and their possible shortcomings, will probably never be fully revealed.

4. In our opinion, the research underlying the new International Norms of Meyer, Shaffer, and Erdberg (2007) is highly impressive and probably represents the most important scientific work on the Rorschach in the past 60 years. The findings are generally consistent with those of other researchers, and the International Norms are undoubtedly much closer to the true scores of nonpatient U.S. adults than the Exner norms are.

5. However, the International Norms are based on samples of convenience from a wide variety of countries, not a random or quasi-random sample of American adults. The normative group does not represent, even approximately, the population of American adults in respect to age, socio-economic status, educational level, cultural background, or language. The methodology underlying the International Norms falls far short of that used to develop norms for such major instruments as the Wechsler intelligence tests and the MMPI-2. Thus, although we admire the scientific achievement of Meyer, Shaffer and Erdberg (2007), we cannot recommend the International Norms for use in clinical or forensic practice.

6. Our evaluation is similar concerning the new Rorschach administration procedure introduced by Meyer and Viglione. On the one hand, we applaud these scholars for attempting to bring "the problem of R" under control. But although their new administration procedure represents a step in the right direction, in our opinion it does not go far enough, as it does not fully equalize the number of responses per participant. We think Wayne Holtzman's simpler solution was the right one: The way to tame the problem of R is to require that all respondents give the same number of responses to the blots. We doubt that the community of Rorschach users would be generally receptive to such an idea, however. The new administration procedure strikes us as a less-than-happy compromise between what is psychometrically desirable and what is politically acceptable in the Rorschach community.

7. More importantly, we do not share the assumption of Meyer and Viglione that their new Rorschach administration procedure can safely be used in conjunction with either the Exner or the new International norms, which were constructed using a substantially different administration procedure. Even minor changes in administration can alter the distribution of test scores and their meaning. For this reason, one of the principles of test standardization is that test scores should only be compared with normative data that were collected under similar conditions and with the same standardized procedures.

A well-trained psychologist would not expect that she or he could change the administration rules for the WAIS, but then interpret her or his patients' scores by comparing them with the standard WAIS norms. The same is true for the Comprehensive System. If the Rorschach administration procedure is changed, then new norms will need to be constructed using the new procedure. Any attempt to combine the new administration procedure with old norms strikes us as highly problematic.

8. We hope the foregoing discussion shows why we consider the Comprehensive System to be in a state of crisis, with no quick rescue in sight. We conclude by posing a question for list members: Should graduate programs in clinical psychology continue to devote training time to a nearly 90-year-old test that does not have, and never has had, an adequate set of norms �and for which the raw normative (original) data have not been made available to independent scholars?

Our next posting will address the topic of Rorschach interrater reliability.

References

Dean, K. L., Viglione, D. J., Perry, W., & Meyer, G. J. (2007). A method to optimize the response range while maintaining Rorschach Comprehensive System validity. Journal of Personality Assessment, 89, 149-161.

Exner, J. E. (2001a). A comment on "The misperception of psychopathology: Problems with the norms of the Comprehensive System for the Rorschach."Clinical Psychology: Science and Practice, 8, 386-388.

Exner, J. E. (2001b). A Rorschach workbook for the Comprehensive System (5th ed.). Asheville, North Carolina: Rorschach Workshops.

Exner, J. E. (2007). A new U.S. adult nonpatient sample. Journal of Personality Assessment, 89(S1), S154-S158.

Hamel, M., Shaffer, T. W., & Erdberg, P. (2000). A study of nonpatient preadolescent Rorschach protocols. Journal of Personality Assessment, 75, 280-294.

Holtzman, W. H., Thorpe, J. S., Swartz, J. D., & Herron, E. W. (1961). Inkblot perception and personality. Austin, Texas: University of Texas Press.

Meyer, G. J. (1991). An empirical search for fundamental personality and mood dimensions within the Rorschach test (Unpublished dissertation. Loyola University of Chicago, 1989). Dissertation Abstracts International, 52, 1071B-1072B.

Meyer, G. J. (2001). Evidence to correct misperceptions about Rorschach norms. Clinical Psychology: Science and Practice, 8, 389-396.

Meyer, G. J., Erdberg, P., & Shaffer, T. W. (2007). Toward international normative reference data for the Comprehensive System. Journal of Personality Assessment, 89, S201-S216.

Meyer, G. J., & Viglione, D. J. (2008). An introduction to Rorschach assessment. In R. P. Archer and S. R. Smith (Eds.),A guide to personality assessment: Evaluation, application, and integration(pp. 281-336). New York: Routledge.

Shaffer, T. W., Erdberg, P., & Haroian, J. (1999). Current nonpatient data for the Rorschach, WAIS-R, and MMPI-2. Journal of Personality Assessment, 73, 305-316.

Shaffer, T. W., Erdberg, P., & Meyer, G. J. (2007). Introduction to the JPA special supplement on international reference samples for the Rorschach Comprehensive System. Journal of Personality Assessment, 89,S2-S6.

Wood, J. M., Nezworski, M. T., Garb, H. N., & Lilienfeld, S. O. (2001a). The misperception of psychopathology: Problems with the norms of the Comprehensive System for the Rorschach. Clinical Psychology: Science and Practice, 8, 350-373.

Wood, J. M., Nezworski, M. T., Garb, Howard, N., & Lilienfeld, S. O. (2001b) Problems with the norms of the Comprehensive System for the Rorschach: Methodological and conceptual considerations. Clinical Psychology: Science and Practice, 8, 397-402.

Wood, J. M., Nezworski, M. T., Garb, H. N., & Lilienfeld, S. O. (Spring, 2006). The controversy over Exner's comprehensive system for the Rorschach: The critics speak. The Independent Practitioner. Available on the web at

http://www.division42.org/MembersArea/IPfiles/Spring06/practitioner/rorschach.phpsee

Wood, J. M., Nezworski, M. T., Lilienfeld, S.O., & Garb, H. N. (2003). What's wrong with the Rorschach? Science confronts the controversial inkblot test. San Francisco: Jossey-Bass.

Wood, J.A., Nezworski, M.T., Lilienfeld, S. O., & Garb, H. N. (2009). Projective techniques in the courtroom. In J. L. Skeem, K. S. Douglas, & S. O. Lilienfeld (Eds.),Psychological science in the courtroom: Controversies and consensus (pages 202-223).New York:Guilford.

Part III:

Comprehensive System Interrater Reliability

In our previous postings to SSCP-net we have discussed the validity, norms, and administration procedures for Exner's Comprehensive System for the Rorschach. In the present posting, which is relatively brief, we discuss interrater reliability. Our comments are drawn largely from our book (Wood, Nezworski, Lilienfeld, & Garb, 2003) and an article available on the internet (Wood, Nezworski, Garb, & Lilienfeld, 2006).

The Interrater Reliability of Comprehensive System Scores

For many years, psychologists accepted Exner's (1978, p. 14; 1986, p. 23) claim that all Comprehensive System (CS) scores have a scoring reliability (i.e., interrater reliability) of .85 or higher. However, recent studies have revealed this claim to be incorrect. For example, Acklin, McDowell, Verschell, and Chan (2000) found that for 89 CS scores, reliabilities (intraclass correlation coefficients) ranged from .16 to 1.00, with a median of .83. A recent study by McGrath et al. (2005) found that for 69 scores, reliabilities ranged from .58 to .99, with a median of .89. Studies by Meyer, Hilsenroth, et al. (2002) and Viglione and Taylor (2003) presented generally higher figures, but in our view their methodology and statistical analyses were problematic (for a critique, see Wood, Nezworski, Lilienfeld, & Garb, 2003, pp. 231-234, 366-367).


Is the interrater reliability of CS scores acceptable? This question has three answers, depending on which standards are applied. First, do CS scores meet the high standards set by the Wechsler IQ subtests, whose minimum interrater reliability is .90 (Wechsler, 1997)? The best studies on the Rorschach (Acklin et al., 2000; McGrath et al., 2005) indicate that approximately 50% of CS scores meet this stringent standard.

Second, do CS scores meet traditional minimal reliability standards for tests used in clinical practice? Because scores with reliability below .80 contain substantial error, experts in psychological assessment often recommend that only tests above this level of reliability should be used for clinical decision making (Nunnally & Bernstein, 1994; but see Cicchetti, 1994). According to the Acklin and McGrath studies, about 75% of CS scores meet this traditional standard of .80 reliability.

Third and finally, do CS scores meet recommended standards for tests used in research? Psychometric experts generally recommend that the minimum reliability of tests scores used in research should be .60 (Shrout, 1998). All but a few CS scores meet this minimal standard.

In a recent article, Rorschach proponent Irving Weiner (2005, pp. 78-79) asserted that "recent research leaves little doubt that adequately trained examiners can achieve substantial reliability in their coding of Rorschach responses." We are in 75% agreement with this assertion. That is, research has clearly shown that approximately 75% of CS scores meet traditional minimal standards of interrater reliability for clinical use.

However, the same research indicates that approximately 25% of CS scores do not meet these standards. For example, the interrater reliabilities of CS scores related to psychosis and thought disorder -- the Schizophrenia Index (SCZI), the Perceptual Thinking Index (PTI), and Weighted Sum of Six Special Scores (WSum6) -- have reliabilities in the .70s. Other important scores with reliability below .80 include Distorted Form (X-%), the D Score and Adjusted D (said to be related to stress), the Sum of Diffuse Shading responses (Sum Y, said to be related to anxiety), and the ratios FC:CF+C and a:p. Although these scores possess adequate reliability for research applications, their use in clinical practice is likely to yield unacceptable error rates.

References

Acklin, M. W., McDowell, C. J., Verschell, M. S., & Chan, D. (2000). Interobserver agreement, intraobserver reliability, and the Rorschach Comprehensive System. Journal of Personality Assessment, 74, 15-47.

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6, 284-290.

Exner, J. E. (1978). The Rorschach: A Comprehensive System: Vol. 2. Current research and advanced interpretation. New York: Wiley.

Exner, J. E. (1986). The Rorschach: A Comprehensive System: Vol. 1. Basic foundations(2nd ed.). New York: Wiley.

McGrath, R. E., Pogge, D. L., Stokes, J. M., Cragnolino, A., Zaccario, M., Hayman, J., Piacentini, T., & Wayland-Smith, D. (2005). Field reliability of Comprehensive System scoring in an adolescent inpatient sample. Assessment, 12, 199-209.

Meyer, G. J., Hilsenroth, M. J., Baxter, D., Exner, J. E., Fowler, J. C., Piers, C. C., & Resnick, J. (2002). An examination of interrater reliability for scoring the Rorschach Comprehensive System in eight data sets. Journal of Personality Assessment, 78, 219-274.

Nunnally, J. C., & Bernstein, I. C. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.

Shrout, P. E. (1998). Measurement reliability and agreement in psychiatry. Statistical methods in medical research, 7, 301-317.

Viglione, D. J., & Taylor, N. (2003). Empirical support for interrater reliability of Rorschach Comprehensive System coding. Journal of Clinical Psychology, 59, 111-121.

Wechsler, D. (1997). WAIS-III administration and scoring manual. San Antonio, TX: The Psychological Corporation.

Weiner, I. B. (2005). The utility of Rorschach assessment in clinical and forensic practice. Independent Practitioner, 25, 76-83.

Wood, J. M., Nezworski, M. T., Garb, H. N., & Lilienfeld, S. O. (Spring, 2006). The controversy over Exner's comprehensive system for the Rorschach: The critics speak. The Independent Practitioner. Available on the web at

http://www.division42.org/MembersArea/IPfiles/Spring06/practitioner/rorschach.php

Wood, J. M., Nezworski, M. T., Lilienfeld, S.O., & Garb, H. N. (2003). What's wrong with the Rorschach? Science confronts the controversial inkblot test. San Francisco: Jossey-Bass.