The Wayback Machine - https://web.archive.org/web/20110727172926/http://www.phcsg.org/main/pastconf/camb96/readcode.htm

Structural and lexical features of successive versions of the Read Codes

Timothy E Bentley BDS MSc, Colin Price FRCS, Philip J B Brown MRCGP

NHS Centre for Coding & Classification, Woodgate, Loughborough, Leics., LE11 2TG

Abstract

Originally introduced in the mid-1980s as the 4-byte set, the Read Codes have evolved from a relatively simple clinical classification designed to capture summarised primary care data, into a comprehensive computer-based thesaurus of clinical terms. In 1988, they were adopted as the standard for general practice computing.

All areas of clinical information are supported, and Pringle concluded in 1990 that Version 2 of the Read Codes was the only coding system to offer full coverage for all aspects of primary care. The Read Thesaurus (Version 3 of the Read Codes) is a comprehensive user-led medical terminology that aims to support clinicians from the primary, secondary and tertiary sectors in recording all facets of the process of care. It was produced during the Terms Projects (1992-95), which involved over 2,000 clinicians from all disciplines in the enhancement of the existing Read Codes.

In this study, parallel sections of the 4-byte set, Version 2 and the Read Thesaurus have been evaluated based on expressivity, natural language representation and intrinsic structural properties. The results are presented and discussed with emphasis on the functionality of each set.

Introduction

Over 80% of United Kingdom (UK) general practices have computers and most of these utilise one of the early versions of the Read Codes, either the 4-byte set or Version 2. A number of studies have evaluated Read Codes both in clinical usage and also in experimental settings.

Smith noted the limitations of earlier versions when attempting to agree a standard data set for general practice communication of clinical information. Problems encountered related to both structure and content, in particular the absence of what was described as a "true hierarchical structure" in the areas of smoking and alcohol consumption. They suggest that the problems result from the relatively simple fixed hierarchy structure in the early versions.

Williams studied the use of 4-byte Read codes in recording and retrieving data relating to depressive illness and acknowledged that limitations resulted from difficulties in matching a coding scheme to clinical consensus in any domain. This is a particular problem when clinical ideas evolve such that changes become necessary which cannot be accommodated in a fixed-depth code-dependent hierarchy.

A comparison of the content of the April 1993 Version 2 release with other internationally available coding schemes found that SNOMED was superior to both Read and the UMLS Metathesaurus but admitted that the significant modifications subsequently introduced in Read Version 3 would probably lead to improved performance. Furthermore, SNOMED includes modifiers but, unlike Version 3, lacks an information model for their use: the study was confined to coverage and made no attempt to assess the equally important area of data retrieval.

A preliminary study of Read Version 3 (including developmental qualifier templates) based on United States family practice terms reported encouraging results and suggested that, with its flexible hierarchy structure and qualifiers, it was a strong contender for adoption as a standard coding scheme.

Cimino has documented the ideal criteria for controlled medical vocabularies including completeness, non-redundancy, synonymy, non-ambiguity, multiple parentage and explicit relationships. In this paper, we examine the three versions of the Read Codes with an emphasis on the fulfilment of these criteria.

Overview of Read Codes

Background

The Read Codes were originally devised in the early 1980s by Dr James Read, a Loughborough (UK) general medical practitioner, for use in primary health care. The 4-byte set was first issued in 1985 with the intention of supporting the collection of clinical data by general practitioners using stand-alone personal computers. Throughout the 1980s computers became increasingly widely used in general practice and in 1987 the Joint Computing Group of the Royal College of General Practitioners and the General Medical Services Committee of the British Medical Association established a technical Working Party to evaluate clinical classification systems for general practice. Its 1988 report "The Classification of General Practice Data" recommended that Read Codes should become the standard, and on that basis, they were adopted as the standard for general practice computing by the Royal College of General Practitioners.

In 1990, the Read Clinical Classification was purchased by the Department of Health and became Crown Copyright. At the same time the National Health Service Centre for Coding and Classification (NHS-CCC) was established, with a principal function of maintaining and developing the Read Codes.

Figure 1: Milestones in the Evolution of the Read Thesaurus

       1985         4-byte set introduced                           

       1988         Recommended as standard for GP computing        

       1990         Purchased by Department of Health;  Crown       
                    Copyright                                       

                    Version 2 introduced                            

       1992         Terms Projects begin                            

       1994         Version 3 introduced                            

       1995         First hospital partnership sites for Version 3

The Terms Projects, were a series of collaborations between the clinical professions in the United Kingdom and the NHS Executive. The aim was to capture and incorporate into the Read Codes the natural language used in health care. The requirement for the recording and retrieval of complex clinical detail, and the need to support different specialist views, led to the development of a new (Version 3) file structure.

When the collected terms provided by the clinical specialties were integrated into the new Thesaurus, all concepts from Version 2 were also included, many as obsolete. Subsequently, all the 4-byte concepts have been integrated into Version 3 although this work is still developmental.

Features

The Read Codes have a number of features, some common to all versions, which are discussed below:

Codes

The 4-byte set, as the name implies, contains 4-character alphanumeric Read Codes utilising A to Z (originally upper case only) and 0 to 9. In Version 2, 5-character codes are used with A to Z, a to z and 0 to 9 but excluding letters O, o, I and i to avoid confusion. In Version 3, all characters in the ranges A to Z, a to z and 0 to 9 are available. The theoretical number of available codes therefore increases progressively between versions: 1,679,616 (4-byte); 656,356,768 (Version 2); and 916,132,832 (Version 3).

Hierarchy

The concepts within the Read Codes are located in a branching display hierarchy in which levels of increasingly detailed concepts are placed. This structure facilitates browsing and eventual selection of a concept for data entry.

In the 4-byte and Version 2 file structures, the hierarchical position (or ancestry) was implicit in the alphanumeric code (path-form hierarchy representation). In Version 3 the hierarchy is stored as a table of parent-child linkages (link-form hierarchy representation). This migration from path- to link-form representation has allowed the adoption of a directed acyclic graph "hierarchy" structure which permits multiple parentage (see Figure 2), unlimited depth (the maximum in Version 3 at present is fifteen levels in parts of the anatomy chapter) and readily supports refinement and expansion.

Figure 2: Multiple parentage in Version 3

Thyroid disorder         Thyroid disorder          

 Goitre                   Thyrotoxicosis           

  Toxic goitre             Toxic goitre

It also overcomes a major limitation of the earlier versions in which, when the final level of the fixed-depth hierarchy is reached, new concepts can only be added as "siblings", even when they might be "children" (see Figure 3), or as synonyms, even when they are detailed variants that might warrant their own code.

Figure 3: Limitations of a fixed-depth hierarchy

Version 2  7130.   Mastectomy                                        

            71304 Subcutaneous mastectomy                            

            71307 Subcutaneous mastectomy - gynaecomastia            

Version 3  7130.   Mastectomy                                        

            71304 Subcutaneous mastectomy                            

             71307 Subcutaneous mastectomy - gynaecomastia

(Ideally 71307 should be a child of 71304 but this is not supported by the fixed

Version 2 hierarchy in which it has had to be added as a sibling)

If hierarchies are to be coherent, some redundancy seems inevitable in the absence of multiple parentage as shown in the two Version 2 examples in Figure 4 where the same concepts have been duplicated to allow inclusion in both the Infections (A) and Nervous system (F) chapters and the Female genital (K) and Congenital (P) chapters respectively.

Figure 4: Redundancy in Version 2

Read Code   Term                    Read Code    Term                     

A130.       Tuberculous meningitis  F004.        Meningitis -             
                                                 tuberculous              

K5623       Atresia of vagina       PC4yB        Atresia of vagina

Comprehensive coverage

At the time of its introduction, the 4-byte set allowed far greater coverage than any existing clinical coding system, including history, signs and symptoms, prevention and administration. Version 2 offered similar domain coverage with an emphasis on supporting the acute sector's requirements for generating central statistical data for diagnoses and operations. Version 3 aims to support the requirements of health care workers in all sectors and specialties for recording clinical data and thus contains significantly increased detail. A recent phase of refinement has ensured that concepts from earlier versions are also included, so that the new Thesaurus can fully support the needs of both specialists and generalists.

Dynamism

The Read Codes are re-issued quarterly (monthly for drugs) allowing timely incorporation of corrections and enhancements. Additions to the 4-byte set usually originate in the primary care sector; those for Version 2 from both primary and acute sectors. Version 3 is undergoing refinement in response to feedback from initial hospital partnership sites.

Cross-referenced

Version 2 and Version 3 are cross-referenced to the Tabular List of the Classification of Surgical Operations and Procedures produced by the Office of Population Censuses and Surveys (OPCS-4), the International Classification of Diseases 9th revision (ICD-9), and the International Classification of Diseases and Health Related Problems 10th revision (ICD-10).

Terms

In each version, a preferred term is assigned to a code but synonymous terms can also be included and the overall ratio of terms to codes for the full sets is illustrated in Figure 5. This suggests that the richness of the terminology in Version 3 (ratio 1.2) is lower than the 4-byte set (ratio 1.5). However, Version 3 contains extensive sections of anatomy, organisms and other qualifying detail which have a low synonymy rate. A comparison of clinical subsets, as reported below, suggests that Version 3 is indeed a richer clinical vocabulary (see Figure 8).

Figure 5: Terms and Concepts: Full sets

All versions support synonymy and Versions 2 and 3 include separate term identifiers. In Version 2, the terms are fixed to concepts with 2 character additional term codes, whereas in Version 3, the labelling of terms with unique 5 character codes enables them to be relocated to different concepts. In Version 3, a dynamic descriptions table links the term identifiers to Read codes (figure 6).

Figure 6: Terms and codes

              Read Code      Term identifier   Term                         

Version 2     G3...          00                Ischaemic heart disease      

                             11                Arteriosclerotic heart       
                                               disease                      

Version 3     G3...          Y201T             Ischaemic heart disease      

                             Y201V             Arteriosclerotic heart       
                                               disease

In Version 3, a non-preferred term can be linked to more than one concept, allowing terms with a more general meaning to point to more specific concepts. This is important to support the functionality of a natural language terminology. In order to avoid ambiguity, it is important to ensure that terms do not depend on hierarchical or contextual clues for accurate assignment of meaning. In the 4-byte set and Version 2, the term IgG (Read code 43J3.) is in the laboratory procedures hierarchy though in isolation it may appear to be a substance.

Attempts to avoid confusion may require unnatural defining terminology, for example, Aspiration - action (a procedure as opposed to a cause of pneumonitis), though it may be possible to preferentially use a natural synonym in a live system.

A final point in relation to terms is the released term lengths: the 4-byte set only includes 30 character terms; Versions 2 and 3 include 60- and 198-character term lengths. The use of unabbreviated terms avoids ambiguity in systems which can support them.

Qualifiers

The Read Thesaurus enumerates common clinical concepts but also supports the composition of concepts through the attachment of specified attribute-value pairs as qualifiers to objects. This was an essential factor in enabling Version 3 to represent a wealth of clinical detail in a manageable way. The template table contains allowable object-attribute-value triples to support qualification of concepts and also enables semantic definition as described below. The long-term plans for the development of Read Version 3 include widespread semantic characterisation of concepts throughout the thesaurus, a process referred to as atomic mapping.

Study

Aim

The aim was to compare parallel subsets from each of the three versions with respect to:

Quantifiable relationships between terms and concepts
Assessments of natural clinical language and synonymy
Comparison of semantic definition data

Comparison methods

Subsets

Current activity in Read Code development is focusing on the demands of inter-version compatibility and this has led to some preparatory analytical work at the NHS-CCC. In order to examine differences between the three versions, subsets were identified from the procedures and disorders chapters of Version 3 for comparison with parallel sections of Version 2 and the 4-byte set. Two clinical domains - thyroid disorders and breast operations - were non-randomly selected on the basis of stability, representing 0.5% and 1.25% of the current concepts in disorder and procedures chapters of Version 3 respectively.

Terms and codes

The overall number of terms and concept codes were determined in the two clinical domains from the April 1996 release of 4-byte set, Version 2 and Version 3.

Clinical usefulness of concepts

Two doctors were asked to examine each concept and indicate whether they considered it to be clinical or non-clinical. The latter include a number of concepts which were identified as residual categories derived from formal classifications (often including the rubrics Other specified/OS, Unspecified/NOS, Not elsewhere classified, and so on) and also organisational terms (for example, Breast biopsy and related procedures) which are structural headings, often useful for aggregation, but unlikely to be entered in a clinical record. The proportion of clinically useful representations in each subset was calculated.

Synonymy

The synonym purity of the non-preferred terms from each of the three subsets was assessed by the same two doctors. They were asked to make a judgment for each version, as to whether the non-preferred terms for each concept were synonyms, hyponyms, hypernyms, eponyms, or incorrect.

The following definitions were applied based on previous work in this field:

Synonyms - exact alternative expressions for a concept that should not be any more or less specific, for example, "Disorder of thyroid gland" and "Thyroid disorder"
Hyponyms - expressions that provide more information, for example, "Operation on skin of nipple" is a hyponym of "Operation on nipple"
Hypernyms - expressions that provide less information, for example, "Repair of breast" is a hypernym of "Suture of breast"
Eponyms - the use of proper names in the term, for example, de Quervain's thyroiditis. Eponyms are a difficult category in which to determine true synonymy, hypernymy or hyponymy as clinical usage is often imprecise
Incorrect - assigned when the reviewer considered that the alternative expression was an unacceptable label for that concept as defined by the preferred term

Semantic properties

Semantic definition of enumerated concepts in clinical vocabularies by means of attributed characteristics has been described by several groups,,. The technique involves applying to each concept a formal frame containing defined attributes and specifying appropriate values for the attributes when present within the concept. For the purposes of this study, the Breast operations hierarchy was analysed for each version of Read. Attribution of definitions was achieved by either manual assignment or by lexical matching between value sets and the preferred terms and subsequent visual inspection. The completed definition frame for Percutaneous biopsy of breast is shown in Figure 7.

Figure 7: Attribute Definition Frame used for analysis

Attribute          Values                                

Site               Breast structure                      

Action             Biopsy                                

Approach           Percutaneous

We have utilised semantic definition to examine a number of intrinsic properties including the "molecularity" of the concepts in each version and also the size of the concept fields theoretically required to construct each version compositionally.

Results

Terms and codes

Figure 8 illustrates the number of terms and concepts with the ratio.

Figure 8: Terms and Codes for breast and thyroid subsets

The results of analysis of the subsets from the three versions reveals that the total number of concepts and terms, for thyroid disorders and breast procedures combined, increases with successive versions of the Read Codes from 52 concepts (71 terms) in the 4-byte set and 193 concepts (240 terms) in Version 2, to 285 concepts (406 terms) in Version 3. In addition the ratio of terms to concepts also increases because of the provision of more alternative terms.

Clinically useful concepts

The results for the three versions in the two subset domains are summarised in Figure 9.

Figure 9: Clinical and non-clinical concepts

                         Thyroid disorders  Breast procedures       Overall      

           Clinical                14                 24              38 (73%)   
                         (64%)              (80%)                                

4-byte     Non-clinical              8                  6             14 (27%)   
                         (36%)              (20%)                                

           Total                   22                 30              52         

                                                                                 

           Clinical                55                 56            111 (57%)    
                         (56%)              (58%)                                

Version 2  Non-clinical            42                 40              82 (43%)   
                         (43%)              (42%)                                

           Total                   97                 96            193          

                                                                                 

           Clinical              165                115 (96%)       280 (98%)    
                         (100%)                                                  

Version 3  Non-clinical              0                  5               5        
                                            (4%)               (2%)              

           Total                 165                120               285

The 4-byte set contained over 70% clinical concepts. The relatively high proportion of non-clinical concepts in Version 2 (over 40%) reflects its purpose of generating central statistical data for disorders and operations. In contrast Version 3, generated by clinicians as a vocabulary to support their records, consists almost entirely of clinical concepts (98%).

These results are presented graphically in Figure 10.

Figure 10: Bar chart of clinical and non clinical concepts

Synonymy

The results for the three versions in the two subset domains are summarised in Figure 11.

Figure 11: Preferred and Non-preferred term analyses for subsets

                                 Thyroid disorders   Breast procedures  

4-byte      Concepts                  22                 30             

            Total terms               38                 33             

            Non-preferred             16                   3            
             Synonymns                       5                  2       
             Hyponyms            (31%)              (66%)               
             Hypernyms                       5                  1       
             Eponyms             (31%)              (33%)               
             Incorrect                       2                  --      
                                 (13%)                          --      
                                             4                  --      
                                 (25%)                                  
                                             --                         

                                                                        

Version 2   Concepts                  97                 96             

            Total terms             120                120              

            Non-preferred             23                 24             
             Synonymns                       2                  6       
             Hyponyms            (7%)               (25%)               
             Hypernyms                       5                  4       
             Eponyms             (22%)              (3%)                
             Incorrect                       1                  --      
                                 (4%)                         11        
                                             4      (9%)                
                                 (17%)                          3       
                                           11       (2%)                
                                 (48%)                                  

                                                                        

Version 3   Concepts                165                120              

            Total terms             227                179              

            Non-preferred             62                 59             
             Synonymns                     15                 15 (25%)  
             Hyponyms            (25%)                        12 (20%)  
             Hypernyms                     12                 12 (20%)  
             Eponyms             (20%)                        18 (30%)  
             Incorrect                     12                   1       
                                 (20%)              (1.5%)              
                                           18                           
                                 (30%)                                  
                                             1                          
                                 (1.5%)

For thyroid disorders, the 4-byte set contained sixteen non-preferred terms for thyroid disorders. Although none were considered incorrect, only five were thought to be true synonyms. Version 2 is more expressive than the 4-byte set as it contains 120 terms with 23 (26%) being non-preferred terms. However, eleven of these were marked as incorrect and only two were considered true synonyms. Version 3 is much more expressive than either of the earlier versions, containing 227 terms for Thyroid disorders of which 62 were non-preferred terms. Of these one was considered incorrect whilst the remainder were designated hypernyms/hyponyms.

For breast procedures, the 4-byte set contained only three non-preferred terms, and none was considered incorrect. Of the 24 non-preferred terms in Version 2, only six were considered to be synonyms and the remainder included three that were marked incorrect. As with thyroid disorders, Version 3 provided a rich term set (179 terms) and the non-preferred group included twelve hyponyms, twelve hypernyms and one incorrect.

Semantic properties

These results relate to the breast procedures section of each Version.

The total number of attributed atoms in each version

Figure 12 illustrates the increasingly complex representation of clinical information in the enumerated sections of each version (4-byte, Version 2 and Version 3 current).

Figure 12: Total atom count for breast procedures in each version

The total number of atoms in relationship to the number of concepts, i.e. an indicator of the "molecularity" of the version (see Figure 13).

Figure 13: Molecularity for breast operation subsets

                            4-byte     Version 2    Version 3   

       Atoms (x)                  70         248          376   

       Concepts (y)               30           96         120   

       x/y                   2.33         2.58         3.13

Despite the facility to construct complex concepts by adding qualifiers, enumerated Version 3 concepts in the subsets examined have greater molecularity than those from the previous versions.

The number of unique atoms in each version, i.e. the size of the primitive concept set which would be required to construct the terminology compositionally

The bar chart (Figure 14) illustrates the size of the atomic concept fields in the 4-byte set, Version 2 and Version 3 (current).

Figure 14: Unique atoms for breast operations in each version

Discussion

Domain completeness in earlier versions of Read is limited primarily by the hierarchy structure (even the comparatively small 4-byte set has over 1 million unique codes available which could in theory support a coded vocabulary several times the size of SNOMED or Read Version 3). It is difficult, however, to expand earlier versions and at the same time to preserve meaningful hierarchical relationships between concepts for the purposes of browsing and analysis. The problems that result from a restricted number of available hierarchy positions is highlighted by Version 2, in which the limitations of the inflexible structure in supporting specialist detail are readily apparent. A further factor is the specific purpose for which these versions are designed - the 4-byte set for primary care data and Version 2 for acute sector encoding. The limited number of requests received for additions to the 4-byte set in particular suggest that it is fulfilling its role for most primary care users. Not surprisingly, evaluation of an early version for use as a comprehensive controlled medical vocabulary will yield sub-optimal results.

Analyses of term/concept ratios and of the atom content of enumerated concepts suggest that Version 3 is a far more expressive terminology. Furthermore, the availability of qualifiers in Version 3, not examined in this study, increases the expressivity further.

The trade-off between the introduction of redundancy and the need to ensure complete hierarchies is a problem which has not been satisfactorily resolved in early versions of Read with their fixed structures. The potential for Version 3 to support multiple classification overcomes this difficulty but also requires that the multiple placement of concepts should be complete if hierarchical analysis is to be effective.

Analysis suggests that Version 2 has the highest incidence of incorrect synonyms and this probably reflects its function in providing a comprehensive term-based index to classifications, the key requirement being the generation of the correct cross-mapping rather than clinical precision. This functional requirement was augmented by the incorporation as synonyms of a number of inclusion statements derived from classifications. This is again only a limiting factor in a inexpansible hierarchy: in Version 3 further concepts can be accommodated independently, rather than as impure synonyms.

Version 3 deliberately excludes concepts with a pure classification function (residual categories, and so on) from its current set, though a small number of organisational concepts have been introduced to support a rational structure.

Conclusion

This paper has outlined significant differences between the three versions of the Read Codes. The subset analyses have provided useful insight into structural and lexical features of the versions but further analyses need to be undertaken using larger and randomised samples to support statistical validation.

The 4-byte set has obtained widespread use in general practice but its limited structure makes it unsuitable to support specialist detail, accurate cross-mapping and to be sufficiently flexible for long-term future needs.

Version 2 was designed with the acute sector management requirements in mind, that is, encoding to the OPCS-4 and ICD classifications. The priority in this sector is to provide accurate cross-maps to the classifications within the limits of a fixed-level rigid hierarchy which Version 2 is able to support.

The flexible directed acyclic graph hierarchy and the provision of qualifiers in Version 3, though introducing complexities not present in its forerunners, offer the potential to support both specialist depth and generalist breadth. Version 3 most closely fulfils Cimino's criteria for a controlled clinical vocabulary to support all health care sectors.

Return to the Conference Homepage

Jun	JUL	Aug
	27
2010	2011	2012